Day 12: Records and Parsing

You have seen how to create the B-tree for the data file, but not how to write the data itself to disk. Today you will learn

How to create the data file
How to connect a stored key to more than one note
What some of the trade-offs are in building this database
How to parse each word and combine command-line arguments into a single record

Putting the Notes into a Data File

You are closing in on the final version of your database, but you still have not saved the actual notes. Remember that until now all you have been saving is the index, with the key value. The note itself must, of course, be saved in your data file. After all, the machinations and manipulations to index and store the keys are for the purpose of storing and retrieving the notes. It is easy to lose sight of this goal as you focus on the B-tree algorithms, but to the end user, the notes are the entire point of the exercise.

Storing the Notes

How should the note be stored? Although it is possible to create a complex binary structure around these notes, there really is no reason to do so. The text can be written out to the file, and the offset of the text can be recorded and stored in the index.

The first iteration need not worry about the fact that each key may point to multiple notes. First, get the data-storage fundamentals working, and then you can worry about the intermediate index as described on day 11.

Improving the Code

After this is done, all the fundamental database features will be in place except one. You need to be able to associate more than one note to each key. You have eliminated duplicate index entries, but you cannot yet store the fact that a given key is used by more than one note.

One solution is to create an intermediary index file; I'll call this a Word-Node-Join or WNJ record. Rather than each index pointing directly to the note in the data file, it will point to a WNJ index entry. Each WNJ index entry will point both to a note and to the next WNJ index object.

The problem with this approach, however, is that some key values will have dozens of associated data files, and you now have introduced a terrible inefficiency into your program. You went to all the work of creating a very advanced B-tree structure to avoid multiple disk reads, and here you are back to reading this index repeatedly as you track down the offsets into the data file.

There are a few potential solutions to this problem. You could create yet another B-tree, but that would bring its own inefficiencies: Most keys will associate with only one, or a very small number, of notes. You could create an array of offsets, but that has two problems: The empty slots in the array take up disk space, and when you fill all the slots you are stuck, what do you do with the next note that matches that key?

The solution is to pick a reasonable but small number of slots, and reserve the last slot as a pointer to the next array. I'll use five slots as a starting point; if you have one to four notes that match a given key, their offsets will all fit in the array. When you get a fifth note, you will create a new array and put its offset into the fifth slot.

I believe this creates a reasonable trade-off between disk usage and speed of access, but in a commercial program you would want to make this a configurable number. There are a number of approaches to doing so: You might let the user set this number in a configuration file or, in a very advanced system, the program might adjust itself based on usage patterns.

In any case, changing this value can occur only when the database is repacked. You see more about packing the database when deletion of records is covered.

Recording Length and Data

As mentioned earlier, you will want to record the length of each data record so that displaying the complete record will be easier. You also need to record the date the note was created; according to the specification, the date of each note is displayed in the list of found notes.

Why record the length of the note? Remember that you are not writing the terminating null to disk, you are writing only the contents of the string. Having the length enables you to read in and display the correct number of characters when the user asks to see the complete note. One could argue that this is inefficient; saving the null costs 1 byte per string, and the length costs 4 bytes per string; but you will find it convenient to be able to read the length and then use that figure for formatting the string. If this becomes a real-world cost over time, it is easy to change in a later version.

Case Insensitivity

When searching for a value, the user probably would prefer to match without case sensitivity. Thus, when searching for computer, the user would like to match notes that include Computer and COMPUTER. There are a number of ways to accomplish this, but the easiest and most straightforward is to save all the indexed keys in uppercase.

Very Small Words

In a few days, you will see how to add stop words--those terms you do not want to search on and for which you don't want to allocate storage. There are some assumptions that you can make right now to reduce disk storage, however, and one is that one- and two-letter words almost never will have significance. There is little reason to index words such as is, if, or, we, on, in, and so on. You therefore can set the index to ignore words of fewer than three letters.

Again, in a commercial program, you might want to make this a configurable number, allowing the user to decide the minimum length of words suitable for indexing. For now, you will hard-wire in a minimum length of three letters.

Statistics

As the database grows, you will need statistics on its usage; the next version keeps track of how many pages (node and leaf) are created, and how many indexes are used in each page. This version displays these statistics at the end of each run, along with a display of all the indexed terms.

The Final Preliminary Version

There is much more to do for ROBIN V.1, which you will see on day 13. For now, however, you have a number of significant improvements, and the following listings display the changes described earlier in this chapter:

    Listing 12.1     Btree.hpp                  308

    Listing 12.2     Datafile.cpp               313

    Listing 12.3     BTree.cpp                  317

    Listing 12.4     DiskMgr.cpp                319

    Listing 12.5     Index.cpp                  322

    Listing 12.6     Page.cpp                   323

    Listing 12.7     WNJFile.cpp                328

    Listing 12.8     The new driver program     331

Listing 12.1 Btree.hpp

    1:         // **************************************************
    2:         // PROGRAM:  Btree, Page and Index declarations
    3:         // FILE:     btree.hpp
    4:         // PURPOSE:  provide fundamental btree functionality
    5:         // NOTES:
    6:         // AUTHOR:   Jesse Liberty (jl)
    7:         // REVISIONS: 11/1/94 1.0 jl  initial release
    8:         // **************************************************
    9:         #ifndef BTREE_HPP                 // inclusion guards
    10:        #define BTREE_HPP
    11:
    12:        #include <time.h>
    13:        #include <string.h>
    14:        #include <fstream.h>
    15:        #include "stdef.hpp"
    16:
    17:        const int Order =      31;  // 31 indexes and 1 header
    18:        const int dataLen =    11; // length of a key
    19:        const int MaxPages =   20; // more than we need
    20:        const int SizeItem =   16;  // key + offset
    21:        const int SizePointer = 2; // size of offset
    22:        const int PageSize = (Order+1) * SizeItem;
    23:        const int WNJSize =     5;  // entries per wnj array
    24:
    25:        // forward declarations
    26:        class Page;
    27:        class Index;
    28:
    29:        class DataFile
    30:        {
    31:        public:
    32:           // constructors & destructor
    33:           DataFile();
    34:           ~DataFile(){}
    35:
    36:           // management of files
    37:           void Close();
    38:           void Create();
    39:           void GetRecord(long offset, char * buffer);
    40:
    41:           long Insert(char *);
    42:
    43:        private:
    44:        fstream myFile;
    45:        };
    46:
    47:        class WNJFile
    48:        {
    49:        public:
    50:           // constructors & destructor
    51:           WNJFile();
    52:           ~WNJFile(){}
    53:
    54:           // management of files
    55:           void Close();
    56:           void Create();
    57:           int* Find(int NextWNJ);
    58:           int Insert(int, int);
    59:           int Append(int);
    60:
    61:        private:
    62:        static int myCount;
    63:        fstream myFile;
    64:        union
    65:        {
    66:           int myInts[5];
    67:           char  myArray[5*4];
    68:        };
    69:        };
    70:
    71:        // DiskManager - in memory keeps track of what pages are
    72:        // already in memory and swaps to disk as required
    73:        class DiskManager
    74:        {
    75:        public:
    76:           // constructors & destructor
    77:           DiskManager();
    78:           ~DiskManager(){}
    79:
    80:           // management of files
    81:           void Close(int);
    82:           int Create();
    83:
    84:           // Page manipulation
    85:           Page* GetPage(int);
    86:           void SetPage(Page*);
    87:           void Insert(Page * newPage);
    88:           void Save(Page* pg);
    89:           int NewPage(Index&, int);
    90:           int NewPage(Index *array, long offset, int leaf, int count);
    91:
    92:        private:
    93:        Page* myPages[MaxPages];
    94:        fstream myFile;
    95:        int myCount;
    96:        };
    97:
    98:        // the btree itself - has a pointer to first page
    99:        class BTree
    100:       {
    101:       public:
    102:          // constructors and destructor
    103:          BTree();
    104:          ~BTree();
    105:
    106:          // utility functions
    107:          void AddKey(char* data, long offset);
    108:          void Insert(char* buffer, int, char**);
    109:          void Insert(char*);
    110:          void PrintTree();
    111:          int Find(char* data);
    112:
    113:          // page methods
    114:          Page* GetPage(int page)
    115:             { return theDiskManager.GetPage(page); }
    116:          int NewPage(Index& pIndex, int IsLeaf)
    117:             { return theDiskManager.NewPage(pIndex,FALSE); }
    118:
    119:
    120:          // public static member!
    121:          static int myAlNum(char ch);
    122:          static  DiskManager theDiskManager;
    123:          static WNJFile   theWNJFile;
    124:          static DataFile theDataFile;
    125:          static int GetWord(char*, char*, int&);
    126:          static void GetStats();
    127:          static int NodeIndexCtr;
    128:          static int LeafIndexCtr;
    129:          static int NodePageCtr;
    130:          static int LeafPageCtr;
    131:          static int NodeIndexPerPage[Order+1];
    132:          static int LeafIndexPerPage[Order+1];
    133:
    134:       private:
    135:          int myRoot;
    136:       };
    137:
    138:       // index objects point to pages or real data
    139:       class Index
    140:       {
    141:       public:
    142:          // constructors & destructor
    143:          Index();
    144:          Index(char *);
    145:          Index (char*, int);
    146:          Index(Index&);
    147:          ~Index(){}
    148:
    149:          // accessors
    150:          const char * GetData() const  { return myData; }
    151:          void SetData(const Index& rhs)
    152:             { strcpy(myData,rhs.GetData()); }
    153:          void SetData(const char * rhs)
    154:             { strcpy(myData,rhs); }
    155:          int GetPointer()const { return myPointer; }
    156:          void SetPointer (int pg) { myPointer = pg; }
    157:
    158:          // utility functions
    159:          void PrintKey();
    160:          void PrintPage();
    161:          int Insert(Index& ref,int findOnly = FALSE);
    162:
    163:          // overloaded operators
    164:          int operator==(const Index& rhs) const;
    165:       //   {return strcmp(myData,rhs.GetData()) == 0; }
    166:
    167:          int operator < (const Index& rhs) const
    168:           {return strcmp(myData,rhs.GetData())<0;}
    169:
    170:          int operator <= (const Index& rhs) const
    171:           {return strcmp(myData,rhs.GetData())<=0;}
    172:
    173:          int operator > (const Index& rhs) const
    174:           {return strcmp(myData,rhs.GetData())>0;}
    175:
    176:          int operator >= (const Index& rhs) const
    177:           {return strcmp(myData,rhs.GetData())>=0;}
    178:
    179:       public:
    180:          int myPointer;
    181:          char myData[SizeItem - SizePointer];
    182:       };
    183:
    184:
    185:       // pages - consist of header and array of indexes
    186:       class Page
    187:       {
    188:       public:
    189:          // constructors and destructor
    190:          Page();
    191:          Page(char*);
    192:          Page(Index&,int);
    193:          Page(Index*, long, int, int);
    194:          ~Page(){}
    195:
    196:          // insertion and searchoperations
    197:          int Insert(Index&, int findOnly = FALSE);
    198:          int Find(Index& idx) { return Insert(idx,TRUE); }
    199:          int InsertLeaf(Index&);
    200:          int FindLeaf(Index&,int findOnly);
    201:          int InsertNode(Index&,int findOnly = FALSE);
    202:          void Push(Index&,long offset=0L, int=TRUE);
    203:
    204:          // accessors
    205:          Index GetFirstIndex() { return myKeys[0]; }
    206:          int GetIsLeaf() const { return myVars.IsLeaf; }
    207:          int GetCount() const { return myVars.myCount; }
    208:          void SetCount(int cnt) { myVars.myCount=cnt; }
    209:          time_t GetTime() { return myTime; }
    210:
    211:          // page manipulation
    212:          int GetPageNumber() const { return myVars.myPageNumber; }
    213:          Page* GetPage(int page)
    214:             { return BTree::theDiskManager.GetPage(page); }
    215:          int NewPage(Index& rIndex, int IsLeaf)
    216:             { return BTree::theDiskManager.NewPage(rIndex,FALSE); }
    217:
    218:          int NewPage(Index* arr, int off,int isL, int cnt)
    219:             { return   BTree::theDiskManager.NewPage(arr,off,isL, cnt); }
    220:
    221:          // utility functions
    222:          void Nullify(long offset);
    223:          void Print();
    224:          fstream& Write(fstream&);
    225:          void ReCount();
    226:
    227:          static int GetgPageNumber(){ return gPage; }
    228:          static void SetgPageNumber(int pg) { gPage = pg; }
    229:
    230:       private:
    231:          Index * const myKeys; // will point to myVars.mk
    232:          union
    233:          {
    234:             char myBlock[PageSize];   // a page from disk
    235:             struct
    236:             {
    237:                int myCount;
    238:                int IsLeaf;
    239:                int myPageNumber;
    240:                int lastKey;
    241:                char overhead[8];
    242:                char mk[Order*sizeof(Index)];  // array of indexes
    243:             }myVars;
    244:          };
    245:
    246:          // memory only
    247:          static int gPage;
    248:          time_t myTime; // for lifo queue
    249:       };
    250:
    251:       #endif

Listing 12.2 Btree.cpp

    1:       // **************************************************
    2:       // PROGRAM:  Btree
    3:       // FILE:     btree.cpp
    4:       // PURPOSE:  provide fundamental btree functionality
    5:       // NOTES:
    6:       // AUTHOR:   Jesse Liberty (jl)
    7:       // REVISIONS: 11/1/94 1.0 jl  initial release
    8:       // **************************************************
    9:
    10:      #include "btree.hpp"
    11:      #include "stdef.hpp"
    12:      #include <assert.h>
    13:      #include <ctype.h>
    14:      #include <stdlib.h>
    15: #if defined(_BC_16BIT) || defined(_BC32_BIT)
        #include <values.h>
        #endif
        #ifdef _MSVC_16BIT
        #define MAXINT 0x7FFF //(16 bit)
        #endif
        #ifdef _MSVC_32BIT
        #define 0x7FFFFFFF //(32 bit)
        #endif
    16:
    17:      // construct the tree
    18:      // initialize myRoot pointer to nothing
    19:      // either create or read the index file
    20:      BTree::BTree():myRoot(0)
    21:      {
    22:         myRoot = theDiskManager.Create();
    23:         theDataFile.Create();
    24:         theWNJFile.Create();
    25:
    26:      }
    27:
    28:      // write out the index file
    29:      BTree::~BTree()
    30:      {
    31:         theDiskManager.Close(myRoot);
    32:         theDataFile.Close();
    33:         theWNJFile.Close();
    34:      }
    35:
    36:      // find an existing record
    37:      int BTree::Find(char * str)
    38:      {
    39:         Index index(str);
    40:         if (!myRoot)
    41:            return 0L;
    42:         else
    43:            return GetPage(myRoot)->Find(index);
    44:      }
    45:
    46:      void BTree::Insert(char * buffer, int argc, char** argv)
    47:      {
    48:         // get each word,
    49:         long offset = theDataFile.Insert(buffer);
    50:         for (int i = 1; i<argc; i++)
    51:         {
    52:            AddKey(argv[i],offset);
    53:         }
    54:
    55:      }
    56:
    57:      void BTree::Insert(char* buffer)
    58:      {
    59:
    60:         if (strlen(buffer) < 3)
    61:            return;
    62:
    63:          char *buff = buffer;
    64:          char word[PageSize];
    65:          int wordOffset = 0;
    66:          long offset;
    67:
    68:         if (GetWord(buff,word,wordOffset))
    69:         {
    70:            offset = theDataFile.Insert(buffer);
    71:            AddKey(word,offset);
    72:         }
    73:
    74:
    75:         while (GetWord(buff,word,wordOffset))
    76:         {
    77:            AddKey(word,offset);
    78:         }
    79:
    80:      }
    81:
    82:      void BTree::AddKey(char * str, long offset )
    83:      {
    84:
    85:         if (strlen(str) < 3)
    86:            return;
    87:
    88:         int retVal =0;
    89:         assert (offset < MAXINT);
    90:         Index index(str,(int)offset);
    91:         if (!myRoot)
    92:         {
    93:            myRoot = theDiskManager.NewPage (index,TRUE);
    94:         }
    95:         else
    96:         {
    97:            retVal = GetPage(myRoot)->Insert(index);
    98:            if (retVal) // our root split
    99:            {
    100:              // create a pointer to the old top
    101:              Index index(GetPage(myRoot)->GetFirstIndex());
    102:              index.SetPointer(myRoot);
    103:              // make the new page & insert the index
    104:              int PageNumber = NewPage(index,FALSE);
    105:
    106:              Page* pg = GetPage(PageNumber);
    107:
    108:              //get a pointer to the new (sibling) page
    109:              Index Sib(GetPage(retVal)->GetFirstIndex());
    110:              Sib.SetPointer(retVal);
    111:              // put it into the page
    112:              pg->InsertLeaf(Sib);
    113:
    114:              // reset myRoot to point to the new top
    115:              myRoot = PageNumber;
    116:           }
    117:        }
    118:     }
    119:
    120:     void BTree::PrintTree()
    121:     {
    122:        GetPage(myRoot)->Print();
    123:
    124:        cout << "\n\nStats:" << endl;
    125:        cout << "Node pages:   " << NodePageCtr << endl;
    126:        cout << "Leaf pages:   " << LeafPageCtr << endl;
    127:        cout << "Node indexes: " << NodeIndexCtr << endl;
    128:        cout << "Leaf indexes: " << LeafIndexCtr << endl;
    129:        for (int i = 0; i < Order +2; i++)
    130:        {
    131:           if (NodeIndexPerPage[i])
    132:           {
    133:              cout << "Pages with " << i << " nodes:  ";
    134:              cout << NodeIndexPerPage[i] << endl;
    135:           }
    136:           if (LeafIndexPerPage[i])
    137:           {
    138:              cout << "Pages with " << i << " leaves: ";
    139:              cout << LeafIndexPerPage[i] << endl;
    140:           }
    141:        }
    142:
    143:     }
    144:
    145:     int BTree::GetWord(char* string, char* word, int& wordOffset)
    146:     {
    147:
    148:        if (!string[wordOffset])
    149:           return FALSE;
    150:
    151:        char *p1, *p2;
    152:        p1 = p2 = string+wordOffset;
    153:
    154:        // eat leading spaces
    155:        for (int i = 0; i<(int)strlen(p1) && !(BTree::myAlNum(p1[0])); i++)
    156:           p1++;
    157:
    158:        // see if you have a word
    159:        if (!BTree::myAlNum(p1[0]))
    160:           return FALSE;
    161:
    162:         p2 = p1; // point to start of word
    163:
    164:        // march p2 to end of word
    165:        while (BTree::myAlNum(p2[0]))
    166:           p2++;
    167:
    168:        int len = int(p2-p1);
    169: #if defined(_MSVC_16BIT) || defined(_MSVC_32BIT)
       : {
       :    len = __min(len,(int)PageSize);
       : }
       : #else
       : {
       :    len = min(len,(int)PageSize);
       : }
       : #endif
    170:
    171:        strncpy (word,p1,len);
    172:        word[len]='\0';
    173:
    174:        for (i = int (p2-string); i<(int)strlen(string) &&
                !BTree::myAlNum(p2[0]); i++)
    175:           p2++;
    176:
    177:        wordOffset = int (p2-string);
    178:
    179:        return TRUE;
    180:     }
    181:
    182:     int BTree::myAlNum(char ch)
    183:     {
    184:        return isalnum(ch) ||
    185:           ch == '-' ||
    186:           ch == '\'' ||
    187:           ch == '(' ||
    188:           ch == ')';
    189:     }

Listing 12.3 Datafile.cpp

    1:      // **************************************************
    2:      // PROGRAM:  Data file
    3:      // FILE:     datafile.cpp
    4:      // PURPOSE:
    5:      // NOTES:
    6:      // AUTHOR:   Jesse Liberty (jl)
    7:      // REVISIONS: 11/5/94 1.0 jl  initial release
    8:      // **************************************************
    9:
    10:     #include "btree.hpp"
    11:     #include <assert.h>
    12:
    13:     // on construction, try to open the file if it exists
    14:     DataFile::DataFile():
    15:        myFile("ROBIN.DAT",
    16:        ios::binary | ios::in | ios::out | ios::nocreate | ios::app)
    17:     {
    18:     }
    19:
    20:     void DataFile::Create()
    21:     {
    22:        if (!myFile) // nocreate failed, first creation
    23:        {
    24:
    25:           // open the file, create it this time
    26:           myFile.clear();
    27:
    28:           myFile.open
    29:              ("ROBIN.DAT",ios::binary | ios::in | ios::out | ios::app);
    30:
    31:           char Header[2];
    32:           int MagicNumber = 1234; // a number we can check for
    33:           memcpy(Header,&MagicNumber,2);
    34:           myFile.clear();
    35:           myFile.flush();
    36:           myFile.seekp(0);
    37:           myFile.write(Header,2);
    38:           return;
    39:        }
    40:
    41:        // we did open the file, it already existed
    42:        // get the numbers we need
    43:        int MagicNumber;
    44:        myFile.seekg(0);
    45:        myFile.read((char *) &MagicNumber,2);
    46:
    47:        // check the magic number. If it is wrong the file is
    48:        // corrupt or this isn't the index file
    49:        if (MagicNumber != 1234)
    50:        {
    51:              // change to an exception!!
    52:              cout << "DataFile Magic number failed!";
    53:        }
    54:        return;
    55:     }
    56:
    57:     // write out the numbers we'll need next time
    58:     void DataFile::Close()
    59:     {
    60:        myFile.close();
    61:     }
    62:
    63:
    64:     long DataFile::Insert(char * newNote)
    65:     {
    66:        int len = strlen(newNote);
    67:        int fullLen = len + 2;
    68:        char buffer[PageSize];
    69:        memcpy(buffer,&len,2);
    70:        memcpy(buffer+2,newNote,len);
    71:        myFile.clear();
    72:        myFile.flush();
    73:        myFile.seekp(0,ios::end);
    74:        long offset = myFile.tellp();
    75:        myFile.write(buffer,fullLen);
    76:        myFile.flush();
    77:        return offset;
    78:     }
    79:
    80:     void DataFile::GetRecord(long offset, char* buffer)
    81:     {
    82:        int len;
    83:        char tmpBuff[PageSize];
    84:        myFile.flush();
    85:        myFile.clear();
    86:        myFile.seekg(offset);
    87:        myFile.read(tmpBuff,PageSize);
    88:        memcpy(&len,tmpBuff,2);
    89:        strncpy(buffer,tmpBuff+2,len);
    90:        buffer[len] = '\0';
    91:     }

Listing 12.4 Idxmgr.cpp

    1:       // **************************************************
    2:       // PROGRAM:  diskmgr
    3:       // FILE:     diskmgr.cpp
    4:       // PURPOSE:  provide i/o for btree
    5:       // NOTES:
    6:       // AUTHOR:   Jesse Liberty (jl)
    7:       // REVISIONS: 11/5/94 1.0 jl  initial release
    8:       // **************************************************
    9:
    10:      #include "btree.hpp"
    11:      #include <assert.h>
    12:
    13:      // on construction, try to open the file if it exists
    14:      DiskManager::DiskManager():
    15:         myFile("ROBIN.IDX",ios::binary | ios::in | ios::out | ios::nocreate)
    16:      {
    17:         // initialize the pointers to null
    18:         for (int i = 0; i< MaxPages; i++)
    19:            myPages[i] = 0;
    20:         myCount = 0;
    21:
    22:      }
    23:
    24:      // called by btree constructor
    25:      // if we opened the file, read in the numbers we need
    26:      // otherwise create the file
    27:      int DiskManager::Create()
    28:      {
    29:         if (!myFile) // nocreate failed, first creation
    30:         {
    31:            // open the file, create it this time
    32:            myFile.open("ROBIN.IDX",ios::binary | ios::in | ios::out);
    33:
    34:            char Header[PageSize];
    35:            int MagicNumber = 1234; // a number we can check for
    36:            memcpy(Header,&MagicNumber,2);
    37:            int NextPage = 1;
    38:            memcpy(Header+2,&NextPage,2);
    39:            memcpy(Header+4,&NextPage,2);
    40:            Page::SetgPageNumber(NextPage);
    41:            char title[]="ROBIN.IDX. Ver 1.00";
    42:            memcpy(Header+6,title,strlen(title));
    43:            myFile.flush();
    44:            myFile.clear();
    45:            myFile.seekp(0);
    46:            myFile.write(Header,PageSize);
    47:            return 0;
    48:         }
    49:
    50:         // we did open the file, it already existed
    51:         // get the numbers we need
    52:         int MagicNumber;
    53:         myFile.seekg(0);
    54:         myFile.read((char *) &MagicNumber,2);
    55:
    56:         // check the magic number. If it is wrong the file is
    57:         // corrupt or this isn't the index file
    58:         if (MagicNumber != 1234)
    59:         {
    60:               // change to an exception!!
    61:               cout << "Index Magic number failed!";
    62:               return 0;
    63:         }
    64:
    65:         int NextPage;
    66:         myFile.seekg(2);
    67:         myFile.read((char*) &NextPage,2);
    68:         Page::SetgPageNumber(NextPage);
    69:         int FirstPage;
    70:         myFile.seekg(4);
    71:         myFile.read((char*) &FirstPage,2);
    72:         const int room =  PageSize  6;
    73:         char buffer[room];
    74:         myFile.read(buffer,room);
    75:         buffer[room]='\0';
    76:         cout << buffer << endl;
    77:         // read in all the pages
    78:         for (int i = 1; i < NextPage; i++)
    79:         {
    80:               myFile.seekg(i * PageSize);
    81:               char buffer[PageSize];
    82:               myFile.read( buffer, PageSize);
    83:               Page * pg = new Page(buffer);
    84:               Insert(pg);
    85:         }
    86:
    87:         return FirstPage;
    88:      }
    89:
    90:      // write out the numbers we'll need next time
    91:      void DiskManager::Close(int theRoot)
    92:      {
    93:
    94:         for (int i = 0; i< MaxPages; i++)
    95:            if (myPages[i])
    96:               Save(myPages[i]);
    97:         int NextPage = Page::GetgPageNumber();
    98:         if (!myFile)
    99:            cout << "Error opening myFile!" << endl;
    100:        myFile.flush();
    101:        myFile.clear();
    102:        myFile.seekp(2);
    103:        myFile.write ( (char *) &NextPage,2);
    104:        myFile.seekp(4);
    105:        myFile.write((char*) &theRoot,2);
    106:        myFile.close();
    107:     }
    108:
    109:     // wrapper function
    110:     int DiskManager::NewPage(Index& index, int bLeaf)
    111:     {
    112:        Page * newPage = new Page(index, bLeaf);
    113:        Insert(newPage);
    114:        Save(newPage);
    115:        return newPage->GetPageNumber();
    116:     }
    117:
    118:     int DiskManager::NewPage(
    119:        Index *array,
    120:        long offset,
    121:        int leaf,
    122:        int count)
    123:     {
    124:        Page * newPage = new Page(array, offset, leaf,count);
    125:        Insert(newPage);
    126:        Save(newPage);
    127:        return newPage->GetPageNumber();
    128:     }
    129:
    130:     void DiskManager::Insert(Page * newPage)
    131:     {
    132:
    133:        // add new page into array of page managers
    134:        if (myCount < MaxPages)
    135:        {
    136:           assert(!myPages[myCount]);
    137:           myPages[myCount++] = newPage;
    138:        }
    139:        else  // no room, time to page out to disk
    140:        {
    141:           int lowest = 0;
    142:           for (int i = 0; i< MaxPages; i++)
    143:              if (myPages[i]->GetTime() < myPages[lowest]->GetTime())
    144:                 lowest = i;
    145:           Save(myPages[lowest]);
    146:           delete myPages[lowest];
    147:           myPages[lowest] = newPage;
    148:        }
    149:     }
    150:
    151:     // tell the page to write itself out
    152:     void DiskManager::Save(Page* pg)
    153:     {
    154:       pg->Write(myFile);
    155:     }
    156:
    157:     // see if the page is in memory, if so return it
    158:     // otherwise get it from disk
    159:
    160:     Page * DiskManager::GetPage(int target)
    161:     {
    162:
    163:        for (int i = 0; i< MaxPages; i++)
    164:        {
    165:           if (myPages[i]->GetPageNumber() == target)
    166:              return  myPages[i];
    167:        }
    168:        myFile.seekg(target * PageSize);
    169:        char buffer[PageSize];
    170:        myFile.read( buffer, PageSize);
    171:        Page * pg = new Page(buffer);
    172:        Insert(pg);
    173:        return pg;
    174:     }

Listing 12.5 Index.cpp

    1:      // **************************************************
    2:      // PROGRAM:  index
    3:      // FILE:     index.cpp
    4:      // PURPOSE:  provide fundamental btree functionality
    5:      // NOTES:
    6:      // AUTHOR:   Jesse Liberty (jl)
    7:      // REVISIONS: 11/1/94 1.0 jl  initial release
    8:      // **************************************************
    9:
    10:     #include "btree.hpp"
    11:     #include <ctype.h>
    12:
    13:     Index::Index(char* str):myPointer(1)
    14:     {
    15:        strncpy(myData,str,dataLen);
    16:        myData[dataLen]='\0';
    17:     }
    18:
    19:     Index::Index(char* str, int ptr):myPointer(ptr)
    20:     {
    21:
    22:        strncpy(myData,str,dataLen);
    23:        myData[dataLen]='\0';
    24:     }
    25:
    26:     Index::Index(Index& rhs):
    27:        myPointer(rhs.GetPointer())
    28:     {
    29:        strcpy(myData, rhs.GetData());
    30:     }
    31:
    32:     Index::Index():myPointer(0)
    33:     {
    34:        myData[0]='\0';
    35:     }
    36:
    37:     void Index::PrintKey()
    38:     {
    39:        cout << "  " << myData;
    40:     }
    41:
    42:     void Index::PrintPage()
    43:     {
    44:        cout << "\n" << myData << ": " ;
    45:        BTree::theDiskManager.GetPage(myPointer)->Print();
    46:     }
    47:
    48:     int Index::Insert(Index& ref, int findOnly)
    49:     {
    50:        return BTree::theDiskManager.GetPage(myPointer)->Insert(ref,findOnly);
    51:     }
    52:
    53:     int Index::operator==(const Index& rhs) const
    54:     {
    55:        int len1 = strlen(myData);
    56:        int len2 = strlen(rhs.GetData());
    57:        char w1[PageSize];
    58:        char w2[PageSize];
    59:        for (int i = 0; i<len1; i++)
    60:           w1[i] = toupper(myData[i]);
    61:        for (i = 0; i<len2; i++)
    62:           w2[i] = toupper(rhs.GetData()[i]);
    63:        w1[len1] = '\0';
    64:        w2[len2]='\0';
    65:        return (strcmp(w1,w2) == 0);
    66:     }

Listing 12.6 Page.cpp

    1:      // **************************************************
    2:      // PROGRAM:  Page
    3:      // FILE:     Page.cpp
    4:      // PURPOSE:  provide fundamental btree functionality
    5:      // NOTES:
    6:      // AUTHOR:   Jesse Liberty (jl)
    7:      // REVISIONS: 11/1/94 1.0 jl  initial release
    8:      // **************************************************
    9:
    10:     #include "btree.hpp"
    11:     #include <assert.h>
    12:     #include <values.h>
    13:
    14:     // constructors
    15:
    16:     // default constructor
    17:     Page::Page()
    18:     {
    19:     }
    20:
    21:     // create a page from a buffer read in from disk
    22:     Page::Page(char *buffer):
    23:        myKeys((Index*)myVars.mk)
    24:     {
    25:        assert(sizeof(myBlock) == PageSize);
    26:        assert(sizeof(myVars) == PageSize);
    27:        memcpy(&myBlock,buffer,PageSize);
    28:        myTime = time(NULL);
    29:     }
    30:
    31:     // create a Page from the first index
    32:     Page::Page(Index& index, int bLeaf):
    33:        myKeys((Index*)myVars.mk)
    34:     {
    35:
    36:        myVars.myCount=1;
    37:        myVars.IsLeaf = bLeaf;
    38:        // if this is a leaf, this is the first
    39:        // index on the first page, set its pointer
    40:        // based on creating a new wnj. otherwise
    41:        // you are here creating a new node, do not
    42:        // set the pointer, it is already set.
    43:        if (bLeaf)
    44:           index.SetPointer(BTree::theWNJFile.Append(index.GetPointer()));
    45:        myKeys[0]=index;
    46:        myVars.myPageNumber = gPage++;
    47:        myTime = time(NULL);
    48:     }
    49:
    50:     // create a page by splitting another page
    51:     Page::Page(Index *array, long offset, int bLeaf,int count):
    52:        myKeys((Index*)myVars.mk)
    53:     {
    54:        myVars.IsLeaf = bLeaf;
    55:        myVars.myCount = 0;
    56:        assert (offset < MAXINT);
    57:        int i = 0;
    58:        int j = (int) offset;
    59:        for (; j<Order && i < count; i++, j++)
    60:        {
    61:           myKeys[i]= array[j];
    62:           myVars.myCount++;
    63:        }
    64:        myVars.myPageNumber = gPage++;
    65:        myTime = time(NULL);
    66:     }
    67:
    68:     void Page::Nullify(long offset)
    69:     {
    70:         assert (offset < MAXINT);
    71:        for (int i = (int)offset; i<Order; i++)
    72:        {
    73:              myKeys[i].SetPointer(0);
    74:              myVars.myCount--;
    75:        }
    76:     }
    77:
    78:     // decide whether I'm a leaf or a node
    79:     // and pass this index to the right
    80:     // function. If findOnly is true, don't insert
    81:     // just return the page number (for now)
    82:     int Page::Insert(Index& rIndex, int findOnly)
    83:     {
    84:        if (myVars.IsLeaf)
    85:          return   FindLeaf(rIndex,findOnly);
    86:        else
    87:          return InsertNode(rIndex,findOnly);
    88:     }
    89:
    90:
    91:     // find the right page for this index
    92:     int Page::InsertNode(Index& rIndex, int findOnly)
    93:     {
    94:        int retVal =0;
    95:        int inserted = FALSE;
    96:        int i,j;
    97:
    98:        assert(myVars.myCount>0); // nodes have at least 1
    99:        assert(myKeys[0].GetPointer()); // must be valid
    100:
    101:       // does it go before my first entry?
    102:       if (rIndex < myKeys[0])
    103:       {
    104:          if (findOnly)
    105:             return 0L; // not found
    106:
    107:          myKeys[0].SetData(rIndex);
    108:          retVal=myKeys[0].Insert(rIndex);
    109:          inserted = TRUE;
    110:       }
    111:
    112:       // does it go after my last?
    113:       if (!inserted)
    114:          for (i = myVars.myCount-1; i>=0; i--)
    115:          {
    116:             assert(myKeys[i].GetPointer());
    117:             if (rIndex >= myKeys[i])
    118:             {
    119:                 retVal=myKeys[i].Insert(rIndex,findOnly);
    120:                 inserted = TRUE;
    121:                 break;
    122:             }
    123:        }
    124:
    125:       // find where it does go
    126:       if (!inserted)
    127:          for (j = 0; j<i && j+1 < myVars.myCount; j++)
    128:          {
    129:             assert(myKeys[j+1].GetPointer());
    130:             if (rIndex < myKeys[j+1])
    131:             {
    132:                retVal=myKeys[j].Insert(rIndex,findOnly);
    133:                inserted = TRUE;
    134:                break;
    135:             }
    136:          }
    137:
    138:       assert(inserted);  // change to exception if not!
    139:
    140:       // if you had to split
    141:       if (retVal && !findOnly) // got back a pointer to a new page
    142:       {
    143:          Index * pIndex = new Index(GetPage(retVal)->GetFirstIndex());
    144:          pIndex->SetPointer(retVal);
    145:          retVal = InsertLeaf(*pIndex);
    146:       }
    147:       return retVal;
    148:    }
    149:
    150:    // called if current page is a leaf
    151:    int Page::FindLeaf(Index& rIndex, int findOnly)
    152:    {
    153:       int result = 0;
    154:
    155:       // no duplicates!
    156:       for (int i=0; i < myVars.myCount; i++)
    157:             if (rIndex == myKeys[i])
    158:             {
    159:                if (findOnly)               // return first WNJ
    160:                   //return BTree::theWNJFile.Find(myKeys[i].GetPointer());
    161:                   return myKeys[i].GetPointer();
    162:                else
    163:                   return BTree::theWNJFile.Insert(
    164:                               rIndex.GetPointer(),
    165:                                  myKeys[i].GetPointer());
    166:             }
    167:
    168:       if (findOnly)         // not found
    169:          return result;
    170:
    171:       // this index item does not yet exist
    172:       // before you push it into the index
    173:       // push an entry into the wnj.idx
    174:       // and set the index to point to that entry
    175:       rIndex.SetPointer(BTree::theWNJFile.Append(rIndex.GetPointer())); //
               new!
    176:       return InsertLeaf(rIndex);
    177:    }
    178:
    179:    int Page::InsertLeaf(Index& rIndex)
    180:    {
    181:       int result = 0;
    182:       if (myVars.myCount < Order)
    183:          Push(rIndex);
    184:       else      // overflow the page
    185:       {
    186:             // make sibling
    187:             int NewPg =
    188:                NewPage(myKeys,Order/2,myVars.IsLeaf,myVars.myCount);
    189:             Page* Sibling = GetPage(NewPg);
    190:             Nullify(Order/2);                     // nullify my right half
    191:
    192:             // does it fit in this side?
    193:             if (myVars.myCount>Order/2-1 && rIndex <= myKeys[Order/2-1])
    194:                Push(rIndex);
    195:             else  // push it into the new sibling
    196:                Sibling->Push(rIndex);
    197:
    198:             result = NewPg; // we split, pass it up
    199:          }
    200:        return result;
    201:    }
    202:
    203:    // put the new index into this page (in order)
    204:    void Page::Push(Index& rIndex,long offset,int first)
    205:    {
    206:       int inserted = FALSE;
    207:       assert(myVars.myCount < Order);
    208:       assert (offset < MAXINT);
    209:
    210:       for (int i=(int)offset; i<Order && i<myVars.myCount; i++)
    211:       {
    212:          assert(myKeys[i].GetPointer());
    213:          if (rIndex <= myKeys[i])
    214:          {
    215:                Push(myKeys[i],offset+1,FALSE);
    216:                myKeys[i]=rIndex;
    217:                inserted = TRUE;
    218:                break;
    219:          }
    220:       }
    221:       if (!inserted)
    222:          myKeys[myVars.myCount] = rIndex;
    223:
    224:       if (first)
    225:          myVars.myCount++;
    226:    }
    227:
    228:
    229:    void Page::Print()
    230:    {
    231:          if (!myVars.IsLeaf)
    232:          {
    233:             BTree::NodePageCtr++;
    234:             BTree::NodeIndexPerPage[myVars.myCount]++;
    235:             BTree::NodeIndexCtr+=myVars.myCount;
    236:          }
    237:          else
    238:          {
    239:             BTree::LeafPageCtr++;
    240:             BTree::LeafIndexPerPage[myVars.myCount]++;
    241:             BTree::LeafIndexCtr+=myVars.myCount;
    242:          }
    243:
    244:          for (int i = 0; i<Order && i < myVars.myCount; i++)
    245:          {
    246:             assert(myKeys[i].GetPointer());
    247:             if (myVars.IsLeaf)
    248:                myKeys[i].PrintKey();
    249:             else
    250:                myKeys[i].PrintPage();
    251:          }
    252:    }
    253:
    254:    // write out the entire page as a block
    255:    fstream& Page::Write(fstream& file)
    256:    {
    257:       char buffer[PageSize];
    258:       memcpy(buffer,&myBlock,PageSize);
    259:       file.flush();
    260:       file.clear();
    261:       file.seekp(myVars.myPageNumber*PageSize);
    262:       file.write(buffer,PageSize);
    263:       return file;
    264:    }

Listing 12.7 WNJFile.cpp

    1:       // **************************************************
    2:       // PROGRAM:  WNJ file
    3:       // FILE:     WNJfile.cpp
    4:       // PURPOSE:  Cross index from key to datafile
    5:       // NOTES:
    6:       // AUTHOR:   Jesse Liberty (jl)
    7:       // REVISIONS: 11/5/94 1.0 jl  initial release
    8:       // **************************************************
    9:
    10:      #include "btree.hpp"
    11:      #include <assert.h>
    12:
    13:      // on construction, try to open the file if it exists
    14:      WNJFile::WNJFile():
    15:         myFile("ROBINWNJ.IDX",
    16:         ios::binary | ios::in | ios::out | ios::nocreate)
    17:      {
    18:         for (int i = 0; i<WNJSize; i++)
    19:            myInts[i]=0L;
    20:      }
    21:
    22:      void WNJFile::Create()
    23:      {
    24:         char Header[8];
    25:         int MagicNumber=0; // a number we can check for
    26:         int zero = 0;
    27:
    28:         if (!myFile) // nocreate failed, first creation
    29:         {
    30:            // open the file, create it this time
    31:            myFile.clear();
    32:            myFile.open("ROBINWNJ.IDX",
    33:               ios::binary | ios::in | ios::out);
    34:
    35:            MagicNumber = 1234;
    36:            memcpy(Header,&MagicNumber,2);
    37:            memcpy(Header+2,&zero,2);
    38:            myFile.seekp(0);
    39:            myFile.write(Header,4);
    40:            myFile.flush();
    41:            return;
    42:         }
    43:
    44:         // we did open the file, it already existed
    45:         // get the numbers we need
    46:
    47:
    48:         myFile.seekg(0);
    49:         myFile.read(Header,4);
    50:         memcpy(&MagicNumber,Header,2);
    51:         memcpy(&myCount,Header+2,2);
    52:
    53:         // check the magic number. If it is wrong the file is
    54:         // corrupt or this isn't the index file
    55:         if (MagicNumber != 1234)
    56:         {
    57:               // change to an exception!!
    58:               cout << "WNJ Magic number failed!";
    59:               cout << "Magic number: " << MagicNumber;
    60:               cout << "\nmyCount: " << myCount << endl;
    61:         }
    62:         return;
    63:      }
    64:
    65:      // write out the numbers we'll need next time
    66:      void WNJFile::Close()
    67:      {
    68:         myFile.seekg(2);
    69:         myFile.write((char*)&myCount,2);
    70:         myFile.close();
    71:      }
    72:
    73:      int WNJFile::Append(int DataOffset)
    74:      {
    75:
    76:         int newPos = 4 + myCount++ * (WNJSize*SizePointer);
    77:         int offsets[WNJSize];
    78:         offsets[0] = DataOffset;
    79:         for (int i = 1; i<WNJSize; i++)
    80:            offsets[i]=0;
    81:         myFile.seekg(newPos);
    82:         myFile.write((char*)offsets,WNJSize*SizePointer);
    83:
    84:         return newPos;
    85:      }
    86:
    87:
    88:      int WNJFile::Insert(int DataOffset,int WNJOffset)
    89:      {
    90:         int ints[WNJSize];
    91:         myFile.seekg(WNJOffset);
    92:         myFile.read((char*)ints,WNJSize*SizePointer);
    93:
    94:         int offset=WNJOffset;
    95:
    96:         while (ints[WNJSize-1])
    97:         {
    98:            offset = ints[WNJSize-1];
    99:            myFile.clear();
    100:           myFile.flush();
    101:           myFile.seekg(ints[WNJSize-1]);
    102:           myFile.read((char*)ints,WNJSize*SizePointer);
    103:        }
    104:        if (ints[WNJSize-2]) // full!
    105:        {
    106:           ints[WNJSize-1] = Append(DataOffset);
    107:           myFile.clear();
    108:           myFile.flush();
    109:           myFile.seekg(offset);
    110:           myFile.write((char*)ints,WNJSize*SizePointer);
    111:        }
    112:        else
    113:        {
    114:           for (int i = 0; i<WNJSize-1; i++)
    115:           {
    116:              if (ints[i] == 0)
    117:              {
    118:                 ints[i] = DataOffset;
    119:                 myFile.clear();
    120:                 myFile.flush();
    121:                 myFile.seekg(offset);
    122:                 myFile.write((char*)ints,WNJSize*SizePointer);
    123:                 break;
    124:              }
    125:           }
    126:        }
    127:        return 0;
    128:     }
    129:
    130:
    131:     int* WNJFile::Find(int NextWNJ)
    132:     {
    133:        int ints[WNJSize];
    134:        int * results = new int[PageSize];
    135:
    136:        int i = 0, j=0;
    137:
    138:        while (j<PageSize)
    139:           results[j++] = 0;
    140:
    141:        j = 0;
    142:
    143:        myFile.seekg(NextWNJ);
    144:        myFile.read((char*)ints, WNJSize*SizePointer);
    145:        while (j < PageSize)
    146:        {
    147:           if (ints[i])
    148:           {
    149:              if (i == WNJSize-1)
    150:              {
    151:                 myFile.seekg(ints[WNJSize-1]);
    152:                 myFile.read((char*)ints,WNJSize*2);
    153:                 i = 0;
    154:                 continue;
    155:              }
    156:           results[j++] = ints[i++];
    157:           }
    158:           else
    159:              break;
    160:        }
    161:        return results;
    162:     }

Listing 12.8 Driver.cpp

    1:     #include "String.hpp"
    2:     #include "stdef.hpp"
    3:     #include "btree.hpp"
    4:     DiskManager BTree::theDiskManager;
    5:     DataFile BTree::theDataFile;
    6:     WNJFile BTree::theWNJFile;
    7:
    8:     int WNJFile::myCount=0L;
    9:     int Page::gPage=1;
    10:    int BTree::NodeIndexCtr=0;
    11:    int BTree::LeafIndexCtr=0;
    12:    int BTree::NodePageCtr=0;
    13:    int BTree::LeafPageCtr=0;
    14:    int BTree::NodeIndexPerPage[Order+1];
    15:    int BTree::LeafIndexPerPage[Order+1];
    16:
    17:    int main(int argc, char ** argv)
    18:    {
    19:       BTree myTree;
    20:
    21:       for (int i = 0; i < Order +2; i++)
    22:       {
    23:          BTree::NodeIndexPerPage[i] =0;
    24:          BTree::LeafIndexPerPage[i] = 0;
    25:       }
    26:
    27:       char buffer[PageSize+1];
    28:
    29:       for (;;)
    30:       {
    31:          cout << "Enter a string (blank to stop): ";
    32:          cin.getline(buffer,PageSize);
    33:          // cin.ignore(PageSize,'\n');
    34:          if (buffer[0])
    35:             myTree.Insert(buffer);
    36:          else
    37:             break;
    38:       }
    39:
    40:       for (;;)
    41:       {
    42:          cout << "Find: ";
    43:          cin.getline(buffer,255);
    44:          if (buffer[0])
    45:          {
    46:                int offset = myTree.Find(buffer);
    47:                if (offset)
    48:                {
    49:                   int* found  =  BTree::theWNJFile.Find(offset);
    50:                   int i=0;
    51:                   int len;
    52:                   time_t theTime;
    53:                   char buffer[512];
    54:                   while (found[i])
    55:                   {
    56:                      BTree::theDataFile.GetRecord(found[i],buffer);
    57:                      struct tm * ts = localtime(&theTime);
    58:                      cout << "Found: ";
    59:                      cout << ts->tm_mon << "/";
    60:                      cout << ts->tm_mday << "/";
    61:                      cout << ts->tm_year << " ";
    62:                      cout <<  buffer << endl;
    63:                      i++;
    64:                   }
    65:                   delete [] found;
    66:                }
    67:          }
    68:          else
    69:             break;
    70:      }
    71:
    72:      myTree.PrintTree();
    73:
    74:
    75:      return 0;
    76:   }

Output:

    d:\day12>ROBIN
    Enter a string (blank to stop): Computer: 486 33MHZ 16MB RAM
    Enter a string (blank to stop): Remember to buy more RAM
    Enter a string (blank to stop): The show is at noon.
    Enter a string (blank to stop): Once more into the breach
    Enter a string (blank to stop):
    Find: more
    Found: 10/25/94 Remember to buy more RAM
    Found: 10/25/94 Once more into the breach
    Find:
      16MB  33MHZ  486  Computer  Once  RAM  Remember  The  breach  buy  into
      more
     noon  show

    Stats:
    Node pages:   0
    Leaf pages:   1
    Node indexes: 0
    Leaf indexes: 14
    Pages with 14 leaves: 1

    d:\day12>type ROBIN.DAT
    -:· ·   /°+.Computer: 486 33MHZ 16MB RAM·   3°+.Remember to buy more RAM·
                                                6°+.
    The show is at noon·   ;°+.Once more into the breach

Analysis:

The first and most significant change is the addition of a data file and a WNJFile. The data file is named ROBIN.DAT and is opened much as the other files have been. However, this time the ios:app flag is used to indicate that the file should be opened in append mode. When a file is opened in append mode, all writes are to the end of the file, regardless of where the tellp() and tellg() flags are positioned.

The Insert() method, shown in line 64 of listing 12.2, could have been as simple as the write() shown in line 75, were it not that you need to know the offset in the file at which the text is written. Because the write() does not move the tellp() pointer, you must do so explicitly. The goal is to move to the end of the file before doing the write, so that you can note where the write starts (which becomes the offset of the string in the file).

The call to seekp() in line 73 takes two parameters. The second parameter is ios::end, which indicates that the pointer is to move to the end and then move back as many bytes as indicated by the first parameter. Because the first parameter is 0, this moves no bytes from the end of the file, and that value is stashed away in the local variable, offset.

It is this value, the offset of the string written to the DAT file, that is stored in the WNJFile index array.

To test the changes to the database, a new, temporary interface has been put into place. Instead of reading the command line, the driver program prompts repeatedly for new notes. Four notes are entered, and when a blank note is entered, the program prompts for a string for which to search.

Note that the word more appears only once in the index, but refers to two notes. The statistics show that no nodes have been created yet, because this is an order 31 B-tree and only 14 indexed values have been found. Note also that each note is prefaced by 8 bytes of binary information the length of the note and the date.

The driver program, shown in listing 12.8, begins by defining the three static members: BTree::theDiskManager, BTree::theDataFile, and BTree::theWNJFile. It initializes the static counter variables: BTree::NodeIndexCtr, BTree::LeafIndexCtr, BTree::NodePageCtr, BTree::LeafPageCtr, BTree::NodeIndexPerPage, and BTree::LeafIndexPerPage.

The user is prompted repeatedly for notes in lines 31 through 38, and each note is inserted into the tree in line 35. In line 42, the user is prompted for a string to find, and that string is passed to the tree in line 46. If an index into the WNJ (Word-Node-Join) file is returned, that index is used in line 49 to access the WNJ array of records. For each record, the actual note is retrieved from the data file in line 56. The 4-byte time value is passed to the standard library's localtime() function, and a pointer to a time structure is returned, which is used to print the time in lines 58 through 61.

The most significant change in this version of the program is the addition of the Word-Node-Join index, as shown in listing 12.7. Note that in this preliminary version, the value 5 is hard coded for WNJSize, but in a release version, you would want to make this configurable.

The logic of the Insert() operation is to check whether the current array has a value in its last entry (WNJSize-1). If so, that is a pointer to the next array of WNJ entries. You chase these linked arrays until you find one where WNJSize-1 is empty. At that point, it is time to insert a new value.

If the next to last (WNJSize-2) already is occupied, you must create a new array and link it into place, as shown in lines 105 through 112. Otherwise, you can add the new data offset right into the first available slot, as shown in lines 113 through 126.

Finding a WNJFile record is the inverse operation, as shown in lines 132 through 162. You chase each record, adding it to the results array, and then return that array to the calling function.

Note that the calling function will delete the array after it finishes displaying the results (line 65 of the driver program). Note also that because this is an array of pointers, you must use the bracket operator with delete.

Summary

Today you learned how to create data files and to connect keys to multiple notes. You created a word-node-join record, which provided the necessary level of abstraction to allow a single key to point to multiple notes. Rather than attempting to create fixed-length records, which would have been overly restrictive and wasteful of disk space, you recorded the length of each record in the index.

You then added a number of small but important features, including eliminating case sensitivity. This chapter represents the final code for the first version of the ROBIN Personal Information Management Utility.

Q&A

Q: The entire text of each key is stored in the index file. Why doesn't it just point to an instance of the word in the data file?

A: It is true that there is a data size cost to storing a copy of the key in the index file, but with today's computers, it is almost always the right choice. First of all, hard disk space currently is very cheap. You easily could store a copy of every English word that exists on most hard disks. Second, disk speeds still are relatively slow, as compared to processor speeds. If you were to put just a pointer to the word on the disk in the index rather than the word itself, this would mean that every comparison of a key would cost one more disk access. This is far too expensive. Finally, the version of the word in the data file has not been normalized to all uppercase for easy comparison. You would have to convert it each time it is used.

Q: Why bother with a Word-Node-Join? Why not just search for multiple-index entries?

A: Once again, the goal is to reduce the number of disk reads, because reading data from the disk is typically the slowest part of a database program.

Q: Why not create a WNJ node with 100 slots and save the bother of having the last slot point to the next record?

A: Most nodes will point only to a small number of records; 100 slots would waste a lot of disk space. Some words, however, will point to literally hundreds of notes, and it would be too restrictive to simply "run out" of nodes. The short answer is that any number large enough to avoid overrunning the limit would be very wasteful for most records.

Workshop

The Workshop provides quiz questions to help you solidify your understanding of the material covered, and exercises to provide you with experience in using what you have learned. Try to answer the quiz and exercise questions before checking the answers in Appendix A, and make sure that you understand the answers before continuing to the next chapter.

Quiz

What is the most expensive operation in a database?
What is the purpose of the Word-Node-Join?
Why did we eliminate case sensitivity?
Why not store the records all in uppercase?

[Click here for Answers]

Exercises

In ROBIN, we do some very simple parsing. In this set of exercises, you tackle a harder parsing challenge mathematical expressions. For example, look at the following expression:

    Expression:=
        Number |
        UnaryOperator Expression |
        ( Expression BinaryOperator Expression )

This is a Bacchus-Naur form of the sort of mathematical expressions that we are going to be able to parse. To read this, you need to know that the symbol := means is defined as, and | means or.

The statement shown here means Expression is defined as a number, or a UnaryOperator followed by an Expression, or an open parenthesis followed by an Expression followed by a BinaryOperator followed by another Expression followed by a close parenthesis. (We're going to require the parentheses in order to simplify our task.) The entire Bacchus-Naur definition follows.

Although it seems crazy to use the term Expression in its own definition, this actually makes sense when you think about mathematical expressions. After you make appropriate definitions of Number, UnaryOperation, and BinaryOperation, you can treat all of these as legal Expressions:

            2
            (7 + 8)
            (-1 * (((15 + 3) / (9 * -2)) - 1))

    Expression:=
        Number |
        UnaryOperator Expression |
        ( Expression BinaryOperator Expression )

    Digit :=
        0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

    Number :=
        Digit |
        Number<adjoining>Digit
            where the resulting number is assigned
            the value of Number * 10 + Digit

    UnaryOperator :=
        - | ~
            which mean, respectively:  negate, complement
    BinaryOperator :=
        + | - | * | / | & | | | ^
            where these mean, respectively:  add, subtract,
            multiply, divide, Boolean and, Boolean or,
            Boolean exclusive or

Create two of the classes: Expression and UnaryOperation. The class Expression should be the parent of the other three classes. (Note that a unary operation IS-A expression.) The class Expression should define a pure virtual function GetValue, which its child classes will override. Of course, also fill out the classes with the other methods and data members they need in order to be complete.
Before trying question 2, be sure to review the answer to question 1 and make sure that you fully understand it. Then, have your program add the classes Number and BinaryOperation.
Write a test program to test these classes. It should just create some instances of the classes manually. Think about the tree that you are building with the binary and unary operations.
Write a class Tokenizer, which accepts an input stream and an output stream. It should provide an interface to enable its callers to read tokens out of the input stream, skipping over white space. (Note that we have defined our expressions so that all tokens are single characters. This is why we broke Number down into individual digits.) All characters from the input stream should be echoed to the output stream (including white space) when they are accepted by the caller or skipped over.
The class also should provide the capability for a caller to peek at the next token without removing it from the stream. Peeking at a token should not cause it to be echoed (although any white space skipped over to get to the token should cause that space to be echoed).
The tokenizer also should have a flag to turn off echoing.
Write a set of Parse functions: Parse, ParseNumber, ParseUnaryOperation, and ParseBinaryOperation. Each of these should accept a Tokenizer and return an Expression *. Parse should peek only at the first token to figure out which of the other functions it should call. The other functions should extract the appropriate pieces from the tokenizer and return the appropriate subclass of expression. For the sub-expressions of the unary and binary operations, you should just call Parse! (These functions sound much harder than they really are.)
Write a test program to make sure that your functions work. Your program should accept legal expressions from cin, echo them back out, and output " = " and the expression's value. The test program should repeat this process until it reaches the end of the file.

Go to: Table of Contents | Next Page