You have seen how to create the B-tree for the data file, but not how to write the data itself to disk. Today you will learn
You are closing in on the final version of your database, but you still have not saved the actual notes. Remember that until now all you have been saving is the index, with the key value. The note itself must, of course, be saved in your data file. After all, the machinations and manipulations to index and store the keys are for the purpose of storing and retrieving the notes. It is easy to lose sight of this goal as you focus on the B-tree algorithms, but to the end user, the notes are the entire point of the exercise.
How should the note be stored? Although it is possible to create a complex binary structure around these notes, there really is no reason to do so. The text can be written out to the file, and the offset of the text can be recorded and stored in the index.
The first iteration need not worry about the fact that each key may point to multiple notes. First, get the data-storage fundamentals working, and then you can worry about the intermediate index as described on day 11.
After this is done, all the fundamental database features will be in place except one. You need to be able to associate more than one note to each key. You have eliminated duplicate index entries, but you cannot yet store the fact that a given key is used by more than one note.
One solution is to create an intermediary index file; I'll call this a Word-Node-Join or WNJ record. Rather than each index pointing directly to the note in the data file, it will point to a WNJ index entry. Each WNJ index entry will point both to a note and to the next WNJ index object.
The problem with this approach, however, is that some key values will have dozens of associated data files, and you now have introduced a terrible inefficiency into your program. You went to all the work of creating a very advanced B-tree structure to avoid multiple disk reads, and here you are back to reading this index repeatedly as you track down the offsets into the data file.
There are a few potential solutions to this problem. You could create yet another B-tree, but that would bring its own inefficiencies: Most keys will associate with only one, or a very small number, of notes. You could create an array of offsets, but that has two problems: The empty slots in the array take up disk space, and when you fill all the slots you are stuck, what do you do with the next note that matches that key?
The solution is to pick a reasonable but small number of slots, and reserve the last slot as a pointer to the next array. I'll use five slots as a starting point; if you have one to four notes that match a given key, their offsets will all fit in the array. When you get a fifth note, you will create a new array and put its offset into the fifth slot.
I believe this creates a reasonable trade-off between disk usage and speed of access, but in a commercial program you would want to make this a configurable number. There are a number of approaches to doing so: You might let the user set this number in a configuration file or, in a very advanced system, the program might adjust itself based on usage patterns.
In any case, changing this value can occur only when the database is repacked. You see more about packing the database when deletion of records is covered.
As mentioned earlier, you will want to record the length of each data record so that displaying the complete record will be easier. You also need to record the date the note was created; according to the specification, the date of each note is displayed in the list of found notes.
Why record the length of the note? Remember that you are not writing the terminating null to disk, you are writing only the contents of the string. Having the length enables you to read in and display the correct number of characters when the user asks to see the complete note. One could argue that this is inefficient; saving the null costs 1 byte per string, and the length costs 4 bytes per string; but you will find it convenient to be able to read the length and then use that figure for formatting the string. If this becomes a real-world cost over time, it is easy to change in a later version.
When searching for a value, the user probably would prefer to match without case sensitivity. Thus, when searching for computer, the user would like to match notes that include Computer and COMPUTER. There are a number of ways to accomplish this, but the easiest and most straightforward is to save all the indexed keys in uppercase.
In a few days, you will see how to add stop words--those terms you do not want to search on and for which you don't want to allocate storage. There are some assumptions that you can make right now to reduce disk storage, however, and one is that one- and two-letter words almost never will have significance. There is little reason to index words such as is, if, or, we, on, in, and so on. You therefore can set the index to ignore words of fewer than three letters.
Again, in a commercial program, you might want to make this a configurable number, allowing the user to decide the minimum length of words suitable for indexing. For now, you will hard-wire in a minimum length of three letters.
As the database grows, you will need statistics on its usage; the next version keeps track of how many pages (node and leaf) are created, and how many indexes are used in each page. This version displays these statistics at the end of each run, along with a display of all the indexed terms.
There is much more to do for ROBIN V.1, which you will see on day 13. For now, however, you have a number of significant improvements, and the following listings display the changes described earlier in this chapter:
Listing 12.1 Btree.hpp 308 Listing 12.2 Datafile.cpp 313 Listing 12.3 BTree.cpp 317 Listing 12.4 DiskMgr.cpp 319 Listing 12.5 Index.cpp 322 Listing 12.6 Page.cpp 323 Listing 12.7 WNJFile.cpp 328 Listing 12.8 The new driver program 331
Listing 12.1 Btree.hpp
1: // ************************************************** 2: // PROGRAM: Btree, Page and Index declarations 3: // FILE: btree.hpp 4: // PURPOSE: provide fundamental btree functionality 5: // NOTES: 6: // AUTHOR: Jesse Liberty (jl) 7: // REVISIONS: 11/1/94 1.0 jl initial release 8: // ************************************************** 9: #ifndef BTREE_HPP // inclusion guards 10: #define BTREE_HPP 11: 12: #include <time.h> 13: #include <string.h> 14: #include <fstream.h> 15: #include "stdef.hpp" 16: 17: const int Order = 31; // 31 indexes and 1 header 18: const int dataLen = 11; // length of a key 19: const int MaxPages = 20; // more than we need 20: const int SizeItem = 16; // key + offset 21: const int SizePointer = 2; // size of offset 22: const int PageSize = (Order+1) * SizeItem; 23: const int WNJSize = 5; // entries per wnj array 24: 25: // forward declarations 26: class Page; 27: class Index; 28: 29: class DataFile 30: { 31: public: 32: // constructors & destructor 33: DataFile(); 34: ~DataFile(){} 35: 36: // management of files 37: void Close(); 38: void Create(); 39: void GetRecord(long offset, char * buffer); 40: 41: long Insert(char *); 42: 43: private: 44: fstream myFile; 45: }; 46: 47: class WNJFile 48: { 49: public: 50: // constructors & destructor 51: WNJFile(); 52: ~WNJFile(){} 53: 54: // management of files 55: void Close(); 56: void Create(); 57: int* Find(int NextWNJ); 58: int Insert(int, int); 59: int Append(int); 60: 61: private: 62: static int myCount; 63: fstream myFile; 64: union 65: { 66: int myInts[5]; 67: char myArray[5*4]; 68: }; 69: }; 70: 71: // DiskManager - in memory keeps track of what pages are 72: // already in memory and swaps to disk as required 73: class DiskManager 74: { 75: public: 76: // constructors & destructor 77: DiskManager(); 78: ~DiskManager(){} 79: 80: // management of files 81: void Close(int); 82: int Create(); 83: 84: // Page manipulation 85: Page* GetPage(int); 86: void SetPage(Page*); 87: void Insert(Page * newPage); 88: void Save(Page* pg); 89: int NewPage(Index&, int); 90: int NewPage(Index *array, long offset, int leaf, int count); 91: 92: private: 93: Page* myPages[MaxPages]; 94: fstream myFile; 95: int myCount; 96: }; 97: 98: // the btree itself - has a pointer to first page 99: class BTree 100: { 101: public: 102: // constructors and destructor 103: BTree(); 104: ~BTree(); 105: 106: // utility functions 107: void AddKey(char* data, long offset); 108: void Insert(char* buffer, int, char**); 109: void Insert(char*); 110: void PrintTree(); 111: int Find(char* data); 112: 113: // page methods 114: Page* GetPage(int page) 115: { return theDiskManager.GetPage(page); } 116: int NewPage(Index& pIndex, int IsLeaf) 117: { return theDiskManager.NewPage(pIndex,FALSE); } 118: 119: 120: // public static member! 121: static int myAlNum(char ch); 122: static DiskManager theDiskManager; 123: static WNJFile theWNJFile; 124: static DataFile theDataFile; 125: static int GetWord(char*, char*, int&); 126: static void GetStats(); 127: static int NodeIndexCtr; 128: static int LeafIndexCtr; 129: static int NodePageCtr; 130: static int LeafPageCtr; 131: static int NodeIndexPerPage[Order+1]; 132: static int LeafIndexPerPage[Order+1]; 133: 134: private: 135: int myRoot; 136: }; 137: 138: // index objects point to pages or real data 139: class Index 140: { 141: public: 142: // constructors & destructor 143: Index(); 144: Index(char *); 145: Index (char*, int); 146: Index(Index&); 147: ~Index(){} 148: 149: // accessors 150: const char * GetData() const { return myData; } 151: void SetData(const Index& rhs) 152: { strcpy(myData,rhs.GetData()); } 153: void SetData(const char * rhs) 154: { strcpy(myData,rhs); } 155: int GetPointer()const { return myPointer; } 156: void SetPointer (int pg) { myPointer = pg; } 157: 158: // utility functions 159: void PrintKey(); 160: void PrintPage(); 161: int Insert(Index& ref,int findOnly = FALSE); 162: 163: // overloaded operators 164: int operator==(const Index& rhs) const; 165: // {return strcmp(myData,rhs.GetData()) == 0; } 166: 167: int operator < (const Index& rhs) const 168: {return strcmp(myData,rhs.GetData())<0;} 169: 170: int operator <= (const Index& rhs) const 171: {return strcmp(myData,rhs.GetData())<=0;} 172: 173: int operator > (const Index& rhs) const 174: {return strcmp(myData,rhs.GetData())>0;} 175: 176: int operator >= (const Index& rhs) const 177: {return strcmp(myData,rhs.GetData())>=0;} 178: 179: public: 180: int myPointer; 181: char myData[SizeItem - SizePointer]; 182: }; 183: 184: 185: // pages - consist of header and array of indexes 186: class Page 187: { 188: public: 189: // constructors and destructor 190: Page(); 191: Page(char*); 192: Page(Index&,int); 193: Page(Index*, long, int, int); 194: ~Page(){} 195: 196: // insertion and searchoperations 197: int Insert(Index&, int findOnly = FALSE); 198: int Find(Index& idx) { return Insert(idx,TRUE); } 199: int InsertLeaf(Index&); 200: int FindLeaf(Index&,int findOnly); 201: int InsertNode(Index&,int findOnly = FALSE); 202: void Push(Index&,long offset=0L, int=TRUE); 203: 204: // accessors 205: Index GetFirstIndex() { return myKeys[0]; } 206: int GetIsLeaf() const { return myVars.IsLeaf; } 207: int GetCount() const { return myVars.myCount; } 208: void SetCount(int cnt) { myVars.myCount=cnt; } 209: time_t GetTime() { return myTime; } 210: 211: // page manipulation 212: int GetPageNumber() const { return myVars.myPageNumber; } 213: Page* GetPage(int page) 214: { return BTree::theDiskManager.GetPage(page); } 215: int NewPage(Index& rIndex, int IsLeaf) 216: { return BTree::theDiskManager.NewPage(rIndex,FALSE); } 217: 218: int NewPage(Index* arr, int off,int isL, int cnt) 219: { return BTree::theDiskManager.NewPage(arr,off,isL, cnt); } 220: 221: // utility functions 222: void Nullify(long offset); 223: void Print(); 224: fstream& Write(fstream&); 225: void ReCount(); 226: 227: static int GetgPageNumber(){ return gPage; } 228: static void SetgPageNumber(int pg) { gPage = pg; } 229: 230: private: 231: Index * const myKeys; // will point to myVars.mk 232: union 233: { 234: char myBlock[PageSize]; // a page from disk 235: struct 236: { 237: int myCount; 238: int IsLeaf; 239: int myPageNumber; 240: int lastKey; 241: char overhead[8]; 242: char mk[Order*sizeof(Index)]; // array of indexes 243: }myVars; 244: }; 245: 246: // memory only 247: static int gPage; 248: time_t myTime; // for lifo queue 249: }; 250: 251: #endif
Listing 12.2 Btree.cpp
1: // ************************************************** 2: // PROGRAM: Btree 3: // FILE: btree.cpp 4: // PURPOSE: provide fundamental btree functionality 5: // NOTES: 6: // AUTHOR: Jesse Liberty (jl) 7: // REVISIONS: 11/1/94 1.0 jl initial release 8: // ************************************************** 9: 10: #include "btree.hpp" 11: #include "stdef.hpp" 12: #include <assert.h> 13: #include <ctype.h> 14: #include <stdlib.h> 15: #if defined(_BC_16BIT) || defined(_BC32_BIT) #include <values.h> #endif #ifdef _MSVC_16BIT #define MAXINT 0x7FFF //(16 bit) #endif #ifdef _MSVC_32BIT #define 0x7FFFFFFF //(32 bit) #endif 16: 17: // construct the tree 18: // initialize myRoot pointer to nothing 19: // either create or read the index file 20: BTree::BTree():myRoot(0) 21: { 22: myRoot = theDiskManager.Create(); 23: theDataFile.Create(); 24: theWNJFile.Create(); 25: 26: } 27: 28: // write out the index file 29: BTree::~BTree() 30: { 31: theDiskManager.Close(myRoot); 32: theDataFile.Close(); 33: theWNJFile.Close(); 34: } 35: 36: // find an existing record 37: int BTree::Find(char * str) 38: { 39: Index index(str); 40: if (!myRoot) 41: return 0L; 42: else 43: return GetPage(myRoot)->Find(index); 44: } 45: 46: void BTree::Insert(char * buffer, int argc, char** argv) 47: { 48: // get each word, 49: long offset = theDataFile.Insert(buffer); 50: for (int i = 1; i<argc; i++) 51: { 52: AddKey(argv[i],offset); 53: } 54: 55: } 56: 57: void BTree::Insert(char* buffer) 58: { 59: 60: if (strlen(buffer) < 3) 61: return; 62: 63: char *buff = buffer; 64: char word[PageSize]; 65: int wordOffset = 0; 66: long offset; 67: 68: if (GetWord(buff,word,wordOffset)) 69: { 70: offset = theDataFile.Insert(buffer); 71: AddKey(word,offset); 72: } 73: 74: 75: while (GetWord(buff,word,wordOffset)) 76: { 77: AddKey(word,offset); 78: } 79: 80: } 81: 82: void BTree::AddKey(char * str, long offset ) 83: { 84: 85: if (strlen(str) < 3) 86: return; 87: 88: int retVal =0; 89: assert (offset < MAXINT); 90: Index index(str,(int)offset); 91: if (!myRoot) 92: { 93: myRoot = theDiskManager.NewPage (index,TRUE); 94: } 95: else 96: { 97: retVal = GetPage(myRoot)->Insert(index); 98: if (retVal) // our root split 99: { 100: // create a pointer to the old top 101: Index index(GetPage(myRoot)->GetFirstIndex()); 102: index.SetPointer(myRoot); 103: // make the new page & insert the index 104: int PageNumber = NewPage(index,FALSE); 105: 106: Page* pg = GetPage(PageNumber); 107: 108: //get a pointer to the new (sibling) page 109: Index Sib(GetPage(retVal)->GetFirstIndex()); 110: Sib.SetPointer(retVal); 111: // put it into the page 112: pg->InsertLeaf(Sib); 113: 114: // reset myRoot to point to the new top 115: myRoot = PageNumber; 116: } 117: } 118: } 119: 120: void BTree::PrintTree() 121: { 122: GetPage(myRoot)->Print(); 123: 124: cout << "\n\nStats:" << endl; 125: cout << "Node pages: " << NodePageCtr << endl; 126: cout << "Leaf pages: " << LeafPageCtr << endl; 127: cout << "Node indexes: " << NodeIndexCtr << endl; 128: cout << "Leaf indexes: " << LeafIndexCtr << endl; 129: for (int i = 0; i < Order +2; i++) 130: { 131: if (NodeIndexPerPage[i]) 132: { 133: cout << "Pages with " << i << " nodes: "; 134: cout << NodeIndexPerPage[i] << endl; 135: } 136: if (LeafIndexPerPage[i]) 137: { 138: cout << "Pages with " << i << " leaves: "; 139: cout << LeafIndexPerPage[i] << endl; 140: } 141: } 142: 143: } 144: 145: int BTree::GetWord(char* string, char* word, int& wordOffset) 146: { 147: 148: if (!string[wordOffset]) 149: return FALSE; 150: 151: char *p1, *p2; 152: p1 = p2 = string+wordOffset; 153: 154: // eat leading spaces 155: for (int i = 0; i<(int)strlen(p1) && !(BTree::myAlNum(p1[0])); i++) 156: p1++; 157: 158: // see if you have a word 159: if (!BTree::myAlNum(p1[0])) 160: return FALSE; 161: 162: p2 = p1; // point to start of word 163: 164: // march p2 to end of word 165: while (BTree::myAlNum(p2[0])) 166: p2++; 167: 168: int len = int(p2-p1); 169: #if defined(_MSVC_16BIT) || defined(_MSVC_32BIT) : { : len = __min(len,(int)PageSize); : } : #else : { : len = min(len,(int)PageSize); : } : #endif 170: 171: strncpy (word,p1,len); 172: word[len]='\0'; 173: 174: for (i = int (p2-string); i<(int)strlen(string) && !BTree::myAlNum(p2[0]); i++) 175: p2++; 176: 177: wordOffset = int (p2-string); 178: 179: return TRUE; 180: } 181: 182: int BTree::myAlNum(char ch) 183: { 184: return isalnum(ch) || 185: ch == '-' || 186: ch == '\'' || 187: ch == '(' || 188: ch == ')'; 189: }
Listing 12.3 Datafile.cpp
1: // ************************************************** 2: // PROGRAM: Data file 3: // FILE: datafile.cpp 4: // PURPOSE: 5: // NOTES: 6: // AUTHOR: Jesse Liberty (jl) 7: // REVISIONS: 11/5/94 1.0 jl initial release 8: // ************************************************** 9: 10: #include "btree.hpp" 11: #include <assert.h> 12: 13: // on construction, try to open the file if it exists 14: DataFile::DataFile(): 15: myFile("ROBIN.DAT", 16: ios::binary | ios::in | ios::out | ios::nocreate | ios::app) 17: { 18: } 19: 20: void DataFile::Create() 21: { 22: if (!myFile) // nocreate failed, first creation 23: { 24: 25: // open the file, create it this time 26: myFile.clear(); 27: 28: myFile.open 29: ("ROBIN.DAT",ios::binary | ios::in | ios::out | ios::app); 30: 31: char Header[2]; 32: int MagicNumber = 1234; // a number we can check for 33: memcpy(Header,&MagicNumber,2); 34: myFile.clear(); 35: myFile.flush(); 36: myFile.seekp(0); 37: myFile.write(Header,2); 38: return; 39: } 40: 41: // we did open the file, it already existed 42: // get the numbers we need 43: int MagicNumber; 44: myFile.seekg(0); 45: myFile.read((char *) &MagicNumber,2); 46: 47: // check the magic number. If it is wrong the file is 48: // corrupt or this isn't the index file 49: if (MagicNumber != 1234) 50: { 51: // change to an exception!! 52: cout << "DataFile Magic number failed!"; 53: } 54: return; 55: } 56: 57: // write out the numbers we'll need next time 58: void DataFile::Close() 59: { 60: myFile.close(); 61: } 62: 63: 64: long DataFile::Insert(char * newNote) 65: { 66: int len = strlen(newNote); 67: int fullLen = len + 2; 68: char buffer[PageSize]; 69: memcpy(buffer,&len,2); 70: memcpy(buffer+2,newNote,len); 71: myFile.clear(); 72: myFile.flush(); 73: myFile.seekp(0,ios::end); 74: long offset = myFile.tellp(); 75: myFile.write(buffer,fullLen); 76: myFile.flush(); 77: return offset; 78: } 79: 80: void DataFile::GetRecord(long offset, char* buffer) 81: { 82: int len; 83: char tmpBuff[PageSize]; 84: myFile.flush(); 85: myFile.clear(); 86: myFile.seekg(offset); 87: myFile.read(tmpBuff,PageSize); 88: memcpy(&len,tmpBuff,2); 89: strncpy(buffer,tmpBuff+2,len); 90: buffer[len] = '\0'; 91: }
Listing 12.4 Idxmgr.cpp
1: // ************************************************** 2: // PROGRAM: diskmgr 3: // FILE: diskmgr.cpp 4: // PURPOSE: provide i/o for btree 5: // NOTES: 6: // AUTHOR: Jesse Liberty (jl) 7: // REVISIONS: 11/5/94 1.0 jl initial release 8: // ************************************************** 9: 10: #include "btree.hpp" 11: #include <assert.h> 12: 13: // on construction, try to open the file if it exists 14: DiskManager::DiskManager(): 15: myFile("ROBIN.IDX",ios::binary | ios::in | ios::out | ios::nocreate) 16: { 17: // initialize the pointers to null 18: for (int i = 0; i< MaxPages; i++) 19: myPages[i] = 0; 20: myCount = 0; 21: 22: } 23: 24: // called by btree constructor 25: // if we opened the file, read in the numbers we need 26: // otherwise create the file 27: int DiskManager::Create() 28: { 29: if (!myFile) // nocreate failed, first creation 30: { 31: // open the file, create it this time 32: myFile.open("ROBIN.IDX",ios::binary | ios::in | ios::out); 33: 34: char Header[PageSize]; 35: int MagicNumber = 1234; // a number we can check for 36: memcpy(Header,&MagicNumber,2); 37: int NextPage = 1; 38: memcpy(Header+2,&NextPage,2); 39: memcpy(Header+4,&NextPage,2); 40: Page::SetgPageNumber(NextPage); 41: char title[]="ROBIN.IDX. Ver 1.00"; 42: memcpy(Header+6,title,strlen(title)); 43: myFile.flush(); 44: myFile.clear(); 45: myFile.seekp(0); 46: myFile.write(Header,PageSize); 47: return 0; 48: } 49: 50: // we did open the file, it already existed 51: // get the numbers we need 52: int MagicNumber; 53: myFile.seekg(0); 54: myFile.read((char *) &MagicNumber,2); 55: 56: // check the magic number. If it is wrong the file is 57: // corrupt or this isn't the index file 58: if (MagicNumber != 1234) 59: { 60: // change to an exception!! 61: cout << "Index Magic number failed!"; 62: return 0; 63: } 64: 65: int NextPage; 66: myFile.seekg(2); 67: myFile.read((char*) &NextPage,2); 68: Page::SetgPageNumber(NextPage); 69: int FirstPage; 70: myFile.seekg(4); 71: myFile.read((char*) &FirstPage,2); 72: const int room = PageSize 6; 73: char buffer[room]; 74: myFile.read(buffer,room); 75: buffer[room]='\0'; 76: cout << buffer << endl; 77: // read in all the pages 78: for (int i = 1; i < NextPage; i++) 79: { 80: myFile.seekg(i * PageSize); 81: char buffer[PageSize]; 82: myFile.read( buffer, PageSize); 83: Page * pg = new Page(buffer); 84: Insert(pg); 85: } 86: 87: return FirstPage; 88: } 89: 90: // write out the numbers we'll need next time 91: void DiskManager::Close(int theRoot) 92: { 93: 94: for (int i = 0; i< MaxPages; i++) 95: if (myPages[i]) 96: Save(myPages[i]); 97: int NextPage = Page::GetgPageNumber(); 98: if (!myFile) 99: cout << "Error opening myFile!" << endl; 100: myFile.flush(); 101: myFile.clear(); 102: myFile.seekp(2); 103: myFile.write ( (char *) &NextPage,2); 104: myFile.seekp(4); 105: myFile.write((char*) &theRoot,2); 106: myFile.close(); 107: } 108: 109: // wrapper function 110: int DiskManager::NewPage(Index& index, int bLeaf) 111: { 112: Page * newPage = new Page(index, bLeaf); 113: Insert(newPage); 114: Save(newPage); 115: return newPage->GetPageNumber(); 116: } 117: 118: int DiskManager::NewPage( 119: Index *array, 120: long offset, 121: int leaf, 122: int count) 123: { 124: Page * newPage = new Page(array, offset, leaf,count); 125: Insert(newPage); 126: Save(newPage); 127: return newPage->GetPageNumber(); 128: } 129: 130: void DiskManager::Insert(Page * newPage) 131: { 132: 133: // add new page into array of page managers 134: if (myCount < MaxPages) 135: { 136: assert(!myPages[myCount]); 137: myPages[myCount++] = newPage; 138: } 139: else // no room, time to page out to disk 140: { 141: int lowest = 0; 142: for (int i = 0; i< MaxPages; i++) 143: if (myPages[i]->GetTime() < myPages[lowest]->GetTime()) 144: lowest = i; 145: Save(myPages[lowest]); 146: delete myPages[lowest]; 147: myPages[lowest] = newPage; 148: } 149: } 150: 151: // tell the page to write itself out 152: void DiskManager::Save(Page* pg) 153: { 154: pg->Write(myFile); 155: } 156: 157: // see if the page is in memory, if so return it 158: // otherwise get it from disk 159: 160: Page * DiskManager::GetPage(int target) 161: { 162: 163: for (int i = 0; i< MaxPages; i++) 164: { 165: if (myPages[i]->GetPageNumber() == target) 166: return myPages[i]; 167: } 168: myFile.seekg(target * PageSize); 169: char buffer[PageSize]; 170: myFile.read( buffer, PageSize); 171: Page * pg = new Page(buffer); 172: Insert(pg); 173: return pg; 174: }
Listing 12.5 Index.cpp
1: // ************************************************** 2: // PROGRAM: index 3: // FILE: index.cpp 4: // PURPOSE: provide fundamental btree functionality 5: // NOTES: 6: // AUTHOR: Jesse Liberty (jl) 7: // REVISIONS: 11/1/94 1.0 jl initial release 8: // ************************************************** 9: 10: #include "btree.hpp" 11: #include <ctype.h> 12: 13: Index::Index(char* str):myPointer(1) 14: { 15: strncpy(myData,str,dataLen); 16: myData[dataLen]='\0'; 17: } 18: 19: Index::Index(char* str, int ptr):myPointer(ptr) 20: { 21: 22: strncpy(myData,str,dataLen); 23: myData[dataLen]='\0'; 24: } 25: 26: Index::Index(Index& rhs): 27: myPointer(rhs.GetPointer()) 28: { 29: strcpy(myData, rhs.GetData()); 30: } 31: 32: Index::Index():myPointer(0) 33: { 34: myData[0]='\0'; 35: } 36: 37: void Index::PrintKey() 38: { 39: cout << " " << myData; 40: } 41: 42: void Index::PrintPage() 43: { 44: cout << "\n" << myData << ": " ; 45: BTree::theDiskManager.GetPage(myPointer)->Print(); 46: } 47: 48: int Index::Insert(Index& ref, int findOnly) 49: { 50: return BTree::theDiskManager.GetPage(myPointer)->Insert(ref,findOnly); 51: } 52: 53: int Index::operator==(const Index& rhs) const 54: { 55: int len1 = strlen(myData); 56: int len2 = strlen(rhs.GetData()); 57: char w1[PageSize]; 58: char w2[PageSize]; 59: for (int i = 0; i<len1; i++) 60: w1[i] = toupper(myData[i]); 61: for (i = 0; i<len2; i++) 62: w2[i] = toupper(rhs.GetData()[i]); 63: w1[len1] = '\0'; 64: w2[len2]='\0'; 65: return (strcmp(w1,w2) == 0); 66: }
Listing 12.6 Page.cpp
1: // ************************************************** 2: // PROGRAM: Page 3: // FILE: Page.cpp 4: // PURPOSE: provide fundamental btree functionality 5: // NOTES: 6: // AUTHOR: Jesse Liberty (jl) 7: // REVISIONS: 11/1/94 1.0 jl initial release 8: // ************************************************** 9: 10: #include "btree.hpp" 11: #include <assert.h> 12: #include <values.h> 13: 14: // constructors 15: 16: // default constructor 17: Page::Page() 18: { 19: } 20: 21: // create a page from a buffer read in from disk 22: Page::Page(char *buffer): 23: myKeys((Index*)myVars.mk) 24: { 25: assert(sizeof(myBlock) == PageSize); 26: assert(sizeof(myVars) == PageSize); 27: memcpy(&myBlock,buffer,PageSize); 28: myTime = time(NULL); 29: } 30: 31: // create a Page from the first index 32: Page::Page(Index& index, int bLeaf): 33: myKeys((Index*)myVars.mk) 34: { 35: 36: myVars.myCount=1; 37: myVars.IsLeaf = bLeaf; 38: // if this is a leaf, this is the first 39: // index on the first page, set its pointer 40: // based on creating a new wnj. otherwise 41: // you are here creating a new node, do not 42: // set the pointer, it is already set. 43: if (bLeaf) 44: index.SetPointer(BTree::theWNJFile.Append(index.GetPointer())); 45: myKeys[0]=index; 46: myVars.myPageNumber = gPage++; 47: myTime = time(NULL); 48: } 49: 50: // create a page by splitting another page 51: Page::Page(Index *array, long offset, int bLeaf,int count): 52: myKeys((Index*)myVars.mk) 53: { 54: myVars.IsLeaf = bLeaf; 55: myVars.myCount = 0; 56: assert (offset < MAXINT); 57: int i = 0; 58: int j = (int) offset; 59: for (; j<Order && i < count; i++, j++) 60: { 61: myKeys[i]= array[j]; 62: myVars.myCount++; 63: } 64: myVars.myPageNumber = gPage++; 65: myTime = time(NULL); 66: } 67: 68: void Page::Nullify(long offset) 69: { 70: assert (offset < MAXINT); 71: for (int i = (int)offset; i<Order; i++) 72: { 73: myKeys[i].SetPointer(0); 74: myVars.myCount--; 75: } 76: } 77: 78: // decide whether I'm a leaf or a node 79: // and pass this index to the right 80: // function. If findOnly is true, don't insert 81: // just return the page number (for now) 82: int Page::Insert(Index& rIndex, int findOnly) 83: { 84: if (myVars.IsLeaf) 85: return FindLeaf(rIndex,findOnly); 86: else 87: return InsertNode(rIndex,findOnly); 88: } 89: 90: 91: // find the right page for this index 92: int Page::InsertNode(Index& rIndex, int findOnly) 93: { 94: int retVal =0; 95: int inserted = FALSE; 96: int i,j; 97: 98: assert(myVars.myCount>0); // nodes have at least 1 99: assert(myKeys[0].GetPointer()); // must be valid 100: 101: // does it go before my first entry? 102: if (rIndex < myKeys[0]) 103: { 104: if (findOnly) 105: return 0L; // not found 106: 107: myKeys[0].SetData(rIndex); 108: retVal=myKeys[0].Insert(rIndex); 109: inserted = TRUE; 110: } 111: 112: // does it go after my last? 113: if (!inserted) 114: for (i = myVars.myCount-1; i>=0; i--) 115: { 116: assert(myKeys[i].GetPointer()); 117: if (rIndex >= myKeys[i]) 118: { 119: retVal=myKeys[i].Insert(rIndex,findOnly); 120: inserted = TRUE; 121: break; 122: } 123: } 124: 125: // find where it does go 126: if (!inserted) 127: for (j = 0; j<i && j+1 < myVars.myCount; j++) 128: { 129: assert(myKeys[j+1].GetPointer()); 130: if (rIndex < myKeys[j+1]) 131: { 132: retVal=myKeys[j].Insert(rIndex,findOnly); 133: inserted = TRUE; 134: break; 135: } 136: } 137: 138: assert(inserted); // change to exception if not! 139: 140: // if you had to split 141: if (retVal && !findOnly) // got back a pointer to a new page 142: { 143: Index * pIndex = new Index(GetPage(retVal)->GetFirstIndex()); 144: pIndex->SetPointer(retVal); 145: retVal = InsertLeaf(*pIndex); 146: } 147: return retVal; 148: } 149: 150: // called if current page is a leaf 151: int Page::FindLeaf(Index& rIndex, int findOnly) 152: { 153: int result = 0; 154: 155: // no duplicates! 156: for (int i=0; i < myVars.myCount; i++) 157: if (rIndex == myKeys[i]) 158: { 159: if (findOnly) // return first WNJ 160: //return BTree::theWNJFile.Find(myKeys[i].GetPointer()); 161: return myKeys[i].GetPointer(); 162: else 163: return BTree::theWNJFile.Insert( 164: rIndex.GetPointer(), 165: myKeys[i].GetPointer()); 166: } 167: 168: if (findOnly) // not found 169: return result; 170: 171: // this index item does not yet exist 172: // before you push it into the index 173: // push an entry into the wnj.idx 174: // and set the index to point to that entry 175: rIndex.SetPointer(BTree::theWNJFile.Append(rIndex.GetPointer())); // new! 176: return InsertLeaf(rIndex); 177: } 178: 179: int Page::InsertLeaf(Index& rIndex) 180: { 181: int result = 0; 182: if (myVars.myCount < Order) 183: Push(rIndex); 184: else // overflow the page 185: { 186: // make sibling 187: int NewPg = 188: NewPage(myKeys,Order/2,myVars.IsLeaf,myVars.myCount); 189: Page* Sibling = GetPage(NewPg); 190: Nullify(Order/2); // nullify my right half 191: 192: // does it fit in this side? 193: if (myVars.myCount>Order/2-1 && rIndex <= myKeys[Order/2-1]) 194: Push(rIndex); 195: else // push it into the new sibling 196: Sibling->Push(rIndex); 197: 198: result = NewPg; // we split, pass it up 199: } 200: return result; 201: } 202: 203: // put the new index into this page (in order) 204: void Page::Push(Index& rIndex,long offset,int first) 205: { 206: int inserted = FALSE; 207: assert(myVars.myCount < Order); 208: assert (offset < MAXINT); 209: 210: for (int i=(int)offset; i<Order && i<myVars.myCount; i++) 211: { 212: assert(myKeys[i].GetPointer()); 213: if (rIndex <= myKeys[i]) 214: { 215: Push(myKeys[i],offset+1,FALSE); 216: myKeys[i]=rIndex; 217: inserted = TRUE; 218: break; 219: } 220: } 221: if (!inserted) 222: myKeys[myVars.myCount] = rIndex; 223: 224: if (first) 225: myVars.myCount++; 226: } 227: 228: 229: void Page::Print() 230: { 231: if (!myVars.IsLeaf) 232: { 233: BTree::NodePageCtr++; 234: BTree::NodeIndexPerPage[myVars.myCount]++; 235: BTree::NodeIndexCtr+=myVars.myCount; 236: } 237: else 238: { 239: BTree::LeafPageCtr++; 240: BTree::LeafIndexPerPage[myVars.myCount]++; 241: BTree::LeafIndexCtr+=myVars.myCount; 242: } 243: 244: for (int i = 0; i<Order && i < myVars.myCount; i++) 245: { 246: assert(myKeys[i].GetPointer()); 247: if (myVars.IsLeaf) 248: myKeys[i].PrintKey(); 249: else 250: myKeys[i].PrintPage(); 251: } 252: } 253: 254: // write out the entire page as a block 255: fstream& Page::Write(fstream& file) 256: { 257: char buffer[PageSize]; 258: memcpy(buffer,&myBlock,PageSize); 259: file.flush(); 260: file.clear(); 261: file.seekp(myVars.myPageNumber*PageSize); 262: file.write(buffer,PageSize); 263: return file; 264: }
Listing 12.7 WNJFile.cpp
1: // ************************************************** 2: // PROGRAM: WNJ file 3: // FILE: WNJfile.cpp 4: // PURPOSE: Cross index from key to datafile 5: // NOTES: 6: // AUTHOR: Jesse Liberty (jl) 7: // REVISIONS: 11/5/94 1.0 jl initial release 8: // ************************************************** 9: 10: #include "btree.hpp" 11: #include <assert.h> 12: 13: // on construction, try to open the file if it exists 14: WNJFile::WNJFile(): 15: myFile("ROBINWNJ.IDX", 16: ios::binary | ios::in | ios::out | ios::nocreate) 17: { 18: for (int i = 0; i<WNJSize; i++) 19: myInts[i]=0L; 20: } 21: 22: void WNJFile::Create() 23: { 24: char Header[8]; 25: int MagicNumber=0; // a number we can check for 26: int zero = 0; 27: 28: if (!myFile) // nocreate failed, first creation 29: { 30: // open the file, create it this time 31: myFile.clear(); 32: myFile.open("ROBINWNJ.IDX", 33: ios::binary | ios::in | ios::out); 34: 35: MagicNumber = 1234; 36: memcpy(Header,&MagicNumber,2); 37: memcpy(Header+2,&zero,2); 38: myFile.seekp(0); 39: myFile.write(Header,4); 40: myFile.flush(); 41: return; 42: } 43: 44: // we did open the file, it already existed 45: // get the numbers we need 46: 47: 48: myFile.seekg(0); 49: myFile.read(Header,4); 50: memcpy(&MagicNumber,Header,2); 51: memcpy(&myCount,Header+2,2); 52: 53: // check the magic number. If it is wrong the file is 54: // corrupt or this isn't the index file 55: if (MagicNumber != 1234) 56: { 57: // change to an exception!! 58: cout << "WNJ Magic number failed!"; 59: cout << "Magic number: " << MagicNumber; 60: cout << "\nmyCount: " << myCount << endl; 61: } 62: return; 63: } 64: 65: // write out the numbers we'll need next time 66: void WNJFile::Close() 67: { 68: myFile.seekg(2); 69: myFile.write((char*)&myCount,2); 70: myFile.close(); 71: } 72: 73: int WNJFile::Append(int DataOffset) 74: { 75: 76: int newPos = 4 + myCount++ * (WNJSize*SizePointer); 77: int offsets[WNJSize]; 78: offsets[0] = DataOffset; 79: for (int i = 1; i<WNJSize; i++) 80: offsets[i]=0; 81: myFile.seekg(newPos); 82: myFile.write((char*)offsets,WNJSize*SizePointer); 83: 84: return newPos; 85: } 86: 87: 88: int WNJFile::Insert(int DataOffset,int WNJOffset) 89: { 90: int ints[WNJSize]; 91: myFile.seekg(WNJOffset); 92: myFile.read((char*)ints,WNJSize*SizePointer); 93: 94: int offset=WNJOffset; 95: 96: while (ints[WNJSize-1]) 97: { 98: offset = ints[WNJSize-1]; 99: myFile.clear(); 100: myFile.flush(); 101: myFile.seekg(ints[WNJSize-1]); 102: myFile.read((char*)ints,WNJSize*SizePointer); 103: } 104: if (ints[WNJSize-2]) // full! 105: { 106: ints[WNJSize-1] = Append(DataOffset); 107: myFile.clear(); 108: myFile.flush(); 109: myFile.seekg(offset); 110: myFile.write((char*)ints,WNJSize*SizePointer); 111: } 112: else 113: { 114: for (int i = 0; i<WNJSize-1; i++) 115: { 116: if (ints[i] == 0) 117: { 118: ints[i] = DataOffset; 119: myFile.clear(); 120: myFile.flush(); 121: myFile.seekg(offset); 122: myFile.write((char*)ints,WNJSize*SizePointer); 123: break; 124: } 125: } 126: } 127: return 0; 128: } 129: 130: 131: int* WNJFile::Find(int NextWNJ) 132: { 133: int ints[WNJSize]; 134: int * results = new int[PageSize]; 135: 136: int i = 0, j=0; 137: 138: while (j<PageSize) 139: results[j++] = 0; 140: 141: j = 0; 142: 143: myFile.seekg(NextWNJ); 144: myFile.read((char*)ints, WNJSize*SizePointer); 145: while (j < PageSize) 146: { 147: if (ints[i]) 148: { 149: if (i == WNJSize-1) 150: { 151: myFile.seekg(ints[WNJSize-1]); 152: myFile.read((char*)ints,WNJSize*2); 153: i = 0; 154: continue; 155: } 156: results[j++] = ints[i++]; 157: } 158: else 159: break; 160: } 161: return results; 162: }
Listing 12.8 Driver.cpp
1: #include "String.hpp" 2: #include "stdef.hpp" 3: #include "btree.hpp" 4: DiskManager BTree::theDiskManager; 5: DataFile BTree::theDataFile; 6: WNJFile BTree::theWNJFile; 7: 8: int WNJFile::myCount=0L; 9: int Page::gPage=1; 10: int BTree::NodeIndexCtr=0; 11: int BTree::LeafIndexCtr=0; 12: int BTree::NodePageCtr=0; 13: int BTree::LeafPageCtr=0; 14: int BTree::NodeIndexPerPage[Order+1]; 15: int BTree::LeafIndexPerPage[Order+1]; 16: 17: int main(int argc, char ** argv) 18: { 19: BTree myTree; 20: 21: for (int i = 0; i < Order +2; i++) 22: { 23: BTree::NodeIndexPerPage[i] =0; 24: BTree::LeafIndexPerPage[i] = 0; 25: } 26: 27: char buffer[PageSize+1]; 28: 29: for (;;) 30: { 31: cout << "Enter a string (blank to stop): "; 32: cin.getline(buffer,PageSize); 33: // cin.ignore(PageSize,'\n'); 34: if (buffer[0]) 35: myTree.Insert(buffer); 36: else 37: break; 38: } 39: 40: for (;;) 41: { 42: cout << "Find: "; 43: cin.getline(buffer,255); 44: if (buffer[0]) 45: { 46: int offset = myTree.Find(buffer); 47: if (offset) 48: { 49: int* found = BTree::theWNJFile.Find(offset); 50: int i=0; 51: int len; 52: time_t theTime; 53: char buffer[512]; 54: while (found[i]) 55: { 56: BTree::theDataFile.GetRecord(found[i],buffer); 57: struct tm * ts = localtime(&theTime); 58: cout << "Found: "; 59: cout << ts->tm_mon << "/"; 60: cout << ts->tm_mday << "/"; 61: cout << ts->tm_year << " "; 62: cout << buffer << endl; 63: i++; 64: } 65: delete [] found; 66: } 67: } 68: else 69: break; 70: } 71: 72: myTree.PrintTree(); 73: 74: 75: return 0; 76: }Output:
d:\day12>ROBIN Enter a string (blank to stop): Computer: 486 33MHZ 16MB RAM Enter a string (blank to stop): Remember to buy more RAM Enter a string (blank to stop): The show is at noon. Enter a string (blank to stop): Once more into the breach Enter a string (blank to stop): Find: more Found: 10/25/94 Remember to buy more RAM Found: 10/25/94 Once more into the breach Find: 16MB 33MHZ 486 Computer Once RAM Remember The breach buy into more noon show Stats: Node pages: 0 Leaf pages: 1 Node indexes: 0 Leaf indexes: 14 Pages with 14 leaves: 1 d:\day12>type ROBIN.DAT -:· · /°+.Computer: 486 33MHZ 16MB RAM· 3°+.Remember to buy more RAM· 6°+. The show is at noon· ;°+.Once more into the breachAnalysis:
The first and most significant change is the addition of a data file and a WNJFile. The data file is named ROBIN.DAT and is opened much as the other files have been. However, this time the ios:app flag is used to indicate that the file should be opened in append mode. When a file is opened in append mode, all writes are to the end of the file, regardless of where the tellp() and tellg() flags are positioned.
The Insert() method, shown in line 64 of listing 12.2, could have been as simple as the write() shown in line 75, were it not that you need to know the offset in the file at which the text is written. Because the write() does not move the tellp() pointer, you must do so explicitly. The goal is to move to the end of the file before doing the write, so that you can note where the write starts (which becomes the offset of the string in the file).
The call to seekp() in line 73 takes two parameters. The second parameter is ios::end, which indicates that the pointer is to move to the end and then move back as many bytes as indicated by the first parameter. Because the first parameter is 0, this moves no bytes from the end of the file, and that value is stashed away in the local variable, offset.
It is this value, the offset of the string written to the DAT file, that is stored in the WNJFile index array.
To test the changes to the database, a new, temporary interface has been put into place. Instead of reading the command line, the driver program prompts repeatedly for new notes. Four notes are entered, and when a blank note is entered, the program prompts for a string for which to search.
Note that the word more appears only once in the index, but refers to two notes. The statistics show that no nodes have been created yet, because this is an order 31 B-tree and only 14 indexed values have been found. Note also that each note is prefaced by 8 bytes of binary information the length of the note and the date.
The driver program, shown in listing 12.8, begins by defining the three static members: BTree::theDiskManager, BTree::theDataFile, and BTree::theWNJFile. It initializes the static counter variables: BTree::NodeIndexCtr, BTree::LeafIndexCtr, BTree::NodePageCtr, BTree::LeafPageCtr, BTree::NodeIndexPerPage, and BTree::LeafIndexPerPage.
The user is prompted repeatedly for notes in lines 31 through 38, and each note is inserted into the tree in line 35. In line 42, the user is prompted for a string to find, and that string is passed to the tree in line 46. If an index into the WNJ (Word-Node-Join) file is returned, that index is used in line 49 to access the WNJ array of records. For each record, the actual note is retrieved from the data file in line 56. The 4-byte time value is passed to the standard library's localtime() function, and a pointer to a time structure is returned, which is used to print the time in lines 58 through 61.
The most significant change in this version of the program is the addition of the Word-Node-Join index, as shown in listing 12.7. Note that in this preliminary version, the value 5 is hard coded for WNJSize, but in a release version, you would want to make this configurable.
The logic of the Insert() operation is to check whether the current array has a value in its last entry (WNJSize-1). If so, that is a pointer to the next array of WNJ entries. You chase these linked arrays until you find one where WNJSize-1 is empty. At that point, it is time to insert a new value.
If the next to last (WNJSize-2) already is occupied, you must create a new array and link it into place, as shown in lines 105 through 112. Otherwise, you can add the new data offset right into the first available slot, as shown in lines 113 through 126.
Finding a WNJFile record is the inverse operation, as shown in lines 132 through 162. You chase each record, adding it to the results array, and then return that array to the calling function.
Note that the calling function will delete the array after it finishes displaying the results (line 65 of the driver program). Note also that because this is an array of pointers, you must use the bracket operator with delete.
Today you learned how to create data files and to connect keys to multiple notes. You created a word-node-join record, which provided the necessary level of abstraction to allow a single key to point to multiple notes. Rather than attempting to create fixed-length records, which would have been overly restrictive and wasteful of disk space, you recorded the length of each record in the index.
You then added a number of small but important features, including eliminating case sensitivity. This chapter represents the final code for the first version of the ROBIN Personal Information Management Utility.
Q: The entire text of each key is stored in the index file. Why doesn't it just point to an instance of the word in the data file?
A: It is true that there is a data size cost to storing a copy of the key in the index file, but with today's computers, it is almost always the right choice. First of all, hard disk space currently is very cheap. You easily could store a copy of every English word that exists on most hard disks. Second, disk speeds still are relatively slow, as compared to processor speeds. If you were to put just a pointer to the word on the disk in the index rather than the word itself, this would mean that every comparison of a key would cost one more disk access. This is far too expensive. Finally, the version of the word in the data file has not been normalized to all uppercase for easy comparison. You would have to convert it each time it is used.
Q: Why bother with a Word-Node-Join? Why not just search for multiple-index entries?
A: Once again, the goal is to reduce the number of disk reads, because reading data from the disk is typically the slowest part of a database program.
Q: Why not create a WNJ node with 100 slots and save the bother of having the last slot point to the next record?
A: Most nodes will point only to a small number of records; 100 slots would waste a lot of disk space. Some words, however, will point to literally hundreds of notes, and it would be too restrictive to simply "run out" of nodes. The short answer is that any number large enough to avoid overrunning the limit would be very wasteful for most records.
The Workshop provides quiz questions to help you solidify your understanding of the material covered, and exercises to provide you with experience in using what you have learned. Try to answer the quiz and exercise questions before checking the answers in Appendix A, and make sure that you understand the answers before continuing to the next chapter.
In ROBIN, we do some very simple parsing. In this set of exercises, you tackle a harder parsing challenge mathematical expressions. For example, look at the following expression:
Expression:= Number | UnaryOperator Expression | ( Expression BinaryOperator Expression )
This is a Bacchus-Naur form of the sort of mathematical expressions that we are going to be able to parse. To read this, you need to know that the symbol := means is defined as, and | means or.
The statement shown here means Expression is defined as a number, or a UnaryOperator followed by an Expression, or an open parenthesis followed by an Expression followed by a BinaryOperator followed by another Expression followed by a close parenthesis. (We're going to require the parentheses in order to simplify our task.) The entire Bacchus-Naur definition follows.
Although it seems crazy to use the term Expression in its own definition, this actually makes sense when you think about mathematical expressions. After you make appropriate definitions of Number, UnaryOperation, and BinaryOperation, you can treat all of these as legal Expressions:
2 (7 + 8) (-1 * (((15 + 3) / (9 * -2)) - 1)) Expression:= Number | UnaryOperator Expression | ( Expression BinaryOperator Expression ) Digit := 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Number := Digit | Number<adjoining>Digit where the resulting number is assigned the value of Number * 10 + Digit UnaryOperator := - | ~ which mean, respectively: negate, complement BinaryOperator := + | - | * | / | & | | | ^ where these mean, respectively: add, subtract, multiply, divide, Boolean and, Boolean or, Boolean exclusive or
The class also should provide the capability for a caller to peek at the next token without removing it from the stream. Peeking at a token should not cause it to be echoed (although any white space skipped over to get to the token should cause that space to be echoed).
The tokenizer also should have a flag to turn off echoing.
Write a test program to make sure that your functions work. Your program should accept legal expressions from cin, echo them back out, and output " = " and the expression's value. The test program should repeat this process until it reaches the end of the file.
Go to: Table of Contents | Next Page