Day 11: Building Indexes on Disk

You have learned how to create a variety of advanced data structures, including balanced trees and B-trees. As discussed on day 10, the entire purpose of B-trees is to make access to disk-based indexes quick and easy. Today you will learn

How to design a disk-based B-tree
How to add and search for records on disk
How the B-tree will interact with your data records

Keeping It in Perspective

The goal, you will remember, is quick access to the records that match the keys your user will enter. In the case of ROBIN, that means finding the rNote objects that match the words the user types, but the more important general idea is to be able to find a record based on an indexed value.

This chapter continues to focus on the specific requirements of ROBIN: adding and removing rNote objects and rWord-based Index objects. Everything shown here, however, is directly applicable to any disk-based database.

The primary goal with all software development is to ship a product. The closer your program comes to meeting the specifications, budget, timeline, and other requirements of your employer or market demands, the more successful you will be. But remember, the goal is to ship the product.

The game you are playing is pinball. Do a good job, ship the product, score lots of points, and you get to play again.

Real-World Considerations

Let's take a short digression to talk about building a database in the real world. In all likelihood, here's how you will program your next database: you will pick up the phone, dial an 800 number to a mail-order house with a name like Programmer Credit-Card Heaven, and ask for the fastest, most reliable ISAM-based database system they have that meets your specifications. You then will plunk down a few hundred dollars and save yourself a few hundred hours of programming time.

Only in the most rare of circumstances will you actually "roll your own" database from scratch, as I am doing here. Even so, it is imperative that you understand how all this works, because whatever you buy, you will need to tweak it, extend it, and ultimately take ownership of it for your product.

Knowing What You Are Building

With ROBIN, the overall goal is to have a database of rNotes and an index file of words that enables you to quickly find all the note records that match the offered words.

The rNotes database (ROBIN.DB) consists of variable-length records. Each record, or rNote, will consist of a date (4 bytes), a length (4 bytes), and a variable-length string of characters. rNotes will be stored sequentially, with each note beginning right after the preceding note in the file. You will use the techniques you learned on day 9 for object persistence to store the notes.

The word index file (WORD.IDX) will consist of pages of indexes, leading ultimately to a leaf page with the index for the word found. This record will not point to ROBIN.DB directly; after all, each word probably will point to many different rNote records. Instead, each index will point to the first record in the Node index file (ROBIN.IDX), which will have a linked list of pointers to records in ROBIN.DB.

Here's how it works: The user requests a search on the word THE. Word.IDX, which is a disk-based B-tree, is searched. Node-page index entries lead to other node-page entries until a leaf page is found. The number recorded there is an offset into ROBIN.IDX. Each record in ROBIN.IDX points to a record in ROBIN.DB, and also points to the next entry in ROBIN.IDX (or to null when there are no more entries).

Using Disk-Based B-Trees

The B-tree you built on day 10 was memory-based; at no time were the records written to disk. The goal of a B-tree, however, is precisely thatto store the index on disk.

In the examples you saw on day 10, the index stored a pointer. Pointers are memory-based entities, however. Instead, you want not to store a pointer, but to store the appropriate page. Node-page index entries will store the page number they point to; leaf-page index entries will store the offset of the data to which they refer.

Caching

Rather than reading each page into memory each time it is needed and then tossing it away, it is far faster and more efficient if you can keep a few pages in memory. The index, however, doesn't want to keep track of whether the page it is pointing to is in memory (in which case a pointer is needed) or is on disk.

This capability would require the index to understand far too much about how pages are stored to disk. All the index should know is that it points to page number 5, for example.

A disk manager is needed. The node-page index hands the disk manager an offset and gets back a pointer to the requested page. It is the disk manager's job to decide whether the page already is in memory or must be picked up off the disk.

In the simplest case, the disk manager would keep an array of pages. When the array was full, it would toss out a page and bring in another. But which page should be tossed? It would be far more efficient to toss out a page that rarely is used than a page that is used all the time. After all, tossing out a page that is used all the time just means that the page will be loaded back in soon, and this defeats your purpose of trying to minimize disk reads.

The answer is a least recently used (LRU) queue. A recently used queue is the queue that was used most recently; a least recently used queue is the page you have gone the longest without needing. The idea is that if you haven't needed it for a while, you probably will not need it any time soon.

Every database can keep only a limited number of pages in memory at any one time. The amount is determined by the amount of memory the user has available, and truly sophisticated programs will increase or decrease that number dynamically. ROBIN will take a middle course, and allow the user to set the number of pages to be kept in memory (along with the order of the pages) from a configuration file.

Swapping Out

The problem with setting a very small number of pages to be held in memory at any one time is that the disk manager may swap out a page while it still is needed. One solution to this problem is to use lockers to lock down the page when it is in use; this is discussed on day 15 when memory management is covered.

B-trees reduce this problem somewhat by keeping a relatively shallow tree in memory. Depending on how many indexes you have on each page, you may be able to severely limit the number of pages you ever need in memory at the same time.

Determining How Big Each Page Should Be

The goal, again, is quick reads and writes. It turns out that most personal computers are fastest if they can read a block of data that is a multiple of 2. The ideal size is determined by the sector size of your disk. Because this size varies, you will want the program to get this size from a configuration file, but for now I'll use 512, the size for most PC disks.

Each index record needs to be a divisor of the order, so that an even number fit on each page. The index you have seen so far is 16 bytes: 4 bytes of pointer (now offset) and 11 bytes of data, with a final null byte terminating the string. There are 32 16-bytes in a 512-byte page, so each page will hold 32 index objects. The order of the B-tree therefore will be 32. A 32-order tree can hold 1,024 words in two levels, 32,768 words in three levels, and 33,554,432 words in five levels. Most searches probably can be accomplished in just a few disk reads, which is ideal.

Determining How Many Pages Can Be in Memory at Once

The algorithm you will use will be recursive, starting at the top node and working your way down, as seen in previous examples. Because each page has room for 32 indexes, the average case is that at any time, half of these will be in use16 indexes per page.

Ten levels of pages with 16 indexes per page provide access to a trillion keys, which will be more than enough. Ten pages, however, will take up only 16 bytes per index: 16 indexes per page times 10 pages equals 2,560 bytes or 2K.

This is the power of B-trees in a nutshell: By having the capability to hold 2K of pages in memory at any time, B-trees give you access to a trillion keys.

Swapping to Disk

The fastest way to swap to disk is to do a single memcpy. C++ gains from its lineage in C the capability to hack memory when required so that all the indexes on a page can be loaded from disk in a single memory move. This capability is discussed in detail when the Page object is covered later in this section. Listing 11.1 is the header file for the disk-based, B-tree system.

Listing 11.1 A Disk-Based, B-Tree Interface

    1:     // **************************************************
    2:     // PROGRAM:  Btree, Page and Index declarations
    3:     // FILE:     btree.hpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:     #ifndef BTREE_HPP                 // inclusion guards
    10:    #define BTREE_HPP
    11:
    12:    #include <time.h>
    13:    #include <string.h>
    14:    #include <fstream.h>
    15:    #include "stdef.hpp"
    16:
    17:    const int Order =      31;  // 31 indexes and 1 header
    18:    const int dataLen =    11; // length of a key
    19:    const int MaxPages =   20; // more than we need
    20:    const int SizeItem =   16;  // key + offset
    21:    const int SizePointer = 4; // size of offset
    22:    const int PageSize = (Order+1) * SizeItem;
    23:
    24:    // forward declarations
    25:    class Page;
    26:    class Index;
    27:
    28:    // DiskManager - in memory keeps track of what pages are
    29:    // already in memory and swaps to disk as required
    30:    class DiskManager
    31:    {
    32:    public:
    33:       // constructors & destructor
    34:       DiskManager();
    35:       ~DiskManager(){}
    36:
    37:       // management of files
    38:       void Close(int);
    39:       int Create();
    40:       int Open();
    41:
    42:       // Page manipulation
    43:       Page* GetPage(int);
    44:       void SetPage(Page*);
    45:       void Insert(Page * newPage);
    46:       void Save(Page* pg);
    47:       int NewPage(const Index&, BOOL);
    48:       int NewPage(Index *array, int offset, BOOL leaf, int count);
    49:
    50:    private:
    51:    Page* myPages[MaxPages];
    52:    fstream myFile;
    53:    int myCount;
    54:    };
    55:
    56:    // the btree itself - has a pointer to first page
    57:    class BTree
    58:    {
    59:    public:
    60:       // constructors and destructor
    61:       BTree();
    62:       ~BTree();
    63:
    64:       // utility functions
    65:       void AddKey(char* data);
    66:       void PrintTree();
    67:       int Find(char* data);
    68:
    69:       // page methods
    70:       Page* GetPage(int page)
    71:          { return theDiskManager.GetPage(page); }
    72:       int NewPage(const Index& pIndex, BOOL IsLeaf)
    73:          { return theDiskManager.NewPage(pIndex,FALSE); }
    74:
    75:       // public static member!
    76:       static  DiskManager theDiskManager;
    77:    private:
    78:       int myRoot;
    79:    };
    80:
    81:    // index objects point to pages or real data
    82:    class Index
    83:    {
    84:    public:
    85:       // constructors & destructor
    86:       Index();
    87:       Index(char *);
    88:       Index (char*, int);
    89:       Index(const Index&);
    90:       ~Index(){}
    91:
    92:       // accessors
    93:       const char * GetData() const  { return myData; }
    94:       void SetData(const Index& rhs)
    95:          { strcpy(myData,rhs.GetData()); }
    96:       void SetData(const char * rhs)
    97:          { strcpy(myData,rhs); }
    98:       int GetPointer()const { return myPointer; }
    99:       void SetPointer (int pg) { myPointer = pg; }
    100:
    101:      // utility functions
    102:      void PrintKey();
    103:      void PrintPage();
    104:      int Insert(const Index& ref,BOOL findOnly = FALSE);
    105:
    106:      // overloaded operators
    107:      int operator==(const Index& rhs) const
    108:      {return strcmp(myData,rhs.GetData()) == 0; }
    109:
    110:      int operator < (const Index& rhs) const
    111:       {return strcmp(myData,rhs.GetData())<0;}
    112:
    113:      int operator <= (const Index& rhs) const
    114:       {return strcmp(myData,rhs.GetData())<=0;}
    115:
    116:      int operator > (const Index& rhs) const
    117:       {return strcmp(myData,rhs.GetData())>0;}
    118:
    119:      int operator >= (const Index& rhs) const
    120:       {return strcmp(myData,rhs.GetData())>=0;}
    121:
    122:   public:
    123:      int myPointer;
    124:      char myData[SizeItem - SizePointer];
    125:   };
    126:
    127:
    128:   // pages - consist of header and array of indexes
    129:   class Page
    130:   {
    131:   public:
    132:      // constructors and destructor
    133:      Page();
    134:      Page(char*);
    135:      Page(const Index&,BOOL);
    136:      Page(Index*, int, BOOL, int);
    137:      ~Page(){}
    138:
    139:      // insertion and searchoperations
    140:      int Insert(const Index&, BOOL findOnly = FALSE);
    141:      int Find(const Index& idx) { return Insert(idx,TRUE); }
    142:      int InsertLeaf(const Index&,BOOL findOnly = FALSE);
    143:      int InsertNode(const Index&,BOOL findOnly = FALSE);
    144:      void Push(const Index&,int offset=0, BOOL=TRUE);
    145:
    146:      // accessors
    147:      Index GetFirstIndex() { return myKeys[0]; }
    148:      BOOL GetIsLeaf() const { return myVars.IsLeaf; }
    149:      int GetCount() const { return myVars.myCount; }
    150:      void SetCount(int cnt) { myVars.myCount=cnt; }
    151:      time_t GetTime() { return myTime; }
    152:
    153:      // page manipulation
    154:      int GetPageNumber() const { return myVars.myPageNumber; }
    155:      Page* GetPage(int page)
    156:         { return BTree::theDiskManager.GetPage(page); }
    157:      int NewPage(const Index& rIndex, BOOL IsLeaf)
    158:         { return BTree::theDiskManager.NewPage(rIndex,FALSE); }
    159:
    160:      int NewPage(Index* arr, int off,BOOL isL, int cnt)
    161:         { return   BTree::theDiskManager.NewPage(arr,off,isL, cnt); }
    162:
    163:      // utility functions
    164:      void Nullify(int offset);
    165:      void Print();
    166:      fstream& Write(fstream&);
    167:      void ReCount();
    168:
    169:      static int GetgPageNumber(){ return gPage; }
    170:      static void SetgPageNumber(int pg) { gPage = pg; }
    171:
    172:   private:
    173:      Index * const myKeys; // will point to myVars.mk
    174:      union
    175:      {
    176:         char myBlock[PageSize];   // a page from disk
    177:         struct
    178:         {
    179:            int myCount;
    180:            BOOL IsLeaf;
    181:            int myPageNumber;
    182:            int lastKey;
    183:            char mk[Order*sizeof(Index)];  // array of indexes
    184:         }myVars;
    185:      };
    186:
    187:      // memory only
    188:      static int gPage;
    189:      time_t myTime; // for lifo queue
    190:   };
    191:
    192:   #endif
    193:

Output:

None.

Analysis:

Lines 17 through 22 declare a number of constants used throughout the program. Many of these would be read out of a configuration file in a commercial version of ROBIN. Order determines how many indexes are held on each page, for example. When experimenting with this program, you may want to temporarily reduce this number, but don't forget to increase the number of pages if you do. A good rule of thumb is that their product should be greater than 600. If you reduce Order to 4, for example, be sure to increase MaxPages to at least 150.

DataLen is used when creating the key; the total length of the index's data is dataLen plus 1 for the terminating NULL. The SizeItem is the sum of the dataLen, plus the null, plus the SizePointer. Finally, each page's size is computed by assuming Order indexes plus an additional 16 bytes of header. Thus, 32 times 16, or 512 bytes.

Lines 25 and 26 provide forward declarations. The DiskManager class declaration begins in line 31 and ends in line 56. The purpose of this class is to provide transparent virtual memory to swap pages in and out of memory as needed.

Clients of this class call GetPage() (line 43), providing a page number and getting back a pointer to a Page. The Insert() method (line 45) puts a page into the DiskManager's array of Pages. Save() instructs the Page to write itself out to disk.

The BTree class is declared in lines 57 through 79. This class remains quite simple; it has only a pointer to the first page and methods to add and obtain pages. Note, in line 76, that BTree contains a public static DiskManager object. This provides controlled global access to the single DiskManager for the Btree.

The Index class is declared in lines 82 through 125. There are no significant changes here from previous versions; the Index object needs virtually no awareness of whether the Index objects will be stored in memory or on disk.

The Page class is declared in lines 129 through 190. There are four constructors provided, and they will be discussed in detail as the implementation is described.

The most important thing to notice is the union shown in lines 178 through 189. The goal here is to allow the disk manager to read 512 bytes in one gulp and then to move it into a Page in a single call to memcpy().

MyBlock is declared to be a character array of 512 bytes, and that is a union with the four members and the arrays of indexes. You then can read in the 512 bytes treating the unnamed union as a character array, and read out the member variables using the members of the structure.

The problem is that the array of indexes cannot be put into a union, because the Index objects have a constructor. Therefore, a character array, mk, is declared of the same size as the array of indexes. An Index pointer then is declared, in line 177. This pointer is not stored on disk, but when the page is created in memory, this pointer is set to point to the mk array. You then can treat myKeys as if it were the array, indexing into it as you would index into the heap using a pointer to allocated memory.

In line 188, a static unsigned int, gPage, is declared to allow each page to be awarded a unique, sequential page number. In line 189, a time stamp is declared. Note that neither of these variables is stored on disk; they are memory only, and are created and manipulated only after the class is instantiated in memory.

Implementing the Tree

Listing 11.2 shows the B-Tree implementation, listing 11.3 shows the Page implementation, listing 11.4 shows the Index implementation, and listing 11.5 shows the DiskManager implementation.

Listing 11.2 The B-Tree Implementation

    1:   // **************************************************
    2:     // PROGRAM:  Btree
    3:     // FILE:     btree.cpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "btree.hpp"
    11:    #include "stdef.hpp"
    12:
    13:    // construct the tree
    14:    // initialize myRoot pointer to nothing
    15:    // either create or read the index file
    16:    BTree::BTree():myRoot(0)
    17:    {
    18:       myRoot = theDiskManager.Create();
    19:    }
    20:
    21:    // write out the index file
    22:    BTree::~BTree()
    23:    {
    24:       theDiskManager.Close(myRoot);
    25:    }
    26:
    27:    // find an existing record
    28:    int BTree::Find(char * str)
    29:    {
    30:       Index index(str);
    31:       if (!myRoot)
    32:          return 0L;
    33:       else
    34:          return GetPage(myRoot)->Find(index);
    35:    }
    36:
    37:
    38:
    39:    void BTree::AddKey(char * str)
    40:    {
    41:
    42:       int retVal =0;
    43:       Index index(str);
    44:       if (!myRoot)
    45:       {
    46:          myRoot = theDiskManager.NewPage (index,TRUE);
    47:       }
    48:       else
    49:       {
    50:          retVal = GetPage(myRoot)->Insert(index);
    51:          if (retVal) // our root split
    52:          {
    53:             // create a pointer to the old top
    54:             Index index(GetPage(myRoot)->GetFirstIndex());
    55:             index.SetPointer(myRoot);
    56:             // make the new page & insert the index
    57:             int PageNumber = NewPage(index,FALSE);
    58:
    59:             Page* pg = GetPage(PageNumber);
    60:
    61:             //get a pointer to the new (sibling) page
    62:             Index Sib(GetPage(retVal)->GetFirstIndex());
    63:             Sib.SetPointer(retVal);
    64:             // put it into the page
    65:             pg->InsertLeaf(Sib);
    66:
    67:             // reset myRoot to point to the new top
    68:             myRoot = PageNumber;
    69:          }
    70:       }
    71:    }
    72:
    73:    void BTree::PrintTree()
    74:    {
    75:       GetPage(myRoot)->Print();
    76:    }

Listing 11.3 The Page Implementation

    1:     // **************************************************
    2:     // PROGRAM:  Page
    3:     // FILE:     Page.cpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "btree.hpp"
    11:    #include <assert.h>
    12:
    13:    // constructors
    14:
    15:    // default constructor
    16:    Page::Page()
    17:    {
    18:    }
    19:
    20:    // create a page from a buffer read in from disk
    21:    Page::Page(char *buffer):
    22:       myKeys((Index*)myVars.mk)
    23:    {
    24:       assert(sizeof(myBlock) == PageSize);
    25:       assert(sizeof(myVars) == PageSize);
    26:       memcpy(&myBlock,buffer,PageSize);
    27:       myTime = time(NULL);
    28:    }
    29:
    30:    // create a Page from the first index
    31:    Page::Page(const Index& index, BOOL bLeaf):
    32:       myKeys((Index*)myVars.mk)
    33:    {
    34:
    35:       myVars.myCount=1;
    36:       myVars.IsLeaf = bLeaf;
    37:       myKeys[0]=index;
    38:       myVars.myPageNumber = gPage++;
    39:       myTime = time(NULL);
    40:    }
    41:
    42:    // create a page by splitting another page
    43:    Page::Page(Index *array, int offset, BOOL bLeaf,int count):
    44:       myKeys((Index*)myVars.mk)
    45:    {
    46:       myVars.IsLeaf = bLeaf;
    47:       myVars.myCount = 0;
    48:       for (int i=0, j = offset; j<Order && i < count; i++, j++)
    49:       {
    50:          myKeys[i]= array[j];
    51:          myVars.myCount++;
    52:       }
    53:       myVars.myPageNumber = gPage++;
    54:       myTime = time(NULL);
    55:    }
    56:
    57:    void Page::Nullify(int offset)
    58:    {
    59:       for (int i = offset; i<Order; i++)
    60:       {
    61:             myKeys[i].SetPointer(0);
    62:             myVars.myCount--;
    63:       }
    64:    }
    65:
    66:    // decide whether I'm a leaf or a node
    67:    // and pass this index to the right
    68:    // function. If findOnly is true, don't insert
    69:    // just return the page number (for now)
    70:    int Page::Insert(const Index& rIndex, BOOL findOnly)
    71:    {
    72:       if (myVars.IsLeaf)
    73:         return   InsertLeaf(rIndex,findOnly);
    74:       else
    75:         return InsertNode(rIndex,findOnly);
    76:    }
    77:
    78:
    79:    // find the right page for this index
    80:    int Page::InsertNode(const Index& rIndex, BOOL findOnly)
    81:    {
    82:       int retVal =0;
    83:       BOOL inserted = FALSE;
    84:       int i,j;
    85:
    86:       assert(myVars.myCount>0); // nodes have at least 1
    87:       assert(myKeys[0].GetPointer()); // must be valid
    88:
    89:       // does it go before my first entry?
    90:       if (rIndex < myKeys[0])
    91:       {
    92:          if (findOnly)
    93:             return 0L; // not found
    94:
    95:          myKeys[0].SetData(rIndex);
    96:          retVal=myKeys[0].Insert(rIndex);
    97:          inserted = TRUE;
    98:       }
    99:
    100:      // does it go after my last?
    101:      if (!inserted)
    102:         for (i = myVars.myCount-1; i>=0; i--)
    103:         {
    104:            assert(myKeys[i].GetPointer());
    105:            if (rIndex >= myKeys[i])
    106:            {
    107:                retVal=myKeys[i].Insert(rIndex,findOnly);
    108:                inserted = TRUE;
    109:                break;
    110:            }
    111:       }
    112:
    113:      // find where it does go
    114:      if (!inserted)
    115:         for (j = 0; j<i && j+1 < myVars.myCount; j++)
    116:         {
    117:            assert(myKeys[j+1].GetPointer());
    118:            if (rIndex < myKeys[j+1])
    119:            {
    120:               retVal=myKeys[j].Insert(rIndex,findOnly);
    121:               inserted = TRUE;
    122:               break;
    123:            }
    124:         }
    125:
    126:      assert(inserted);  // change to exception if not!
    127:
    128:      // if you had to split
    129:      if (retVal && !findOnly) // got back a pointer to a new page
    130:      {
    131:         Index * pIndex = new Index(GetPage(retVal)->GetFirstIndex());
    132:         pIndex->SetPointer(retVal);
    133:         retVal = InsertLeaf(*pIndex);
    134:      }
    135:      return retVal;
    136:   }
    137:
    138:   // called if current page is a leaf
    139:   int Page::InsertLeaf(const Index& rIndex, BOOL findOnly)
    140:   {
    141:      int result = 0;
    142:
    143:      // no duplicates!
    144:      for (int i=0; i < myVars.myCount; i++)
    145:            if (rIndex == myKeys[i])
    146:            {
    147:               if (findOnly)
    148:                  return GetPageNumber();
    149:               else
    150:                  return result;
    151:            }
    152:
    153:      if (findOnly)         // not found
    154:         return result;
    155:
    156:      if (myVars.myCount < Order)
    157:         Push(rIndex);
    158:      else      // overflow the page
    159:      {
    160:            // make sibling
    161:            int NewPg =
    162:               NewPage(myKeys,Order/2,myVars.IsLeaf,myVars.myCount);
    163:            Page* Sibling = GetPage(NewPg);
    164:            Nullify(Order/2);                     // nullify my right half
    165:
    166:            // does it fit in this side?
    167:            if (myVars.myCount>Order/2-1 && rIndex <= myKeys[Order/2-1])
    168:               Push(rIndex);
    169:            else  // push it into the new sibling
    170:               Sibling->Push(rIndex);
    171:
    172:            result = NewPg; // we split, pass it up
    173:         }
    174:      return result;
    175:   }
    176:
    177:   // put the new index into this page (in order)
    178:   void Page::Push(const Index& rIndex,int offset,BOOL first)
    179:   {
    180:      BOOL inserted = FALSE;
    181:      assert(myVars.myCount < Order);
    182:      for (int i=offset; i<Order && i<myVars.myCount; i++)
    183:      {
    184:         assert(myKeys[i].GetPointer());
    185:         if (rIndex <= myKeys[i])
    186:         {
    187:               Push(myKeys[i],offset+1,FALSE);
    188:               myKeys[i]=rIndex;
    189:               inserted = TRUE;
    190:               break;
    191:         }
    192:      }
    193:      if (!inserted)
    194:         myKeys[myVars.myCount] = rIndex;
    195:
    196:      if (first)
    197:         myVars.myCount++;
    198:   }
    199:
    200:   void Page::Print()
    201:   {
    202:         for (int i = 0; i<Order && i < myVars.myCount; i++)
    203:         {
    204:            if (!myKeys[i].GetPointer())
    205:            {
    206:               cout << "error!! myKeys[" << i << "]";
    207:               cout << "data: " << myKeys[i].GetData();
    208:               assert(myKeys[i].GetPointer());
    209:            }
    210:
    211:            if (myVars.IsLeaf)
    212:               myKeys[i].PrintKey();
    213:            else
    214:               myKeys[i].PrintPage();
    215:         }
    216:   }
    217:
    218:   // write out the entire page as a block
    219:   fstream& Page::Write(fstream& file)
    220:   {
    221:      char buffer[PageSize];
    222:      memcpy(buffer,&myBlock,PageSize);
    223:      file.seekp(myVars.myPageNumber*PageSize);
    224:      file.write(buffer,PageSize);
    225:      return file;
    226:   }

Listing 11.4 The Index Implementation

    1:     // **************************************************
    2:     // PROGRAM:  index
    3:     // FILE:     index.cpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "btree.hpp"
    11:
    12:    Index::Index(char* str):myPointer(1)
    13:    {
    14:       strncpy(myData,str,dataLen);
    15:       myData[dataLen]='\0';
    16:    }
    17:
    18:    Index::Index(char* str, int ptr):myPointer(ptr)
    19:    {
    20:
    21:       strncpy(myData,str,dataLen);
    22:       myData[dataLen]='\0';
    23:    }
    24:
    25:    Index::Index(const Index& rhs):
    26:       myPointer(rhs.GetPointer())
    27:    {
    28:       strcpy(myData, rhs.GetData());
    29:    }
    30:
    31:    Index::Index():myPointer(0)
    32:    {
    33:       myData[0]='\0';
    34:    }
    35:
    36:    void Index::PrintKey()
    37:    {
    38:       cout << "  " << myData;
    39:    }
    40:
    41:    void Index::PrintPage()
    42:    {
    43:       cout << "\n" << myData << ": " ;
    44:       BTree::theDiskManager.GetPage(myPointer)->Print();
    45:    }
    46:
    47:    int Index::Insert(const Index& ref, BOOL findOnly)
    48:    {
    49:       return BTree::theDiskManager.GetPage(myPointer)->Insert(ref,findOnly);
    50:    }

Listing 11.5 The DiskManager Implementation

    1:     // **************************************************
    2:     // PROGRAM:  diskmgr
    3:     // FILE:     diskmgr.cpp
    4:     // PURPOSE:  provide i/o for btree
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/5/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "btree.hpp"
    11:    #include <assert.h>
    12:
    13:    // on construction, try to open the file if it exists
    14:    DiskManager::DiskManager():
    15:       myFile("ROBIN.IDX",ios::binary | ios::in | ios::out | ios::nocreate)
    16:    {
    17:       // initialize the pointers to null
    18:       for (int i = 0; i< MaxPages; i++)
    19:          myPages[i] = 0;
    20:       myCount = 0;
    21:
    22:    }
    23:
    24:    // called by btree constructor
    25:    // if we opened the file, read in the numbers we need
    26:    // otherwise create the file
    27:    int DiskManager::Create()
    28:    {
    29:       if (!myFile) // nocreate failed, first creation
    30:       {
    31:          // open the file, create it this time
    32:          myFile.open("ROBIN.IDX",ios::binary | ios::in | ios::out);
    33:
    34:          char Header[PageSize];
    35:          int MagicNumber = 1234; // a number we can check for
    36:          memcpy(Header,&MagicNumber,4);
    37:          int NextPage = 1;
    38:          memcpy(Header+4,&NextPage,4);
    39:          memcpy(Header+8,&NextPage,4);
    40:          Page::SetgPageNumber(NextPage);
    41:          char title[]="ROBIN.IDX. Ver 1.00";
    42:          memcpy(Header+12,title,strlen(title));
    43:          myFile.seekp(0);
    44:          myFile.write(Header,PageSize);
    45:          return 0;
    46:       }
    47:
    48:
    49:       // we did open the file, it already existed
    50:       // get the numbers we need
    51:       int MagicNumber;
    52:       myFile.seekg(0);
    53:       myFile.read((char *) &MagicNumber,4);
    54:
    55:       // check the magic number. If it is wrong the file is
    56:       // corrupt or this isn't the index file
    57:       if (MagicNumber != 1234)
    58:       {
    59:             // change to an exception!!
    60:             cout << "Magic number failed!";
    61:             return 0;
    62:       }
    63:
    64:       int NextPage;
    65:       myFile.seekg(4);
    66:       myFile.read((char*) &NextPage,4);
    67:       Page::SetgPageNumber(NextPage);
    68:       int FirstPage;
    69:       myFile.seekg(8);
    70:       myFile.read((char*) &FirstPage,4);
    71:       const int room =  PageSize - 12;
    72:       char buffer[room];
    73:       myFile.read(buffer,room);
    74:       cout << buffer << endl;
    75:       // read in all the pages
    76:       for (int i = 1; i < NextPage; i++)
    77:       {
    78:             myFile.seekg(i * PageSize);
    79:             char buffer[PageSize];
    80:             myFile.read( buffer, PageSize);
    81:             Page * pg = new Page(buffer);
    82:             Insert(pg);
    83:       }
    84:
    85:       return FirstPage;
    86:    }
    87:
    88:    // write out the numbers we'll need next time
    89:    void DiskManager::Close(int theRoot)
    90:    {
    91:
    92:       for (int i = 0; i< MaxPages; i++)
    93:          if (myPages[i])
    94:             Save(myPages[i]);
    95:       int NextPage = Page::GetgPageNumber();
    96:       if (!myFile)
    97:          cout << "Error opening myFile!" << endl;
    98:       myFile.seekp(4);
    99:       myFile.write ( (char *) &NextPage,4);
    100:      myFile.seekp(8);
    101:      myFile.write((char*) &theRoot,4);
    102:      myFile.close();
    103:   }
    104:
    105:   // wrapper function
    106:   int DiskManager::NewPage(const Index& index, BOOL bLeaf)
    107:   {
    108:      Page * newPage = new Page(index, bLeaf);
    109:      Insert(newPage);
    110:      Save(newPage);
    111:      return newPage->GetPageNumber();
    112:   }
    113:
    114:   int DiskManager::NewPage(
    115:      Index *array,
    116:      int offset,
    117:      BOOL leaf,
    118:      int count)
    119:   {
    120:      Page * newPage = new Page(array, offset, leaf,count);
    121:      Insert(newPage);
    122:      Save(newPage);
    123:      return newPage->GetPageNumber();
    124:   }
    125:
    126:   void DiskManager::Insert(Page * newPage)
    127:   {
    128:   // add new page into array of page managers
    129:
    130:
    131:      if (myCount < MaxPages)
    132:      {
    133:         assert(!myPages[myCount]);
    134:         myPages[myCount++] = newPage;
    135:      }
    136:      else  // no room, time to page out to disk
    137:      {
    138:         int lowest = 0;
    139:         for (int i = 0; i< MaxPages; i++)
    140:            if (myPages[i]->GetTime() < myPages[lowest]->GetTime())
    141:               lowest = i;
    142:         Save(myPages[lowest]);
    143:         delete myPages[lowest];
    144:         myPages[lowest] = newPage;
    145:      }
    146:   }
    147:
    148:   // tell the page to write itself out
    149:   void DiskManager::Save(Page* pg)
    150:   {
    151:     pg->Write(myFile);
    152:   }
    153:
    154:   // see if the page is in memory, if so return it
    155:   // otherwise get it from disk
    156:   // note: this won't scale, with lots of page managers
    157:   // you'd need a more efficient search. 10 levels of page
    158:   // managers, with 31 indexes per page gives you room for
    159:   // 800 trillion words. Even if each page is only 1/2 full
    160:   // on average, 10 levels of depth would represent 64 million
    161:   // keys alone, not to mention the actual records.
    162:
    163:   Page * DiskManager::GetPage(int target)
    164:   {
    165:
    166:      for (int i = 0; i< MaxPages; i++)
    167:      {
    168:         if (myPages[i]->GetPageNumber() == target)
    169:            return  myPages[i];
    170:      }
    171:      myFile.seekg(target * PageSize);
    172:      char buffer[PageSize];
    173:      myFile.read( buffer, PageSize);
    174:      Page * pg = new Page(buffer);
    175:      Insert(pg);
    176:      return pg;
    177:   }

Listing 11.6 The Driver Program

    1:     #include "String.hpp"
    2:     #include "stdef.hpp"
    3:     #include "btree.hpp"
    4:     DiskManager BTree::theDiskManager;
    5:     int Page::gPage=1;
    6:     int main()
    7:     {
    8:        BTree myTree;
    9:        char buffer[255];
    10:       for (;;)
    11:       {
    12:          cout << "word: ";
    13:          cin.getline(buffer,255);
    14:          if (buffer[0])
    15:             if (buffer[0]=='?')
    16:             {
    17:                int found = myTree.Find(buffer+1);
    18:                cout << "Found record: " << found << endl;
    19:             }
    20:             else
    21:                myTree.AddKey(buffer);
    22:          else
    23:             break;
    24:       }
    25:       myTree.PrintTree();
    26:       return 0;
    27:    }

Analysis:

This is a complicated collection of code, although you saw much of the fundamental logic on day 10. The best way to review this code is probably to walk through bringing the program up, entering some data, closing down, restarting, adding more, and then closing down again.

The driver program begins in line 4 with the definition of the two static class members: BTree's DiskManager and Page's gPage, which is initialized to 1.

The initialization of the DiskManager object calls its constructor, shown in lines 6 through 14 of listing 11.5. The initialization of myFile fails, because of the nocreate flag, which says not to open the file if it doesn't already exist. In the body of the constructor, the array of PageManager object pointers is initialized so that all the pointers have the value 0. The failure of the open call is expected and planned for (and desired). It is a signal that the index file does not already exist, and will be handled in the Create() method.

The driver program declares a BTree object that invokes the BTree constructor, which is seen in line 16 of listing 11.2. The root pointer is initialized to 0 and the static member, the DiskManager's Create() method, is called.

The first time you run the program, the opening of myFile, shown in the constructor for the disk manager, fails. Now that failure is detected in line 21, so the program knows that it is dealing with a new file. In line 24, the file is opened in binary mode, allowing both input and output.

A magic number is written as the first 4 bytes of the file. This number can be checked on subsequent opens, to ensure that the file you are reading is valid.

In line 29, the variable NextPage is initialized to 1 and written to the file, mostly as a place holder. This number is used to initialize the global variable gPage, using the static member function SetgPageNumber. Finally, in line 35, a title is written to the file. All of this is written into the first 512 bytes of the file in line 37.

On subsequent builds, the if statement would fail, and this function would begin in line 43. The MagicNumber is read in line 45 and checked in line 49. If it fails, an exception should be thrown.

The NextPage variable is read from the file, and that number is used to set the gPage variable. The FirstPage variable is read and will be returned when the function returns, allowing the BTree to set its root pointer.

In lines 68 through 75, the pages are read 512 bytes at a time. Each page is provided as a parameter to the Page constructor in line 73.

The Page constructor is shown in lines 20 through 27 of listing 11.3. The index pointer, myKeys, is initialized with the address of myVars.mk cast to an Index pointer. Note that this enables you to deal with myKeys as if it were the array of Index objects, but still load the array in the union with the buffer.

How the Union Works

When the Page object is in memory, you want to deal with it as if it consisted of four int objects and an array of Index objects. When getting the Page object off the disk, you want to deal with it as a block of 512 bytes. To do this, you make a union between a character array of 512 bytes with the other variables. Because you cannot put an array of objects that have constructors into a union, you instead create an array of characters of the size of your array of Index objects. You then use the pointer myKeys much like you use a pointer to dynamically created memory.

In the body of the constructor, there are two asserts that fall out in the final code. The one meaningful statement, in line 26, is the block move of the buffer loaded off the disk into the union of data members.

Control then returns to line 74 of listing 11.5. The newly created page is inserted into the DiskManager's array. When this function ends, control returns to the constructor of the BTree, which in turn returns to the driver program.

The user is prompted repeatedly for words that are added to the BTree in line 19. After the user presses Enter, the tree is printed in line 23.

Each call to AddKey transfers control to line 27 of listing 11.2. If the root does not yet point to a page, a new page is created in line 34. This transfers control to line 109 of listing 11.5. A new Page object is created, and it is inserted into the array of Pages and saved to disk. The number of that page is returned, and that number is saved in the myRoot pointer.

When AddKey is called subsequently, the myRoot pointer has a value, and control goes to line 38 of listing 11.2. Here, the logic of inserting a page and then checking the return value begins. The return value, if non-zero, indicates that the page already pointed to split. In this case, in line 39 of listing 11.2, this indicates that the top node split and a new top node is required.

The call for inserting into a page is shown in line 38. The GetPage() function call takes a page number and returns a pointer. That pointer then is used to invoke Insert() on the page returned from the DiskManager.

This moves control to line 66 of listing 11.3. The current page knows whether it is a leaf page or a node page. The first 31 times this is called, the root node will be a leaf. This invokes the InsertLeaf method, beginning in line 133.

The current value is checked to see whether it is a duplicate of an existing key. If so, for now it is just discarded. In ROBIN, of course, the actual data, the word record, will be updated for all the notes that reference that word, but there will be only one key per word.

In line 143, the page is checked to see whether there is room. If so, the Index is Pushed into the page. Otherwise, the page must split. Splitting involves creating a new page and dividing the contents of the original page with the sibling. The new value then is pushed into whichever of the two pages (original and sibling) it belongs to.

When a page splits, the page number of the new page is passed back to the calling function. If the calling function is a node, the new value is put into that page, which in turn may have to split. If the calling function is the AddKey() function of the BTree itself, then a new topmost node is created.

Note that if the user prepends a question mark to his new word, the driver program interprets this as a call to Find(). Page's Find() method calls its insert method and overrides the default value of FALSE for its second parameter, passing in TRUE instead, indicating find, not insert.

The Find() logic closely parallels the Insert() logic, so instead of creating duplicate code paths, I use a flag value (findOnly) to choose the right behavior in a unified method.

Currently, Find() just returns the page where the Index is stored; in the real program, this will return an array of notes on which this word was found.

Connecting Nodes to Data

Leaf-node index items in the examples provided point nowhere. In a real program, of course, they would point to the actual data. The specification for ROBIN, like for most B-trees, specifically designates that words will not be duplicated in the tree, but that each word may point to zero, one, or many individual notes.

Each leaf-node index item will point to a disk-based, linked list of pointers to notes. When the program searches for an entry (for example, the word America), the correct leaf node will be found. The index will have the offset of the first of the linked note pointers. Each note pointer will, in fact, have two pointers: one to a note object and the other to the next note pointer.

As more notes are added, each word will be parsed and submitted to the WORD.IDX. If the word already exists in the WORD.IDX, the linked list will be traced and a new entry will be placed on the tail.

Deleting Notes

In version 1 of ROBIN, notes never will be deleted. The database will grow forever. Future versions will, however, allow for the deletion of NOTES.

When notes are deleted from the database, each word will be traced and the appropriate entry will be removed from the linked list of note pointers. When a word has no entries, it is ready for deletion from the linked list.

This begs the question, "How are Index objects deleted from a B-tree?" It certainly is possible to delete an index from a B-tree, but it is a lot of work, it usually isn't needed, and it often isn't desirable.

The simpler and more common solution is to mark the index for deletion, and then remove it when the entire B-tree is packed. When an Index item is marked as deleted, it is treated by the tree as if it were removed.

When the B-tree is packed, it is rebuilt, entry by entry, with those Index objects marked for deletion simply not included in the new version. Typically, the entire tree is rebuilt in a new file and then copied over the old file.

Implementing Performance Enhancements

The B-tree shown in listing 11.2 is very fast. Disk reads are held to an absolute minimum, and the techniques shown for reading data into memory are among the fastest available. The one inefficiency is the technique for searching the disk manager's array.

If the Disk Manager were going to hold a large number of entries, the current implementation of reading through each and looking for the oldest entry would be entirely unacceptable. Because you can be certain, however, that no more than 10 entries will ever be needed, there is little point in optimizing Disk Manager. If you were going to optimize, the easiest way to do so would be to keep the array sorted. Then finding the earliest entry would be trivial and instant.

Similarly, in lines 144 through 146 of listing 11.3, you see a brute force search for a matching Index item. A binary search would significantly speed this search. The idea of a binary search is to check the middle value in the array of Index objects. If the target is smaller than the middle, you then search the left half; if it is larger, you search the right half. That second search is accomplished by comparing the middle value of the chosen half. Each half is halved again until the object is found or proven not to exist.

These and similar optimization approaches are discussed in more detail on day 17, "Profiling and Performance."

Summary

Today you learned how to write a B-tree to disk. The primary design concern with writing such a B-tree is to minimize the number of disk accesses and to make the reads as efficient and fast as possible.

Q&A

Q: Why do you want to reduce the number of disk accesses?

A: Reading and writing to disk is typically the slowest part of any program other than waiting for input from a modem.

Q: Building this B-tree wasn't so hard. Why do you advise buying B-tree products rather than building them as needed?

A: Building a first version is not terribly difficult, although we are not done yet. Getting it right and rock-solid reliable is much harder, though.

Q: Why are the pages cached in memory?

A: The goal is to reduce the number of times you must go to the disk. Because the node pages will be used frequently, keeping them in memory can greatly enhance the efficiency of your program.

Workshop

The Workshop provides quiz questions to help you solidify your understanding of the material covered, and exercises to provide you with experience in using what you have learned. Try to answer the quiz and exercise questions before checking the answers in Appendix A, and make sure that you understand the answers before continuing to the next chapter.

Quiz

Why do B-trees reduce disk accesses?
How is page size determined?
What does it mean to say a page is order 32?
Why was a union used?
Why was a pointer to char used as a member variable?

[Click here for Answers]

Exercises

Modify the index manager to print out whether it gets its page from memory or from disk.
Start with clean data files. Enter this text:
```
    Silent night, holy night, all is calm, all is right, round young virgin mother and
    child, sleep in heavenly bliss.
```
Examine the output. How many pages are required to hold these 20 words? How many times are the pages accessed?
Change the header to set the order to 4. Delete the data and index files and run them again with the string from exercise 2. Examine the output. How many pages are required to hold these 20 words? How many times are the pages accessed? How often is the disk accessed?
Cut down the disk manager size to 5 pages, and run exercise 3 again, after clearing out the data and index files. Examine the output. How many pages are required to hold these 20 words? How many times are the pages accessed? How often is the disk accessed?

Go to: Table of Contents | Next Page