Day 10: Collection Classes and B-Trees

You have created a number of powerful lists, including sorted and unsorted linked lists and AVL trees. By using these lists, you can create a wide variety of custom Collection classes.

Each of these lists is optimized for different things: some for ease of creation, some for speed of insertion, and some for quick searching. Each has a limitation as well, and for a program like ROBIN, which will store its data on disk, none of these programs is quite right. Today you will learn

What dynamic and sparse arrays are
What dictionaries and sets are
How to create lists optimized for disk storage

Collection Classes

C++ provides one built-in collection type: the array. Most C++ primers show how to create a dynamic Array class, which enables you to combine the flexibility of linked lists with the ease of use of arrays.

The linked list shown in day 7, "Complex Data Structures," provides all you need to build a dynamic Array class. You will want to privatize the FindObject() method, and flesh out the offset operator ([]), but otherwise it is pretty much all there.

Sets

A set is a collection, often implemented with a list, where every data item is represented exactly once. Once an item is already in the set, a duplicate cannot be added.

Sets typically provide methods that implement the mathematical set operations, so that if you have two sets of data you can ask for the intersection of the sets, and get back only those items that appear in both sets. Similarly, you can ask for the union of the sets, and get those items that exist in either set. The returned object from these methods would be another set.

Dictionaries

Dictionaries are a special case of sets, in which each record has a key as well as its data. The key may be related to the data, but need not be. You might have a set of definitions, for example, whose keys are the words being defined. Or, alternatively, you might have a set of payment records, where the key is the social security number of the recipient.

A special case is the double-keyed dictionary, with which you can look up based on either value. A typical use of this is a name and address database, where records can be retrieved by either name or address.

Creating the Sparse Array Class

A sparse array is a special case of a dictionary in which the key is a numerical index into a virtual array. The sparse array uses the semantics of an Array class, but not all the elements exist. The array might allow you to enter the 12th, 85th, and 102nd members, for example. Rather than storing 102 elements, each entry is stored in a dictionary, keyed to the offset. When you request myArray[85], the sparse array looks up the value 85 in its dictionary and returns the associated value.

Making your sparse array raises a number of design issues. Assuming that you have no previous Collection classes on which to build your new SparseArray class, here's how you might begin.

Like all programs, you need to start with a specification:

Your sparse array will look and feel like an array.
You add and retrieve items using the offset operator.
Management of the array's internal lists will be invisible.
The array will allocate room only for those items actually stored on the list.

Implementation Issues

As far as your user is concerned, the array will hold whatever type of item (string, word, numbers, and so on) the user chooses. As far as the program is concerned, the actual stored item will be a node with a pointer to the stored item and a key. The key will be the value on which the list will be sorted and searched.

When the user asks for a key that is not already in the list, a node will be created with that key, and an item of the appropriate type will be created as well. This will enable the user to write

    SparseArray[5] = <some value>

and thereby assign that position to a new value. Listing 10.1 shows a modified rWord header file, listing 10.2 shows the sparse array template, and listing 10.3 shows a driver program that exercises the sparse array.

Note: In a real-world situation, you would not repeatedly re-create these Collection classes. You would, instead, design a suite of related Collection classes, and derive some of the more specific types from the more general types.

I have chosen to re-create many of these Collection classes to show a variety of approaches to creating such lists. This approach provides a number of design perspectives at the cost of undermining the fundamental lesson that you should strive for reusability in your designs.

Listing 10.1 A Modified rWord Header File

    1:     // **************************************************
    2:     // PROGRAM:  Basic word object
    3:     // FILE:     word.hpp
    4:     // PURPOSE:  provide simple word object
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 10/23/94 1.0 jl  initial release
    8:     //            11/3/94  1.1 jl added constructor char *
    9:     // **************************************************
    10:
    11:    #ifndef WORD_HPP
    12:    #define WORD_HPP
    13:
    14:    #include "stdef.hpp"
    15:    #include "string.hpp"
    16:
    17:    class rWord
    18:    {
    19:     public:
    20:        rWord(const String& text):
    21:        itsText(text), reserved1(0L), reserved2(0L) {}
    22:
    23:        rWord(const char* txt):
    24:           itsText(txt),reserved1(0L), reserved2(0L) {}
    25:
    26:        rWord(): itsText(), reserved1(0L), reserved2(0L) {}
    27:
    28:        ~rWord(){}
    29:
    30:        const String& GetText()const { return itsText; }
    31:        void SetText(const String& rhs) { itsText = rhs; }
    32:
    33:        BOOL operator<(const rWord& rhs)
    34:         { return itsText < rhs.GetText(); }
    35:
    36:        BOOL operator>(const rWord& rhs)
    37:         { return itsText > rhs.GetText(); }
    38:
    39:        BOOL operator==(const rWord& rhs)
    40:         { return itsText == rhs.GetText(); }
    41:
    42:        void Display() const
    43:           { cout <<   "  Text: " << itsText << endl; }
    44:
    45:
    46:     private:
    47:        String itsText;
    48:        long reserved1;
    49:        long reserved2;
    50:     };
    51:
    52:    #endif

Listing 10.2 The Sparse Array Template

    1:     // **************************************************
    2:     // PROGRAM:  SparseArray Header
    3:     // FILE:     sparse array
    4:     // PURPOSE:  provide sparse array
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/3/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "stdef.hpp"
    11:
    12:    #ifndef SPARSE_ARRAY_HPP
    13:    #define SPARSE_ARRAY_HPP
    14:
    15:     // **************** Node class ************
    16:      template <class T>
    17:      class Node
    18:      {
    19:      public:
    20:         Node (T* Obj, long key):itsObject(Obj),itsNext(0),itsKey(key){}
    21:         ~Node() { delete itsObject; itsNext=0;}
    22:         Node (const Node& rhs);
    23:         void InsertAfter(Node * newNode)
    24:          { newNode->SetNext(itsNext); itsNext=newNode;}
    25:         Node * GetNext() const    { return itsNext; }
    26:         void SetNext(Node * next) { itsNext = next; }
    27:         T & GetObject() const { return *itsObject; }
    28:         long GetKey() const { return itsKey; }
    29:         int operator<(const Node<T>& rhs) { return itsKey < rhs.GetKey();}
    30:         int operator==(long rhs) const { return itsKey == rhs; }
    31:         int operator<(long rhs) const { return itsKey < rhs; }
    32:         int operator>(long rhs) const { return itsKey > rhs; }
    33:         int operator<=(long rhs) const { return itsKey <= rhs; }
    34:         int operator>=(long rhs) const { return itsKey >= rhs; }
    35:
    36:        const Node& operator=(T* rhs){ delete itsObject; itsObject = rhs; return
               *this;}
    37:      private:
    38:         T * itsObject;
    39:         Node * itsNext;
    40:         long itsKey;
    41:      };
    42:
    43:       template <class T>
    44:       Node<T>::Node(const Node& rhs)
    45:       {
    46:         itsObject = new T(rhs.GetObject());
    47:         itsNext = rhs.GetNext();
    48:         itsKey = rhs.GetKey();
    49:       }
    50:
    51:
    52:
    53:      // **************** Object SparseArray ************
    54:      template <class T>
    55:      class SparseArray
    56:      {
    57:      public:
    58:         SparseArray();
    59:         ~SparseArray();
    60:         long      GetCount() const { return itsCount; }
    61:         void       Insert(T &, long);
    62:         void       Iterate(void (T::*f)()const);
    63:         T &         operator[](long);
    64:
    65:      private:
    66:         Node<T>  itsHead;
    67:         long itsCount;
    68:     };
    69:
    70:         // Implementations for SparseArrays...
    71:
    72:        template<class T>
    73:        SparseArray <T>::SparseArray():
    74:           itsCount(0),
    75:           itsHead(0,0)  // initialize head node to have no Object
    76:           {}
    77:
    78:        template<class T>
    79:        SparseArray <T>::~SparseArray()
    80:        {
    81:        }
    82:
    83:        template<class T>
    84:        T&  SparseArray<T>::operator[](long key)
    85:        {
    86:           for (Node<T> * pNode = itsHead.GetNext();
    87:                 pNode!=NULL;
    88:                 pNode = pNode->GetNext()
    89:                )
    90:           {
    91:              if ( *pNode >= key)
    92:                 break;
    93:           }
    94:           if (pNode && *pNode == key)
    95:             return pNode->GetObject();
    96:
    97:           else     // not found
    98:           {
    99:             T* pNew = new T;
    100:            Insert (*pNew, key);
    101:            return *pNew;
    102:          }
    103:
    104:       }
    105:
    106:       template<class T>
    107:       void SparseArray<T>::Insert(T & Object, long key)
    108:       {
    109:           Node<T> * NewNode = new Node<T>(&Object,key);
    110:
    111:           for (Node<T> * pNode = &itsHead;;pNode = pNode->GetNext())
    112:           {
    113:            int IsLess = FALSE;
    114:            if (pNode->GetNext())
    115:              IsLess = *NewNode < *(pNode->GetNext());
    116:
    117:            if ( (pNode->GetNext() == NULL) || IsLess)
    118:            {
    119:                 pNode->InsertAfter(NewNode);
    120:                 itsCount++;
    121:                 break;
    122:            }
    123:           }
    124:       }
    125:
    126:      template<class T>
    127:      void SparseArray<T>::Iterate(void (T::*func)()const)
    128:      {
    129:         for (Node<T>* pNode = itsHead.GetNext();
    130:              pNode;
    131:              pNode=pNode->GetNext()
    132:             )
    133:         {
    134:               cout << "[" << pNode->GetKey() << "] ";
    135:               (pNode->GetObject().*func)();
    136:               // cout << "\n";
    137:         }
    138:      }
    139:
    140:   #endif

Listing 10.3 A Driver Program for the Sparse Array

    1:     // Listing 10.3 - A Driver Program for the Sparse Array
    2:
    3:     #include "sparse.hpp"
    4:     #include "word.hpp"
    5:
    6:     int main   (int argc, char **argv)
    7:     {
    8:        SparseArray<rWord> array;
    9:        rWord * pWord = 0;
    10:
    11:       for (int i = 0; i<argc; i++)
    12:       {
    13:          pWord = new rWord(argv[i]);
    14:          array[i] = *pWord;
    15:          array[5*(i+10)] = array[i];
    16:       }
    17:       array.Iterate(rWord::Display);
    18:       return 0;
    19:    }

Output:

    d:\day10>1003 Imagination is more important than knowledge
    [0]   Text: D:\DAY10\1003.EXE
    [1]   Text: Imagination
    [2]   Text: is
    [3]   Text: more
    [4]   Text: important
    [5]   Text: than
    [6]   Text: knowledge
    [50]   Text: D:\DAY10\1003.EXE
    [55]   Text: Imagination
    [60]   Text: is
    [65]   Text: more
    [70]   Text: important
    [75]   Text: than
    [80]   Text: knowledge

Analysis:

In lines 23 and 24 of listing 10.1, the rWord class is extended to include a constructor that takes a C-style string.

The SparseArray class itself is declared in listing 10.2, along with its associated Node class. The Node class provides comparison operators to compare one node to another (line 29), as well as comparing the key value of a node with a provided long (lines 30 through 34).

The node itself stores a pointer to the object stored in the array (line 46), and a pointer to the next node in the array (line 47). In line 48, it also stores its key value, which is used as the offset in the array offset operator ([]), shown on line 63.

The SparseArray itself is straightforward. When a request is made for an object at a given offset, operator[] is called. The array is searched until the key is found, or until the key value is exceeded, and thus it is known that the key is not currently stored in the array.

If the key is found, the associated object is returned. If the key is not found, a default version of the object is created and inserted into the array, and the default object is returned.

Listing 10.3 provides a simple driver, which reads all the words on the command line and inserts them into the array. The first insertion, in line 14, is at the value of the argument; thus, the first argument is at array offset 0, the second at array offset 1, and so on. The second insertion, in line 15, demonstrates the capability to insert into the array at somewhat arbitrary intervals.

Note that when array[55] is created, the intervening (empty) array items are not created. The array is sparsely populated, as the specification requires.

Tip: There are a number of ways to optimize this and the other arrays you have seen. The sparse array, for example, would be quicker if a more sophisticated search algorithm were used to find the key or to decide where to insert the key. This optimization is left for day 17, "Profiling and Optimization."

Tree Structures and Object Persistence

Over the past few days, I've reviewed a number of powerful data structures. The single most significant drawback to all of them, however, is that they are not optimized for storage on disk.

In the real world, computers have limited memory, and users exit your program sooner or later. ROBIN will be quite useless if it cannot store its data, and the data structures seen so far are not good at quick retrieval from disk.

The binary trees and AVL trees seen yesterday and on days 7 and 8 have the nasty habit of becoming quite deep when they get large. Each layer of the tree represents at least one additional disk read, and the single most expensive thing you can do in your program, from the standpoint of performance, is to read the disk.

B-Trees

As you might have suspected, I have a solution to this problem: the B-tree. The B-tree was invented in 1972 by R. Bayer and E. McCreight, and was designed from the start to create shallow trees for fast disk access.

B-trees were recognized immediately as a powerful solution to the problem of disk-based storage, and virtually every commercial database system now uses B-trees or a variant as their storage mechanism.

A B-tree consists of pages. Each page has an index, which consists of a key value and a pointer. The pointer in an index can point to another page or to the data you are storing in the tree.

Every page has indexes that point to other pages or to data. In the former case, the page is called a node page; in the latter case, the page is called a leaf page. The number of indexes on a page is the page's order.

Every page therefore has a maximum of order child pages. It is a rule of B-trees that no pages, other than the top page and the node pages, ever have fewer than order/2 indexes. A leaf page can have one fewer than that (order/2-1).

New indexes are added only to leaf pages. This fact is critical: you never add an index to a node page. Node pages are created when an existing page "splits."

Here's how it works. Assume that you are creating a B-tree of order 4, to store words. To simplify the example, I'll ignore the actual data, and have the index's key be the word itself. For this example, I'll build up a tree with the words Four score and seven years ago, our fathers brought forth on this continent...

When the tree is created, its root pointer, myRoot, points to nothing, as shown in figure 10.1.

The first word, Four, is added to a new page, as shown in figure 10.2. This new page is a leaf page, and the index, four, would point to the actual data.

Words are added to the page until order words, in this case four words, at which time the page is full, as shown in figure 10.3.

Figure 10.1 An empty B-tree.

Figure 10.2 A leaf page.

Figure 10.3 A full-leaf page.

When it is time to enter the word for, the page must split to make room. The algorithm follows:

Split the page in half.
Add the new word.
If the new word is smaller than the first word, adjust the pointer.
Return a pointer to the new page.
If the root detects that a new top page is required, create it.
Add an entry in the new top page to point to what the root points to.
Add an entry in the new top page for the return value from step 4.
Point the root to the new top page.

In the case shown in figure 10.4, the word years is to be added. To do this, the page must split. The new page is returned, and the root pointer recognizes that a new top page is needed. The new page is populated with an index pointing to the entry myRoot used to point to (And), and a second entry is made pointing to the new page. myRoot then is pointed to this new node page.

Figure 10.4 Splitting the page.

The next word to be added is ago. Because this is earlier than and, it is added before and on the leaf page, and the node page is "fixed up" to point to this earlier index, as shown in figure 10.5.

When a page node is filled, as shown in figure 10.6, it splits, as shown in figure 10.7, and the new pointer is added to the node pointer. This continues until the node page is full, as shown in figure 10.8.

Figure 10.5 Fixing up the node.

Figure 10.6 Getting ready to split.

Figure 10.7 After the split.

Figure 10.8 The tree when the node page is full.

Adding the next word, new, to this tree presents a problem. The first page will have to split, and when it does, it will pass an entry to its parent node. That node, however, is full, so it too will have to split. When it does, the root pointer must recognize that a new node is required. This is shown in figure 10.9.

Figure 10.9 A third tier.

Deleting a Node

Although it is possible to delete nodes in a B-tree, you usually don't bother. ROBIN will take the expedient course of marking nodes as deleted and then rebuilding the tree when it is packed.

This method saves building the complex deletion algorithm into the class, and is the right way to handle this in any case. You will note, however, in the following section, that mark-for-deletion is not yet implemented. That detail is put off until day 11, when database management is discussed.

Implementing the B-Tree

Once the algorithms are fully understood, the implementation of the B-tree is straightforward. Listing 10.4 provides a slightly modified standard definition file (stdef.hpp). Listing 10.5 has the B-tree header file saved as btree.hpp. This provides the declarations for the B-tree, the page, and the Index classes.

Listing 10.6 is the implementation of the B-tree, listing 10.7 is the implementation of the page, and listing 10.8 is the implementation of the index. Finally, listing 10.9 is a simple driver program that creates a B-tree and prompts the user for entries.

Listing 10.4 Using stdef.hppthe Standard Definition File

    1:     // **************************************************
    2:     // PROGRAM:  Standard Definitions
    3:     // FILE:     stdef.hpp
    4:     // PURPOSE:  provide standard inclusions and definitions
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 10/23/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:
    11:    #ifndef STDDEF_HPP                 // inclusion guards
    12:    #define STDDEF_HPP
    13:    #include <iostream.h>
    14:
    15:    const int szLong = sizeof (long int);
    16:    const int szShort = sizeof (short int);
    17:    const int szInt = sizeof (int);
    18:
    19:    enum BOOL { FALSE, TRUE };
    20:
    21:    #endif

Listing 10.5 Using btree.hppthe B-Tree Header File

    1:     // **************************************************
    2:     // PROGRAM:  Btree, Page and Index declarations
    3:     // FILE:     btree.hpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:     #ifndef BTREE_HPP                 // inclusion guards
    10:    #define BTREE_HPP
    11:
    12:    #include <string.h>
    13:    #include "stdef.hpp"
    14:    const int Order = 4;
    15:    const int dataLen = 11;
    16:
    17:    // forward declarations
    18:    class Btree;
    19:    class Index;
    20:
    21:    class Page
    22:    {
    23:    public:
    24:       Page();
    25:       Page(Index*,BOOL);
    26:       Page(Index**, int, BOOL);
    27:       ~Page(){}
    28:       Page* Insert(Index*);
    29:       Page* InsertLeaf(Index*);
    30:       Page* InsertNode(Index*);
    31:       Index* GetFirstIndex() { return myKeys[0]; }
    32:       BOOL GetIsLeaf() const { return IsLeaf; }
    33:       void Push(Index*,int offset=0);
    34:       void Nullify(int offset);
    35:       void Print();
    36:       void ReCount();
    37:    private:
    38:       Index* myKeys[Order];
    39:       int myCount;
    40:       BOOL IsLeaf;
    41:
    42:    };
    43:
    44:    class Index
    45:    {
    46:    public:
    47:       Index(char *);
    48:       Index();
    49:       Index(const Index&);
    50:       char * GetData() const { return myData; }
    51:       void SetData(const Index& rhs)
    52:          { strcpy(myData,rhs.GetData()); }
    53:       Page * GetPointer()const { return myPointer; }
    54:       void SetPointer (Page* pg) { myPointer = pg; }
    55:       void PrintKey();
    56:       void PrintPage();
    57:
    58:       int operator==(const Index& rhs)
    59:       {return strcmp(myData,rhs.GetData()) == 0; }
    60:
    61:       int operator < (const Index& rhs)
    62:        {return strcmp(myData,rhs.GetData())<0;}
    63:
    64:       int operator <= (const Index& rhs)
    65:        {return strcmp(myData,rhs.GetData())<=0;}
    66:
    67:       int operator > (const Index& rhs)
    68:        {return strcmp(myData,rhs.GetData())>0;}
    69:
    70:       Page * Insert(Index* ptr)
    71:          { return myPointer->Insert(ptr);}
    72:
    73:    public:
    74:       Page* myPointer;
    75:       char * myData;
    76:    };
    77:
    78:
    79:    class BTree
    80:    {
    81:    public:
    82:       BTree():myRoot(0){}
    83:       BTree(char* data);
    84:       ~BTree();
    85:       void AddKey(char* data);
    86:       void PrintTree();
    87:
    88:    private:
    89:       Page * myRoot;
    90:    };
    91:
    92:    #endif

Listing 10.6 B-Tree Implementation

    1:     // **************************************************
    2:     // PROGRAM:  Btree
    3:     // FILE:     btree.cpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "btree.hpp"
    11:    #include "stdef.hpp"
    12:
    13:    BTree::BTree(char *  data):myRoot(0)
    14:    {
    15:       Index * pIndex = new Index(data);
    16:       myRoot = new Page (pIndex,TRUE);
    17:    }
    18:
    19:    BTree::~BTree() {}
    20:
    21:    void BTree::AddKey(char * str)
    22:    {
    23:       Page * retVal =0;
    24:       Index* pIndex = new Index(str);
    25:       if (!myRoot)
    26:          myRoot = new Page (pIndex,TRUE);
    27:       else
    28:       {
    29:          retVal = myRoot->Insert(pIndex);
    30:          if (retVal) // new top node
    31:          {
    32:             // create a new topmost node page
    33:             Index * pIndex = new Index(*myRoot->GetFirstIndex());
    34:             pIndex->SetPointer(myRoot);
    35:             Page * pg = new Page(pIndex,FALSE);
    36:             Index * pSib = new Index (*retVal->GetFirstIndex());
    37:             pSib->SetPointer(retVal);
    38:             pg->InsertLeaf(pSib);
    39:             myRoot = pg;
    40:          }
    41:       }
    42:    }
    43:
    44:    void BTree::PrintTree()
    45:    {
    46:       myRoot->Print();
    47:    }

Listing 10.7 Page Implementation

    1:     // **************************************************
    2:     // PROGRAM:  Page
    3:     // FILE:     Page.cpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "btree.hpp"
    11:
    12:    Page::Page()
    13:    {
    14:    }
    15:
    16:    Page::Page(Index* index, BOOL bLeaf):myCount(1),IsLeaf(bLeaf)
    17:    {
    18:       for (int i = 1; i<Order; i++)
    19:          myKeys[i]=0;
    20:       myKeys[0]=index;
    21:       myCount = 1;
    22:
    23:    }
    24:
    25:    Page::Page(Index **array, int offset, BOOL leaf):IsLeaf(leaf)
    26:    {
    27:       myCount = 0;
    28:       int i, j;
    29:       for (i = 1; i<Order; i++)
    30:          myKeys[i]=0;
    31:       for (i=0, j = offset; j<Order; i++, j++)
    32:       {
    33:          myKeys[i]= new Index(*(array[j]));
    34:          myCount++;
    35:       }
    36:    }
    37:
    38:    void Page::ReCount()
    39:    {
    40:       myCount = 0;
    41:       for (int i = 0; i<Order; i++)
    42:          if (myKeys[i])
    43:             myCount++;
    44:
    45:    }
    46:
    47:    void Page::Nullify(int offset)
    48:    {
    49:       for (int i = offset; i<Order; i++)
    50:       {
    51:          if (myKeys[i])
    52:          {
    53:             delete myKeys[i];
    54:             myKeys[i]= 0;
    55:          }
    56:       }
    57:    }
    58:
    59:    Page * Page::Insert(Index* pIndex)
    60:    {
    61:       if (IsLeaf)
    62:         return   InsertLeaf(pIndex);
    63:       else
    64:         return InsertNode(pIndex);
    65:    }
    66:
    67:    Page * Page::InsertNode(Index* pIndex)
    68:    {
    69:       Page * retVal =0;
    70:       BOOL inserted = FALSE;
    71:       int i,j;
    72:
    73:       if (myKeys[0] && *pIndex < *(myKeys[0]))
    74:       {
    75:          myKeys[0]->SetData(*pIndex);
    76:          retVal=myKeys[0]->Insert(pIndex);
    77:          inserted = TRUE;
    78:       }
    79:       if (!inserted)
    80:          for (i = Order-1; i>=0; i--)
    81:          {
    82:             if (myKeys[i])
    83:             {
    84:                if (*pIndex > *(myKeys[i]))
    85:                {
    86:                 retVal=myKeys[i]->Insert(pIndex);
    87:                 inserted = TRUE;
    88:                }
    89:              break;
    90:             }
    91:           }
    92:       if (!inserted)
    93:          for (j = 0; j<i; j++)
    94:          {
    95:             if (myKeys[j+1] && *pIndex < *(myKeys[j+1]))
    96:             {
    97:                retVal=myKeys[j]->Insert(pIndex);
    98:                break;
    99:             }
    100:         }
    101:
    102:      if (retVal) // got back a pointer to a new page
    103:      {
    104:         Index * pIndex = new Index(*retVal->GetFirstIndex());
    105:         pIndex->SetPointer(retVal);
    106:         retVal = InsertLeaf(pIndex);
    107:      }
    108:
    109:      return retVal;
    110:   }
    111:
    112:   Page * Page::InsertLeaf(Index* pIndex)
    113:   {
    114:         if (myCount < Order)
    115:         {
    116:            Push(pIndex);
    117:            myCount++;
    118:            return 0;
    119:         }
    120:         else      // overflow the page
    121:         {
    122:            Page* sibling = new Page(myKeys,Order/2,IsLeaf); // make sibling
    123:            Nullify(Order/2);                     // nullify my right half
    124:
    125:            // does it fit in this side?
    126:            if (myKeys[Order/2-1] && *pIndex <= *(myKeys[Order/2-1]))
    127:               Push(pIndex);
    128:            else
    129:               sibling->Push(pIndex);
    130:
    131:            ReCount();
    132:            return sibling;
    133:         }
    134:   }
    135:
    136:   void Page::Push(Index *pIndex,int offset)
    137:   {
    138:      for (int i=offset; i<Order; i++)
    139:      {
    140:         if (!myKeys[i]) // empty
    141:         {
    142:            myKeys[i]=pIndex;
    143:            break;
    144:         }
    145:         else
    146:         {
    147:            if (myKeys[i] && *pIndex <= *myKeys[i])
    148:            {
    149:               Push(myKeys[i],offset+1);
    150:               myKeys[i]=pIndex;
    151:               break;
    152:            }
    153:         }
    154:      }
    155:   }
    156:
    157:
    158:   void Page::Print()
    159:   {
    160:         for (int i = 0; i<Order; i++)
    161:         {
    162:            if (myKeys[i])
    163:            {
    164:               if (IsLeaf)
    165:                  myKeys[i]->PrintKey();
    166:               else
    167:                  myKeys[i]->PrintPage();
    168:            }
    169:            else
    170:               break;
    171:         }
    172:   }

Listing 10.8 Index Implementation

    1:     // **************************************************
    2:     // PROGRAM:  index
    3:     // FILE:     index.cpp
    4:     // PURPOSE:  provide fundamental btree functionality
    5:     // NOTES:
    6:     // AUTHOR:   Jesse Liberty (jl)
    7:     // REVISIONS: 11/1/94 1.0 jl  initial release
    8:     // **************************************************
    9:
    10:    #include "btree.hpp"
    11:
    12:    Index::Index(char* str)
    13:    {
    14:
    15:       myData = new char[dataLen+1];
    16:       strncpy(myData,str,dataLen);
    17:       myData[dataLen]='\0';
    18:       myPointer=0;
    19:    }
    20:
    21:    Index::Index(const Index& rhs)
    22:    {
    23:        myData = new char[dataLen+1];
    24:       strcpy(myData, rhs.GetData());
    25:       myPointer=rhs.GetPointer();
    26:    }
    27:
    28:    Index::Index():myPointer(0)
    29:    {
    30:       myData = new char[dataLen+1];
    31:       myData[0]='\0';
    32:    }
    33:
    34:    void Index::PrintKey()
    35:    {
    36:       cout << "  " << myData;
    37:    }
    38:
    39:    void Index::PrintPage()
    40:    {
    41:       cout << "\n" << myData << ": " ;
    42:       myPointer->Print();
    43:    }

Listing 10.9 The B-Tree Driver Program

    1:     // Listing 10.9 - The B-Tree Driver Program
    2:
    3:     #include "String.hpp"
    4:     #include "stdef.hpp"
    5:     #include "btree.hpp"
    6:
    7:     int main()
    8:     {
    9:        BTree myTree;
    10:       char buffer[255];
    11:       for (;;)
    12:       {
    13:
    14:          cout << "word: ";
    15:          cin.getline(buffer,255);
    16:          if (buffer[0])
    17:             myTree.AddKey(buffer);
    18:          else
    19:             break;
    20:       }
    21:       myTree.PrintTree();
    22:       return 0;
    23:    }

Output:

    d:\day10>1004
    word: four
    word: score
    word: and
    word: seven
    word: years
    word: ago
    word: our
    word: fathers
    word: brought
    word: forth
    word: on
    word: this
    word: continent
    word: a
    word: new
    word: nation
    word:

    a:
    a:   a  ago  and
    brought:   brought  continent
    fathers:
    fathers:   fathers  forth
    four:   four  nation  new  on
    score:   score  seven  this  years

Analysis:

Listing 10.5 is the interface to the B-tree, its constituent pages, and indexes. In line 14, the order of the tree is set--for this demonstration, to 4. This indicates that there will be at most 4 indexes on each page, and that node leaves that are not the top always will have at least 1.

Lines 18 and 19 are forward declarations of the Btree and Index classes, which are used by the Page class, which begins in line 21.

Page has three constructors, shown in lines 24 through 26. The last, in line 26, takes a pointer to an array of Index objects, and is used when splitting a Page. The Insert() method determines which of the following methods, InsertNode() or InsertLeaf(), is appropriate.

The Push() method is a quick approach to inserting a node into a page. This is a good method to target for optimization, however, as is discussed shortly.

The Page has an array of Index objects of size Order, a count of how many Index objects currently are in the array, and a Boolean indicating whether the current page is a leaf or a node page.

The Index class begins in line 44, and includes three constructors, including a copy constructor. There are accessors for the Index's data in lines 50 through 52, and for the Pointer, which points to the next node in the tree or to the data stored.

In lines 58 through 68, various comparison operators are provided to simplify comparing the data stored in Index objects. The operators act as wrappers on the data, enabling you to write

    if *myNode < *otherNode

rather than

    if myNode->GetData() < otherNode->GetData()

The Btree itself begins in line 79 and is a very simple class. It contains only a pointer to the topmost page. There are two constructors and a destructor, and two methods, one to add a new item to the tree and the other to print the tree.

The implementation of the Btree constructor that takes a pointer to data is in lines 13 through 17. An Index object is created and placed into a new Page object; the myRoot pointer then is set to point to that new page.

The AddKey() method is in lines 21 through 42. A pointer to a new Page object is created and initialized to NULL in line 23. A new Index object is created from the provided character string in line 24.

There are two cases considered: the first in which there is no root page, and the second in which the root page already exists. The former is simpler: a new Page object is created and the new Index is inserted.

If the myRoot pointer already points to a tree, the Insert() method is called and its return value is assigned to the retVal pointer, in line 29. If this pointer is non-null, then a new topmost page is needed, because the previous topmost page had to split and the new page was returned. In this case, the new page is created and an Index that points to the old topmost page is inserted. A second Index object also is created to point to the new sibling page. Finally, in line 39, the myRoot pointer is reassigned to point to the new topmost page.

The PrintTree() method simply passes along the command to print the tree to the topmost page.

Listing 10.7 has the implementation for the Page methods, including the constructor for new pages (lines 16 through 23) and for split-off pages (lines 25 through 36).

In the former case, myCount is initialized to 1 and IsLeaf is initialized to the value provided by whoever is creating the Page. The Index objects then are initialized to null pointers in lines 18 and 19, and the first Index pointer is set to the value passed in. Finally the myCount variable is set to 1.

The latter constructor, called when a Page is splitting, is similar, except that the myKeys array is initialized to the values passed in from the original Page, in lines 31 through 35.

The two utility methods, Recount() and Nullify(), are fairly straightforward. Recount() resets the counter, and Nullify() deletes all the Index objects starting at the provided offset.

The Insert() method examines the Page's IsLeaf Boolean and dispatches to the appropriate Insert... method. If the current Page is a node page, InsertNode, starting in line 67, is called.

In line 73, the value of the new index is compared to the value of the first index. Note the peculiar construction of

    if (myKeys[0] && *pIndex < *(myKeys[0]))

This is used repeatedly throughout the program. First check to make sure that the pointer is not null, and then access it to call its comparison operator. Remember that if myKeys[0] evaluates to 0 (false), the second part of the && clause never will be evaluated.

The inserted flag is used to simplify the complex logic of when to continue evaluating and when to stop. The node has been inserted if the new value is less than the first member of the array. In this case, in line 75, the first member of the array is set to the value in the new index and then, in line 76, the new Index item is inserted into the tree. This call to the Index's Insert() method invokes the Insert() method of the page that the first member of the array points to, as shown in lines 70 and 71 of listing 10.5.

The situation you have just seen is this: a node page has a first value that is larger than the new value. The new value must be inserted into the node pointed to by that first Index, but that first index must be set to reflect this new smaller value. This is the situation as shown in figure 10.5.

If the test in line 73 fails, the next test is performed in line 82. This bit of code looks for the last entry in the array (counting back from the last entry in the index until one is found). If the new value is greater than that entry, the new value is inserted into the page pointed to by that entry.

If no value is found, the index is searched for the first instance where the next item is greater than the new Index object, and the new Index is inserted there.

Thus, if the page has boy dog fence, the word angle would be inserted into the page pointed to by boy, and the word boy would be changed to angle.

If the word grape were the new value, it would be inserted into the page pointed to by fence via the logic in the second test.

If the word cat were the new value, it would be inserted into the page pointed to by boy, because it is smaller than the entry to the right of boy.

In line 102, retVal is examined. This will be non-null if the child node was forced to split. In that case, the tree is adjusted. If the current page is forced to split as well, retVal will be set to point to the new sibling page, and the logic will recurse to the calling object.

The logic for InsertLeaf begins in line 112. If there is room in the array, the test in line 114 will evaluate TRUE; Push() will be called, the count will be incremented, and a null pointer will be returned to the calling function.

Push() is implemented in lines 136 through 155. This is not a very efficient algorithm, but it is fairly simple. It forces a sort of the array, recursively inserting objects into their appropriate slots in the array. It relies on the fact that the array always is packed to the left; there are never slots open unless there are no greater values. Thus, the first open slot may be used. If the slot is not open, a comparison is done; and if the new value is smaller, it is inserted and the old value is pushed.

Returning to InsertLeaf(), if the page is full, the else condition starting in line 120 is evaluated. A new Page object is created, passing in a pointer to the current array of Index objects, the offset of the middle of the array, and the current page's IsLeaf Boolean. The sibling page will be a node page if the current page is one, and it will be a leaf page if the current page is a leaf.

In lines 126 through 129, the new value is evaluated and pushed into the appropriate page. Because the array now has been sliced down, ReCount() is called and the new sibling Page is returned.

The PrintKey() method calls PrintKey() or PrintPage() on each of its Index objects, depending on whether the page itself is a leaf or a node.

Listing 10.8 shows the constructors for the Index objects, as well as the two rather simple Print methods. Listing 10.9 shows the driver program that exercises this B-tree by prompting the user for words and adding the words to the tree, which then is printed.

Tip: If the first part of an and clause fails, the second part of the clause is never evaluated.

Writing Pages To Disk (Not Yet)

The entire purpose of a B-tree is to store data on a disk. The current implementation is not quite ready to do so, however. Rather than using pointers, it will need to store the offset of the Page objects in a disk file.

To facilitate getting pages off the disk and writing them back out, and to provide a cache to speed this up, you will need to create a PageManager object, as seen on day 11. You also will need the capability to read and write Page objects to disk, although this can be done as blocks of data rather than as persistent objects. All these details are considered in depth on day 11.

Summary

Today you learned what various Collection classes are and how to create them. You saw the implementation of a sparse array, and how to create a B-tree. You also explored some of the trade-offs in using each of these complex data structures.

Q&A

Q: What is the prime determining characteristic of a set?

A: The thing that makes a set special is that no value may appear more than once in the set.

Q: What is the difference between a sparse array and a dictionary?

A: A sparse array is a dictionary where the key value is an integral type (an integer or long, for example). It uses the offset operator ([]) to mimic the functionality of an array.

Q: What is the difference between a sparse array and a regular array?

A: A sparse array uses no memory for those values that are not held in the array.

Workshop

The Workshop provides quiz questions to help you solidify your understanding of the material covered, and exercises to provide you with experience in using what you have learned. Try to answer the quiz and exercise questions before checking the answers in Appendix A, and make sure that you understand the answers before continuing to the next chapter.

Quiz

What is a set?
What is a dictionary?
What is a sparse array?
What is a B-tree?
How is a B-tree different from a binary tree?

[Click here for Answers]

Exercises

Declare a Dictionary class.
Implement the Dictionary class.
Create a driver program to run the Dictionary class created in the first two exercises.
Modify the B-tree to not take duplicates.

Go to: Table of Contents | Next Page