In addition to searching for documents, you can use the IAT to route documents into particular categories. For example, say you have a collection of unsorted documents related to transportation that you want to divide into three categories: planes, trains, and automobiles. After setting up the various categories, you can use the IAT to compare documents from the unsorted collection with examples for each category. For each category you must have specified one or more documents as examples of what it should contain; that is, you would "prime" the trains category with documents heavily-related to trains, and so on. The category that provides the best fit (the highest ranked score) is the most appropriate place to put the unsorted document.
This chapter describes how to use the IAT to define document categories and route documents among those categories.
A category for related documents is
called a cluster . Clusters are
represented by the IACluster
class, which must be
subclassed to handle particular document types. For example, the IAT
provides the subclass HFSCluster
, which represents a
cluster of HFS documents (that is, Mac OS files). When subclassing
IACluster
, you must override the GetNextDoc
method, which returns the next document in the cluster, and the
Reset
method, which resets the iterator.
For a given cluster, you must provide one or more example documents, which are used to establish weighting criteria when comparing the cluster to an unsorted document. For an HFSCluster, these documents are contained in the folder associated with the cluster.
After establishing your clusters, the
IAT router , defined in the class IARouter
, lets
you identify the cluster that offers the best fit for an unsorted
document.
The router does not copy or move the unsorted
document; it only identifies the cluster to which the document
belongs. Any moving or copying actions must be done by your
application.
When seeking the best fit for a document, you can add the weighting (that is, the normalized TWVector) of that document to the weighting of the appropriate cluster, if desired. That is, each additional document that fits in the cluster helps define what should be there.
See Sorting documents using the router shows an example that displays the best fit cluster for each of a number of unsorted documents.
// enter the name of folder containing the items to route StringPtr unSortedItems = "\pMyDisk:Folders:Disorganized Stuff"; // Enter name of the index StringPtr singleIndexName = "\pMyDisk:Folders:test.index"; // Enter the names of the router folders StringPtr clusterFolders[] = { "\pMyDisk:Folders:Planes", "\pMyDisk:Folders:Trains", "\pMyDisk:Folders:Automobiles", "\p" // empty string to mark end }; void DemoRouting() { FSSpec fsSpec; char str[256]; // Create/initialize our index (void)FSMakeFSSpec(0, 0, singleIndexName, &fsSpec); IAStorage* myStorage = MakeHFSStorage(fsSpec.vRefNum, fsSpec.parID, fsSpec.name); myStorage->Initialize(); HFSCorpus myCorpus = new HFSCorpus; IAAnalysis* myAnalysis = new SimpleAnalysis(); VectorIndex* myIndex = new VectorIndex(myStorage,myCorpus, myAnalysis); myIndex->Initialize(); // Setup clusters uint32 clusterCount = 0; for (clusterCount = 0; clusterFolders[clusterCount][0] != 0; clusterCount++ ) {} HFSCluster** folders = new HFSCluster*[clusterCount]; for (uint32 i = 0; i < clusterCount; i++ ) { folders[i] = new HFSCluster(myIndex, clusterFolders[i]); } // Instantiate a router and initialize with the corpuses representing // our clusters. IARouter myRouter (myIndex); myRouter.InitializeClusters((IACluster**)folders, clusterCount); AddItemsToIndex(unSortedItems, myIndex); myIndex->Flush(); HFSTextFolderCorpus* source = new HFSTextFolderCorpus(unSortedItems); IADocIterator* docs = source->GetDocIterator(); IADoc* doc = docs->GetNextDoc(); // Now loop through each unsorted document and find the best cluster while (doc) { uint32 clusterIndex = myRouter.WhichCluster(doc, false); printf ("%s belongs in cluster %d\n", PToCStr(((HFSDoc*)doc)->GetFileName(), str), ++clusterIndex); delete doc; doc = docs->GetNextDoc(); } // Cleanup delete docs; delete source; myRouter.Store(); myIndex->Flush(); myStorage->Commit(); delete myIndex; delete myStorage; for (uint32 i = 0; i < clusterCount; i++ ) { delete folders[i]; } delete [] folders; } // End DemoRouting // This method is called by the DemoRouting method. void AddItemsToIndex(StringPtr folderPathName, VectorIndex* inIndex) { FSSpec myFsSpec; OSErr err = FSMakeFSSpec(0, 0, folderPathName, &myFsSpec); IAAssertion(err == noErr, "Can't get folder", IAAssertionFailure); HFSIterator folderIterator(fsSpec.vRefNum,FSSpecToDirID(&myFsSpec)); IATry { while (folderIterator.Increment()) { CInfoPBRec* pb = folderIterator.GetPBRec(); HFSDoc* doc = new HFSDoc((HFSCorpus*)inIndex->GetCorpus(), pb->hFileInfo.ioVRefNum, pb->hFileInfo.ioFlParID, pb->hFileInfo.ioNamePtr); inIndex->AddDoc(doc); } } IACatch (const IAException& exception) { printf("%s, %s\n", exception.What(), exception.GetLocation()); } } // End AddItemsToIndex
In this example, after creating the
index, the corpus, and specifying the type of analysis, the
DemoRouting
method sets up clusters based on what was defined in
clusterFolders[]
. Each folder in the array should contain
example documents defining the type of document that should belong to
the cluster. Documents to be routed should be in the
unSortedItems
folder.
After initializing the clusters by
calling the InitializeClusters
method, the router ( myRouter
) then simply cycles through
the corpus representing the contents of unSortedItems
and
calls the WhichCluster
method
for each document. If you set the second parameter in
WhichCluster
to true, the weighting of the document to be routed
is added to the appropriate cluster when a match is made.
In this example, after the
DemoRouting
method routs all the documents in
unSortedItems
, it calls the Store
method before
removing all instantiated objects. Doing so saves the cluster
information and weightings so you can retrieve them at some later
time. If you specified that the cluster accumulate weightings as
documents were routed, the saved settings will reflect the additional
weightings. If you want to rout additional documents later using the
stored cluster settings, you call the Restore
method
instead of instantiating clusters and calling
InitializeClusters
.
The sections that follow describe the classes and methods used for routing documents using the IAT.
The following methods allow you to use the IAT to sort documents into arbitrary categories. Note that the IAT only specifies which category to put a document in; your application must copy or move the document based on the categorization.
IARouter ( VectorIndex* index, TProgressFn* progressFn, clock_t progressFreq, void* appData);
index
A pointer to the vector
index.
progressFn
A pointer to an
application-defined progress function. If not NULL
, the
IAT calls this function periodically to give the client application
control.
progressFreq
The wait time
between callbacks, in clock ticks (using the ANSI clocks_per_sec
standard).
appData
A pointer to
application-specific data that is passed to the client application
when the callback occurs.
virtual ~IARouter();
Specifies clusters to use in the routing.
void InitializeClusters ( IACluster** clusters, uint32 howManyClusters);
clusters
A pointer to an array
of IACluster
pointers specifying the clusters to use.
howManyClusters
The number of IACluster
pointers in the array.
You call InitializeClusters
when you begin routing using a new set of clusters. If you want to
route documents using an older saved set of clusters, you should call
Restore
instead.
Specifies the cluster to which a document belongs.
uint32 WhichCluster ( IADoc* doc, bool accumulate);
doc
A pointer to a document to
be routed.
accumulate
A Boolean value. True adds
the normalized weighting of the specified document (that is, its
TWVector) to the weighting of the cluster. The default is false.
method result An value specifying the index of the cluster to which the document belongs.
The WhichCluster
method does
not move or copy the document to the indicated cluster. If you want
to move the document, your application must do so itself.
void Store ( IAStorage* storage, IABlockID block) const;
storage
A pointer to a
IAStorage*
object. The default (obtained by passing
NULL
) is the storage instance that contains the index used by
the router.
block
The ID of the block in which you want to store the router
settings. The default block ID is 0.
Restores saved router settings.
void Restore ( IAStorage* storage, IABlockID block);
storage
A pointer to a
IAStorage
object containing the router information. If you
specify NULL
here, the IAT attempts to restore the setting
from the storage instance used by the index associated with the
router.
block
The ID of the block containing the router settings you want to
retrieve. The default block ID is 0.
Stores the size of the current router.
IABlockSize StoreSize () const;
method result The size of the router.
Returns the pointer to the application-defined progress function.
TProgressFn* GetProgressFn () const;
method result A pointer to the application-defined function. The IAT calls back to this function to allow the client time to do other things, if desired.
Returns the progress function data.
void* GetProgressData () const;
method result A pointer to the
data passed to the application at the time of the progress function
callback. You specify the location of this data when the
IARouter
constructor is called.
Returns the time between calls to the progress function callback.
clock_t GetProgressFreq () const;
method result The wait time between callbacks, in clock ticks (using the ANSI clocks_per_sec standard).
Returns the best cluster for a given TWVector.
uint32 BestCluster (TWVector *vector) const;
vector
A pointer to a TWVector
object.
method result The index of the cluster to which the TWVector best fits.
If you want to subclass
IARouter
and implement your own best cluster algorithm, you may
want to override this method.
Clears the accumulator associated with the router.
void ClearAccumulator (void);
Adds a document's TWVector to the accumulator.
void AddDocVectorToAccumulator (TWVector* newDocVector);
newDocVector
A pointer to the
TWVector
object representing the document.
The See AccumulateDocVector method See AccumulateDocVector .
Adds the TWVector representing a document to the accumulator.
void AccumulateDocVector(IADoc* doc);
The See AddDocVectorToAccumulator method See AddDocVectorToAccumulator .
Adds the specified TWVector to the weighting of a given cluster.
void AddToAccumulator ( uint32 cluster, TWVector *docVector);
cluster
The cluster to whose weighting
you want to add the TWVector.
docVector
A pointer to the TWVector for
a given document.
This method adds a TWVector to the
weighting of a particular cluster, not the accumulator that is
generated during the WhichCluster
call.
This abstract class represents a cluster
of documents. You must subclass this class to represent clusters of
an actual document format. For example, the HFSCluster
class is a subclass of IACluster
that you use to represent
clusters of HFS format documents (that is, Mac OS files).
IACluster (IAIndex* index);
index
The index that contains
this cluster.
virtual ~IACluster ();
Gets the next document in the cluster.
virtual IADoc* GetNextDoc () const;
method result A pointer to the next document in the cluster.
The type of document returned depends on
the IACluster
subclass. For example, the IACluster
returns documents of
type HFSDoc
.
virtual void Reset ();
After reset, the See GetNextDoc function See GetNextDoc begins with the first document in the cluster.
Retrieves the corpus in the index associated with the cluster.
IACorpus* GetCorpus () const;
method result A pointer to the corpus.
The HFSCluster
class is a
subclass of IACluster
that handles clusters of HFS
documents (that is, Mac OS files).
HFSCluster ( IAIndex* index StringPtr clusterName);
index
The index to contain this
cluster.
clusterName
A pointer to the pathname of
the folder containing document examples for the cluster.
virtual ~HFSCluster ();
Gets the next HFS document in the cluster.
IADoc* GetNextDoc () const;
method result A pointer to the next HFS document in the cluster.
The HFSCluster
subclass of
IACluster
returns documents of type HFSDoc
.
void Reset ();
After reset, the See GetNextDoc function See GetNextDoc begins with the first document in the cluster.