PATH
You can use the IAT to provide a brief summary of a document. For example, when presenting a list of documents as the results of a search, you often want to include brief descriptions of each one so that the end user can easily determine which one to choose. This chapter describes the IAT classes and methods you use to summarize the contents of a given document.
A document summary is a subset of the document that describes its contents. For example, if a document describes how to build a log cabin using an axe, a document summary could be a sentence that contains the words "log cabin," "axe" and "constructing," such as "Constructing a log cabin using only your hands and an axe can be a rewarding experience." You use the IAT to find the portions of the document that best typifies its contents.
To create a document summary, you must first break up the document into an arbitrary number of portions called text extents , or simply extents. An extent can be a sentence, or any other grouping of characters (for example, a paragraph, or even simply a block of 72 characters). Given all the available extents, the IAT then ranks them in the order that most closely represent the entire document. You can then use one or more of the most highly ranked extents to summarize the document.
To create a summary for a given document, you must do the following:
The following sections describe these steps in more detail.
An extent
parser breaks up a document into a group of text extents. The IAT
provides the abstract class
IAExtentParser
which contains functions you can use to create
an extent parser. Your application can implement an extent parser by
subclassing IAExtentParser
and overriding the
GettingNextEvent
method. If you want to parse by sentence, you
can use the subclass IAANSISentenceParser
included in
the IAT instead of writing your own parser.
The IAT class
IADocumentAbstractor
contains methods that let you parse a document into text extents and
then rank them. The higher the ranking, the more closely the extent
resembles the entire document. You can then display one or more of
the highest-ranked extents to summarize the contents of the document.
See Breaking up a document into ranked extents shows a function that creates a list of ranked extents from a file.
void DemoAbstractor (StringPtr file) { FSSpec mMacFileSpec; short mDataForkRefNum; // Data to use with the abstractor's Summarize method uint32 numberOfSentences = 1; TermIndex* contextIndex = NULL; clock_t progFrequency = 10000; void* callerData = NULL; RankedProgressFn* progressFn = NULL; // Files used with the EnglishAnalyis object and the sentence parser char* stopwordFile = "EnglishStopwords"; char* stemDictDoc = "EnglishSubstitutions"; char* abbrevFile = "EnglishAbbreviations"; // Open the file and read into a buffer // Turn the filename into a Mac OS FSSpec OSErr iErr = FSMakeFSSpec(0, 0, file, &mMacFileSpec;); // Open file OSErr err = FSpOpenDF(&mMacFileSpec;, fsRdPerm, &mDataForkRefNum;); if (err != noErr) { return; } // Read the file Handle dataHandle = nil; long FileLength; err = GetEOF(mDataForkRefNum, &fileLength;); if (err != noErr) { return; } dataHandle = NewHandle(fileLength + 1); if (dataHandle == nil) return; err = SetFPos(mDataForkRefNum, fsFromStart, 0); if (err != noErr) { return; } HLock(dataHandle); err = FSRead(mDataForkRefNum, &fileLength;, *dataHandle); if (err != noErr) { return; } // Get pointer to buffer and the buffer length *((*dataHandle)+fileLength) = '\0'; char* buffer = (char*)(*dataHandle); uint32 bufferLength = strlen(buffer); // Create the analysis object to use with the parser EnglishAnalysis* myAnalysis = new EnglishAnalysis(stopwordFile, stemDictDoc); // Create the storage object IAStorage* myStorage = MakeHFSStorage(0,0,"\ptemp.index"); IADeleteOnUnwind delStorage(myStorage); // Designate the extent parser IAExtentParser* myParser = new IAANSISentenceParser((byte*)buffer, bufferLength, abbrevFile); // Create the abstractor and get the extents IADocumentAbstractor MyAbstractor(myParser, myStorage, myAnalysis); myAbstractor.Summarize(progressFn, progFrequency, callerData, numberOfSentences, contextIndex); // The DumpInformation function uses GetNumberOfSentences // and GetSentences to get the top ranked sentence uint32 showThisManySentences = 1; DumpInformation(myAbstractor, showThisManySentences); // Cleanup delete myParser; HUnlock (dataHandle); // Close the file err = FSClose(mDataForkRefNum); if (err != noErr) { return; } FlushVol(nil, mMacFileSpec.vRefNum); }
The DemoAbstractor
function opens the HFS file and reads the contents into a buffer. It
initializes the text parser ( parser
) with the buffer
and an abbreviations file. The abbreviations file
EnglishAbbreviations
holds
abbreviated words (such as i.e., or e.g.) that would trigger
end-of-sentence conditions if they were not called out.
After setting up the sentence parser,
DemoAbstractor
initializes the document abstractor
( abstractor
), specifying the extent parser, the
storage medium, and the type of analysis to use. This abstractor uses
the EnglishAnalysis
subclass of
IAAnalysis
to analyze the text extents (sentences) and extract
the relevant tokens. The filters defined in
EnglishAnalysis
require the text files
EnglishSubstitutions
and
EnglishStopwords
for proper
operation.
After creating the abstractor, the
Summarize
function then breaks up the document into a
ranked list of sentence extents. As with many other IAT components,
you can specify a callback during the Summarize
call to
give time to your application, if desired. This example has no
callback, however, so the progressFn
pointer is set to
NULL
.
The DemoAbstractor
function calls the DumpInformation
function, shown in
the next section, to display the highest-ranked sentences.
After
creating a ranked list of sentence extents, you can display one or
more as the document summary. The example in See Breaking up a document into ranked extents
calls the DumpInformation
method to present the
summary. See Displaying the sentences
with the highest rank. shows a possible implementation for
DumpInformation
that simply cycles through the array of ranked
sentences and displays one or more with the highest ranks.
void DumpInformation (const IADocumentAbstractor& abstractor, uint32 showlevel) { uint32 paragraphNumber = 0; bool firstHasBeenShown = false; bool showScore = true; bool showRank = true; uint32 numberTopWords = 5; bool showSentences = true; // Get the number of sentences and the array of pointers to those // sentences uint32 cnt = abstractor.GetNumberOfExtents(); IAExtentDoc** sentences = abstractor.GetExtents(); // Now loop and display the sentences or other information for(int i=0; i< cnt; i++) { // Keep looping until you've displayed the number of sentences // requested, or until you run out of sentences. if(sentences[i] && (sentences[i]->GetRank() < showlevel) && (sentences[i]->GetLength() > 0)) { // Get the sentence from the pointer IAExtentDoc* doc = sentences[i]; // The sentence parser also groups extents. // If an extent begins a new group, add a separator. if(!firstHasBeenShown){ firstHasBeenShown = true; paragraphNumber = doc->GetGroupNumber(); } if( doc->GetGroupNumber() > paragraphNumber){ paragraphNumber = doc->GetGroupNumber(); printf ("Paragraph Separator\r\r", 2); } // Display the weighted score for the sentence if(showScore){ printf("(%3.2f) ", doc->GetRankedHit()->GetScore()); } // Display its rank if(showRank){ printf("(%d) ", 1 + (int)doc->GetRank()); } // Display the top-ranked words in the extent if(numberTopWords >0 ) { if(showlevel > 0) { printf ("["); } uint32 numberToShow = numberTopWords; if (numberToShow > doc->GetRankedHit()->GetMatchingTermsLen()) { numberToShow = doc->GetRankedHit()->GetMatchingTermsLen(); } for(int w = 0 ; w< numberToShow; w++) { printf ("%s ", doc->GetRankedHit()->GetMatchingTerms()[w]->GetData()); } if(showlevel >0) { printf ("]\n"); } } // Display the sentence if(showSentences) { char* txt = (char*)doc->GetExtent(); printf ("Summary => %s\n", txt); IAFreeArray(txt); } } } }
The sections that follow describe the classes and methods used for summarizing documents using the IAT.
The following methods allow you to use the IAT to summarize the contents of a document.
IADocumentAbstractor ( IAExtentParser* parser, IAStorage* storage, IAAnalysis* analysis);
parser
The
parser to use to break up the document into extents. This parser must
instantiated from a subclass of IAExtentParser .
storage
The
storage instance to use for the summarization.
analysis
The
analysis instance used to rank the extents in the document.
virtual ~IADocumentAbstractor();
Summarizes a document as a ranked list of text extents.
virtual void Summarize ( RankedProgressFn* progressFn, clock_t progFrequency, void* callerData, uint32 maxLevel, TermIndex* contextIndex);
progressFn
A
pointer to an application-defined progress function. If not
NULL , the IAT calls this function periodically to give the
client application control.
progFrequency
The wait time between callbacks, in clock ticks (using
the ANSI clocks_per_sec standard).
callerData
A
pointer to application-specific data that is passed to the client
application when the callback occurs.
maxLevel
The
maximum number of extents to include in the summary.
contextIndex
A
pointer to the context index to use for this summarization. The
context index is used for calculating term scale factors. If you do
not specify a context index, this pointer defaults to NULL .
After creating the ranked list of extents, you can use the GetExtents method to retrieve them.
Specfies the analysis instance used to rank the extents.
void SetAnalysis (IAAnalysis* analysis);
analysis
A pointer to
the analysis object.
Specifies the storage instance.
void SetStorage (IAStorage* storage);
storage
A pointer to
the storage object.
void SetParser (IAExtentParser* parser);
parser
A
pointer to the extent parser object.
Returns the number of extents parsed in the document.
uint32 GetNumberOfExtents () const;
This method returns the number of text extents. The definition of the extent is dependent on the extent parser used.
SentenceDoc** GetExtents () const;
IAExtentDoc
pointers.
You can determine the number of extents in the array by calling
the GetNumberOfExtents
method.
This method returns an array of pointers to the parsed text extents. The definition of the extent is dependent on the extent parser used.
You use this class to create extent parsers to use with document summarization. Your extent parser must be a subclass of IAExtentParser . If you want a sentence to be the size of the extent, you can simply use the IAANSISentenceParser subclass instead of writing your own. See See The IAANSISentenceParser Class for more information.
IAExtentParser ( byte* buffer, uint32 bufferLength, bool removezReturns);
buffer
A pointer to
the buffer holding the text to be parsed.
bufferLength
The size
of the buffer.
removezReturns
If
true, the parser removes carriage returns from the parsed extents.
The default is to remove returns.
virtual ~IAExtentParser();
Specifes the buffer containing the text to be parsed.
void SetBuffer ( byte* buffer, uint32 bufferLength);
buffer
A pointer to
the buffer containing the text to be parsed.
bufferLength
The
length of the buffer.
Returns the buffer containing the text to be parsed.
byte* GetBuffer ();
method result A pointer to the buffer.
Returns the length of the buffer containing the text to be parsed.
uint32 GetBufferLength ();
method result The length of the buffer.
Sets the beginning of a text extent.
void SetStartOfExtent (byte* start);
start
A pointer to
the beginning of a text extent.
In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .
Returns the beginning of a text extent.
byte* GetStartOfExtent ();
method result A pointer to the beginning of a text extent.
In most cases, you need to use this
method only if you are creating a subclass of
IAExtentParser
.
Sets the length of a text extent.
void SetExtentLength (uint32 length);
length
The length of
a text extent.
In most cases, you need to use this
method only if you are creating a subclass of
IAExtentParser
.
Returns the length of a text extent.
uint32 GetExtentLength ();
method result The length of the text extent.
In most cases, you need to use this
method only if you are creating a subclass of
IAExtentParser
.
Sets the beginning of the next extent.
void SetStartOfNext (byte* start);
start
A pointer to
the beginning of the next extent.
If you are creating a subclass of the
IAExtentParser
class, you may want to use this
method to keep track of the beginning of your own parser-defined
extent (such as a paragraph).
Returns the beginning of the next extent.
byte* GetStartOfNext ();
method result A pointer to the beginning of the next extent.
If you are subclassing the
IAExtentParser
class, you may want to use this
method to keep track of the beginning of your own parser-defined
extent (such as a paragraph).
Sets the number of bytes left in the buffer.
void SetBytesLeft (uint32 length);
length
The
number of bytes remaining to be parsed in the buffer.
In most cases, you need to use this
method only if you are creating a subclass of
IAExtentParser
.
Returns the number of bytes left in the buffer.
uint32 GetBytesLeft ();
method result The number of bytes remaining to be parsed in the buffer.
In most cases, you need to use this
method only if you are creating a subclass of
IAExtentParser
.
Assigns a group number to a text extent.
void SetGroupNumber (uint32 gNum);
If desired, you can assign extents to arbitrary groups. For example, if the extent is a sentence, a paragraph is a group of extents. An extent contained in the fifth paragraph of a document could have a group number of 5.
The See GetGroupNumber method See GetGroupNumber .
The See SetOrderNumber method See SetOrderNumber .
Returns the group number assigned to an extent.
uint32 GetGroupNumber ();
method result The group number assigned to the extent.
If desired, you can assign extents to arbitrary groups. For example, if the extent is a sentence, a paragraph is a group of extents. An extent contained in the fifth paragraph of a document could have a group number of 5.
The See SetGroupNumber method See SetGroupNumber .
The See GetOrderNumber method See GetOrderNumber .
Assigns an order number to a text extent.
void SetOrderNumber (uint32 oNum);
oNum
The order
number to assign.
Extents contained in groups often have order numbers to make tracking them easier. For example, the third sentence (extent) in a paragraph (group) might have the order number 3.
The See GetOrderNumber method See GetOrderNumber .
The See SetGroupNumber method See SetGroupNumber .
Returns the order number assigned to an extent.
uint32 GetOrderNumber ();
method result The order number assigned to the extent.
Extents contained in groups often have order numbers to make tracking them easier. For example, the third sentence (extent) in a paragraph (group) might have the order number 3.
The See SetOrderNumber method See SetOrderNumber .
The See GetGroupNumber method See GetGroupNumber .
Determines whether the parsing process was cancelled.
bool IsCancelled ();
method result True if the parsing process was cancelled by the client.
In most cases, you need to use this
method only if you are creating a subclass of
IAExtentParser
.
Determines whether carriage returns were removed from the document.
bool AreReturnsRemoved ();
method result True if carriage returns were removed.
In most cases, you need to use this
method only if you are creating a subclass of
IAExtentParser
.
virtual SentenceDoc* GetNextExtent ( RankedProgressFn* progressFn, clock_t progFrequency, void* callerData);
progressFn
A
pointer to an application-defined progress function. If not
NULL
, the IAT calls this function periodically to
give the client application control.
progFrequency
The wait time between callbacks, in clock ticks (using
the ANSI clocks_per_sec standard).
callerData
A
pointer to application-specific data that is passed to the client
application when the callback occurs.
method result A pointer to the next parsed extent.
This method calls the protected method
GettingNextExtent
, which does the actual work
of parsing the next text extent.
virtual IAExtentDoc* GettingNextExtent ( RankedProgressFn* progressFn, clock_t progFrequency, void* callerData, bool *endBuffer) = 0;
progressFn
A pointer
to an application-defined progress function. If not NULL ,
the IAT calls this function periodically to give the client
application control.
progFrequency
The
wait time between callbacks, in clock ticks (using the ANSI
clocks_per_sec standard).
callerData
A pointer
to application-specific data that is passed to the client application
when the callback occurs.
endBuffer
If true,
the parser has reached the end of the buffer
method result A pointer to the next parsed sentence.
This is a pure virtual method that does
that actual work for the public method
GetNextExtent
. When implementing your own extent
parsers, you must override this method
The
IAANSISentenceParser
class is a subclass of
IAExtentParser
that breaks up a document into text
extents that are sentences.
IAANSISentenceParser ( byte* buffer, uint32 bufferLength, const char* abbrevFile, bool removezReturns);
buffer
A pointer to
the buffer that holds the text to be parsed.
bufferLength
The size
of the buffer.
abbrevFile
The path
to a text file containing abbreviated words. The sentence parser
assumes that a punctuation mark such as a period ( . )
signals the end of a sentence unless the word containing the mark is
contained in this file (for example, the abbreviation e.g. should go
in the file). In general this path should point to the file
EnglishAbbreviations .
removezReturns
If
true, the parser removes carriage returns from the parsed sentences.
The default is to remove returns.
virtual ~IAANSISentenceParser();
Sets whether a parsed sentence must begin with an upper-case letter.
void SetRequiredUpper (bool value);
value
If true,
the parser requires a sentence to begin with an upper-case letter.
The default setting is true.
Specifies whether a parsed sentence is required to start with an upper-case letter.
bool IsRequiredUpper () const;
method result If true, the parser requires a sentence to begin with an upper-case letter.
Returns the next parsed sentence in the document.
virtual SentenceDoc* GettingNextExtent ( RankedProgressFn* progressFn, clock_t progFrequency, void* callerData, bool* endBuffer);
progressFn
A
pointer to an application-defined progress function. If not
NULL , the IAT calls this function periodically to give the
client application control.
progFrequency
The wait time between callbacks, in clock ticks (using
the ANSI clocks_per_sec standard).
callerData
A
pointer to application-specific data that is passed to the client
application when the callback occurs.
endBuffer
If
true on output, the parser has reached the end of the buffer
method result A pointer to the next parsed sentence.
This method is called by the public
method GetNextExtent
.
The
IAExtentCorpus
class is a subclass of
IACorpus
that represents a set of strings in memory
(such as a collection of text extents). The actual body of the
strings is maintained in a buffer managed by the client. For more
information about corpus classes, see Chapter 8, "Corpus Category,"
in the Apple Information Access Toolkit v.1.0 Programmer's
Guide .
IAExtentCorpus (uint32 corpType); IAExtentCorpus (byte* buffer, uint32 corpType);
buffer
A
pointer to the buffer holding the text associated with this corpus.
corpType
An
integer specifying the type of extents this corpus handles. If you do
not specify this parameter, the default corpus type is
ExtentCorpusType
.
virtual ~IAExtentCorpus();
Tells the index the type of documents this corpus represents.
IADoc* GetProtoDoc ();
method result A
pointer to a new instance of the document type (for example, an
instance of IAExtentDoc
).
Retrieves the text associated with an extent.
IADocText* GetDocText (const IADoc* doc);
doc
The extent
whose text you want to retrieve.
method result A pointer to the extent text.
Retrieves the buffer associated with the corpus.
byte* GetBuffer ();
method result A pointer to the buffer.
The IAExtentDoc
class is a
subclass of IADoc
that represents a text extent in a
corpus represented by IAExtentCorpus
. See for more
information about IAExtentCorpus
.
IAExtentDoc ( const byte* buffer, uint32 offset, uint32 textLength, uint32 extentNumber, uint32 groupNumber, uint32 rank, RankedHit* rankHitVal);
buffer
A
pointer to the buffer containing the text extent.
offset
The
offset of the text extent in the buffer, in bytes.
textLength
The
length of the text extent, in bytes.
extentNumber
The
extent number assigned to the extent.
groupNumber
The
group number assigned to the extent.
rank
The rank
assigned to the extent. If you do not specify this parameter, the
default rank is 0.
rankHitVal
The
ranked hit value assigned to the extent. If you do not specify this
parameter, the default ranked hit value is NULL .
virtual ~IAExtentDoc();
Creates a deep copy of the extent.
IAStorable* DeepCopy () const;
method result Apointer to the copy of the extent.
Calling DeepCopy
creates a copy of the extent as well as copies of any objects
referenced by the extent.
Stores the size of the extent.
uint32 StoreSize () const;
method result The size of the extent, in bytes.
void Store (IAOutputBlock *output) const;
output
A
pointer to the output block in which to store the extent.
Restores the extent from a given block.
IAStorable* Restore (IAInputBlock *input) const;
input
A
pointer to the block containing the extent.
method result A pointer to the restored extent.
Checks if the extent is less than another.
bool LessThan (const IAOrderedStorable neighbor) const;
neighbor
The
extent you want to compare against.
method result True if the extent is less than that of neighbor .
Checks to see if two extents are equal.
byte* Equal (const IAOrderedStorable neighbor) const;
neighbor
The
extent you want to compare against.
method result True if the two extents are equal.
Gets the buffer associated with the extent.
byte* GetText () const;
method result A pointer to the buffer containing the text.
uint32 GetLength () const;
method result The length of the extent, in bytes.
Gets the offset of the extent within the buffer.
uint32 GetOffset () const;
method result The offset of the extent, in bytes.
Gets the extent as a null-terminated string.
byte* GetExtent () const;
method result A pointer to the extent.
Space for the returned extent is
allocated by IAMallocArray
, so you must call
IAFreeArray
to free the memory when you no longer
need the extent.
void SetRank (uint32 rank) const;
rank
The rank
to assign to the extent.
uint32 GetRank () const;
method result The rank assigned to the extent.
Sets the ranked hit value of an extent.
void SetRank (uint32 rankHitVal) const;
rankHitVal The ranked hit value to assign to the extent.
Gets the ranked hit value of an extent.
uint32 GetRankedHit () const;
method result The ranked hit value assigned to the extent.
Gets the extent number assigned to the extent.
uint32 GetExtentNumber () const;
method result The extent number assigned to the extent.
Gets the group number assigned to the extent.
uint32 GetGroupNumber () const;
method result The group number assigned to the extent.