Developer Documentation
PATH  Mac OS X Documentation > Mac OS X Server Release Notes


DOCUMENT SUMMARIZATION

You can use the IAT to provide a brief summary of a document. For example, when presenting a list of documents as the results of a search, you often want to include brief descriptions of each one so that the end user can easily determine which one to choose. This chapter describes the IAT classes and methods you use to summarize the contents of a given document.

Document Summaries

A document summary is a subset of the document that describes its contents. For example, if a document describes how to build a log cabin using an axe, a document summary could be a sentence that contains the words "log cabin," "axe" and "constructing," such as "Constructing a log cabin using only your hands and an axe can be a rewarding experience." You use the IAT to find the portions of the document that best typifies its contents.

To create a document summary, you must first break up the document into an arbitrary number of portions called text extents , or simply extents. An extent can be a sentence, or any other grouping of characters (for example, a paragraph, or even simply a block of 72 characters). Given all the available extents, the IAT then ranks them in the order that most closely represent the entire document. You can then use one or more of the most highly ranked extents to summarize the document.

To create a summary for a given document, you must do the following:

The following sections describe these steps in more detail.

Extent Parsers

An extent parser breaks up a document into a group of text extents. The IAT provides the abstract class IAExtentParser which contains functions you can use to create an extent parser. Your application can implement an extent parser by subclassing IAExtentParser and overriding the GettingNextEvent method. If you want to parse by sentence, you can use the subclass IAANSISentenceParser included in the IAT instead of writing your own parser.

The Document Abstractor

The IAT class IADocumentAbstractor contains methods that let you parse a document into text extents and then rank them. The higher the ranking, the more closely the extent resembles the entire document. You can then display one or more of the highest-ranked extents to summarize the contents of the document.

See Breaking up a document into ranked extents shows a function that creates a list of ranked extents from a file.

Breaking up a document into ranked extents
void DemoAbstractor (StringPtr file) {
	FSSpec			mMacFileSpec;
	short			mDataForkRefNum;
 
// Data to use with the abstractor's Summarize method
	uint32 numberOfSentences = 1;
	TermIndex* contextIndex = NULL;
	clock_t progFrequency = 10000;
	void* callerData = NULL;
	RankedProgressFn* progressFn = NULL;
 
// Files used with the EnglishAnalyis object and the sentence parser
	char* stopwordFile = "EnglishStopwords";
	char* stemDictDoc = "EnglishSubstitutions";
	char* abbrevFile = "EnglishAbbreviations";
 
// Open the file and read into a buffer
	// Turn the filename into a Mac OS FSSpec
		OSErr iErr = FSMakeFSSpec(0, 0, file, &mMacFileSpec;);
 
	// Open file
		OSErr err = FSpOpenDF(&mMacFileSpec;, fsRdPerm, &mDataForkRefNum;);
			if (err != noErr) {
				return;
			}
 
	// Read the file
		Handle dataHandle = nil;
 
		long FileLength;
		err = GetEOF(mDataForkRefNum, &fileLength;);
		if (err != noErr) {
			return;
			}
		dataHandle = NewHandle(fileLength + 1);
		if (dataHandle == nil) return;
	
		err = SetFPos(mDataForkRefNum, fsFromStart, 0);
		if (err != noErr) {
			return;
			}
		HLock(dataHandle);	
 
		err = FSRead(mDataForkRefNum, &fileLength;, *dataHandle);
		if (err != noErr) {
			return;
			}
 
	// Get pointer to buffer and the buffer length
		*((*dataHandle)+fileLength) = '\0';
		char* buffer = (char*)(*dataHandle);
		uint32 bufferLength = strlen(buffer);
 
 
// Create the analysis object to use with the parser
	EnglishAnalysis* myAnalysis = new EnglishAnalysis(stopwordFile, stemDictDoc);
 
// Create the storage object
	IAStorage* myStorage = MakeHFSStorage(0,0,"\ptemp.index");
	IADeleteOnUnwind delStorage(myStorage);
 
// Designate the extent parser
	IAExtentParser* myParser = new IAANSISentenceParser((byte*)buffer,
									bufferLength, abbrevFile);
 
// Create the abstractor and get the extents
	IADocumentAbstractor MyAbstractor(myParser, myStorage, myAnalysis);
	myAbstractor.Summarize(progressFn, progFrequency, callerData,
		numberOfSentences, contextIndex);
 
// The DumpInformation function uses GetNumberOfSentences
// and GetSentences to get the top ranked sentence
	uint32 showThisManySentences = 1;
	DumpInformation(myAbstractor, showThisManySentences);	
 
// Cleanup
	delete myParser;
	HUnlock (dataHandle);
 
// Close the file
		
	err = FSClose(mDataForkRefNum);
		
	if (err != noErr) {
		return;
	}
	FlushVol(nil, mMacFileSpec.vRefNum);
}

The DemoAbstractor function opens the HFS file and reads the contents into a buffer. It initializes the text parser ( parser ) with the buffer and an abbreviations file. The abbreviations file EnglishAbbreviations holds abbreviated words (such as i.e., or e.g.) that would trigger end-of-sentence conditions if they were not called out.

After setting up the sentence parser, DemoAbstractor initializes the document abstractor ( abstractor ), specifying the extent parser, the storage medium, and the type of analysis to use. This abstractor uses the EnglishAnalysis subclass of IAAnalysis to analyze the text extents (sentences) and extract the relevant tokens. The filters defined in EnglishAnalysis require the text files EnglishSubstitutions and EnglishStopwords for proper operation.

After creating the abstractor, the Summarize function then breaks up the document into a ranked list of sentence extents. As with many other IAT components, you can specify a callback during the Summarize call to give time to your application, if desired. This example has no callback, however, so the progressFn pointer is set to NULL .

The DemoAbstractor function calls the DumpInformation function, shown in the next section, to display the highest-ranked sentences.

Presenting the Summary

After creating a ranked list of sentence extents, you can display one or more as the document summary. The example in See Breaking up a document into ranked extents calls the DumpInformation method to present the summary. See Displaying the sentences with the highest rank. shows a possible implementation for DumpInformation that simply cycles through the array of ranked sentences and displays one or more with the highest ranks.

Displaying the sentences with the highest rank.
void DumpInformation (const
IADocumentAbstractor& abstractor,
		uint32 showlevel)
{
	uint32 paragraphNumber = 0;
	bool firstHasBeenShown = false;
	bool showScore = true;
	bool showRank = true;
	uint32 numberTopWords = 5;
	bool showSentences = true;
 
// Get the number of sentences and the array of pointers to those
// sentences
	uint32 cnt = abstractor.GetNumberOfExtents();
	IAExtentDoc** sentences = abstractor.GetExtents();
 
// Now loop and display the sentences or other information
	for(int i=0; i< cnt; i++) {
 
	// Keep looping until you've displayed the number of sentences
	// requested, or until you run out of sentences.
		if(sentences[i] && (sentences[i]->GetRank() < showlevel) &&
				(sentences[i]->GetLength() > 0)) {
 
		// Get the sentence from the pointer
			IAExtentDoc* doc = sentences[i];
 
		// The sentence parser also groups extents.
		// If an extent begins a new group, add a separator.
			if(!firstHasBeenShown){
				firstHasBeenShown = true;
				paragraphNumber =  doc->GetGroupNumber();
			}
			
			if( doc->GetGroupNumber() > paragraphNumber){
				paragraphNumber = doc->GetGroupNumber();
				printf ("Paragraph Separator\r\r", 2);
			}
 
		// Display the weighted score for the sentence
			if(showScore){
				printf("(%3.2f) ", doc->GetRankedHit()->GetScore());
			}
		// Display its rank
			if(showRank){
				printf("(%d)    ", 1 + (int)doc->GetRank());
				}
 
		// Display the top-ranked words in the extent
			if(numberTopWords >0 ) {
 
				if(showlevel > 0) {
					printf ("[");
					}
				
				uint32  numberToShow = numberTopWords;
				if (numberToShow >
							doc->GetRankedHit()->GetMatchingTermsLen()) {
						numberToShow =
							doc->GetRankedHit()->GetMatchingTermsLen();
						}
 
				for(int w = 0 ; w< numberToShow; w++) {
					printf ("%s ",
				doc->GetRankedHit()->GetMatchingTerms()[w]->GetData());
					}
 
				if(showlevel >0) {
					printf ("]\n");
					}
				}
 
		// Display the sentence
			if(showSentences) {
				char* txt = (char*)doc->GetExtent();
				printf ("Summary => %s\n", txt);
				IAFreeArray(txt);
			}
		}
	}
}

The sections that follow describe the classes and methods used for summarizing documents using the IAT.

The IADocumentAbstractor Class

Ancestors None.

Subclasses None.

Header file Abstractor.h

Description

The following methods allow you to use the IAT to summarize the contents of a document.

Public Methods

IADocumentAbstractor

Constructor for this class.

IADocumentAbstractor (
				IAExtentParser* parser,
				IAStorage* storage,
				IAAnalysis* analysis);

parser   The parser to use to break up the document into extents. This parser must instantiated from a subclass of IAExtentParser .

storage   The storage instance to use for the summarization.

analysis   The analysis instance used to rank the extents in the document.

~IADocumentAbstractor

Destructor for this class.

virtual ~IADocumentAbstractor();

Summarize

Summarizes a document as a ranked list of text extents.

virtual void Summarize (
				RankedProgressFn* progressFn,
				clock_t progFrequency,
				void* callerData,
				uint32 maxLevel,
				TermIndex* contextIndex);

progressFn   A pointer to an application-defined progress function. If not NULL , the IAT calls this function periodically to give the client application control.

progFrequency   The wait time between callbacks, in clock ticks (using the ANSI clocks_per_sec standard).

callerData   A pointer to application-specific data that is passed to the client application when the callback occurs.

maxLevel   The maximum number of extents to include in the summary.

contextIndex   A pointer to the context index to use for this summarization. The context index is used for calculating term scale factors. If you do not specify a context index, this pointer defaults to NULL .

DISCUSSION

After creating the ranked list of extents, you can use the GetExtents method to retrieve them.

SetAnalysis

Specfies the analysis instance used to rank the extents.

void SetAnalysis (IAAnalysis* analysis);

analysis  A pointer to the analysis object.

SetStorage

Specifies the storage instance.

void SetStorage (IAStorage* storage);

storage  A pointer to the storage object.

SetParser

Specifies the extent parser.

void SetParser (IAExtentParser* parser);

parser   A pointer to the extent parser object.

GetNumberOfExtents

Returns the number of extents parsed in the document.

uint32 GetNumberOfExtents () const; 

 

DISCUSSION

This method returns the number of text extents. The definition of the extent is dependent on the extent parser used.

 

GetExtents


Gets the parsed extents that make up the document, in ranked order.

SentenceDoc** GetExtents () const; 
method result
A pointer to an array of IAExtentDoc pointers. You can determine the number of extents in the array by calling the GetNumberOfExtents method.
 

DISCUSSION

This method returns an array of pointers to the parsed text extents. The definition of the extent is dependent on the extent parser used.

 

The IAExtentParser Class

Ancestors None.

Subclass  IAANSISentenceParser

Header file  Abstractor.h

Description

You use this class to create extent parsers to use with document summarization. Your extent parser must be a subclass of IAExtentParser . If you want a sentence to be the size of the extent, you can simply use the IAANSISentenceParser subclass instead of writing your own. See See The IAANSISentenceParser Class for more information.

Public Methods

IAExtentParser

Constructor for this class.

IAExtentParser (
				byte* buffer,
				uint32 bufferLength,
				bool removezReturns);

buffer  A pointer to the buffer holding the text to be parsed.

bufferLength  The size of the buffer.

removezReturns  If true, the parser removes carriage returns from the parsed extents. The default is to remove returns.

~IAExtentParser

Destructor for this class.

virtual ~IAExtentParser();

SetBuffer

Specifes the buffer containing the text to be parsed.

void SetBuffer (
				byte* buffer,
				uint32 bufferLength);

buffer  A pointer to the buffer containing the text to be parsed.

bufferLength  The length of the buffer.

GetBuffer

Returns the buffer containing the text to be parsed.

byte* GetBuffer ();

method result   A pointer to the buffer.

GetBufferLength

Returns the length of the buffer containing the text to be parsed.

uint32 GetBufferLength ();

method result  The length of the buffer.

SetStartOfExtent

Sets the beginning of a text extent.

void SetStartOfExtent (byte* start);

start  A pointer to the beginning of a text extent.

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

GetStartOfExtent

Returns the beginning of a text extent.

byte* GetStartOfExtent ();

method result  A pointer to the beginning of a text extent.

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

SetExtentLength

Sets the length of a text extent.

void SetExtentLength (uint32 length);

length  The length of a text extent.

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

GetExtentLength

Returns the length of a text extent.

uint32 GetExtentLength ();

method result  The length of the text extent.

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

SetStartOfNext

Sets the beginning of the next extent.

void SetStartOfNext (byte* start);

start  A pointer to the beginning of the next extent.

DISCUSSION

If you are creating a subclass of the IAExtentParser class, you may want to use this method to keep track of the beginning of your own parser-defined extent (such as a paragraph).

GetStartOfNext

Returns the beginning of the next extent.

byte* GetStartOfNext ();

method result   A pointer to the beginning of the next extent.

DISCUSSION

If you are subclassing the IAExtentParser class, you may want to use this method to keep track of the beginning of your own parser-defined extent (such as a paragraph).

SetBytesLeft

Sets the number of bytes left in the buffer.

void SetBytesLeft (uint32 length);

length   The number of bytes remaining to be parsed in the buffer.

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

GetBytesLeft

Returns the number of bytes left in the buffer.

uint32 GetBytesLeft ();

method result   The number of bytes remaining to be parsed in the buffer.

  

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

SetGroupNumber

Assigns a group number to a text extent.

void SetGroupNumber (uint32 gNum);

gNum  The group to assign.

DISCUSSION

If desired, you can assign extents to arbitrary groups. For example, if the extent is a sentence, a paragraph is a group of extents. An extent contained in the fifth paragraph of a document could have a group number of 5.

SEE ALSO

The See GetGroupNumber method See GetGroupNumber .

The See SetOrderNumber method See SetOrderNumber .

GetGroupNumber

Returns the group number assigned to an extent.

uint32 GetGroupNumber ();

method result   The group number assigned to the extent.

DISCUSSION

If desired, you can assign extents to arbitrary groups. For example, if the extent is a sentence, a paragraph is a group of extents. An extent contained in the fifth paragraph of a document could have a group number of 5.

SEE ALSO

The See SetGroupNumber method See SetGroupNumber .

The See GetOrderNumber method See GetOrderNumber .

SetOrderNumber

Assigns an order number to a text extent.

void SetOrderNumber (uint32 oNum);

oNum   The order number to assign.

DISCUSSION

Extents contained in groups often have order numbers to make tracking them easier. For example, the third sentence (extent) in a paragraph (group) might have the order number 3.

SEE ALSO

The See GetOrderNumber method See GetOrderNumber .

The See SetGroupNumber method See SetGroupNumber .

GetOrderNumber

Returns the order number assigned to an extent.

uint32 GetOrderNumber ();

method result   The order number assigned to the extent.

DISCUSSION

Extents contained in groups often have order numbers to make tracking them easier. For example, the third sentence (extent) in a paragraph (group) might have the order number 3.

SEE ALSO

The See SetOrderNumber method See SetOrderNumber .

The See GetGroupNumber method See GetGroupNumber .

IsCancelled

Determines whether the parsing process was cancelled.

bool IsCancelled ();

method result   True if the parsing process was cancelled by the client.

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

AreReturnsRemoved

Determines whether carriage returns were removed from the document.

bool AreReturnsRemoved ();

method result   True if carriage returns were removed.

DISCUSSION

In most cases, you need to use this method only if you are creating a subclass of IAExtentParser .

GetNextExtent

Returns the next text extent.

virtual SentenceDoc* GetNextExtent (
				RankedProgressFn* progressFn,
				clock_t progFrequency,
				void* callerData);

progressFn   A pointer to an application-defined progress function. If not NULL , the IAT calls this function periodically to give the client application control.

progFrequency   The wait time between callbacks, in clock ticks (using the ANSI clocks_per_sec standard).

callerData   A pointer to application-specific data that is passed to the client application when the callback occurs.

method result   A pointer to the next parsed extent.

DISCUSSION

This method calls the protected method GettingNextExtent , which does the actual work of parsing the next text extent.

Protected Method

GettingNextExtent

Returns the next text extent.

 virtual IAExtentDoc* GettingNextExtent (
					RankedProgressFn* progressFn,
					clock_t progFrequency,
					void* callerData,
					bool *endBuffer) = 0;

progressFn  A pointer to an application-defined progress function. If not NULL , the IAT calls this function periodically to give the client application control.

progFrequency  The wait time between callbacks, in clock ticks (using the ANSI clocks_per_sec standard).

callerData  A pointer to application-specific data that is passed to the client application when the callback occurs.

endBuffer  If true, the parser has reached the end of the buffer

method result  A pointer to the next parsed sentence.

DISCUSSION

This is a pure virtual method that does that actual work for the public method GetNextExtent . When implementing your own extent parsers, you must override this method

The IAANSISentenceParser Class

Ancestor   IAExtentParser

Subclasses  None.

Header file  Abstractor.h

Description

The IAANSISentenceParser class is a subclass of IAExtentParser that breaks up a document into text extents that are sentences.

Public Methods

IAANSISentenceParser

Constructor for this class.

IAANSISentenceParser (
				byte* buffer,
				uint32 bufferLength,
				const char* abbrevFile,
				bool removezReturns);

buffer  A pointer to the buffer that holds the text to be parsed.

bufferLength  The size of the buffer.

abbrevFile  The path to a text file containing abbreviated words. The sentence parser assumes that a punctuation mark such as a period ( . ) signals the end of a sentence unless the word containing the mark is contained in this file (for example, the abbreviation e.g. should go in the file). In general this path should point to the file EnglishAbbreviations .

removezReturns  If true, the parser removes carriage returns from the parsed sentences. The default is to remove returns.

~IAANSISentenceParser

Destructor for this class.

virtual ~IAANSISentenceParser();

SetRequiredUpper

Sets whether a parsed sentence must begin with an upper-case letter.

void SetRequiredUpper (bool value);

value   If true, the parser requires a sentence to begin with an upper-case letter. The default setting is true.

IsRequiredUpper

Specifies whether a parsed sentence is required to start with an upper-case letter.

bool IsRequiredUpper () const;

method result   If true, the parser requires a sentence to begin with an upper-case letter.

Protected Method

GettingNextExtent

Returns the next parsed sentence in the document.

virtual SentenceDoc* GettingNextExtent (
				RankedProgressFn* progressFn,
				clock_t progFrequency,
				void* callerData,
				bool* endBuffer);

progressFn   A pointer to an application-defined progress function. If not NULL , the IAT calls this function periodically to give the client application control.

progFrequency   The wait time between callbacks, in clock ticks (using the ANSI clocks_per_sec standard).

callerData   A pointer to application-specific data that is passed to the client application when the callback occurs.

endBuffer   If true on output, the parser has reached the end of the buffer

method result   A pointer to the next parsed sentence.

DISCUSSION

This method is called by the public method GetNextExtent .

 

The IAExtentCorpus Class

Ancestor   IACorpus

Subclasses   None.

Header file   Abstractor.h

Description

The IAExtentCorpus class is a subclass of IACorpus that represents a set of strings in memory (such as a collection of text extents). The actual body of the strings is maintained in a buffer managed by the client. For more information about corpus classes, see Chapter 8, "Corpus Category," in the Apple Information Access Toolkit v.1.0 Programmer's Guide .

Public Methods

IAExtentCorpus

Constructor for this class.

IAExtentCorpus (uint32 corpType);
IAExtentCorpus (byte* buffer, uint32 corpType);

buffer   A pointer to the buffer holding the text associated with this corpus.

corpType   An integer specifying the type of extents this corpus handles. If you do not specify this parameter, the default corpus type is ExtentCorpusType .

~IAExtentCorpus

Destructor for this class.

virtual ~IAExtentCorpus();

GetProtoDoc

Tells the index the type of documents this corpus represents.

IADoc* GetProtoDoc ();

method result   A pointer to a new instance of the document type (for example, an instance of IAExtentDoc ).

GetDocText

Retrieves the text associated with an extent.

IADocText* GetDocText (const IADoc* doc);

doc   The extent whose text you want to retrieve.

method result   A pointer to the extent text.

GetBuffer

Retrieves the buffer associated with the corpus.

byte* GetBuffer ();

method result   A pointer to the buffer.

The IAExtentDoc Class

Ancestor   IADoc

Subclasses   None.

Header file   Abstractor.h

Description

The IAExtentDoc class is a subclass of IADoc that represents a text extent in a corpus represented by IAExtentCorpus . See for more information about IAExtentCorpus .

Public Methods

IAExtentDoc

Constructor for this class.

IAExtentDoc (
			const byte* buffer,
			uint32 offset,
			uint32 textLength,
			uint32 extentNumber,
			uint32 groupNumber,
			uint32 rank,
			RankedHit* rankHitVal);

buffer  A pointer to the buffer containing the text extent.

offset   The offset of the text extent in the buffer, in bytes.

textLength   The length of the text extent, in bytes.

extentNumber  The extent number assigned to the extent.

groupNumber  The group number assigned to the extent.

rank   The rank assigned to the extent. If you do not specify this parameter, the default rank is 0.

rankHitVal   The ranked hit value assigned to the extent. If you do not specify this parameter, the default ranked hit value is NULL .

 

.

~IAExtentDoc

Destructor for this class.

virtual ~IAExtentDoc();

DeepCopy

Creates a deep copy of the extent.

IAStorable* DeepCopy () const;

method result   Apointer to the copy of the extent.

DISCUSSION

Calling DeepCopy creates a copy of the extent as well as copies of any objects referenced by the extent.

StoreSize

Stores the size of the extent.

uint32 StoreSize () const;

method result   The size of the extent, in bytes.

Store

Stores the extent.

void Store (IAOutputBlock *output) const;

output   A pointer to the output block in which to store the extent.

SEE ALSO

The method .

Restore

Restores the extent from a given block.

IAStorable* Restore (IAInputBlock *input)
const;

input   A pointer to the block containing the extent.

method result   A pointer to the restored extent.

SEE ALSO

The method .

LessThan

Checks if the extent is less than another.

bool LessThan (const IAOrderedStorable
neighbor) const;

neighbor   The extent you want to compare against.

method result   True if the extent is less than that of neighbor .

Equal

Checks to see if two extents are equal.

byte* Equal (const IAOrderedStorable
neighbor) const;

neighbor   The extent you want to compare against.

method result   True if the two extents are equal.

GetText

Gets the buffer associated with the extent.

byte* GetText () const;

method result   A pointer to the buffer containing the text.

GetLength

Gets the length of an extent.

uint32 GetLength () const;

method result   The length of the extent, in bytes.

GetOffset

Gets the offset of the extent within the buffer.

uint32 GetOffset () const;

method result   The offset of the extent, in bytes.

GetExtent

Gets the extent as a null-terminated string.

byte* GetExtent () const;

method result   A pointer to the extent.

DISCUSSION

Space for the returned extent is allocated by IAMallocArray , so you must call IAFreeArray to free the memory when you no longer need the extent.

SetRank

Sets the rank of an extent.

void SetRank (uint32 rank) const;

rank   The rank to assign to the extent.

SEE ALSO

The method .

GetRank

Gets the rank of an extent.

uint32 GetRank () const;

method result   The rank assigned to the extent.

SEE ALSO

The method .

SetRankedHit

Sets the ranked hit value of an extent.

void SetRank (uint32 rankHitVal) const;

rankHitVal   The ranked hit value to assign to the extent.

SEE ALSO

The method .

GetRankedHit

Gets the ranked hit value of an extent.

uint32 GetRankedHit () const;

method result   The ranked hit value assigned to the extent.

SEE ALSO

The method .

GetExtentNumber

Gets the extent number assigned to the extent.

uint32 GetExtentNumber () const;

method result   The extent number assigned to the extent.

SEE ALSO

.

GetGroupNumber

Gets the group number assigned to the extent.

uint32 GetGroupNumber () const;

method result   The group number assigned to the extent.

SEE ALSO

.