Search Engine


Search Overview
The search engine is bundled with the Sambar Server. All files being indexed must reside under the Sambar Server document directory and be available to the HTTP server. URLs are created by the index server for all files found as part of the indexing task. Should files be removed or new files added, the index must be regenerated.
The indexing process is initiated from the System Administration console of the Sambar Server (WWW interface).

Search Indexer
By default, all pages under the Documents Directory are indexed (see the configuration management section for details on indexing specific directories). No words found in the stopword.ini file are indexed. A hash table index is built of all alpha-numeric strings found in the files searched. This hash index is very fast to search, but relatively bulky from a disk usage standpoint.

The Search Indexer provides the ability to specify the files to be indexed. The WWW Server must have read access to all the files being indexed. Files may be filtered by file extension, individual files, directory, or by a directory and all its sub-directories. All index files are placed in the search sub-directory located in the installation directory of the Sambar Server.

Documents are indexed by file name, file size and last modified date. In addition, in the case of HTML files, the TITLE is parsed and used as the description of the file. In this release, the only weighting used is a count of the number of times a word appears in a document, as well as additional weighting for words appearing in the title or heading.

Multiple indexes may be built and individually searched. Additional indexes are defined by editing the config.ini file and adding additional Index File directives. Each Index File directive results in an index of that name being made available.

Indexes are restricted to files found within the default directory identified in the config.ini file. Directories associated with virtual-hosts cannot presently be indexed. In a future release the ability to search across multiple indexes will be supported as will the ability to index files associated with virtual-host directories.

Stop Word List

The server administrator has the ability to specify a list of stop words that are judged to be trivial with respect to the content of the source files. Prior to the index server initiating the indexing of a group of files, the stop words file is loaded in from the config/stopword.ini file. After the files have been indexed, the administrator has access to a list of all the words which have been processed (search/search.wrd) by the index engine. This may be used as a guide for customizing the stop word list for subsequent re-indexing.

Query String
The following rules govern search patters:

paris galerie louvre
Finds documents containing as many of these words and phrases as possible, ranked so that documents with the most matches are presented first.
Lower-case search will find matches of capitalized words also. For example, paris will find matches for paris, Paris, and PARIS.

noir +film -pinot
Matches may be required, optional, or prohibited. Precede a required word or phrase with + and a prohibited one with -. This query finds documents containing film and noir, but not containing pinot.

These boolean operators are used to determine if a statement is true or false. The following chart illustrates the usage samples:

Searching...Results in...
cable + carDocuments with both words.
cable carDocuments with either words. This results in the greatest amount of matches
cable -carDocuments about cable, but not about cable cars.

Wildcard Searches
If the Allow Wildcarding flag in the config/config.ini file is set to true, the arguments to the search engine will be examined for wildcard characters. If found, an search index will be walked comparing entries with the pattern.

Wildcard search patterns are:

* The star (*) character performs an expansive pattern match.

? The question-mark (?) character matches any single character.

[] Brackets ([]) can be used to match a single character in the string being searched with a character found within the brackets.



Ranking Simple Queries

The Sambar Search engine ranks the results based on a scoring algorithm; documents with a higher score appear at the head of the ranking list. A document has a higher score if the following hold:
  • the query words or phrases are found in the special sections of the document such as the title or headings.
  • the document contains multiple instances of the query word or phrase.

Multiple Indexes
Multiple search indexes can be created and used with the Sambar Server. Each Index File entry in the config.ini file identifies a different search index. Initially, the search index is empty; by using the System Administration console, one or more directories can be indexed for use by the search engine.


Copyright 1995 to 1997 Sambar Technologies