Search Overview
The search engine is bundled with the Sambar Server. All files being indexed
must reside under the Sambar Server document directory and be available
to the HTTP server. URLs are created by the index server for all files
found as part of the indexing task. Should files be removed or new files
added, the index must be regenerated.
The indexing process is initiated from the System Administration console
of the Sambar Server (WWW interface).
Search Indexer
By default, all pages under the Documents Directory are indexed
(see the configuration management section for details on indexing
specific directories). No words found in the stopword.ini
file are indexed. A hash table index is built of all alpha-numeric
strings found in the files searched. This hash index is very fast to
search, but relatively bulky from a disk usage standpoint.
The Search Indexer provides the ability to specify the files to be indexed.
The WWW Server must have read access to all the files being indexed.
Files may be filtered by file extension, individual files, directory, or by a
directory and all its sub-directories. All index files are placed in
the search sub-directory located in the installation directory of the
Sambar Server.
Documents are indexed by file name, file size and last modified date.
In addition, in the case of HTML files, the TITLE is parsed and used
as the description of the file. In this release, the only weighting
used is a count of the number of times a word appears in a document,
as well as additional weighting for words appearing in the title or heading.
Multiple indexes may be built and individually searched.
Additional indexes are defined by editing the config.ini file and adding
additional Index File directives. Each Index File directive
results in an index of that name being made available.
Indexes are restricted to files found within the default directory identified
in the config.ini file.
Directories associated with virtual-hosts cannot presently be indexed.
In a future release the ability to search across multiple indexes will
be supported as will the ability to index files associated with virtual-host
directories.
Stop Word List
The server administrator has the ability to specify a list of stop words
that are judged to be trivial with respect to the content of the source files.
Prior to the index server initiating the indexing of a group of files, the
stop words file is loaded in from the config/stopword.ini file.
After the files have been indexed, the administrator has access to a list of
all the words which have been processed (search/search.wrd) by the
index engine. This may be used as a guide for customizing the stop word list
for subsequent re-indexing.
Query String
The following rules govern search patters:
- paris galerie louvre
- Finds documents containing as many of these words and phrases as
possible, ranked so that documents with the most matches are presented first.
- Lower-case search will find matches of capitalized words also. For example,
paris will find matches for paris, Paris, and
PARIS.
- noir +film -pinot
- Matches may be required, optional, or prohibited. Precede a required
word or phrase with + and a prohibited one with -. This query finds documents
containing film and noir, but not containing pinot.
These boolean operators are used to determine if a statement is true
or false. The following chart illustrates the usage samples:
Searching... | Results in... |
cable + car | Documents with both words. |
cable car | Documents with either words. This results in the greatest amount of matches |
cable -car | Documents about cable, but not about cable cars. |
Wildcard Searches
If the Allow Wildcarding flag in the config/config.ini
file is set to true, the arguments to the search engine
will be examined for wildcard characters. If found, an search index
will be walked comparing entries with the pattern.
Wildcard search patterns are:
* The star (*) character performs
an expansive pattern match.
? The question-mark (?) character
matches any single character.
[] Brackets ([]) can be used to match a single
character in the string being searched with a character found within the
brackets.
Ranking Simple Queries
The Sambar Search engine ranks the results based on a scoring algorithm;
documents with a higher score appear at the head of the ranking list.
A document has a higher score if the following hold:
- the query words or phrases are found in the special sections of the
document such as the title or headings.
- the document contains multiple instances of the query word or phrase.
Multiple Indexes
Multiple search indexes can be created and used with the Sambar
Server. Each Index File entry in the config.ini file
identifies a different search index. Initially, the search index
is empty; by using the System Administration console, one or more
directories can be indexed for use by the search engine.
|