home *** CD-ROM | disk | FTP | other *** search
- How to Search a WAIS Database
-
- The WAIS search engine is at the heart of the WAIS Server and
- Workstation products. The WAIS search engine receives a user's
- question, searches its database for documents most relevant to the
- question, and returns a relevance-ranked list of documents back to the
- user. Each document is given a score from 1 to 1000, based on how well
- it matched the user's question (how many words it contained, their
- importance in the document, etc.). A question is an expression
- containing a combination of natural language, relevant documents, and
- boolean terms. Other key features of the WAIS search engine include
- fielded search, right truncation (wildcard searching), and relevance
- ranking.
-
- Natural Language
-
- The server can be queried using natural language questions. The server
- does not understand the question, rather it takes the words and
- phrases in the question and finds documents that have those words and
- phrases in them. "Tell me about portable computers." is an example of
- a natural language question. In this example, the WAIS Server would
- search for documents containing the words 'portable' and 'computers';
- the other words, 'tell', 'me', and 'about', are called "stop words" --
- they are so common that they occur in almost every document, so they
- are not used for searching a document.
-
- Boolean Operators
-
- The boolean operators, AND, OR, NOT, and ADJ aid in establishing
- logical relationships between concepts expressed in natural language.
- These operators are especially useful in narrowing down the search.
-
- o The AND operator is helpful in restricting a search when a particular
- pair of terms is known. For instance, when searching for documents on
- the weather in Boston, a question such as "weather AND Boston" would
- return only those documents that contain both the word "weather" and
- the word "Boston".
-
- o The OR operator is often used to join two different phrases of a
- Boolean search. A question such as "hurricane OR tornado" would search
- for all documents containing either the word "hurricane", or the word
- "tornado", or both. A natural language question is much like having an
- implicit OR between the words, except that the search engine does more
- work in a natural language query to determine the relevance of words
- and their relationships in a phrase.
-
- o The NOT operator is used to reject any documents that contain certain
- words. The question "basketball NOT college" would find all documents
- containing the word "basketball", that also do not contain the word
- "college". (Note, however, that this question would eliminate
- articles on any professional players that mention their alma maters!)
-
- o The adjacent operator, ADJ, is used to ensure that one word is
- followed by another in the returned document, with no other words in
- between. For example, "cordless ADJ telephone" returns only documents
- with exactly "cordless telephone" and not any documents that only
- contain the words "cordless" and "telephone" separately. Mixed
- Natural Language And Boolean Operators Unique to the WAIS Inc server
- is the ability for users to combine natural language and boolean
- operators to better target their searches. For example, suppose you
- were looking for documents specifically on portable laptop computers
- that are not made by Apple. The question could then be "Tell me about
- portable laptop computers NOT Apple.".
-
- Fielded Search
-
- For data collections whose documents are structured in a semi-regular
- format, the regular portions of the documents can be tagged by the
- WAIS parser as fields. A client can then ask a WAIS server to limit
- its search to those documents containing a user-specified value of a
- particular field. This is called a "Fielded Search".
-
- The mail-or-rmail parse format is an example of a parse format in
- which fields are tagged. For this parse format, the WAIS parser
- detects the "to" and "cc" fields, the "from" and "sender" fields, the
- "subject" field, and the "date" field. An example of a question using
- natural language, a boolean operator, and fielded search is: "company
- picnic AND from=barbara". The WAIS server would then return documents
- containing messages about a company picnic that barbara sent.
-
- Right Truncation (Wildcards)
-
- A user can specify right truncation by ending a word with the asterisk
- ('*') wild card character. This tells the search engine to search on
- words matching the base characters before the '*' and to ignore any
- trailing characters. For example, you might use right truncation in a
- question such as "geo*", which may retrieve documents containing the
- words: geographer, geography, geologist, geometry, geometrical, etc.
-
- Grouping Search Terms
-
- A user can group search terms and phrases together using parentheses.
- For example, if you wished to search for information about snowstorms,
- tornadoes, or hurricanes in New York City, you might search for
- "(snowstorms OR tornadoes OR hurricanes) AND (New ADJ York ADJ City)."
- You can also nest your parentheses; for example, "from = ( (ben ADJ
- wais) OR (brewster ADJ think) )" searches for messages from either
- ben@wais.com or brewster@think.com.
-
- Relevance Ranking
-
- Each document is scored based on its relevance to a user's question,
- where the most relevant document has the highest score, or rank --
- 1000 being the highest, 1 being the lowest. A document receives a
- higher score if the words in the question are in the headline, or if
- the words appear many times, or if phrases occur as in the question. A
- document's score is derived using techniques such as word weighting,
- term weighting, proximity relationships, and word density. Note that
- questions made up of natural language, relevant documents, and boolean
- expressions are all weighted using these techniques.
-
- Word Weight
-
- If a word in a document is found to match a word in the user's
- question, the word is assigned a weight, and this weight adds to the
- overall score of the document. The exact weight that a word receives
- depends on the emphasis given to the word by the author, and on where
- in the document the word was found. For example, a word is weighted
- highest if it appears in the headline, lower if the word has all
- capital letters or if the first letter of the word is capitalized, and
- finally, lowest if it appears only in the text. The WAIS parser
- determines word weights as it reads through the original data
- collection.
-
- Term Weight
-
- Each word used in data collection is assigned a numerical value,
- called the term weight, based on the frequency of occurrence of that
- word over all documents in the data collection. Words that occur
- frequently are not weighted as highly as those that appear less
- frequently. Very common words are either ignored or diminished in the
- scoring. For example, since the term, "animal", may occur frequently
- in many of the documents in a data collection, its term weight is
- small compared to a term such as "hippopotamus", which may occur only
- a few times.
-
- Proximity Relationships
-
- Proximity relationships designate that if the words in a natural
- language question are located close together in a document, they are
- given a higher weight than those found further apart. The idea behind
- a proximity relationship is that if a document contains a phrase
- similar to one in the user's question, that document is more likely to
- be relevant.
-
- Word Density
-
- The ratio of the number of times a word appears in a document to the
- size of the document is called the word density. It is a measure of
- how important a word is to the overall content of the document. A
- higher word density results in a higher relevance ranking.
-
- Courtesy of WAIS Inc.
-
- .
-