The query syntax in the ICIB
Although the traversal of hypertext links is the main navigation paradigm
of the WWW, the HTTP protocol also for suppplying keywords and passing
them to the server.
Using this mechanism to allow for free text
searches is an important feature in the ICIB. The ICIB not only implements
simple keyword searches, but has a complex query language which can be
used for retrieving documents and definitions. The use of thesauri also
allows for operating on "concepts". This section describes the
text query features of the ICIB server.
The easiest query involves specifying a keyword: A query with the keyword "picture" will simply retrieve a list of documents containing the string "picture". To speed up the process of searching through documents, the ICIB server uses an inverted index of words. It contains a list of every word in the document and it's frequency. This allows for nearly instantaneous search of a given term. Searches are case insensitive, so it does not matter if e.g. a word starts with a capital letter. Also, a query only looks for strings as words, this prevents results like retrieving "cartoon" when the query is supposed to look for "car".
Keywords may also contain wildcards that allow for selecting all strings matching a specific pattern. The ICIB server uses "ed"-style regular
expressions. The regular expression syntax is the following:
- Any character except a special character matches
itself. Special characters are the regular expression
delimiter plus \[. and sometimes ^*$.
- A . matches any character.
- A \ followed by any character except a digit or ()
matches that character.
- A nonempty string s bracketed [s] (or [^s]) matches any
character in (or not in) s. In s, \ has no special
meaning, and ] may only appear as the first letter. A
substring a-b, with a and b in ascending ASCII order,
stands for the inclusive range of ASCII characters.
- A regular expression of form 1-4 followed by * matches
a sequence of 0 or more matches of the regular
expression.
Combining keywords
The combination of two or more keywords can be used to create more complex queries. Combining query terms with the operator "OR" retrieves files containing either term. Connecting the keywords using the "AND" operator will limit the search to those documents containing both strings: By issuing the query "computer AND picture", only documents containing both words will be retrieved. Such a query will retrieve mostly documents that deal with computer generated or manipulated images (although it will also find text fragments like "The book gives a detailed picture of what's going on inside a computer"). Using the "OR" operator will retrieve
documents containing either words. Queries can be built up using multiple keywords connected with AND and OR, the AND operator has precedence.
Using the Thesaurus
Another important feature is the use of a built-in thesaurus. Usually, a thesaurus is used to find synonyms to words, but a thesaurus is much more than a synonym list. It is a semantic network containing concepts that are related to one another in various ways. Since thesauri allow for dealing with a concept instead of just a single keyword, they can be very useful in free text queries. A simple example is the fact that the words "image" and "picture" are often used synonymously, so a query for "picture" should also retrieve files containing the string "image". Other relations between concepts are "is an abbreviation for" and "is a broader term for". Being able to identify terms related to a certain concept makes it possible to find information that is difficult or impossible to select using simple keyword search. The ICIB uses a technical thesaurus with terms meaningful in the context of image communication standards, which can be used in queries to the ICIB.
Since there can be situations where the automatic use of a thesaurus is not wanted, it can be switched on and off by using a special notation in the query string. For example, typing "{picture}" will retrieve all documents containing either the word picture or related concepts (it could be read as "retrieve the set of all concepts related to picture"), whereas the query string "picture" only retrieves documents that contain the literal string.
Queries in logical text sections
A document can be devided into logical sections. For example, there could be parts like "Title", "Author", "Summary", or "Abstract". Addressing a section within a query can be useful for doing things like finding all documents which have a given keyword in the "Abstract" part of the document. By supplying the field name, a query can be restricted to the corresponding field. The query
"{picture} IN Abstract"
will retrieve only documents which contain the string "picture" or a related concept in the document field "Abstract".
In order to mark up a text section for this purpose, on has to put
it inside a "group" field, with the name of this field
as role. Here's an example:
<H1> Abstract: </H1>
<GROUP role="abstract"> This is the abstract text..
foo bar.. </GROUP>
Note: although this is already finished, I'd like to get some feedback
first on whether or not group/role are adequate for this purpose, so this
is not part of this index extension package.
Defining a context
Another means of refining a search query is by using a document relative context, which defines a specific subset of documents. The context is defined in a way that is depending on the file from where the query has been issued. For example, if a query is entered from a document which deals with a standard in the area of television, a context "Area" could relate to all documents dealing with television. Issuing the query
"{picture} AND digital IN Abstract CONTEXT video"
from within a document dealing with a television standard would limit the search to the specified set of documents.
This is being archieved by having an optional file .context
in every directory, which consists of a context name and the
corresponing directory. Here's an example:
AREA /icib/tv
ORG /icib/tv/ccir
Note: Currently, the indexing software retrieves files local
to the directory resp. subdirectories of the file where the query has
been issued. This behaviour can be modified using the CONTEXT modifier.
Please note that future implementations may change this to rather have
a search look in all directories, and have a special context identifier
"LOCAL" in order to make the search local to the current directory
subhierarchy.