Using WAIS Index Searches with GN

version 1.17

Overview

Starting with version 1.1 the GN gopher/http protocol server has support for WAIS index searches. This means you can index a collection of files with the index software designed for use with WAIS (Wide Area Information Server) and the gn server will respond to user queries by providing a menu of those documents from your collection which contain a match for the user supplied search term. Simple boolean combinations like `horses and cows' or `fox not goose' are supported.

WAIS index support is provided by means of an auxiliary program provided with the gn distribution, called waisgn. In order to use this program the server maintainer must first obtain and compile the WAIS software distribution (directions are given below). This provides the program waisindex which creates the indices and the libraries which must be linked with the waisgn program. When the gn server receives a WAIS index query it execs (in the UNIX sense) the waisgn program passing the search term to it. That is, it turns itself into the waisgn program by replacing the server in memory with the waisgn binary. There is no need to run a WAIS server.

One reason for this design is size. The gn server is relatively small and hence fairly efficient. The server I run is about 64K in size. The waisgn program source is small but the libraries with which it must be linked are not. The final binary for waisgn is 400K to 500K in size. The design with two separate programs has several advantages. First the efficiency of servers which are not using WAIS is not degraded by its presence. Also WAIS is a complicated system to set up and run. Having it done with a separate program makes it much easier to check that things are functioning correctly and fix them if they are not.

Do You Really Need WAIS?

The process of obtaining and compiling and setting up the WAIS software is slightly complicated. It is much easier to use the builting search capabilities of the gn server if they meet your needs. If you want to search the contents of fewer than 100 files of moderate size, for example, try using the gn grep search (see the Installation and Maintenance Guide). Similarly if you have a single large structured file (e.g. a mail file) of size less than one megabyte, then the gn structured file search (type 7m) will probably meet your needs. On the other hand if you have more than 100 files for a single search you may well need WAIS, so read ahead.

Obtaining and Installing the Software

Here are the steps to build and install waisgn with your gn server.

1. Edit the file "config.h" in the main gn source directory and change the entry

     #define WAISGN  "/usr/local/etc/waisgn"
to reflect the location where you want to keep the waisgn binary. Now run make in the gn source directory to make a new version of gn which is aware of this value.

2. Get the WAIS software. You can use either freeWAIS from

   ftp://ftp.cnidr.org/pub/NIDR.tools/freeWAIS-0.202.tar.Z
or
   ftp://ftp.bio.indiana.edu/util/wais/iubio-wais-8b5-d.tar.Z.
Then build WAIS per the instructions with that distribution. I use the IUbio version because it results in a smaller binary for waisgn. I have tested waisgn with both. There is a noticeable difference the "best match" ranks they produce, but I have no information on what the differences in scoring algorithms are.

3. In the waisgn src directory make symbolic links to the directories "bin" and "ir" in the main WAIS source directory. The commands to do this are, for example

     ln -s /path/to/freeWAIS-0.202/bin
     ln -s /path/to/freeWAIS-0.202/ir
Then examine the contents of these directories to make sure the links are working.

4. In the waisgn source directory run make, producing the waisgn binary. Copy the waisgn binary to the location you designated as WAISGN in step 1.

Index Your Files

This is done with the program "waisindex" which in the bin subdirectory of the main WAIS source directory. I suggest doing this by making a directory, say "myindex" in which the index files will reside. Then cd to that directory and use the command
     waisindex -t filename /complete/path/to/files...
or
     waisindex -t first_line /complete/path/to/files...
where "files" is typically replaced by a wildcard expression matching all the files you want to index. You could also have multiple wildcard expressions for the files. The difference in these two commands is that menu which waisgn will produce will title the matching documents either by the name of the file containing a match or by the the first line of the contents of the file containing a match. Note that in the first form the argument is literally the string "filename"; that string is not replaced with the name of a file. It is also possible to obtain the titles of the matching objects from the Name= field of a menu file (see below).

Testing waisgn

After your files are indexed it is appropriate to test waisgn as a standalone program before using it with the gn server. Since waisgn takes a large number of arguments a simple shell script to test it is included in the waisgn source directory. This script, called wgtest, must be edited to define the values of GN_ROOT and INDEX. The value of GN_ROOT should be the complete path of your gn data directory (with no final '/') and INDEX should be the path of the file index.inv you created by running waisindex relative to the GN_ROOT (it should start with '/'). Now run this script with the command
	wgtest words...
where words... is a list of search terms, perhaps just one. The output of this script includes a fair amount of diagnostic verbiage, but should include gopher protocol lines for the files which contain matches for your search term.

If this is successful you are ready to create the menu entries for your files and search item.

Creating the Menu Entries

For WAIS indexing to work there are several menu files which must have the correct entries.
The Data Files
After a search has been completed the user will want to access the matching files. This will only be possible if these files are listed in a .cache file created from a menu file in the directory containing the data files. Of course, this menu file can be created by hand, but this may be laborious if there are several hundred of them. A simple script called mkmenu is provided in the waisgn source directory to aid in this task. It's first argument is the path of the data directory relative to the GN_ROOT directory. This path should begin and end with '/'. Subsequent arguments are the names of the data files -- typically a single wildcard expression. Thus the command for this script might be
     mkmenu /path/to/data/ *.txt

The script will create a menu file in the data directory which must be processed by mkcache to produce a .cache file. By default mkmenu assumes the files are text files (type 0). This can be changed by editing the TYPE variable in the script.

The Index Files
The files you created in the directory "myindex" by running waisindex need to be accessed by gn. These files are, of course, not sent to a client by gn but putting one of them in a menu file is the way gn knows that you are granting permission to run waisgn on this index. Thus in the directory "myindex" you should create a menu file with the entry
     Name=Index
     Path=7w/relative/path/to/myindex/index.inv
and run mkcache on it. Note: You can if you wish put more than one set of index files in the same directory by using the "-d" option with waisindex. E.g. waisindex -d index1 ... will produce index1.inv, index1.dct etc., and waisindex -d index2 ... will produce index2.inv, etc. If you have done this then you can put multiple entries in the file .../myindex/menu.

Index Search Menu Items
Finally it is necessary to put an entry in a menu that the users will actually see. This can be anywhere in your gn hierarchy or more than one place if you choose. It should be an entry like
     Name=WAIS search of all my documents
     Path=7w/relative/path/to/myindex/index.inv

If all the files you are indexing have an html version as well as a plaintext version and the html version has the same name as the plaintext except with a .html extension, then you should index only the plaintext files and should use

     Path=7wh/relative/path/to/myindex/index.inv
(i.e. 7wh instead of 7w) in the menu files described above. Also you should set the GNTYPE variable in the mkmenu program to have the value "0h" so that your menu of indexed files will consist of items of type 0h (see the Installation and User's Guide).

Cache files

If you wish you may edit the menu file in your data directory (probably produced by mkmenu) to put something in the Name= field more informative than the file name, so the user sees a better menu. By default waisgn does not take the menu title from the .cache file for WAIS index items, but gets it directly from the WAIS indexes instead. This is more efficient and in the typical case the menu item would be the file name in either case.

This default behavior can be overridden by using the type "7wc" instead of "7w" in the menu files mentioned above. When this is done the menu items will be the contents of the Name= field of the menu file in the data directory just as it is for non-search gn items.

Ranges in files

It is possible to use waisindex to index files and have file considered as a composite of a number of separated documents. For example, if the command
     waisindex -t para files...
is used each separate paragraph (separated by blank lines) will be considered a document and its first line will be used as the title. Likewise if the "-t mail_or_rmail" option is used the files will be assumed to be standard UNIX mail files and each message will be considered as a separate document. To see a complete list of "-t" options run waisindex with no arguments.

In order for gn and waisgn to know to a byte range for the matching documents, it is necessary to "7wr" in place of "7w" in the menu files mentioned above (i.e. the menu file in the "myindex" directory and the one listing the search item). It is also necessary to use a different type for the data files. Instead of their gn type being '0' it should be "range". This means that the menu item should data file "file1" should be

     Name=Whatever
     Path=range/relative/path/to/file1
     Type=0

Trouble shooting

If things aren't working right for you here are some steps to try.

1. Run gn from the command line either with logging enabled or with the command "gn -L /dev/tty" so logging output will go to your terminal. When gn pauses for input enter the Path field of your index followed by a tab and a search term. I.e. you should enter something like.

     7w/relative/path/to/myindex/index.invsearch term
The error messages may give you some idea of what is going wrong.

2. Run the wgtest program as described above in the section on testing. Make sure that waisgn is functioning properly as a standalone program.

3. If you still can't isolate your problem, edit the file waisgn.h and change the #define LOG_DEBUG at the end from FALSE to TRUE. Then recompile waisgn. It will now write the diagnostic error messages that you get with wgtest into the file /tmp/waisgn.debug where you can examine them. The advantage over wgest is that you can check the communication between gn and waisgn.

Acknowledgements

The waisgn program is based very loosely on a function in Don Gilbert's Go_Ask_WAIS utility. I am very grateful for the help in dealing with WAIS that his routine has provided provided and for his kind permission to use it here. Any errors are mine and not Don's.

John Franks -- Dept of Math. Northwestern University <john@math.nwu.edu>