Index Processor

The other major portion of our work is a complete index processor implementation. The resulting processor is a single C program called MakeIndex. Actually, MakeIndex had several predecessors all written as UNIX shell scripts with embedded sed [24] and awk [25] code. We quickly became unsatisfied with them for various reasons. One of the concerns was efficiency. Interpreted languages such as sed and awk are satisfactory for rapid prototyping, but they are slow. In our experience, MakeIndex was able to process a book index of 3,300 entries in less than 50 seconds of user time plus an extra 5% of system time on a client node of SUN 3/50 (MC68020 at 12.5 MHz). This is at least an order of magnitude faster than using its most recent sed/awk predecessor, which has only half the features of MakeIndex.

The decision to do a C implementation was made because of C's dynamic storage allocation and data structures. Since we wanted as general a solution as possible, it would be difficult, if not impossible, to implement all the desired features using sed/awk. For instance, in awk a dynamically linked list of index entries is impossible, and the style handling mechanism with a comprehensive error processing facility is difficult to realize. Originally developed in UNIX C with portability in mind, MakeIndex also runs on VAX/VMS, TOPS-20, MS/DOS systems with a few trivial changes. This portability would have been very difficult for an implementation in UNIX shell scripts. Furthermore, MakeIndex has more features and is more extensible than tools like IdxTEX [26], a LATEX-specific index processor that runs only on VAX/VMS systems.

Our approach is the direct opposite of that taken by make.index, a host of indexing tools [23] built for troff. As part of the UNIX troff document preparation environment, these tools are canonical examples of the pipelining model. They are a collection of small awk programs working together in a very long pipeline. The claim is that by breaking the system down into small pieces, it is easier for the user to adapt the package to various situations not possible to envision in advance. Their approach, therefore, seems to follow the ``do it yourself'' metaphor whereby you get a toolkit and assemble the final product yourself. The basic package covers only those features used by its authors. A third party user has to modify their awk code if something different is desired.

The design of MakeIndex is quite different. Our intention is to build a complete system by analyzing the tasks involved in index processing. Parameters are derived to form a table-driven style handling facility. Formatter and format independence is achieved by simple style specification. All the underlying data structures and processing mechanisms are transparent. Yet it is robust enough to handle different precedence schemes, ordering rules, etc. To use the system, the user needs not deal with the processing details. If the default is inappropriate, the only adaptation required is a table specification for the style facility.

For instance, by assigning the output index header index.head defined in make.index to our preamble and other related commands to item_i's, it is easy to produce an alphabetized index in troff format from a raw index generated by troff. The same technique applies to other formatting systems. Scribe, for example, has an indexing subsystem that supports a subset of the functionality described above. By adapting its raw index file format as input style and its output appearance commands as output style, its indexing capability can be readily expanded using MakeIndex. In both cases, there is no need to modify the original code used to generate the raw index (Step II), nor is it necessary to modify MakeIndex itself. Likewise, MakeIndex can be integrated with direct-manipulation document development or publishing systems by binding their descriptive markup tags to the attributes of MakeIndex's input format and output style.

To summarize the differences in approach between MakeIndex and make.index, let us examine the points of comparison: speed, size of the programs, ease of use, portability, and extensibility. MakeIndex is an order of magnitude faster. The awk programs of make.index are less than 100 lines of code, while MakeIndex is large: 6,000 lines of C code and a binary of 70K bytes. MakeIndex is a self-contained C program, adapted from our original UNIX implementation, to run on a variety of machines from micros to mainframes. The make.index system will run on any UNIX system which has awk so it too spans many systems, although it is UNIX-specific while MakeIndex is not.

MakeIndex is easy to use. It is monolithic and covers almost every conceivable case likely to occur. We actually examined many of the books in the Addison-Wesley Computer Science Series and found that more than 95% of the cases could be handled by the program. The make.index suite, on the other hand, handles the standard cases and all customizations must be done by the user by modifying awk code. With MakeIndex, user changes will very rarely occur. If they should occur, and we have never had them happen, then the user must change the C source code. The question as to whether it is easier to modify awk code or C code is left to the reader. Our answer is given in MakeIndex.