home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Usenet 1994 October
/
usenetsourcesnewsgroupsinfomagicoctober1994disk2.iso
/
unix
/
volume24
/
mkid2
/
part04
/
TUTORIAL
< prev
Wrap
Text File
|
1991-10-09
|
16KB
|
434 lines
This is a program identifier database package. These tools provide a
logical extension to ctags. (which is limited in that it only stores the
location of function and type *definitions*a) The ID facility
stores the locations for all uses of identifiers, pre-processor
names, and numbers. (in decimal, octal or hex)
When fixing or enhancing a large program (particularly one that is
unfamiliar) it is often necessary to audit the use of global
data-structures in order to verify that the proposed modification will
not trigger any hidden `gotchas'. Often this entails grepping through
many thousands of lines of source code spread over dozens and sometimes
hundreds of source files in multiple sub-directories. This process
places a significant load on computing resources, and takes a long
time. There is even the danger that a programmer will avoid doing a
complete audit due to the perceived cost--he or she will rely on memory
and hope that there are no booby traps.
The id-database is most useful for maintaining large programs that
consist of many source files. The database is simply a two dimensional
boolean array indexed by identifier-name and source-file-name. For a
given identifier and source-file, if the identifier occurs in the file,
the boolean value is TRUE. The database may be queried either by
identifier-name or file-name.
The following types of queries supported:
* name lookup
list all the files where an identifier occurs. The name
may be a regular expression.
* name apropos
list all the files for all identifiers that have the sub-string
name in them. Matches are done in a case-insensitive mammer.
* name `grep'
search for an identifier in all the files where it occurs.
This is an optimized `grep' over all the sources--we only
search on files that contain the identifier.
* name edit
invoke an editor on the files where an identifier occurs,
and use the identifier as an initial search string.
* file lookup
list all identifiers that occur in a file, or list
the identifiers that are common between two files.
* non-unique names
list the names of all indentifiers whose names are non-unique
within some number of characters. This is useful when porting
a program from a `flexnames' system to one more limited names.
* solo
list all identifiers that occur exactly once in a software
system. This may be useful for locating identifiers that are
declared but never used, or library functions that are used
but never declared.
The first four queries are handled by one program. The type of query
is determined by the name the program was invoked with. The four links
are lid(1) for `lookup id', aid(1) for `apropos id', gid(1) for `grep
id' and eid(1) for `edit id'. One or more identifiers may be passed on
the command line. The identifiers may be literal strings or regular
expressions. Here are some examples:
$ lid FILE
FILE extern.h {fid,gets0,getsFF,idx,init,lid,mkid,opensrc,scan-asm,scan-c}.c
$ lid FILE$
AF_FILE mkid.c
AF_IDFILE mkid.c
FILE extern.h {fid,gets0,getsFF,idx,init,lid,mkid,opensrc,scan-asm,scan-c}.c
IDFILE id.h {fid,lid,mkid}.c
IdFILE {fid,lid}.c
argFILE mkid.c
gidFILE lid.c
idFILE {init,mkid}.c
inFILE {gets0,getsFF,scan-asm,scan-c}.c
openSrcFILE extern.h {idx,mkid,opensrc}.c
srcFILE {idx,mkid,opensrc}.c
$ lid ^get
get opensrc.c
getAdaId getscan.c
getAsmId extern.h {getscan,scan-asm}.c
getCId extern.h {getscan,scan-c}.c
getDirToName extern.h {fid,lid,paths}.c
getId {idx,mkid}.c
getLanguage extern.h {getscan,idx,mkid}.c
getLispId getscan.c
getPascalId getscan.c
getRoffId getscan.c
getSCCS extern.h opensrc.c
getScanner extern.h {getscan,idx,mkid}.c
getTeXId getscan.c
getTextId getscan.c
getc {gets0,getsFF,lid,scan-asm,scan-c}.c
getchar lid.c
getenv extern.h lid.c
gets lid.c
getsFF extern.h {bitsvec,fid,getsFF,lid,mkid}.c
As you can see, when a regular expression is used, it is possible to
get more than one line of output. If you wish multiple lines to be
merged into one, supply the `-m' option:
$ lid -m ^get
^get extern.h {bitsvec,fid,gets0,getsFF,getscan,idx,lid,mkid,opensrc,paths,scan-asm,scan-c}.c
The query program searches for numbers numerically rather than
textually. Therefore you may search for multiple representations of a
number. It is best to illustrate this with examples:
$ lid -a 0x10
020 numtst.c
0x00010 numtst.c
0x0010 scan-c.c
0x10 {id,radix}.h {scan-asm,stoi}.c
16 numtst.c
The `-a' argument tells lid(1) to look for 0x10 in all radixes. (For
numbers 0 through 7, lid(1) looks for all radixes by default. For numbers
greater than 7, lid(1) only looks for the radix that the argument is
supplied in.) It is also possible to restrict the search to selected
radixes by supplying an argument consisting of one or more of the
key-letters `o', `d', and `x' for octal decimal and hexadecimal
respectively:
$ lid -o 0x10
020 numtst.c
$ lid -x 16
0x00010 numtst.c
0x0010 scan-c.c
0x10 {id,radix}.h {scan-asm,stoi}.c
$ lid -d 020
16 numtst.c
The grep interface behaves somewhat like the following command:
$ grep -w -n `lid TRUE`
Heres some sample output for the equivalent gid command:
$ gid TRUE
bool.h:5: #define TRUE (0==0)
lid.c:102: case 'm': forceMerge = TRUE; break;
lid.c:170: Merging = TRUE;
lid.c:204: crunching = TRUE;
lid.c:553: hitDigits = TRUE;
lid.c:787: return TRUE;
mkid.c:117: Verbose = TRUE;
mkid.c:191: keepLang = TRUE;
scan-asm.c:79: static bool eatUnder = TRUE;
scan-asm.c:80: static bool preProcess = TRUE;
scan-asm.c:96: static bool newLine = TRUE;
scan-asm.c:130: newLine = TRUE;
scan-asm.c:141: newLine = TRUE;
scan-asm.c:145: newLine = TRUE;
scan-asm.c:150: newLine = TRUE;
scan-asm.c:165: newLine = TRUE;
scan-c.c:88: static bool eatUnder = TRUE;
scan-c.c:101: static bool newLine = TRUE;
scan-c.c:138: newLine = TRUE;
scan-c.c:199: newLine = TRUE;
scan-c.c:205: newLine = TRUE;
scan-c.c:210: newLine = TRUE;
wmatch.c:37: return TRUE;
Notice that each line is reported in the same format as a
C-preprocessor error message. This feature allows gid(1) lines to be
digested by any program that parses error messages, such as error(1)
and gnu-emacs.
If you want to edit all files that have an identifier, you may
conveniently do so with eid(1):
$ eid TRUE
TRUE bool.h {lid,mkid,scan-asm,scan-c,wmatch}.c
Edit? [y1-9^S/nq]
Before the editor is invoked, you are given the lid(1) output to review
and comfirm. If you want to edit all files listed, respond with a
newline or with `y'. If you want to skip some number of files into the
argument list, respond with a single digit `1' through `9' to skip that
many files, or do a string-search to the first file you want with
`^S<string>' or `/<string>'. If you don't want to edit anything, type
`n' to go on to the next argument you gave to eid(1) or type `q' to
quit altogether.
The behavior of the editing interface is controlled by three
environment variables called EIDARG, EIDLDEL, and EIDRDEL. The best
way to illustrate their use by an example. Here is how to define them
for vi(1) (using /bin/sh syntax)
EIDARG='+/%s/' # printf(3) string for initial search-string argument
EIDLDEL='\<' # left word-delimiter
EIDRDEL='\>' # right word-delimiter
`EID[LR]DEL' are positioned around the identifier as left and right
word-delimiters if your editor supports that notion. Then the whole
name-string is sprintf(3)'ed into `EIDARG' to construct the initial
search-string argument to the editor. If your editor can't digest such
an argument, simply leave these variables undefined in the
environment.
Some emacs users are appalled at the notion of starting up a fresh editor
simply to follow an identifier. For those who are fortunate enough to have
a programmable emacs such as gnu-emacs, it is fairly simple to devise
a command that invokes gid(1) and digests its output as though it were
/lib/cpp error strings to be examined. (Sorry, no such code is provided
at this posting...)
Another type of query is to find all identifiers that are non-unique
within some number of characters. This is useful for finding potential
portability problems when moving to a system whose compiler or linker
limits the number of significant characters in a name. The `-u<n>'
argument does the trick. Here's a list of identifiers that may yield
multiply-defined errors in a symbol table that only knows about the
first 7 characters:
$ lid -u7
SCAN_TEX getscan.c
SCAN_TEXT getscan.c
idh_argc id.h {init,mkid}.c
idh_argo id.h {init,mkid}.c
idh_namc id.h {fid,mkid}.c
idh_namo id.h {fid,init,lid,mkid}.c
oldHashSize mkid.c
oldHashTable mkid.c
Better yet, if you want to edit these, try
$ eid -u7
^SCAN_TE getscan.c
Edit? [y1-9^S/nq] n
^idh_arg getscan.c id.h {init,mkid}.c
Edit? [y1-9^S/nq] n
^idh_nam {fid,getscan}.c id.h {init,lid,mkid}.c
Edit? [y1-9^S/nq] n
^oldHash {fid,getscan}.c id.h {init,lid,mkid}.c
Edit? [y1-9^S/nq] n
An additional feature of lid(1) is that pathnames are automatically
adjusted for the current working directory. Large programs such as the
UNIX kernel are often partitioned into subsystems whose sources live in
different directories. What follows are several examples of the same
search conducted from different points in the UNIX kernel source
hierarchy:
$ cd /src/uts/m68k
$ lid bdevsw
bdevsw sys/conf.h cf/conf.c io/bio.c os/{fio,main,prf,sys3}.c
$ cd io
$ lid bdevsw
bdevsw ../sys/conf.h ../cf/conf.c bio.c ../os/{fio,main,prf,sys3}.c
$ cd ../os
bdevsw ../sys/conf.h ../cf/conf.c ../io/bio.c {fio,main,prf,sys3}.c
The database is built with mkid(1). The user supplies pathnames
either on the command line or on stdin. Here's the output of the
`verbose' option to mkid(1):
$ mkid -v *.h *.c
c: bitops.h
c: bool.h
c: extern.h
c: id.h
c: patchlevel.h
c: radix.h
c: string.h
c: basename.c
c: bitcount.c
c: bitops.c
c: bitsvec.c
c: bsearch.c
c: bzero.c
c: document.c
c: fid.c
c: gets0.c
c: getsFF.c
c: getscan.c
c: hash.c
c: idx.c
c: init.c
c: lid.c
c: mkid.c
c: numtst.c
c: opensrc.c
c: paths.c
c: scan-asm.c
c: scan-c.c
c: stoi.c
c: tty.c
c: uerror.c
c: wmatch.c
Compressing Hash Table...
Sorting Hash Table...
Writing `ID'...
Names: 593, Numbers: 64, Strings: 43, Solo: 119, Total: 697
Occurrances: 11.67, Load: 0.17, Probes: 1.07
Mkid(1) echoes the name of each file as it is scanned, prefixed by the
name of the language it thinks the file is written in. Mkid(1) reports
how many unique names and numbers were found, how many names occurred
only once, and the total for names and numbers. It also reports the
average number of occurrances for all names and numbers. Next, there
are some hash-table statistics on the load-factor and the average
number of open-addressed probes.
Mkid(1) can take arguments from the command line, from stdin, or from
a file. A file full of filenames may also contain mkid options of the form
-<option>. Filenames and options appear in the file one-per-line. Typical
usage for this feature is as follows:
$ find . -name '*.[chys]' -print >IDFILES
$ mkid -aIDFILES
-- or --
$ find . -name '*.[chys]' -print |mkid -
Mkid(1) stashes the filenames and relevant arguments in the database
itself. It uses these to support the ``incremental-update' option.
If invoked with `-u', mkid(1) checks the modification times of all
constituent files, and only re-scans those that are newer than the
database itself. It is invoked like so:
$ mkid -u
In summation, mkid(1) can get arguments from one of four places:
1) the command line, 2) a file, 3) stdin, 4) the database itself.
Mkid(1) accepts a number of scanner-specific arguments. Generally,
these are introduced with `-S<lang>' where <lang> is the name of
a language, such as `c' or `asm'. You can get a scanner-specific
usage-report with `-S<lang>?' (Of course, the `?' must be escaped
to get it past the shell)
Here's scanner-usage for the assembly language scanner:
$ mkid -Sasm\?
The Assembler scanner arguments take the form -Sasm<arg>, where
<arg> is one of the following: (<cc> denotes one or more characters)
-c<cc> . . . . <cc> introduce(s) a comment until end-of-line.
(+|-)u . . . . (Do|Don't) strip a leading `_' from ids.
(+|-)a<cc> . . Allow <cc> in ids, and (keep|ignore) those ids.
(+|-)p . . . . (Do|Don't) handle C-preprocessor directives.
(+|-)C . . . . (Do|Don't) handle C-style comments. (/* */)
`-Sasm-c<cc>' tells the scanner what characters are used to introduce comments
that extend to end-of-line.
Use `-Sasm+u' if your C compiler prepends leading underscores to external
names. This way, mkid(1) will strip leading underscores, and the name
`foo' in a C source will be correctly associated with the name `_foo'
in an assembler source. If your compiler doesn't prepend leading
underscores, use `-Sasm-u'.
Many assemblers allow special characters to be mixed with
alpha-numerics in label, constant and register names. Common choices
are `.', `%', and `$'. Thus, a label such as `L%123' should be scanned
as one token, not broken up into the name `L' and the number 123.
`-Sasm-a%.' tells the scanner to allow `%' and `.' in tokens, but to throw
away tokens containing `%' or `.' `-Sasm+a%.' tells the scanner to keep such
tokens and put them into the database.
`-Sasm+p' tells the scanner to handle `#include' and `#define' lines as
in C source, and `-Sasm+C' tells it to ignore C-style comments.
Here's the scanner-usage for C:
$ mkid -Sc\?
The C scanner arguments take the form -Sc<arg>, where <arg>
is one of the following: (<cc> denotes one or more characters)
(+|-)u . . . . (Do|Don't) strip a leading `_' from ids in strings.
-s<cc> . . . . Allow <cc> in string ids.
The `+u' argument is akin to the argument for the assembly-language
scanner. Mkid(1) keeps the contents of quoted-strings if the string
contains a single valid C name and nothing else. E.g. mkid(1) would
keep the contents of "_proc". Such strings are interesting because
they may contain symbol names that a program uses for nlist lookups.
So, if your compiler prepends underscores to external symbols, use
`-Sc+u' so that mkid(1) will strip them back off.
Mkid(1) normally throws away the contents of quoted strings that have
anything other than a single name in them. You can tell mkid(1) to
accept additional characters in strings with `-Sc-s<cc>' where <cc> is
one or more special characters. E.g. `-Sc-s/.-:,' will include most of
the strings containing pathnames that you are likely to encounter.
Another class of scanner argument allows you to associate a suffix
with a language. E.g. `-S.y=c' tells mkid(1) to use the C language
scanner on all files ending with .y. You can ask mkid(1) for the
available scanners and associated suffixes like so:
$ mkid -S\?=\?
.c=c, .h=c, .y=c, .s=asm, .p=pascal, .pas=pascal
Please note, mkid(1) is lying to you about its Pascal prowess!
At the time of this posting, there are scanners for C and assembly
language sources. There are also stubs for Pascal, Ada and LISP. The
scanners are very fast. The assembly language scanner knows how
to throw away C-style comments as well as the traditional `comment-
character-until-end-of-line' style. In order to test new scanners,
there is a scanner driver called idx(1). Idx(1) simply calls the
scanner to get identifiers one-at-a-time prints them on stdout one-per-line.
For more information, read the manual pages!
Happy Hacking,
--
-- Greg McGary
-- P.O. Box 286
-- Lincoln, MA 01773
--
-- 9/15/87
--
-- Until the end of 1987,
-- Consulting to Sun's East Coast Division:
-- gmcgary@ecd.sun.com
-- gmcgary@suneast.uu.net
--
-- After that, probably consulting in Europe...