home *** CD-ROM | disk | FTP | other *** search
- Subject: Get Hep to Kanji-Ready Five-Algorithm [ef]?grep
-
- # "I need very little,
- and of that little,
- I need very little." -- St. Francis of Assisi
-
- Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least
- five string search techniques, is now further tuned.
-
- Posted to USENET (and the mod.sources archive) some months ago, our
- implementation of "plug-compatible" egrep.c has been extended to support:
-
- transparent 'grep' expression translation, to bring the speed of
- hyper-'egrep' to bear upon searches specified via the less capable
- 'grep' syntax.
-
- interception of 'fgrep' for alternations of low (<= 7) cardinality,
- using a novel method of Boyer-Moore table superimposition and
- pre-computation magic. the resulting speedup applies also to simple
- metacharacter-free 'egrep'-style alternations.
-
- (the above two improvements are made invisible by linking the
- grep/egrep/fgrep triumvirate.)
-
- a revised strategy of fallback to standard 'egrep' for hard
- cases, which eliminates costly popen() plumbing in favor of a
- statistically-based re-exec() dynamic.
-
- more complete application of fast match to standard options,
- including -n line numbering.
-
- preparation for Kanji pattern input, based upon parity-marked EUC
- codes. new egrep.c is "eight-bit clean". the fast algorithms
- unfortunately currently apply only to meta-free patterns and
- simple alternations; full Kanji regular expression treatment
- remains problematic. however, the new code will pass difficult
- input through to [ef]?grep utilities in the UNIX Japan standard
- substrate.
-
- Kanji capability is perhaps the most important addition, as the
- number of UNIX systems in the Orient proliferate, providing a "new market"
- for Boyer-Moore-style search. However, actual search efficacy remains
- unknown until the Gaijin author gains feedback from JUNET or points beyond.
- For all we know, Nippon text search utilities may already incorporate
- the methods. Tests conducted so far have been with artificial Kanji files.
-
- In case you are w(o|a)ndering, be reminded that no vendor source
- changes are required to use the code. It is configured as a turbo-charged
- "front-end" to the existing section one commands, though it is (barely)
- conceivable to adapt things, at a loss in worst-case efficiency, for
- (heaven forfend!) non-Unix systems running C. And, since we do not offer
- a minimalist grep, do not expect it to run well on minimalist UNIX clones.
-
- Below appears a brief timing run on Webster's 2nd wordlist. Notes
- on implementation changes since original release follow in the next message.
- If you want to skip the rest, fine. The easy instructions -- download
- from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa
- for the (im)patient], and run:
-
- make
- sh eg.shell # regression test amusement
- make install
-
- after perusing README.FIRST. Though the bundle in ~ftp/pub at NASA Ames
- Research Center contains prerequisite support from Univ. of Toronto's
- Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources.
- John Gilmore of the GNU project has a modified regexec(), but it is not
- mandatory for running the new egrep.
-
- Contrary to popular belief, old egrep is not I/O bound for large
- large files on most machines. The new version is. One sample:
-
- time egrep 'u.*nix' /usr/dict/web2 (2.5 MB)
- (yielding {Coturnix, Pseudophoenix, [Tt]urnix}), shows
-
- user sys real (load ave. < 1)
-
- VAX 11/750, 4.3 BSD, Fujitsu Eagles
- (new) 6.8 3.8 11
- (old) 70.0 5.5 87
-
- Sun 3/160, 3.2 OS, Eagle on SMD
- (new) 1.7 2.2 5
- (old) 14.7 1.5 16
-
- Cray 2, Sys 5, no disk striping
- (new) .93 .11 1
- (old) 8.07 .21 8
-
- notes: New egrep was actually run with -i option, but not old egrep.
- Also, fumbling for three-character residue is not [ef]?grep's forte,
- so the example is conservative.
-
- Sun 3 has higher sys time for some unknown reason (a guess: VAX 4.3 kernel
- handles read() calls with oddsize buffers differently?). Cray 2 reportedly
- does disk I/O at 5-10 megabytes per second, but the architecture/compiler
- is bad at byte addressing -- no cache, no vectors here. Unfair comparison:
- new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O!
-
- Speculation: the code might be useful on the Macintosh II, even if the Unix
- filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth.
- Mac 2 testers please forward info to ames!jaw.
-
- Another metric is inner loop efficiency:
-
- # instructions
- VAX Berkeley cc 5
- Sun 68020 3.2 cc 6
- Stallman's GNU 68020 cc 4
- MIPS R2000 6
- Cray 2 25
-
- Thanks goes to mips!dce (David Elliott) for his testing time, as well as
- to RMS for two-upping Sun's C compiler.
-
- Of course, if you have a Connection Machine dedicated to running their
- admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at
- all. And, for unindexed text on fine-grained parallel machines, reg. expr.
- search algorithms can be made to run with a lower time bound (see the Hillis
- book). But if your files are on disk, even a specialized search chip won't help
- once things become I/O or bus limited. For this reason, vectorization on a
- Cray(ette) would be a bust, though Cray buffs may write the author for other
- scalar speedup ideas...
-
- [continued]
-