home *** CD-ROM | disk | FTP | other *** search
-
-
- Four AWK Implementations for MS-DOS - How Do They Compare?
- Copyright (c) 1989, 1990, by George A. Theall
-
-
-
-
- In the fall of 1988 I was introduced to a little known programming
- language named AWK. Its main feature is undoubtedly the speed with which
- programs can be developed. AWK has been available on Unix systems for
- about 10 years now but only recently crossed over to the MS-DOS world.
- Despite this slow start, AWK's power, versatility, and flexibility should
- make it a hit for anyone who is serious about using their PC.
-
-
- When I first discovered AWK, it seemed perfectly suited to the type of
- data manipulation which much of my research work involves. At the time,
- two companies were marketing implementations of AWK for MS-DOS: Mortice
- Kern Systems (MKS) and Polytron/SAGE Software. Both claimed to have a
- complete implementation of AWK as described in _The AWK Programming
- Language_ by Aho, Kernighan, and Weinberger, the language's developers. I
- decided on MKS' product - it was the cheaper, and support was available
- via electronic mail.
-
-
- Initially I was a bit uncomfortable with my decision. Though both
- companies have excellent reputations, I had not seen any comparisons of
- the two AWKs. Since then I have worked with the products from MKS and
- Polytron/SAGE as well as non-commercial implementations from Rob Duff and
- the GNU Project. Given my experience, two questions come to mind: How
- significantly do these implementations differ? And more importantly, why
- spend roughly $100 for a commercial program when you can download a copy
- of Duff's AWK or GNU AWK from a local bbs for just the cost of a phone
- call?
-
-
- All four implementations reviewed here claim to conform to the de facto
- standard of _The AWK Programming Language_. And, with the exception of
- Duff AWK's inability to handle the pipe form of the getline function, all
- four uphold this claim. Some basic features of each are tabulated below:
-
-
- TABLE 1. Features of Each AWK Implementation
-
- ---------------------------------------------------------------------------
- Feature DUFF_AWK GAWK MKS_AWK POLY_AWK
- --------------------------- --------- --------- --------- ---------
- Executable Size (in Kbytes) 63 131 56/87 131
- Uses 80x87? unknown unknown yes (1) yes
- Mamimum record size 1024 16384 2048 32000
- Extensions? no yes (2) no yes (3)
- Read programs from stdin? yes no no no
- ---------------------------------------------------------------------------
- (1) MKS AWK comes with four executables - one set of two provides direct
- support for a numeric coprocessor; the other uses it if available.
- File sizes reported above are for the set with indirect support.
-
- (2) Chief among GAWK's extensions are an IGNORECASE variable as well as
- egrep-style regular expressions as described by the POSIX standard.
-
- (3) PolyAWK's extensions relate to case-conversion, manipulation of
- environment variables, and bitwise operators.
-
- DUFF_AWK represents version 2.14 of Rob Duff's implementation; GAWK refers
- to version 2.11, patchlevel 1, of GNU AWK as ported to MS-DOS by Kent
- Williams; MKS_AWK denotes version 2.3 of the small and large models from
- MKS; and POLY_AWK refers to version 1.3 of Polytron/SAGE's product.
-
-
- To compare the implementations I devised a set of programs based on
- tasks for which I commonly use AWK. Each program processes three input
- files, constructed of lines such as:
-
- PD1:<MSDOS.APL>
- SAPLPC-A.ARC 208K BINARY 04/02/88
- SAPLPC-B.ARC 225K BINARY 04/02/88
-
- PD1:<MSDOS.ARC-LBR>
- ADIR103.ARC 8K BINARY 05/24/87 5-col .ARC file ...
- ADIR140.ARC 10K BINARY 02/05/88 Dave Rand's ARC ...
-
- The input files themselves differ only in their sizes, which are:
-
- 112 936 7449 SMALL.FIL
- 1006 7199 60711 MEDIUM.FIL
- 10569 74685 631238 LARGE.FIL
-
- as reported by the word-counting task. [Fields are as follows: number of
- lines, number of words, number of characters, and file name.] These tasks
- do not purport to represent all of AWK's capabilities nor is there much
- justification for selecting them. Nevertheless, they do point to some
- interesting differences.
-
-
- All tasks were run under MS-DOS v3.30 on an NEC PowerMate SX machine
- operating at 16MHz and equipped with a fast (28ms) hard disk but without a
- math coprocessor. Before each implementation was tested, the hard disk was
- optimized and the system was rebooted with no AUTOEXEC.BAT or CONFIG.SYS
- file. This resulted in a "clean" system with roughly 600K of conventional
- memory available. For each implementation, six tasks were performed using
- the three input files, results were tabulated, and then the executable and
- output files were moved onto a floppy. To avoid _human_ measurement
- inaccuracies execution times were calculated with Brant Cheikes' TM
- utility, which rounds to the nearest second.
-
-
- Results are presented in Table 2 below. For each version and each task,
- three execution times are reported - the times required to process
- SMALL.FIL, MEDIUM.FIL, and LARGE.FIL respectively. The actual AWK programs
- appear at the end of this document.
-
-
- TABLE 2. AWK Program Execution Times
- (in seconds)
-
- ----------------------------------------------------------------------------
- Task DUFF_AWK GAWK MKS_AWK MKS_AWKL POLY_AWK
- ------------- --------- --------- --------- -------- ---------
- Record Count. 2/14/145 2/3/21 1/2/12 1/2/23 1/2/15
- Word Count. 4/25/252 3/16/160 1/5/43 2/6/52 2/5/42
- Line Number. 4/21/209 1/7/67 1/6/62 1/6/70 2/6/55
- Reg. Expr. 4/22/222 2/5/40 2/15/150 4/23/228 2/5/41
- Column sums 3/23/238 2/8/78 2/4/36 1/6/54 1/4/39
- Spelling 15/22*/22* 12/123/138* 6/13*/13* 8/71/1020* 4/25/127*
- ----------------------------------------------------------------------------
- * indicates the program ran out of memory and aborted.
-
- DUFF_AWK represents version 2.14 of Rob Duff's implementation; GAWK refers
- to version 2.11, patchlevel 1, of GNU AWK as ported to MS-DOS by Kent
- Williams; MKS_AWK and MKS_AWKL denote versions 2.3 of the small and large
- models from MKS; and POLY_AWK refers to version 1.3 of Polytron/SAGE's
- product.
-
-
- Note that while _actual_ execution times will vary from one situation
- or machine to another, _relative_ times are useful in making comparisons.
- Note also that the figures reported above are from a single run rather
- than averages of multiple runs. The problem with multiple runs is one of
- time: it takes about 1 hour for a single set of runs on the SX!
-
-
- Among the commercial products, there is no clear-cut leader. For tasks
- involving SMALL.FIL, execution times for the three implementations are all
- within a few seconds of each other so that observed differences are
- probably due largely to TM's rounding to the nearest second. As the size
- of the input file grows, however, two things become clear: First, the need
- to use far pointers exacts a noticeable performance penalty in MKS_AWKL.
- Second, POLY_AWK excels at regular expression matching (as exemplified by
- the regular-expressions and spelling tasks).
-
-
- The commercial products do, though, offer clear performance advantages
- compared to the non-commercial implementations. Of those tasks completed
- successfully, DUFF_AWK turned in the slowest execution times almost
- without exception. Its performance in the record-counting and line-
- numbering tasks suggests the problem is due to poor disk I/O. Results for
- the GNU AWK are of a mixed bag: GAWK turns in the fastest times for
- regular expression matching yet at the same time ranks among the slowest
- for the word-counting and column-sums tasks.
-
-
- In terms of how the language is implemented by each package, I found
- one outright bug and several interesting differences while performing
- these benchmarks. [NB: The two implementations from MKS differ only in
- execution speed and available storage area; therefore, what is said below
- about MKS_AWK applies to MKS_AWKL as well.] The bug involves GAWK's
- handling of the metacharacter '+' for regular expression matching. An
- earlier version of the column-sums tasks had to be changed because it used
- this metacharacter. The differences arise because several areas of the
- language are left up to the implementors themselves; they do not indicate
- any lack of compliance with the de facto standard of _The AWK Programming
- Language_.
-
-
- The most disturbing difference concerns the function printf() in
- DUFF_AWK: printf("%d", i) display properly only numbers in the range
- [-32768, 32767]. [Strangely enough, this behavior does not occur if the
- statement print i is used!] Although it is possible to avoid problematic
- results when working with numbers outside this range, I'd certainly like
- to see some mention of this and similar limitations in the documentation.
-
-
- Also annoying is the treatment of associative array indices: MKS_AWK
- reverses them, both POLY_AWK and DUFF_AWK alphabetize them, and GAWK
- rearranges them in some apparently inexplicable fashion. Although the
- standard says this is implementation-dependent, it is often desirable to
- output the indices in the proper order. With the MKS product, it's just a
- hassle; with the others, it's basically impossible.
-
-
- It's also worth mentioning that MKS_AWK, unlike the other three
- implementations, does not regard ^Z as an end-of-file marker; whether or
- not it should is unclear. This caused some initial consternation when I
- was comparing the results of the word-counting task, but otherwise seems
- of little import.
-
-
- Documentation for each package is sparse, but this is because all
- conform to the standard closely. Of the four, DUFF_AWK's is by far the
- best. It includes not only a discussion of the language but also a brief
- tutorial and a ***large*** collection of sample programs. For those new to
- AWK this is reason enough to grab a copy of Rob Duff's package. GAWK comes
- with a single unix-style man page which focuses primarily on GNU-specific
- extensions; additional documentation is available separately. MKS supplies
- a copy of _The AWK Programming Language_ and a reference manual covering
- not only AWK but several other utilities included with its package. I've
- found one mistake (about maximum record size) and another omission (about
- placement of temporary files), but overall it's quite adequate. Polytron
- also supplies the book and a single README file. The latter mentions first
- those examples in the AWK book which don't work because of shortcomings
- inherent to MS-DOS, then Polytron/SAGE's own extensions to the language.
-
-
- What conclusions can be made from these comparisons? Well, as the old
- adage says, "you don't get something for nothing". That is, choosing a
- non-commericial over a commercial implementation will save you money
- initially, but every time you use it you're likely to "pay" a price in
- terms of slower performance. Whether this is worth $100 depends on _your_
- particular situation. On one hand, if you're interested in learning about
- AWK, use it only infrequently, or process small files (and don't work in a
- commercial environment) you'll probably be quite happy with either Duff's
- AWK or GAWK. On the other, if you can justify spending the money, you'll
- face a tough choice between the offerings from MKS and Polytron, with
- Polytron's product looking slightly better. Nevertheless, all four
- implementations are well worth your consideration, and I have no qualms
- about recommending them as effective tools for users of DOS-based
- computers.
-
-
- Disclaimer: Apart from being a satisfied owner of Mortice Kern System's
- AWK and Polytron's PolyShell, I have no direct connection with the
- companies or people mentioned above.
-
-
- If you have any comments about this article, or the AWK language in
- general, please get in touch. For those with email access, I can be
- reached as GTHEALL@PENNDRLS (BITNET) or GTHEALL@PENNDRLS.UPENN.EDU
- (Internet). Otherwise, give me a call at +1 215 898 3419.
-
-
- Tasks Used in Comparisons
-
-
-
- 1. Record-counting task
- END {print NR}
-
-
- 2. Word-counting task
- FName != FILENAME {
- if (FName)
- printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
- FName = FILENAME
- cc = wc = lc = 0
- }
-
- {
- cc += length($0) + 1 # don't forget LF!
- wc += NF
- lc ++
- }
-
- END {
- printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
- }
-
-
- 3. Line-numbering task
- {print NR ": ", $0}
-
-
- 4. Regular-expressions task
- /test|Test|TEST/
-
-
- 5. Column-sums task
- $2 ~ /[0-9][0-9]*K$/ {
- sum += $2
- }
-
- END {
- printf("Sum of 2nd column: %16.0fK\n", sum)
- }
-
-
- 6. Spelling task
- # List words occurring only once in a document.
- # A "word" is defined as a sequence of alphanumerics
- # or underscores.
-
- # Scan thru each line and compute word frequencies.
- # The associative array Words[] holds these frequencies.
- {
- # replace non-alphanumerics with blanks throughout line
- gsub(/[^A-Za-z0-9_]/, " ")
-
- # count how many times each word used
- for (i = 1; i <= NF; i++) # scan all fields ...
- Words[$i]++ # increment word count
- }
-
- # Print out infrequently-used words.
- END {
- for (w in Words) # scan over all words ...
- if (Words[w] == 1) # if word only appears once ...
- print w # print it
- }
-