home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Simtel MSDOS - Coast to Coast
/
simteldosarchivecoasttocoast2.iso
/
awk
/
awk_rev3.zip
/
AWK2.REV
< prev
next >
Wrap
Text File
|
1990-05-22
|
14KB
|
309 lines
Four AWK Implementations for MS-DOS - How Do They Compare?
Copyright (c) 1989, 1990, by George A. Theall
In the fall of 1988 I was introduced to a little known programming
language named AWK. Its main feature is undoubtedly the speed with which
programs can be developed. AWK has been available on Unix systems for
about 10 years now but only recently crossed over to the MS-DOS world.
Despite this slow start, AWK's power, versatility, and flexibility should
make it a hit for anyone who is serious about using their PC.
When I first discovered AWK, it seemed perfectly suited to the type of
data manipulation which much of my research work involves. At the time,
two companies were marketing implementations of AWK for MS-DOS: Mortice
Kern Systems (MKS) and Polytron/SAGE Software. Both claimed to have a
complete implementation of AWK as described in _The AWK Programming
Language_ by Aho, Kernighan, and Weinberger, the language's developers. I
decided on MKS' product - it was the cheaper, and support was available
via electronic mail.
Initially I was a bit uncomfortable with my decision. Though both
companies have excellent reputations, I had not seen any comparisons of
the two AWKs. Since then I have worked with the products from MKS and
Polytron/SAGE as well as non-commercial implementations from Rob Duff and
the GNU Project. Given my experience, two questions come to mind: How
significantly do these implementations differ? And more importantly, why
spend roughly $100 for a commercial program when you can download a copy
of Duff's AWK or GNU AWK from a local bbs for just the cost of a phone
call?
All four implementations reviewed here claim to conform to the de facto
standard of _The AWK Programming Language_. And, with the exception of
Duff AWK's inability to handle the pipe form of the getline function, all
four uphold this claim. Some basic features of each are tabulated below:
TABLE 1. Features of Each AWK Implementation
---------------------------------------------------------------------------
Feature DUFF_AWK GAWK MKS_AWK POLY_AWK
--------------------------- --------- --------- --------- ---------
Executable Size (in Kbytes) 63 131 56/87 131
Uses 80x87? unknown unknown yes (1) yes
Mamimum record size 1024 16384 2048 32000
Extensions? no yes (2) no yes (3)
Read programs from stdin? yes no no no
---------------------------------------------------------------------------
(1) MKS AWK comes with four executables - one set of two provides direct
support for a numeric coprocessor; the other uses it if available.
File sizes reported above are for the set with indirect support.
(2) Chief among GAWK's extensions are an IGNORECASE variable as well as
egrep-style regular expressions as described by the POSIX standard.
(3) PolyAWK's extensions relate to case-conversion, manipulation of
environment variables, and bitwise operators.
DUFF_AWK represents version 2.14 of Rob Duff's implementation; GAWK refers
to version 2.11, patchlevel 1, of GNU AWK as ported to MS-DOS by Kent
Williams; MKS_AWK denotes version 2.3 of the small and large models from
MKS; and POLY_AWK refers to version 1.3 of Polytron/SAGE's product.
To compare the implementations I devised a set of programs based on
tasks for which I commonly use AWK. Each program processes three input
files, constructed of lines such as:
PD1:<MSDOS.APL>
SAPLPC-A.ARC 208K BINARY 04/02/88
SAPLPC-B.ARC 225K BINARY 04/02/88
PD1:<MSDOS.ARC-LBR>
ADIR103.ARC 8K BINARY 05/24/87 5-col .ARC file ...
ADIR140.ARC 10K BINARY 02/05/88 Dave Rand's ARC ...
The input files themselves differ only in their sizes, which are:
112 936 7449 SMALL.FIL
1006 7199 60711 MEDIUM.FIL
10569 74685 631238 LARGE.FIL
as reported by the word-counting task. [Fields are as follows: number of
lines, number of words, number of characters, and file name.] These tasks
do not purport to represent all of AWK's capabilities nor is there much
justification for selecting them. Nevertheless, they do point to some
interesting differences.
All tasks were run under MS-DOS v3.30 on an NEC PowerMate SX machine
operating at 16MHz and equipped with a fast (28ms) hard disk but without a
math coprocessor. Before each implementation was tested, the hard disk was
optimized and the system was rebooted with no AUTOEXEC.BAT or CONFIG.SYS
file. This resulted in a "clean" system with roughly 600K of conventional
memory available. For each implementation, six tasks were performed using
the three input files, results were tabulated, and then the executable and
output files were moved onto a floppy. To avoid _human_ measurement
inaccuracies execution times were calculated with Brant Cheikes' TM
utility, which rounds to the nearest second.
Results are presented in Table 2 below. For each version and each task,
three execution times are reported - the times required to process
SMALL.FIL, MEDIUM.FIL, and LARGE.FIL respectively. The actual AWK programs
appear at the end of this document.
TABLE 2. AWK Program Execution Times
(in seconds)
----------------------------------------------------------------------------
Task DUFF_AWK GAWK MKS_AWK MKS_AWKL POLY_AWK
------------- --------- --------- --------- -------- ---------
Record Count. 2/14/145 2/3/21 1/2/12 1/2/23 1/2/15
Word Count. 4/25/252 3/16/160 1/5/43 2/6/52 2/5/42
Line Number. 4/21/209 1/7/67 1/6/62 1/6/70 2/6/55
Reg. Expr. 4/22/222 2/5/40 2/15/150 4/23/228 2/5/41
Column sums 3/23/238 2/8/78 2/4/36 1/6/54 1/4/39
Spelling 15/22*/22* 12/123/138* 6/13*/13* 8/71/1020* 4/25/127*
----------------------------------------------------------------------------
* indicates the program ran out of memory and aborted.
DUFF_AWK represents version 2.14 of Rob Duff's implementation; GAWK refers
to version 2.11, patchlevel 1, of GNU AWK as ported to MS-DOS by Kent
Williams; MKS_AWK and MKS_AWKL denote versions 2.3 of the small and large
models from MKS; and POLY_AWK refers to version 1.3 of Polytron/SAGE's
product.
Note that while _actual_ execution times will vary from one situation
or machine to another, _relative_ times are useful in making comparisons.
Note also that the figures reported above are from a single run rather
than averages of multiple runs. The problem with multiple runs is one of
time: it takes about 1 hour for a single set of runs on the SX!
Among the commercial products, there is no clear-cut leader. For tasks
involving SMALL.FIL, execution times for the three implementations are all
within a few seconds of each other so that observed differences are
probably due largely to TM's rounding to the nearest second. As the size
of the input file grows, however, two things become clear: First, the need
to use far pointers exacts a noticeable performance penalty in MKS_AWKL.
Second, POLY_AWK excels at regular expression matching (as exemplified by
the regular-expressions and spelling tasks).
The commercial products do, though, offer clear performance advantages
compared to the non-commercial implementations. Of those tasks completed
successfully, DUFF_AWK turned in the slowest execution times almost
without exception. Its performance in the record-counting and line-
numbering tasks suggests the problem is due to poor disk I/O. Results for
the GNU AWK are of a mixed bag: GAWK turns in the fastest times for
regular expression matching yet at the same time ranks among the slowest
for the word-counting and column-sums tasks.
In terms of how the language is implemented by each package, I found
one outright bug and several interesting differences while performing
these benchmarks. [NB: The two implementations from MKS differ only in
execution speed and available storage area; therefore, what is said below
about MKS_AWK applies to MKS_AWKL as well.] The bug involves GAWK's
handling of the metacharacter '+' for regular expression matching. An
earlier version of the column-sums tasks had to be changed because it used
this metacharacter. The differences arise because several areas of the
language are left up to the implementors themselves; they do not indicate
any lack of compliance with the de facto standard of _The AWK Programming
Language_.
The most disturbing difference concerns the function printf() in
DUFF_AWK: printf("%d", i) display properly only numbers in the range
[-32768, 32767]. [Strangely enough, this behavior does not occur if the
statement print i is used!] Although it is possible to avoid problematic
results when working with numbers outside this range, I'd certainly like
to see some mention of this and similar limitations in the documentation.
Also annoying is the treatment of associative array indices: MKS_AWK
reverses them, both POLY_AWK and DUFF_AWK alphabetize them, and GAWK
rearranges them in some apparently inexplicable fashion. Although the
standard says this is implementation-dependent, it is often desirable to
output the indices in the proper order. With the MKS product, it's just a
hassle; with the others, it's basically impossible.
It's also worth mentioning that MKS_AWK, unlike the other three
implementations, does not regard ^Z as an end-of-file marker; whether or
not it should is unclear. This caused some initial consternation when I
was comparing the results of the word-counting task, but otherwise seems
of little import.
Documentation for each package is sparse, but this is because all
conform to the standard closely. Of the four, DUFF_AWK's is by far the
best. It includes not only a discussion of the language but also a brief
tutorial and a ***large*** collection of sample programs. For those new to
AWK this is reason enough to grab a copy of Rob Duff's package. GAWK comes
with a single unix-style man page which focuses primarily on GNU-specific
extensions; additional documentation is available separately. MKS supplies
a copy of _The AWK Programming Language_ and a reference manual covering
not only AWK but several other utilities included with its package. I've
found one mistake (about maximum record size) and another omission (about
placement of temporary files), but overall it's quite adequate. Polytron
also supplies the book and a single README file. The latter mentions first
those examples in the AWK book which don't work because of shortcomings
inherent to MS-DOS, then Polytron/SAGE's own extensions to the language.
What conclusions can be made from these comparisons? Well, as the old
adage says, "you don't get something for nothing". That is, choosing a
non-commericial over a commercial implementation will save you money
initially, but every time you use it you're likely to "pay" a price in
terms of slower performance. Whether this is worth $100 depends on _your_
particular situation. On one hand, if you're interested in learning about
AWK, use it only infrequently, or process small files (and don't work in a
commercial environment) you'll probably be quite happy with either Duff's
AWK or GAWK. On the other, if you can justify spending the money, you'll
face a tough choice between the offerings from MKS and Polytron, with
Polytron's product looking slightly better. Nevertheless, all four
implementations are well worth your consideration, and I have no qualms
about recommending them as effective tools for users of DOS-based
computers.
Disclaimer: Apart from being a satisfied owner of Mortice Kern System's
AWK and Polytron's PolyShell, I have no direct connection with the
companies or people mentioned above.
If you have any comments about this article, or the AWK language in
general, please get in touch. For those with email access, I can be
reached as GTHEALL@PENNDRLS (BITNET) or GTHEALL@PENNDRLS.UPENN.EDU
(Internet). Otherwise, give me a call at +1 215 898 3419.
Tasks Used in Comparisons
1. Record-counting task
END {print NR}
2. Word-counting task
FName != FILENAME {
if (FName)
printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
FName = FILENAME
cc = wc = lc = 0
}
{
cc += length($0) + 1 # don't forget LF!
wc += NF
lc ++
}
END {
printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
}
3. Line-numbering task
{print NR ": ", $0}
4. Regular-expressions task
/test|Test|TEST/
5. Column-sums task
$2 ~ /[0-9][0-9]*K$/ {
sum += $2
}
END {
printf("Sum of 2nd column: %16.0fK\n", sum)
}
6. Spelling task
# List words occurring only once in a document.
# A "word" is defined as a sequence of alphanumerics
# or underscores.
# Scan thru each line and compute word frequencies.
# The associative array Words[] holds these frequencies.
{
# replace non-alphanumerics with blanks throughout line
gsub(/[^A-Za-z0-9_]/, " ")
# count how many times each word used
for (i = 1; i <= NF; i++) # scan all fields ...
Words[$i]++ # increment word count
}
# Print out infrequently-used words.
END {
for (w in Words) # scan over all words ...
if (Words[w] == 1) # if word only appears once ...
print w # print it
}