Simtel MSDOS - Coast to Coast

home *** CD-ROM | disk | FTP | other *** search

/ Simtel MSDOS - Coast to Coast / simteldosarchivecoasttocoast2.iso / awk / awk_rev3.zip / AWK2.REV < prev next >

Wrap

Text File | 1990-05-22 | 14KB | 309 lines

Four AWK Implementations for MS-DOS - How Do They Compare? Copyright (c) 1989, 1990, by George A. Theall In the fall of 1988 I was introduced to a little known programming language named AWK. Its main feature is undoubtedly the speed with which programs can be developed. AWK has been available on Unix systems for about 10 years now but only recently crossed over to the MS-DOS world. Despite this slow start, AWK's power, versatility, and flexibility should make it a hit for anyone who is serious about using their PC. When I first discovered AWK, it seemed perfectly suited to the type of data manipulation which much of my research work involves. At the time, two companies were marketing implementations of AWK for MS-DOS: Mortice Kern Systems (MKS) and Polytron/SAGE Software. Both claimed to have a complete implementation of AWK as described in _The AWK Programming Language_ by Aho, Kernighan, and Weinberger, the language's developers. I decided on MKS' product - it was the cheaper, and support was available via electronic mail. Initially I was a bit uncomfortable with my decision. Though both companies have excellent reputations, I had not seen any comparisons of the two AWKs. Since then I have worked with the products from MKS and Polytron/SAGE as well as non-commercial implementations from Rob Duff and the GNU Project. Given my experience, two questions come to mind: How significantly do these implementations differ? And more importantly, why spend roughly $100 for a commercial program when you can download a copy of Duff's AWK or GNU AWK from a local bbs for just the cost of a phone call? All four implementations reviewed here claim to conform to the de facto standard of _The AWK Programming Language_. And, with the exception of Duff AWK's inability to handle the pipe form of the getline function, all four uphold this claim. Some basic features of each are tabulated below: TABLE 1. Features of Each AWK Implementation --------------------------------------------------------------------------- Feature DUFF_AWK GAWK MKS_AWK POLY_AWK --------------------------- --------- --------- --------- --------- Executable Size (in Kbytes) 63 131 56/87 131 Uses 80x87? unknown unknown yes (1) yes Mamimum record size 1024 16384 2048 32000 Extensions? no yes (2) no yes (3) Read programs from stdin? yes no no no --------------------------------------------------------------------------- (1) MKS AWK comes with four executables - one set of two provides direct support for a numeric coprocessor; the other uses it if available. File sizes reported above are for the set with indirect support. (2) Chief among GAWK's extensions are an IGNORECASE variable as well as egrep-style regular expressions as described by the POSIX standard. (3) PolyAWK's extensions relate to case-conversion, manipulation of environment variables, and bitwise operators. DUFF_AWK represents version 2.14 of Rob Duff's implementation; GAWK refers to version 2.11, patchlevel 1, of GNU AWK as ported to MS-DOS by Kent Williams; MKS_AWK denotes version 2.3 of the small and large models from MKS; and POLY_AWK refers to version 1.3 of Polytron/SAGE's product. To compare the implementations I devised a set of programs based on tasks for which I commonly use AWK. Each program processes three input files, constructed of lines such as: PD1:<MSDOS.APL> SAPLPC-A.ARC 208K BINARY 04/02/88 SAPLPC-B.ARC 225K BINARY 04/02/88 PD1:<MSDOS.ARC-LBR> ADIR103.ARC 8K BINARY 05/24/87 5-col .ARC file ... ADIR140.ARC 10K BINARY 02/05/88 Dave Rand's ARC ... The input files themselves differ only in their sizes, which are: 112 936 7449 SMALL.FIL 1006 7199 60711 MEDIUM.FIL 10569 74685 631238 LARGE.FIL as reported by the word-counting task. [Fields are as follows: number of lines, number of words, number of characters, and file name.] These tasks do not purport to represent all of AWK's capabilities nor is there much justification for selecting them. Nevertheless, they do point to some interesting differences. All tasks were run under MS-DOS v3.30 on an NEC PowerMate SX machine operating at 16MHz and equipped with a fast (28ms) hard disk but without a math coprocessor. Before each implementation was tested, the hard disk was optimized and the system was rebooted with no AUTOEXEC.BAT or CONFIG.SYS file. This resulted in a "clean" system with roughly 600K of conventional memory available. For each implementation, six tasks were performed using the three input files, results were tabulated, and then the executable and output files were moved onto a floppy. To avoid _human_ measurement inaccuracies execution times were calculated with Brant Cheikes' TM utility, which rounds to the nearest second. Results are presented in Table 2 below. For each version and each task, three execution times are reported - the times required to process SMALL.FIL, MEDIUM.FIL, and LARGE.FIL respectively. The actual AWK programs appear at the end of this document. TABLE 2. AWK Program Execution Times (in seconds) ---------------------------------------------------------------------------- Task DUFF_AWK GAWK MKS_AWK MKS_AWKL POLY_AWK ------------- --------- --------- --------- -------- --------- Record Count. 2/14/145 2/3/21 1/2/12 1/2/23 1/2/15 Word Count. 4/25/252 3/16/160 1/5/43 2/6/52 2/5/42 Line Number. 4/21/209 1/7/67 1/6/62 1/6/70 2/6/55 Reg. Expr. 4/22/222 2/5/40 2/15/150 4/23/228 2/5/41 Column sums 3/23/238 2/8/78 2/4/36 1/6/54 1/4/39 Spelling 15/22*/22* 12/123/138* 6/13*/13* 8/71/1020* 4/25/127* ---------------------------------------------------------------------------- * indicates the program ran out of memory and aborted. DUFF_AWK represents version 2.14 of Rob Duff's implementation; GAWK refers to version 2.11, patchlevel 1, of GNU AWK as ported to MS-DOS by Kent Williams; MKS_AWK and MKS_AWKL denote versions 2.3 of the small and large models from MKS; and POLY_AWK refers to version 1.3 of Polytron/SAGE's product. Note that while _actual_ execution times will vary from one situation or machine to another, _relative_ times are useful in making comparisons. Note also that the figures reported above are from a single run rather than averages of multiple runs. The problem with multiple runs is one of time: it takes about 1 hour for a single set of runs on the SX! Among the commercial products, there is no clear-cut leader. For tasks involving SMALL.FIL, execution times for the three implementations are all within a few seconds of each other so that observed differences are probably due largely to TM's rounding to the nearest second. As the size of the input file grows, however, two things become clear: First, the need to use far pointers exacts a noticeable performance penalty in MKS_AWKL. Second, POLY_AWK excels at regular expression matching (as exemplified by the regular-expressions and spelling tasks). The commercial products do, though, offer clear performance advantages compared to the non-commercial implementations. Of those tasks completed successfully, DUFF_AWK turned in the slowest execution times almost without exception. Its performance in the record-counting and line- numbering tasks suggests the problem is due to poor disk I/O. Results for the GNU AWK are of a mixed bag: GAWK turns in the fastest times for regular expression matching yet at the same time ranks among the slowest for the word-counting and column-sums tasks. In terms of how the language is implemented by each package, I found one outright bug and several interesting differences while performing these benchmarks. [NB: The two implementations from MKS differ only in execution speed and available storage area; therefore, what is said below about MKS_AWK applies to MKS_AWKL as well.] The bug involves GAWK's handling of the metacharacter '+' for regular expression matching. An earlier version of the column-sums tasks had to be changed because it used this metacharacter. The differences arise because several areas of the language are left up to the implementors themselves; they do not indicate any lack of compliance with the de facto standard of _The AWK Programming Language_. The most disturbing difference concerns the function printf() in DUFF_AWK: printf("%d", i) display properly only numbers in the range [-32768, 32767]. [Strangely enough, this behavior does not occur if the statement print i is used!] Although it is possible to avoid problematic results when working with numbers outside this range, I'd certainly like to see some mention of this and similar limitations in the documentation. Also annoying is the treatment of associative array indices: MKS_AWK reverses them, both POLY_AWK and DUFF_AWK alphabetize them, and GAWK rearranges them in some apparently inexplicable fashion. Although the standard says this is implementation-dependent, it is often desirable to output the indices in the proper order. With the MKS product, it's just a hassle; with the others, it's basically impossible. It's also worth mentioning that MKS_AWK, unlike the other three implementations, does not regard ^Z as an end-of-file marker; whether or not it should is unclear. This caused some initial consternation when I was comparing the results of the word-counting task, but otherwise seems of little import. Documentation for each package is sparse, but this is because all conform to the standard closely. Of the four, DUFF_AWK's is by far the best. It includes not only a discussion of the language but also a brief tutorial and a ***large*** collection of sample programs. For those new to AWK this is reason enough to grab a copy of Rob Duff's package. GAWK comes with a single unix-style man page which focuses primarily on GNU-specific extensions; additional documentation is available separately. MKS supplies a copy of _The AWK Programming Language_ and a reference manual covering not only AWK but several other utilities included with its package. I've found one mistake (about maximum record size) and another omission (about placement of temporary files), but overall it's quite adequate. Polytron also supplies the book and a single README file. The latter mentions first those examples in the AWK book which don't work because of shortcomings inherent to MS-DOS, then Polytron/SAGE's own extensions to the language. What conclusions can be made from these comparisons? Well, as the old adage says, "you don't get something for nothing". That is, choosing a non-commericial over a commercial implementation will save you money initially, but every time you use it you're likely to "pay" a price in terms of slower performance. Whether this is worth $100 depends on _your_ particular situation. On one hand, if you're interested in learning about AWK, use it only infrequently, or process small files (and don't work in a commercial environment) you'll probably be quite happy with either Duff's AWK or GAWK. On the other, if you can justify spending the money, you'll face a tough choice between the offerings from MKS and Polytron, with Polytron's product looking slightly better. Nevertheless, all four implementations are well worth your consideration, and I have no qualms about recommending them as effective tools for users of DOS-based computers. Disclaimer: Apart from being a satisfied owner of Mortice Kern System's AWK and Polytron's PolyShell, I have no direct connection with the companies or people mentioned above. If you have any comments about this article, or the AWK language in general, please get in touch. For those with email access, I can be reached as GTHEALL@PENNDRLS (BITNET) or GTHEALL@PENNDRLS.UPENN.EDU (Internet). Otherwise, give me a call at +1 215 898 3419. Tasks Used in Comparisons 1. Record-counting task END {print NR} 2. Word-counting task FName != FILENAME { if (FName) printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName) FName = FILENAME cc = wc = lc = 0 } { cc += length($0) + 1 # don't forget LF! wc += NF lc ++ } END { printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName) } 3. Line-numbering task {print NR ": ", $0} 4. Regular-expressions task /test|Test|TEST/ 5. Column-sums task $2 ~ /[0-9][0-9]*K$/ { sum += $2 } END { printf("Sum of 2nd column: %16.0fK\n", sum) } 6. Spelling task # List words occurring only once in a document. # A "word" is defined as a sequence of alphanumerics # or underscores. # Scan thru each line and compute word frequencies. # The associative array Words[] holds these frequencies. { # replace non-alphanumerics with blanks throughout line gsub(/[^A-Za-z0-9_]/, " ") # count how many times each word used for (i = 1; i <= NF; i++) # scan all fields ... Words[$i]++ # increment word count } # Print out infrequently-used words. END { for (w in Words) # scan over all words ... if (Words[w] == 1) # if word only appears once ... print w # print it }