home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
rtsi.com
/
2014.01.www.rtsi.com.tar
/
www.rtsi.com
/
OS9
/
TOP
/
USR
/
SRC
/
gawk2.0.t.Z
/
gawk2.0.t
/
awk.doc
< prev
next >
Wrap
Text File
|
1989-04-06
|
31KB
|
1,717 lines
Awk - A Pattern Scanning and Processing Language
(Second Edition)
Alfred V. Aho
Brian W. Kernighan
Peter J. Weinberger
_A_B_S_T_R_A_C_T
_A_w_k is a programming language whose basic
operation is to search a set of files for pat-
terns, and to perform specified actions upon lines
or fields of lines which contain instances of
those patterns. _A_w_k makes certain data selection
and transformation operations easy to express; for
example, the _a_w_k program
length > 72
prints all input lines whose length exceeds 72
characters; the program
NF % 2 == 0
prints all lines with an even number of fields;
and the program
{ $1 = log($1); print }
replaces the first field of each line by its loga-
rithm.
_A_w_k patterns may include arbitrary boolean
combinations of regular expressions and of rela-
tional operators on strings, numbers, fields,
variables, and array elements. Actions may
include the same pattern-matching constructions as
in patterns, as well as arithmetic and string
expressions and assignments, if-else, while, for
statements, and multiple output streams.
This report contains a user's guide, a dis-
cussion of the design and implementation of _a_w_k,
and some timing statistics.
- ii -
September 1, 1978
Awk - A Pattern Scanning and Processing Language
(Second Edition)
Alfred V. Aho
Brian W. Kernighan
Peter J. Weinberger
_1. _I_n_t_r_o_d_u_c_t_i_o_n
_A_w_k is a programming language designed to make many
common information retrieval and text manipulation tasks
easy to state and to perform.
The basic operation of _a_w_k is to scan a set of input
lines in order, searching for lines which match any of a set
of patterns which the user has specified. For each pattern,
an action can be specified; this action will be performed on
each line that matches the pattern.
Readers familiar with the UNIX|- program _g_r_e_p[1] will
recognize the approach, although in _a_w_k the patterns may be
more general than in _g_r_e_p, and the actions allowed are more
involved than merely printing the matching line. For exam-
ple, the _a_w_k program
-------------------------
|- UNIX is a trademark of Bell Laboratories.
- 2 -
{print $3, $2}
prints the third and second columns of a table in that
order. The program
$2 ~ /A|B|C/
prints all input lines with an A, B, or C in the second
field. The program
$1 != prev { print; prev = $1 }
prints all lines in which the first field is different from
the previous first field.
_1._1. _U_s_a_g_e
The command
awk program [files]
executes the _a_w_k commands in the string program on the set
of named files, or on the standard input if there are no
files. The statements can also be placed in a file pfile,
and executed by the command
awk -f pfile [files]
_1._2. _P_r_o_g_r_a_m _S_t_r_u_c_t_u_r_e
An _a_w_k program is a sequence of statements of the form:
- 3 -
_p_a_t_t_e_r_n { _a_c_t_i_o_n }
_p_a_t_t_e_r_n { _a_c_t_i_o_n }
...
Each line of input is matched against each of the patterns
in turn. For each pattern that matches, the associated
action is executed. When all the patterns have been tested,
the next line is fetched and the matching starts over.
Either the pattern or the action may be left out, but
not both. If there is no action for a pattern, the matching
line is simply copied to the output. (Thus a line which
matches several patterns can be printed several times.) If
there is no pattern for an action, then the action is per-
formed for every input line. A line which matches no pat-
tern is ignored.
Since patterns and actions are both optional, actions
must be enclosed in braces to distinguish them from pat-
terns.
_1._3. _R_e_c_o_r_d_s _a_n_d _F_i_e_l_d_s
_A_w_k input is divided into ``records'' terminated by a
record separator. The default record separator is a new-
line, so by default _a_w_k processes its input a line at a
time. The number of the current record is available in a
variable named NR.
Each input record is considered to be divided into
- 4 -
``fields.'' Fields are normally separated by white space -
blanks or tabs - but the input field separator may be
changed, as described below. Fields are referred to as $1,
$2, and so forth, where $1 is the first field, and $0 is the
whole input record itself. Fields may be assigned to. The
number of fields in the current record is available in a
variable named NF.
The variables FS and RS refer to the input field and
record separators; they may be changed at any time to any
single character. The optional command-line argument -F_c
may also be used to set FS to the character _c.
If the record separator is empty, an empty input line
is taken as the record separator, and blanks, tabs and new-
lines are treated as field separators.
The variable FILENAME contains the name of the current
input file.
_1._4. _P_r_i_n_t_i_n_g
An action may have no pattern, in which case the action
is executed for all lines. The simplest action is to print
some or all of a record; this is accomplished by the _a_w_k
command print. The _a_w_k program
{ print }
prints each record, thus copying the input to the output
intact. More useful is to print a field or fields from each
- 5 -
record. For instance,
print $2, $1
prints the first two fields in reverse order. Items
separated by a comma in the print statement will be
separated by the current output field separator when output.
Items not separated by commas will be concatenated, so
print $1 $2
runs the first and second fields together.
The predefined variables NF and NR can be used; for
example
{ print NR, NF, $0 }
prints each record preceded by the record number and the
number of fields.
Output may be diverted to multiple files; the program
{ print $1 >"foo1"; print $2 >"foo2" }
writes the first field, $1, on the file foo1, and the second
field on file foo2. The >> notation can also be used:
print $1 >>"foo"
appends the output to the file foo. (In each case, the out-
put files are created if necessary.) The file name can be a
variable or a field as well as a constant; for example,
- 6 -
print $1 >$2
uses the contents of field 2 as a file name.
Naturally there is a limit on the number of output
files; currently it is 10.
Similarly, output can be piped into another process (on
UNIX only); for instance,
print | "mail bwk"
mails the output to bwk.
The variables OFS and ORS may be used to change the
current output field separator and output record separator.
The output record separator is appended to the output of the
print statement.
_A_w_k also provides the printf statement for output for-
matting:
printf format expr, expr, ...
formats the expressions in the list according to the specif-
ication in format and prints them. For example,
printf "%8.2f %10ld\n", $1, $2
prints $1 as a floating point number 8 digits wide, with two
after the decimal point, and $2 as a 10-digit long decimal
number, followed by a newline. No output separators are
- 7 -
produced automatically; you must add them yourself, as in
this example. The version of printf is identical to that
used with C.[2]
_2. _P_a_t_t_e_r_n_s
A pattern in front of an action acts as a selector that
determines whether the action is to be executed. A variety
of expressions may be used as patterns: regular expressions,
arithmetic relational expressions, string-valued expres-
sions, and arbitrary boolean combinations of these.
_2._1. _B_E_G_I_N _a_n_d _E_N_D
The special pattern BEGIN matches the beginning of the
input, before the first record is read. The pattern END
matches the end of the input, after the last record has been
processed. BEGIN and END thus provide a way to gain control
before and after processing, for initialization and wrapup.
As an example, the field separator can be set to a
colon by
BEGIN { FS = ":" }
... _r_e_s_t _o_f _p_r_o_g_r_a_m ...
Or the input lines may be counted by
END { print NR }
If BEGIN is present, it must be the first pattern; END must
be the last if used.
- 8 -
_2._2. _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s
The simplest regular expression is a literal string of
characters enclosed in slashes, like
/smith/
This is actually a complete _a_w_k program which will print all
lines which contain any occurrence of the name ``smith''.
If a line contains ``smith'' as part of a larger word, it
will also be printed, as in
blacksmithing
_A_w_k regular expressions include the regular expression
forms found in the UNIX text editor _e_d[1] and _g_r_e_p (without
back-referencing). In addition, _a_w_k allows parentheses for
grouping, | for alternatives, + for ``one or more'', and ?
for ``zero or one'', all as in _l_e_x. Character classes may
be abbreviated: [a-zA-Z0-9] is the set of all letters and
digits. As an example, the _a_w_k program
/[Aa]ho|[Ww]einberger|[Kk]ernighan/
will print all lines which contain any of the names ``Aho,''
``Weinberger'' or ``Kernighan,'' whether capitalized or not.
Regular expressions (with the extensions listed above)
must be enclosed in slashes, just as in _e_d and _s_e_d. Within
a regular expression, blanks and the regular expression
metacharacters are significant. To turn of the magic
- 9 -
meaning of one of the regular expression characters, precede
it with a backslash. An example is the pattern
/\/.*\//
which matches any string of characters enclosed in slashes.
One can also specify that any field or variable matches
a regular expression (or does not match it) with the opera-
tors ~ and !~. The program
$1 ~ /[jJ]ohn/
prints all lines where the first field matches ``john'' or
``John.'' Notice that this will also match ``Johnson'',
``St. Johnsbury'', and so on. To restrict it to exactly
[jJ]ohn, use
$1 ~ /^[jJ]ohn$/
The caret ^ refers to the beginning of a line or field; the
dollar sign $ refers to the end.
_2._3. _R_e_l_a_t_i_o_n_a_l _E_x_p_r_e_s_s_i_o_n_s
An _a_w_k pattern can be a relational expression involving
the usual relational operators <, <=, ==, !=, >=, and >. An
example is
$2 > $1 + 100
which selects lines where the second field is at least 100
greater than the first field. Similarly,
- 10 -
NF % 2 == 0
prints lines with an even number of fields.
In relational tests, if neither operand is numeric, a
string comparison is made; otherwise it is numeric. Thus,
$1 >= "s"
selects lines that begin with an s, t, u, etc. In the
absence of any other information, fields are treated as
strings, so the program
$1 > $2
will perform a string comparison.
_2._4. _C_o_m_b_i_n_a_t_i_o_n_s _o_f _P_a_t_t_e_r_n_s
A pattern can be any boolean combination of patterns,
using the operators || (or), && (and), and ! (not). For
example,
$1 >= "s" && $1 < "t" && $1 != "smith"
selects lines where the first field begins with ``s'', but
is not ``smith''. && and || guarantee that their operands
will be evaluated from left to right; evaluation stops as
soon as the truth or falsehood is determined.
_2._5. _P_a_t_t_e_r_n _R_a_n_g_e_s
The ``pattern'' that selects an action may also consist
- 11 -
of two patterns separated by a comma, as in
pat1, pat2 { ... }
In this case, the action is performed for each line between
an occurrence of pat1 and the next occurrence of pat2
(inclusive). For example,
/start/, /stop/
prints all lines between start and stop, while
NR == 100, NR == 200 { ... }
does the action for lines 100 through 200 of the input.
_3. _A_c_t_i_o_n_s
An _a_w_k action is a sequence of action statements ter-
minated by newlines or semicolons. These action statements
can be used to do a variety of bookkeeping and string mani-
pulating tasks.
_3._1. _B_u_i_l_t-_i_n _F_u_n_c_t_i_o_n_s
_A_w_k provides a ``length'' function to compute the
length of a string of characters. This program prints each
record, preceded by its length:
{print length, $0}
length by itself is a ``pseudo-variable'' which yields the
length of the current record; length(argument) is a function
- 12 -
which yields the length of its argument, as in the
equivalent
{print length($0), $0}
The argument may be any expression.
_A_w_k also provides the arithmetic functions sqrt, log,
exp, and int, for square root, base _e logarithm, exponen-
tial, and integer part of their respective arguments.
The name of one of these built-in functions, without
argument or parentheses, stands for the value of the func-
tion on the whole record. The program
length < 10 || length > 20
prints lines whose length is less than 10 or greater than
20.
The function substr(s, m, n) produces the substring of
s that begins at position m (origin 1) and is at most n
characters long. If n is omitted, the substring goes to the
end of s. The function index(s1, s2) returns the position
where the string s2 occurs in s1, or zero if it does not.
The function sprintf(f, e1, e2, ...) produces the value
of the expressions e1, e2, etc., in the printf format speci-
fied by f. Thus, for example,
x = sprintf("%8.2f %10ld", $1, $2)
- 13 -
sets x to the string produced by formatting the values of $1
and $2.
_3._2. _V_a_r_i_a_b_l_e_s, _E_x_p_r_e_s_s_i_o_n_s, _a_n_d _A_s_s_i_g_n_m_e_n_t_s
_A_w_k variables take on numeric (floating point) or
string values according to context. For example, in
x = 1
x is clearly a number, while in
x = "smith"
it is clearly a string. Strings are converted to numbers
and vice versa whenever context demands it. For instance,
x = "3" + "4"
assigns 7 to x. Strings which cannot be interpreted as
numbers in a numerical context will generally have numeric
value zero, but it is unwise to count on this behavior.
By default, variables (other than built-ins) are ini-
tialized to the null string, which has numerical value zero;
this eliminates the need for most BEGIN sections. For exam-
ple, the sums of the first two fields can be computed by
{ s1 += $1; s2 += $2 }
END { print s1, s2 }
Arithmetic is done internally in floating point. The
- 14 -
arithmetic operators are +, -, *, /, and % (mod). The C
increment ++ and decrement -- operators are also available,
and so are the assignment operators +=, -=, *=, /=, and %=.
These operators may all be used in expressions.
_3._3. _F_i_e_l_d _V_a_r_i_a_b_l_e_s
Fields in _a_w_k share essentially all of the properties
of variables - they may be used in arithmetic or string
operations, and may be assigned to. Thus one can replace
the first field with a sequence number like this:
{ $1 = NR; print }
or accumulate two fields into a third, like this:
{ $1 = $2 + $3; print $0 }
or assign a string to a field:
{ if ($3 > 1000)
$3 = "too big"
print
}
which replaces the third field by ``too big'' when it is,
and in any case prints the record.
Field references may be numerical expressions, as in
{ print $i, $(i+1), $(i+n) }
Whether a field is deemed numeric or string depends on
- 15 -
context; in ambiguous cases like
if ($1 == $2) ...
fields are treated as strings.
Each input line is split into fields automatically as
necessary. It is also possible to split any variable or
string into fields:
n = split(s, array, sep)
splits the the string s into array[1], ..., array[n]. The
number of elements found is returned. If the sep argument
is provided, it is used as the field separator; otherwise FS
is used as the separator.
_3._4. _S_t_r_i_n_g _C_o_n_c_a_t_e_n_a_t_i_o_n
Strings may be concatenated. For example
length($1 $2 $3)
returns the length of the first three fields. Or in a print
statement,
print $1 " is " $2
prints the two fields separated by `` is ''. Variables and
numeric expressions may also appear in concatenations.
_3._5. _A_r_r_a_y_s
Array elements are not declared; they spring into
- 16 -
existence by being mentioned. Subscripts may have _a_n_y non-
null value, including non-numeric strings. As an example of
a conventional numeric subscript, the statement
x[NR] = $0
assigns the current input record to the NR-_t_h element of the
array x. In fact, it is possible in principle (though
perhaps slow) to process the entire input in a random order
with the _a_w_k program
{ x[NR] = $0 }
END { ... _p_r_o_g_r_a_m ... }
The first action merely records each input line in the array
x.
Array elements may be named by non-numeric values,
which gives _a_w_k a capability rather like the associative
memory of Snobol tables. Suppose the input contains fields
with values like apple, orange, etc. Then the program
/apple/ { x["apple"]++ }
/orange/ { x["orange"]++ }
END { print x["apple"], x["orange"] }
increments counts for the named array elements, and prints
them at the end of the input.
_3._6. _F_l_o_w-_o_f-_C_o_n_t_r_o_l _S_t_a_t_e_m_e_n_t_s
_A_w_k provides the basic flow-of-control statements if-
- 17 -
else, while, for, and statement grouping with braces, as in
C. We showed the if statement in section 3.3 without
describing it. The condition in parentheses is evaluated;
if it is true, the statement following the if is done. The
else part is optional.
The while statement is exactly like that of C. For
example, to print all input fields one per line,
i = 1
while (i <= NF) {
print $i
++i
}
The for statement is also exactly that of C:
for (i = 1; i <= NF; i++)
print $i
does the same job as the while statement above.
There is an alternate form of the for statement which
is suited for accessing the elements of an associative
array:
for (i in array)
_s_t_a_t_e_m_e_n_t
does _s_t_a_t_e_m_e_n_t with i set in turn to each element of array.
The elements are accessed in an apparently random order.
- 18 -
Chaos will ensue if i is altered, or if any new elements are
accessed during the loop.
The expression in the condition part of an if, while or
for can include relational operators like <, <=, >, >=, ==
(``is equal to''), and != (``not equal to''); regular
expression matches with the match operators ~ and !~; the
logical operators ||, &&, and !; and of course parentheses
for grouping.
The break statement causes an immediate exit from an
enclosing while or for; the continue statement causes the
next iteration to begin.
The statement next causes _a_w_k to skip immediately to
the next record and begin scanning the patterns from the
top. The statement exit causes the program to behave as if
the end of the input had occurred.
Comments may be placed in _a_w_k programs: they begin with
the character # and end with the end of the line, as in
print x, y # this is a comment
_4. _D_e_s_i_g_n
The UNIX system already provides several programs that
operate by passing input through a selection mechanism.
_G_r_e_p, the first and simplest, merely prints all lines which
match a single specified pattern. _E_g_r_e_p provides more gen-
- 19 -
eral patterns, i.e., regular expressions in full generality;
_f_g_r_e_p searches for a set of keywords with a particularly
fast algorithm. _S_e_d[1] provides most of the editing facili-
ties of the editor _e_d, applied to a stream of input. None
of these programs provides numeric capabilities, logical
relations, or variables.
_L_e_x[3] provides general regular expression recognition
capabilities, and, by serving as a C program generator, is
essentially open-ended in its capabilities. The use of _l_e_x,
however, requires a knowledge of C programming, and a _l_e_x
program must be compiled and loaded before use, which
discourages its use for one-shot applications.
_A_w_k is an attempt to fill in another part of the matrix
of possibilities. It provides general regular expression
capabilities and an implicit input/output loop. But it also
provides convenient numeric processing, variables, more gen-
eral selection, and control flow in the actions. It does
not require compilation or a knowledge of C. Finally, _a_w_k
provides a convenient way to access fields within lines; it
is unique in this respect.
_A_w_k also tries to integrate strings and numbers com-
pletely, by treating all quantities as both string and
numeric, deciding which representation is appropriate as
late as possible. In most cases the user can simply ignore
the differences.
- 20 -
Most of the effort in developing _a_w_k went into deciding
what _a_w_k should or should not do (for instance, it doesn't
do string substitution) and what the syntax should be (no
explicit operator for concatenation) rather than on writing
or debugging the code. We have tried to make the syntax
powerful but easy to use and well adapted to scanning files.
For example, the absence of declarations and implicit ini-
tializations, while probably a bad idea for a general-
purpose programming language, is desirable in a language
that is meant to be used for tiny programs that may even be
composed on the command line.
In practice, _a_w_k usage seems to fall into two broad
categories. One is what might be called ``report genera-
tion'' - processing an input to extract counts, sums, sub-
totals, etc. This also includes the writing of trivial data
validation programs, such as verifying that a field contains
only numeric information or that certain delimiters are
properly balanced. The combination of textual and numeric
processing is invaluable here.
A second area of use is as a data transformer, convert-
ing data from the form produced by one program into that
expected by another. The simplest examples merely select
fields, perhaps with rearrangements.
_5. _I_m_p_l_e_m_e_n_t_a_t_i_o_n
The actual implementation of _a_w_k uses the language
- 21 -
development tools available on the UNIX operating system.
The grammar is specified with _y_a_c_c;[4] the lexical analysis
is done by _l_e_x; the regular expression recognizers are
deterministic finite automata constructed directly from the
expressions. An _a_w_k program is translated into a parse tree
which is then directly executed by a simple interpreter.
_A_w_k was designed for ease of use rather than processing
speed; the delayed evaluation of variable types and the
necessity to break input into fields makes high speed diffi-
cult to achieve in any case. Nonetheless, the program has
not proven to be unworkably slow.
Table I below shows the execution (user + system) time
on a PDP-11/70 of the UNIX programs _w_c, _g_r_e_p, _e_g_r_e_p, _f_g_r_e_p,
_s_e_d, _l_e_x, and _a_w_k on the following simple tasks:
1. count the number of lines.
2. print all lines containing ``doug''.
3. print all lines containing ``doug'', ``ken'' or
``dmr''.
4. print the third field of each line.
5. print the third and second fields of each line, in that
order.
6. append all lines containing ``doug'', ``ken'', and
``dmr'' to files ``jdoug'', ``jken'', and ``jdmr'',
- 22 -
respectively.
7. print each line prefixed by ``line-number : ''.
8. sum the fourth column of a table.
The program _w_c merely counts words, lines and characters in
its input; we have already mentioned the others. In all
cases the input was a file containing 10,000 lines as
created by the command _l_s -_l; each line has the form
-rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx
The total length of this input is 452,960 characters. Times
for _l_e_x do not include compile or load.
As might be expected, _a_w_k is not as fast as the spe-
cialized tools _w_c, _s_e_d, or the programs in the _g_r_e_p family,
but is faster than the more general tool _l_e_x. In all cases,
the tasks were about as easy to express as _a_w_k programs as
programs in these other languages; tasks involving fields
were considerably easier to express as _a_w_k programs. Some
of the test programs are shown in _a_w_k, _s_e_d and _l_e_x.
_R_e_f_e_r_e_n_c_e_s
1. K. Thompson and D. M. Ritchie, _U_N_I_X _P_r_o_g_r_a_m_m_e_r'_s
_M_a_n_u_a_l, Bell Laboratories, May 1975. Sixth Edition
2. B. W. Kernighan and D. M. Ritchie, _T_h_e _C _P_r_o_g_r_a_m_m_i_n_g
_L_a_n_g_u_a_g_e, Prentice-Hall, Englewood Cliffs, New Jersey,
- 23 -
1978.
3. M. E. Lesk, "Lex - A Lexical Analyzer Generator," Comp.
Sci. Tech. Rep. No. 39, Bell Laboratories, Murray Hill,
New Jersey, October 1975.
4. S. C. Johnson, "Yacc - Yet Another Compiler-Compiler,"
Comp. Sci. Tech. Rep. No. 32, Bell Laboratories, Murray
Hill, New Jersey, July 1975.
Task
Program 1 2 3 4 5 6 7 8
___________________________________________________________________
_w_c | 8.6| | | | | | | |
| | | | | | | | |
_g_r_e_p | 11.7| 13.1| | | | | | |
| | | | | | | | |
_e_g_r_e_p | 6.2| 11.5| 11.6| | | | | |
| | | | | | | | |
_f_g_r_e_p | 7.7| 13.8| 16.1| | | | | |
| | | | | | | | |
_s_e_d | 10.2| 11.6| 15.8| 29.0| 30.5| 16.1| | |
| | | | | | | | |
_l_e_x | 65.1| 150.1| 144.2| 67.7| 70.3| 104.0| 81.7| 92.8|
| | | | | | | | |
_a_w_k | 15.0| 25.6| 29.9| 33.3| 38.9| 46.4| 71.4| 31.1|
| | | | | | | | |
________|_______|________|________|_______|_______|________|_______|_______|
Table I. Execution Times of Programs. (Times are in sec.)
The programs for some
1. END {print NR}
of these jobs are shown
below. The _l_e_x programs are
2. /doug/
generally too long to show.
AWK: 3. /ken|doug|dmr/
- 24 -
4. {print $3} 4. /[^ ]* [ ]*[^ ]* [ ]*\([^ ]*\) .*/s//\1/p
5. {print $3, $2} 5. /[^ ]* [ ]*\([^ ]*\) [ ]*\([^ ]*\) .*/s//\2 \1/p
6. /ken/ {print >"jken"} 6. /ken/w jken
/doug/ {print >"jdoug"} /doug/w jdoug
/dmr/ {print >"jdmr"} /dmr/w jdmr
7. {print NR ": " $0} LEX:
8. {sum = sum + $4} 1. %{
END {print sum} int i;
%}
SED: %%
\n i++;
1. $= . ;
%%
2. /doug/p yywrap() {
printf("%d\n", i);
3. /doug/p }
/doug/d
/ken/p 2. %%
/ken/d ^.*doug.*$ printf("%s\n", yytext);
/dmr/p . ;
/dmr/d \n ;