home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The Unsorted BBS Collection
/
thegreatunsorted.tar
/
thegreatunsorted
/
programming
/
misc_programming
/
awk.man
< prev
next >
Wrap
Text File
|
1991-05-04
|
21KB
|
510 lines
AWK AWK
NAME
awk - pattern scanning and processing language
SYNOPSIS
awk [-ffile] [-Fstr] [-t] [-l] [program] [var=text] [file] ...
DESCRIPTION
Awk scans each input file for lines that match any of a set of patterns
specified in the program. With each pattern in the program there can be an
associated action that will be performed when a line of a file matches the
pattern.
The AWK program may be specified as a file with the -f option as:
-ffilename.ext
-f filename.ext
in which case the AWK program is read from the named file. If the file does
not exist then an error message will be printed.
The AWK program may also be specified as a single argument as:
filename.ext
filename[.awk]
or as a valid AWK program:
{ for (i in ARGV) printf ("%d: %s\n", i, ARGV[i]) }
AWK will first try to open the first argument as a file, it it can't open a
file, it then adds the extension ".awk" and tries again to open a file,
finally AWK will attempt to read the argument directly as an AWK program.
If the filename is a minus sign (-) then the AWK program is read from the
standard input. The program may then be terminated with either a ctrl-Z or
a period (.) on a line by itself. The second method is useful for entering
an AWK program followed by the data for the program. If no program file is
specified then the program is read from standard input.
If the -f option is selected the full path/name/extension must be specified.
If only the filename is specified AWK will first attempt to open the named
file, then the file with the extension ".AWK", finally AWK will attempt to
parse the parameter as a program. Multiple -f options may be used to get
the program source from many files.
Files are read in order, the file name '-' means standard input. Each line
is matched against the pattern portion of every pattern-action statement;
the associated action is performed for each matched pattern.
If a file name has the form variable=value, program variables may be changed
before a file is read. The assignment takes place when the argument would be
treated as the next file to read. Any assignments before the first file
take place before the first BEGIN block is executed. An assignment after
the last file will occur before any END block unless an exit was performed.
awk "{ print code, NR, $$0 }" code=10 file1 code=75 file2
If no files are specified the input is read from standard input.
An input line is made up of fields separated by the field separator FS.
The fields are denoted by $1, $2 ...; $0 denotes the entire line:
$0 = "now is the time"
$1 = "now" $2 = "is"
$3 = "the" $4 = "time"
with the default FS (white space). If the field separator is set to comma (,)
with "-F," on the command line then the fields might be:
$0 = "a, b,c,, ,"
$1 = "a" $2 = " b" $3 = "c"
$4 = "" $5 = " " $6 = ""
A pattern-action statement has the form:
pattern { action }
A missing { action } has the same effect as { print $0 }, a missing pattern
always matches.
Pattern-Actions are separated by semicolons or newlines. A statement may be
continued on the next line by putting a backslash (\) at the end of the line.
{ words += NF }; END { print words }
A pattern is a test that is performed on each input line. If the pattern
matches the input line then the corresponding action is performed.
Patterns come in several forms:
Form Example Meaning
BEGIN BEGIN {N=1} initialize N before input is read
END END {print N} print N after all input is read
function function x(y) define a function called x
text match /stop/ line contains the string "stop"
expression $1 == 3 first field is the number 3
compound /x/ && NF > 2 more that two fields and contain "x"
range NR==10,NR==20 records ten through twenty inclusive
BEGIN and END patterns are special patterns that match before any files are
read and after all files have been read respectivly. There may be multiple
occurances of these patterns and the associated actions are executed in the
order that they occur.
If there is only a series of BEGIN blocks in the awk program and no other
pattern/action blocks except function declarations then no input files are
read. If only END blocks are defined then all the files are read and NR will
be set to the number of records in all the files.
BEGIN { page = 5 }
A function pattern is never matched and serves to declare a user defined
function. You can declare a function with more parameters than are
passed as arguments so that the extra parameters can act as local
variables.
function show(a, i) { for (i in a) print a[i] }
A regular expression by itself is matched against the input record ($0). That
is "/abc/" is equivalent to "$0 ~ /abc/".
Any expression will match if it evaluates to != 0 or !="". Also any logical
combination of expressions and regular expressions may be used as a pattern.
FILENAME != oldname && FILENAME != "skip"
The last special pattern is two patterns separated by a comma. This pattern
specifies a range of records that match the pattern. The pattern starts to
match when the first pattern matches and stops matching when the second
pattern matches. If they both match on the same input record then only that
record will match the pattern.
/AUTHOR/,/NOTES/
An action is a sequence of statements that are performed when a pattern
matches.
A statement can be one of the following:
{ STATEMENT_LIST }
EXPRESSION
print EXPRESSION-LIST
printf FORMAT, EXPRESSION_LIST
if ( EXPRESSION ) STATEMENT [ else STATEMENT ]
for ( VARIABLE in ARRAY ) STATEMENT
for ( EXPRESSION; EXPRESSION; EXPRESSION) STATEMENT
while ( EXPRESSION ) STATEMENT
do STATEMENT while ( EXPRESSION )
break
continue
next
delete ARRAY[SUBSCRIPT]
exit [ EXPRESSION ]
return [EXPRESSION ]
A STATEMENT_LIST is a list of statements separated by newlines or semicolons.
As with pattern-actions statements may be extended over more than one line
with backslash (\).
{
print "value:", i, \
"number:", j
i = i + $3; j++
}
Expressions take on string or numeric values depending on the operators.
There is only one string operator, concatenation, indicated by adjacent
expressions. The following are the operators in order of increasing
precedence:
Operation Operator Example Meaning
assignment = *= /= %= x += 2 two is added to x
+= -= ^=
conditional ?: x?y:z if x then y else z
logical OR || x||y if (x) 1 else if (y) 1 else 0
logical AND && x&&y if (x) if (y) 1 else 0 else 0
array membership in x in y if (exists(y[x])) 1 else 0
matching ~ !~ $1~/x/ if ($1 contains x) 1 else 0
relational == != > x==y if (x equals y) 1 else 0
<= >= <
concatenation "x" "y" a new string "xy"
add, subtract + - x+y sum of x and y
mul, div, mod * / % x*y product of x and y
unary plus minus + - -x negative of x
logical not ! !x if (x is 0 or null) 1 else 0
exponentiation ^ x^y x to the yth power
inc, dec ++ -- x++ x then add 1 to x
field $ $3 the 3rd field
grouping () ($1)++ increment the 1st field
Variables may be scalars, array elements (denoted x[i]) or fields (denoted
$expression). Variable names begin with a letter or underscore and may
contain any number of letters, digits, or underscores.
Variables are initialized to both zero and the null string. Fields and the
command line arguments will be both string and numeric if they can be
completely represented as numbers. The range for numbers is 1E-306..1E306.
Array subscripts may be any string. Multi dimensional arrays are simulated in
AWK by concatenating the individual indexes with the subscript separator
between them. So array[1,1] is equivalent to array[1 SUBSEP 1]. Individual
array elements may be removed with the delete statement, and the whole array
erased with an assignment to the bare variable.
delete a[i] # delete one element
a = "" # delete all elements
Simply referencing an array element will cause it to be created and
initialized. To avoid creating unwanted elements use the in operator.
if (i in a) print a[i] # print one element (if it exists)
for (i in a) print a[i] # print all elements (that exist)
Comparison will be numeric if both operands are numeric otherwise a string
comparison will be made. Operands will be coerced to strings if necessary.
Uninitialized variables will compare as numeric if the other operand is
numeric or uninitialized. Eg. 2 > "10" and 2 < 10.
There are a number of built in variables they are:
Variable Meaning Default
ARGC number of command line arguments -
ARGV array of command line arguments -
FILENAME name of current input file -
FNR record number in current file -
FS controls the input field separator " "
NF number of fields in current record -
NR number of records read so far -
OFMT output format for records "%.6g"
OFS output field separator " "
ORS output record separator "\n"
RLENGTH length of string matched by match function -
RS controls input record separator "\n"
RSTART start of string match by match function -
SUBSEP subscript separator "\034"
ARGC and ARGV are the count and values of the command line arguments. ARGV[0]
is the full path/name of AWK.EXE, and the rest are all the command line
arguments except the "-F", "-f" and program arguments which are used by AWK.
The field separator is a string that is interpreted as a regular expression.
A single space has a special meaning and is changed to /[ \t]+/, any leading
spaces or tabs are removed. A BEGIN action may be used to set the separator
or it may be set by using the -F command line option.
BEGIN { FS = "," } sets FS to a single comma
"-F[ ]" sets FS to a single space
The record separator is a string that is either a newline or the null string.
If the record separator RS is set to the null string then multi line records
may be read. In this case the record separator is an empty line. Setting RS
to "\n" will restore the default behavior.
There are a number of built in functions:
Function Value returned
atan2(y, x) arctangent of y/x in the range -pi to pi
cos(x) cosine of x x in radians
exp(x) exponentiation of x (e ^ x)
gsub(r, s) number of substitutions substitute s for all r in $0
gsub(r, s, t) number of substitutions substitute s for all r in t
index(s) position of s in $0 0 if not in $0
index(s, t) position of t in s 0 if not in s
int(x) integer part of x
length(s) number of characters in s
log(x) natural log of x
match(s, r) position of r in s or 0 sets RSTART and RLENGTH
rand() random number 0 <= rand < 1
sin(x) sine of x x in radians
split(s, a) number of fields split s into a on FS
split(s, a, fs) number of fields split s into a on fs
sprintf(f, e, ...) formatted string
sqrt(x) square root of x
sub(r, s) number of substitutions substitute s for one r in $0
sub(r, s, t) number of substitutions substitute s for one r in t
substr(s, p) substring of s from p to end
substr(s, p, n) substring of s from p of length n
system(s) exit status execute command s
The numeric procedure srand(x) sets a new seed for the random number
generator. srand() sets the seed from the system time.
The regular expression arguments of sub, gsub, and match may be either regular
expressions delimited by slashes or any expression. The expression is coerced
to a string and the resulting string is converted into a regular expression.
This coersion and conversion occurs every time the procedure is called so the
regular expression form will always be faster.
The print and printf statements come in several forms:
Form Meaning
print print $0 on standard output
print expression, ... prints expressions separated by OFS
print(expression, ...)
printf format, expression, ...
printf(format, expression, ...)
print >"file" print $0 on file "file"
print >>"file" append $0 to file "file"
printf(format, ...) >"file"
printf(format, ...) >>"file"
close("file") close the file "file"
The print statement prints its arguments on the standard output, or the
specified file, separated by the current output field separator, and
terminated by the output record separator. The printf statement formats its
expression-list according to the format. The file is only opened once unless
it is closed between executions of the print statement. A file than is open
for output must be closed if it is to be used for input. The "file" argument
may any expression that evaluates to a DOS file name.
There is one function that is used for input. It has several forms
Form Meaning
getline read the next record into $0
getline s read the next record into s
getline <"file" read a record from file "file" into $0
getline s <"file" read a record from file "file" into s
getline returns -1 if there is an error (such as non existent file), 0 on
end of file and 1 otherwise. The pipe form mentioned in the book is not
implemented in this version.
The for ( i in a ) statement assigns to i the indexes of a for all elements
in a. The while (), do while (), and for (;;) statement is as in C as are
break and continue.
The next statements stops processing the pattern action statements and reads
in the next record. An exit will cause the END actions to be performed or if
encountered in an END action will cause termination of the program. The
optional expression is returned as the exit status unless overridden by a
further exit statement in an END action.
The return statement may be used only in function declarations. It may have
an option value to return as the value of the function. The value of a
function defaults to zero/null (0/"").
REGULAR EXPRESSIONS
A \ followed by a single character matches that character.
The ^ matches the beginning of the string.
The $ matches the end of the string.
A . matches any character.
A single character with no special meaning matches that character.
A string enclosed in brackets [] matches any single character in that string.
Ranges of ASCII character codes may be abbreviated as 'a-z0-9'. A left
bracket ] may occur only as the first character of the string. A literal -
must be placed where it can't be mistaken as a range indicator. If the first
character is the caret ^ then any character not in the string will match.
A regular expression followed by * matches a sequence of 0 or more
matches of the regular expression.
A regular expression followed by + matches a sequence of 1 or more
matches of the regular expression.
A regular expression followed by ? matches a sequence of 0 or 1
matches of the regular expression.
Two adjacent (concatenated) regular expressions match a match of the first
followed by a match of the second.
Two regular expressions separated by | match either a match for the
first or a match for the second.
A regular expression enclosed in parentheses matches a match for the
regular expression.
The order of precedence of operators at the same parenthesis level is
[] then *+? then concatenation then |.
PRINTF FORMAT
Any character except % and \ is printed as that character.
A \ followed by up to three octal digits is the ASCII character
represented by that number.
A \ followed by n, t, r, b, f, v, or p is newline, tab, return, backspace,
form feed, vertical tab, or escape.
%[-][number][.number][l][c|d|E|e|F|f|G|g|o|s|X|x|%] prints an expression:
The optional leading - means left justified in the field
The optional first number is the field width
The optional . and second number is the precision
The optional l denotes a long expression
The final character denotes the form of the expression
c character
d decimal
e exponential floating point
f fixed, or exponential floating point
g decimal, fixed, or exponential floating point
o octal
s string
x hexadecimal
An upper case E, F, or G denotes use of upper case E in exponential format.
An upper case X denotest hexadecimal in upper case.
Two percent characters (%%) will print as one.
A format will match the regular expression:
/[^%]*(%(%|(-?([0-9]+)?(\.[0-9]+)?l?[cdEeFfGgosXx]))[^%]*)*/
EXAMPLES
Print lines longer than 72 characters (missing action is print):
length($0) > 72
Print first two fields in opposite order (missing pattern is always match):
{ print $2, $1 }
Add up first column, print sum and average:
{ s = s + $1 }
END { print "sum is", s, "average is", s/NR }
Print fields in reverse order:
{ for (i = NF; i > 0; --i ) print $i }
Print all lines between start/stop pairs:
/start/,/stop/
Print all lines whose first field is different from previous one:
$1 != prev { print; prev = $1 }
Convert date from MM/DD/YY to metric (YYMMDD):
{ n = split(date, a, "/"); date = a[3] a[1] a[2] }
Copy a C program and insert include files:
$1 == "#include" && $2 ~ /^"/ {
include = $2;
gsub(/"/, "", include);
while ((getline <include) > 0) print
next
}
{ print }
AUTHOR
Rob Duff, Vancouver, B.C., V5N 1Y9
BBS: (604)877-7752 Fido: 1:153/713.0
DATE
08-Feb-90
SEE ALSO
M. E. Lesk and E. Schmidt,
LEX - Lexical Analyser Generator
A. V Aho, B. W Kernighan, P. J. Weinberger,
Awk - a pattern scanning and processing language
A. V Aho, B. W Kernighan, P. J. Weinberger,
The AWK Programming Language
Addison-Wesley 1988 ISBN 0-201-07981-X
NOTES
There are no explicit conversions between numbers and strings. To force an
expression to b treated as a number add 0 to it; to force it to be a string
concatenate "" to it. Array indices are strings and may have the same
numerical value but will index different values (eg "01" vs "1").
LIMITS
stack depth is 500
number of files is 10
largest string is 4000
input line size is 2000
number of variables is 100
function call depth is 100
highest field number is 100