Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / unix / volume22 / gawk2.11 / part07 / gawk.texinfo.06

Wrap

Text File | 1990-06-07 | 48.5 KB | 1,399 lines

the @code{BEGIN} rule was executed. Some applications came to depend upon this ``feature''. When @code{awk} was changed to be more consistent, the @samp{-v} option was added to accomodate applications that depended upon this old behaviour. The variable assignment feature is most useful for assigning to variables such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and output formats, before scanning the data files. It is also useful for controlling state if multiple passes are needed over a data file. For example:@refill @cindex multiple passes over data @cindex passes, multiple @example awk 'pass == 1 @{ @var{pass 1 stuff} @} pass == 2 @{ @var{pass 2 stuff} @}' pass=1 datafile pass=2 datafile @end example @node AWKPATH Variable,, Other Arguments, Command Line @section The @code{AWKPATH} Environment Variable @cindex @code{AWKPATH} environment variable @cindex search path @cindex directory search @cindex path, search @c @cindex differences between @code{gawk} and @code{awk} The previous section described how @code{awk} program files can be named on the command line with the @samp{-f} option. In some @code{awk} implementations, you must supply a precise path name for each program file, unless the file is in the current directory. But in @code{gawk}, if the file name supplied in the @samp{-f} option does not contain a @samp{/}, then @code{gawk} searches a list of directories (called the @dfn{search path}), one by one, looking for a file with the specified name. The search path is actually a string containing directory names separated by colons. @code{gawk} gets its search path from the @code{AWKPATH} environment variable. If that variable does not exist, @code{gawk} uses the default path, which is @samp{.:/usr/lib/awk:/usr/local/lib/awk}.@refill The search path feature is particularly useful for building up libraries of useful @code{awk} functions. The library files can be placed in a standard directory that is in the default path, and then specified on the command line with a short file name. Otherwise, the full file name would have to be typed for each file. Path searching is not done if @code{gawk} is in compatibility mode. @xref{Command Line}. @strong{Note:} if you want files in the current directory to be found, you must include the current directory in the path, either by writing @file{.} as an entry in the path, or by writing a null entry in the path. (A null entry is indicated by starting or ending the path with a colon, or by placing two colons next to each other (@samp{::}).) If the current directory is not included in the path, then files cannot be found in the current directory. This path search mechanism is identical to the shell's. @c someday, @cite{The Bourne Again Shell}.... @node Language History, Gawk Summary, Command Line, Top @chapter The Evolution of the @code{awk} Language This manual describes the GNU implementation of @code{awk}, which is patterned after the System V Release 4 version. Many @code{awk} users are only familiar with the original @code{awk} implementation in Version 7 Unix, which is also the basis for the version in Berkeley Unix. This chapter briefly describes the evolution of the @code{awk} language. @menu * V7/S5R3.1:: The major changes between V7 and System V Release 3.1. * S5R4:: The minor changes between System V Releases 3.1 and 4. * S5R4/GNU:: The extensions in @code{gawk} not in System V Release 4. @end menu @node V7/S5R3.1, S5R4, Language History, Language History @section Major Changes Between V7 and S5R3.1 The @code{awk} language evolved considerably between the release of Version 7 Unix (1978) and the new version first made widely available in System V Release 3.1 (1987). This section summarizes the changes, with cross-references to further details. @itemize @bullet @item The requirement for @samp{;} to separate rules on a line (@pxref{Statements/Lines}). @item User-defined functions, and the @code{return} statement (@pxref{User-defined}). @item The @code{delete} statement (@pxref{Delete}). @item The @code{do}-@code{while} statement (@pxref{Do Statement}). @item The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and @code{srand} (@pxref{Numeric Functions}). @item The built-in functions @code{gsub}, @code{sub}, and @code{match} (@pxref{String Functions}). @item The built-in functions @code{close} and @code{system} (@pxref{I/O Functions}). @item The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART}, and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}). @item The conditional expression using the operators @samp{?} and @samp{:} (@pxref{Conditional Exp}). @item The exponentiation operator @samp{^} (@pxref{Arithmetic Ops}) and its assignment operator form @samp{^=} (@pxref{Assignment Ops}).@refill @item C-compatible operator precedence, which breaks some old @code{awk} programs (@pxref{Precedence}). @item Regexps as the value of @code{FS} (@pxref{Field Separators}), or as the third argument to the @code{split} function (@pxref{String Functions}).@refill @item Dynamic regexps as operands of the @samp{~} and @samp{!~} operators (@pxref{Regexp Usage}). @item Escape sequences (@pxref{Constants}) in regexps.@refill @item The escape sequences @samp{\b}, @samp{\f}, and @samp{\r} (@pxref{Constants}). @item Redirection of input for the @code{getline} function (@pxref{Getline}). @item Multiple @code{BEGIN} and @code{END} rules (@pxref{BEGIN/END}). @item Simulation of multidimensional arrays (@pxref{Multi-dimensional}). @end itemize @node S5R4, S5R4/GNU, V7/S5R3.1, Language History @section Minor Changes between S5R3.1 and S5R4 The System V Release 4 version of Unix @code{awk} added these features: @itemize @bullet @item The @code{ENVIRON} variable (@pxref{Built-in Variables}). @item Multiple @samp{-f} options on the command line (@pxref{Command Line}). @item The @samp{-v} option for assigning variables before program execution begins (@pxref{Command Line}). @item The @samp{--} option for terminating command line options. @item The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences (@pxref{Constants}). @item A defined return value for the @code{srand} built-in function (@pxref{Numeric Functions}). @item The @code{toupper} and @code{tolower} built-in string functions for case translation (@pxref{String Functions}). @item A cleaner specification for the @samp{%c} format-control letter in the @code{printf} function (@pxref{Printf}). @item The use of constant regexps such as @code{/foo/} as expressions, where they are equivalent to use of the matching operator, as in @code{$0 ~ /foo/}. @end itemize @node S5R4/GNU, , S5R4, Language History @section Extensions In @code{gawk} Not In S5R4 The GNU implementation, @code{gawk}, adds these features: @itemize @bullet @item The @code{AWKPATH} environment variable for specifying a path search for the @samp{-f} command line option (@pxref{Command Line}). @item The @samp{-C} and @samp{-V} command line options (@pxref{Command Line}). @item The @code{IGNORECASE} variable and its effects (@pxref{Case-sensitivity}). @item The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and @file{/dev/fd/@var{n}} file name interpretation (@pxref{Special Files}). @item The @samp{-c} option to turn off these extensions (@pxref{Command Line}). @item The @samp{-a} and @samp{-e} options to specify the syntax of regular expressions that @code{gawk} will accept (@pxref{Command Line}). @end itemize @node Gawk Summary, Sample Program, Language History, Top @appendix @code{gawk} Summary @ignore See, man pages are good for something. This chapter started life as the gawk.1 man page for 2.11. @end ignore This appendix provides a brief summary of the @code{gawk} command line and the @code{awk} language. It is designed to serve as ``quick reference.'' It is therefore terse, but complete. @menu * Command Line Summary:: Recapitulation of the command line. * Language Summary:: A terse review of the language. * Variables/Fields:: Variables, fields, and arrays. * Rules Summary:: Patterns and Actions, and their component parts. * Functions Summary:: Defining and calling functions. @end menu @node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary @appendixsec Command Line Options Summary The command line consists of options to @code{gawk} itself, the @code{awk} program text (if not supplied via the @samp{-f} option), and values to be made available in the @code{ARGC} and @code{ARGV} predefined @code{awk} variables: @example awk @r{[@code{-F@var{fs}}] [@code{-v @var{var}=@var{val}}] [@code{-V}] [@code{-C}] [@code{-c}] [@code{-a}] [@code{-e}] [@code{--}]} '@var{program}' @var{file} @dots{} awk @r{[@code{-F@var{fs}}] @code{-f @var{source-file}} [@code{-f @var{source-file} @dots{}}] [@code{-v @var{var}=@var{val}}] [@code{-V}] [@code{-C}] [@code{-c}] [@code{-a}] [@code{-e}] [@code{--}]} @var{file} @dots{} @end example The options that @code{gawk} accepts are: @table @code @item -F@var{fs} Use @var{fs} for the input field separator (the value of the @code{FS} predefined variable). @item -f @var{program-file} Read the @code{awk} program source from the file @var{program-file}, instead of from the first command line argument. @item -v @var{var}=@var{val} Assign the variable @var{var} the value @var{val} before program execution begins. @item -a Specifies use of traditional @code{awk} syntax for regular expressions. This means that @samp{\} can be used to quote regular expression operators inside of square brackets, just as it can be outside of them. @item -e Specifies use of @code{egrep} syntax for regular expressions. This means that @samp{\} does not serve as a quoting character inside of square brackets. @item -c Specifies compatibility mode, in which @code{gawk} extensions are turned off. @item -V Print version information for this particular copy of @code{gawk} on the error output. This option may disappear in a future version of @code{gawk}. @item -C Print the short version of the General Public License on the error output. This option may disappear in a future version of @code{gawk}. @item -- Signal the end of options. This is useful to allow further arguments to the @code{awk} program itself to start with a @samp{-}. This is mainly for consistency with the argument parsing conventions of POSIX. @end table Any other options are flagged as invalid, but are otherwise ignored. @xref{Command Line}, for more details. @node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary @appendixsec Language Summary An @code{awk} program consists of a sequence of pattern-action statements and optional function definitions. @example @var{pattern} @{ @var{action statements} @} function @var{name}(@var{parameter list}) @{ @var{action statements} @} @end example @code{gawk} first reads the program source from the @var{program-file}(s) if specified, or from the first non-option argument on the command line. The @samp{-f} option may be used multiple times on the command line. @code{gawk} reads the program text from all the @var{program-file} files, effectively concatenating them in the order they are specified. This is useful for building libraries of @code{awk} functions, without having to include them in each new @code{awk} program that uses them. To use a library function in a file from a program typed in on the command line, specify @samp{-f /dev/tty}; then type your program, and end it with a @kbd{C-d}. @xref{Command Line}. The environment variable @code{AWKPATH} specifies a search path to use when finding source files named with the @samp{-f} option. If the variable @code{AWKPATH} is not set, @code{gawk} uses the default path, @samp{.:/usr/lib/awk:/usr/local/lib/awk}. If a file name given to the @samp{-f} option contains a @samp{/} character, no path search is performed. @xref{AWKPATH Variable}, for a full description of the @code{AWKPATH} environment variable.@refill @code{gawk} compiles the program into an internal form, and then proceeds to read each file named in the @code{ARGV} array. If there are no files named on the command line, @code{gawk} reads the standard input. If a ``file'' named on the command line has the form @samp{@var{var}=@var{val}}, it is treated as a variable assignment: the variable @var{var} is assigned the value @var{val}. For each line in the input, @code{gawk} tests to see if it matches any @var{pattern} in the @code{awk} program. For each pattern that the line matches, the associated @var{action} is executed. @node Variables/Fields, Rules Summary, Language Summary, Gawk Summary @appendixsec Variables and Fields @code{awk} variables are dynamic; they come into existence when they are first used. Their values are either floating-point numbers or strings. @code{awk} also has one-dimension arrays; multiple-dimensional arrays may be simulated. There are several predefined variables that @code{awk} sets as a program runs; these are summarized below. @menu * Fields Summary:: Input field splitting. * Built-in Summary:: @code{awk}'s built-in variables. * Arrays Summary:: Using arrays. * Data Type Summary:: Values in @code{awk} are numbers or strings. @end menu @node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields @appendixsubsec Fields As each input line is read, @code{gawk} splits the line into @var{fields}, using the value of the @code{FS} variable as the field separator. If @code{FS} is a single character, fields are separated by that character. Otherwise, @code{FS} is expected to be a full regular expression. In the special case that @code{FS} is a single blank, fields are separated by runs of blanks and/or tabs. Note that the value of @code{IGNORECASE} (@pxref{Case-sensitivity}) also affects how fields are split when @code{FS} is a regular expression. Each field in the input line may be referenced by its position, @code{$1}, @code{$2}, and so on. @code{$0} is the whole line. The value of a field may be assigned to as well. Field numbers need not be constants: @example n = 5 print $n @end example @noindent prints the fifth field in the input line. The variable @code{NF} is set to the total number of fields in the input line. References to nonexistent fields (i.e., fields after @code{$NF}) return the null-string. However, assigning to a nonexistent field (e.g., @code{$(NF+2) = 5}) increases the value of @code{NF}, creates any intervening fields with the null string as their value, and causes the value of @code{$0} to be recomputed, with the fields being separated by the value of @code{OFS}.@refill @xref{Reading Files}, for a full description of the way @code{awk} defines and uses fields. @node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields @appendixsubsec Built-in Variables @code{awk}'s built-in variables are: @table @code @item ARGC The number of command line arguments (not including options or the @code{awk} program itself). @item ARGV The array of command line arguments. The array is indexed from 0 to @code{ARGC} - 1. Dynamically changing the contents of @code{ARGV} can control the files used for data.@refill @item ENVIRON An array containing the values of the environment variables. The array is indexed by variable name, each element being the value of that variable. Thus, the environment variable @code{HOME} would be in @code{ENVIRON["HOME"]}. Its value might be @file{/u/close}. Changing this array does not affect the environment seen by programs which @code{gawk} spawns via redirection or the @code{system} function. (This may change in a future version of @code{gawk}.) Some operating systems do not have environment variables. The array @code{ENVIRON} is empty when running on these systems. @item FILENAME The name of the current input file. If no files are specified on the command line, the value of @code{FILENAME} is @samp{-}. @item FNR The input record number in the current input file. @item FS The input field separator, a blank by default. @item IGNORECASE The case-sensitivity flag for regular expression operations. If @code{IGNORECASE} has a nonzero value, then pattern matching in rules, field splitting with @code{FS}, regular expression matching with @samp{~} and @samp{!~}, and the @code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub} predefined functions all ignore case when doing regular expression operations.@refill @item NF The number of fields in the current input record. @item NR The total number of input records seen so far. @item OFMT The output format for numbers, @code{"%.6g"} by default. @item OFS The output field separator, a blank by default. @item ORS The output record separator, by default a newline. @item RS The input record separator, by default a newline. @code{RS} is exceptional in that only the first character of its string value is used for separating records. If @code{RS} is set to the null string, then records are separated by blank lines. When @code{RS} is set to the null string, then the newline character always acts as a field separator, in addition to whatever value @code{FS} may have.@refill @item RSTART The index of the first character matched by @code{match}; 0 if no match. @item RLENGTH The length of the string matched by @code{match}; @minus{}1 if no match. @item SUBSEP The string used to separate multiple subscripts in array elements, by default @code{"\034"}. @end table @xref{Built-in Variables}. @node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields @appendixsubsec Arrays Arrays are subscripted with an expression between square brackets (@samp{[} and @samp{]}). The expression may be either a number or a string. Since arrays are associative, string indices are meaningful and are not converted to numbers. If you use multiple expressions separated by commas inside the square brackets, then the array subscript is a string consisting of the concatenation of the individual subscript values, converted to strings, separated by the subscript separator (the value of @code{SUBSEP}). The special operator @code{in} may be used in an @code{if} or @code{while} statement to see if an array has an index consisting of a particular value. @group @example if (val in array) print array[val] @end example @end group If the array has multiple subscripts, use @code{(i, j, @dots{}) in array} to test for existence of an element. The @code{in} construct may also be used in a @code{for} loop to iterate over all the elements of an array. @xref{Scanning an Array}. An element may be deleted from an array using the @code{delete} statement. @xref{Arrays}, for more detailed information. @node Data Type Summary, , Arrays Summary, Variables/Fields @appendixsubsec Data Types The value of an @code{awk} expression is always either a number or a string. Certain contexts (such as arithmetic operators) require numeric values. They convert strings to numbers by interpreting the text of the string as a numeral. If the string does not look like a numeral, it converts to 0. Certain contexts (such as concatenation) require string values. They convert numbers to strings by effectively printing them. To force conversion of a string value to a number, simply add 0 to it. If the value you start with is already a number, this does not change it. To force conversion of a numeric value to a string, concatenate it with the null string. The @code{awk} language defines comparisons as being done numerically if possible, otherwise one or both operands are converted to strings and a string comparison is performed. Uninitialized variables have the string value @code{""} (the null, or empty, string). In contexts where a number is required, this is equivalent to 0. @xref{Variables}, for more information on variable naming and initialization; @pxref{Conversion}, for more information on how variable values are interpreted.@refill @node Rules Summary, Functions Summary, Variables/Fields, Gawk Summary @appendixsec Patterns and Actions @menu * Pattern Summary:: Quick overview of patterns. * Regexp Summary:: Quick overview of regular expressions. * Actions Summary:: Quick overview of actions. @end menu An @code{awk} program is mostly composed of rules, each consisting of a pattern followed by an action. The action is enclosed in @samp{@{} and @samp{@}}. Either the pattern may be missing, or the action may be missing, but, of course, not both. If the pattern is missing, the action is executed for every single line of input. A missing action is equivalent to this action, @example @{ print @} @end example @noindent which prints the entire line. Comments begin with the @samp{#} character, and continue until the end of the line. Blank lines may be used to separate statements. Normally, a statement ends with a newline, however, this is not the case for lines ending in a @samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines ending in @code{do} or @code{else} also have their statements automatically continued on the following line. In other cases, a line can be continued by ending it with a @samp{\}, in which case the newline is ignored.@refill Multiple statements may be put on one line by separating them with a @samp{;}. This applies to both the statements within the action part of a rule (the usual case), and to the rule statements themselves. @xref{Comments}, for information on @code{awk}'s commenting convention; @pxref{Statements/Lines}, for a description of the line continuation mechanism in @code{awk}. @node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary @appendixsubsec Patterns @code{awk} patterns may be one of the following: @example /@var{regular expression}/ @var{relational expression} @var{pattern} && @var{pattern} @var{pattern} || @var{pattern} @var{pattern} ? @var{pattern} : @var{pattern} (@var{pattern}) ! @var{pattern} @var{pattern1}, @var{pattern2} BEGIN END @end example @code{BEGIN} and @code{END} are two special kinds of patterns that are not tested against the input. The action parts of all @code{BEGIN} rules are merged as if all the statements had been written in a single @code{BEGIN} rule. They are executed before any of the input is read. Similarly, all the @code{END} rules are merged, and executed when all the input is exhausted (or when an @code{exit} statement is executed). @code{BEGIN} and @code{END} patterns cannot be combined with other patterns in pattern expressions. @code{BEGIN} and @code{END} rules cannot have missing action parts.@refill For @samp{/@var{regular-expression}/} patterns, the associated statement is executed for each input line that matches the regular expression. Regular expressions are the same as those in @code{egrep}, and are summarized below. A @var{relational expression} may use any of the operators defined below in the section on actions. These generally test whether certain fields match certain regular expressions. The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and'', logical ``or'', and logical ``not'', respectively, as in C. They do short-circuit evaluation, also as in C, and are used for combining more primitive pattern expressions. As in most languages, parentheses may be used to change the order of evaluation. The @samp{?:} operator is like the same operator in C. If the first pattern matches, then the second pattern is matched against the input record; otherwise, the third is matched. Only one of the second and third patterns is matched. The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a range pattern. It matches all input lines starting with a line that matches @var{pattern1}, and continuing until a line that matches @var{pattern2}, inclusive. A range pattern cannot be used as an operand to any of the pattern operators. @xref{Patterns}, for a full description of the pattern part of @code{awk} rules. @node Regexp Summary, Actions Summary, Pattern Summary, Rules Summary @appendixsubsec Regular Expressions Regular expressions are the extended kind found in @code{egrep}. They are composed of characters as follows: @table @code @item @var{c} matches the character @var{c} (assuming @var{c} is a character with no special meaning in regexps). @item \@var{c} matches the literal character @var{c}. @item . matches any character except newline. @item ^ matches the beginning of a line or a string. @item $ matches the end of a line or a string. @item [@var{abc}@dots{}] matches any of the characters @var{abc}@dots{} (character class). @item [^@var{abc}@dots{}] matches any character except @var{abc}@dots{} and newline (negated character class). @item @var{r1}|@var{r2} matches either @var{r1} or @var{r2} (alternation). @item @var{r1r2} matches @var{r1}, and then @var{r2} (concatenation). @item @var{r}+ matches one or more @var{r}'s. @item @var{r}* matches zero or more @var{r}'s. @item @var{r}? matches zero or one @var{r}'s. @item (@var{r}) matches @var{r} (grouping). @end table @xref{Regexp}, for a more detailed explanation of regular expressions. The escape sequences allowed in string constants are also valid in regular expressions (@pxref{Constants}). @node Actions Summary, , Regexp Summary, Rules Summary @appendixsubsec Actions Action statements are enclosed in braces, @samp{@{} and @samp{@}}. Action statements consist of the usual assignment, conditional, and looping statements found in most languages. The operators, control statements, and input/output statements available are patterned after those in C. @menu * Operator Summary:: @code{awk} operators. * Control Flow Summary:: The control statements. * I/O Summary:: The I/O statements. * Printf Summary:: A summary of @code{printf}. * Special File Summary:: Special file names interpreted internally. * Numeric Functions Summary:: Built-in numeric functions. * String Functions Summary:: Built-in string functions. * String Constants Summary:: Escape sequences in strings. @end menu @node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary @appendixsubsubsec Operators The operators in @code{awk}, in order of increasing precedence, are @table @code @item = += -= *= /= %= ^= Assignment. Both absolute assignment (@code{@var{var}=@var{value}}) and operator assignment (the other forms) are supported. @item ?: A conditional expression, as in C. This has the form @code{@var{expr1} ? @var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the expression is @var{expr2}; otherwise it is @var{expr3}. Only one of @var{expr2} and @var{expr3} is evaluated.@refill @item || Logical ``or''. @item && Logical ``and''. @item ~ !~ Regular expression match, negated match. @item < <= > >= != == The usual relational operators. @item @var{blank} String concatenation. @item + - Addition and subtraction. @item * / % Multiplication, division, and modulus. @item + - ! Unary plus, unary minus, and logical negation. @item ^ Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment operator). @item ++ -- Increment and decrement, both prefix and postfix. @item $ Field reference. @end table @xref{Expressions}, for a full description of all the operators listed above. @xref{Fields}, for a description of the field reference operator. @node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary @appendixsubsubsec Control Statements The control statements are as follows: @example if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]} while (@var{condition}) @var{statement} do @var{statement} while (@var{condition}) for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement} for (@var{var} in @var{array}) @var{statement} break continue delete @var{array}[@var{index}] exit @r{[} @var{expression} @r{]} @{ @var{statements} @} @end example @xref{Statements}, for a full description of all the control statements listed above. @node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary @appendixsubsubsec I/O Statements The input/output statements are as follows: @table @code @item getline Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}. @item getline <@var{file} Set @code{$0} from next record of @var{file}; set @code{NF}. @item getline @var{var} Set @var{var} from next input record; set @code{NF}, @code{FNR}. @item getline @var{var} <@var{file} Set @var{var} from next record of @var{file}. @item next Stop processing the current input record. The next input record is read and processing starts over with the first pattern in the @code{awk} program. If the end of the input data is reached, the @code{END} rule(s), if any, are executed. @item print Prints the current record. @item print @var{expr-list} Prints expressions. @item print @var{expr-list} > @var{file} Prints expressions on @var{file}. @item printf @var{fmt, expr-list} Format and print. @item printf @var{fmt, expr-list} > file Format and print on @var{file}. @end table Other input/output redirections are also allowed. For @code{print} and @code{printf}, @samp{>> @var{file}} appends output to the @var{file}, while @samp{| @var{command}} writes on a pipe. In a similar fashion, @samp{@var{command} | getline} pipes input into @code{getline}. @code{getline} returns 0 on end of file, and @minus{}1 on an error.@refill @xref{Getline}, for a full description of the @code{getline} statement. @xref{Printing}, for a full description of @code{print} and @code{printf}. Finally, @pxref{Next Statement}, for a description of how the @code{next} statement works.@refill @node Printf Summary, Special File Summary, I/O Summary, Actions Summary @appendixsubsubsec @code{printf} Summary The @code{awk} @code{printf} statement and @code{sprintf} function accept the following conversion specification formats: @table @code @item %c An ASCII character. If the argument used for @samp{%c} is numeric, it is treated as a character and printed. Otherwise, the argument is assumed to be a string, and the only first character of that string is printed. @item %d A decimal number (the integer part). @item %i Also a decimal integer. @item %e A floating point number of the form @samp{@r{[}-@r{]}d.ddddddE@r{[}+-@r{]}dd}.@refill @item %f A floating point number of the form @r{[}@code{-}@r{]}@code{ddd.dddddd}. @item %g Use @samp{%e} or @samp{%f} conversion, whichever is shorter, with nonsignificant zeros suppressed. @item %o An unsigned octal number (again, an integer). @item %s A character string. @item %x An unsigned hexadecimal number (an integer). @item %X Like @samp{%x}, except use @samp{A} through @samp{F} instead of @samp{a} through @samp{f} for decimal 10 through 15.@refill @item %% A single @samp{%} character; no argument is converted. @end table There are optional, additional parameters that may lie between the @samp{%} and the control letter: @table @code @item - The expression should be left-justified within its field. @item @var{width} The field should be padded to this width. If @var{width} has a leading zero, then the field is padded with zeros. Otherwise it is padded with blanks. @item .@var{prec} A number indicating the maximum width of strings or digits to the right of the decimal point. @end table @xref{Printf}, for examples and for a more detailed description. @node Special File Summary, Numeric Functions Summary, Printf Summary, Actions Summary @appendixsubsubsec Special File Names When doing I/O redirection from either @code{print} or @code{printf} into a file, or via @code{getline} from a file, @code{gawk} recognizes certain special file names internally. These file names allow access to open file descriptors inherited from @code{gawk}'s parent process (usually the shell). The file names are: @table @file @item /dev/stdin The standard input. @item /dev/stdout The standard output. @item /dev/stderr The standard error output. @item /dev/fd/@var{n} The file denoted by the open file descriptor @var{n}. @end table @noindent These file names may also be used on the command line to name data files. @xref{Special Files}, for a longer description that provides the motivation for this feature. @node Numeric Functions Summary, String Functions Summary, Special File Summary, Actions Summary @appendixsubsubsec Numeric Functions @code{awk} has the following predefined arithmetic functions: @table @code @item atan2(@var{y}, @var{x}) returns the arctangent of @var{y/x} in radians. @item cos(@var{expr}) returns the cosine in radians. @item exp(@var{expr}) the exponential function. @item int(@var{expr}) truncates to integer. @item log(@var{expr}) the natural logarithm function. @item rand() returns a random number between 0 and 1. @item sin(@var{expr}) returns the sine in radians. @item sqrt(@var{expr}) the square root function. @item srand(@var{expr}) use @var{expr} as a new seed for the random number generator. If no @var{expr} is provided, the time of day is used. The return value is the previous seed for the random number generator. @end table @node String Functions Summary, String Constants Summary, Numeric Functions Summary, Actions Summary @appendixsubsubsec String Functions @code{awk} has the following predefined string functions: @table @code @item gsub(@var{r}, @var{s}, @var{t}) for each substring matching the regular expression @var{r} in the string @var{t}, substitute the string @var{s}, and return the number of substitutions. If @var{t} is not supplied, use @code{$0}. @item index(@var{s}, @var{t}) returns the index of the string @var{t} in the string @var{s}, or 0 if @var{t} is not present. @item length(@var{s}) returns the length of the string @var{s}. @item match(@var{s}, @var{r}) returns the position in @var{s} where the regular expression @var{r} occurs, or 0 if @var{r} is not present, and sets the values of @code{RSTART} and @code{RLENGTH}. @item split(@var{s}, @var{a}, @var{r}) splits the string @var{s} into the array @var{a} on the regular expression @var{r}, and returns the number of fields. If @var{r} is omitted, @code{FS} is used instead. @item sprintf(@var{fmt}, @var{expr-list}) prints @var{expr-list} according to @var{fmt}, and returns the resulting string. @item sub(@var{r}, @var{s}, @var{t}) this is just like @code{gsub}, but only the first matching substring is replaced. @item substr(@var{s}, @var{i}, @var{n}) returns the @var{n}-character substring of @var{s} starting at @var{i}. If @var{n} is omitted, the rest of @var{s} is used. @item tolower(@var{str}) returns a copy of the string @var{str}, with all the upper-case characters in @var{str} translated to their corresponding lower-case counterparts. Nonalphabetic characters are left unchanged. @item toupper(@var{str}) returns a copy of the string @var{str}, with all the lower-case characters in @var{str} translated to their corresponding upper-case counterparts. Nonalphabetic characters are left unchanged. @item system(@var{cmd-line}) Execute the command @var{cmd-line}, and return the exit status. @end table @xref{Built-in}, for a description of all of @code{awk}'s built-in functions. @node String Constants Summary, , String Functions Summary, Actions Summary @appendixsubsubsec String Constants String constants in @code{awk} are sequences of characters enclosed between double quotes (@code{"}). Within strings, certain @dfn{escape sequences} are recognized, as in C. These are: @table @code @item \\ A literal backslash. @item \a The ``alert'' character; usually the ASCII BEL character. @item \b Backspace. @item \f Formfeed. @item \n Newline. @item \r Carriage return. @item \t Horizontal tab. @item \v Vertical tab. @item \x@var{hex digits} The character represented by the string of hexadecimal digits following the @samp{\x}. As in ANSI C, all following hexadecimal digits are considered part of the escape sequence. (This feature should tell us something about language design by committee.) E.g., @code{"\x1B"} is a string containing the ASCII ESC (escape) character. @item \@var{ddd} The character represented by the 1-, 2-, or 3-digit sequence of octal digits. Thus, @code{"\033"} is also a string containing the ASCII ESC (escape) character. @item \@var{c} The literal character @var{c}. @end table The escape sequences may also be used inside constant regular expressions (e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace characters).@refill @xref{Constants}. @node Functions Summary, , Rules Summary, Gawk Summary @appendixsec Functions Functions in @code{awk} are defined as follows: @example function @var{name}(@var{parameter list}) @{ @var{statements} @} @end example Actual parameters supplied in the function call are used to instantiate the formal parameters declared in the function. Arrays are passed by reference, other variables are passed by value. If there are fewer arguments passed than there are names in @var{parameter-list}, the extra names are given the null string as value. Extra names have the effect of local variables. The open-parenthesis in a function call must immediately follow the function name, without any intervening white space. This is to avoid a syntactic ambiguity with the concatenation operator. The word @code{func} may be used in place of @code{function}. @xref{User-defined}, for a more complete description. @node Sample Program, Notes, Gawk Summary, Top @appendix Sample Program The following example is a complete @code{awk} program, which prints the number of occurrences of each word in its input. It illustrates the associative nature of @code{awk} arrays by using strings as subscripts. It also demonstrates the @samp{for @var{x} in @var{array}} construction. Finally, it shows how @code{awk} can be used in conjunction with other utility programs to do a useful task of some complexity with a minimum of effort. Some explanations follow the program listing.@refill @example awk ' # Print list of word frequencies @{ for (i = 1; i <= NF; i++) freq[$i]++ @} END @{ for (word in freq) printf "%s\t%d\n", word, freq[word] @}' @end example The first thing to notice about this program is that it has two rules. The first rule, because it has an empty pattern, is executed on every line of the input. It uses @code{awk}'s field-accessing mechanism (@pxref{Fields}) to pick out the individual words from the line, and the built-in variable @code{NF} (@pxref{Built-in Variables}) to know how many fields are available. For each input word, an element of the array @code{freq} is incremented to reflect that the word has been seen an additional time.@refill The second rule, because it has the pattern @code{END}, is not executed until the input has been exhausted. It prints out the contents of the @code{freq} table that has been built up inside the first action.@refill Note that this program has several problems that would prevent it from being useful by itself on real text files:@refill @itemize @bullet @item Words are detected using the @code{awk} convention that fields are separated by whitespace and that other characters in the input (except newlines) don't have any special meaning to @code{awk}. This means that punctuation characters count as part of words.@refill @item The @code{awk} language considers upper and lower case characters to be distinct. Therefore, @samp{foo} and @samp{Foo} are not treated by this program as the same word. This is undesirable since in normal text, words are capitalized if they begin sentences, and a frequency analyzer should not be sensitive to that.@refill @item The output does not come out in any useful order. You're more likely to be interested in which words occur most frequently, or having an alphabetized table of how frequently each word occurs.@refill @end itemize The way to solve these problems is to use other system utilities to process the input and output of the @code{awk} script. Suppose the script shown above is saved in the file @file{frequency.awk}. Then the shell command:@refill @example tr A-Z a-z < file1 | tr -cd 'a-z\012' \ | awk -f frequency.awk \ | sort +1 -nr @end example @noindent produces a table of the words appearing in @file{file1} in order of decreasing frequency. The first @code{tr} command in this pipeline translates all the upper case characters in @file{file1} to lower case. The second @code{tr} command deletes all the characters in the input except lower case characters and newlines. The second argument to the second @code{tr} is quoted to protect the backslash in it from being interpreted by the shell. The @code{awk} program reads this suitably massaged data and produces a word frequency table, which is not ordered. The @code{awk} script's output is now sorted by the @code{sort} command and printed on the terminal. The options given to @code{sort} in this example specify to sort by the second field of each input line (skipping one field), that the sort keys should be treated as numeric quantities (otherwise @samp{15} would come before @samp{5}), and that the sorting should be done in descending (reverse) order.@refill See the general operating system documentation for more information on how to use the @code{tr} and @code{sort} commands.@refill @ignore @strong{ADR: I have some more substantial programs courtesy of Rick Adams at UUNET. I am planning on incorporating those either in addition to or instead of this program.} @strong{I would also like to incorporate the general @code{translate} function that I have written.} @end ignore @node Notes, Glossary, Sample Program, Top @appendix Implementation Notes This appendix contains information mainly of interest to implementors and maintainers of @code{gawk}. Everything in it applies specifically to @code{gawk}, and not to other implementations. @menu * Compatibility Mode:: How to disable certain @code{gawk} extensions. * Future Extensions:: New features we may implement soon. * Improvements:: Suggestions for improvements by volunteers. @end menu @node Compatibility Mode, Future Extensions, Notes, Notes @appendixsec Downwards Compatibility and Debugging @xref{S5R4/GNU}, for a summary of the GNU extensions to the @code{awk} language and program. All of these features can be turned off either by compiling @code{gawk} with @samp{-DSTRICT} (not recommended), or by invoking @code{gawk} with the @samp{-c} option.@refill If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there are two more options available on the command line. @table @samp @item -d Print out debugging information during execution. @item -D Print out the parse stack information as the program is being parsed. @end table Both of these options are intended only for serious @code{gawk} developers, and not for the casual user. They probably have not even been compiled into your version of @code{gawk}, since they slow down execution. The code for recognizing special file names such as @file{/dev/stdin} can be disabled at compile time with @samp{-DNO_DEV_FD}, or with @samp{-DSTRICT}.@refill @node Future Extensions, Improvements, Compatibility Mode, Notes @appendixsec Probable Future Extensions This section briefly lists extensions that indicate the directions we are currently considering for @code{gawk}. @table @asis @item ANSI C compatible @code{printf} The @code{printf} and @code{sprintf} functions may be enhanced to be fully compatible with the specification for the @code{printf} family of functions in ANSI C.@refill @item @code{RS} as a regexp The meaning of @code{RS} may be generalized along the lines of @code{FS}. @item Control of subprocess environment Changes made in @code{gawk} to the array @code{ENVIRON} may be propagated to subprocesses run by @code{gawk}. @item Data bases It may be possible to map an NDBM/GDBM file into an @code{awk} array. @item Single-character fields The null string, @code{""}, as a field separator, will cause field splitting and the split function to separate individual characters. Thus, @code{split(a, "abcd", "")} would yield @code{a[1] == "a"}, @code{a[2] == "b"}, and so on. @item Fixed-length fields and records A mechanism may be provided to allow the specification of fixed length fields and records. @item Regexp syntax The @code{egrep} syntax for regular expressions, now specified with the @samp{-e} option, may become the default, since the POSIX standard may specify this. @c this is @emph{very} long term --- not worth including right now. @ignore @item The C Comma Operator We may add the C comma operator, which takes the form @code{@var{expr1},@var{expr2}}. The first expression is evaluated, and the result is thrown away. The value of the full expression is the value of @var{expr2}.@refill @end ignore @end table @node Improvements,, Future Extensions, Notes @appendixsec Suggestions for Improvements Here are some projects that would-be @code{gawk} hackers might like to take on. They vary in size from a few days to a few weeks of programming, depending on which one you choose and how fast a programmer you are. Please send any improvements you write to the maintainers at the GNU project.@refill @enumerate @item State machine regexp matcher: At present, @code{gawk} uses the backtracking regular expression matcher from the GNU subroutine library. If a regexp is really going to be used a lot of times, it is faster to convert it once to a description of a finite state machine, then run a routine simulating that machine every time you want to match the regexp. You might be able to use the matching routines used by GNU @code{egrep}. @item Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like) parser to convert the script given it into a syntax tree; the syntax tree is then executed by a simple recursive evaluator. Both of these steps incur a lot of overhead, since parsing can be slow (especially if you also do the previous project and convert regular expressions to finite state machines at compile time) and the recursive evaluator performs many procedure calls to do even the simplest things.@refill It should be possible for @code{gawk} to convert the script's parse tree into a C program which the user would then compile, using the normal C compiler and a special @code{gawk} library to provide all the needed functions (regexps, fields, associative arrays, type coercion, and so on).@refill An easier possibility might be for an intermediate phase of @code{awk} to convert the parse tree into a linear byte code form like the one used in GNU Emacs Lisp. The recursive evaluator would then be replaced by a straight line byte code interpreter that would be intermediate in speed between running a compiled program and doing what @code{gawk} does now.@refill @item An error message section has not been included in this version of the manual. Perhaps some nice beta testers will document some of the messages for the future. @end enumerate @node Glossary, Index , Notes, Top @appendix Glossary @table @asis @item Action A series of @code{awk} statements attached to a rule. If the rule's pattern matches an input record, the @code{awk} language executes the rule's action. Actions are always enclosed in curly braces. @xref{Actions}.@refill @item Amazing @code{awk} Assembler Henry Spencer at the University of Toronto wrote a retargetable assembler completely as @code{awk} scripts. It is thousands of lines long, including machine descriptions for several 8-bit microcomputers. It is distributed with @code{gawk} and is a good example of a program that would have been better written in another language.@refill @item Assignment An @code{awk} expression that changes the value of some @code{awk} variable or data object. An object that you can assign to is called an @dfn{lvalue}. @xref{Assignment Ops}.@refill @item @code{awk} Language The language in which @code{awk} programs are written. @item @code{awk} Program An @code{awk} program consists of a series of @dfn{patterns} and @dfn{actions}, collectively known as @dfn{rules}. For each input record given to the program, the program's rules are all processed in turn. @code{awk} programs may also contain function definitions.@refill @item @code{awk} Script Another name for an @code{awk} program. @item Built-in Function The @code{awk} language provides built-in functions that perform various numerical and string computations. Examples are @code{sqrt} (for the square root of a number) and @code{substr} (for a substring of a string). @xref{Built-in}.@refill @item Built-in Variable The variables @code{ARGC}, @code{ARGV}, @code{ENVIRON}, @code{FILENAME}, @code{FNR}, @code{FS}, @code{NF}, @code{IGNORECASE}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS}, @code{RLENGTH}, @code{RSTART}, @code{RS}, and @code{SUBSEP}, have special meaning to @code{awk}. Changing some of them affects @code{awk}'s running environment. @xref{Built-in Variables}.@refill @item C The system programming language that most GNU software is written in. The @code{awk} programming language has C-like syntax, and this manual points out similarities between @code{awk} and C when appropriate.@refill @item Compound Statement A series of @code{awk} statements, enclosed in curly braces. Compound statements may be nested. @xref{Statements}.@refill @item Concatenation