Index


RISC World

Great awk

Gavin Wraith continues his series on awk.

(If you missed the first part, there's a copy of the Mawk interpreter in the SOFTWARE directory on the CD-ROM.)

Records and Fields: Example 3

In the previous article I said that pattern-action statements applied to the text line by line and that the variable $n denoted the n-th word of the current line. That was a deliberate oversimplification.

Adopting jargon from database tradition, we may say that every textfile can be thought of as a sequence of records, separated by a record-separator, and that each record can be thought of as a sequence of fields, separated by a field-separator. Awk has a number of built in variables, among them:

FS
input field-separator
RS
input record-separator
OFS
output field-separator
ORS
output record-separator
NF
number of fields in the current record
NR
number of records read so far
FNR
number of records read from current input file
FILENAME
pathname of current input file

The default value for both RS and ORS is the string '\n' - in other words, a newline character. This means that by default record is a line of text. The default value for FS and OFS is ' ', a single blank space. When FS takes this special value, input fields are separated by blank spaces and/or tabs, and leading blank spaces and tabs are ignored. So, in effect, fields are words. The variable OFS determines how the comma symbol is interpreted between the expressions following print. The variable ORS determines the string that is output automatically after a print statement.

If RS is set to an empty string '' then input records are separated by one or more blank lines, and field-separators are either newlines or values given by FS. So the following settings:

    BEGIN { RS = ""; FS = "\n" }

would interpret paragraphs as records and individual lines as fields, in a 'standard' text file where paragraph breaks are blank lines.

Incidentally, as the example aboove shows, the semicolon can be used to put many statements on the same line. Unlike C, the semicolon is a statement separator, not a statement terminator, which is why we don't need one after FS = "\n". The sample file Authors contains the following two line records:

J D Salinger
The Catcher in the Rye

M Peake
Gormenghast

J Cowper Powys
Wolf Solent

Himself
Augustus Carp

By applying them to the Authors file, can you explain the difference in behaviour between the following awk programs?
Convert1 Convert2
BEGIN { RS = ""
    FS = "\n"
    OFS = " wrote "
}
{ print }
BEGIN { RS = ""
    FS = "\n"
    OFS = " was written by "
}
{ print $2,$1 }

Without those commas, OFS plays no role. print by itself is equivalent to print $0, which implies that you have to write print "" to get a newline on its own. For fancier formatted output you can use the printf and sprintf functions. If you are used to C you will have no difficulty with these; if not, there's not room to describe them here.

To undo the effect of convert2 try the following awk program:

    # revert
    BEGIN { FS = " was written by " }
    { printf("%s\n%s\n\n", $2,$1) }

Command Line Arguments and Files

Awk was devised for Unix in the days before Desktop environments. Have a look at !mawk.Docs.manpage for the complete specification of the command line arguments for the executable file !mawk.bin.mawk. As awk programs are often so short, awk was written to accept 'throwaway' programs inside single quotes on the command line to the awk (in our case mawk) command.

In RISC OS, where path names are limited to 256 characters, this is not so convenient. Instead one must use the
    -f <program pathname>

option. If you shift-doubleclick on the Obey file !mawk.Apps.!RunAwk.!Run you will see in the last line how this is used when you drag a textfile onto the !RunAwk icon. The contents of this file are shown below:

| AWK
if "<Awk$Prog>" = "" then echo No awk program chosen
if "<Awk$Prog>" = "" then obey
if "%*0" = "" then obey
?leaf <Awk$Prog>
do taskwindow "mawk -f <Awk$Prog> %*0" -name <leaf>=>%0 -quit

There is nothing to stop you using the mawk command with its full variety of command line options in an Obey file. Furthermore, you can use the -f <program pathname> option lots of times in the same command, which is useful for loading in libraries of function definitions. For instance, you could use the following:

mawk -f <prog_1> ... -f <prog_m>  <file_1> ... <file_n>

A word about the sequence in which things happen needs to be said here. The program does not run until all m program files have been read in. Then all the function definitions found in them are compiled, the built in variable ARGC is given the value n and the built in array ARGV is initialized so that ARGV[i] has the value <file_i>. Then all the BEGIN actions are grouped together and executed. Then the input files are read in, in sequence, and finally the END actions are grouped together and executed.

There is a useful trick for passing in values. The command line arguments do not have to be the pathnames of input files. For example, suppose you wanted <file_k> to specify an output file. Then you would include in one of the programs a line of the form

BEGIN { out = ARGV[k] ; ARGV[k] = "" }

Setting the k-th element of ARGV to a null string will suppress input of the k-th command line argument. The variable out can then be used for output statements. This technique will be demonstrated in the next example.

Invoicing customers: Example 4

Suppose you are the milkman. You want to add up the amount due from customers and send out bills to them. The directory Invoices holds a collection of files, how many and what they are called is irrelevant, in each of which a record is kept of milk deliveries, each record having the form:

<customer> <amount>

Double-click on the Obey file Bill to create a directory of bills to send out to customers.

Shift-double-clicking on Bill will reveal its contents:

| Bill customers
dir <Obey$Dir>
enumdir Invoices invlist
cdir Letters
mawk -f Total invlist Invoices Letters
delete invlist

The fifth line makes the awk program Total act on the temporary file invlist, and passes the the input and output directory names as ARGV[2] and ARGV[3]. Shift-double-clicking on Total will reveal:

     # Total up invoices
     BEGIN { invoicesdirectory = ARGV[2]
             outdir = ARGV[3]
             ARGV[2] = ARGV[3]  = "" # remove args
             sep = "." } # directory separator symbol 

    NF { list[$1] = "" } # read what invoice files there are

    END { for (file in list)
            invoice(invoicesdirectory sep file, account)
          for (customer in account)
            if ((owing = account[customer])) # no bill if zero
              letter(outdir sep customer,customer,owing) }

    function invoice(f,a) {
      while ((getline < f) > 0)
        a[$1] = $2
      close(f) }

    function letter(out,customer,amount) {
        print "Milk Bill for",sysvar("Sys$Date") > out
        print "Dear",customer > out
        print "You owe", amount, "pence." > out
        print "Thank you." > out
        close(out) }

One or two points need comment:

  • In line 4, ARGV[2] = ARGV[3] = "" works because in awk - as in C - assignment statements return values.
  • The pattern NF in line 6 holds precisely for nonempty lines. The array list is indexed by the leaf names of files in Invoices and has uninteresting values.
  • In lines 8 and 11 concatenation of strings is used, by simply separating string expressions by blank spaces.
  • Note how we assign the value of account[customer] to the variable owing inside a condition This is a typical C turn of phrase which avoids recomputing the value. The '=' here is not a comparison (which would be '==') but an assignment, and the test is effectively whether or not account[customer] is zero.
  • In the definition of the function invoice() we repeatedly read in a line of text from the file f using the getline < operator and we create the account[] array by passing it as a parameter to the function.
  • In the definition of the function letter() we see the print statement redirected to the file signified by the variable out.

You can execute RISC OS commands from within an awk program using the standard built in function system(), which is analogous to Basic's OSCLI. The function sysvar() is not standard; it is an extension specially for RISC OS. It takes a string denoting a system variable as argument and returns its value. See the file !mawk.Docs.Riscos for a quick summary of these extensions.

This milk account example is, of course, just a skeleton. There are a great many aspects of the model that we have left out or simplified. Nevertheless, I hope you can see that by using Obey files and awk programs together you can achieve a great deal with very few lines of code.

Summary

In this article we have seen that awk parses text as a sequence of records, made up of a sequence of fields. Combining awk programs with Obey files lets us construct more involved applications that use the full range of the command line, whereas the click and drag approach given in the first article merely restricts awk programs to the role of text filters. In later articles we will look more closely at patterns and consider how to use other applications to display output in tables or graphs.

Gavin Wraith (gavin@wraith.u-net.com)

 Index