Simtel MSDOS - Coast to Coast

home *** CD-ROM | disk | FTP | other *** search

/ Simtel MSDOS - Coast to Coast / simteldosarchivecoasttocoast2.iso / awk / awk320.zip / AWK.DOC < prev next >

Wrap

Text File | 1991-05-04 | 23KB | 609 lines

The AWK Programming Language Users Manual and Tutorial This document is an introduction to the use of AWK for manipulating text and the textual representation of numbers. This mouthful means that you can use AWK to manipulate words and numbers. 1. Basic Concepts 1.1 AWK Programs AWK programs consist of a series of PATTERNS and ACTIONS. Patterns are boolean (logical) expressions that are evaluated and if they are true (non-zero number or non-null string) then the associated Action is performed. Actions are program fragments in a "C" like language. The Pattern-Action statements comprising an AWK program are evaluated in turn for each input RECORD. That is, a Record is read and the Patterns in the program are evaluated in order, for each Pattern that succeeds, an Action is performed. For example: NR == 5 { print } is a simple program that prints the fifth line of a file. NR is a built-in variable that is equal to the number of records AWK has read so far. The double equal sign is the equality comparison operator from C. As you can see from the above example, a Pattern is a naked expression and an Action is a compound statement or list of program statements enclosed in braces ({}). You may omit the Action in a Pattern/Action statement in which case the default action is { print }. You may, on the other hand omit the Pattern which defaults to true, so that the Action is always taken. Finally if you omit both the Pattern and the Action you have a blank line, which is ignored. 1.2 Fields and Records To AWK all data are divided into FIELDS and RECORDS. The definition of a field is any string of characters separated by the Field Separator or FS for short. Similarly a record is any string of characters separated by the Record Separator or RS. In the simplest form a Field is a string of characters surrounded by white space (blanks or tabs,) and a Record is a line of text. You can make the Field Separator as complex as you like by providing your own REGULAR EXPRESSION for the FS. The Record Separator is limited to the null string "" or a newline "\n". The null string means that a blank line separates a multi- line record, and the newline means that each line is a record. You can refer to the Fields in the current Record with the dollar ($) operator: $3 < 10 { print NR, $0 } here $0 denotes the entire Record, and $3 is the third field. If the tenth line of the input file was: Rob Duff 7 the output generated for this record by the one line program would be: 10 Rob Duff 7 since the third field (7) is less than 10 the record number (10), followed by a space (the Output Field Separator OFS), followed by the whole record ($0). 1.3 Regular Expressions You may wonder why something like NR == 5 would be called a Pattern, well, the name comes from pattern matching with Regular Expressions. A Regular Expression is a formula for matching strings. The simplest is straight matching a string of characters within a line: /with/ will match a Record that contains the substring "with" at any point. A more complex Regular Expression is Alternation (one string or another): /with|line/ which is equivalent to /with/ || /line/ which will match any Record that has "with" or has "line" as a substring. A somewhat less simple concept is the CLASS or set of characters: /[0-9]/ will match any digit and /[a-zA-Z]/ will match any upper or lower case letter. A special Class is the period (.) which will match any character. Next we come to the repetition operators there are three of them, one for zero or more occurances of a pattern, the asterisk (*), another for one or more, the plus sign (+), and finally the one for zero or one occurances of a pattern, the question mark (?). For example the pattern: /[0-9]+/ will match one or more digits or in other words it will recognize numbers. As you can see a Regular Expression is delimited by slashes (/) in the same fashion as a string is delimited by quotes ("). You can use a Regular Expression in a pattern all by itself or in a logical expression: /line/ { print "line", NR} NR > 5 && /with/ { print NR, $0 } or you can use the match operator tilde (~) when you want to match anything other than the Current Record ($0): $2 ~ /ff/ { print $2,$3 } Finally if you want to find out if a string does not match a pattern you use the not-match (!~) operator. 1.4 Expressions Expressions in AWK are something that C programmers will be comfortable with immediatly and those familiar with other languages could pick up quickly. You may be either disappointed or relieved by the restrictions on expressions in AWK. You cannot for instance get the address of anything, and arrays are one-dimensional (multi-dimensions are simulated). Most of the power of the C language is available for expressions. Perhaps the most familiar expressions are those involving arithmetic and assignment: a = b + c * 2 # a becomes b plus c times 2 A less familar kind of expression involves comparison and boolean operators: a > b && c == 1 # a is greater than b and c equals 1 You have already encountered the Field operator ($) and the pattern matching operator (~): $1 ~ /[0-9]/ # field #1 contains a digit The one operation that has no operator is string concatenation: name = name ".DAT" # add file extension to name Doubtless the least familiar to non C programmers is the conditional assignment and the increment/decrement operators. x = (a > b) ? a-- : b-- # x becomes the greater of a and b then # decrement the greater of a or b Beware of the traps that some of these operators can let you fall into. The assignment operator can be used anywhere so if you use it instead of the equality operator you will not get what you expect: a == 3 # gives 1 if a is 3, otherwise 0 a = 3 # gives 3 always and sets a to 3 So even the assignment expression has a value that can be used within a larger expression: b = (a = 5) + 2 # b becomes 7 and a becomes 5 The comparison operators will do either a string comparison or a numeric comparison depending on the arguments to the operation. If both left and right expressions are numeric then a numeric compare is done otherwise a string compare is performed. To force AWK to perform the kind of comparison you want either you must ensure that both expressions are numeric or one of them is string. In the simplest case this can be done by adding a zero to make it numeric or concatenating a null string to make it a string. 3 < "10" # false -- string compare 3 < 10 # true -- numeric compare Fields, command line assignments, ARGV and the arrays created by the split function have the special property of (possibly) being both a number and a string. If the field can be fully represented as a number by AWK then the field will have the combined type and any variable that has been assigned the value of one of these fields will have the combined type. If both sides of a comparison are combined type or one side is a number and the other is this combined type then a numeric comparison is done. 1.5 Variables Variables spring into existance out of the ylem by being mentioned. They can have either a numeric or string type. The type of a variable is determined by the type of the expression that is assigned to them: a = x + 0 # a is a number a = x "" # a is a string Before any value is assigned to a variable it's type is indeterminate, that is it is both a string and a number (this is important when doing comparisons and for printing). An uninitialized variable will compare equal to the null string ("") and to zero (0). It will print as the null string. print (x, x=="", x==0) will print: 1 1 1.6 Arrays Any variable can also be an array. In AWK arrays are one-dimensional but multi-dimensions are simulated by a BUILTIN VARIABLE called SUBSEP and separating the subscripts with commas: a[1,2,3] is equivalent to: a[1 SUBSEP 2 SUBSEP 3] where the default value for SUBSEP is ascii ^Z (SUB). Arrays are indexed by strings so that the elements a["1"] and a["01"] are not the same, and also the elements a["1"] and a[1] may or may not be the same depending on the Output Format (OFMT). Since arrays are indexed by strings there must be a way of stepping through the array using strings. The way we do it is with a special form of the for loop: for (index in array) print index, array will step through the array (in lexical order) printing each index and the associated array element. You can still access arrays using numbers if they have been put into the array using numbers, since the string representation for the indeces will be the same (unless you change OFMT) between creating the array and using it. One departure from most programming languages that this kind of array provides is fractional indices for array elements. You can for instance have an array element indexed by any number: a[3.14159] = "pi" Once you have finished with an array element you may remove it using the delete statement or assigning every element out of existance. delete a[pi] a = "" Finally there is a test for array membership that you must use if you don't want extra array elements since the very mention of an array element will cause it to spring into existance. You must use: if (i in a) ... because by using: if (a[i] == "") ... you will create all kinds of unwanted array elements that consist of uninitialized variables. 1.7 Built In Variables There are a number of variables that AWK defines so that you can get information, control certain aspects of AWK and in two cases get extra information about a function. Two variables give information about the command line, ARGC and ARGV. ARGC is the number of arguments except the options and program that AWK itself uses and ARGV is an array containing the value of the command line arguments with ARGV[0] being AWK's name and ARGV[1] to ARGV[ARGC-1] the rest of the command line arguments. Information about the file being read in is contained in FILENAME, and FNR which contains the number of records read from the current file. Neither of these variables have any valid meaning during a BEGIN or END action. The variables that describe the current input record are NR the number of records read so far, and NF the number of fields in the current record. NF will change anytime you assign the value of $0 or any field after $NF. Any fields between $NF and $n where n > NF will be set to the null string (""). Output is controlled by the Output Record Separator (ORS), the output Field Separator (OFS), and the Output Format (OFMT). If you print some items such as: print 1.20, "test", 001 the each comma will be replaced by the OFS (default blank) and the ORS will be printed at the end. The numbers will be printed according to the OFMT (default %.6g) so: 1.2 test 1 will be the result of the print statement. Input is controlled by the Record Separator (RS) and the Field Separator (FS). If the RS is the null string ("") then the input record is delimited by a blank line. If it is a string consisting of the newline ("\n") then each line will be a record. The FS has more versitility, it can be any Regular Expression so you have full control over the parsing of fields. The pseudo multi-dimensional array Subscript Separator (SUBSEP) is described in the section on arrays. Finally the variables RLENGTH and RSTART are set by the match function to be the length of the string matched and the index of the first character matched respectivly. 1.8 Control Structures AWK has three basic control structures for alternation, iteration, and repetition. They are respectivly the if statement, the while statement, and the for statement. The if statement allows two mutually exclusive paths of program execution: if (a > b) print "yes"; else print "no" either of the paths may be a null statement, and if the second statement is omitted then the else keyword may also be omitted: if (command == "print") print The while statement comes in two flavours, test first, and test last. The test last sort is know as the do-while statement. do i = a[i]; while (a[i] != 0) And the test first as the wile statement. while (a[i] != 0) i = a[i] Both of these will continue to loop as long as the expression in parentheses evaluates to true (non zero or non null). The for statement is a generalized loop generator in that any three expressions can be used to control it. Most often a familiar combination of initialization, testing and modification is done: for (i = 0; i < ARGC; i++) print ARGV[i] although the three expressions are not limited to this style of loop. Indeed any of the expressions may be omitted, the middle (test) expression will be true if it is missing. The for loop above is equivalent to: i = 0; while (i < ARGC) { print ARGV[i]; i++ } as you can see the first expression is evaluated before the loop, the second is tested before the statement is executed and the third is evaluated after the statement. The while, do-while, and for statements have two special statements that can be used inside them to control the flow in extraordinary ways. First the break statement can be used to jump out of the loop entirely and second, the continue statement is used to jump past the rest of the statements (in a COMPOUND STATEMENT) and start another loop (at the third expression in the case of the for loop). for (i = 1; i < NF; i++) { if ($i < 10) continue # skip if < 10 if ($i > 20) break # stop if > 20 x += $i # accumulate values } Finally there are two statements that control the AWK program globally. They are the next and exit statements. They function similarly to the continue and break statements for loops. The next statement will cause the AWK program to stop in it's tracks and start with the next record. The exit statement will cause the AWK program to stop processing input records and start the END actions (if any) or to stop altogether if the exit statement is in an END action. There is an optional numeric argument to the exit statement that is returned as the ERRORLEVEL of the program. 1.9 Statements There are four kinds of statements in AWK. There are expressions, flow-control, printing and compound statements. Expressions can be assignments or expressions with side effects. For instance: a = a + 1 and a++ Expression with neither assignments nor side effects may be used as statements but why bother? Flow control statements were outlined in section 1.8. Compound statements are groups of statements delimited by braces that may be used anywhere single statements are used (as in the flow control statements). for (i = 0; i < 4; i++) { sum = sum + a[i]; print i, a[i] } Statements may be separated by semi-colons (;) or by the end-of-line. If you want to extend a statement across more than one line you break the line with a backslash (\). You may break a statement without a backslash after a comma, left brace, &&, ||, do, else and the right parenthesis in an if or for statement. if (a > b || c < d) { print ("silly", "program"); print a,b,c,d } A comment beginning with the octothorp (#) may be put at the end of any line (including a blank line.) # print current record print # printing current record ($0) # current record printed 2. Advanced Concepts 2.1 Functions AWK allows you to write your own functions. Two keywords are provided for this purpose. They are function and return, for definition and value respectivly. You declare a function with a special pattern action pair. function factorial(a) { return (a <= 1) ? 1 : factorial(a-1) * a } You invoke the function as normal, there must be no space between the function name and the left parenthesis. Any extra argument that you provide are evaluated and discarded, and any parameters that you do not provide arguments for become uninitialized local variables. function print_array(a, i) { for (i in a) print i, a[i] } { telephone[$1] = $2 } # collect name/telno END { print_array(telephone) # print name and telno } . harry_rag (111)555-1212 When you are using recursive functions like factorial, you should be aware that there is a limit on the level of recursion that you can do because of the size of the evaluation stack. 3. Anatomy of an AWK program 3.1 The Problem (MS|PC)DOS normally prints the dates of files in a directory in the form MM-DD-YY. The problem is to convert that to the form DD-Mmm-YY where Mmm is the first three letters of each month. 3.2 The Data Volume in drive C is R_DUFF Directory of C:\AWK . <DIR> 11-16-88 1:53p .. <DIR> 11-16-88 1:53p AWK 23721 11-16-88 2:15p AWK C 14945 1-24-89 9:20p AWK DOC 15791 2-19-89 5:09p AWK EXE 118361 2-19-89 3:18p AWK H 5380 11-13-88 1:58p AWK MAN 19552 11-20-88 12:15p AWK OBJ 10132 1-24-89 9:20p 9 File(s) 7235584 bytes free Here there are two kinds of records with dates and four without dates. The two with dates are 4 and 5 fields long, and the ones without dates are 6, 3 and 5 fields long. Since we only want to modify the fields that have dates in them we have to differentiate between the two types of size 5 records. 3.2 The Program We have a BEGIN section, followed by a function declaration, followed by three pattern/action statements. # dir - list directory with date interpretation BEGIN { split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", month, " "); } The BEGIN section creates the month interpretation array. function date(i) { if ($i ~ /[0-9]+-[0-9]+-[0-9]+/) { n = split($i, mdy, "-") mdy[1] = month[int(mdy[1])] $i = sprintf("%2d-%3s-%02d", mdy[2], mdy[1], mdy[3]) return 1 } return 0 } The date function checks with a regular expression for a valid date in a particular field. The regular expression matches 1 or more digits ([0-9]+) followed by a dash (-), followed by 1 or more digits (again) followed by another dash, finally ending with 1 or more digits. This is the three numbers separated by digits that MSDOS uses for dates. If the date is valid then the field is split into sub-fields at the dashes by the split function. This will give an array (mdy) with three elements corresponding to the three numbers in the date. The month sub field is replaced by the three character name from the month array by coercing the month number to numeric (to remove leading zeros) and using this as an index into the array generated in the BEGIN action. The sub-fields are then reassembled into the format that we want using sprintf. The day first (mdy[2]) followed by the month name (mdy[1]), followed by the year (mdy[3]), all separated by dashes. This string is assigned to the field that we just took apart. Finally a success code is returned (1). If the date is not valid then only the failure code is returned and no substitution is performed. NF == 5 { if (date(4)) $0 = sprintf("%-9s%-3s%9s%11s%8s", $1,$2,$3,$4,$5) } This pattern check for those records that have 5 fields in them. They are of two types: AWK MAN 19552 11-20-88 12:15p and 9 File(s) 7235584 bytes free only the first type is one that we have a date to change in. We therefore check field number 4 for a date and if it is one then we change the record to use the fields in the correct format for our new date. If we didn't use sprintf and assign a new value to $0 then all of the fields would be separated by a blank (ORS) instead of nicely lined up. NF == 4 { date(3) if ($2 ~ /<DIR>/) $0 = sprintf("%-9s%9s%14s%8s", $1,$2,$3,$4) else $0 = sprintf("%-9s%12s%11s%8s", $1,$2,$3,$4) } This pattern check for those records that have 4 fields in them. They are of two types: AWK 23721 11-16-88 2:15p and .. <DIR> 11-16-88 1:53p they both have dates in them at the same field but we want to format them differently. You will notice that the file size and the <DIR> indication do not line up so we must format them separatly. Hence we use our date function to fix up the third field and then format one way if the second field matches with the <DIR> indicator and format another way if it doesn't. { print } Finally we get to print every record, some modified by the preceding actions and some in their origional form. 3.3 The Output Volume in drive C is R_DUFF Directory of C:\SRC\AWK . <DIR> 16-Nov-88 1:53a .. <DIR> 16-Nov-88 1:53a AWK 23721 16-Nov-88 2:15a AWK C 14945 24-Jan-89 9:20p AWK DOC 15791 19-Feb-89 5:09p AWK EXE 118361 19-Feb-89 3:18p AWK H 5380 13-Nov-88 1:58p AWK MAN 19552 20-Nov-88 12:15p AWK OBJ 10132 24-Jan-89 9:20p 9 File(s) 7235584 bytes free 3.4 Conclusion The result ends looking much the same as the input but with a more readable date. We thus have created an AWK program that can be used to pretty up directories.