The World of Computer Software

home *** CD-ROM | disk | FTP | other *** search

/ The World of Computer Software / World_Of_Computer_Software-02-385-Vol-1of3.iso / p / pccts.zip / begtut.ms next >

Wrap

Text File | 1992-12-08 | 38KB | 1,154 lines

.de iH \\$1 \\$2 \\$3 \\$4 \\$5 \\$6 \\$7 \\$8 \\.XS .if \\n(NS-4 .if \\n(NS-3 .if \\n(NS-2 .if \\n(NS-1 \\*(SN \\$1 \\$2 \\$3 \\$4 \\$5 \\$6 \\$7 \\$8 \\.XE .. .de 1s .nr PS 12 .nr VS 13 .br .LP .. .de 2s .nr PS 12 .nr VS 16 .br .LP .. .de cB .nr PS 10 .nr VS 11 .br .LP .KS .LD .ft CW .. .de cE .DE .KE .2s .ft R .. .EH 'PCCTS Introductory Tutorial 1.0x' .OH 'PCCTS Introductory Tutorial 1.0x' .EF '''Page %' .OF '''Page %' .ds d $ .EQ delim $$ .EN .TL \s+4Introductory Tutorial .sp 2 .br PCCTS 1.0x\s-4 .AU Terence Parr, Hank Dietz, Will Cohen .AI School of Electrical Engineering Purdue University West Lafayette, IN 47907 \fIFall 1992\fP .sp 1 \f(CWparrt@ecn.purdue.edu\fP \f(CWhankd@ecn.purdue.edu\fP \f(CWcohenw@ecn.purdue.edu\fP .2s .LP .br .QP The Purdue Compiler-Construction Tool Set (PCCTS) is a set of public domain software tools designed to facilitate the implementation of compilers and other translation systems. In many ways, PCCTS is similar to a highly integrated version of YACC and LEX; where ANTLR (ANother Tool for Language Recognition) corresponds to YACC and DLG (DFA-based Lexical analyzer Generator) functions like LEX. However, PCCTS has many additional features which make it easier to use for a wide range of translation problems. .QP This document introduces the basic functionality of PCCTS by example. The user need not be familiar with parsing theory or other compiler tools, but any familiarity reduces the learning curve substantially. The PCCTS reference manual is a necessary supplement to this tutorial as information here regarding PCCTS structures and operation is incomplete. .LP .bp .NH .iH Introduction .PP PCCTS allows the user to describe languages (e.g. programming language, OS shell, game, editor); from such a description, a C program is generated that recognizes and, optionally, translates phrases in that language. The user must specify the following: .IP \fI(i)\fP How the input stream is to be broken up into lexemes (tokens) which comprise the vocabulary of the language. .IP \fI(ii)\fP How the tokens are to be grouped; i.e. what structure/grammar is to be applied to the token stream. .IP \fI(iii)\fP C actions which perform a user-specified translation. Along with this specification, the user must also describe token attributes\(emobjects that actions use to communicate with the lexical analysis phase of translation. .LP Similarly, this tutorial is broken up into sections on lexical analysis, syntactic analysis, and actions/translation. .NH .iH Lexical Analysis .PP Before understanding a phrase in English, one must separate the stream of characters into a stream of words; e.g. the phrase: \*Qthisisveryhardtoread\*U accentuates this fact\(emrecognition cannot easily be done from a character stream, only from word/token streams. .PP Compilers and other translators are very strict about this \*Qtokenization\*U and generally describe tokens via \fIregular expressions\fR\(emexpressions that describe sets of character sequences. The regular expressions are, in fact, language descriptions as well. For example, \f(CWhello\fR is a regular expression that recognizes a sequence of five characters; namely, the word: \*Qhello\*U. To inform PCCTS that \*Qhello\*U is to be a word in the vocabulary of your language, the following description would be placed in your grammar file. .cB #token LABEL "hello" .cE .LP where \f(CWLABEL\fR is some label (C \f(CW#define\fP) that you want associated with that token. To test regular expressions in PCCTS, let us form a simple, complete description which recognizes \*Qhello\*U (we will use this description as a base for all examples in this section): .cB #header <<#include "charbuf.h">> <<main() { ANTLR(a(), stdin); }>> #token WORD "hello" a : WORD ; .cE .LP This is a minimal description in that it contains everything needed for PCCTS to generate an executable (actually, to generate all C files needed for the C compiler to generate an executable). The \f(CW#header <<...>>\fP instruction informs PCCTS that the C code inside the \f(CW<<...>>\fP action is necessary to define attributes and to compile the actions found elsewhere; for this section, we will ignore its significance. The second action gives a main program that specifies where C is to begin execution. It contains one statement which asks ANTLR to begin parsing at rule \f(CWa\fP. The third instruction defines a token \f(CWhello\fP. The fourth component of this description is a rule definition. Rules definitions have the form: .cB \fIrule\fP: $alternative sub 1$ | $alternative sub 2$ | ... | $alternative sub n$ ; .cE .LP where each alternative is a sequence of grammatical structures that are to be matched\(emone of possible structures is a simple token reference (\f(CWWORD\fP, in our case). Therefore, rule \f(CWa\fP says, \*Qmatch the token identified as \f(CWWORD\fP on the input stream\*U. The C function generated for rule \f(CWa\fP asks the lexical analyzer, generated by PCCTS, to collect characters until it sees a complete token. Each token in the vocabulary is given a unique number which the lexical analyzer returns to indicate what token was just matched. Function \f(CWa()\fP then verifies that the number associated with \f(CWWORD\fP is indeed returned by the lexical analyzer. .PP The above example can be tested via the following sequence of commands: .cB antlr -gk t.g dlg -i parser.dlg scan.c cc -I../h -o t t.c scan.c err.c .cE The first command generates the parser, \f(CWt.c\fP, the lexical description, \f(CWparser.dlg\fP, and a support file, \f(CWerr.c\fP. The second command converts the lexical description to a C file that constitutes our scanner (lexical analyzer). The third command compiles all C files needed to generate the executable (the \f(CW-I../h\fP option tells the C compiler where to look for the standard PCCTS include files; you will have to change this to where the PCCTS include files are located). The output on our UNIX system looks like this (assuming the example is in file \f(CWt.g\fP): .cB % antlr -gk t.g Antlr parser generator Version 1.06 1989-1992 % dlg -i parser.dlg scan.c dlg Version 1.0 1989, 1990, 1991 % cc -I../h -o t t.c scan.c err.c .cE .LP To test the grammar file, run the executable: .cB % t hello % .cE .LP No error message is generated and \f(CWt\fP terminates successfully. If a token not in the vocabulary of our language is typed, an error message appears. We have only one word in our vocabulary, and hence, anything other than \*Qhello\*U generates an error. .cB % t bob invalid token near line 1 (text was 'b') invalid token near line 1 (text was 'o') invalid token near line 1 (text was 'b') invalid token near line 1 (text was ' ') ^Dline 1: syntax error at "EOF" missing WORD % .cE .LP The first \*Qinvalid token\*U errors are from the scanner, the last message is from the parser (function \f(CWa()\fP) indicating that end-of-file was found when a \f(CWWORD\fP was expected. \f(CWEOF\fP was returned by the scanner because \f(CWbob\fP was ignored and end-of-file appeared immediately afterwards; \f(CWEOF\fP is a predefined token in any PCCTS vocabulary. .PP Adding more tokens to your language's vocabulary is easy\(emsimply add more \f(CW#token\fP definitions. Consider this new example: .cB #token "\e " <<zzskip();>> /* ignore blanks */ #token "\et" <<zzskip();>> /* ignore tabs */ #token "\en" <<zzline++; zzskip();>> /* ignore newlines */ #token A "apple" #token P "pear" .cE .LP This example introduces lexical actions\(emactions that are executed upon recognition of a particular regular expression. For most language descriptions, lexical actions are not used except to tell the scanner to skip a token or continue looking for more characters. \f(CWzzskip()\fP is a standard PCCTS function (generally, PCCTS variables/functions/defines are prefixed with \f(CWzz\fP to avoid name collisions with user variables) which forces the scanner to ignore the currently matched token and to try to find another. Essentially, the first three token definitions here tell the scanner that it is to ignore white space, but to increment the current line number when it sees a newline. The fourth and fifth definitions introduce two words into our vocabulary. Notice that only the last two have labels associated with them. Any \f(CW#token\fP instruction may give a label, but they are not necessary. Labels are handy when you want an action to refer to the value (token number) of a particular token; also, when a regular expression is complicated or confusing, often it is better to use a label throughout your grammar rather than repeating the regular expression. To illustrate this, we present the following four equivalent partial PCCTS descriptions: .IP \fI(i)\fP Repeated use of labels. .cB #token A "apple" #token P "pear" a : A P | P A ; .cE .IP \fI(ii)\fP Repeated use of expressions. .cB #token "apple" #token "pear" a : "apple" "pear" | "pear" "apple" ; .cE .IP \fI(iii)\fP Repeated use of implicitly-defined expressions. .cB a : "apple" "pear" | "pear" "apple" ; .cE .IP \fI(iv)\fP Mixed use of labels and expressions. .cB #token A "apple" #token P "pear" a : "apple" P | "pear" A ; .cE .LP Each unique token regular-expression string in PCCTS gets its own token number. Token labels are words that begin with a uppercase letter whereas rules begin with lowercase letters. Repeating the same token string in a grammar merely refers to the same token; strings can only appear once in \f(CW#token\fP definitions, however, as this instruction attempts to define a new token. An implicitly-defined token is one that is referenced but that has no formal \f(CW#token\fP instruction. In fact, we use the \f(CW#token\fP only when the expression is long, when a lexical action must be attached, or when a label is required (so that a C action can refer to it). .PP Each rule \f(CWa\fP above indicates that either \f(CWapple\fP followed by \f(CWpear\fP is to be matched or \f(CWpear\fP followed by \f(CWapple\fP is to be matched. .PP Once again, let's test this vocabulary description with a complete, executable example: .cB #header <<#include "charbuf.h">> <<main() { ANTLR(a(), stdin); }>> #token "\e " <<zzskip();>> /* ignore blanks */ #token "\et" <<zzskip();>> /* ignore tabs */ #token "\en" <<zzline++; zzskip();>> /* ignore newlines */ a : "apple" "pear" | "pear" "apple" ; .cE .LP To build the executable, we proceed as before: .cB % antlr -gk t.g Antlr parser generator Version 1.06 1989-1992 % dlg -i parser.dlg scan.c dlg Version 1.0 1989, 1990, 1991 % cc -g -I../h -o t t.c scan.c err.c .cE .LP To test the example, type: .cB % t apple pear % .cE No error is reported due to the validity of the input. Note that the newline and the spaces were ignored because of the \f(CWzzskip()\fP actions associated with our token definitions for white space. To ensure that \f(CWt\fP is actually doing something useful, try: .cB % t apple apple line 2: syntax error at "apple" missing pear ^D% .cE .LP PCCTS generates parsers that automatically report errors and try to resynchronize the parser; hence, in this case, a control-D (\f(CW^D\fP) is necessary to terminate the program because \f(CWt\fP is looking for another token with which to resynchronize. Because of the \f(CWzzline++\fP statement in the action for newline, the error is correctly reported on line 2. .PP The regular expressions used in the above examples are simple and do not use any of the \fImeta-characters\fP or regular expression operators. Before presenting a more realistic example, we illustrate the use of some useful regular expression meta-characters (for a complete description see PCCTS documentation): .IP \f(CW@\fR \f(CWEOF\fP character .IP \f(CW\et\fR tab character .IP \f(CW\en\fR newline character .IP \f(CW\e\fP\fIc\fR character escape; used to obtain actual character for meta-characters .IP \f(CW(\fIe\f(CW)\fR keep expression \fIe\fP as an indivisible group .IP \f(CW[\fIc\f(CW]\fR match one character from list \fIc\fP .IP \f(CW[\fIx\f(CW-\fIy\f(CW]\fR match one character from range \fIx\fP to \fIy\fP .IP \f(CW~[\fIc\f(CW]\fR match one character not in list \fIc\fP .IP \f(CW{\fIe\f(CW}\fR expression \fIe\fP is optional .IP \fIe\f(CW*\fR match zero or more of \fIe\fP .IP \fIe\f(CW+\fR match one or more of \fIe\fP .IP \fIe\f(CW|\fIf\fR match either expression \fIe\fP \fBor\fR \fIf\fP .LP Naturally, the above operators and meta-characters can be used in many combinations to produce very complicated expressions. To illustrate more complex expressions, we define the vocabulary of a calculator (ignoring white space for the moment). .cB #token NUM "[0-9]+" #token VAR "[a-zA-Z][a-zA-Z0-9]*" #token "\e(" #token "\e)" #token "\e+" #token "\e-" #token "\e*" #token "/" .cE .LP A number is defined as a sequence of one or more decimal digits. Variables begin with an upper or lowercase letter, but can otherwise contain digits as well; note that \f(CW*\fP is used rather than \f(CW+\fP for variables because \f(CW+\fP would force \f(CWVAR\fP to recognize at least two characters. This calculator has some tokens in its vocabulary that are identical to those of the regular expressions, so these must be escaped to tell the scanner to look for those actual characters. To create an executable, we form a grammar which accepts one of the words in the vocabulary: .cB #header <<#include "charbuf.h">> <<main() { ANTLR(a(), stdin); }>> #token "\e " <<zzskip();>> /* ignore blanks */ #token "\et" <<zzskip();>> /* ignore tabs */ #token "\en" <<zzline++; zzskip();>> /* ignore newlines */ .cE .cB #token NUM "[0-9]+" #token VAR "[a-zA-Z][a-zA-Z0-9]*" #token "\e(" #token "\e)" #token "\e+" #token "\e-" #token "\e*" #token "/" a : NUM | VAR | "\e(" | "\e)" | "\e+" | "\e-" | "\e*" | "/" ; .cE .LP As before, we create the executable with (assuming the example is in \f(CWt.g\fR): .cB antlr -gk t.g dlg -i parser.dlg scan.c cc -g -I../h -o t t.c scan.c err.c .cE .LP The executable, \f(CWt\fP, will recognize any one token from our vocabulary. The next section discusses how one employs rules to specify valid, structured sequences; i.e. how one defines the syntax of a language. .NH .iH Syntactic Analysis .PP The syntax of a language is the grammatical structure which summarizes the set of valid phrases in that language. Because one cannot normally delineate all possible sentences, languages are described via a set of rules which obey the laws of a \fImeta-language\fP, which is literally a \*Qlanguage to describe languages\*U just as the syntax of regular expressions represents a language. This section describes the format of a PCCTS language description\(emthe syntax of PCCTS rules and how they may be used to impose a structure upon a stream of input tokens. .PP The basic template used to build a grammar is: .nf .LP \f(CW#header\fP \fIaction\fP .br \fIaction(s)\fR and/or \f(CW#token\fR \fIdefinition(s) .br rule(s) .br action(s)\fR and/or \f(CW#token\fR \fIdefinition(s) .fi .LP To compile, all grammars must define a number of things inside the \f(CW#header\fP action; this instruction is not optional and must appear first in your file. The rest of the file is basically a sequence of user actions, token and rule definitions\(emexcept that actions, not contained within rules, must be placed before or after the rule definitions. .PP Rules have the basic form: .cB \fIrule\fP: $alternative sub 1$ | $alternative sub 2$ | ... | $alternative sub n$ ; .cE .LP where $alternative sub i$ is a sequence of the following elements: .IP \fItoken\fP\ \ Match \fItoken\fP on the input stream. .IP \fIrule\fP\ \ Visit \fIrule\fP and match whatever is specified. .IP \fIaction\fR\ \ Execute C \fIaction\fP. .IP "\f(CW($a sub 1$ | $a sub 2$ | ... | $a sub n$)\fR" Introduce a subrule\(emmatch one $a sub i$. .IP "\f(CW{$a sub 1$ | $a sub 2$ | ... | $a sub n$}\fR" Introduce an optional subrule; match one $a sub i$ or none. .IP "\f(CW($a sub 1$ | $a sub 2$ | ... | $a sub n$)*\fR" Conditionally match any sequence of $a sub i$'s. .IP "\f(CW($a sub 1$ | $a sub 2$ | ... | $a sub n$)+\fR" Match any sequence of $a sub i$'s. .LP Examples of rule definitions are: .cB w : WORD ("," WORD)* ; .cE .LP where rule \f(CWw\fP matches a list of comma-separated \f(CWWORD\fP's. The \f(CW("," WORD)*\fP construction says match zero or more \f(CW","\ WORD\fP sequences. Consider, .cB st : "if" expr "then" st {"else" st} ";" | WORD ":=" expr | "begin" ( st ";" )+ "end" ; .cE .LP where \f(CWexpr\fP is some rule that matches an arithmetic expression. Rule \f(CWst\fP matches statements such as: .cB if $expr sub 1$ then begin i := $expr sub 2$; j := $expr sub 3$; end else k := $expr sub 4$; .cE .LP The first alternative has an optional subrule that matches an \f(CWelse\fP-clause if it exists. The third alternative matches one or more semicolon-delimited statements, which are enclosed in \f(CWbegin\fP and \f(CWend\fP. Let's examine the description of a simple expression. .cB e : e1 ( ("\e+" | "\e-") e1 )* ; e1 : WORD | INT ; .cE .LP Rule \f(CWe\fP matches simple expressions with only plus and minus as operators; e.g. \f(CWa+3-b\fP or \f(CWa\fP. Note that we have nested the \f(CW("\e+" | "\e-")\fP subrule within the \f(CW(...)*\fP subrule. .PP Let's build a complete PCCTS language description by extending the expression example. Rules to handle multiplication and division will be added as well as token definitions to ignore white space etc...: .cB #header <<#include "charbuf.h">> <<main() { ANTLR(calc(), stdin); }>> #token "[\e \et]" <<zzskip();>> /* ignore blanks, tabs */ #token "\en" <<zzline++;>> /* ignore newlines */ #token INT "[0-9]+" #token FLOAT "[0-9]+ {. [0-9]+}" calc: ( e "\en" )* "@" ; e : e1 ( ("\e+" | "\e-") e1 )* ; e1 : e2 ( ("\e*" | "/") e2 )* ; e2 : INT | FLOAT ; .cE .LP Note that newlines are no longer to be ignored, hence, the \f(CWzzskip()\fP function call has been removed from its lexical action. Our language is a set of expressions terminated by newlines followed by end-of-file (\f(CW@\fP is a predefined lexical meta-symbol referring to end-of-file). Without actions, testing this grammar is uninteresting because no output is generated (unless, of course, an invalid expression is given). Therefore, let us place an action among the rule elements to generate some output. Augment rule \f(CWcalc\fP as follows: .cB calc: ( e "\en" <<printf("found expression\en");>> )* "@" ; .cE .LP Essentially, we have added C code to print out a brief message after an expression-newline pair has been encountered. Create the executable, \f(CWt\fP, as before with: .cB antlr -gk t.g dlg -i parser.dlg scan.c cc -I../h -o t t.c scan.c err.c .cE .LP Test the program via a few simple expressions: .cB % t 3+4*5 found expression 3.15 / 6 - 2.1 found expression ^D% .cE .LP This example grammar is not recursive; i.e. no rule references another rule that directly or indirectly returns to itself. But, recursion is a very powerful tool. It allows the concept of \fIself-similarity\fP. In other words, structures in which some subcomponents are similar to the outer structure. Pascal has several self-similar constructs: record field definitions, procedure definitions, expressions, and type definitions to name a few. .PP To illustrate recursive grammars, we extend the above expression example to allow parenthesized subexpressions such as \f(CW(3+4)*7\fP. .cB e2 : INT | FLOAT | "\e(" e "\e)" ; .cE .LP Placing the subexpression construct at the lowest recursion level makes it have the highest precedence because of the nature of top-down, depth-first parsing. To see this, consider the parse tree for \f(CW(3+4)*5\fP (beginning at rule \f(CWe\fP): .KS .PS .ps 11 box invis "\f(CWe\fP" with .sw at (2.24,9.76) width 0.25 height 0.25 box invis "\f(CWe2\fP" with .sw at (2.24,9.26) width 0.25 height 0.25 box invis "\f(CWe3\fP" with .sw at (1.74,8.76) width 0.25 height 0.25 box invis "\f(CW*\fP" with .sw at (2.24,8.76) width 0.25 height 0.25 box invis "\f(CW5\fP" with .sw at (2.74,8.76) width 0.25 height 0.25 box invis "\f(CWe\fP" with .sw at (1.74,8.26) width 0.25 height 0.25 box invis "\f(CW(\fP" with .sw at (1.24,8.26) width 0.25 height 0.25 box invis "\f(CW)\fP" with .sw at (2.24,8.26) width 0.25 height 0.25 box invis "\f(CW3\fP" with .sw at (1.24,7.76) width 0.25 height 0.25 box invis "\f(CW+\fP" with .sw at (1.74,7.76) width 0.25 height 0.25 box invis "\f(CW4\fP" with .sw at (2.24,7.76) width 0.25 height 0.25 line -> from 2.362,9.762 to 2.362,9.512 line -> from 2.362,9.262 to 2.362,9.012 line -> from 2.362,9.262 to 1.863,9.012 line -> from 2.362,9.262 to 2.862,9.012 line -> from 1.863,8.762 to 1.863,8.512 line -> from 1.863,8.762 to 1.363,8.512 line -> from 1.863,8.762 to 2.362,8.512 line -> from 1.863,8.262 to 1.363,8.012 line -> from 1.863,8.262 to 1.863,8.012 line -> from 1.863,8.262 to 2.362,8.012 .PE .KE .LP Clearly, \f(CW3+4\fP must be evaluated before the \f(CW*\fP for a valid result; this is precisely a depth-first evaluation of the parse tree (which PCCTS parsers do naturally). The deeper the recursive nesting, the higher the precedence. Extending the input expression to \f(CW(3+4)*(5-6)\fP yields: .KS .PS .ps 11 box invis "\f(CWe3\fP" with .sw at (1.24,6.01) width 0.25 height 0.25 box invis "\f(CWe\fP" with .sw at (1.24,5.51) width 0.25 height 0.25 box invis "\f(CW(\fP" with .sw at (0.74,5.51) width 0.25 height 0.25 box invis "\f(CW)\fP" with .sw at (1.74,5.51) width 0.25 height 0.25 box invis "\f(CW3\fP" with .sw at (0.74,5.01) width 0.25 height 0.25 box invis "\f(CW+\fP" with .sw at (1.24,5.01) width 0.25 height 0.25 box invis "\f(CW4\fP" with .sw at (1.74,5.01) width 0.25 height 0.25 line -> from 1.363,6.013 to 1.363,5.763 line -> from 1.363,6.013 to 0.863,5.763 line -> from 1.363,6.013 to 1.863,5.763 line -> from 1.363,5.513 to 0.863,5.263 line -> from 1.363,5.513 to 1.363,5.263 line -> from 1.363,5.513 to 1.863,5.263 box invis "\f(CWe3\fP" with .sw at (3.24,6.01) width 0.25 height 0.25 box invis "\f(CWe\fP" with .sw at (3.24,5.51) width 0.25 height 0.25 box invis "\f(CW(\fP" with .sw at (2.74,5.51) width 0.25 height 0.25 box invis "\f(CW)\fP" with .sw at (3.74,5.51) width 0.25 height 0.25 box invis "\f(CW5\fP" with .sw at (2.74,5.01) width 0.25 height 0.25 box invis "\f(CW-\fP" with .sw at (3.24,5.01) width 0.25 height 0.25 box invis "\f(CW6\fP" with .sw at (3.74,5.01) width 0.25 height 0.25 line -> from 3.362,6.013 to 3.362,5.763 line -> from 3.362,6.013 to 2.862,5.763 line -> from 3.362,6.013 to 3.862,5.763 line -> from 3.362,5.513 to 2.862,5.263 line -> from 3.362,5.513 to 3.362,5.263 line -> from 3.362,5.513 to 3.862,5.263 box invis "\f(CWe\fP" with .sw at (2.24,7.01) width 0.25 height 0.25 box invis "\f(CWe2\fP" with .sw at (2.24,6.51) width 0.25 height 0.25 box invis "\f(CW*\fP" with .sw at (2.24,6.01) width 0.25 height 0.25 line -> from 2.362,7.013 to 2.362,6.763 line -> from 2.362,6.513 to 2.362,6.263 line -> from 2.362,6.513 to 1.363,6.263 line -> from 2.362,6.513 to 3.362,6.263 .PE .KE .LP Again, both operands of the \f(CW*\fP must be evaluated before it can proceed. .PP As another example of recursive definitions, consider type definitions for a Pascal-like language. Types look like: .cB char integer array [5] of char array [100] of array [20] of integer .cE .LP A grammar similar to the following could be used: .cB type: "char" | "integer" | "array" "\e[" INT "\e]" "of" type ; .cE The recursive invocation of \f(CWtype\fP by the \f(CWarray\fP alternative effectively allows chains of \f(CWarray\fP specifications. The parse tree for .cB array [100] of array [20] of integer .cE .LP looks like: .KS .PS .ps 11 box invis "\f(CW[\fP" with .sw at (2.49,8.76) width 0.25 height 0.25 box invis "\f(CWarray\fP" with .sw at (1.99,8.76) width 0.25 height 0.25 box invis "\f(CW20\fP" with .sw at (2.99,8.76) width 0.25 height 0.25 box invis "\f(CW]\fP" with .sw at (3.49,8.76) width 0.25 height 0.25 box invis "\f(CWof\fP" with .sw at (3.99,8.76) width 0.25 height 0.25 box invis "\f(CWtype\fP" with .sw at (4.49,8.76) width 0.25 height 0.25 line -> from 3.362,9.262 to 2.112,9.012 line -> from 3.362,9.262 to 2.612,9.012 line -> from 3.362,9.262 to 3.112,9.012 line -> from 3.362,9.262 to 3.612,9.012 line -> from 3.362,9.262 to 4.112,9.012 line -> from 3.362,9.262 to 4.612,9.012 box invis "\f(CWtype\fP" with .sw at (3.24,9.26) width 0.25 height 0.25 box invis "\f(CW[\fP" with .sw at (1.24,9.26) width 0.25 height 0.25 box invis "\f(CWarray\fP" with .sw at (0.74,9.26) width 0.25 height 0.25 box invis "\f(CW100\fP" with .sw at (1.74,9.26) width 0.25 height 0.25 box invis "\f(CW]\fP" with .sw at (2.24,9.26) width 0.25 height 0.25 box invis "\f(CWof\fP" with .sw at (2.74,9.26) width 0.25 height 0.25 box invis "\f(CWtype\fP" with .sw at (1.99,9.76) width 0.25 height 0.25 line -> from 2.112,9.762 to 0.863,9.512 line -> from 2.112,9.762 to 1.363,9.512 line -> from 2.112,9.762 to 1.863,9.512 line -> from 2.112,9.762 to 2.362,9.512 line -> from 2.112,9.762 to 2.862,9.512 line -> from 2.112,9.762 to 3.362,9.512 box invis "\f(CWinteger\fP" with .sw at (4.49,8.26) width 0.25 height 0.25 line -> from 4.612,8.762 to 4.612,8.512 .PE .KE .LP In this case, we are less interested in precedence and more interested in allowing chains of array specifications. .PP In general, recursion and repetition constructs such as \f(CW(...)+\fP are needed to avoid delineating all possible phrases in a language. Grammars are descriptions of the patterns found among the phrases of a particular language just as $size +2 SIGMA$ notation summarizes an infinite series. .PP The recognition of input languages, via the use of grammars, performs two tasks: it ensures phrase validity and directs translation to an output language. The next section demonstrates how actions, embedded among the grammar elements, can be used to effect a translation. .NH .iH Translation .PP Given a grammar, PCCTS constructs a recognizer for phrases in that input language. No translation from input to output is performed. User actions must be supplied in the correct positions to generate output. Translation occurs when an action produces output which is a function of the input phrase. Actions have access to input phrase token values through an abstraction called an \fIattribute\fP. These attributes are user-defined types and can be as simple as the text associated with a token. .PP This section introduces the notion of an attribute as a means of communicating with the lexical analyzer and presents a number of examples that explain how and where actions can be used to generate output. .NH 2 .iH Attributes .PP Attributes are objects associated with all rules and rule elements, but we will only concern ourselves here with attributes associated with token and rule references. Attributes are referenced in actions with the notation \f(CW\*d\fP$i$ where $i$ indicates that the attribute for the $i sup th$ token in that production is desired. Attributes are run-time objects and have no value until run-time. They are generally used to access the actual text (or a function of the text) of the tokens matched on the input stream. The set of all tokens defines the \fIvocabulary\fP of the input language. The term \*Qtoken\*U collectively refers to the token type (an integer that identifies it as part of the vocabulary) and the token text (the actual string that matched the regular expression for the token type). .PP Before illustrating attributes, we begin with an example. The vocabulary of an input language (known a priori) may be the set { \f(CWWORD\fP, \f(CW"begin"\fP, \f(CWINT\fP }, which is the set of integer token types. The text associated with a token type is only known at parser run-time because it depends on the input characters. Let us say that the grammatical structure of the language is any sequence of tokens in the vocabulary (ignoring white space); then, a valid sentence could be: .cB begin hello 34 13 bob .cE .LP The parser would see a token stream of tuples of the form (token type, token text): .cB (begin, begin) (WORD, hello) (INT, 34) (INT, 13) (WORD, bob) .cE .LP A different input sentence, with the same sequence of \fBtoken types\fP is: .cB begin hi 2 99 ptr .cE .LP which would yield the same sequence of token types, but a different set of token text: .cB (begin, begin) (WORD, hi) (INT, 2) (INT, 99) (WORD, ford) .cE The grammar might look like: .cB a : ( WORD | "begin" | INT )+ ; .cE .LP Only the token types are referenced in the grammar as they describe the structure of the language and are a shorthand notation for the set of valid input sentences. Obviously, one could not delineate all possible sentences as there are infinitely many. For a PCCTS description to perform a translation that is specific to the particular input, actions must access the text of the input tokens, not just the token type. Attributes are provided to provide access to the text (or some function thereof) of an input token. To illustrate this, we give a complete example and then, later, describe the particulars: .cB #header <<#include "charptr.h">> <<main() { ANTLR(a(), stdin); }>> #token "[\e \et]" <<zzskip();>> #token "\en" <<zzline++; zzskip();>> a : ( WORD <<printf(" %s", \*d1);>> | "begin" <<printf(" begin");>> | INT <<printf(" %s", \*d1);>> )+ ; #token WORD "[a-z]+" #token INT "[0-9]+" .cE .LP This example defines attributes to be strings representing what was found on the input stream and prints the stream of tokens back out. In other words, attributes are merely a copy of the words found; the mapping from token/lexeme to attribute is an identity mapping (do nothing but copy). For the moment, concentrate on the actions. \f(CW\*d1\fP refers to the attribute of the first item in the production in which the action occurs; in this case, only one item appears per production. Note that the action for the \f(CW"begin"\fP token does not need to refer to its attribute as it will always be \f(CWbegin\fP. The rest of this section describes the particulars needed to understand everything else in the example. .PP PCCTS requires that the user define the data type or structure of the attributes as well as specify how to convert from lexemes to attributes. The type is always defined by a C \f(CWtypedef\fP named \f(CWAttrib\fP and must be defined in the action associated with the \f(CW#header\fP instruction. For example, if one wishes the attribute for a token to be simple integers, the following is a sufficient type definition: .cB #header <<typedef int Attrib;>> .cE .LP However, this does not tell PCCTS how to convert a token to an attribute. This is accomplished with a function called \f(CWzzcr_attr()\fP which defines the value of an attribute given complete information about a lexeme (token number and associated text). It has the general form: .cB void zzcr_attr(a,token,text) Attrib *a; int token; char *text; { /* *a = function(token, text); */ } .cE where \f(CWa\fP points to an attribute created by PCCTS at run-time. The user simply has to assign a value to \f(CW*a\fP. In our case, we will use a macro version to set our attributes to the integer value of the input: .cB #define zzcr_attr(a,tok,txt) {*(a) = atoi(txt);} .cE .LP This specifies that whenever a token is matched on the input stream by the parser, an attribute of type \f(CWint\fP is to be created and assigned the result of \f(CWatoi(\fItext\f(CW)\fR where \fItext\fP is the character string matched for the token. The attribute is then made available as \f(CW\*d\fP$i$ to actions in the production that matched the token. For example, .cB #header << typedef int Attrib; #define zzcr_attr(a,tok,txt) {*(a) = atoi(txt);} >> <<main() { ANTLR(a(), stdin); }>> #token "[\e \et]" <<zzskip();>> #token "\en" <<zzline++; zzskip();>> a : "hi" "[0-9]+" <<printf("\*d1, \*d2 are %d, %d\en", \*d1, \*d2);>> ; .cE .LP \f(CW\*d1\fP refers to the first token in the alternative, \f(CW"hi"\fR; similarly, \f(CW\*d2\fP refers to the the second token, \f(CW"[0-9]+"\fR. When executed, the executable \f(CWt\fP (created as before) yields: .cB % t hi 34 \*d1, \*d2 are 0, 34 % .cE .LP where \f(CWatoi()\fP of a non-numeric string is 0, but the text \f(CW34\fP gets converted to an integer (binary word) version of 34 and printed back out as a number. .PP The token type can be tested to ensure that it is an integer before applying the \f(CWatoi()\fP function via: .cB #header << typedef int Attrib; #define zzcr_attr(a,tok,txt) {if ( tok==INT ) *(a) = atoi(txt);} >> .cE .LP where \f(CWINT\fP is defined to be \f(CW"[0-9]+"\fR. This defines an attribute for all \f(CWINT\fP tokens found on the input stream. Other tokens have undefined attributes. .PP Attributes can have multiple elements or assume one of many values. For example, we can extend the above example to handle \f(CWFLOAT\fP tokens as well: .cB #header <<typedef union { int ival; float fval; } Attrib;>> .cE .cB << void zzcr_attr(a,token,text) Attrib *a; int token; char *text; { switch ( token ) { case INT : (a)->ival = atoi(text); break; case FLOAT : (a)->fval = atof(text); break; } } >> .cE .LP The \f(CWtypedef\fP specifies that attributes are integer or floating point values. When the regular expression for a floating point number (integer number) is matched on the input stream, \f(CWzzcr_attr()\fP converts the string of characters representing that number to a C \f(CWfloat\fP (\f(CWint\fP). .PP Attributes can become even more complicated, but typically, attributes are merely a copy of the text found on the input stream. A standard PCCTS attribute definition is available as \f(CWcharbuf.h\fP and is defined as follows: .cB /* PCCTS attribute -- constant width text */ #ifndef D_TextSize #define D_TextSize 30 #endif typedef struct { char text[D_TextSize]; } Attrib; #define zzcr_attr(a,tok,t) strncpy((a)->text, t, D_TextSize-1); .cE .LP These attributes are referred to by \f(CW\*d\fP$i$\f(CW.text\fP in actions. .PP Each alternative begins a new sequence of \f(CW\*d\fP$i$'s and from an enclosing scope/level, entire subules are counted as one unit. This is best explained with an example: .cB a : A B ( C D )+ E | F G ; .cE .LP From an action after token \f(CWE\fP, \f(CWA\fP is \f(CW\*d1\fP, \f(CWB\fP is \f(CW\*d2\fP, the entire subrule \f(CW( C D )\fP is \f(CW\*d3\fP, and \f(CWE\fP is \f(CW\*d4\fP; \f(CWC\fP and \f(CWD\fR are inaccessible from outside the scope of the subrule. From an action inside the subrule just after the token \f(CWD\fP, \f(CWC\fP is \f(CW\*d1\fP and \f(CWD\fP is \f(CW\*d2\fP. In alternative two from an action after \f(CWG\fP, \f(CWF\fP is \f(CW\*d1\fP and \f(CWF\fP is \f(CW\*d2\fP. Attributes have a scoping just like variables in a programming langauage. .PP Attributes are a means of communicating with the lexical analyzer. Actions may use these attributes to provide a translation. The next section utilizes the concepts presented here to build translators. .NH 2 .iH Actions .PP Actions are rule elements just like token references, but perform a different function. Token references indicate that a particular token is to be matched on the input stream at that point in the parse. Actions indicate that this action is to be performed at that point in the parse, immediately following the preceding token match. For example, .cB a : A <<$action sub 1$>> B ; | ( C )+ <<$action sub 2$>> ; .cE .LP $action sub 1$ is executed after the parser has found an \f(CWA\fP, but before it has found a \f(CWB\fP. $action sub 2$ is executed only after a sequence of one or more \f(CWC\fP's has been found. .PP As a more concrete example, we augment the above \f(CWcalc\fP example to print something more useful than \f(CWfound expression\fP: .cB calc: ( e "\en" <<printf("\en");>> )* "@" ; e : e1 ( ( "\e+" <<printf(" add");>> | "\e-" <<printf(" sub");>> ) e1 )* ; .cE .cB e1 : e2 ( ( "\e*" <<printf(" mult");>> | "/" <<printf(" div");>> ) e2 )* ; e2 : INT <<printf(" INT");>> | FLOAT <<printf(" FLOAT");>> ; .cE .LP Essentially, we have added C code to print out the operand types and operators. Create the executable, \f(CWt\fP, as before with .cB antlr -gk t.g dlg -i parser.dlg scan.c cc -I../h -o t t.c scan.c err.c .cE .LP Test the program via a few simple expressions: .cB % t 3+4*5 INT add INT mult INT 3.15 / 6 - 2.1 FLOAT div INT sub FLOAT ^D% .cE .LP Now, let's use the attributes to generate code for a simple \fIreverse-polish\fP stack machine whose operations are defined as follows: .IP "\f(CWpush $opnd$\fR" Push $opnd$ onto the stack. .IP "\f(CWprint\fR" Print the value of the top of stack; POP the value off the stack. .IP "\f(CWadd\ \ \ \fR" PUSH(POP + POP) .IP "\f(CWsub\ \ \ \fR" $a$ := POP .br $b$ := POP .br PUSH($b$ - $a$) .IP "\f(CWmult\ \ \ \fR" PUSH(POP * POP) .IP "\f(CWdiv\ \ \ \fR" $a$ := POP .br $b$ := POP .br PUSH($b$ / $a$) .LP Modify the rules as follows: .cB #header <<#include "charbuf.h">> <<main() { ANTLR(calc(), stdin); }>> #token "[\e \et]" <<zzskip();>> /* ignore blanks, tabs */ #token "\en" <<zzline++;>> /* ignore newlines */ #token INT "[0-9]+" #token FLOAT "[0-9]+ {. [0-9]+}" .cE .cB calc: ( e "\en" <<printf("\etprint\en");>> )* "@" ; e : <<char *op;>> e1 ( ( "\e+" <<op="\etadd\en";>> | "\e-" <<op="\etsub\en";>> ) e1 <<printf("%s", op);>> )* ; .cE .cB e1 : <<char *op;>> e2 ( ( "\e*" <<op="\etmult\en";>> | "/" <<op="\etdiv\en";>> ) e2 <<printf("%s", op);>> )* ; e2 : INT <<printf("\etpush %s\en", \*d1.text);>> | FLOAT <<printf("\etpush %s\en", \*d1.text);>> ; .cE .LP .bp .EF '''' .OF '''' .LP .PX