home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: 10 Tools
/
10-Tools.zip
/
pccts1.zip
/
UPDT106.TXT
< prev
next >
Wrap
Text File
|
1993-09-07
|
17KB
|
560 lines
The Purdue Compiler Construction Tool Set
Version 1.06 Update
December 1, 1992
1. Enhancements
This section describes the update of PCCTS Version 1.00 to Ver-
sion 1.06. The list of changes here is not complete since many "lit-
tle" fixes were made. Here, we concentrate on the enhancements to the
system; file BUGS100 contains a list of the bug fixes incorporated in
the 1.06 release. In addition, note that the manual has not been
updated and will not be until the next major release (or until we get
around to it).
1.1. 1.06 Is Written in 1.00
The grammar for the PCCTS meta-language (file format) has been
implemented in Version 1.00, making heavy use of #lexclass directives.
File lexhelp.c has been eliminated due to the superiority of 1.00 to
1.00B.
1.2. ANTLR Compiles Under ANSI C
Because of the rewrite of the grammar and some rewrites of ANTLR
code, ANTLR now compiles with ANSI compilers without a wimper (except
for two "unknown escape sequence" warnings). Your mileage may vary.
1.3. Grammar Files Distributed
antlr.g and dlg_p.g are now included in the source distribution
for 1.06. Note that the 1.00 PCCTS (both ANTLR and DLG) are required
to process these grammar files. DO NOT DESTROY YOUR OLD COPY OF 1.00
PCCTS (at least save the executables).
1.4. Script Generates Makefiles for PCCTS Projects
A C program called genmk.c is available in the support directory
of the PCCTS release which has the following usage:
genmk project f1.g f2.g ... fn.g
It generates a makefile that creates an executable, project, from a
set of grammar files. For example, genmk t t.g generates:
Page 1
PCCTS
#
# PCCTS makefile for: t.g
#
DLG_FILE = parser.dlg
ERR_FILE = err.c
HDR_FILE = stdpccts.h
TOK_FILE = tokens.h
K = 1
ANTLR_H = .
BIN = .
ANTLR = $(BIN)/antlr
DLG = $(BIN)/dlg
CFLAGS = -I. -I$(ANTLR_H)
AFLAGS = -fe err.c -fh stdpccts.h -fl parser.dlg -ft tokens.h -k $(K)
-gk
DFLAGS = -C2 -i
GRM = t.g
SRC = scan.c t.c err.c
OBJ = scan.o t.o err.o
t: $(OBJ) $(SRC)
cc -o t $(CFLAGS) $(OBJ)
t.c parser.dlg : t.g
$(ANTLR) $(AFLAGS) t.g
scan.c : parser.dlg
$(DLG) $(DFLAGS) parser.dlg scan.c
This program is handy when beginning a new PCCTS project or when first
learning about PCCTS.
1.5. DLG Supports Case Insensitive Scanners
DLG has two new options which provide control over the case sen-
sitivity of the generated scanner. Specifically, case insensitivity
implies that when a character is referenced in a regular expression,
DLG behaves as if the user had typed both upper and lower case ver-
sions of that character; i.e. (a|A) where a is some character. The
new options are:
-ci Make lexical analyzer case insensitive
-cs Make lexical analyzer case sensitive (default).
1.6. Delayed Lookahead Fetches in Generated Parser
Currently, PCCTS generates parsers which always have k tokens of
lookahead. This is done by following the strategy that another token
is fetched (zzCONSUME) when one is matched (zzmatch). This can be a
Page 2
PCCTS
problem for actions that need to occur after a token has been matched,
but before the next token of lookahead is fetched. This is somewhat
overcome in PCCTS 1.00 by delaying the fetch if an action is found
immediately after a token reference. With the new delayed lookahead
scheme, the next token is not consumed until the next match is
required. This means that any action before the next match (not
necessarily adjacent to the previous match) will be executed before a
lookahead fetch occurs. Turn on ANTLR option -gk and DLG option -i to
enable this feature. This feature appears to work with AST's and
attributes with the constraints mentioned below in the incompatibili-
ties section (e.g. use of LA(i) and LATEXT(i) has been restricted).
It has been tested with the C (k>1 and Pascal (k==1) examples provided
in the release and with several other large grammars.
This feature is primarily useful for developers of interactive
tools. Previously, it was really hard to get PCCTS to generate truly
interactive tools. It appeared as if the parser was always waiting on
a token fetch rather than executing an appropriate action. E.g. in
PCCTS 1.00,
a : ( "A" "B" "C" "\n" ) <<printf("got it\n");>>
;
would not print got it until one token after the newline had been
typed. PCCTS 1.06 will generate parsers that print the message
immediately upon newline and will exit without waiting for another
token as there are no token references following the action.
Another way in which delayed lookahead is useful lies in transla-
tors which add symbols to the symbol table which must be examined by a
lexical action. If a lookahead fetch occurs too fast, the lexical
action may miss the introduction of a symbol into the symbol table.
1.7. Tutorials Available
With release 1.06, we are distributing both a beginning and
advanced tutorial. They have not been thoroughly "debugged" much are
much better than nothing:
Beginning
This tutorial introduces the basic functionality of PCCTS by
example. The user need not be familiar with parsing theory or
other compiler tools, but any familiarity reduces the learning
curve substantially.
Advanced
Constructing a translator can be viewed as an iterative refine-
ment process moving from language recognition to intermediate-
form transformation. This tutorial presents one possible
sequence of refinements. It uses as many features of PCCTS as is
reasonable without regards to optimality. It develops a compiler
for a simple string manipulation language called sc. The
Page 3
PCCTS
resulting compiler generates code for a simple stack machine.
1.8. Error Messages for k>1
Previous versions of PCCTS did not handle error message correctly
for k>1. For example, with two tokens of lookahead and the following
grammar:
a : "A" "B" "D"
| "A" "C" "E"
;
an incorrect input of A D D would yield:
line 1: syntax error at "A" missing A
which is wrong (and incredibly confusing). The new error mechanism
generates the following error message upon the same incorrect input:
line 1: syntax error at "A D"; "D" not in { B C }
which is infinitely superior. Unfortunately, situations may arise
when even this method will give an invalid message. This may occur
when alternatives have lookahead sequences which are permutations of
the same tokens.
The definition of the standard error reporting function, zzsyn()
has been modified. The parameter list is now:
void
zzsyn(char *text,
int tok,
char *egroup,
unsigned *eset,
int etok,
int k,
char *bad_text);
Users can ignore this as it is transparent to them; unless, of course,
the standard error reporting must be modified. In addition, zzFAIL is
now a function rather than a macro.
1.9. Trace Facility has Exit Macro
Previously, only an entry trace macro was inserted in parsers
when the -gd ANTLR option was used. An exit macro has been defined
Page 4
PCCTS
which resulted in zzTRACE becoming zzTRACEIN. Also, a default trace
macro prints out "enter rule rule" if no default trace macros are
defined. To define your own, the macro definitions must appear in the
#header action. As before, the sole argument to the trace routines is
a string representing the rule which has been entered or is about to
be exited.
1.10. Resource Limitation
Occasionally, ANTLR is unable to analyze a grammar submitted by
the user. This rare situation can only occur when the grammar is
large and the amount of lookahead is greater than one. A nonlinear
analysis algorithm is used by PCCTS to handle the general case of
LL(k) parsing. The average complexity of analysis, however, is near
linear due to some fancy footwork in the implementation which reduces
the number of calls to the full LL(k) algorithm.
To avoid the situation where ANTLR takes 23 hours of CPU time and
then runs out of virtual memory, use the -rl n resource limit option
where n is the maximum number of tree nodes to be used by the analysis
algorithm. An error message will be displayed, if this limit is
reached, which indicates which grammar construct was being analyzed
when ANTLR hit a non-linearity. Use this option if ANTLR seems to go
off to lunch and your disk start swapping; try n=10000 to start. Once
the offending construct has been identified, try to remove the ambi-
guity that ANTLR was trying to overcome with large lookahead analysis.
Future versions of PCCTS may incorporate a known algorithm that does
not exhibit this exponential behavior.
1.11. Rule Prefix Option
An ANTLR option has been added that prefixes all functions
corresponding to rules with a prefix. This can be used to provide
symbol hiding in your project to isolate the parser. It can also be
used to allow rule names that correspond to C keywords such as if and
typedef.
1.12. Standard PCCTS Header
Two new ANTLR options have been added that control the creation
of a standard PCCTS header file - stdpccts.h. Option -gh instructs
ANTLR to create a file, stdpccts.h unless -fh is used, which contains
all header information needed by non-PCCTS generated files that want
to access PCCTS symbols. For example, it indicates the k of the
parser, whether trees are being constructed, whether lookahead is to
be delayed, and indicates what the user specified in the #header
action in the grammar file. Previously, the user had to manually con-
struct this information from the grammar file in order to place the
information in a non-PCCTS C file. The -fh option is merely used to
rename stdpccts.h.
1.13. Doubly-Linked AST's
A new function is available in ast.c which will doubly-link any
Page 5
PCCTS
tree that is passed in. To use this option, the user must define
zzAST_DOUBLE in the #header directive or on the command-line of the C
compile. This defines left and up fields automatically in the AST
node typedef. ANTLR generated parsers, normally, only construct
singly-linked trees. The fields can be filled in via code similar to
the following:
#header <<
#define AST_FIELDS user-fields;
>>
<<
main()
{
AST *root = NULL;
ANTLR(start(&root), stdin);
zzdouble_link(root, NULL, NULL);
}
where the function is defined as:
zzdouble_link(AST *t, AST *left, AST *up);
1.14. C++ Compatible Parsers
PCCTS parsers may now be compiled with C++ compilers; i.e. the
output is more ANSI C-like than before. It has been successfully com-
piled with GCC 2.2, but not with GCC 1.37. We do not guarantee any-
thing. To be safe, use the -ga option so that PCCTS generates ANSI-
style prototypes for functions generated from rules. As a simple
example, consider:
#header <<
#include "charbuf.h"
>>
#token "[ 0" <<zzskip();>>
<<
main()
{
ANTLR(a(), stdin);
}
>>
a : "A" "B"
;
Page 6
PCCTS
which does not do much, but does compile with G++ 2.2 except that a
few warnings are generated concerning exit() and strncpy() not being
declared before use. This is trivial to fix, of course - include the
C++ string include file and define exit().
1.15. Predicated Parsing - EXTREMELY ALPHA
Normally, parsing is a function of syntax alone. Semantics
(meaning/value of the input) are not taken into account when parsing
decisions are made. For example, the following grammar is LL(k) ambi-
guous:
a : ( expr | decl )+
;
/* recognize 'a+b+c+...' */
expr: WORD ("+" WORD)*
;
/* recognize 'int i' or 'user_type j' etc... */
decl: ( WORD
| "int"
)
WORD
;
because, upon WORD, rule a does not know whether an expr or a decl is
coming. However, adding predicates (actions suffixed with a '?') can
disambiguate the decision. E.g.
a : ( expr | decl )+
;
/* recognize 'a+b+c+...' */
expr: <<isVAR(LATEXT(1))>>? WORD ("+" WORD)*
;
/* recognize 'int i' or 'user_type j' etc... */
decl: ( <<isTYPE(LATEXT(1))>>? WORD
| "int"
)
WORD
;
where isVAR and isTYPE are user-defined macros or functions that
return true or false after looking up the symbol in the symbol table.
ANTLR attempts to find predicates when a syntactic ambiguity occurs.
Otherwise, the predicates are simply tested to ensure semantic vali-
dity. ANTLR still prints out the ambiguity message, for the moment,
but generates code in rule a that looks (kind of) like this:
Page 7
PCCTS
a()
{
...
while ( ... ) {
if ( (isVAR(LATEXT(1)) && LA(1)==WORD) && LA(1)==WORD ) {
expr();
}
else if ( (isTYPE(LATEXT(1)) && LA(1)==WORD) &&
(LA(1)==WORD || LA(1)=="int") ) {
decl();
}
}
...
}
ANTLR has "hoisted" the predicates from the other rules for use in the
prediction decision for rule a. Currently, at most one predicate is
hoisted. In addition, there may be nothing between a decision and a
predicate for it to be hoisted; i.e. no token or rule references and
certainly no actions. Predicates should be side-effect free, as well.
The context under which predicates are currently evaluated may not be
computed correctly. IT DOES NOT WORK WITH LARGE GRAMMARS and has only
been tested on small test cases.
Why are we even mentioning them if they do not really work?
Because we want users to play with them to find better ways of doing
things and to find new uses for them. We really want user feedback on
these critters before a full implementation is developed. Please send
mail to parrt@ecn.purdue.edu, hankd@ecn.purdue.edu or use the email
server mechanism at pccts@ecn.purdue.edu.
2. Acknowledgements
We acknowledge Dan Lawrence of MDBS for the new error reporting
facility concering greater than one token of lookahead; Dana Hoggatt,
also of MDBS, suggested the rule prefix idea (-gp option) and beta
tested 1.06. ### Beta testers: Peter Dahl of U of MN, ...
3. Machine Compatibility
PCCTS Version 1.06 has been successfully tested on the following
platforms:
Sun 3/60
Sun SPARC I, II
DEC 5000
Linux on 386PC
IBM RS6000
Page 8
PCCTS
4. Incompatibilities
Calls to zzTRACE are no longer generated by the -gd option. Now,
zzTRACEIN, zzTRACEOUT are called at the beginning and end of func-
tions, respectively.
Attribute variables in strings in actions need to be escaped,
\$i, whereas, before, $i was ok.
The user should no longer set the next token of lookahead or the
text of the next token in the lexical analyzer using LA(1) and
LATEXT(1). This is incompatible with the -gk option; hence, NLA and
NLATEXT should be used instead where the N means "next".
5. Future Work:
Predicates are currently under development with an in-house ver-
sion of PCCTS. We expect a release in the Spring that allows parsing
to be a function of syntax as well as semantics.
Attribute names are expected to be enhanced. For example,
instead of $i, $token_name could be used:
a : WORD TYPE <<printf("%s %s\n", $WORD, $TYPE);>>
;
We anticipate a version that supports object-oriented programming
and generates C++ instead of ANSI C. For the moment, PCCTS is compa-
tible with C++, but future versions will support C++.
Future versions, both C and C++, will be able to refer to PCCTS
symbols by pccts.symbol instead of zzsymbol. E.g. zzskip() will
become pccts.skip().
DLG will soon use lookahead of its own to allow the recognition
of more complicated expressions; specifically, those which have left
substrings in common with other regular expressions.
We expect future versions of PCCTS to dump grammar analysis and
parser construction statistics such as how many rules required 1 token
of lookahead, how many ambiguities etc...
Page 9