home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: 10 Tools
/
10-Tools.zip
/
pccts.zip
/
pccts
/
NOTES.newbie
< prev
next >
Wrap
Text File
|
1994-03-31
|
65KB
|
1,695 lines
31 March 94
Version 1.10 of pccts
At the moment this help file is available via anonymous FTP at
Node: marvin.ecn.purdue.edu
File: (sometimes)/pub/pccts/1.10/newbie.info
File: (now) /pub/pccts/1.10/NOTES.newbie
Mail corrections or additions to moog@polhode.com
===============================================================================
General
-------------------------------------------------------------------------------
1. Tokens begin with uppercase characters. Rules begin with lowercase
characters.
2. Multiple parsers can coexist in the same application through use
of the #parser directive.
3. When you see a syntax error message that has quotation marks on
separate lines:
line 1: syntax error at "
" missing ID
that probably means that the offending element contains a newline.
4. Even if your C compiler does not support C++ style comments,
you can use them in the *non-action* portion of the ANTLR source code.
Inside an action (i.e. <<...>> ) you have to obey the comment
conventions of your compiler.
5. ANTLR counts a line which is continued across a newline using
the backslash convention as a single line. For example:
#header <<
#define abcd alpha\
beta\
gamma\
delta
>>
Will cause line numbers in ANTLR error messages to be off by 3.
6. Don't confuse #[...] with #(...).
The first creates a single AST node (usually from a token identifier and
an attribute) using the routine zzmk_ast(). The zzmk_ast routine must be
supplied by the user (or selected from one of the pccts supplied ones such
as charbuf or charptr).
The second creates an AST list (usually more than a single node) from other
ASTs by filling in the "down" field of the first node in the list to create
a root node, and the "sibling" fields of each of remaining ASTs in the
list. A null pointer is put in the sibling field of the last AST in the
list. This is performed by the pccts supplied routine zztmake().
#token ID "[a-z]*"
#token COLON ":"
#token STMT_WITH_LABEL
id! : ID <<#0=#[STMT_WITH_LABEL,$1];>>
Creates an AST. The AST (a single node)
contains STMT_WITH_LABEL in the token
field - given a traditional version of
zzmk_ast().
rule! : id COLON expr
<<#0=#(#1,#3);>>
Creates an AST list with the ID at its
root and "expr" as its first (and only) child.
The following example is equivalent, but more confusing
because the two steps above have been combined into a single
action statement:
rule! : ID COLON expr
<<#0=#(#[STMT_WITH_LABEL,$1],#3);>>
===============================================================================
Section on switches and options
-------------------------------------------------------------------------------
7. Don't forget about the ANTLR -gd option which provides a trace of
rules which are triggered and exited.
8. When you want to inspect the code generated by ANTLR you may want to
use the ANTLR -gs switch. This causes ANTLR to test for a token being
an element of a lookahead set by using explicit tests rather by using
the faster bit oriented operations which are difficult to read.
9. When using the ANTLR -gk option you probably want to use the DLG -i
option. As far as I can tell neither option works by itself.
Unfortunately they have different abbreviations so that one can't
use the same symbol for both in a makefile.
10. When you are debugging code in the rule section and there is no
change to the lexical scanner, you can avoid regeneration of scanner.c
by using the ANTLR -gx option. However some items from stdpccts.h
can affect the scanner, such as -k -ck and the addition of semantic
predicates - so this optimization should be used with a little care.
11. One cannot use an interactive scanner (ANTLR -gk option) with the
ANTLR infinite lookahead and backtracking options (syntactic predicates).
12. If you want backtracking, but not the prefetching of characters and
tokens that one gets with lookahead, then you might want to try using
your own input routine and then using ANTLRs (input supplied by string)
or ANTLRf (input supplied by function) rather than plain ANTLR which
is used in most of the examples.
See Example 4 below for an example of an ANTLRf input function.
13. The format used in #line directive is controlled by the macro
#define LineInfoFormatStr "# %d \"%s\"\n"
which is defined in generic.h. A change requires recompilation of ANTLR.
14. To make the lexical scanner case insensitive use the DLG -ci
switch. The analyzer does not change the text, it just ignores case
when matching it against the regular expressions.
As of 24-Feb-94 there was an off-by-one problem in the testing for
lowercase in range expressions. The problem appears if the last character
in the range is "z" or "Z" (e.g. [a-z]). The temporary workaround is to
use a range of the form "[a-yz]".
15. The lexical routines zzmode(), zzskip(), and zzmore() do NOT work like
coroutines. Basically, all they do is set status bits or fields in a
structure owned by the lexical analyzer and then return immediately. Thus it
is OK to call these routines anywhere from within a lexical action. You
can even call them from within a subroutine called from a lexical action
routine.
See Example 5 below for routines which maintain a stack of modes.
===============================================================================
Section on #token and lexical issues
-------------------------------------------------------------------------------
16. To gobble up everything to a newline use: "~[\n]*".
17. To match any single character use: "~[]".
18. If a #token symbol is spelled incorrectly in a rule it will not be
reported by ANTLR. ANTLR will assign it a new #token number which,
of course, will never be matched. Look at token.h for misspelled
terminals or inspect "zztokens[]" in err.c.
19. If you happen to define the same #token name twice (for instance
because of inadvertent duplication of a line) you will receive no
error message from ANTLR or DLG. ANTLR will simply leave a hole in
the assignment of token numbers, and you will find strange numbers in
the test of LA(1) and LA(2) rather than the expected token names.
Look at token.h or inspect "zztokens[]" in err.c.
20. One cannot continue a regular expression in a #token statement across
lines. If one tries to use "\" to continue the line the lexical analyzer
will think you are trying to match a newline character.
21. The escaped literals in #token regular expressions are not identical
to the ANSI escape sequences. For instance "\v" will yield a match
for "v", not a vertical tab.
\t \n \r \b - the only escaped letters
22. In #token regular expressions spaces and tabs which are
not escaped are ignored - thus making it easy to add white space to
a regular expression.
#token symbol "[a-z A-Z] [a-z A-Z 0-9]*"
23. You cannot supply an action (even a null action) for a #token
statement without a regular expression. You'll receive the message:
warning: action cannot be attached to a token name
(...token name...); ignored
This is a minor problem when the #token is created for use with
attributes or ASTs nodes and has not regular expression:
#token CAST_EXPR
#token SUBSCRIPT_EXPR
#token ARGUMENT_LIST
<<
... Code related to parsing
>>
ANTLR assumes the code block is the action associated with the #token
immediately preceding it. It is not obvious what the problem is because
the line number referenced is the end of the code block (">>") rather
than the beginning. My solution is to follow such #token statements
with a #token which does have a regular expression or a rule.
24. The #token statement allows the action to appear on a second line.
This sometimes leads to the misinterpretation of a code block which
follows:
#token RETURN "return"
<<
... some code related to parsing ...
>>
ANTLR will interpret the code block as being the action of the
#token. The workaround is to put in an intervening #token statement
or a null action:
#token RETURN "return" <<;>>
25. Since the lexical analyzer wants to find the longest possible string
that matches a regular expression, it is probably best not to use expressions
like "~[]*" which will gobble up everything to the end-of-file.
26. When a string is matched by two #token regular expressions, the lexical
analyzer will choose the one which appears first in the source code. Thus
more specific regular expressions should appear before more general ones:
#token HELP "help" /* should appear before "symbol" */
#token symbol "[a-zA-Z]*" /* should appear after keywords */
Another example of this is defining hexadecimal characters before decimal
characters.
Some of these may be caught by using the DLG switch -Wambiguity.
27. zzbegexpr and zzendexpr point to the start and end of the string last
matched by a regular expression in a #token statement.
However, zzlextext may be larger than the string pointed to by zzbegexpr and
zzendexpr because it includes substrings accumulated through the use of
zzmore().
28. ZZCOL in the lexical scanner controls the update of column information.
This doesn't cause the zzsyn routine to report the position of tokens
causing the error. You'll still have to write that yourself. The
problem, I think, is that, due to look-ahead, the value of zzendcol
will not be synchronized with the token causing the error, so that
the problem becomes non-trivial.
29. If you want to use ZZCOL to keep track of the column position
remember to adjust zzendcol in the lexical action when a character is not
one print position wide (e.g. tabs or non-printing characters).
30. In version 1.00 it was common to change the token code based on
semantic routines in the #token actions. With the addition of semantic
predicates in 1.06 this technique is now frowned upon.
Old style:
#token TypedefName
#token ID "[a-z A-Z]*"
<<{if (isTypedefName(LATEXT(1))) NLA=TypedefName;};>>
New Style:
#token ID "[a-z A-Z]*"
typedefName : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1>> ID;
See the section on semantic predicates for a longer explanation.
31. The macro #define ZZCHAR_T <character-type> allows the user to
specify the type of character constants generated by ANTLR and DLG. However
the algorithms in DLG needs to be revised in order to be 8 bit clean
or support wide characters.
32. DLG has no operator like grep's "^" which anchors a pattern to the
beginning of a line. One can use tests based on zzbegcol only if column
information is selected (#define ZZCOL) AND one is not using infinite
lookahead mode (syntactic predicates). A technique which does not depend
on zzbegcol is to look for the newline character and then enter a special
#lexclass.
Consider the problem of recognizing lines which have a "!" as the first
character of a line. A possible solution suggested by Doug Cuthbertson
is:
#token "\n" <<zzline++; zzmode(BEGIN_LINE);>>
*** or ***
#token "\n" <<zzline++;
if (zzchar=='!') zzmode(BEGIN_LINE);>>
#lexclass BEGIN_LINE
#token BANG "!" <<zzmode(START);>>
#token "~[]" <<zzmode(START); zzmore();>>
When a newline is encountered the #lexclass BEGIN_LINE is entered. If
the next character is a "!" it returns the token "BANG" and returns
to #lexclass START. If the next character is anything else it calls
zzmore to accumulate additional characters for the token and, as before,
returns to #lexclass START. (The order of calls to zzmode() and zzmore()
is not significant).
There are two limitations to this.
a. If there are other single character tokens which can appear in the first
column then zzmore() won't work because the entire token has already been
consumed. Thus all single character tokens which can appear in column 1
must appear in both #lexclass START and #lexclass BEGIN_LINE.
b. The first character of the first line is not preceded by a newline.
thus DLG will be starting in the wrong state. Thus you might want to rename
"BEGIN_LINE" to "START" and "START" to "NORMAL".
Another solution is to use ANTLRf (input from a function) to insert
you own function to do a limited amount of lexical processing
which is difficult to express in DLG.
See Example 4 below.
===============================================================================
Section on #lexclass
-------------------------------------------------------------------------------
33. Special care should be taken when using "in-line" regular expressions
in rules if there are multiple lexical classes #lexclass). ANTLR will
place such regular expressions in the last lexical class defined. If
the last lexical class was not START you may be surprised.
#lexclass START
....
#lexclass COMMENT
....
inline_example: symbol "=" expression
This will place "=" in the #lexclass COMMENT (where
it will never be matched) rather than the START #lexclass
where the user meant it to be.
Since it is okay to specify parts of the #lexclass in several pieces
it might be a good idea when using #lexclass to place "#lexclass START"
just before the first rule - then any in-line definitions of tokens
will be placed in the START #lexclass automatically.
#lexclass START
...
#lexclass A
...
#lexclass B
...
#lexclass START
34. A good example of the use of #lexclass are the definitions for C
and C++ style comments, character literals, and string literals which
can be found in pccts/antlr/lang/C/decl.g - or see Example 1 below.
===============================================================================
Section on rules
-------------------------------------------------------------------------------
35. If a rule is not used (is an orphan) it can lead to unanticipated
reports of ambiguity. Use the ANTLR cross-reference option (-cr) to
locate rules which are not referenced.
36. There is no way to express the idea "Any single token is acceptable
at this point in the parse". In other words, there is no syntactic
equivalent to the lexical regular expression "~[]". There are plans to
add such a capability (".") to version 1.20 or 2.00 of ANTLR.
37. Don't confuse init-actions with actions which precede a rule.
If the first element following the start of a rule or sub-rule
is an action it is always interpreted as an init-action.
An init-action occurs in a scope which include the entire rule or sub-rule.
An action which is NOT an init-action is enclosed in "{" and "}" during
generation of code for the rule and has essentially zero scope - the
action itself.
The difference between an init-action and an action which precedes
a rule can be especially confusing when an action appears at the start
of an alternative:
These APPEAR to be almost identical, but they aren't:
b : <<int i=0;>> b1 > [i] /* b1 <<...>> is an init-action */
| <<int j=0;>> b2 > [j] /* b2 <<...>> is part of the rule */
; /* and will cause a compilation error */
On line "b1" the <<...>> appears immediately after the beginning of the
rule making it an init-action. On line "b2" the <<...>> does NOT appear at
the start of a rule or sub-rule, thus it is interpreted as an action which
happens to precede the rule.
This can be especially dangerous if you are in the habit of rearranging
the order of alternatives in a rule. For instance:
Changing this:
b : <<int i=0,j=0;>> <<i++;>> b1 > [i] /* c1 */
| <<j++;>> b1 > [i] /* c2 */
;
to:
b : /* empty production */ /* d1 */
| <<int i=0,j=0;>> <<i++;>> b1 > [i] /* d2 */
| <<j++;>> b1 > [i]
;
or to this:
b
: <<j++;>> b1 > [i] /* e1 */
| <<int i=0,j=0;>> <<i++;>> b1 > [i] /* e2 */
changes an init-action into a non-init action, and vice-versa.
38. In the case of sub-rules such as (...)+ and (...)* the
init-action is executed just once before the sub-rule is entered.
Consider the following example from section 3.6.1 (page 29) of the 1.00
manual:
a : <<List *p=NULL;>> // initialize list
Type
( <<int i=0;>> // initialize index
Var <<append(p,i++,$1);>>
)*
<<OperateOn(p);>>
;
39. Associativity and precedence of operations is determined by
nesting of rules. In the example below "=" associates to the right
and has the lowest precedence. Operators "+" and "*" associate to
the left with "*" having the highest precedence.
expr0 : expr1 {"=" expr0};
expr1 : expr2 ("\+" expr2)*;
expr2 : expr3 ("\*" expr3)*;
expr3 : ID;
See Example 2.
40. Fail actions for a rule can be placed after the final ";" of
a rule. These will be:
"executed after a syntax error is detected but before
a message is printed and the attributes have been destroyed.
However, attributes are not valid here because one does not
know at what point the error occurred and which attributes
even exist. Fail actions are often useful for cleaning up
data structures or freeing memory."
(Page 29 of 1.00 manual)
Example of a fail action:
a : <<List *p=NULL;>>
( Var <<append(p,$1);>> )+
<<operateOn(p);rmlist(p);>>
; <<rmlist(p);>>
************** <--- Fail Action
41. When you have rules with large amounts of lookahead (that may
cross several lines) you can use the ANTLR -gk option to make an
ANTLR-generated parser delay lookahead fetches until absolutely
necessary. To get better line number information (e.g. for error
messages or #line directives) place an action which will save
"zzline" in a variable at the start of the production where you
want better line number information:
a : <<int saveCurrentLine;>>
<<saveCurrentLine = zzline;>> A B C
<< /* use saveCurrentLine not zzline here */ >>
| <<saveCurrentLine = zzline;>> A B D
<< /* use saveCurrentLine not zzline here */ >>
;
After the production has been matched you can use saveCurrentLine
rather than the bogus "zzline".
Contributed by Terence "The ANTLR Guy" Parr (parrt@acm.org)
A new macro ZZINF_LINE() has been added to extract line information
that is save in a manner similar to LATEXT. This patch was added
to the pccts FTP site on 27-Feb-94.
42. An easy way to get a list of the names of all the rules is
to grep tokens.h for the string "void", edit the output from the ANTLR -cr
option (cross-reference).
===============================================================================
Section on Attributes
-------------------------------------------------------------------------------
43. Attributes are built automatically only for terminals. For
rules (non-terminals) one must assign an attribute to $0, use the
$[token,...] convention for creating attributes, or use zzcr_attr.
44. If you define a zzcr_attr or zzmk_attr which allocates resources
such as strings from the heap don't forget to define a zzd_attr routine
to release the resources when the attribute is deleted.
45. Attributes go out of scope when the rule or sub-rule that defines
them is exited. Don't try to pass them to an outer rule or a sibling
rule. The only exception is $0 which may be passed back to the containing
rule as a return argument. However, if the attribute contains a pointer
which is copied (e.g. charptr.c) then extra caution is required because
of the actions of zzd_attr(). See the next item for more information.
46. The charptr.c routines use a pointer to a string. The string itself
will go out of scope when the rule or sub-rule is exited. Why ? The
string is copied to the heap when ANTLR calls the routine zzcr_attr
supplied by charptr.c - however ANTLR also calls the charptr.c supplied
routine zzd_attr (which frees the allocated string) as soon as the rule
or sub-rule exits. The result is that in order to pass charptr.c strings
to outer rules (for instance to $0) it is necessary to make an independent
copy of the string using strdup or else zero the pointer to prevent its
deallocation.
47. To initialize $0 of a sub-rule use a construct like the following:
decl : typeID
Var <<$2.type = $1;>>
( "," Var <<$2.type = $0;>>)*[$1]
****
See section 4.1.6.1 (page 29) of the 1.00 manual
48. One can use the zzdef0 macro to define a standard method for
initializing $0 of a rule or sub-rule. If the macro is defined it is
invoked as zzdef0(&($0)).
See section 4.1.6.1 (page 29) of the 1.00 manual
49. The expression $$ refers to the attribute of the named rule.
The expression $0 refers to the attribute of the the enclosing rule,
(which might be a sub-rule).
rule : a b (c d (e f g) h) i
For (e f g) $0 becomes $3 of (c d ... h). For (c d ... h) $0 becomes
$3 of (a b ... i). However $$ always is equivalent to $rule.
50. If you construct temporary attributes in the middle of the
recognition of a rule, remember to deallocate the structure should the
rule fail. The code for failure goes after the ";" and before the next
rule. For this reason it is sometimes desirable to defer some processing
until the rule is recognized rather than the most appropriate place.
===============================================================================
Section on ASTs
-------------------------------------------------------------------------------
51. If you define a zzcr_ast or zzmk_ast which allocates resources
such as strings from the heap don't forget to define a zzd_ast routine
to release the resources when the AST is deleted.
52. If you construct temporary ASTs in the middle of the recognition of a
rule, remember to deallocate the structure should the rule fail. The code
for failure goes after the ";" and before the next rule. For this reason
it is sometimes desirable to defer some processing until the rule is
recognized rather than the most appropriate place.
53. If you want to place prototypes for routines that have an AST
as an argument in the #header directive you should explicitly
#include "ast.h" after the #define AST_FIELDS and before any references
to AST:
#define AST_FIELDS int token;char *text;
#include "ast.h"
#define zzcr_ast(ast,attr,tok,astText) \
create_ast(ast,attr,tok,text)
void create_ast (AST *ast,Attr *attr,int tok,char *text);
54. The make-a-root operator for ASTs ("^") can be applied only to
terminals. I think this is because a child rule might return a tree rather
than a single AST. If it did then it could not be made into a root
as it is already a root and the corresponding fields of the structure
are already in use. To make an AST returned by a called rule a root use
the expression: #(root-rule sibling1 sibling2 sibling3).
add ! : expr ("\+"^ expr) ; // Is ok
addOperator ! : expr (AddOp expr) // Is NOT ok
addOp : "\+" | "-"; //
Example 2 describes a workaround for this restriction.
55. Do not assign to #0 of a rule unless automatic construction of ASTs
has been disabled using the "!" operator:
a! : x y z <<#0=#(#1 #2 #3);>> // ok
a : x y z <<#0=#(#1 #2 #3);>> // NOT ok
The reason for the restriction is that assignment to #0 will cause any
ASTs pointed to by #0 to be lost when the pointer is overwritten. (Three
cheers for C++ copy constructors).
The stated restriction is somewhat stronger than necessary. You can
assign to #0 even when using automated AST construction, if the old
tree pointed to by #0 is part of the new tree constructed by #(...).
For example:
#token COMMA ","
#token STMT_LIST
stmt_list: stmt (COMMA stmt)* <<#0=#(#[STMT_LIST],#0);>>
The automatically constructed tree pointed to by #0 is just put at the
end of the new list, so nothing is lost.
However, if you reassign to #0 in the middle of the rule, automatic tree
construction will result in the addition of remaining elements at the end
of the new tree. This is not recommended by TJP.
56. If you use ASTs you have to pass a root AST to ANTLR.
AST *root=NULL;
again:
ANTLR (start(&root),stdin);
walk_the_tree(root);
zzfree_ast(root);
root=NULL;
goto again;
57. zzfree_ast(AST *tree) will recursively descend the AST tree and free
all sub-trees. The user should supply a routine zzd_ast to
free any resources used by a single node - such as pointers to
character strings allocated on the heap. See Example 2 on associativity
and precedence.
58. AST elements in rules are assigned numbers in the same fashion as
attributes with three exceptions:
1. A hole is left in the sequence when sub-rules are encountered.
2. #0 is the AST of the named rule, not the sub-rule - see the next item
3. There is nothing analogous to $i.j notation (which allows one
to refer to attributes from earlier in the rule). In other words,
you can't use #i.j notation to refer to an AST created earlier
in the rule.
There are plans to add notation similar to Sorcerer's
tagged elements, but this will not appear for a while.
(24-Mar-94)
Consider the following example:
a : b // B is #1 for the rule
(c d)* // C is #1 when scope is inside the sub-rule
// D is #2 when scope is inside the sub-rule
// You may *NOT* refer to b as #1.1
e // E is #3 for the rule
// There is NO #2 for the rule
59. The expression #0 refers to the AST of the named rule. Thus it is
a misnomer and (for consistentcy) should probably have been named ## or #$.
There is nothing equivalent to $0 for ASTs. This is probably because
sub-rules aren't assigned AST numbers in a rule. See above.
60. Associativity and precedence of operations is determined by nesting
of rules. In the example below "=" associates to the right and has the
lowest precedence. Operators "+" and "*" associate to the left with "*"
having the highest precedence.
expr0 : expr1 {"=" expr0};
expr1 : expr2 ("\+" expr2)*;
expr2 : expr3 ("\*" expr3)*;
expr3 : ID;
In Example 2 the zzpre_ast routine is used to walk all the AST nodes. The
AST nodes are numbered during creation so that one can see the order in
which they are created and the order in which they are deleted. Do not
confuse the "#" in the sample output with the AST numbers used to refer to
elements of a rule in the action part of a the rule. The "#" in the
sample output are just to make it simpler to match elements of the
expression tree with the order in which zzd_ast is called for each
node in the tree.
61. If the make-as-root operator were not used in the rules:
;expr0 : expr1 {"=" expr0}
;expr1 : expr2 ("\+" expr2)*
;expr2 : expr3 ("\*" expr3)*
;expr3 : ID
With input:
a+b*c
The output would be:
a <#1> \+ <#2> b <#3> \* <#4> c <#5> NEWLINE <#6>
zzd_ast called for <node #6>
zzd_ast called for <node #5>
zzd_ast called for <node #4>
zzd_ast called for <node #3>
zzd_ast called for <node #2>
zzd_ast called for <node #1>
62. Suppose that one wanted to replace the terminal "+" with the rule:
addOp : "\+" | "-" ;
Then one would be unable to use the "make-as-root" operator because
it can be applied only to terminals. This is one possible workaround:
expr : (expr0 NEWLINE)
;expr0 : expr1 {"="^ expr0}
;expr1! : expr2 <<#0=#1;>>
(addOp expr2 <<#0=#(#1,#0,#2);>> )*
;expr2 : expr3 ("\*"^ expr3)*
;expr3 : ID
;addOp : "\+" | "\-"
With input:
a-b-c
The output is:
( \- <#4> ( \- <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6>
The "!" for rule "expr1" disables automatic constructions of ASTs in the
rule. This allows one to manipulate #0 manually. If the expression had
no addition operator then the sub-rule "(addOp expr)*" would not be
executed and #0 will be assigned the AST constructed by rule expr2 (i.e.
AST #1). However if there is an addOp present then each time the sub-rule
is rescanned due to the "(...)*" the current tree in #0 is placed as the
first of two siblings underneath a new tree. This new tree has the AST
returned by addOp (AST #1 of the addOp sub-rule) as the root.
63. There is an option for doubly linked ASTs in the module ast.c. It
is controlled by #define zzAST_DOUBLE. See page 12 of the 1.06 manual
for more information.
64. If a rule which creates an AST is called and the result is not
linked into the tree being constructed then zzd_ast will not be called
to release the resources used by the rule. This it is important
when rules are used in syntactic predicates. The following construct
is dangerous because the ASTs created when "b_predicate" calls "b" will
be lost unless they are explicitly deallocated by a call to zzfree_ast().
a : (b_predicate) ? b
b_predicate ! : b !
b : c d e ;
===============================================================================
Section on Semantic Predicates
-------------------------------------------------------------------------------
65. There is a bug in 1.10 which prevents semantic predicates from including
an unescaped newline. The predicate is incorrectly "string-ized" in the
call to zzfailed_predicate.
rule: <<do_test();
this_will_not_work()>>? x y z;
rule: <<do_test();\
this_will_work()>>? x y z;
66. Semantic predicates are enclosed in "<<... >>?" but because they are
inside "if" statements they normally do not end with a ";" - unlike other
code enclosed in "<<...>>" in ANTLR.
67. Semantic predicates which are not the first element in the rule or
sub-rule become "validation predicates" and are not used for prediction.
After all, if there are no alternatives, then there is no need for
prediction - and alternatives exist only at the left edge of rules
and sub-rules. Even if the semantic predicates are on the left edge it
is no guarantee that it will be part of the prediction expression.
Consider the following two examples:
a : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* a1 */
| ID glob /* a2 */
;
b : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* b1 */
| NUMBER glob /* b2 */
;
Rule a requires the semantic predicate to disambiguate alternatives
a1 and a2 because the rules are otherwise identical. Rule b has a
token type of NUMBER in alternative b2 so it can be distinguished from
b1 without evaluation of the semantic predicate during prediction. In
both cases the semantic predicate will be evaluated inside the rule.
When the tokens which can follow a rule allow ANTLR to disambiguate the
expression without resort to semantic predicates ANTLR may not evaluate
the semantic predicate in the prediction code. For example:
simple_func : <<LA(1)==ID ? isSimpleFunc(LATEXT(1)) : 1>>? ID
complex_func : <<LA(1)==ID ? isComplexFunc(LATEXT(1)) : 1>>? ID
function_call : "(" ")"
func : simple_func function_call
| complex_func "." ID function_call
In this case, a "simple_func" MUST be followed by a "(", and a
"complex_func" MUST be followed by a ".", so it is unnecessary to evaluate
the semantic predicates in order to predict which of the alternative to
use. A simple test of the lookahead tokens is sufficient. As stated before,
the semantic predicates will still be used to validate the rule.
68. Suppose that the requirement that all semantic predicates used in
prediction expressions were lifted. Consider the following code
segment:
cast_expr /* a1 */
: LP typedef RP cast_expr /* a2 */
| expr13 /* a3 */
;expr13 /* a4 */
: id_name /* a5 */
| LP cast_expr RP /* a6 */
;typedef_name /* a7 */
: <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* a8 */
;id_name /* a9 */
: ID /* a10 */
Now consider the token sequences:
Token: #1 #2 #3 #4
-- ----------------------- -- --
"(" ID-which-is-typedef ")" ID
"(" ID-which-is-NOT-typedef ")"
Were the semantic predicate at line a8 hoisted to predict which alternative
of cast_expr to use (a2 or a3) the program would use the wrong lookahead
token (LA(1) and LATEXT(1)) rather than LA(2) and LATEXT(2) to check for an
ID which satisfies "isTypedefName()". This problem could perhaps be
solved by application of sufficient ingenuity, however, in the meantime
the solution is to rewrite the rules so as to move the decision point
to the left edge of the production.
First perform in-line expansion of expr13 in cast_expr:
cast_expr /* b1 */
: LP typedef RP cast_expr /* b2 */
| id_name /* b3 */
| LP cast_expr RP /* b4 */
Secondly, move the alternatives (in cast_expr) beginning with LP to a
separate rule so that "typedef" and "cast_expr" will be on the left edge:
cast_expr /* c1 */
: LP cast_expr_suffix /* c2 */
| id_name /* c3 */
;cast_expr_suffix /* c4 */
: typedef RP cast_expr /* c5 */
| cast_expr RP /* c6 */
;typedef_name /* c7 */
: <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* c8 */
;id_name /* c9 */
: ID /* c10 */
This will result in the desired treatment of the semantic predicate to
choose from alternatives c5 and c6.
69. Validation predicates are evaluated by the parser. If they fail a
call to zzfailed_predicate(string)) is made. To disable the message
redefine the macro zzfailed_predicate(string) or use the optional
"failed predicate" action which is enclosed in "[" and "]" and follows
immediately after the predicate:
a : <<LA(1)==ID ?
isTypedef(LATEXT(1)) : 1>>?[printf("Not a typedef\n");]
70. An expression in a semantic predicate (e.g. <<isFunc()>>? ) should not
have side-effects. If there is no match then the rest of the rule using the
syntactic predicate won't be executed.
71. A documented restriction of ANTLR is the inability to hoist multiple
semantic predicates. However, no error message is given when one attempts
this. When compiled with k=1 and ck=2 this generates inappropriate code
in "statement" when attempting to predict "expr":
statement
: expr
| declaration
;expr
: commandName BARK
| typedefName GROWL
;declaration
: typedefName BARK
;typedefName
: <<LA(1)==ID ? istypedefName(LATEXT(1)) : 1>>? ID
;commandName
: <<LA(1)==ID ? isCommand(LATEXT(1)) : 1>>? ID
;
Some help is obtained by using leading actions to inhibit hoisting as
described in the next note.
72. Leading actions will inhibit the hoisting of semantic predicates into
the prediction of rules.
expr_rhs
: <<;>> <<>> expr0
| command
See the section on known bugs for a more complete example.
73. When using semantic predicates in ANTLR is is *IMPORTANT* to
understand what the "-prc on" ("predicate context computation")
option does and what "-prc off" doesn't do. Consider the following
example:
+------------------------------------------------------+
| Note: All examples in this sub-section are based on |
| code generated with -k=1 and -ck=1. |
+------------------------------------------------------+
expr : upper
| lower
| number
;
upper : <<isU(LATEXT(1))>>? ID ;
lower : <<isL(LATEXT(1))>>? ID ;
number : NUMBER ;
With "-prc on" ("-prc off" is the default) the code for expr() to predict
upper() would resemble:
if (LA(1)==ID && isU(LATEXT(1)) && LA(1)==ID) { /* a1 */
upper(zzSTR); /* a2 */
} /* a3 */
else { /* a4 */
if (LA(1)==ID && isL(LATEXT(1)) && LA(1)==ID) { /* a5 */
lower(zzSTR); /* a6 */
} /* a7 */
else { /* a8 */
if (LA(1)==NUMBER) { /* a9 */
zzmatch(NUMBER); /* a10 */
} /* a11 */
else /* a12 */
{zzFAIL();goto fail;} /* a13 */
} /* a14 */
} ...
...
*******************************************************
*** ***
*** Starting with version 1.20: ***
*** Predicate tests appear AFTER lookahead tests ***
*** ***
*******************************************************
Note that each test of LATEXT(i) is guarded by a test of the token type
(e.g. "LA(1)==ID && isU(LATEXT(1)").
With "-prc off" the code would resemble:
if (isU(LATEXT(1)) && LA(1)==ID) { /* b1 */
upper(zzSTR); /* b2 */
} /* b3 */
else { /* b4 */
if (isL(LATEXT(1)) && LA(1)==ID) { /* b5 */
lower(zzSTR); /* b6 */
} /* b7 */
else { /* b8 */
if ( (LA(1)==NUMBER) ) { /* b9 */
zzmatch(NUMBER); /* b10 */
} /* b11 */
else /* b12 */
{zzFAIL();goto fail;} /* b13 */
} /* b14 */
} ...
...
Thus when coding the grammar for use with "-prc off" it is necessary
to do something like:
upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ;
lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ;
This will make sure that if the token is of type NUMBER that it is not
passed to isU() or isL() when using "-prc off".
So, you say to yourself, "-prc on" is good and "-prc off" is bad. Wrong.
Consider the following slightly more complicated example in which the
first alternative of rule "expr" contains tokens of two different types:
expr : ( upper | NUMBER ) NUMBER
| lower
| ID
;
upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ;
lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ;
number : NUMBER ;
With "-prc off" the code would resemble:
...
{ /* c1 */
if (LA(1)==ID && isU(LATEXT(1)) && /* c2 */
( LA(1)==ID || LA(1)==NUMBER) ) { /* c3 */
{ /* c4 */
if (LA(1)==ID) { /* c5 */
upper(zzSTR); /* c6 */
} /* c7 */
else { /* c8 */
if (LA(1)==NUMBER) { /* c9 */
zzmatch(NUMBER); /* c10 */
} /* c11 */
else {zzFAIL();goto fail;}/* c12 */
} /* c13 */
} ...
...
Note that if the token is a NUMBER (i.e. LA(1)==NUMBER) then the clause at
line c2 ("LA(1)==ID && ...") will always be false, which implies that the
test in the "if" statement (lines c2/c3) will always be false. (In other
words LA(1)==NUMBER implies LA(1)!=ID). Thus the sub-rule for NUMBER at
line c9 can never be reached.
With "-prc on" essentially the same code is generated, although it
is not necessary to manually code a test for token type ID preceding
the call to "isU()".
The workaround is to to bypass the heart of the predicate when
testing the wrong type of token.
upper : <<LA(1)==ID ? isU(LATEXT(1)) : 1>>? ID ;
lower : <<LA(1)==ID ? isL(LATEXT(1)) : 1>>? ID ;
Then with "-prc off" the code would resemble:
...
{ /* d1 */
if ( (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* d2 */
(LA(1)==ID || LA(1)==NUMBER) ) { /* d3 */
...
...
With this correction the body of the "if" statement is now reachable
even if the token type is NUMBER - the "if" statement does what one
wants.
With "-prc on" the code would resemble:
... /* e1 */
if (LA(1)==ID && /* e2 */
(LA(1)==ID ? isU(LATEXT(1)) : 1) && /* e3 */
(LA(1)==ID || LA(1)==NUMBER) ) { /* e4 */
...
...
Note that the problem of the unreachable "if" statement body has
reappeared because of the redundant test ("e2") added by the predicate
computation.
The lesson seems to be: when using rules which have alternatives which
are "visible" to ANTLR (within the lookahead distance) that have different
token types it is probably dangerous to use "-prc on".
74. You cannot use downward inheritance to pass parameters
to semantic predicates which are NOT validation predicates. The
problem appears when the semantic predicate is hoisted into a
parent rule to predict which rule to call:
For instance:
a : b1 [flag]
| b2
| b3
b1 [int flag]
: <<LA(1)==ID && flag && hasPropertyABC (LATEXT(1))>>? ID ;
b2 :
: <<LA(1)==ID && hasPropertyXYZ (LATEXT(1))>>? ID ;
b3 : ID ;
When the semantic predicate is evaluated within rule "a" to determine
whether to call b1, b2, or b3 the compiler will discover that there
is no variable named "flag" for procedure "a()". If you are unlucky
enough to have a variable named "flag" in a() then you will have a
VERY difficult-to-find bug.
The -prc option has no effect on this behavior.
75. Another reason why semantic predicates must not have side effects is
that when they are hoisted into a parent rule in order to decide which
rule to call they will be invoked twice: once as part of the prediction
and a second time as part of the validation of the rule.
Consider the example above of upper and lower. When the input does
in fact match "upper" the routine isU() will be called twice: once inside
expr() to help predict which rule to call, and a second time in upper() to
validate the prediction. If the second test fails the macro zzpred_fail()
is called.
As far as I can tell, there is no simple way to disable the use of a
semantic predicate for validation after it has been used for prediction.
===============================================================================
Section on Syntactic Predicates (also known as "Guess Mode")
-------------------------------------------------------------------------------
76. The terms "infinite lookahead", "guess mode", and "syntactic
predicates" are all equivalent. The term "syntactic predicate" emphasizes
that is handled by the parser. The term "guess mode" emphasizes that the
parser may have to backtrack. The term "infinite lookahead" emphasizes
the implementation in ANTLR: the entire input is read, processed, and
tokenized by DLG before ANTLR begins parsing.
77. An expression in a syntactic predicate should not have side-effects.
If there is no match then the rule using the syntactic predicate won't be
executed.
78. When using syntactic predicates the entire input buffer is read and
tokenized by DLG before parsing by ANTLR begins. If a "wrong" guess
requires that parsing be rewound to an earlier point all attributes
that were creating during the "guess" are destroyed and the parsing
begins again and create new attributes at it reparses the (previously)
tokenized input.
79. In infinite lookahead mode the line and column information is
hopelessly out-of-sync because zzline will contain the line number of
the last line of input - the entire input was parsed before
scanning was begun. The line and column information is not restored
during backtracking. To keep track of the line information in a meaningful
way one has to use the new ZZINF_LINE macro which was added as a patch
to the FTP site on 27-Feb-94.
Putting line and column information in a field of the attribute will not
help. The attributes are created by ANTLR, not DLG, and when ANTLR
backtracks it destroys any attributes that were created in making the
incorrect guess.
80. As infinite lookahead mode causes the entire input to be scanned
by DLG before ANTLR begins parsing, one cannot depend on feedback from
the parser to the lexer to handle things like providing special token codes
for items which are in a symbol table (the "lex hack" for typedefs
in the C language). Instead one MUST use semantic predicates which allow
for such decisions to be made by the parser.
81. One cannot use an interactive scanner (ANTLR -gk option) with the
ANTLR infinite lookahead and backtracking options (syntactic predicates).
82. An example of the need for syntactic predicates is the case where
relational expressions involving "<" and ">" are enclosed in angle bracket
pairs.
Relation: a < b
Array Index: b <i>
Problem: a < b<i>
vs. b < a>
I was going to make this into an extended example, but I haven't had
time yet.
83. If your syntactic predicate invokes routines which build ASTs this
violates the rule that syntactic predicates should not have side effects.
You'll end up with lots of extra nodes added to the AST.
84. The following is an example of the use of syntactic predicates.
program : ( s SEMI )* ;
s : ( ID EQUALS )? ID EQUALS e
| e
;
e : t ( PLUS t | MINUS t )* ;
t : f ( TIMES f | DIV f )* ;
f : Num
| ID
| "\(" e "\)"
;
When compiled with:
antlr -fe err.c -fh stdpccts.h -fl parser.dlg -ft tokens.h \
-fm mode.h -k 1 test.g
One gets the following warning:
warning: alts 1 and 2 of the rule itself ambiguous upon { ID }
even though the manual suggests that this is okay. The only problem is
that ANTLR 1.10 should NOT issue this error message unless the -w2 option
is selected.
Included with permission of S. Salters
===============================================================================
Section on Inheritance
-------------------------------------------------------------------------------
85. A rule which uses upward inheritance:
rule > [int result] : x | y | z;
Is simply declaring a function which returns an "int" as a function
value. If the function has more than one item passed via upward
inheritance then ANTLR creates a structure to hold the result and
then copies each component of the structure to the upward inheritance
variables.
86. When writing a rule that uses downward inheritance:
rule [int *x] : r1 r2 r3
one should remember that the arguments passed via downward inheritance are
simply arguments to a function. If one is using downward inheritance
syntax to pass results back to the caller (really upward inheritance !)
then it is necessary to pass the address of the variable which will receive
the result.
87. ANTLR is smart enough to combine the declaration for an AST with
the items declared via downward inheritance when constructing the
prototype for a function which uses both ASTs and downward inheritance.
===============================================================================
Section on LA, LATEXT, NLA, and NLATEXT
-------------------------------------------------------------------------------
88. Need examples of LATEXT for various forms of lookahead.
===============================================================================
Section on Prototypes
-------------------------------------------------------------------------------
89. Prototype for typical create_ast routine:
#define zzcr_ast(ast,attr,tok,astText) \
create_ast(ast,attr,tok,text)
void create_ast (AST *ast,Attr *attr,int tok,char *text);
90. Prototype for typical make_ast routine:
AST *zzmk_ast (AST *ast,int token,char *text)
91. Prototype for typical create_attr routine:
#define zzcr_attr(attr,token,text) \
create_attr(attr,token,text)
void create_attr (Attrib *attr,int token,char *text);
92. Prototype for ANTLR (these are actually macros):
read from file: void ANTLR (void startRule(...),FILE *)
read from string: void ANTLRs (void startRule(...),zzchar_t *)
read from function: void ANTLRf (void startRule(...),int (*)())
In the call to ANTLRf the function behaves like getchar()
in that it returns EOF (-1) to indicate end-of-file.
If ASTs are used or there is downward or upward inheritance then the
call to the startRule must pass these arguments:
AST *root;
ANTLRf (startRule(&root),stdin);
===============================================================================
Section on ANTLR/DLG Internals Routines That Might Be Useful
-------------------------------------------------------------------------------
****************************
****************************
** **
** Use at your own risk **
** **
****************************
****************************
93. static int zzauto - defined in dlgauto.h
Current DLG mode. This is used by zzmode() only.
94. void zzerr (char * s) defined in dlgauto.h
Defaults to zzerrstd(char *s) in dlgauto.h
Unless replaced by a user-written error reporting routine:
fprintf(stderr,
"%s near line %d (text was '%s')\n",
((s == NULL) ? "Lexical error" : s),
zzline,zzlextext);
95. static char zzebuf[70] defined in dlgauto.h
===============================================================================
Section on Known Minor Bugs
-------------------------------------------------------------------------------
96. FoLink() bug and its fix reported 24-Feb-94. This fixes a bug in
ANTLR which caused nodes to be revisited multiple times during a traversal
of trees. Under some circumstances large grammars could take 50 times as
long to analyze.
97. As of 24-Feb-94 there was an off-by-one problem in the testing for
lowercase in range expressions. The problem appears if the last character
in the range is "z" or "Z" (e.g. [a-z]). The temporary workaround is to
use a range of the form "[a-yz]".
98. As of 22-Mar-94 there was a bug in the hoisting of semantic predicates
when all alternatives of a rule have the same lookahead token. (I believe
this is a special case of the inability of ANTLR to hoist multiple
semantic predicates). The following example uses ANTLR options
"-prc off -k 1 -ck 3". Consider the following:
;obj_name
: global_func_id
| simple_class SUFFIX_DOT
| ID
;simple_class
: <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID
;global_func_id
: <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID
One would expect prediction code to look something like the following:
if ( (test_for_class || test_for_function) && LA(1)==ID ...
...
if (test_for_class && LA(1)==ID) simple_class()
...
else if (test_for_function && LA(1)==ID) global_func_id()
...
else if (LA(1)==ID) {match();consume();}
But what one sees is the use of "&&" rather than "||". Below is the
complete example: It is not meant to execute, but only to be sufficient
to generate ANTLR output for inspection.
The workaround is to precede semantic predicates with an action
(such as "<<;>>") which inhibits hoisting into the prediction
expression.
#header
<<
#include "charptr.h"
>>
#token QUESTION
#token COMMA
#token ID
#token Eof "@"
#token SUFFIX_DOT
statement
: (information_request)?
| obj_name
;information_request
: QUESTION id_list ( QUESTION )*
| id_list ( QUESTION )+
;id_list
: obj_name ( COMMA | obj_name )*
;simple_class
: <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID
;global_func_id
: <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID
;obj_name
: <<;>> global_func_id
| <<;>> simple_class SUFFIX_DOT
| ID
;
===============================================================================
Example 1 of #lexclass
===============================================================================
Borrowed code
-------------------------------------------------------------------------------
/*
* Various tokens
*/
#token "[\t\ ]+" << zzskip(); >> /* Ignore whitespace */
#token "\n" << zzline++; zzskip(); >> /* Count lines */
#token "\"" << zzmode(STRINGS); zzmore(); >>
#token "'" << zzmode(CHARACTERS); zzmore(); >>
#token "/\*" << zzmode(COMMENT); zzskip(); >>
#token "//" << zzmode(CPPCOMMENT); zzskip(); >>
/*
* C++ String literal handling
*/
#lexclass STRINGS
#token STRING "\"" << zzmode(START); >>
#token "\\\"" << zzmore(); >>
#token "\\n" << zzreplchar('\n'); zzmore(); >>
#token "\\r" << zzreplchar('\r'); zzmore(); >>
#token "\\t" << zzreplchar('\t'); zzmore(); >>
#token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10));
zzmore(); >>
#token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8));
zzmore(); >>
#token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16));
zzmore(); >>
#token "\\~[\n\r]" << zzmore(); >>
#token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >>
#token "~[\"\n\r\\]+" << zzmore(); >>
/*
* C++ Character literal handling
*/
#lexclass CHARACTERS
#token CHARACTER "'" << zzmode(START); >>
#token "\\'" << zzmore(); >>
#token "\\n" << zzreplchar('\n'); zzmore(); >>
#token "\\r" << zzreplchar('\r'); zzmore(); >>
#token "\\t" << zzreplchar('\t'); zzmore(); >>
#token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10));
zzmore(); >>
#token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8));
zzmore(); >>
#token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16));
zzmore(); >>
#token "\\~[\n\r]" << zzmore(); >>
#token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >>
#token "~[\'\n\r\\]" << zzmore(); >>
/*
* C-style comment handling
*/
#lexclass COMMENT
#token "\*/" << zzmode(START); zzskip(); >>
#token "~[\*]*" << zzskip(); >>
#token "\*~[/]" << zzskip(); >>
/*
* C++-style comment handling
*/
#lexclass CPPCOMMENT
#token "[\n\r]" << zzmode(START); zzskip(); >>
#token "~[\n\r]" << zzskip(); >>
#lexclass START
/*
* Assorted literals
*/
#token OCT_NUM "0[0-7]*"
#token L_OCT_NUM "0[0-7]*[Ll]"
#token INT_NUM "[1-9][0-9]*"
#token L_INT_NUM "[1-9][0-9]*[Ll]"
#token HEX_NUM "0[Xx][0-9A-Fa-f]+"
#token L_HEX_NUM "0[Xx][0-9A-Fa-f]+[Ll]"
#token FLOAT_NUM "([1-9][0-9]*{.[0-9]*} | {0}.[0-9]+) {[Ee]{[\+\-]}[0-9]+}"
/*
* Identifiers
*/
#token Identifier "[_a-zA-Z][_a-zA-Z0-9]*"
===============================================================================
Example 2: ASTs
===============================================================================
#header <<
#include "charbuf.h"
#include <string.h>
int nextSerial;
#define AST_FIELDS int token; int serial; char *text;
#include "ast.h"
#define zzcr_ast(ast,attr,tok,astText) \
(ast)->token=tok; \
(ast)->text=strdup( (char *) &( ( (attr)->text ) ) ); \
nextSerial++; \
(ast)->serial=nextSerial; \
#define zzd_ast(node) delete_ast(node)
void delete_ast (AST *node);
>>
<<
AST *root=NULL;
void show(AST *tree) {
if (tree->token==ID) {
printf (" %s <#%d> ",
tree->text,tree->serial);}
else {
printf (" %s <#%d> ",
zztokens[tree->token],
tree->serial);
};
}
void before (AST *tree) {
printf ("(");
}
void after (AST *tree) {
printf (")");
}
void delete_ast(AST *node) {
printf ("\nzzd_ast called for <node #%d>\n",node->serial);
free (node->text);
return;
}
int main() {
nextSerial=0;
ANTLR (expr(&root),stdin);
printf ("\n");
zzpre_ast(root,show,before,after);
printf ("\n");
zzfree_ast(root);
return 0;
}
>>
#token WhiteSpace "[\ \t]" <<zzskip();>>
#token ID "[a-z A-Z]*"
#token NEWLINE "\n"
#token OpenAngle "<"
#token CloseAngle ">"
expr : (expr0 NEWLINE)
;expr0 : expr1 {"="^ expr0}
;expr1 : expr2 ("\+"^ expr2)*
;expr2 : expr3 ("\*"^ expr3)*
;expr3 : ID
-------------------------------------------------------------------------------
Sample output from this program:
a=b=c=d
( = <#2> a <#1> ( = <#4> b <#3> ( = <#6> c <#5> d <#7> ))) NEWLINE <#8>
zzd_ast called for <node #7>
zzd_ast called for <node #5>
zzd_ast called for <node #6>
zzd_ast called for <node #3>
zzd_ast called for <node #4>
zzd_ast called for <node #1>
zzd_ast called for <node #8>
zzd_ast called for <node #2>
a+b*c
( \+ <#2> a <#1> ( \* <#4> b <#3> c <#5> )) NEWLINE <#6>
zzd_ast called for <node #5>
zzd_ast called for <node #3>
zzd_ast called for <node #4>
zzd_ast called for <node #1>
zzd_ast called for <node #6>
zzd_ast called for <node #2>
a*b+c
( \+ <#4> ( \* <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6>
zzd_ast called for <node #3>
zzd_ast called for <node #1>
zzd_ast called for <node #5>
zzd_ast called for <node #2>
zzd_ast called for <node #6>
zzd_ast called for <node #4>
-------------------------------------------------------------------------------
Makefile (I don't know makefiles either)
DLG_FILE = parser.dlg
ERR_FILE = err.c
HDR_FILE = stdpccts.h
TOK_FILE = tokens.h
MOD_FILE = mode.h
K = 1
CK=2
ANTLR_H = /pccts/h
BIN = .
ANTLR = /pccts/bin/antlr
DLG = /pccts/bin/dlg
GD=-gd
GS=-gs
GX=
GK=-gk
INTER=-i
CFLAGS = -I. -I$(ANTLR_H) -ansi
AFLAGS = -fe $(ERR_FILE) -fh $(HDR_FILE) -fl $(DLG_FILE) -ft $(TOK_FILE) \
-fm $(MOD_FILE) -k $(K) $(GS) -ck $(CK) -gt $(GK)
DFLAGS = -C2
..SUFFIXES :
..SUFFIXES : .g .c .o
..g:
$(ANTLR) $(AFLAGS) $(A) $*.g
make $*.o
make scan.c
make scan.o
make err.o
$(CC) -o $* $*.o scan.o err.o
scan.c : parser.dlg
$(DLG) $(DFLAGS) $(INTER) $(D) parser.dlg scan.c
===============================================================================
Example 3: Syntactic Predicates
===============================================================================
Not completed.
===============================================================================
Example 4: DLG input function
===============================================================================
This example demonstrates the use of a DLG input function to work
around a limitation of DLG. In this example the user wants to
recognize an exclamation mark as the first character of a line and
treat it differently from an exclamation mark elsewhere. The
work-around is for the input function to return a non-printing
character (binary 1) when it finds an "!" in column 1. If it reads a
genuine binary 1 in column 1 of the input text it returns a "?".
The parse is started by:
int DLGchar (void);
...
ANTLRf (expr(&root),DLGchar);
...
-------------------------------------------------------------------------------
#token BANG "!"
#token BANG_COL1 "\01"
#token WhiteSpace "[\ \t]" <<zzskip();>>
#token ID "[a-z A-Z]*"
#token NEWLINE "\n"
expr! : (bang <<printf ("\nThe ! is NOT in column 1\n");>>
| bang1 <<printf ("\nThe ! is in column 1\n");>>
| id <<printf ("\nFirst token is an ID\n");>>
)* "@"
;bang! : BANG ID NEWLINE
;bang1! : BANG_COL1 ID NEWLINE
;id! : ID NEWLINE
;
-------------------------------------------------------------------------------
#include <stdio.h>
/*
Antlr DLG input function - See page 18 of pccts 1.00 manual
*/
static int firstTime=1;
static int c;
int DLGchar (void) {
if (feof(stdin)) {
return EOF;
};
if (firstTime || c=='\n') {
firstTime=0;
c=fgetc(stdin);
if (c==EOF) return (EOF);
if (c=='!') return ('\001');
if (c=='\001') return ('?');
return (c);
} else {
c=fgetc(stdin);
return (c);
};
};
===============================================================================
Example 5: Maintaining a Stack of DLG Modes
===============================================================================
Contributed by David Seidel
-------------------------------------------------------------------------------
#define MAX_MODE ????
#define ZZMAXSTK (MAX_MODE * 2)
static int zzmstk[ZZMAXSTK] = { -1 };
static int zzmdep = 0;
void
#ifdef __STDC__
zzmpush( int m )
#else
zzmpush( m )
int m;
#endif
{
if(zzmdep == ZZMAXSTK - 1)
{ sprintf(zzebuf, "Mode stack overflow ");
zzerr(zzebuf);
}
else
{ zzmstk[zzmdep++] = zzauto;
zzmode(m);
}
}
void
zzmpop()
{
if(zzmdep == 0)
{ sprintf(zzebuf, "Mode stack underflow ");
zzerr(zzebuf);
}
else
{ zzmdep--;
zzmode(zzmstk[zzmdep]);
}
}