OS/2 Shareware BBS: 10 Tools

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Shareware BBS: 10 Tools / 10-Tools.zip / pccts.zip / pccts / NOTES.newbie < prev next >

Wrap

Text File | 1994-03-31 | 65KB | 1,695 lines

31 March 94 Version 1.10 of pccts At the moment this help file is available via anonymous FTP at Node: marvin.ecn.purdue.edu File: (sometimes)/pub/pccts/1.10/newbie.info File: (now) /pub/pccts/1.10/NOTES.newbie Mail corrections or additions to moog@polhode.com =============================================================================== General ------------------------------------------------------------------------------- 1. Tokens begin with uppercase characters. Rules begin with lowercase characters. 2. Multiple parsers can coexist in the same application through use of the #parser directive. 3. When you see a syntax error message that has quotation marks on separate lines: line 1: syntax error at " " missing ID that probably means that the offending element contains a newline. 4. Even if your C compiler does not support C++ style comments, you can use them in the *non-action* portion of the ANTLR source code. Inside an action (i.e. <<...>> ) you have to obey the comment conventions of your compiler. 5. ANTLR counts a line which is continued across a newline using the backslash convention as a single line. For example: #header << #define abcd alpha\ beta\ gamma\ delta >> Will cause line numbers in ANTLR error messages to be off by 3. 6. Don't confuse #[...] with #(...). The first creates a single AST node (usually from a token identifier and an attribute) using the routine zzmk_ast(). The zzmk_ast routine must be supplied by the user (or selected from one of the pccts supplied ones such as charbuf or charptr). The second creates an AST list (usually more than a single node) from other ASTs by filling in the "down" field of the first node in the list to create a root node, and the "sibling" fields of each of remaining ASTs in the list. A null pointer is put in the sibling field of the last AST in the list. This is performed by the pccts supplied routine zztmake(). #token ID "[a-z]*" #token COLON ":" #token STMT_WITH_LABEL id! : ID <<#0=#[STMT_WITH_LABEL,$1];>> Creates an AST. The AST (a single node) contains STMT_WITH_LABEL in the token field - given a traditional version of zzmk_ast(). rule! : id COLON expr <<#0=#(#1,#3);>> Creates an AST list with the ID at its root and "expr" as its first (and only) child. The following example is equivalent, but more confusing because the two steps above have been combined into a single action statement: rule! : ID COLON expr <<#0=#(#[STMT_WITH_LABEL,$1],#3);>> =============================================================================== Section on switches and options ------------------------------------------------------------------------------- 7. Don't forget about the ANTLR -gd option which provides a trace of rules which are triggered and exited. 8. When you want to inspect the code generated by ANTLR you may want to use the ANTLR -gs switch. This causes ANTLR to test for a token being an element of a lookahead set by using explicit tests rather by using the faster bit oriented operations which are difficult to read. 9. When using the ANTLR -gk option you probably want to use the DLG -i option. As far as I can tell neither option works by itself. Unfortunately they have different abbreviations so that one can't use the same symbol for both in a makefile. 10. When you are debugging code in the rule section and there is no change to the lexical scanner, you can avoid regeneration of scanner.c by using the ANTLR -gx option. However some items from stdpccts.h can affect the scanner, such as -k -ck and the addition of semantic predicates - so this optimization should be used with a little care. 11. One cannot use an interactive scanner (ANTLR -gk option) with the ANTLR infinite lookahead and backtracking options (syntactic predicates). 12. If you want backtracking, but not the prefetching of characters and tokens that one gets with lookahead, then you might want to try using your own input routine and then using ANTLRs (input supplied by string) or ANTLRf (input supplied by function) rather than plain ANTLR which is used in most of the examples. See Example 4 below for an example of an ANTLRf input function. 13. The format used in #line directive is controlled by the macro #define LineInfoFormatStr "# %d \"%s\"\n" which is defined in generic.h. A change requires recompilation of ANTLR. 14. To make the lexical scanner case insensitive use the DLG -ci switch. The analyzer does not change the text, it just ignores case when matching it against the regular expressions. As of 24-Feb-94 there was an off-by-one problem in the testing for lowercase in range expressions. The problem appears if the last character in the range is "z" or "Z" (e.g. [a-z]). The temporary workaround is to use a range of the form "[a-yz]". 15. The lexical routines zzmode(), zzskip(), and zzmore() do NOT work like coroutines. Basically, all they do is set status bits or fields in a structure owned by the lexical analyzer and then return immediately. Thus it is OK to call these routines anywhere from within a lexical action. You can even call them from within a subroutine called from a lexical action routine. See Example 5 below for routines which maintain a stack of modes. =============================================================================== Section on #token and lexical issues ------------------------------------------------------------------------------- 16. To gobble up everything to a newline use: "~[\n]*". 17. To match any single character use: "~[]". 18. If a #token symbol is spelled incorrectly in a rule it will not be reported by ANTLR. ANTLR will assign it a new #token number which, of course, will never be matched. Look at token.h for misspelled terminals or inspect "zztokens[]" in err.c. 19. If you happen to define the same #token name twice (for instance because of inadvertent duplication of a line) you will receive no error message from ANTLR or DLG. ANTLR will simply leave a hole in the assignment of token numbers, and you will find strange numbers in the test of LA(1) and LA(2) rather than the expected token names. Look at token.h or inspect "zztokens[]" in err.c. 20. One cannot continue a regular expression in a #token statement across lines. If one tries to use "\" to continue the line the lexical analyzer will think you are trying to match a newline character. 21. The escaped literals in #token regular expressions are not identical to the ANSI escape sequences. For instance "\v" will yield a match for "v", not a vertical tab. \t \n \r \b - the only escaped letters 22. In #token regular expressions spaces and tabs which are not escaped are ignored - thus making it easy to add white space to a regular expression. #token symbol "[a-z A-Z] [a-z A-Z 0-9]*" 23. You cannot supply an action (even a null action) for a #token statement without a regular expression. You'll receive the message: warning: action cannot be attached to a token name (...token name...); ignored This is a minor problem when the #token is created for use with attributes or ASTs nodes and has not regular expression: #token CAST_EXPR #token SUBSCRIPT_EXPR #token ARGUMENT_LIST << ... Code related to parsing >> ANTLR assumes the code block is the action associated with the #token immediately preceding it. It is not obvious what the problem is because the line number referenced is the end of the code block (">>") rather than the beginning. My solution is to follow such #token statements with a #token which does have a regular expression or a rule. 24. The #token statement allows the action to appear on a second line. This sometimes leads to the misinterpretation of a code block which follows: #token RETURN "return" << ... some code related to parsing ... >> ANTLR will interpret the code block as being the action of the #token. The workaround is to put in an intervening #token statement or a null action: #token RETURN "return" <<;>> 25. Since the lexical analyzer wants to find the longest possible string that matches a regular expression, it is probably best not to use expressions like "~[]*" which will gobble up everything to the end-of-file. 26. When a string is matched by two #token regular expressions, the lexical analyzer will choose the one which appears first in the source code. Thus more specific regular expressions should appear before more general ones: #token HELP "help" /* should appear before "symbol" */ #token symbol "[a-zA-Z]*" /* should appear after keywords */ Another example of this is defining hexadecimal characters before decimal characters. Some of these may be caught by using the DLG switch -Wambiguity. 27. zzbegexpr and zzendexpr point to the start and end of the string last matched by a regular expression in a #token statement. However, zzlextext may be larger than the string pointed to by zzbegexpr and zzendexpr because it includes substrings accumulated through the use of zzmore(). 28. ZZCOL in the lexical scanner controls the update of column information. This doesn't cause the zzsyn routine to report the position of tokens causing the error. You'll still have to write that yourself. The problem, I think, is that, due to look-ahead, the value of zzendcol will not be synchronized with the token causing the error, so that the problem becomes non-trivial. 29. If you want to use ZZCOL to keep track of the column position remember to adjust zzendcol in the lexical action when a character is not one print position wide (e.g. tabs or non-printing characters). 30. In version 1.00 it was common to change the token code based on semantic routines in the #token actions. With the addition of semantic predicates in 1.06 this technique is now frowned upon. Old style: #token TypedefName #token ID "[a-z A-Z]*" <<{if (isTypedefName(LATEXT(1))) NLA=TypedefName;};>> New Style: #token ID "[a-z A-Z]*" typedefName : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1>> ID; See the section on semantic predicates for a longer explanation. 31. The macro #define ZZCHAR_T <character-type> allows the user to specify the type of character constants generated by ANTLR and DLG. However the algorithms in DLG needs to be revised in order to be 8 bit clean or support wide characters. 32. DLG has no operator like grep's "^" which anchors a pattern to the beginning of a line. One can use tests based on zzbegcol only if column information is selected (#define ZZCOL) AND one is not using infinite lookahead mode (syntactic predicates). A technique which does not depend on zzbegcol is to look for the newline character and then enter a special #lexclass. Consider the problem of recognizing lines which have a "!" as the first character of a line. A possible solution suggested by Doug Cuthbertson is: #token "\n" <<zzline++; zzmode(BEGIN_LINE);>> *** or *** #token "\n" <<zzline++; if (zzchar=='!') zzmode(BEGIN_LINE);>> #lexclass BEGIN_LINE #token BANG "!" <<zzmode(START);>> #token "~[]" <<zzmode(START); zzmore();>> When a newline is encountered the #lexclass BEGIN_LINE is entered. If the next character is a "!" it returns the token "BANG" and returns to #lexclass START. If the next character is anything else it calls zzmore to accumulate additional characters for the token and, as before, returns to #lexclass START. (The order of calls to zzmode() and zzmore() is not significant). There are two limitations to this. a. If there are other single character tokens which can appear in the first column then zzmore() won't work because the entire token has already been consumed. Thus all single character tokens which can appear in column 1 must appear in both #lexclass START and #lexclass BEGIN_LINE. b. The first character of the first line is not preceded by a newline. thus DLG will be starting in the wrong state. Thus you might want to rename "BEGIN_LINE" to "START" and "START" to "NORMAL". Another solution is to use ANTLRf (input from a function) to insert you own function to do a limited amount of lexical processing which is difficult to express in DLG. See Example 4 below. =============================================================================== Section on #lexclass ------------------------------------------------------------------------------- 33. Special care should be taken when using "in-line" regular expressions in rules if there are multiple lexical classes #lexclass). ANTLR will place such regular expressions in the last lexical class defined. If the last lexical class was not START you may be surprised. #lexclass START .... #lexclass COMMENT .... inline_example: symbol "=" expression This will place "=" in the #lexclass COMMENT (where it will never be matched) rather than the START #lexclass where the user meant it to be. Since it is okay to specify parts of the #lexclass in several pieces it might be a good idea when using #lexclass to place "#lexclass START" just before the first rule - then any in-line definitions of tokens will be placed in the START #lexclass automatically. #lexclass START ... #lexclass A ... #lexclass B ... #lexclass START 34. A good example of the use of #lexclass are the definitions for C and C++ style comments, character literals, and string literals which can be found in pccts/antlr/lang/C/decl.g - or see Example 1 below. =============================================================================== Section on rules ------------------------------------------------------------------------------- 35. If a rule is not used (is an orphan) it can lead to unanticipated reports of ambiguity. Use the ANTLR cross-reference option (-cr) to locate rules which are not referenced. 36. There is no way to express the idea "Any single token is acceptable at this point in the parse". In other words, there is no syntactic equivalent to the lexical regular expression "~[]". There are plans to add such a capability (".") to version 1.20 or 2.00 of ANTLR. 37. Don't confuse init-actions with actions which precede a rule. If the first element following the start of a rule or sub-rule is an action it is always interpreted as an init-action. An init-action occurs in a scope which include the entire rule or sub-rule. An action which is NOT an init-action is enclosed in "{" and "}" during generation of code for the rule and has essentially zero scope - the action itself. The difference between an init-action and an action which precedes a rule can be especially confusing when an action appears at the start of an alternative: These APPEAR to be almost identical, but they aren't: b : <<int i=0;>> b1 > [i] /* b1 <<...>> is an init-action */ | <<int j=0;>> b2 > [j] /* b2 <<...>> is part of the rule */ ; /* and will cause a compilation error */ On line "b1" the <<...>> appears immediately after the beginning of the rule making it an init-action. On line "b2" the <<...>> does NOT appear at the start of a rule or sub-rule, thus it is interpreted as an action which happens to precede the rule. This can be especially dangerous if you are in the habit of rearranging the order of alternatives in a rule. For instance: Changing this: b : <<int i=0,j=0;>> <<i++;>> b1 > [i] /* c1 */ | <<j++;>> b1 > [i] /* c2 */ ; to: b : /* empty production */ /* d1 */ | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* d2 */ | <<j++;>> b1 > [i] ; or to this: b : <<j++;>> b1 > [i] /* e1 */ | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* e2 */ changes an init-action into a non-init action, and vice-versa. 38. In the case of sub-rules such as (...)+ and (...)* the init-action is executed just once before the sub-rule is entered. Consider the following example from section 3.6.1 (page 29) of the 1.00 manual: a : <<List *p=NULL;>> // initialize list Type ( <<int i=0;>> // initialize index Var <<append(p,i++,$1);>> )* <<OperateOn(p);>> ; 39. Associativity and precedence of operations is determined by nesting of rules. In the example below "=" associates to the right and has the lowest precedence. Operators "+" and "*" associate to the left with "*" having the highest precedence. expr0 : expr1 {"=" expr0}; expr1 : expr2 ("\+" expr2)*; expr2 : expr3 ("\*" expr3)*; expr3 : ID; See Example 2. 40. Fail actions for a rule can be placed after the final ";" of a rule. These will be: "executed after a syntax error is detected but before a message is printed and the attributes have been destroyed. However, attributes are not valid here because one does not know at what point the error occurred and which attributes even exist. Fail actions are often useful for cleaning up data structures or freeing memory." (Page 29 of 1.00 manual) Example of a fail action: a : <<List *p=NULL;>> ( Var <<append(p,$1);>> )+ <<operateOn(p);rmlist(p);>> ; <<rmlist(p);>> ************** <--- Fail Action 41. When you have rules with large amounts of lookahead (that may cross several lines) you can use the ANTLR -gk option to make an ANTLR-generated parser delay lookahead fetches until absolutely necessary. To get better line number information (e.g. for error messages or #line directives) place an action which will save "zzline" in a variable at the start of the production where you want better line number information: a : <<int saveCurrentLine;>> <<saveCurrentLine = zzline;>> A B C << /* use saveCurrentLine not zzline here */ >> | <<saveCurrentLine = zzline;>> A B D << /* use saveCurrentLine not zzline here */ >> ; After the production has been matched you can use saveCurrentLine rather than the bogus "zzline". Contributed by Terence "The ANTLR Guy" Parr (parrt@acm.org) A new macro ZZINF_LINE() has been added to extract line information that is save in a manner similar to LATEXT. This patch was added to the pccts FTP site on 27-Feb-94. 42. An easy way to get a list of the names of all the rules is to grep tokens.h for the string "void", edit the output from the ANTLR -cr option (cross-reference). =============================================================================== Section on Attributes ------------------------------------------------------------------------------- 43. Attributes are built automatically only for terminals. For rules (non-terminals) one must assign an attribute to $0, use the $[token,...] convention for creating attributes, or use zzcr_attr. 44. If you define a zzcr_attr or zzmk_attr which allocates resources such as strings from the heap don't forget to define a zzd_attr routine to release the resources when the attribute is deleted. 45. Attributes go out of scope when the rule or sub-rule that defines them is exited. Don't try to pass them to an outer rule or a sibling rule. The only exception is $0 which may be passed back to the containing rule as a return argument. However, if the attribute contains a pointer which is copied (e.g. charptr.c) then extra caution is required because of the actions of zzd_attr(). See the next item for more information. 46. The charptr.c routines use a pointer to a string. The string itself will go out of scope when the rule or sub-rule is exited. Why ? The string is copied to the heap when ANTLR calls the routine zzcr_attr supplied by charptr.c - however ANTLR also calls the charptr.c supplied routine zzd_attr (which frees the allocated string) as soon as the rule or sub-rule exits. The result is that in order to pass charptr.c strings to outer rules (for instance to $0) it is necessary to make an independent copy of the string using strdup or else zero the pointer to prevent its deallocation. 47. To initialize $0 of a sub-rule use a construct like the following: decl : typeID Var <<$2.type = $1;>> ( "," Var <<$2.type = $0;>>)*[$1] **** See section 4.1.6.1 (page 29) of the 1.00 manual 48. One can use the zzdef0 macro to define a standard method for initializing $0 of a rule or sub-rule. If the macro is defined it is invoked as zzdef0(&($0)). See section 4.1.6.1 (page 29) of the 1.00 manual 49. The expression $$ refers to the attribute of the named rule. The expression $0 refers to the attribute of the the enclosing rule, (which might be a sub-rule). rule : a b (c d (e f g) h) i For (e f g) $0 becomes $3 of (c d ... h). For (c d ... h) $0 becomes $3 of (a b ... i). However $$ always is equivalent to $rule. 50. If you construct temporary attributes in the middle of the recognition of a rule, remember to deallocate the structure should the rule fail. The code for failure goes after the ";" and before the next rule. For this reason it is sometimes desirable to defer some processing until the rule is recognized rather than the most appropriate place. =============================================================================== Section on ASTs ------------------------------------------------------------------------------- 51. If you define a zzcr_ast or zzmk_ast which allocates resources such as strings from the heap don't forget to define a zzd_ast routine to release the resources when the AST is deleted. 52. If you construct temporary ASTs in the middle of the recognition of a rule, remember to deallocate the structure should the rule fail. The code for failure goes after the ";" and before the next rule. For this reason it is sometimes desirable to defer some processing until the rule is recognized rather than the most appropriate place. 53. If you want to place prototypes for routines that have an AST as an argument in the #header directive you should explicitly #include "ast.h" after the #define AST_FIELDS and before any references to AST: #define AST_FIELDS int token;char *text; #include "ast.h" #define zzcr_ast(ast,attr,tok,astText) \ create_ast(ast,attr,tok,text) void create_ast (AST *ast,Attr *attr,int tok,char *text); 54. The make-a-root operator for ASTs ("^") can be applied only to terminals. I think this is because a child rule might return a tree rather than a single AST. If it did then it could not be made into a root as it is already a root and the corresponding fields of the structure are already in use. To make an AST returned by a called rule a root use the expression: #(root-rule sibling1 sibling2 sibling3). add ! : expr ("\+"^ expr) ; // Is ok addOperator ! : expr (AddOp expr) // Is NOT ok addOp : "\+" | "-"; // Example 2 describes a workaround for this restriction. 55. Do not assign to #0 of a rule unless automatic construction of ASTs has been disabled using the "!" operator: a! : x y z <<#0=#(#1 #2 #3);>> // ok a : x y z <<#0=#(#1 #2 #3);>> // NOT ok The reason for the restriction is that assignment to #0 will cause any ASTs pointed to by #0 to be lost when the pointer is overwritten. (Three cheers for C++ copy constructors). The stated restriction is somewhat stronger than necessary. You can assign to #0 even when using automated AST construction, if the old tree pointed to by #0 is part of the new tree constructed by #(...). For example: #token COMMA "," #token STMT_LIST stmt_list: stmt (COMMA stmt)* <<#0=#(#[STMT_LIST],#0);>> The automatically constructed tree pointed to by #0 is just put at the end of the new list, so nothing is lost. However, if you reassign to #0 in the middle of the rule, automatic tree construction will result in the addition of remaining elements at the end of the new tree. This is not recommended by TJP. 56. If you use ASTs you have to pass a root AST to ANTLR. AST *root=NULL; again: ANTLR (start(&root),stdin); walk_the_tree(root); zzfree_ast(root); root=NULL; goto again; 57. zzfree_ast(AST *tree) will recursively descend the AST tree and free all sub-trees. The user should supply a routine zzd_ast to free any resources used by a single node - such as pointers to character strings allocated on the heap. See Example 2 on associativity and precedence. 58. AST elements in rules are assigned numbers in the same fashion as attributes with three exceptions: 1. A hole is left in the sequence when sub-rules are encountered. 2. #0 is the AST of the named rule, not the sub-rule - see the next item 3. There is nothing analogous to $i.j notation (which allows one to refer to attributes from earlier in the rule). In other words, you can't use #i.j notation to refer to an AST created earlier in the rule. There are plans to add notation similar to Sorcerer's tagged elements, but this will not appear for a while. (24-Mar-94) Consider the following example: a : b // B is #1 for the rule (c d)* // C is #1 when scope is inside the sub-rule // D is #2 when scope is inside the sub-rule // You may *NOT* refer to b as #1.1 e // E is #3 for the rule // There is NO #2 for the rule 59. The expression #0 refers to the AST of the named rule. Thus it is a misnomer and (for consistentcy) should probably have been named ## or #$. There is nothing equivalent to $0 for ASTs. This is probably because sub-rules aren't assigned AST numbers in a rule. See above. 60. Associativity and precedence of operations is determined by nesting of rules. In the example below "=" associates to the right and has the lowest precedence. Operators "+" and "*" associate to the left with "*" having the highest precedence. expr0 : expr1 {"=" expr0}; expr1 : expr2 ("\+" expr2)*; expr2 : expr3 ("\*" expr3)*; expr3 : ID; In Example 2 the zzpre_ast routine is used to walk all the AST nodes. The AST nodes are numbered during creation so that one can see the order in which they are created and the order in which they are deleted. Do not confuse the "#" in the sample output with the AST numbers used to refer to elements of a rule in the action part of a the rule. The "#" in the sample output are just to make it simpler to match elements of the expression tree with the order in which zzd_ast is called for each node in the tree. 61. If the make-as-root operator were not used in the rules: ;expr0 : expr1 {"=" expr0} ;expr1 : expr2 ("\+" expr2)* ;expr2 : expr3 ("\*" expr3)* ;expr3 : ID With input: a+b*c The output would be: a <#1> \+ <#2> b <#3> \* <#4> c <#5> NEWLINE <#6> zzd_ast called for <node #6> zzd_ast called for <node #5> zzd_ast called for <node #4> zzd_ast called for <node #3> zzd_ast called for <node #2> zzd_ast called for <node #1> 62. Suppose that one wanted to replace the terminal "+" with the rule: addOp : "\+" | "-" ; Then one would be unable to use the "make-as-root" operator because it can be applied only to terminals. This is one possible workaround: expr : (expr0 NEWLINE) ;expr0 : expr1 {"="^ expr0} ;expr1! : expr2 <<#0=#1;>> (addOp expr2 <<#0=#(#1,#0,#2);>> )* ;expr2 : expr3 ("\*"^ expr3)* ;expr3 : ID ;addOp : "\+" | "\-" With input: a-b-c The output is: ( \- <#4> ( \- <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6> The "!" for rule "expr1" disables automatic constructions of ASTs in the rule. This allows one to manipulate #0 manually. If the expression had no addition operator then the sub-rule "(addOp expr)*" would not be executed and #0 will be assigned the AST constructed by rule expr2 (i.e. AST #1). However if there is an addOp present then each time the sub-rule is rescanned due to the "(...)*" the current tree in #0 is placed as the first of two siblings underneath a new tree. This new tree has the AST returned by addOp (AST #1 of the addOp sub-rule) as the root. 63. There is an option for doubly linked ASTs in the module ast.c. It is controlled by #define zzAST_DOUBLE. See page 12 of the 1.06 manual for more information. 64. If a rule which creates an AST is called and the result is not linked into the tree being constructed then zzd_ast will not be called to release the resources used by the rule. This it is important when rules are used in syntactic predicates. The following construct is dangerous because the ASTs created when "b_predicate" calls "b" will be lost unless they are explicitly deallocated by a call to zzfree_ast(). a : (b_predicate) ? b b_predicate ! : b ! b : c d e ; =============================================================================== Section on Semantic Predicates ------------------------------------------------------------------------------- 65. There is a bug in 1.10 which prevents semantic predicates from including an unescaped newline. The predicate is incorrectly "string-ized" in the call to zzfailed_predicate. rule: <<do_test(); this_will_not_work()>>? x y z; rule: <<do_test();\ this_will_work()>>? x y z; 66. Semantic predicates are enclosed in "<<... >>?" but because they are inside "if" statements they normally do not end with a ";" - unlike other code enclosed in "<<...>>" in ANTLR. 67. Semantic predicates which are not the first element in the rule or sub-rule become "validation predicates" and are not used for prediction. After all, if there are no alternatives, then there is no need for prediction - and alternatives exist only at the left edge of rules and sub-rules. Even if the semantic predicates are on the left edge it is no guarantee that it will be part of the prediction expression. Consider the following two examples: a : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* a1 */ | ID glob /* a2 */ ; b : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* b1 */ | NUMBER glob /* b2 */ ; Rule a requires the semantic predicate to disambiguate alternatives a1 and a2 because the rules are otherwise identical. Rule b has a token type of NUMBER in alternative b2 so it can be distinguished from b1 without evaluation of the semantic predicate during prediction. In both cases the semantic predicate will be evaluated inside the rule. When the tokens which can follow a rule allow ANTLR to disambiguate the expression without resort to semantic predicates ANTLR may not evaluate the semantic predicate in the prediction code. For example: simple_func : <<LA(1)==ID ? isSimpleFunc(LATEXT(1)) : 1>>? ID complex_func : <<LA(1)==ID ? isComplexFunc(LATEXT(1)) : 1>>? ID function_call : "(" ")" func : simple_func function_call | complex_func "." ID function_call In this case, a "simple_func" MUST be followed by a "(", and a "complex_func" MUST be followed by a ".", so it is unnecessary to evaluate the semantic predicates in order to predict which of the alternative to use. A simple test of the lookahead tokens is sufficient. As stated before, the semantic predicates will still be used to validate the rule. 68. Suppose that the requirement that all semantic predicates used in prediction expressions were lifted. Consider the following code segment: cast_expr /* a1 */ : LP typedef RP cast_expr /* a2 */ | expr13 /* a3 */ ;expr13 /* a4 */ : id_name /* a5 */ | LP cast_expr RP /* a6 */ ;typedef_name /* a7 */ : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* a8 */ ;id_name /* a9 */ : ID /* a10 */ Now consider the token sequences: Token: #1 #2 #3 #4 -- ----------------------- -- -- "(" ID-which-is-typedef ")" ID "(" ID-which-is-NOT-typedef ")" Were the semantic predicate at line a8 hoisted to predict which alternative of cast_expr to use (a2 or a3) the program would use the wrong lookahead token (LA(1) and LATEXT(1)) rather than LA(2) and LATEXT(2) to check for an ID which satisfies "isTypedefName()". This problem could perhaps be solved by application of sufficient ingenuity, however, in the meantime the solution is to rewrite the rules so as to move the decision point to the left edge of the production. First perform in-line expansion of expr13 in cast_expr: cast_expr /* b1 */ : LP typedef RP cast_expr /* b2 */ | id_name /* b3 */ | LP cast_expr RP /* b4 */ Secondly, move the alternatives (in cast_expr) beginning with LP to a separate rule so that "typedef" and "cast_expr" will be on the left edge: cast_expr /* c1 */ : LP cast_expr_suffix /* c2 */ | id_name /* c3 */ ;cast_expr_suffix /* c4 */ : typedef RP cast_expr /* c5 */ | cast_expr RP /* c6 */ ;typedef_name /* c7 */ : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* c8 */ ;id_name /* c9 */ : ID /* c10 */ This will result in the desired treatment of the semantic predicate to choose from alternatives c5 and c6. 69. Validation predicates are evaluated by the parser. If they fail a call to zzfailed_predicate(string)) is made. To disable the message redefine the macro zzfailed_predicate(string) or use the optional "failed predicate" action which is enclosed in "[" and "]" and follows immediately after the predicate: a : <<LA(1)==ID ? isTypedef(LATEXT(1)) : 1>>?[printf("Not a typedef\n");] 70. An expression in a semantic predicate (e.g. <<isFunc()>>? ) should not have side-effects. If there is no match then the rest of the rule using the syntactic predicate won't be executed. 71. A documented restriction of ANTLR is the inability to hoist multiple semantic predicates. However, no error message is given when one attempts this. When compiled with k=1 and ck=2 this generates inappropriate code in "statement" when attempting to predict "expr": statement : expr | declaration ;expr : commandName BARK | typedefName GROWL ;declaration : typedefName BARK ;typedefName : <<LA(1)==ID ? istypedefName(LATEXT(1)) : 1>>? ID ;commandName : <<LA(1)==ID ? isCommand(LATEXT(1)) : 1>>? ID ; Some help is obtained by using leading actions to inhibit hoisting as described in the next note. 72. Leading actions will inhibit the hoisting of semantic predicates into the prediction of rules. expr_rhs : <<;>> <<>> expr0 | command See the section on known bugs for a more complete example. 73. When using semantic predicates in ANTLR is is *IMPORTANT* to understand what the "-prc on" ("predicate context computation") option does and what "-prc off" doesn't do. Consider the following example: +------------------------------------------------------+ | Note: All examples in this sub-section are based on | | code generated with -k=1 and -ck=1. | +------------------------------------------------------+ expr : upper | lower | number ; upper : <<isU(LATEXT(1))>>? ID ; lower : <<isL(LATEXT(1))>>? ID ; number : NUMBER ; With "-prc on" ("-prc off" is the default) the code for expr() to predict upper() would resemble: if (LA(1)==ID && isU(LATEXT(1)) && LA(1)==ID) { /* a1 */ upper(zzSTR); /* a2 */ } /* a3 */ else { /* a4 */ if (LA(1)==ID && isL(LATEXT(1)) && LA(1)==ID) { /* a5 */ lower(zzSTR); /* a6 */ } /* a7 */ else { /* a8 */ if (LA(1)==NUMBER) { /* a9 */ zzmatch(NUMBER); /* a10 */ } /* a11 */ else /* a12 */ {zzFAIL();goto fail;} /* a13 */ } /* a14 */ } ... ... ******************************************************* *** *** *** Starting with version 1.20: *** *** Predicate tests appear AFTER lookahead tests *** *** *** ******************************************************* Note that each test of LATEXT(i) is guarded by a test of the token type (e.g. "LA(1)==ID && isU(LATEXT(1)"). With "-prc off" the code would resemble: if (isU(LATEXT(1)) && LA(1)==ID) { /* b1 */ upper(zzSTR); /* b2 */ } /* b3 */ else { /* b4 */ if (isL(LATEXT(1)) && LA(1)==ID) { /* b5 */ lower(zzSTR); /* b6 */ } /* b7 */ else { /* b8 */ if ( (LA(1)==NUMBER) ) { /* b9 */ zzmatch(NUMBER); /* b10 */ } /* b11 */ else /* b12 */ {zzFAIL();goto fail;} /* b13 */ } /* b14 */ } ... ... Thus when coding the grammar for use with "-prc off" it is necessary to do something like: upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ; lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ; This will make sure that if the token is of type NUMBER that it is not passed to isU() or isL() when using "-prc off". So, you say to yourself, "-prc on" is good and "-prc off" is bad. Wrong. Consider the following slightly more complicated example in which the first alternative of rule "expr" contains tokens of two different types: expr : ( upper | NUMBER ) NUMBER | lower | ID ; upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ; lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ; number : NUMBER ; With "-prc off" the code would resemble: ... { /* c1 */ if (LA(1)==ID && isU(LATEXT(1)) && /* c2 */ ( LA(1)==ID || LA(1)==NUMBER) ) { /* c3 */ { /* c4 */ if (LA(1)==ID) { /* c5 */ upper(zzSTR); /* c6 */ } /* c7 */ else { /* c8 */ if (LA(1)==NUMBER) { /* c9 */ zzmatch(NUMBER); /* c10 */ } /* c11 */ else {zzFAIL();goto fail;}/* c12 */ } /* c13 */ } ... ... Note that if the token is a NUMBER (i.e. LA(1)==NUMBER) then the clause at line c2 ("LA(1)==ID && ...") will always be false, which implies that the test in the "if" statement (lines c2/c3) will always be false. (In other words LA(1)==NUMBER implies LA(1)!=ID). Thus the sub-rule for NUMBER at line c9 can never be reached. With "-prc on" essentially the same code is generated, although it is not necessary to manually code a test for token type ID preceding the call to "isU()". The workaround is to to bypass the heart of the predicate when testing the wrong type of token. upper : <<LA(1)==ID ? isU(LATEXT(1)) : 1>>? ID ; lower : <<LA(1)==ID ? isL(LATEXT(1)) : 1>>? ID ; Then with "-prc off" the code would resemble: ... { /* d1 */ if ( (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* d2 */ (LA(1)==ID || LA(1)==NUMBER) ) { /* d3 */ ... ... With this correction the body of the "if" statement is now reachable even if the token type is NUMBER - the "if" statement does what one wants. With "-prc on" the code would resemble: ... /* e1 */ if (LA(1)==ID && /* e2 */ (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* e3 */ (LA(1)==ID || LA(1)==NUMBER) ) { /* e4 */ ... ... Note that the problem of the unreachable "if" statement body has reappeared because of the redundant test ("e2") added by the predicate computation. The lesson seems to be: when using rules which have alternatives which are "visible" to ANTLR (within the lookahead distance) that have different token types it is probably dangerous to use "-prc on". 74. You cannot use downward inheritance to pass parameters to semantic predicates which are NOT validation predicates. The problem appears when the semantic predicate is hoisted into a parent rule to predict which rule to call: For instance: a : b1 [flag] | b2 | b3 b1 [int flag] : <<LA(1)==ID && flag && hasPropertyABC (LATEXT(1))>>? ID ; b2 : : <<LA(1)==ID && hasPropertyXYZ (LATEXT(1))>>? ID ; b3 : ID ; When the semantic predicate is evaluated within rule "a" to determine whether to call b1, b2, or b3 the compiler will discover that there is no variable named "flag" for procedure "a()". If you are unlucky enough to have a variable named "flag" in a() then you will have a VERY difficult-to-find bug. The -prc option has no effect on this behavior. 75. Another reason why semantic predicates must not have side effects is that when they are hoisted into a parent rule in order to decide which rule to call they will be invoked twice: once as part of the prediction and a second time as part of the validation of the rule. Consider the example above of upper and lower. When the input does in fact match "upper" the routine isU() will be called twice: once inside expr() to help predict which rule to call, and a second time in upper() to validate the prediction. If the second test fails the macro zzpred_fail() is called. As far as I can tell, there is no simple way to disable the use of a semantic predicate for validation after it has been used for prediction. =============================================================================== Section on Syntactic Predicates (also known as "Guess Mode") ------------------------------------------------------------------------------- 76. The terms "infinite lookahead", "guess mode", and "syntactic predicates" are all equivalent. The term "syntactic predicate" emphasizes that is handled by the parser. The term "guess mode" emphasizes that the parser may have to backtrack. The term "infinite lookahead" emphasizes the implementation in ANTLR: the entire input is read, processed, and tokenized by DLG before ANTLR begins parsing. 77. An expression in a syntactic predicate should not have side-effects. If there is no match then the rule using the syntactic predicate won't be executed. 78. When using syntactic predicates the entire input buffer is read and tokenized by DLG before parsing by ANTLR begins. If a "wrong" guess requires that parsing be rewound to an earlier point all attributes that were creating during the "guess" are destroyed and the parsing begins again and create new attributes at it reparses the (previously) tokenized input. 79. In infinite lookahead mode the line and column information is hopelessly out-of-sync because zzline will contain the line number of the last line of input - the entire input was parsed before scanning was begun. The line and column information is not restored during backtracking. To keep track of the line information in a meaningful way one has to use the new ZZINF_LINE macro which was added as a patch to the FTP site on 27-Feb-94. Putting line and column information in a field of the attribute will not help. The attributes are created by ANTLR, not DLG, and when ANTLR backtracks it destroys any attributes that were created in making the incorrect guess. 80. As infinite lookahead mode causes the entire input to be scanned by DLG before ANTLR begins parsing, one cannot depend on feedback from the parser to the lexer to handle things like providing special token codes for items which are in a symbol table (the "lex hack" for typedefs in the C language). Instead one MUST use semantic predicates which allow for such decisions to be made by the parser. 81. One cannot use an interactive scanner (ANTLR -gk option) with the ANTLR infinite lookahead and backtracking options (syntactic predicates). 82. An example of the need for syntactic predicates is the case where relational expressions involving "<" and ">" are enclosed in angle bracket pairs. Relation: a < b Array Index: b <i> Problem: a < b<i> vs. b < a> I was going to make this into an extended example, but I haven't had time yet. 83. If your syntactic predicate invokes routines which build ASTs this violates the rule that syntactic predicates should not have side effects. You'll end up with lots of extra nodes added to the AST. 84. The following is an example of the use of syntactic predicates. program : ( s SEMI )* ; s : ( ID EQUALS )? ID EQUALS e | e ; e : t ( PLUS t | MINUS t )* ; t : f ( TIMES f | DIV f )* ; f : Num | ID | "$" e "$" ; When compiled with: antlr -fe err.c -fh stdpccts.h -fl parser.dlg -ft tokens.h \ -fm mode.h -k 1 test.g One gets the following warning: warning: alts 1 and 2 of the rule itself ambiguous upon { ID } even though the manual suggests that this is okay. The only problem is that ANTLR 1.10 should NOT issue this error message unless the -w2 option is selected. Included with permission of S. Salters =============================================================================== Section on Inheritance ------------------------------------------------------------------------------- 85. A rule which uses upward inheritance: rule > [int result] : x | y | z; Is simply declaring a function which returns an "int" as a function value. If the function has more than one item passed via upward inheritance then ANTLR creates a structure to hold the result and then copies each component of the structure to the upward inheritance variables. 86. When writing a rule that uses downward inheritance: rule [int *x] : r1 r2 r3 one should remember that the arguments passed via downward inheritance are simply arguments to a function. If one is using downward inheritance syntax to pass results back to the caller (really upward inheritance !) then it is necessary to pass the address of the variable which will receive the result. 87. ANTLR is smart enough to combine the declaration for an AST with the items declared via downward inheritance when constructing the prototype for a function which uses both ASTs and downward inheritance. =============================================================================== Section on LA, LATEXT, NLA, and NLATEXT ------------------------------------------------------------------------------- 88. Need examples of LATEXT for various forms of lookahead. =============================================================================== Section on Prototypes ------------------------------------------------------------------------------- 89. Prototype for typical create_ast routine: #define zzcr_ast(ast,attr,tok,astText) \ create_ast(ast,attr,tok,text) void create_ast (AST *ast,Attr *attr,int tok,char *text); 90. Prototype for typical make_ast routine: AST *zzmk_ast (AST *ast,int token,char *text) 91. Prototype for typical create_attr routine: #define zzcr_attr(attr,token,text) \ create_attr(attr,token,text) void create_attr (Attrib *attr,int token,char *text); 92. Prototype for ANTLR (these are actually macros): read from file: void ANTLR (void startRule(...),FILE *) read from string: void ANTLRs (void startRule(...),zzchar_t *) read from function: void ANTLRf (void startRule(...),int (*)()) In the call to ANTLRf the function behaves like getchar() in that it returns EOF (-1) to indicate end-of-file. If ASTs are used or there is downward or upward inheritance then the call to the startRule must pass these arguments: AST *root; ANTLRf (startRule(&root),stdin); =============================================================================== Section on ANTLR/DLG Internals Routines That Might Be Useful ------------------------------------------------------------------------------- **************************** **************************** ** ** ** Use at your own risk ** ** ** **************************** **************************** 93. static int zzauto - defined in dlgauto.h Current DLG mode. This is used by zzmode() only. 94. void zzerr (char * s) defined in dlgauto.h Defaults to zzerrstd(char *s) in dlgauto.h Unless replaced by a user-written error reporting routine: fprintf(stderr, "%s near line %d (text was '%s')\n", ((s == NULL) ? "Lexical error" : s), zzline,zzlextext); 95. static char zzebuf[70] defined in dlgauto.h =============================================================================== Section on Known Minor Bugs ------------------------------------------------------------------------------- 96. FoLink() bug and its fix reported 24-Feb-94. This fixes a bug in ANTLR which caused nodes to be revisited multiple times during a traversal of trees. Under some circumstances large grammars could take 50 times as long to analyze. 97. As of 24-Feb-94 there was an off-by-one problem in the testing for lowercase in range expressions. The problem appears if the last character in the range is "z" or "Z" (e.g. [a-z]). The temporary workaround is to use a range of the form "[a-yz]". 98. As of 22-Mar-94 there was a bug in the hoisting of semantic predicates when all alternatives of a rule have the same lookahead token. (I believe this is a special case of the inability of ANTLR to hoist multiple semantic predicates). The following example uses ANTLR options "-prc off -k 1 -ck 3". Consider the following: ;obj_name : global_func_id | simple_class SUFFIX_DOT | ID ;simple_class : <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID ;global_func_id : <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID One would expect prediction code to look something like the following: if ( (test_for_class || test_for_function) && LA(1)==ID ... ... if (test_for_class && LA(1)==ID) simple_class() ... else if (test_for_function && LA(1)==ID) global_func_id() ... else if (LA(1)==ID) {match();consume();} But what one sees is the use of "&&" rather than "||". Below is the complete example: It is not meant to execute, but only to be sufficient to generate ANTLR output for inspection. The workaround is to precede semantic predicates with an action (such as "<<;>>") which inhibits hoisting into the prediction expression. #header << #include "charptr.h" >> #token QUESTION #token COMMA #token ID #token Eof "@" #token SUFFIX_DOT statement : (information_request)? | obj_name ;information_request : QUESTION id_list ( QUESTION )* | id_list ( QUESTION )+ ;id_list : obj_name ( COMMA | obj_name )* ;simple_class : <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID ;global_func_id : <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID ;obj_name : <<;>> global_func_id | <<;>> simple_class SUFFIX_DOT | ID ; =============================================================================== Example 1 of #lexclass =============================================================================== Borrowed code ------------------------------------------------------------------------------- /* * Various tokens */ #token "[\t\ ]+" << zzskip(); >> /* Ignore whitespace */ #token "\n" << zzline++; zzskip(); >> /* Count lines */ #token "\"" << zzmode(STRINGS); zzmore(); >> #token "'" << zzmode(CHARACTERS); zzmore(); >> #token "/\*" << zzmode(COMMENT); zzskip(); >> #token "//" << zzmode(CPPCOMMENT); zzskip(); >> /* * C++ String literal handling */ #lexclass STRINGS #token STRING "\"" << zzmode(START); >> #token "\\\"" << zzmore(); >> #token "\\n" << zzreplchar('\n'); zzmore(); >> #token "\\r" << zzreplchar('\r'); zzmore(); >> #token "\\t" << zzreplchar('\t'); zzmore(); >> #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10)); zzmore(); >> #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8)); zzmore(); >> #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16)); zzmore(); >> #token "\\~[\n\r]" << zzmore(); >> #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >> #token "~[\"\n\r\\]+" << zzmore(); >> /* * C++ Character literal handling */ #lexclass CHARACTERS #token CHARACTER "'" << zzmode(START); >> #token "\\'" << zzmore(); >> #token "\\n" << zzreplchar('\n'); zzmore(); >> #token "\\r" << zzreplchar('\r'); zzmore(); >> #token "\\t" << zzreplchar('\t'); zzmore(); >> #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10)); zzmore(); >> #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8)); zzmore(); >> #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16)); zzmore(); >> #token "\\~[\n\r]" << zzmore(); >> #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >> #token "~[\'\n\r\\]" << zzmore(); >> /* * C-style comment handling */ #lexclass COMMENT #token "\*/" << zzmode(START); zzskip(); >> #token "~[\*]*" << zzskip(); >> #token "\*~[/]" << zzskip(); >> /* * C++-style comment handling */ #lexclass CPPCOMMENT #token "[\n\r]" << zzmode(START); zzskip(); >> #token "~[\n\r]" << zzskip(); >> #lexclass START /* * Assorted literals */ #token OCT_NUM "0[0-7]*" #token L_OCT_NUM "0[0-7]*[Ll]" #token INT_NUM "[1-9][0-9]*" #token L_INT_NUM "[1-9][0-9]*[Ll]" #token HEX_NUM "0[Xx][0-9A-Fa-f]+" #token L_HEX_NUM "0[Xx][0-9A-Fa-f]+[Ll]" #token FLOAT_NUM "([1-9][0-9]*{.[0-9]*} | {0}.[0-9]+) {[Ee]{[\+\-]}[0-9]+}" /* * Identifiers */ #token Identifier "[_a-zA-Z][_a-zA-Z0-9]*" =============================================================================== Example 2: ASTs =============================================================================== #header << #include "charbuf.h" #include <string.h> int nextSerial; #define AST_FIELDS int token; int serial; char *text; #include "ast.h" #define zzcr_ast(ast,attr,tok,astText) \ (ast)->token=tok; \ (ast)->text=strdup( (char *) &( ( (attr)->text ) ) ); \ nextSerial++; \ (ast)->serial=nextSerial; \ #define zzd_ast(node) delete_ast(node) void delete_ast (AST *node); >> << AST *root=NULL; void show(AST *tree) { if (tree->token==ID) { printf (" %s <#%d> ", tree->text,tree->serial);} else { printf (" %s <#%d> ", zztokens[tree->token], tree->serial); }; } void before (AST *tree) { printf ("("); } void after (AST *tree) { printf (")"); } void delete_ast(AST *node) { printf ("\nzzd_ast called for <node #%d>\n",node->serial); free (node->text); return; } int main() { nextSerial=0; ANTLR (expr(&root),stdin); printf ("\n"); zzpre_ast(root,show,before,after); printf ("\n"); zzfree_ast(root); return 0; } >> #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token NEWLINE "\n" #token OpenAngle "<" #token CloseAngle ">" expr : (expr0 NEWLINE) ;expr0 : expr1 {"="^ expr0} ;expr1 : expr2 ("\+"^ expr2)* ;expr2 : expr3 ("\*"^ expr3)* ;expr3 : ID ------------------------------------------------------------------------------- Sample output from this program: a=b=c=d ( = <#2> a <#1> ( = <#4> b <#3> ( = <#6> c <#5> d <#7> ))) NEWLINE <#8> zzd_ast called for <node #7> zzd_ast called for <node #5> zzd_ast called for <node #6> zzd_ast called for <node #3> zzd_ast called for <node #4> zzd_ast called for <node #1> zzd_ast called for <node #8> zzd_ast called for <node #2> a+b*c ( \+ <#2> a <#1> ( \* <#4> b <#3> c <#5> )) NEWLINE <#6> zzd_ast called for <node #5> zzd_ast called for <node #3> zzd_ast called for <node #4> zzd_ast called for <node #1> zzd_ast called for <node #6> zzd_ast called for <node #2> a*b+c ( \+ <#4> ( \* <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6> zzd_ast called for <node #3> zzd_ast called for <node #1> zzd_ast called for <node #5> zzd_ast called for <node #2> zzd_ast called for <node #6> zzd_ast called for <node #4> ------------------------------------------------------------------------------- Makefile (I don't know makefiles either) DLG_FILE = parser.dlg ERR_FILE = err.c HDR_FILE = stdpccts.h TOK_FILE = tokens.h MOD_FILE = mode.h K = 1 CK=2 ANTLR_H = /pccts/h BIN = . ANTLR = /pccts/bin/antlr DLG = /pccts/bin/dlg GD=-gd GS=-gs GX= GK=-gk INTER=-i CFLAGS = -I. -I$(ANTLR_H) -ansi AFLAGS = -fe $(ERR_FILE) -fh $(HDR_FILE) -fl $(DLG_FILE) -ft $(TOK_FILE) \ -fm $(MOD_FILE) -k $(K) $(GS) -ck $(CK) -gt $(GK) DFLAGS = -C2 ..SUFFIXES : ..SUFFIXES : .g .c .o ..g: $(ANTLR) $(AFLAGS) $(A) $*.g make $*.o make scan.c make scan.o make err.o $(CC) -o $* $*.o scan.o err.o scan.c : parser.dlg $(DLG) $(DFLAGS) $(INTER) $(D) parser.dlg scan.c =============================================================================== Example 3: Syntactic Predicates =============================================================================== Not completed. =============================================================================== Example 4: DLG input function =============================================================================== This example demonstrates the use of a DLG input function to work around a limitation of DLG. In this example the user wants to recognize an exclamation mark as the first character of a line and treat it differently from an exclamation mark elsewhere. The work-around is for the input function to return a non-printing character (binary 1) when it finds an "!" in column 1. If it reads a genuine binary 1 in column 1 of the input text it returns a "?". The parse is started by: int DLGchar (void); ... ANTLRf (expr(&root),DLGchar); ... ------------------------------------------------------------------------------- #token BANG "!" #token BANG_COL1 "\01" #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token NEWLINE "\n" expr! : (bang <<printf ("\nThe ! is NOT in column 1\n");>> | bang1 <<printf ("\nThe ! is in column 1\n");>> | id <<printf ("\nFirst token is an ID\n");>> )* "@" ;bang! : BANG ID NEWLINE ;bang1! : BANG_COL1 ID NEWLINE ;id! : ID NEWLINE ; ------------------------------------------------------------------------------- #include <stdio.h> /* Antlr DLG input function - See page 18 of pccts 1.00 manual */ static int firstTime=1; static int c; int DLGchar (void) { if (feof(stdin)) { return EOF; }; if (firstTime || c=='\n') { firstTime=0; c=fgetc(stdin); if (c==EOF) return (EOF); if (c=='!') return ('\001'); if (c=='\001') return ('?'); return (c); } else { c=fgetc(stdin); return (c); }; }; =============================================================================== Example 5: Maintaining a Stack of DLG Modes =============================================================================== Contributed by David Seidel ------------------------------------------------------------------------------- #define MAX_MODE ???? #define ZZMAXSTK (MAX_MODE * 2) static int zzmstk[ZZMAXSTK] = { -1 }; static int zzmdep = 0; void #ifdef __STDC__ zzmpush( int m ) #else zzmpush( m ) int m; #endif { if(zzmdep == ZZMAXSTK - 1) { sprintf(zzebuf, "Mode stack overflow "); zzerr(zzebuf); } else { zzmstk[zzmdep++] = zzauto; zzmode(m); } } void zzmpop() { if(zzmdep == 0) { sprintf(zzebuf, "Mode stack underflow "); zzerr(zzebuf); } else { zzmdep--; zzmode(zzmstk[zzmdep]); } }