OS/2 Shareware BBS: 10 Tools

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Shareware BBS: 10 Tools / 10-Tools.zip / yay-1_0.zip / yay-1_0 / doc / yay10man.txt < prev

Wrap

Text File | 1996-02-05 | 167KB | 5,392 lines

YYAAYY RReeffeerreennccee MMaannuuaall by J. A. Gardner Thinkage Ltd. 85 McIntyre Drive Kitchener, Ontario Canada N2R 1H6 Copyright 1995 by Thinkage Ltd. Thinkage Ltd. YAY Reference Manual TTaabbllee ooff CCoonntteennttss 11.. IInnttrroodduuccttiioonn........................................................................................22 1.1 A Note to Novices.................................2 22.. HHooww YYAAYY WWoorrkkss......................................................................................44 2.1 "yyparse" and "yylex".............................4 2.2 Grammar Rules.....................................5 2.3 Notes on Compiling Source Code Produced by YAY....6 33 IInnppuutt ttoo YYAAYY..........................................................................................77 3.1 The Declarations Section..........................8 3.1.1 Token Declarations..........................8 3.1.2 Precedence and Binding......................9 3.1.3 Variable/Function Declarations.............12 3.1.4 Summary....................................12 3.2 Grammar Rules....................................12 3.2.1 Recognition Actions........................14 3.2.2 Token and Symbol Values....................15 3.2.3 Precedence in the Grammar Rules............16 3.2.4 The Start Symbol...........................18 3.2.5 The End Marker.............................19 3.2.6 Declarations in yyparse....................19 3.3 The Programs Section.............................20 3.3.1 The Lexical Analyzer.......................21 44.. IInntteerrnnaall SSttrruuccttuurreess........................................................................2222 4.1 States...........................................22 4.2 Diagramming States...............................23 4.3 State Actions....................................24 4.3.1 The Accept Action..........................25 4.3.2 The Shift Action...........................25 4.3.3 The Reduce Action..........................26 4.3.4 The Goto Action............................27 4.3.5 The Error Action...........................28 55.. EErrrroorr--HHaannddlliinngg..................................................................................2299 5.1 The "error" Symbol...............................29 5.2 The Error Condition..............................29 5.3 Examples.........................................30 5.4 Error Recognition Actions........................32 5.5 The yyclearin Macro..............................33 5.6 The yyerror Function.............................33 5.6.1 Changing yychar in yyerror.................34 5.7 The yyerrflag Variable...........................35 5.8 The yyerrok Macro................................36 5.9 Other Error Support Routines.....................36 November, 1995 Page i YAY Reference Manual Thinkage Ltd. 66.. YYAAYY OOuuttppuutt..........................................................................................3388 6.1 Rules Summary....................................39 6.2 State Descriptions...............................39 6.3 Conflict Summary.................................42 6.4 Parser Statistics................................43 77.. TTyyppeess....................................................................................................4466 The Default Action....................................47 88.. AAmmbbiigguuiittiieess........................................................................................4499 8.1 Resolving Conflicts by Precedence................50 8.2 Disambiguating Rules.............................50 8.3 Conflicts in YAY Output..........................53 99.. AAddvvaanncceedd FFeeaattuurreess............................................................................5544 9.1 LALR(2) Grammars.................................54 9.1.1 The Lookahead Action.......................54 9.1.2 The yy2lex Routine.........................55 9.1.3 Conflicts..................................56 9.2 Multiple Actions.................................57 9.3 Selection Preference.............................59 9.4 Negative Numbers in $N Constructs................63 9.5 Lists and Handling Null Strings..................64 9.6 Right vs. Left Recursion.........................66 9.7 YYDEBUG..........................................68 9.8 Important Symbols................................69 9.9 How YYERROR May Be Used..........................70 9.10 The Default Action..............................73 9.11 Invalid Tokens..................................74 9.12 Dynamic Stack Allocation........................74 9.13 Synchronizing the Lookahead.....................76 9.14 YYSHIFT.........................................76 AAppppeennddiixx AA -- AAnn EExxaammppllee......................................................................7788 AAppppeennddiixx BB -- YYAAYY vvss.. YYAACCCC..................................................................8811 AAppppeennddiixx CC -- TThhee LLIINNTT CCoommmmaanndd LLiinnee................................................8822 Page ii November, 1995 Thinkage Ltd. YAY Reference Manual 11.. IInnttrroodduuccttiioonn YAY is a tool for writing compilers and other programs that parse input according to strict "grammar" rules. In the terminology of parsing theory, YAY can handle LALR(2) grammars with disambiguating rules. This is a fairly broad class of grammars -- most of the computing languages in use today can be described using LALR(2) specifications. For a precise definition of this sort of grammar, see any standard text on parsing theory, e.g. _P_r_i_n_c_i_p_l_e_s_ _ _o_f _C_o_m_p_i_l_e_r_ _D_e_s_i_g_n by A.V.Aho and J.D.Ullman, Addison-Wesley, 1977, or _L_R_ _ _P_a_r_s_i_n_g by Aho and Johnson, in Computing Surveys. The appearance of YAY input is based on the input to YACC (Yet Another Compiler-Compiler), written by S.C.Johnson of Bell Telephone Laboratories, Murray Hill, N.J. The LALR(1) version of YAY was written by K.W.Lalonde of the Software Development Group of the University of Waterloo. Enhancements to allow YAY to handle LALR(2) grammars were made by P.J.Fraser, also of SDG. The parsing algorithm used by YAY is derived from the article "Methods for Computing LALR(k) Lookahead" by B.B.Kristensen and O.L.Madsen, _A_C_M_ _ _ _T_r_a_n_s_a_c_t_i_o_n_s_ _ _ _o_n _P_r_o_g_r_a_m_m_i_n_g_ _L_a_n_g_u_a_g_e_s_ _ _a_n_d_ _ _S_y_s_t_e_m_s, Vol.3, No.1, January 1981, pp.60-82. Those interested in reading this article should first have a good grasp of parsing theory principles. 11..11 AA NNoottee ttoo NNoovviicceess YAY can produce anything from a simple parser for a desk calculator program to a very complicated parser for a programming language. Those who are using YAY for complex tasks have to know all the idiosyncrasies of the program, including a good deal about the internal workings of YAY. On the other hand, the internal workings are mostly irrelevant to someone who is making an easy straightforward parser. For this reason, novices may want to skip some the sections of the manual which describe very technical aspects of YAY. This is particularly true on first reading, when it is more important to get an overview of how YAY is used than to try to comprehend every little detail. We therefore November, 1995 Page 3 YAY Reference Manual Thinkage Ltd. suggest that novices concentrate on the following sections, and only look at other sections if the requirements of your parser make it necessary. -- All of Chapter 2 -- All of Chapter 3 -- All of Chapter 4 -- Sections 6.0, 6.1, 6.2 -- all of Chapter 7 -- Sections 9.3, 9.5, 9.6 Novices may be particularly interested in Appendix A which gives an example of YAY input for a simple parser. Page 4 November, 1995 Thinkage Ltd. YAY Reference Manual 22.. HHooww YYAAYY WWoorrkkss The input to YAY describes the rules of a grammar. YAY uses these rules to produce the source code for a program that will parse the grammar. This source code can then be compiled to obtain a program that reads input, parses it according to the grammar, and takes action based on the result. The source code produced by YAY consists of a number of data tables that represent the grammar, plus a C function named "yyparse". By default, all the symbol names used by YAY begin with "yy". This is an historical convention, dating back to YAY's predecessor YACC. Users can avoid conflicts with YAY names by avoiding symbols that start with "yy". Users who want to use a different prefix may indicate this with a line of the form %prefix PREFIX at the beginning of their YAY input. For example, %prefix ww asks for a prefix of "ww" instead of "yy". The prefix chosen should be one or two characters long -- longer prefixes will lead to name conflicts on machines where external names are truncated to six characters during the loading process. In addition, at least one of the characters in the prefix should be a lower case letter (because YAY uses an all upper case version of the prefix for some special names, and this has to be different from the real prefix). Different prefixes are useful when two YAY-produced parsers are to be merged into a single program. For the sake of convenience, however, the "yy" convention will be used in this manual. 22..11 ""yyyyppaarrssee"" aanndd ""yyyylleexx"" "yyparse" returns a value of 0 if the input it parses is valid according to the given grammar rules. It returns a 1 if the input is invalid and error recovery is impossible. "yyparse" does not do its own lexical analysis. In other words, it does not pull the input apart into tokens ready November, 1995 Page 5 YAY Reference Manual Thinkage Ltd. for parsing. Instead, it calls a routine called "yylex" every time it wishes to obtain a token from the input. "yylex" returns a value indicating the _t_y_p_e of token that has been obtained. If the token has an actual _v_a_l_u_e, this value (or some representation of the value, e.g. a pointer to a string containing the value) is returned in an external variable named "yylval". It is up to the user to write a "yylex" routine that breaks the input into tokens and returns the tokens one by one to "yyparse". We will say more about the lexical analyzer in Section 3.3. 22..22 GGrraammmmaarr RRuulleess The grammar rules given to YAY not only describe what inputs are valid according to the grammar but also specify what action should be taken when a given input is encountered. For example, if the parser recognizes a statement that assigns a value to a variable, the parser should either perform the assignment itself or take some action to ensure that the assignment will eventually take place. As an example, if the parser is part of an interactive desk calculator, it could carry out arithmetic calculations as soon as the instructions are recognized. However, if the parser is acting as the first pass in a multi-pass compiler, it may simply encode the input in a way that will be used in a later code-generation pass. In summary, the user must provide a number of things when using YAY to produce a parser: (1) grammar rules indicating what input is and isn't valid; (2) a lexical analyzer ("yylex") that breaks raw input into tokens for the parsing routine "yyparse"; (3) any source code or functions that may be needed to perform appropriate actions once particular inputs are recognized; (4) a mainline routine that performs any necessary initializations, calls "yyparse", then performs possible clean-up actions. The simplest kind of mainline is just a function "main" that calls "yyparse", then returns. Page 6 November, 1995 Thinkage Ltd. YAY Reference Manual 22..33 NNootteess oonn CCoommppiilliinngg SSoouurrccee CCooddee PPrroodduucceedd bbyy YYAAYY By default, C source code produced by YAY contains the line #define YYSCTAB const It then uses the YYSCTAB manifest to declare various data objects as ccoonnsstt. If you will be compiling the YAY output with an old C compiler that doesn't recognize the ccoonnsstt qualifier, you should change the definition to read #define YYSCTAB November, 1995 Page 7 YAY Reference Manual Thinkage Ltd. 33 IInnppuutt ttoo YYAAYY This chapter describes the input to YAY when you are defining an LALR(1) grammar. Chapter 9 will talk about additional needs for LALR(2) grammars. The input to YAY is broken into three sections: -- declarations -- grammar rules -- programs The contents of each section will be described shortly, but first we will give some overall rules for YAY input. Sections of YAY input are separated by the symbol %%. The general lay-out of YAY input is therefore declarations %% grammar rules %% programs The declarations section may be omitted if no declarations are necessary. This will mean that the input starts with the first %%. The programs section may also be omitted, from the second %% on. The simplest input for YAY is therefore %% grammar rules Blanks, tabs, and new-lines are used to separate items in YAY input. These are called _w_h_i_t_e_-_s_p_a_c_e characters. Any place one white-space character is valid, any number of blanks, tabs, and/or new-lines may be used. This means, for example, that the %% to separate sections does not have to be on a line all by itself. However, giving it a line of its own makes the YAY input easier to read. Comments may appear anywhere a blank is valid. As in C, comments begin with /* and end with */. Identifiers used in YAY input may be of arbitrary length, and may consist of all letters (upper and lower case), all digits, and the characters dot (.) and underscore (_). The first character of an identifier may not be a digit. YAY distinguishes between upper and lower case Page 8 November, 1995 Thinkage Ltd. YAY Reference Manual letters, so "this" and "THIS" and "This" are all different symbols. Literals in YAY input consist of a single ASCII character enclosed in single quotes, e.g. 'c'. The standard C escape sequences are recognized: \b -- backspace \n -- new-line \r -- carriage return \t -- tab \' -- single quote \\ -- backslash \xxx -- any ASCII character ("xxx" is octal representation) For technical reasons, the null character '\000' should never appear in YAY input. 33..11 TThhee DDeeccllaarraattiioonnss SSeeccttiioonn The declarations section describes many of the identifiers that will be used in the rest of the YAY input. There are two types of declarations: (a) Token declarations; (b) Declarations of functions and variables used in the actions that the parser takes when a particular input is recognized. The declarations section can also specify rules for the precedence and binding of operators used in the grammar. For example, the standard order of arithmetic operations would normally be set up in the declarations section. _3_._1_._1_ _ _T_o_k_e_n_ _D_e_c_l_a_r_a_t_i_o_n_s All ASCII characters are automatically recognized as tokens. For example, 'a' stands for a token that is the literal character "a". Other tokens are declared with statements of the form %token name1 name2 name3 ... This tells YAY that the given names will refer to tokens. For example, November, 1995 Page 9 YAY Reference Manual Thinkage Ltd. %token INTNUM indicates that the identifier INTNUM will refer to a particular type of token returned by the lexical analyzer "yylex". If INTNUM stands for any integer number token, you might have the following code in "yylex". c = getchar(); if ( (c >= '0') && (c <= '9') ) { yylval = 0; do { yylval = (yylval * 10) + (c - '0'); c = getchar(); } while (c >= '0' && c <= '9'); ungetc(c, stdin); return( INTNUM ); } "yylex" returns INTNUM to indicate that a certain kind of token (an integer number) has been returned. The actual value of this number is returned in "yylval". The grammar rules in the YAY input would dictate where an INTNUM token is valid. In the C source code produced by YAY, the identifiers named in a %token statement will appear as constants set up with ##ddeeffiinnee. The first named token has a #defined value of 257, the next is #defined as 258, and so on. Token values start at 257 so they will not conflict with any of the 256 ASCII characters. Because token identifiers are set up as manifest constants, they must not conflict with reserved words or other identifiers that will be used by the parser. For example, %token if yyparse ... will almost certainly lead to errors when you try to compile the source code output of YAY. By convention, this manual will always create token names in upper case, and we recommend that you follow the same practice. _3_._1_._2_ _ _P_r_e_c_e_d_e_n_c_e_ _a_n_d_ _B_i_n_d_i_n_g Parsers that evaluate expressions usually have to establish the order in which various operations are carried out. For example, parsers for arithmetic expressions usually carry out multiplications before additions. Two factors affect order of operation: precedence and binding. Page 10 November, 1995 Thinkage Ltd. YAY Reference Manual Precedence dictates which of two different operations should be carried out first. For example, in A + B * C the standard arithmetic rules of precedence dictate that the multiplication should take place before the addition. Operations that should be carried out first are said to have a _h_i_g_h_e_r precedence than operations that should be performed later. Different operators can sometimes have the same precedence. In C, for example, addition and subtraction are similar enough to share the same precedence. Binding indicates which of two similar operations should be carried out first. By "similar", we mean operations with the same precedence (e.g. addition and subtraction in C). For example, C chooses to parse A - B - C as (A - B) - C while a language like APL would use A - (B - C) If we evaluate the first operation before the second (as C does), we say the operation is _l_e_f_t_-_a_s_s_o_c_i_a_t_i_v_e. If we evaluate the second operation before the first (as APL does), we say the operation is _r_i_g_h_t_-_a_s_s_o_c_i_a_t_i_v_e_. Occasionally, a compiler may have operations which are not associative. For example, Fortran regards A .GT. B .GT. C as invalid. In this case, we say the operation is _n_o_n_- _a_s_s_o_c_i_a_t_i_v_e. The precedence and associativity of operator tokens may be declared in the declarations section using the keywords %left %right %nonassoc For example, November, 1995 Page 11 YAY Reference Manual Thinkage Ltd. %left '+' '-' indicates that the + and - operations have the same precedence and are left-associative. Associativity declarations should be given in order of precedence. Operations with lowest precedence are listed first, and those with highest precedence are listed last. Operations with equal precedence are listed on the same line. For example, %right '=' %left '+' '-' %left '*' '/' '%' says that = has a lower precedence than + and -, which in turn have a lower precedence than *, /, and %. = is also right-associative, so that A = B = C is parsed as A = (B = C) Because of the way YAY specifies precedence and associativity, operators with equal precedence will always have the same associativity. For example, if A and B have equal precedence, their precedence must have been set with one of %left A B %right A B %nonassoc A B which means A and B must have the same associativity. The names supplied with %right, %left, and %nonassoc may be literals or YAY identifiers. If they are identifiers, they are regarded as token names. YAY generates a %token directive for such names if they have not already been declared. For example, in %left '+' '-' %left '*' '/' %left UMINUS UMINUS is taken to be a token identifier. There is no need to define UMINUS as a token identifier -- a %token directive Page 12 November, 1995 Thinkage Ltd. YAY Reference Manual will be generated automatically if necessary. It is perfectly valid to have an explicit %token UMINUS if you want. However, it must precede the %left declaration. For a more technical discussion of how precedence and binding rules affect a parser, see Chapter 8. _3_._1_._3_ _ _V_a_r_i_a_b_l_e_/_F_u_n_c_t_i_o_n_ _D_e_c_l_a_r_a_t_i_o_n_s The declarations section may contain standard C declarations for variables and/or functions used in the actions specified in the grammar rules section. All such declarations should be included in a block that begins with %{ and ends with %}. For example, %{ int i, j, k; static float x = 1.0; %} gives a few variable declarations. These declarations are essentially transferred "as is" to the _b_e_g_i_n_n_i_n_g of the source code that is produced by YAY. This means that they will be external to "yyparse" and therefore extern definitions. _3_._1_._4_ _ _S_u_m_m_a_r_y The source code produced by YAY contains the following. -- code from the declarations section -- code given in the programs section -- parsing tables produced by YAY to represent the grammar -- the "yyparse" routine 33..22 GGrraammmmaarr RRuulleess A YAY grammar rule has the general form identifier : definition ; A colon separates the definition from the identifier being defined. A semicolon marks the end of the definition. November, 1995 Page 13 YAY Reference Manual Thinkage Ltd. The identifiers defined in the grammar rule section are known as _n_o_n_-_t_e_r_m_i_n_a_l_ _s_y_m_b_o_l_s. "Non-terminal" suggests that these symbols aren't the "end of the line". Instead, they are made up of smaller things: tokens or other non-terminal symbols. Here is a simple example of the definition of a non- terminal symbol. paren_expr : '(' expr ')' ; This says that a "paren_expr" consists of a left parenthesis, followed by an "expr", followed by a right parenthesis. The "expr" is either a token or a non-terminal symbol defined in another grammar rule. This grammar rule could be interpreted to say that a parenthesized expression consists of a normal expression inside parentheses. A non-terminal symbol may have more than one definition. For example, we might define "if" statements with if_stat : IF '(' expr ')' stat ; if_stat : IF '(' expr ')' stat ELSE stat ; This definition assumes that IF and ELSE are tokens recognized by the lexical analyzer (which means that this parser's "yylex" can recognize keywords). The definition also assumes that "expr" and "stat" are non-terminal symbols defined elsewhere. When a single symbol has more than one meaning, YAY lets you join the various possibilities into a single definition. Different meanings are separated by or-bars (|). Thus you could write if_stat : IF '(' expr ')' stat | IF '(' expr ')' stat ELSE stat ; This technique is highly recommended, since it makes YAY input more readable. Definitions in a grammar may be recursive. For example, list : item | list ',' item; defines "list" to be one or more items separated by commas. intexp : '(' intexp ')' | intexp '+' intexp | intexp '-' intexp Page 14 November, 1995 Thinkage Ltd. YAY Reference Manual | intexp '*' intexp | intexp '/' intexp | INTNUM ; says that an integer expression can be another integer expression in parentheses, the sum of integer expressions, the difference of integer expressions, the product of integer expressions, the quotient of integer expressions, or an integer number standing on its own (where INTNUM is a token recognized by the lexical analyzer). In recursive symbol definitions, it is often useful to have the empty string as one of the possible definitions. For example, program : /* the empty string */ | statement ';' program ; defines a program as zero or more statements separated by semicolons. Our definition of "list" above was an example of _l_e_f_t _r_e_c_u_r_s_i_o_n because "list" was on the left in the recursive definition. The definition of "program" was an example of _r_i_g_h_t_ _r_e_c_u_r_s_i_o_n. In Section 9.6, we discuss the pros and cons of the two types of recursion. _3_._2_._1_ _ _R_e_c_o_g_n_i_t_i_o_n_ _A_c_t_i_o_n_s In addition to defining what a non-terminal symbol is, a grammar rule usually describes what to do if the non- terminal symbol is encountered in parser input. We call this a _r_e_c_o_g_n_i_t_i_o_n_ _a_c_t_i_o_n_. Recognition actions are specified as part of the grammar rule. They are enclosed in brace brackets in the definition, as in break_stat : BREAK ';' { breakfn(); }; In this definition, "break_stat" is a non-terminal symbol made up of the token known as BREAK, followed by a semicolon. If this symbol is recognized, the parser will invoke a function named "breakfn". Presumably this is a user-defined function that handles a "break;" statement. Note that a semicolon is needed to mark the end of the definition even though the recognition action ends in a November, 1995 Page 15 YAY Reference Manual Thinkage Ltd. brace bracket. Programmers who are used to the way C works should bear this in mind. For compatibility with some versions of YACC, YAY lets you put an equals sign (=) before the opening brace that begins a recognition action, as in break_stat : BREAK ';' = { breakfn(); }; When a symbol has more than one definition, a different recognition action may be associated with each definition. We will show some examples of this in a moment. _3_._2_._2_ _ _T_o_k_e_n_ _a_n_d_ _S_y_m_b_o_l_ _V_a_l_u_e_s One of the most common recognition actions is to return a value. For example, if an input is recognized as an expression to be evaluated, the parser may want to return the resulting value of the expression. To return a value, the recognition action merely assigns the value to a special variable named $$. For example, hexdigit : '0' { $$ = 0; } | '1' { $$ = 1; } ... | 'A' { $$ = 10; } | 'B' { $$ = 11; } | 'C' { $$ = 12; } | 'D' { $$ = 13; } | 'E' { $$ = 14; } | 'F' { $$ = 15; } ; could be a way to convert hexadecimal digits into numeric values. In this case, "yylex" just returns the digits it finds and "yyparse" performs the actual conversion. Another common recognition action is to return a value based on one or more of the items that make up the non- terminal symbol. Inside the recognition action, $1 stands for the value of the first item in the symbol, $2 stands for the value of the second item, and so on. If the item is a token, its value is the "yylval" value returned by "yylex" when the token was read. If the item is non-terminal symbol, its value is the $$ value set by the recognition action associated with the symbol. Thus we might write Page 16 November, 1995 Thinkage Ltd. YAY Reference Manual intexp : '(' intexp ')' { $$ = $2; } /* value of parenthesized expression is expression inside parentheses */ | intexp '+' intexp { $$ = $1 + $3 ; } /* value of addition is sum of two expressions */ | intexp '-' intexp { $$ = $1 - $3 ; } /* value of subtraction is difference of two expressions */ | /* and so on */ ; Note that this particular definition shows that each part of a multiple definition may have a different recognition action. In the source code for "yyparse", this set of actions will be turned into a large sswwiittcchh statement. The cases of the sswwiittcchh will be the various possible recognition actions. If no recognition action is specified for a definition, the default is { $$ = $1 ; } This action just returns the value of the first item (if the item has a value). _3_._2_._3_ _ _P_r_e_c_e_d_e_n_c_e_ _i_n_ _t_h_e_ _G_r_a_m_m_a_r_ _R_u_l_e_s In our discussion of the declarations section, we showed how precedence could be assigned to _o_p_e_r_a_t_o_r_s. Precedence can also be assigned to _g_r_a_m_m_a_r_ _r_u_l_e_s_, and this is done in the grammar input section. One way to give a grammar rule a precedence uses the %prec construct. %prec TOKEN November, 1995 Page 17 YAY Reference Manual Thinkage Ltd. in a grammar rule indicates that the rule should have the same precedence as the specified token. As a simple example, consider the case of the unary minus operator. Suppose our declaration section contains %left '+' '-' %left '*' '/' %left UMINUS In the grammar rules section, we can write exp : exp '+' exp | exp '-' exp | exp '*' exp | exp '/' exp | '-' exp %prec UMINUS | /* and so on */ ; We could not directly set up a precedence for the unary minus, since we had already set up a precedence for the - token. Instead, we created a token named UMINUS and gave it the precedence we wanted to assign the unary minus. When we wrote out the grammar rule for the unary minus, we added %prec UMINUS to show that this rule should have the precedence of UMINUS. As another example, we might set up precedence rules for the right shift and left shift operations of C with %left RS LS ... exp : | exp '<' '<' exp %prec LS | exp '>' '>' exp %prec RS ... In this way we give the shift operations the proper precedence and avoid confusing them with the comparison operations > and <. Of course, another way to resolve this problem is to make the lexical analyzer clever enough to recognize >> and << and to return the RS or LS tokens directly. Although symbols like UMINUS, LS, and RS are treated as tokens, they do not have to correspond to actual input. They may just be placeholders for operator tokens that have two different meanings. Page 18 November, 1995 Thinkage Ltd. YAY Reference Manual We should note that the use of %prec is relatively rare in YAY. People don't usually think of %prec in their first draft of a grammar. %prec is only added in later drafts, when it is needed to resolve conflicts that appear when the rules are run through YAY. If a grammar rule is not assigned a precedence using %prec, the precedence of the rule is taken from the last _t_o_k_e_n in the rule. For example, if the rule is expr : expr '+' expr the last token in the rule is + (since "expr" is a non- terminal symbol, not a token). Thus the precedence of the rule is the same as the precedence of +. If the last token in a rule has no assigned precedence, the rule will not have a precedence. This can result in some surprises if you aren't careful. For example, if you define expr : expr '+' expr ';' the last token in the rule is ; so the rule probably won't have a precedence even if + does. A rule that contains no tokens cannot have a precedence. _3_._2_._4_ _ _T_h_e_ _S_t_a_r_t_ _S_y_m_b_o_l The first non-terminal symbol defined in the rules section is called the _s_t_a_r_t_ _s_y_m_b_o_l. This symbol is taken to be the largest, most general structure described by the grammar rules. For example, if you are generating the parser for a compiler, the start symbol should describe what a complete program looks like in the language to be parsed. If you do not want the first grammar rule to be taken as the start symbol, you can use the directive %start NAME in your rules section. This indicates that the non-terminal symbol named NAME is the start symbol. NAME must be defined somewhere in the rules section. The start symbol must be all-encompassing: every other rule in the grammar must be related to the start symbol. In a sense, the grammar forms a "tree": the root is the start November, 1995 Page 19 YAY Reference Manual Thinkage Ltd. symbol, the first set of branches are the symbols that make up the start symbol, the next set of branches are the symbols that make up the first set, and so on. Any symbol that is "outside" this tree is reported as a _u_s_e_l_e_s_s _v_a_r_i_a_b_l_e in YAY output. Useless variables are ignored by the parser -- the parser is looking for a complete start symbol, and nothing else. _3_._2_._5_ _ _T_h_e_ _E_n_d_ _M_a_r_k_e_r The end of parser input is marked by a special token called the _e_n_d_ _m_a_r_k_e_r. This token is often written as $end; the value of the token is zero. It is the job of the lexical analyzer "yylex" to return a zero to indicate $end when the end of input is reached (e.g. at end of file, or at a keyword that indicates end of input). "yyparse" terminates when it has parsed a start symbol followed by the end marker. _3_._2_._6_ _ _D_e_c_l_a_r_a_t_i_o_n_s_ _i_n_ _y_y_p_a_r_s_e You can specify C declarations that will be local to "yyparse" in much the same way that you specify external declarations in the Declarations Section. Enclose the declarations in %{ and %} symbols, as in %{ /* External declaratios */ %} %% /* Rules Section starts here */ {% /* Declarations here will be local to "yyparse" */ %} /* Rules */ %% /* Program section */ You may also put declarations at the start of recognition action code. Page 20 November, 1995 Thinkage Ltd. YAY Reference Manual 33..33 TThhee PPrrooggrraammss SSeeccttiioonn The programs section of YAY input may contain functions that should be linked in with the "yyparse" routine. YAY itself does nothing with these functions -- it simply adds the source code on the end of the source code produced from the grammar rules. In this way, the functions can be compiled at the same time that the YAY-produced code is compiled. Of course, these additional functions could be compiled separately and linked with the YAY-produced code later on (once everything is in object code format). Separate compilation of modules is strongly recommended for large parsers. However, functions that are compiled separately need a special mechanism if they want to use any manifests that are defined in the YAY-produced code, and it is sometimes simpler to make the program part of the YAY input. For example, consider the case of "yylex". Every time "yylex" obtains a token from the input, it returns to "yyparse" with a value that indicates the type of token found. Obviously then, "yylex" and "yyparse" must agree on which return values indicate which kind of tokens. Since "yyparse" already refers to tokens using manifest constants (created in the declarations section with the %token directive), it makes sense for "yylex" to use the same manifest constants. The lexical analyzer can do this very easily if it is compiled along with "yyparse". Size might be the determining factor. With very simple parsers, it's easier to put "yylex" in the Programs Section. With larger parsers, the advantages of separate compilation are well worth the extra effort. If you are going to compile "yylex" or other routines separately from "yyparse", use the Header=file option on the YAY command line. YAY will write out the manifest definitions to the file of your choice. This file can then be #included to obtain these definitions for "yylex" or any other routine that needs them. The manifests are already included in the generated parser code, so you only need them for separately compiled modules. November, 1995 Page 21 YAY Reference Manual Thinkage Ltd. _3_._3_._1_ _ _T_h_e_ _L_e_x_i_c_a_l_ _A_n_a_l_y_z_e_r The lexical analyzer "yylex" reads input and breaks it into tokens. It is "yylex" that determines what constitutes a token. For example, some lexical analyzers may return numbers one digit at a time while others collect numbers in their entirety before passing them to the parser. Similarly, some lexical analyzers may recognize keywords such as IF or WHILE and will tell the parser that an IF token or WHILE token has been found. Others may not be designed to recognize keywords, so it is up to the parser itself to distinguish between keywords and other things like variable names. As noted before, each token named in the declarations section of the YAY input is set up as a manifest constant. The value of the first token named is 257, the value of the next is 258, and so on. You can also set your own values for tokens by placing a positive integer after the first appearance of any token in the declarations section. For example, %token AA 56 assigns a value of 56 to the manifest (token) symbol AA. This mechanism is very seldom needed, and we recommend that users avoid it whenever possible. There is little else to say about requirements for "yylex". If it wishes to return the value of a token as well as an indication of its type, the value is assigned to the external variable "yylval". By default, "yylval" is defined as an iinntt value but it can also be used to hold other types of values. For more information, see the description of %union in Chapter 7. Page 22 November, 1995 Thinkage Ltd. YAY Reference Manual 44.. IInntteerrnnaall SSttrruuccttuurreess In order to use YAY effectively, it is helpful to understand some of the internal workings of the parser that YAY produces. In this chapter, we will look at some of these structures. To give us a point of reference, suppose we have produced a parser with grammar %token NUM %left '+' '-' %left '*' '/' %% expr : NUM | expr '+' expr | expr '-' expr | expr '*' expr | expr '/' expr | '(' expr ')' ; 44..11 SSttaatteess As the parser reads in token after token, it switches between various _s_t_a_t_e_s. You can think of a state as point where the parser says, "I have read _t_h_i_s particular sequence of input tokens and now I am looking for one of _t_h_e_s_e tokens." For example, a parser for the C language might be in a state where it has finished reading a complete statement and is ready for the start of a new statement. It therefore expects some token that can legitimately start a statement (e.g. a keyword like IF or WHILE, or the name of a variable for an assignment). In this state, it reads a token. Say it finds the token corresponding to the keyword IF. It then switches to a new state where it says "I have seen an IF and now I want to see the ( that begins the IF condition." When it finds the (, it switches again to a state that says "I have found IF( and now I want the start of a condition expression." States break the parsing process into simple steps. At each step, the parser knows what it has seen and what it is looking for next. November, 1995 Page 23 YAY Reference Manual Thinkage Ltd. YAY assigns numbers to every possible state the parser can enter. The 0th state is always the one that describes the parser's condition before it has read any input. Other states are numbered arbitrarily. Sometimes a particular input can only be the start of one construct. For example, the FOR keyword in C can only be the start of a FOR statement, and the FOR statement only has one form. On the other hand, a grammar may have several non- terminal symbols that start the same way. In our sample grammar, all of expr '+' expr expr '-' expr expr '*' expr expr '/' expr start with an "expr". If the parser finds that the input begins with an "expr", the parser has no idea which rule matches the input until it has read the operator following the first "expr". The parser chooses which state it will enter next by looking at the next input token. This token is called the _l_o_o_k_a_h_e_a_d_ _s_y_m_b_o_l for that state. 44..22 DDiiaaggrraammmmiinngg SSttaatteess YAY uses simple diagrams to describe the various states of the parser. These diagrams show what the parser has seen and what it is looking for next. The diagrams are given in the Parser Description report prouced by YAY (see Chapter 6). For example, consider the state where the parser has just read a complete "expr" at the beginning of a larger expression. It is now in a state where it expects to see one of the operators +, -, *, or /, or perhaps the $end marker indicating the end of input. YAY diagrams this state as $accept: expr.$end expr: expr.'+' expr expr: expr.'-' expr expr: expr.'*' expr expr: expr.'/' expr Page 24 November, 1995 Thinkage Ltd. YAY Reference Manual This lists the possible grammar constructs that the parser may be working on. (The first line $accept stands for the Start symbol.) The "." indicates how much the parser has read so far. If the lookahead symbol is *, the parser will switch to a state diagrammed by expr: expr '*'.expr In this state, the parser knows that the next thing to come will be another "expr". This means that the only valid tokens that can be read next are ( or NUM, since those are the only things that start a valid "expr". 44..33 SSttaattee AAccttiioonnss There are several possible actions that the parser can take in a state: (a) _a_c_c_e_p_t the input; (b) _s_h_i_f_t to a new state; (c) _r_e_d_u_c_e one or more input tokens to a single non- terminal symbol, according to a grammar rule; (d) _g_o_t_o a new state; (e) raise an _e_r_r_o_r condition. In order to decide which action to take, the parser checks the lookahead symbol (except in states where the parser can only take one possible action so the lookahead symbol is irrelevant). This means that a typical state has a series of possible actions based upon the possible values of the lookahead symbol. In YAY output, you might see '+' shift 8 '-' shift 7 '*' shift 6 '/' shift 5 ')' shift 9 . error This says that if the parser is in this state and the lookahead symbol is +, the parser will shift to State 8. If November, 1995 Page 25 YAY Reference Manual Thinkage Ltd. the lookahead symbol is -, the parser will shift to State 7, and so on. The "." in the final line stands for any other token not mentioned in the preceding list. If the parser finds any unexpected tokens in this particular state, it will take the error action. The sections that follow explain precisely what each state action means and what the parser does to handle these actions. _4_._3_._1_ _ _T_h_e_ _A_c_c_e_p_t_ _A_c_t_i_o_n The Accept action only happens when the parser is in a state that indicates it has seen a complete input and the lookahead symbol is the end marker $end. When the parser takes the Accept action, "yyparse" terminates and returns a zero to indicate that the input was correct. _4_._3_._2_ _ _T_h_e_ _S_h_i_f_t_ _A_c_t_i_o_n The Shift action happens when the parser is part way through a grammar construct and a new token is read in. As an example, State 4 in our sample parser is diagrammed with expr: expr.'+' expr expr: expr.'-' expr expr: expr.'*' expr expr: expr.'/' expr expr: '(' expr.')' '+' shift 8 '-' shift 7 '*' shift 6 '/' shift 5 ')' shift 9 . error This shows that the parser will shift to various other states depending on the value of the lookahead symbol. For example, if the lookahead symbol is * the parser shifts to State 6 which has the diagram Page 26 November, 1995 Thinkage Ltd. YAY Reference Manual expr: expr '*'.expr NUM shift 2 '(' shift 1 . error expr goto 11 In this new state, the parser has further shifts it can make, depending on the next lookahead symbol. When the parser shifts to a new state, it saves the previous state on a stack called the _s_t_a_t_e_ _s_t_a_c_k. The stack provides a history of the states that the parser has passed through while it was reading input. It is also a control mechanism as will be seen later in this chapter. Paralleling the state stack is a _v_a_l_u_e stack that records the values of tokens and non-terminal symbols encountered while parsing. The value of a token is the "yylval" value returned by "yylex" at the time the token was read. The value of a non-terminal symbol is the $$ value set by the recognition action associated with that symbol's definition. If the definition didn't have an associate recognition action, the value of the symbol is the value of the first item in the symbol's definition. At the same time that the Shift action pushes the current state onto the state stack, it also pushes the "yylval" value of the lookahead symbol (token) onto the value stack. _4_._3_._3_ _ _T_h_e_ _R_e_d_u_c_e_ _A_c_t_i_o_n The Reduce action takes place in states where the parser has recognized all the items that make up a non-terminal symbol. For example, the diagram of State 9 in our sample grammar is expr: '(' expr ')'. . reduce (6) At this point, the parser has seen all three components that make up the non-terminal symbol "expr": '(' expr ')' November, 1995 Page 27 YAY Reference Manual Thinkage Ltd. As the line . reduce (6) shows, it doesn't matter what the lookahead symbol is at this point. The non-terminal symbol has been recognized, and the parser is ready for a Reduce action. (Note that the (6) in the above line just means that the parser has recognized the non-terminal symbol defined in rule (6) of the grammar. We will say more about this in a later chapter.) The Reduce action performs a number of operations. First, it pops states off the state stack. If the recognized non-terminal symbol had N components, a reduction pops N-1 states off the stack. In other words, the parser goes back to the state it was in when it first began to gather the recognized construct. Next, the Reduce action pops values off the value stack. If the definition that is being reduced consisted of N items, the Reduce action conceptually pops N values off the stack. The topmost value on the stack is assigned to $N, the next to $N-1, and so on down to $1. Once the Reduce action has gathered all the $X values, the parser invokes the recognition action that was associated with the grammar rule being reduced. This recognition will use the $X values to come up with a $$ value for the non-terminal symbol. This value is pushed onto the value stack, thereby replacing the N values that were previously on the stack. If the non-terminal symbol had no recognition action, or if the recognition action did not set $$, the parser puts the value of $1 back on the stack. (In reality, the value is never popped off.) Lastly, the Reduce action sets things up so that the lookahead symbol seems to be the non-terminal symbol that was just recognized. For example, it may say that the lookahead symbol is now an "expr" instead of a token. _4_._3_._4_ _ _T_h_e_ _G_o_t_o_ _A_c_t_i_o_n The Goto action is a continuation of the Reduce process. Goto is almost the same as Shift -- the only difference is that the Goto action takes place when the lookahead symbol is a non-terminal symbol while a Shift takes place when the lookahead symbol is a token. Page 28 November, 1995 Thinkage Ltd. YAY Reference Manual For example, State 6 in our sample grammar reads expr: expr '*'.expr NUM shift 2 '(' shift 1 . error expr goto 12 The first time the parser enters this state, the lookahead symbol will be a token and the parser will Shift on the token into some state where it will begin to gather an "expr". When it has a complete "expr", it will perform a Reduce action that will return to this state and set the lookahead symbol to "expr". Now when the parser has to decide what to do next, it sees that it has an "expr" for the lookahead symbol and therefore takes the Goto action and moves to State 12. The Shift action pushes the current state onto the state stack. The Goto does not have to do this -- the state was on the stack already. Similarly, Shift pushes a value onto the value stack, but Goto does not, since the value corresponding to the non-terminal symbol was already put on the value stack by the Reduce action. When the parser reaches the new state, the lookahead symbol will be restored to whatever it was at the time of the Reduce action. Essentially then, a Goto is like a Shift except that it takes place when you come _b_a_c_k to a state with the Reduce action. Also, a Shift is based on the value of a single input token while a Goto is based on a non-terminal symbol. _4_._3_._5_ _ _T_h_e_ _E_r_r_o_r_ _A_c_t_i_o_n The parser takes the Error action when it encounters any input token which cannot validly appear in a particular input location. When this happens, the parser raises the "error" condition. Since error-handling can be quite complicated, we will devote the whole of the next chapter to the subject. November, 1995 Page 29 YAY Reference Manual Thinkage Ltd. 55.. EErrrroorr--HHaannddlliinngg If a piece of input is invalid, the parser can do nothing with it. Except in extreme cases, however, it is inappropriate for the parser to stop all processing as soon as an error is found. Instead, the parser should skip over the invalid input and resume parsing as soon after the error as possible. In this way, the parser can find many syntax errors in one pass through the input, saving time and trouble for the user. YAY therefore tries to generate a parser that can "restart" as soon as possible after an error occurs. YAY does this by letting you specify points at which the parser can pick up after errors. You may also dictate what special processing should take place if an error is encountered at one of these points. 55..11 TThhee ""eerrrroorr"" SSyymmbbooll YAY's error handling facilities use the keyword eerrrroorr to signify an erroneous input. As a result, eerrrroorr may not be used as the name of a user-defined token or non-terminal symbol. You should put "eerrrroorr" in your grammar rules where error recovery might take place. For example, you might write statement: error | /* other definitions of a statement */; This tells YAY that errors may occur in statements, and that after an error, the parser is free to restart parsing at the end of a complete statement. 55..22 TThhee EErrrroorr CCoonnddiittiioonn As noted in the last chapter, YAY will take the Error action if it finds an input that is not valid in a particular location. The Error action has the following steps. (a) See if the current state has a Shift action associated with the eerrrroorr symbol. If it does, shift on this action. (b) If the current state has no such action, pop the state off the stack and check the next state. Also pop off Page 30 November, 1995 Thinkage Ltd. YAY Reference Manual the top value on the value stack, so that the state stack and value stack stay in synch. (c) Repeat (b) until the parser finds a state that can shift on the eerrrroorr symbol. (d) Take the Shift action associated with the eerrrroorr symbol. Note that this pushes the current state on the stack, i.e. the state that is capable of handling errors. No new value is pushed onto the value stack -- the parser keeps whatever value was already associated with the state that can handle errors. When the parser shifts out of the state that can handle errors, the lookahead symbol will be whatever token caused the error condition in the first place. The parser will then try to proceed with normal processing. Of course, it is quite possible that the original lookahead symbol is invalid in the new context. If the lookahead symbol causes an error again, it is discarded and the error condition stays in effect. The parser will continue to read new tokens and discard them until it finds a token that can validly follow the eerrrroorr. The parser will then take whatever action is associated with the valid token. In a typical grammar, the state that has been handling errors will eventually be popped off the stack in a Reduce operation. Notice that the parser always Shifts on the error token. It will never Reduce on eerrrroorr, even if the grammar has a state where eerrrroorr is associated with a Reduce action. In some situations, an error condition will be raised and the parser will pop all the way to the bottom of the state stack without finding a state that can handle the eerrrroorr symbol. For example, the grammar may have no provisions for error recovery. In this case, "yyparse" simply terminates and returns a 1 to its caller. 55..33 EExxaammpplleess As a simple example, consider a parser for a simple desk calculator. All statements end in a semicolon. Thus we might see the rule November, 1995 Page 31 YAY Reference Manual Thinkage Ltd. statement : var '=' expr ';' | expr ';' | error ';' ; When an error occurs in input, the parser will pop back through the state stack until it comes to a state where the eerrrroorr symbol is recognized. For example, the state might be diagrammed as $accept: .statement $end error shift 2 NUM shift 4 . error var goto 7 expr goto 3 statement goto 5 If an error occurs anywhere in an input statement, the parser will pop back to this state, then Shift to State 2. State 2 will look like statement: error.';' . ';' shift 6 . error In other words, the next token must be a semicolon. If it isn't, another error occurs. The parser pops back to the previous state and takes the eerrrroorr shift again. Input will be discarded token by token until a semicolon is found. When the semicolon is found, the parser will be able to shift from State 2 to State 6 which is statement: error ';'. . reduce (3) The erroneous line is reduced to a "statement" non-terminal symbol. Now this example is simple, but it has its drawbacks. It will get you into trouble if the grammar has any concept of block structure or parenthesization. Why? Once an error occurs, the rule statement : error ';' Page 32 November, 1995 Thinkage Ltd. YAY Reference Manual effectively tells the parser to discard absolutely everything until it finds a ; character. If you have a parser for C, for example, it would skip over important characters like ) or } until it found a semicolon. Your parentheses and braces would be out of balance for the rest of the input and the whole parsing process would be a waste of time. The same principle applies to any rule that shows the eerrrroorr token followed by some other non-null symbol: it can lead to hopeless confusion in a lot of grammars. It is safer to write the rule in a form like statement : error | ';' | /* other stuff */ In this case, the eerrrroorr token only matches material until the parser finds something else it recognizes (e.g. the semicolon). Once this happens, the eerrrroorr state will be reduced to a "statement" symbol and popped off the stack. Parsing can then proceed as usual. 55..44 EErrrroorr RReeccooggnniittiioonn AAccttiioonnss The easiest way to generate an error message is to associate a recognition action with the grammar rule that recognizes the error. You can do something simple like statement: error { printf("You made an error!\n"); } or you can be fancier, as in line: error '\n' prompt line { $$ = $4; }; prompt: /* null token */ { printf("Please re-enter line.\n"); }; If an error occurs, the parser skips until it finds a new-line character. After the new-line, it will always find a null token matching "prompt", and the recognition action for "prompt" will print out the message Please re-enter line The final symbol in the rule is another "line", and the action after the eerrrroorr rule shows that the result of the November, 1995 Page 33 YAY Reference Manual Thinkage Ltd. rule ($$) should be the material associated with the second input line. All this means that if the user makes a mistake entering an input line, the parser prints out an error message and accepts a second input line in place of the first. This allows for an interactive user to correct an input line that was incorrectly typed the first time. Of course, this set-up will only work if the user doesn't make an error the second time the line is typed too. If the next token he or she types is also invalid, the parser will discard the token and decide that it's still gobbling up the original error. 55..55 TThhee yyyycclleeaarriinn MMaaccrroo After an Error action, the parser will restore the lookahead symbol to the value it had at the time the error was detected. However, this is sometimes undesirable. For example, your grammar may have a recognition action associated with the eerrrroorr symbol and this may read through the next lot of input until it finds the next sure-to-be- valid data. If this happens, you certainly don't want the parser to pick up the old lookahead symbol again once error recovery is finished. If you want the parser to throw away the old lookahead symbol after an error, put yyclearin ; in the recognition action associated with the eerrrroorr symbol. "yyclearin" is a macro that expands into code that discards the lookahead symbol. 55..66 TThhee yyyyeerrrroorr FFuunnccttiioonn The first thing the parser does when it performs the Error action is to call a function named "yyerror". This happens _b_e_f_o_r_e the parser begins going down the state stack in search of a state that can handle the eerrrroorr symbol. "yyerror" must be supplied by the user -- note that its name is in lower case. The simplest "yyerror" functions either abort the parsing job or just return so that the parser can perform its standard error handling. Page 34 November, 1995 Thinkage Ltd. YAY Reference Manual The "yyerror" function is passed one argument: a character string describing the type of error that just took place. This string is almost always "Syntax error" The only other argument strings that might be passed are "Not enough space for parser stacks" "Parser stack overflow" which are used when the parser runs out of memory for the state stack. Once "yyerror" returns to "yyparse", the parser will proceed popping down the stack in search of a state that can handle errors. If another error is encountered soon after the first, "yyerror" will _n_o_t be called again. The parser considers itself to be in a "potential error" situation until it finds three correct tokens in a row. This avoids the torrents of error messages that often occur as the parser wades through input in search of some recognizable sequence. Once the parser has found three correct tokens in a row, it leaves the "potential error" situation. If a new error is found later on, "yyerror" will be called again. _5_._6_._1_ _ _C_h_a_n_g_i_n_g_ _y_y_c_h_a_r_ _i_n_ _y_y_e_r_r_o_r The "yyerror" function may change the value of "yychar" (the variable that indicates the type of token just read by "yylex"). If "yyerror" does this, "yyparse" immediately gets rid of the error condition and tries to make a shift or reduce using the new "yychar" token. If there is a valid action that can be taken, "yyparse" will proceed as if the error never happened. This behavior can help you deal with at least two kinds of unusual situations: (a) When "yyerror" is attempting to correct a problem by supplying the correct input. For example, it may be possible to determine what input should have been received; the classic example is a compiler which can figure out that a particular statement should have ended in a semicolon. In this case, "yyerror" can supply the missing semicolon by assigning an November, 1995 Page 35 YAY Reference Manual Thinkage Ltd. appropriate value to "yychar". If "yyerror" guessed correctly, action can proceed as if the problem didn't occur. (b) When a particular token can be interpreted in different ways in different contexts. For example, suppose that X can be a keyword in some contexts and a variable name in others. You can write "yylex" so that it always interprets X as a keyword. If "yyerror" is invoked because that keyword is invalid in a particular context, "yyerror" can change "yychar" and see if the input is valid when X is interpreted as a variable name. If "yyerror" changes "yychar", "yyerror" may have to inform "yylex" of the situation so that "yylex" can return the proper input token the next time it is called. 55..77 TThhee yyyyeerrrrffllaagg VVaarriiaabbllee "yyerrflag" is an external integer variable whose value must be 0, 1, 2, or 3. YYABORT (described in a later section) is called if "yyerrflag" does not have one of these values. As we mentioned earlier, "yyparse" doesn't leave its error state until three consecutive valid tokens have been read. "yyerrflag" is used in this process. If "yyerrflag" is zero, "yyerror" will be invoked the next time an error is encountered. As soon as "yyerror" is invoked, it sets "yyerrflag" to 3. Each time a "yyparse" shifts on a valid token, "yyerrflag" is decremented, until it gets to zero. At this point, "yyparse" leaves its error state; if a new error occurs, "yyerror" will be called again. There are two ways in which actions can use "yyerrflag". First, they can check the variable to see if there has been a recent error. For example, if an action encounters a semantic error, it can check "yyerrflag" to see if it is non-zero (meaning that there has been a recent syntax error). If "yyerrflag" is non-zero, the semantic error is almost certainly a result of the previous syntax error. Using this information, the action may choose to modify or suppress the semantic error message. Second, actions can set "yyerrflag" to a value if they want to prevent calls to "yyerror". As long as "yyerrflag" is non-zero, "yyerror" will not be called. This may be Page 36 November, 1995 Thinkage Ltd. YAY Reference Manual useful if an action is smart enough to realize that it is still recovering from a previous error and "yyerror" should not be called again for a while. 55..88 TThhee yyyyeerrrrookk MMaaccrroo In some situations, you may want "yyerror" to be called even if the parser hasn't seen three correct tokens since the last error. For example, suppose you have a parser for a line by line desk calculator. A line of input contains errors, so "yyerror" is called. "yyerror" prints an error message to the user, throws away the rest of the line, and prompts for new input. If the next line contains an error in the first three tokens, the parser will normally start discarding input _w_i_t_h_o_u_t calling "yyerror" again. This means that "yyerror" doesn't print an error message for the user, even though the input line is wrong. To avoid this problem, you can explicitly tell the parser to leave its "potential error" state, even if it hasn't yet seen three correct tokens. Simply say yyerrok ; as part of the error recognition action. For example, you might have the rule expr : error { yyerrok; printf("Please re-enter line.\n"); yyclearin; } "yyerrok" is a macro that is expanded into code that takes the parser out of its "potential error" state and lets it start fresh. More precisely, "yyerrok" sets "yyerrflag" to zero. 55..99 OOtthheerr EErrrroorr SSuuppppoorrtt RRoouuttiinneess YYABORT halts "yyparse" in midstream and immediately returns a 1. To the function that called "yyparse", this means that "yyparse" failed for some reason. YYACCEPT halts the parser in midstream and returns a 0. To the function that called "yyparse", this means that November, 1995 Page 37 YAY Reference Manual Thinkage Ltd. "yyparse" terminated successfully, even if the entire input has not yet been scanned. YYERROR (note that this is upper case) is a macro that "fakes" an error. When YYERROR is encountered in the code, the parser will react as if it just saw an error and will go about recovering from the error. In Chapter 9, we will give an example of how YYERROR can be useful. Page 38 November, 1995 Thinkage Ltd. YAY Reference Manual 66.. YYAAYY OOuuttppuutt YAY can produce several output files. Options on the YAY command line dictate which files are actually generated. The most important output file is the one containing source code that can be compiled into the actual parser. The name of this file is specified with the Parser=file command line option. Another possible output file contains manifest definitions. The name of this file is specified with Header=file on the command line. This file is a distillation of the declarations section of the YAY input. For example, all the %token directives are restated in terms of manifest definitions. %token IF would appear as #define IF 257 in the C version of the manifests (assuming that IF was the first token in the declarations section). By including this file with #include "filename" separately compiled modules can make use of all the pertinent definitions in the YAY input. The third output file that YAY can produce is called the Parser Description. The name of the file is specified with Description=file on the command line. The Parser Description is split into three sections: (a) a summary of the grammar rules; (b) a list of state descriptions; (c) a list of statistics for the parser generated by YAY. In the sections that follow, we will show what the Parser Description looks like for the following grammar: November, 1995 Page 39 YAY Reference Manual Thinkage Ltd. %token IF ELSE A %% stmt : IF stmt ELSE stmt | IF stmt | A ; 66..11 RRuulleess SSuummmmaarryy The rules summary section of the Parser Description begins with the command line used to invoke YAY. This is intended to serve as a "heading" for the output material. We'll use the command line yay infile.y description=out +verbose YAY input is in the file infile.y and output is written to the file out. We have specified the +Verbose option because this provides the largest amount of output. See Appendix C for more information on the YAY command line. Next comes a summary of the grammar rules. In our example, we will have Rules: (0) $accept: stmt $end (1) stmt: IF stmt ELSE stmt (2) stmt: IF stmt (3) stmt: A The 0th rule will always be the definition for a symbol named $accept. This describes what a complete input looks like: the Start symbol followed by the end marker. Other rules are those given in the grammar. YAY puts a formfeed character on the line after the last grammar rule so that the next part of the Parser Description starts on a new page. 66..22 SSttaattee DDeessccrriippttiioonnss The Parser Description output contains complete descriptions of every possible state. For example, here is the description of one state from our sample grammar. Page 40 November, 1995 Thinkage Ltd. YAY Reference Manual State 2 stmt : IF.stmt ELSE stmt stmt : IF.stmt IF shift 2 A shift 1 . error stmt goto 4 By now, this sort of diagram should be familiar to you. The numbers after the word "shift" indicate the state to which the parser will Shift if the lookahead symbol happens to be IF or A. If the lookahead symbol is anything else, the parser raises the error condition and starts error recovery. If the parser pops back to state 2 by means of a Reduce Action, the lookahead symbol will now be "stmt" and the parser will Goto state 4. As another example of a state, here is State 1. State 1 stmt : A. (3) . reduce (3) This is the state that is entered when an A token has been found. The (3) on the end of the first line is a _r_u_l_e _n_u_m_b_e_r. It indicates that this particular line sums up the whole of the third grammar rule that was specified in the YAY input. The line . reduce (3) indicates that no matter what token comes next, we can reduce this particular input using grammar rule (3) and say that we have successfully collected a valid "stmt". The parser will perform a reduction by popping the top state off the stack and setting the lookahead symbol to "stmt". It is important to distinguish between A shift 1 in State 2 and . reduce (3) in State 1. In the Shift instruction, the number that follows is the number of a _s_t_a_t_e. In the Reduce November, 1995 Page 41 YAY Reference Manual Thinkage Ltd. instruction, the number that follows is the number of a _g_r_a_m_m_a_r_ _r_u_l_e (using the numbers given to the grammar rules in the first part of the Parser Description). The Parser Description always encloses rule numbers in parentheses, and leaves state numbers as they are. The complete state descriptions for the grammar are given below. State 0 $accept: .stmt $end IF shift 2 A shift 1 . error stmt goto 3 State 1 (3) stmt: A. . reduce (3) State 2 stmt: IF.stmt ELSE stmt stmt: IF.stmt IF shift 2 A shift 1 . error stmt goto 4 State 3 $accept: stmt.$end $end accept . error State 4 stmt: IF stmt.ELSE stmt (2) stmt: IF stmt. [ $end ELSE ] Shift/reduce conflict (5,2) on ELSE ELSE shift 5 . reduce (2) State 5 stmt: IF stmt ELSE.stmt Page 42 November, 1995 Thinkage Ltd. YAY Reference Manual IF shift 2 A shift 1 . error stmt goto 6 State 6 (1) stmt: IF stmt ELSE stmt. . reduce (1) The parser always begins in State 0, i.e. in a state where no input has been read yet. An acceptable input is a "stmt" followed by the end marker. When a "stmt" has been collected, the parser goes to State 3. In State 3, the required end marker $end indicates that the input should be accepted. If anything else is found, it is excess input and means an error. In State 4, the rule labelled (2) has [ $end ELSE ] on the end. This just means that the parser expects to see one of these two tokens next. 66..33 CCoonnfflliicctt SSuummmmaarryy As noted in the sample output, there is a shift/reduce conflict in State 4. If an ELSE is encountered while the parser is in this state, the parser doesn't know whether to shift (into State 5) or to reduce using rule (2). Thus YAY prints the message Shift/reduce conflict (5,2) on ELSE After the state descriptions, YAY prints a summary of all such conflicts. With our sample grammar, the output contains Conflicts: State Token Action 4 ELSE shift 5 4 ELSE reduce (2) Each line in this summary gives the number of the state in which the conflict occurs, the token(s) that cause the conflict, and the conflicting actions. November, 1995 Page 43 YAY Reference Manual Thinkage Ltd. 66..44 PPaarrsseerr SSttaattiissttiiccss The last section of the Parser Description is a set of statistics summarizing YAY's work. Here are the stats we got when we ran our sample grammar through YAY. 4 rules, 5 tokens, 2 variables, 7 states Memory: max = 8K States: 3 wasted, 4 resets Items: 18, 0 kernel, (2,0) per state, maxival=16 (1 w/s) Lalr: 1 calls, 2 recurs, (0 trans, 12 epred) Actions: 0 entries, gotos: 0 entries Exceptions: 1 states, 4 entries Simple state elim: 0%, Error default elim: 33% Optimizer: in 0, out 0 Size of tables: 24 bytes 2 seconds, final mem = 4 1 shift/reduce conflict Some of these values are machine independent (e.g. the number of rules), others are machine dependent (e.g. the amount of memory used), and some can be different every time you run the job (e.g. time elapsed while YAY was running). The statistic lines are described below. Many of these will be of no interest to the normal user; YAY only generates them for the use of those maintaining the YAY software. A number of the statistics refer to shift/reduce or reduce/reduce conflicts. We will discuss these in a later chapter. 4 rules, 5 tokens, 2 variables, 7 states The four rules are the grammar rules given in the first part of the Parser Description. The five tokens are A, IF, ELSE, $end, and error (which is always defined, even if we don't use it in this grammar). The two variables are the non-terminal symbols, "stmt" and the special $accept. The seven states are states 0 to 6. Memory: max = 8K This gives the maximum amount of dynamic memory that YAY required while producing the parser. This line may also have a "success rate" that tells how often YAY succeeded in having enough memory to handle a situation and how often it had to ask for more memory. States: 3 wasted, 4 resets The algorithm that constructs states from the grammar rules makes a guess at the number of states it will Page 44 November, 1995 Thinkage Ltd. YAY Reference Manual need, very early on in the YAY process. If this guess is too high, the excess states are said to be "wasted". When creating states from the various grammar rules, it is sometimes found that a state from one rule duplicates the state from another (for example, there were two rules that started with IF in our example above). In the final parsing tables, such duplicate states are merged together into a single state. The number of "resets" is the number of duplicate states formed, then merged. Items: 18, 0 kernel, (2,0) per state, maxival=16 (1 w/s) A state is made of items, and the kernel items are an important subset of these: the size of the resulting parsing tables and the running time for YAY are proportional to the number of items and kernel items. The rest of the statistics in this line are not of interest to normal users. Lalr: 1 call, 2 recurs, (0 trans, 12 epred) This gives the number of calls and recursive calls to the conflict resolution routine. The parenthesized figures are related to the same process. In some ways, this is a measure of the complexity of the grammar being parsed. This line will not appear if there are no reduce/reduce or shift/reduce conflicts in your grammar. Actions: 0 entries, gotos: 0 entries This gives the number of entries in the tables "yyact" and "yygo". "yyact" keeps track of the possible Shifts that a program may make and "yygo" keeps track of the "gotos" that take place at the end of states. Exceptions: 1 states, 4 entries This gives the number of entries in the table "yydef", yet another table used in YAY. "yydef" keeps track of the possible Reduce, Accept, and Error actions that a program may make. Simple state elim: 0%, Error default elim: 33% The percentage figures indicate how much table space could be saved through various optimization processes. The better written your grammar, the greater the percentage of space that can be saved. Therefore, high percentages here are an indication of a well-written grammar. November, 1995 Page 45 YAY Reference Manual Thinkage Ltd. Optimizer: in 0, out 0 More optimization statistics: not of interest to normal YAY users. Size of tables: 24 bytes The size of the tables generated to represent the parsing rules. This size is given in bytes on the host machine, so it will be inaccurate if a cross-compiler is being used on the eventual source code output. The size does not include stack space used by "yyparse" or debug tables obtained by defining YYDEBUG. 2 seconds, final mem = 4 Total real time that YAY used to produce the parser, and the final dynamic memory of the parser (in K bytes). 1 shift/reduce conflict Number of conflicts in the grammar. Page 46 November, 1995 Thinkage Ltd. YAY Reference Manual 77.. TTyyppeess Earlier we mentioned that "yylval" is iinntt by default, as are $$, $1, $2, etc. If you want these to have different types, you may redeclare them in the declarations section of the YAY input. This is done with a statement of the form %union { /* possible types for "yylval" and * $$, $1, $2, etc. */ } For example, suppose "yylval" can be either integer or floating point. You might write %union { int intval; float realval; } in the declarations section of the YAY input. YAY converts the %union statement into the following C source. typedef union { int intval; float realval; } YYSTYPE; "yylval" is always declared to have type YYSTYPE. If no %union statement is given in the YAY input, it will use #define YYSTYPE int Once YYSTYPE has been defined as a union, you may specify a particular interpretation of the union by including a statement of the form %type <interpretation> SYMBOL in the declarations section of the YAY input. The "interpretation" enclosed in the angle brackets is the name of the interpretation of the union variable that you want to use. The SYMBOL name is the name of a non-terminal symbol defined in the grammar rules. For example, you might write %type <intval> intexp %type <realval> realexp to indicate that an integer expression has an integer value and a real expression has a floating point value. November, 1995 Page 47 YAY Reference Manual Thinkage Ltd. Tokens may also be declared to have types. This is done in the %token statement in the declaration section, and again shows the union interpretation in angle brackets, e.g. %token <realval> floatnum If you use types in your YAY input, YAY will enforce compatibility of types in all expressions. For example, if you write $$ = $2 in an action, YAY will demand that the two corresponding tokens have the same type; otherwise, the assignment will be marked as invalid. The reason for this is that YAY must always know what interpretation of the union is being used in order to generate correct code. TThhee DDeeffaauulltt AAccttiioonn The default action associated with any rule can be written as $$ = $1 which means that the value of associated with $1 on the value stack is assigned to $$ on the value stack when the rule is reduced. If, for example, $1 is an integer, then $$ will be the same integer after the reduction occurs. On the other hand, suppose that the recognition action associated with a rule explicitly states $$ = $1 This explicit assignment may not have the same effect as the implicit assignment. For example, suppose that you define %union { float floatval; int intval; } Also suppose that the type associated with $$ is "floatval" and the type associated with $1 is "intval". Then the explicit statement $$ = $1 Page 48 November, 1995 Thinkage Ltd. YAY Reference Manual performs an integer to floating point conversion when the value of $1 is assigned to $$, whereas the implicit statement did an integer to integer assignment and did _n_o_t perform this conversion. You must therefore be careful and think about the effects of implicit vs. explicit assignments. November, 1995 Page 49 YAY Reference Manual Thinkage Ltd. 88.. AAmmbbiigguuiittiieess Suppose we have a grammar with the rule expr : expr '-' expr ; and the parser is reading an expression of the form expr - expr - expr The parser reads this token by token, of course, so after three tokens it will have expr - expr The parser recognizes this form. In fact, the parser could reduce this right away into a single "expr" according to the given grammar rule. However, the parser has a problem. At this point, the parser doesn't know what comes next, and perhaps the entire line will be something like expr - expr * expr If it is, the precedence rules say that the multiplication should be performed before the subtraction, so handling the subtraction first is incorrect. The parser must therefore read another token to see if it is really all right to deal with the subtraction now, or if the correct action is to skip the subtraction for the moment and deal with whatever follows the second "expr". In terms of parser states, this problem boils down to a choice between _r_e_d_u_c_i_n_g the expression expr - expr or _s_h_i_f_t_i_n_g and acquiring more input before making a reduction. This is known as a _s_h_i_f_t_/_r_e_d_u_c_e_ _ _c_o_n_f_l_i_c_t_. Sometimes a parser must also choose between two possible reductions. This kind of situation is called a _r_e_d_u_c_e_/_r_e_d_u_c_e_ _c_o_n_f_l_i_c_t_. In case you are curious, there is no such thing as a shift/shift conflict. To see why this is impossible, suppose that we have the following definitions. Page 50 November, 1995 Thinkage Ltd. YAY Reference Manual thing : a b | a c ; b : T rest_of_b; c : T rest_of_c; If the parser is in the state where it has seen "a", you would have the diagram thing : a.b thing : a.c You might think that if the lookahead symbol was the token T, the parser would be confused, since T is the first token of both "b" and "c". However, there is no confusion at all. The parser just Shifts to a state that we could diagram as b : T.rest_of_b c : T.rest_of_c 88..11 RReessoollvviinngg CCoonnfflliiccttss bbyy PPrreecceeddeennccee The precedence directives (%left, %right, and %nonassoc) let YAY-produced parsers resolve shift/reduce conflicts in an obvious way: (a) The precedence of a Shift operation is defined to be the precedence of the token on which the Shift takes place. (b) The precedence of a Reduce operation is defined to be the precedence of the grammar rule that the Reduce operation uses. If you have a shift/reduce conflict, and the Shift and Reduce operations both have a precedence, the parser chooses the operation with the higher precedence. 88..22 DDiissaammbbiigguuaattiinngg RRuulleess Precedence cannot resolve conflicts if one or both conflicting operations have no precedence. For example, consider the following. statmt: IF '(' cond ')' statmt | IF '(' cond ')' statmt ELSE statmt ; Given this rule, how should the parser interpret the input November, 1995 Page 51 YAY Reference Manual Thinkage Ltd. IF ( cond1 ) IF ( cond2 ) statmt1 ELSE statmt2 There are two equally valid interpretations of this input: IF ( cond1 ) { IF ( cond2 ) statmt1 ELSE statmt2 } and IF ( cond1 ) { IF ( cond2 ) statmt1 } ELSE statmt2 In a typical grammar, the IF and IF-ELSE statements would not have a precedence, so precedence could not resolve the conflict. Thus consider what happens at the point when the parser has read IF ( cond1 ) IF ( cond2 ) statmt1 and has just picked up ELSE as the lookahead symbol. (1) It can immediately Reduce the IF ( cond2 ) statmt1 using the first definition of "statmt" and obtain IF ( cond1 ) statmt ELSE ... thereby associating the ELSE with the first IF. (2) It can Shift, which means ignoring the first part (the IF with "cond1") and going on to handle the second part, thereby associating the ELSE with the second IF. In this case, most programming languages choose to associate the ELSE with the second IF, i.e. they want the parser to Shift instead of Reduce. Because of this (and other similar situations), YAY-produced parsers are designed to use the following rule to resolve shift/reduce conflicts. _D_i_s_a_m_b_i_g_u_a_t_i_n_g_ _R_u_l_e_ _1_: If there is a shift/reduce conflict in situations where no precedence rules have been created to resolve the conflict, the default action is to Shift. The conflict will also be reported in the YAY output so you can Page 52 November, 1995 Thinkage Ltd. YAY Reference Manual check that Shifting is actually what you want. If it isn't what you want, the grammar rules will have to be rewritten. This is known as a disambiguating rule because it helps YAY-produced parsers resolve ambiguities. The rule is only used in situations where precedence rules cannot resolve the conflict. If both the Shift operation and the Reduce operation have an assigned precedence, the parser can compare precedences and decide which operation to perform first. Even if the precedences are equal, the precedences must have originated from either %left, %right, or %nonassoc, so the parser knows how to handle the situation. The only time the disambiguating rule is needed is when one or both of the Shift or Reduce operations does not have an assigned precedence. In a similar vein, YAY-produced parsers use the following rule to resolve reduce/reduce conflicts. _D_i_s_a_m_b_i_g_u_a_t_i_n_g_ _R_u_l_e_ _2_: If there is a reduce/reduce conflict, the parser will always Reduce by the rule that was given _f_i_r_s_t in the rules section of the YAY input. Again, the conflict will be reported in the YAY output so that users may ensure that the choice of Reduction is correct. Note that precedence is _n_o_t consulted in reduce/reduce conflicts. YAY always Reduces by the earliest grammar rule, regardless of precedence. Finally, YAY-produced parsers use a third disambiguating rule to prevent excessive errors. _D_i_s_a_m_b_i_g_u_a_t_i_n_g_ _R_u_l_e_ _3_: If the parser is asked to choose between shifting on eerrrroorr or on any other item, it will choose the non- error item. The disambiguating rules are simple to state, but they can have complex repercussions if the grammar is non- trivial. If the grammar is sufficiently complicated, these simple rules for resolving conflicts may not be capable of handling all the necessary intricacies in the way you want. You should pay close attention to all conflicts noted in the parsing table report produced by YAY and should ensure that the default actions taken by the parser are the desired ones. NNoottee:: the conflict between the rules November, 1995 Page 53 YAY Reference Manual Thinkage Ltd. statmt: IF '(' cond ')' statmt | IF '(' cond ')' ELSE statmt ; can be resolved by adding %left ELSE to the declarations section of the YAY input. This puts a precedence on the ELSE action and says that it is left- associative. This is another example of using a precedence to resolve an ambiguity. 88..33 CCoonnfflliiccttss iinn YYAAYY OOuuttppuutt If your grammar has shift/reduce or reduce/reduce conflicts, you will also see a table of conflicts in the statistics section of the Parser Description. For example, if we change the rules section of our sample grammar to stmt : IF stmt ELSE stmt | IF stmt | stmt stmt | A ; we get the following conflict report. Conflicts: State Token Action 5 IF shift 2 5 IF reduce (3) 5 A shift 1 5 A reduce (3) This shows that State 5 has two shift/reduce conflicts. If the parser is in State 5 and encounters an IF token, it can shift to state 2 or reduce using rule 3. If the parser encounters an A token, it can shift to state 1 or reduce using rule 3. This is summarized in the final statistics with the line 2 shift/reduce conflicts Reading the conflict report shows you what action the parser takes in case of a conflict -- the parser always takes the _f_i_r_s_t action shown in the report. This action will be chosen in accordance with the two disambiguating rules. Page 54 November, 1995 Thinkage Ltd. YAY Reference Manual 99.. AAddvvaanncceedd FFeeaattuurreess In this chapter, we describe more specialized features of YAY. 99..11 LLAALLRR((22)) GGrraammmmaarrss An LALR(2) grammar has one more level of "lookahead" than an LALR(1) grammar. When trying to decide how to parse a given input, an LALR(1) parser sometimes looks at the next input token to see if that makes a difference. In the same situation, an LALR(2) parser can look at the next _t_w_o tokens. LALR(2) grammars are described in the same way as LALR(1) grammars. To indicate that you want an LALR(2) parser, you must put the option +LR2 on the command line that invokes YAY. An LALR(2) parser can perform a _l_o_o_k_a_h_e_a_d action as well as Shift, Reduce, Goto, Accept, and Error. For example, in a state diagram you might see A shift 1 B reduce (2) C lookahead This means that if token A is encountered in this state, the parser will shift to State 1; if token B is encountered, it will reduce using Rule (2); and if token C is encountered, it will lookahead. _9_._1_._1_ _ _T_h_e_ _L_o_o_k_a_h_e_a_d_ _A_c_t_i_o_n A state that has a Lookahead action has a list of possible states the parser can enter next. These states conflict with each other -- the Lookahead action isn't needed if there is no conflict. The parser attempts to resolve the conflict by reading one additional token. If the grammar is set up properly, this token will be valid in only one of the possible contexts, so the parser can choose that rule instead of the other possibilities. Thus the Lookahead action tests all the possibilities in the list. It begins by making a private copy of the state stack. (Actually, it just sets things up so that it looks like it has a separate copy of the state stack -- it doesn't really make a copy.) Next, it calls on a routine named November, 1995 Page 55 YAY Reference Manual Thinkage Ltd. "yy2lex" (described later) to obtain a second lookahead token. The Lookahead action then begins going through the possible rules on its list. Some of these rules may be Shifts, while others are Reduces -- the reductions are what make it necessary to simulate a copy of the state stack, because you don't want to play with the true stack. The Lookahead action takes the action associated with each possibility in the list, entering new states through Shifts or Reduces. It then sees if the second lookahead token causes an error in this new state. If an error occurs, the parser pops back to the original Lookahead state and checks the next possibility on the list. When it finally finds a rule where the second lookahead symbol doesn't cause an error, it pops back to the original Lookahead state one last time. It gets rid of the stack set-up that the Lookahead action was using and returns to the old state stack. It then takes the action that it has discovered in the list of possibilities. Note that the Lookahead action takes the first possible rule where the second lookahead symbol is valid. It is, of course, possible that there are other valid possibilities in the list. The order of the list of possibilities is based on the standard disambiguating rules: a shift comes first, followed by reductions in the order in which the corresponding grammar rules were given. _9_._1_._2_ _ _T_h_e_ _y_y_2_l_e_x_ _R_o_u_t_i_n_e In order to perform the lookahead, "yyparse" calls a routine called "yy2lex". This is a user-written lexical analyzer, just as "yylex" is. It returns the same kind of token value that "yylex" does. However, "yy2lex" should not set the "yylval" value. For some applications, "yy2lex" could even be the same routine as "yylex". There are good reasons why the two should be separate, however. Suppose, for the sake of argument, that "yyparse" called "yylex" for all input. "yyparse" might call "yylex" for the next token, then immediately call "yylex" again for a lookahead. This second call could overwrite important values like "yylval" before they had been used, thereby confusing things greatly. For such reasons, "yy2lex" and "yylex" sometimes have to be different routines. Page 56 November, 1995 Thinkage Ltd. YAY Reference Manual In practice, parsers rarely call "yy2lex" -- the routine is only needed in special circumstances. This means that "yy2lex" usually can be much simpler than "yylex": "yylex" must be able to handle all possible input tokens, while "yy2lex" is only called in a few special cases. To find out the tokens that "yy2lex" must be able to handle, run your grammar through YAY and get a list of the states where the parser performs a Lookahead instead of a Shift or Reduce. These instances are the only ones where the parser will call "yy2lex", so the associated tokens are the only ones that the routine has to handle. The values obtained by "yy2lex" are thrown away once the parser has used them in its lookahead. What this means is that the parser doesn't remember what "yy2lex" returned, so it will call on "yylex" to re-read the token for normal parsing. This means one of two things: (a) "yy2lex" can duplicate the processing that "yylex" does, then leave the pertinent information around for "yylex" to obtain. This means that "yylex" must have a way of knowing when "yy2lex" has already read the next token (e.g. "yy2lex" can set a flag). (b) "yy2lex" can cheat and only do minimal processing. For example, "yy2lex" may only have to look at the first character of the next token to figure out what is coming next. "yy2lex" can then "ungetc" that character and return an appropriate token value. When "yylex" is called, it reads the full token in the usual manner. The second approach is sufficient for most grammars. It simplifies both "yylex" and "yy2lex". _9_._1_._3_ _ _C_o_n_f_l_i_c_t_s With an additional level of lookahead, there is the potential for a staggering number of conflicts. For example, you may lookahead to a new state to resolve one conflict, only to find that the new state also has a lookahead to resolve a conflict. If an LALR(2) parser is performing a lookahead and enters a state where the secondary lookahead token invokes another Lookahead action, the parser ruthlessly resolves the second Lookahead action by choosing the first possibility on the second Lookahead list. This is not very elegant, so you should avoid this double lookahead situation. November, 1995 Page 57 YAY Reference Manual Thinkage Ltd. Conflicts of this nature are reported in the usual conflict table at the end of the State Description output. 99..22 MMuullttiippllee AAccttiioonnss A rule can have more than one action. For example, you might have a : A1 {action1} A2 {action2} A3 {action3}; The non-terminal symbol "a" consists of symbols A1, A2, and A3. When A1 is recognized, "action1" is invoked; when A2 is recognized, "action2" is invoked; and when A3 is recognized (and therefore the entire symbol "a"), "action3" is invoked. In this case, $1 -- is the value of A1 $2 -- is the value of $$ in action1 $3 -- is the value of A2 $4 -- is the value of $$ in action2 $5 -- is the value of A3 If types are involved, multiple actions become more complicated. If "action1" mentions $$, there is no way for YAY to guess what type $$ has, since it is not really associated with a token or non-terminal symbol. You must therefore state it explicitly by specifying the appropriate type name in angle brackets between the two $ signs. If we had %union { int intval; float realval; } you might say $<realval>$ in place of $$ in the action, to show that the result had type ffllooaatt. In the same way, if "action2" refers to $2 (the result of action1), you might say $<realval>2 To deal with multiple actions, YAY changes the form of the given grammar rule and creates grammar rules for dummy symbols. The dummy symbols have names made up of a $ followed by the rule number. For example, Page 58 November, 1995 Thinkage Ltd. YAY Reference Manual a : A1 {action1} A2 {action2} A3 {action3}; might be changed to the rules $21 : /* null */ {action1} ; $22 : /* null */ {action2} ; a : A1 $21 A2 $22 A3 {action3}; These rules will be shown in the Rules Summary of the Parser Description report. Note that this technique can introduce conflicts. For example, suppose you have the definitions a : A1 {action1} A2 X; b : A1 A2 Y; These are changed to $50 : /* null */ {action1}; a : A1 $50 A2 X; b : A1 A2 Y; The definitions of "a" and "b" give a shift/reduce conflict because the parser can't tell whether A1 followed by A2 has the null string for $50 in the middle. It has to decide whether to Reduce $50 or to Shift to a state diagrammed by b : A1 A2.Y This particular conflict could be resolved by using an LALR(2) parser instead of LALR(1). However, there are more complicated examples in which this is not possible. There is another situation to watch for. Consider the grammar a : A1 c A2 X ; b : A1 c A2 Y ; c : /* nothing */ {action} ; You might think that this is equivalent to a : A1 {action} A2 X; b : A1 {action} A2 Y; but it is not. The first will not produce a conflict, but the second one will. The second is converted into a : A1 $50 A2 X; b : A1 $51 A2 Y; November, 1995 Page 59 YAY Reference Manual Thinkage Ltd. $50 : {action} ; $51 : {action} ; even when the action is the same for both "a" and "b". If the parser finds an A1 followed by a null string followed by A2, it doesn't know if it should interpret the null string as $50 or $51; therefore, it gets a conflict. Obviously, it is better to use the first format, where the identical actions are associated with a separate rule. 99..33 SSeelleeccttiioonn PPrreeffeerreennccee A _s_e_l_e_c_t_i_o_n_ _p_r_e_f_e_r_e_n_c_e can be added to a grammar rule to help resolve conflicts. The following input shows a simple example of how a selection preference can resolve conflicts between two rules. a1 : a b ['+' '-'] { /* Action 1 */ } ; a2 : a b { /* Action 2 */ } ; The selection preference is indicated by zero or more _t_o_k_e_n_s inside square brackets. If the token that follows the "b" is one of the tokens inside the square brackets, the parser will use the grammar rule for "a1". If the token that follows the "b" is not one of the given tokens, the parser will use the rule for "a2". In this way, the conflict between the two rules is resolved -- the preference tells which one to select, depending on the value of the lookahead token. Notice that a selection preference states that a rule _s_h_o_u_l_d be used when the next token is one of the ones listed in the brackets and should _n_o_t be used if it is not in the brackets. The lookahead token is merely used to determine which rule to select. It is _n_o_t part of the rule itself. For example, suppose you have a1 : a b ['+' '-'] ; a2 : a b ; xx : a1 op expr ; and suppose you have an "a", a "b", and + as the lookahead token. The + indicates that the "a" and "b" should be reduced to "a1". The parser does this and finds that the "a1" is part of the "xx" rule. The + lookahead token will be associated with the "op" symbol in the "xx" rule. In Page 60 November, 1995 Thinkage Ltd. YAY Reference Manual other words, a selection preference does not "use up" an input token; it just looks at the token value to help resolve a conflict. The $end symbol may be used inside square brackets to indicate end-of-file in the input being parsed. For example, statement : line ['\n' $end] shows that this rule is preferred if a "line" is followed by a new-line character or the end of the file. The square brackets of a selection preference may contain no tokens, as in x : y z []; This says that the parser should never use this rule unless it cannot be avoided. When there are many selection preferences in the same state, the parser must compare them to check for conflicts. To do this, YAY chooses the most likely rule (based on preference) and compares this to the other possible rules. In order to understand what we mean by "most likely", an example will help. Consider the grammar a : b [ 'x' ] | b [ 'y' ] | b [ ] ; YAY must consider what happens when a "b" symbol has been encountered. If the "b" is followed by an 'x', the most likely rule is a : b [ 'x' ] If the "b" is followed by a 'y', the most likely rule is a : b [ 'y' ] If the "b" is followed by any other token, the most likely rule is a : b [ ] Once the most likely rule has been chosen, it will be compared to the other rules in the set to make sure that there are no conflicts. (In a previous release, YAY would November, 1995 Page 61 YAY Reference Manual Thinkage Ltd. compare every pair of rules, without choosing a most likely one. In this case, the rules a : b [ 'x' ] a : b [ 'y' ] would conflict with each other, because there was no way to choose between the two if the next token was not 'x' or 'y'. In this release, YAY will compare the most likely rule to all the others, but will not compare all the possible pairs.) Selection preferences can also be stated using the construct [^ T1 T2 T3 ...] where the first character is a caret (^) and T1, T2, etc. are tokens. When this is put on the end of a rule, it indicates that the rule should be used if the lookahead token is _n_o_t one of the listed tokens. For example, a1 : a b { /* Action 1 */ } ; a2 : a b [^ '+' '-'] { /* Action 2 */ } ; says that rule "a2" should be used if the token after the "b" is _n_o_t + or -. If the token is + or -, "a2" should not be used (so "a1" will be). The construct [^] is the reverse of []. It is used to indicate a rule that should _a_l_w_a_y_s be taken unless there is another rule that must be taken because of a selection preference. Selection preference constructs can be put in the middle of rules as well as on the end. For example, we could write expr : expr ['+' '-'] op expr { /* Action 1 */ } | expr op expr { /* Action 2 */ } ; This states that if the first "expr" is followed by + or - you want to use the first rule; otherwise, you want to use the second. Note that the preference does not use up the + or - token: you still need a symbol ("op") to represent such tokens. Page 62 November, 1995 Thinkage Ltd. YAY Reference Manual Selection preferences that appear in the middle of a rule are implemented in the same way as multiple actions, using dummy rules. The above example would result in something like $23 : ['+' '-'] ; expr : expr $23 op expr { /* Action 1 */ } | expr op expr { /* Action 2 */ } ; (where the 23 in $23 is just a number we chose at random). The dummy rule that is created is a null string with the selection preference on the end. The first token for "op" will be the + or - that was the lookahead token in rule $23. If a selection preference in the middle of a rule is immediately followed by an action, only one dummy rule is created to handle both the action and the preference. In most cases, a selection preference counts as a $N symbol, but it has no associated value. For example, in expr : expr ['+' '-'] op expr we have $1 -- first "expr" $2 -- no value $3 -- "op" $4 -- second "expr" If the preference is followed by an action, the preference and the action count as a single $N symbol whose value is equal to the $$ value of the action. For example, in expr : expr ['+' '-'] {action} op expr we have $1 -- first "expr" $2 -- $$ of action $3 -- op $4 -- second "expr" The %%pprreecc construct is incompatible with rules that contain selection preferences, because the preference is all that is needed to resolve conflicts. For this reason, YAY issues an error message if a rule contains both a preference and the %%pprreecc construct. November, 1995 Page 63 YAY Reference Manual Thinkage Ltd. Selection preferences can be used to resolve most conflicts. Indeed, there may be cases where the most practical course of action is to write a number of conflicting rules that contain selection preferences to resolve the conflicts, as in expr : expr ['+' '-'] op expr | expr ['*' '/' '%'] op expr | expr ['&' '|'] op expr ... NNoottee:: selection preferences of the form [error] [^ error] are not useful. Selection preferences are implemented through (dummy) Reduce actions, but the parser's error- handling routines always look for Shift actions and ignore Reductions. 99..44 NNeeggaattiivvee NNuummbbeerrss iinn $$NN CCoonnssttrruuccttss YAY lets you use constructs like $0, $-1, $-2, and so on in recognition actions. These were once important, but the techniques for specifying multiple actions have made them obsolete. YAY only supports the constructs for compatibility with older grammars. To understand what these constructs mean, it is important to think in terms of the state stack. Each $N construct is associated with a state on the stack; the value of $N is the value of the token or non-terminal symbol associated with the state at the time of a Reduce operation. (Recall that recognition actions are executed when the appropriate Reduce action takes place.) $1 is the value associated with the state that found the first component of the grammar rule; $2 is the value associated with the second state, and so on. $0 is the value associated with the state that was on top of the stack before the first component of the grammar rule was found. $-1 is the value associated with the state before that, and so on. All of these states are still on the stack, and their value can be obtained in this way. Page 64 November, 1995 Thinkage Ltd. YAY Reference Manual As an artificial example, suppose that a grammar has the rules stmt : IF condition stmt | WHILE condition stmt condition : /* something */ { /* action */ } The action associated with the condition can use the $-1 construct to find out if the preceding token was IF or WHILE. (Of course, this assumes that the only items that can precede a condition are the IF and WHILE tokens.) There are occasionally times when this sort of information is needed. 99..55 LLiissttss aanndd HHaannddlliinngg NNuullll SSttrriinnggss Grammars often define lists of items. There are two common ways to do this: list : item | list item ; or list : /* null */ | list item ; The first definition means that every "list" has at least one item. The second allows zero-length lists. Using the second definition is sometimes necessary or convenient, but it can lead to difficulties. To understand why, consider a grammar with list1 : /* null */ | list1 item1 ; list2 : /* null */ | list2 item2 ; list : list1 | list2 ; When the parser is in a position to look for a "list", it automatically finds a null string, then gets a conflict because it can't decide if the null string is an instance of "list1" or "list2". This problem is less likely to happen if you define November, 1995 Page 65 YAY Reference Manual Thinkage Ltd. list1 : item1 | list1 item1 ; list2 : item2 | list2 item2 ; list : /* null */ | list1 | list2 ; The parser can determine if it has a "list1" or a "list2" by seeing if the list starts with "item1" or "item2". A YAY-produced parser avoids infinite recursions that might result from matching the same null string over and over again. If the parser matches a null string in a particular state, goes through a few more states and shifts once more into the state where the null string was matched, it will not match the null string again. Without this behaviour, you get infinite recursions on null strings. However, the behaviour occasionally gets in the way if you _w_a_n_t to match more than one null string in a row. For example, consider how you might write the grammar rules for types that may be used in a C cast operation, as in char_ptr = (char *) float_ptr; The rules for the parenthesized cast expression might be written as cast : '(' basic_type opt_abstract ')' ; opt_abstract : /* null */ | abstract; abstract : '(' abstract ')' | opt_abstract '[' ']' | opt_abstract '(' ')' | '*' opt_abstract ; Consider what happens with a cast like (int *[]) This will be interpreted as * followed by a null "opt_abstract" followed by a null "opt_abstract" followed by square brackets. However, the parser will _n_o_t accept two null "opt_abstracts" in a row, and will take some other course of action. To correct this problem, you must rewrite the grammar rules. Instead of using the "opt_abstract" rules, have rules with and without an "abstract". Page 66 November, 1995 Thinkage Ltd. YAY Reference Manual cast : '(' basic_type abstract ')' ; abstract : /* null */ | abstract '[' ']' | '[' ']' | abstract '(' ')' | '(' ')' | '*' abstract | '*' ; 99..66 RRiigghhtt vvss.. LLeefftt RReeccuurrssiioonn Chapter 3 mentioned left and right recursion. For example, if a program consists of a number of statements separated by semicolons, we might define it with right recursion as program : statement | statement ';' program ; or with left recursion as program : statement | program ';' statement ; If you think about the way that the state stack works, you will see that the second way is much to be preferred. Consider, for example, the way something like S1 ; S2 ; S3 ; S4 would be handled (where all the Sn's are statements). With right recursion, the parser would gather "S1;" and then go looking for a program. To gather this program, it would gather S2. It would then look at the lookahead symbol ; and see that this program had the form statement ';' program The parser would then go ahead and gather the program after the semicolon. But after S3, it would find another semicolon, so would begin gathering yet another program. If you work the process through, you will find that the state stack grows to seven entries (one for each Sn and one for each ;) before the first Reduce takes place. November, 1995 Page 67 YAY Reference Manual Thinkage Ltd. On the other hand, if you have the left recursion program : program ';' statement and the same input, the parser will perform a Reduce as soon as it sees S1 ; S2 This is Reduced to a single state corresponding to the non- terminal symbol "program". The parser reads ";S3" and Reduces program ; S3 to "program" again. The process repeats for the last statement. If you follow this through, the state stack never grows longer than three states, as compared to the seven that are required for the right recursive rule. With right recursion, no reduction takes place until the entire list of elements has been read; with left recursion, a reduction takes place as each new list element is encountered. Left recursion can therefore save a lot of stack space. The choice of left or right recursion can also affect the order in which recognition actions are performed. Suppose T is a token. If we define x : /* null */ | y ',' x {a1} ; y : T {a2} ; then the input T , T , T would execute recognition actions in the order {a2} {a2} {a2} {a1} {a1} {a1} The {a2} actions are performed each time a T is Reduced to "y". The {a1} actions don't happen until the entire list has been read, because right recursion reads the entire list before any Reduce actions take place. Page 68 November, 1995 Thinkage Ltd. YAY Reference Manual On the other hand, if you define x : /* null */ | x ',' y {a1} ; y : T {a2}; the recognition actions for the same input will take place in the order {a2} {a1} {a2} {a1} {a2} {a1} With left recursion, Reduce actions take place every time a new element is read in for the list. This means that if you want the action order {a2} {a2} {a2} {a1} {a1} {a1} you must use right recursion even though it takes more stack space. 99..77 YYYYDDEEBBUUGG If you ##ddeeffiinnee a symbol named YYDEBUG in the declarations section and set the variable "yydebug" to a non-zero value, your parser will print a good deal of debugging information as it parses input. In this description, we will describe the output you may see. Every time "yylex" obtains a token, the parser prints read T (VALUE) T is the name of the token and VALUE is the numeric value. Thus if "yylex" has read an IF token, you might see read IF (257) Every time the parser enters a state, it will print out state N (X), char (C) where N is the state number as given in the State Description report, and X and C are other integers. X is another number for the state -- YAY actually renumbers the states and grammar rules after it generates the State Description report in order to improve the parser's efficiency, and X gives the state number after renumbering. C is the token type of the lookahead symbol if the symbol is November, 1995 Page 69 YAY Reference Manual Thinkage Ltd. a token. If the symbol is not a token, or if there is no lookahead symbol at the moment, C is -1. As an example, state 6 (22), char (-1) indicates that the parser has entered State 6 on the State Description Report (State 22 after renumbering) and that the current lookahead symbol is not a token. Every time the parser performs a Shift action, it prints shift N (X) where N is the number of the state to which the parser is shifting and X is the number of the same state after renumbering. Every time the parser performs a Reduce action, it prints reduce N (X), pops M (Y) This says that the parser has Reduced by grammar rule N (renumbered to X). After the reduction, the state on top of the state stack was State M (renumbered to Y). 99..88 IImmppoorrttaanntt SSyymmbboollss Debugging a parser produced by YAY can be a very difficult task, since the code is only partly produced by user input. The remaining code is standard software produced by YAY itself. The situation is aggravated by another problem: the state and rule numbers shown in the State Description report are not the same as the numbers used when the parser actually runs. In the interests of optimization, the parser sorts the states into a more convenient order. Thus the _i_n_t_e_r_n_a_l state number used by the program is usually not the same as the _e_x_t_e_r_n_a_l state number known to the user. To help those who are examining parser code using a symbolic debugger, here are a few of the important variables that the parser uses. yyval holds the value $$ at the time of a reduction. This has the type YYSTYPE. yychar holds the most recent token value returned by "yylex". Page 70 November, 1995 Thinkage Ltd. YAY Reference Manual yystate is the _i_n_t_e_r_n_a_l number of the current state. yyps points to the current top of the state stack. Thus yyps[0] is the internal number of the current state, yyps[-1] is the internal number of the previous state, and so on. yypv points to the top of the current value stack. The entries in this stack have the type YYSTYPE. When a Reduce operation executes a recognition action, this pointer is moved down the stack to the point where yypv[1] = $1 yypv[2] = $2 etc. yyi is the internal number of the rule being reduced by a Reduce action. yyrmap is an array present only when YYDEBUG is defined. It is used to convert internal rule numbers to external ones. For example, yyrmap[yyi] is the external number of the rule being reduced by a Reduce action. yysmap is an array present only when YYDEBUG is defined. It is used to convert internal state numbers to external ones. For example, yysmap[yystate] is the external number of the current state. 99..99 HHooww YYYYEERRRROORR MMaayy BBee UUsseedd The YYERROR macro creates an artificial error condition. To show how this can be useful, suppose we have a line-by- line desk calculator that allows parenthesization of expressions and suppose we have a variable named "depth" that keeps track of how deeply parentheses are nested. November, 1995 Page 71 YAY Reference Manual Thinkage Ltd. Every time the parser finds an opening parenthesis, it adds 1 to "depth". Every time it finds a closing parenthesis, it subtracts 1. Consider how the following definitions will work. expr : lp expr ')' {depth--;} | lp error {depth--;} ; lp : '(' {depth++;}; If no error occurs, the "depth" variable is incremented and decremented correctly. If an error does occur, however, what happens? Your "yyerror" routine is called on to recover from the error in the middle of an expression. Often, it is more reasonable to postpone this recovery until you reach a point where you have a whole expression. Therefore, you might use the following alternate definition. expr : lp error {depth--; YYERROR;} ; line : error '\n' prompt line { $$ = $4; } ; prompt : /* null token */ {printf("Please re-enter line.\n");}; Now, what happens when the grammar is asked to parse a line like 1 + (( a + When the end of the line is encountered, the parser recognizes an error has occurred. Going up the stack, the first state ready to handle the error is expr : lp error ; At this point, the parser will Reduce the input ( a + into an "expr". The Reduction executes the recognition action: it decrements "depth", then signals that an error has taken place. The Error action begins popping the stack again. It will find the previous opening parenthesis, recognize another lp error Page 72 November, 1995 Thinkage Ltd. YAY Reference Manual construct and perform another reduction. The parenthesis count is again decremented, and another error condition generated. This time, the grammar rule that deals with the error is the definition of "line". An error message is issued and a new line is requested. In this way, the parser has worked its way back to error-handling code that can deal with the situation. Along the way, the parser correctly decremented the "depth" variable to account for the missing parentheses. This method of dealing with errors decrements "depth" for every unbalanced opening parenthesis on the line. This corrects the "depth" count properly. Our first definition (without the YYERROR call) would only have decremented "depth" once. This example is somewhat contrived, of course -- you could always just set "depth" to zero whenever you started a new line of input. The usefulness of the technique is more apparent in situations where you obtain memory with "malloc" whenever you get an opening delimiter and free the memory with "free" whenever you get a closing delimiter. In this case, it is obvious that you need to do precisely as many "free" operations as "malloc" operations, so you must raise the error condition for each unbalanced opening delimiter. You might think that the symbol "lp" is unnecessary, and you could just define expr : '(' {depth++;} expr ')' {depth--;} | '(' error {depth--;} ; However, this would not work in general. There is no guarantee that the action {depth++;} would be executed in all cases, particularly if the token after the '(' was one that could not start an expression. As an interesting example of another way to use YYERROR, consider the following (taken from a parser for the Pascal programming language). label_list : label_list ',' label | label | error | error [LABEL CONST VAR PROC FUNC BEGIN] {YYERROR; /* more code */}; November, 1995 Page 73 YAY Reference Manual Thinkage Ltd. This deals with errors in two different ways: (a) If an error is followed by one of the tokens LABEL, CONST, etc. (representing the beginning of new declaration sections in Pascal), the input is reduced to a complete "label_list" input is Reduced to a complete "label_list" and an appropriate action is taken. This action uses YYERROR to raise the error condition, but only _a_f_t_e_r the reduction has taken place. (b) The other rule is used when the parser finds an error which is not followed by one of the listed tokens. This corresponds to an error in the middle of a label list and requires a different sort of handling. In this case, error-handling is allowed to take place immediately, without reduction, because there may be more "label_list" to come. This kind of approach can be used to distinguish different kinds of errors that may take place in a particular situation. 99..1100 TThhee DDeeffaauulltt AAccttiioonn The default action is the one that is taken when the parser finds a token which has no specified effect in the current state. Understanding how default actions work will help you understand what is going on when a YAY-produced parser encounters an error. In a state diagram, the default action is marked with a ".". The default will always be a Reduce or Error action, chosen according to the following rules. (a) If the state has no Shift actions and only one Reduce, the default will be the Reduce action. (b) Apart from (a), an empty rule will never have Reduce as a default. (c) If a state has more than one Reduce action, the parser examines the "popularity" of each Reduce. For example, if Reduction A is used with any of three different input tokens and Reduction B is used with only one input token, Reduction A is three times as "popular" as B. If one Reduce action is more than twice as popular as its closest contender (i.e. if it is taken on more than twice as many input tokens), and if that Reduce Page 74 November, 1995 Thinkage Ltd. YAY Reference Manual action is associated with a rule that contains reductions on at least _f_i_v_e tokens, the popular Reduce action is made the default. (d) In all other cases, the default action will be an Error action. For example, Error is chosen when a rule has more than one Reduce action, and there is no Reduce that is more than twice as popular as all the other contenders. Note: YAY's predecessor YACC always chooses the most popular Reduce action as default (if there is one). It does not use the same requirements as (c) above. As a result of this difference YAY's parser tables are about 20% larger than YACC's, but a YAY-generated parser usually detects errors much earlier than a parser generated by YACC. Because of the way default actions are treated, a YAY- produced parser will sometimes begin reducing when it encounters an error. Several reductions may take place before the error state is finally triggered. Thus your grammar may need some way to determine what actions have taken place between the time the erroneous token was read in and the time that the error was actually triggered. 99..1111 IInnvvaalliidd TTookkeennss It is invalid to say either %token X 0 or %token X 256 The value 0 is reserved for the end marker and 256 is reserved for "error". 99..1122 DDyynnaammiicc SSttaacckk AAllllooccaattiioonn The manifest YYSSIZE is used to determine the size of the state and value stacks used by "yyparse". YYSSIZE gives the maximum number of elements that these stacks will be expected to hold; the size of each value element is dictated by the YYSTYPE type and the size of each state element is determined by YAY. The default value of YYSSIZE is 150. By increasing the value of YYSSIZE, you can allow for grammars with a larger number of pending states. If you ##ddeeffiinnee YYALLOC in the declarations section, the state and value stacks used by "yyparse" will be allocated November, 1995 Page 75 YAY Reference Manual Thinkage Ltd. dynamically via "malloc" and freed before "yyparse" returns. What this effectively means is that "yyparse" makes itself re-entrant by saving a number of externals when it begins execution and restoring them upon completion. The externals involved are yylval yyval yypvt yynerrs yychar yyerrflag If you "longjmp" out of "yyparse" (due to an action), the externals are _n_o_t restored, and "yyparse" will not be re- entrant. If you use YYALLOC, "yyerror" will be called if "malloc" fails to allocate space for the state and value stacks. If you ##ddeeffiinnee YYALLOC with a value greater than 10, the parser allocates the state and value stacks dynamically, beginning with a size of YYSSIZE. If this is not big enough to hold the two stacks, "yyparse" attempts to use "realloc" to grow the stacks by the amount given by YYALLOC. For example, if YYALLOC is 20, "yyparse" tries to grow the stacks to a size that will allow 20 additional elements (whenever "yyparse" needs more space for the stacks). "yyerror" is called if the call to "realloc" fails to allocate additional space. If you set up your parser in such a way that it may grow the stacks, you must be careful not to take a pointer into a stack in one action and use that pointer inside a different action. The reason is that the stack may be grown using "realloc" in between the two actions. Since "realloc" may actually move the entire stack, the pointer will no longer be valid. Thus you should not create pointers with expressions like &($1) and expect those pointers to be valid in other actions. Within the code of a single action, however, such pointers wwiillll remain valid. If you ##ddeeffiinnee YYSTATIC, both the state and value stacks will be static. Otherwise, the state stack will be "auto" (allocated on the program stack) and the value stack will be static. Defining YYALLOC saves both stack space and static space; defining YYSTATIC saves stack space. Page 76 November, 1995 Thinkage Ltd. YAY Reference Manual 99..1133 SSyynncchhrroonniizziinngg tthhee LLooookkaahheeaadd If you ##ddeeffiinnee YYSYNC, the parser will always have a lookahead token when it performs a shift or reduce action. If the symbol is not defined, the parser will only obtain a lookahead token if the value of the token is needed. You would define YYSYNC in situations where you wanted to keep track of the line number on which various situations occurred (e.g. errors). If "yylex" _a_l_w_a_y_s does a lookahead, you know that the line number of the token you are working with is the line number of the second last token read. If "yylex" sometimes does not do a lookahead, you don't know if the current line number in "yylex" is the line number for the current token or for a lookahead token. You would avoid defining YYSYNC in situations where some actions do their own reading. For example, suppose that /* is a token that indicates the beginning of a comment. You could create an action for that token which reads all input up to a closing */. With this kind of action, you would not want "yylex" to read a lookahead token, since that token would be the first token inside the comment, not the first token after the comment. 99..1144 YYYYSSHHIIFFTT The generated parser program invokes a macro named YYSHIFT whenever the parser performs a shift action in response to a token. The macro is invoked without arguments. By default, YYSHIFT is defined to do nothing; however, users can redefine YYSHIFT if desired, in the token definition part of the YAY input. As an example of how you might use YYSHIFT, consider a situation where input is freeform and may be split over many lines of text, including the insertion of blank lines in the input. You may want to keep a count of the number of lines you've read as you parse the input so that you can provide error messages like "Line 12: Syntax error". When the parser finishes reading a line, it often has to read ahead to the next token before it can decide whether to reduce what it already has or keep on reading additional input. Since there may be blank lines in the input, this means that the parser may actually read ahead several lines before finding the next token. When it sees the next token, the parser may decide to reduce what it has seen already before shifting in response to the token. November, 1995 Page 77 YAY Reference Manual Thinkage Ltd. While the reduction is happening, you want your line count to reflect the last line of input. You do not want the line count to reflect the line where the parser found the new token, because the parser isn't really using that token yet; it's working with older input. You only want to update the line count when the parser is ready to shift on the new token. That's where YYSHIFT comes in. You can define the YYSHIFT macro to update the line count at the point when the token is actually used, not when it is first read. YYSHIFT is _n_o_t invoked if a token is discarded (because of error handling or some other reason). It is only invoked when the parser shifts on a token. Page 78 November, 1995 Thinkage Ltd. YAY Reference Manual AAppppeennddiixx AA AAnn EExxaammppllee This appendix gives a simple example of YAY input. We have omitted the recognition actions of the grammar rules in order to keep the example as simple as possible. The parser implements a simple string manipulation language. It has two types of objects: strings and variables. Strings can have a maximum of 100 characters and are enclosed in double quotes. Variable names are a maximum of eight characters long. There are three operations: (a) = assigns a string to a variable; (b) JOIN or + concatenates strings or the contents of variables; (c) PRINT outputs strings or the contents of variables. Every statement is a single input line and blanks are used to separate tokens on the line. Multiple = assignments are permitted as in A = B = "hello" The keyword PRINT can only appear once on the line. If a line does not contain an assignment, the value of the expression on the line is merely printed out. This means that it is valid to have a line that only contains a string or a variable name. The end of the input file marks the end of input. Here are the declarations section of the YAY input and the "yylex" routine (both written in C). %{ /* declarations */ #include <stdio.h> #define MAXSTRING 100 char str[MAXSTRING]; #define MAXVAR 8 char var[MAXVAR]; %} %token VARIABLE STRING %nonassoc PRINT %right '=' %left JOIN November, 1995 Page 79 YAY Reference Manual Thinkage Ltd. %union { char *cstr; } %% program : stat | program stat ; stat : printexpr '\n' | expr '\n' ; printexpr : PRINT expr ; expr : VARIABLE '=' expr | JOIN expr expr | STRING | VARIABLE ; %% int yylex() { char c, *stringet(), *wordget(); extern YYSTYPE yylval; while ( (c=getchar()) == ' ' ) /* skip blanks */; switch(c) { case '\n': /* end of line */ return('\n'); case EOF: /* end of file */ return(0); case '+': /* same as JOIN */ return(JOIN); case '"': /* start of string */ yylval.cstr = stringet(); return(STRING); default: /* keyword or variable */ yylval.cstr = wordget(c); if ( !strcmp("JOIN",yylval.cstr) ) return(JOIN); else if ( !strcmp("PRINT",yylval.cstr) ) return(PRINT); else return(VARIABLE); } } char *stringet() { extern char str[]; int i; Page 80 November, 1995 Thinkage Ltd. YAY Reference Manual for ( i=0; i < MAXSTRING; i++ ) { str[i] = getchar(); if ( (str[i]=='"') || (str[i]=='\n') ) break; } str[i] = '\0'; /* mark end of string */ return(str); } char *wordget(char c) { extern char var[]; int i; var[0] = c; /* first letter obtained already */ for ( i=1; i < MAXVAR; i++ ) { var[i] = getchar(); if ( (var[i]==' ') || (var[i]=='\n') ) break; } var[i] = '\0'; return(var); } void yyerror(char *s) { fprintf(stderr,s); } Note that the lexical analyzer always flags the keywords JOIN and PRINT as operations. This means that you cannot use JOIN or PRINT as variable names. In fact, this is a general limitation of LALR(1) parsers. It is very difficult to have names that are keywords in some contexts and variable names in others, without a great deal of fiddling. Therefore, we recommend that you resign yourself to having _r_e_s_e_r_v_e_d keywords, i.e. keywords forbidden for use as variable names. The above example omitted recognition actions in the Rules Section to make things easier to read. November, 1995 Page 81 YAY Reference Manual Thinkage Ltd. AAppppeennddiixx BB YYAAYY vvss.. YYAACCCC YAY differs from its predecessor YACC in a number of respects. These are summarized below. (a) YACC does not support the following. -- all the LALR(2) features described in 9.1 -- selection preferences, as described in 9.3 -- the use of YYSTATIC, YYSYNC, or YYALLOC (9.12) (b) If a state has several Reduce action, YACC always chooses the most popular of these as the default action. YAY's rules for choosing a default are given in 9.10. In general, YAY's approach will detect errors more quickly (sooner after they actually occur). (c) The two programs handle YYERROR differently. YAY always reduces and pops the current state off the stack before generating the artificial error condition. YACC doesn't do this. With YACC, you cannot use YYERROR inside a rule with "error" in it; YAY, however, will resolve the current rule and then trigger error handling. (d) Output from the two programs differs slightly. As a rule of thumb, YAY accepts any YACC grammar and creates a parser that behaves the same for correct input. However, the YAY version usually detects errors earlier than the YACC parser, and has more Error actions in the state tables. Page 82 November, 1995 Thinkage Ltd. YAY Reference Manual AAppppeennddiixx CC TThhee LLIINNTT CCoommmmaanndd LLiinnee _S_y_n_t_a_x_: yay sourcefile Parser=outfile [option]* (+|-)LR2 (-) (+|-)Verbose (-) (+|-)Warnings (+) Description=file Header=file INSTallation=file Language=C|C++ (C) Parser=file _E_x_a_m_p_l_e_s_: yay cgram.y parse=myparse.c c myparse.c yay ccgram.y lang=c++ pars=myparse.cpp _O_p_t_i_o_n_s_: sourcefile is a file containing YAY input. Language=C produces parsing tables in the C programming language. This is the default. Language=C++ produces parsing tables in the C++ programming language. Note that YAY only produces the tables; the routines that use the tables to parse input are predefined. +LR2 says that the YAY input describes an LALR(2) grammar. Without +LR2, YAY assumes that the grammar is LALR(1). The manual describes modifications that need to be made for LALR(2) grammars. Description=file translates the parsing tables into a format that humans can read, and writes this output into the given file. Header=file writes token definitions and other information necessary for separate compilation, to the named file. INSTallation=file tells YAY where to find the installation file. The installation file tells where various software components have been installed. For more information, see the section on _I_n_s_t_a_l_l_a_t_i_o_n_ _F_i_l_e_s below. November, 1995 Page 83 YAY Reference Manual Thinkage Ltd. If you do not specify an INSTallation= option on the command line, YAY checks for an environment variable named YAY_INST and uses its value as the name of the installation file. If this environment variable does not exist, YAY uses the default installation file. Parser=file writes the resulting source code for the parser into the named file. If this option is omitted, YAY just checks the syntax of your input. +Verbose produces verbose output -- everything that can be flagged is flagged. -Warnings suppresses a number of warning messages that YAY normally issues. _D_e_s_c_r_i_p_t_i_o_n_: YAY converts your context-free grammar into a C or C++ program that is written to the file specified by the Parser= option. If you use the Description= option, YAY writes a full description of the grammar to the specified file. YAY only displays a brief message on the standard output, summarizing conflicts (and other information if you specify +Verbose). On the other hand, if you do not use the Description= option, YAY writes more information to standard output, including descriptions of the states where conflicts occur. In this case, YAY actually provides additional information to help you identify the source of the conflicts; if you ask for a description file, YAY outputs less information when reporting the conflicts because it assumes you can track down additional information by looking at the description file. For this reason, you can sometimes get a quicker idea of what has gone wrong if you do not ask for a description file. _C_+_+_ _P_a_r_s_e_r_s In general, you only need to use Language=C++ if you intend YYSTYPE to contain a C++ object with constructors. If you intend to compile the parser with C++ but the %union statement does not have any elements that need constructors, it's best to use Language=C to get more efficient C code. If YYSTYPE does contain elements that need constructors, you need to define an appropriate constructor-like function for the YYSTYPE type. This function should have the prototype Page 84 November, 1995 Thinkage Ltd. YAY Reference Manual void name(YYSTYPE *p) where "name" can be any valid name. In the declarations section of the grammar, you must then add the statement #define YYSTYPE_INIT name where "name" is the name of the constructor-like function. With Language=C++, the %union statement generates a structure type rather than a union, since C++ does not allow objects with constructors to belong to unions. In many cases, the same grammar may be processed with either Language=C or Language=C++. _I_n_s_t_a_l_l_a_t_i_o_n_ _F_i_l_e_s_: An installation file specifies the pathnames for software and data files used by YAY. Installation files are text files made up of comment lines and option lines. Comment lines: Any line whose first non-blank character is # will be taken as a comment. Blank lines are also considered comments. Option lines: Option lines have the format Keyword=pathname In this documentation, keywords are written with some letters in upper case and some in lower case. You may abbreviate keywords by omitting any or all of the letters shown in lower case. The remaining letters may be entered in either upper or lower case; the documentation simply uses upper case to show which characters may not be omitted. In this version of YAY, there is only one valid option line: Library=pathname The pathname should be the directory containing the YAY parser template files (e.g. yyparse.c). November, 1995 Page 85 YAY Reference Manual Thinkage Ltd. _N_o_t_e_s_: If you define YYALLOC, the parser allocates its state and value stacks dynamically via malloc and free. This shrinks your parser and helps prevent stack overflows. Copyright 1995, Thinkage Ltd. Page 86 November, 1995