home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: 10 Tools
/
10-Tools.zip
/
yay-1_0.zip
/
yay-1_0
/
doc
/
yay10man.txt
< prev
Wrap
Text File
|
1996-02-05
|
167KB
|
5,392 lines
YYAAYY RReeffeerreennccee MMaannuuaall
by
J. A. Gardner
Thinkage Ltd.
85 McIntyre Drive
Kitchener, Ontario
Canada N2R 1H6
Copyright 1995 by Thinkage Ltd.
Thinkage Ltd. YAY Reference Manual
TTaabbllee ooff CCoonntteennttss
11.. IInnttrroodduuccttiioonn........................................................................................22
1.1 A Note to Novices.................................2
22.. HHooww YYAAYY WWoorrkkss......................................................................................44
2.1 "yyparse" and "yylex".............................4
2.2 Grammar Rules.....................................5
2.3 Notes on Compiling Source Code Produced by YAY....6
33 IInnppuutt ttoo YYAAYY..........................................................................................77
3.1 The Declarations Section..........................8
3.1.1 Token Declarations..........................8
3.1.2 Precedence and Binding......................9
3.1.3 Variable/Function Declarations.............12
3.1.4 Summary....................................12
3.2 Grammar Rules....................................12
3.2.1 Recognition Actions........................14
3.2.2 Token and Symbol Values....................15
3.2.3 Precedence in the Grammar Rules............16
3.2.4 The Start Symbol...........................18
3.2.5 The End Marker.............................19
3.2.6 Declarations in yyparse....................19
3.3 The Programs Section.............................20
3.3.1 The Lexical Analyzer.......................21
44.. IInntteerrnnaall SSttrruuccttuurreess........................................................................2222
4.1 States...........................................22
4.2 Diagramming States...............................23
4.3 State Actions....................................24
4.3.1 The Accept Action..........................25
4.3.2 The Shift Action...........................25
4.3.3 The Reduce Action..........................26
4.3.4 The Goto Action............................27
4.3.5 The Error Action...........................28
55.. EErrrroorr--HHaannddlliinngg..................................................................................2299
5.1 The "error" Symbol...............................29
5.2 The Error Condition..............................29
5.3 Examples.........................................30
5.4 Error Recognition Actions........................32
5.5 The yyclearin Macro..............................33
5.6 The yyerror Function.............................33
5.6.1 Changing yychar in yyerror.................34
5.7 The yyerrflag Variable...........................35
5.8 The yyerrok Macro................................36
5.9 Other Error Support Routines.....................36
November, 1995 Page i
YAY Reference Manual Thinkage Ltd.
66.. YYAAYY OOuuttppuutt..........................................................................................3388
6.1 Rules Summary....................................39
6.2 State Descriptions...............................39
6.3 Conflict Summary.................................42
6.4 Parser Statistics................................43
77.. TTyyppeess....................................................................................................4466
The Default Action....................................47
88.. AAmmbbiigguuiittiieess........................................................................................4499
8.1 Resolving Conflicts by Precedence................50
8.2 Disambiguating Rules.............................50
8.3 Conflicts in YAY Output..........................53
99.. AAddvvaanncceedd FFeeaattuurreess............................................................................5544
9.1 LALR(2) Grammars.................................54
9.1.1 The Lookahead Action.......................54
9.1.2 The yy2lex Routine.........................55
9.1.3 Conflicts..................................56
9.2 Multiple Actions.................................57
9.3 Selection Preference.............................59
9.4 Negative Numbers in $N Constructs................63
9.5 Lists and Handling Null Strings..................64
9.6 Right vs. Left Recursion.........................66
9.7 YYDEBUG..........................................68
9.8 Important Symbols................................69
9.9 How YYERROR May Be Used..........................70
9.10 The Default Action..............................73
9.11 Invalid Tokens..................................74
9.12 Dynamic Stack Allocation........................74
9.13 Synchronizing the Lookahead.....................76
9.14 YYSHIFT.........................................76
AAppppeennddiixx AA -- AAnn EExxaammppllee......................................................................7788
AAppppeennddiixx BB -- YYAAYY vvss.. YYAACCCC..................................................................8811
AAppppeennddiixx CC -- TThhee LLIINNTT CCoommmmaanndd LLiinnee................................................8822
Page ii November, 1995
Thinkage Ltd. YAY Reference Manual
11.. IInnttrroodduuccttiioonn
YAY is a tool for writing compilers and other programs
that parse input according to strict "grammar" rules. In
the terminology of parsing theory, YAY can handle LALR(2)
grammars with disambiguating rules. This is a fairly broad
class of grammars -- most of the computing languages in use
today can be described using LALR(2) specifications.
For a precise definition of this sort of grammar, see
any standard text on parsing theory, e.g. _P_r_i_n_c_i_p_l_e_s_ _ _o_f
_C_o_m_p_i_l_e_r_ _D_e_s_i_g_n by A.V.Aho and J.D.Ullman, Addison-Wesley,
1977, or _L_R_ _ _P_a_r_s_i_n_g by Aho and Johnson, in Computing
Surveys.
The appearance of YAY input is based on the input to
YACC (Yet Another Compiler-Compiler), written by S.C.Johnson
of Bell Telephone Laboratories, Murray Hill, N.J. The
LALR(1) version of YAY was written by K.W.Lalonde of the
Software Development Group of the University of Waterloo.
Enhancements to allow YAY to handle LALR(2) grammars were
made by P.J.Fraser, also of SDG.
The parsing algorithm used by YAY is derived from the
article "Methods for Computing LALR(k) Lookahead" by
B.B.Kristensen and O.L.Madsen, _A_C_M_ _ _ _T_r_a_n_s_a_c_t_i_o_n_s_ _ _ _o_n
_P_r_o_g_r_a_m_m_i_n_g_ _L_a_n_g_u_a_g_e_s_ _ _a_n_d_ _ _S_y_s_t_e_m_s, Vol.3, No.1, January
1981, pp.60-82. Those interested in reading this article
should first have a good grasp of parsing theory principles.
11..11 AA NNoottee ttoo NNoovviicceess
YAY can produce anything from a simple parser for a desk
calculator program to a very complicated parser for a
programming language. Those who are using YAY for complex
tasks have to know all the idiosyncrasies of the program,
including a good deal about the internal workings of YAY.
On the other hand, the internal workings are mostly
irrelevant to someone who is making an easy straightforward
parser.
For this reason, novices may want to skip some the
sections of the manual which describe very technical aspects
of YAY. This is particularly true on first reading, when it
is more important to get an overview of how YAY is used than
to try to comprehend every little detail. We therefore
November, 1995 Page 3
YAY Reference Manual Thinkage Ltd.
suggest that novices concentrate on the following sections,
and only look at other sections if the requirements of your
parser make it necessary.
-- All of Chapter 2
-- All of Chapter 3
-- All of Chapter 4
-- Sections 6.0, 6.1, 6.2
-- all of Chapter 7
-- Sections 9.3, 9.5, 9.6
Novices may be particularly interested in Appendix A which
gives an example of YAY input for a simple parser.
Page 4 November, 1995
Thinkage Ltd. YAY Reference Manual
22.. HHooww YYAAYY WWoorrkkss
The input to YAY describes the rules of a grammar. YAY
uses these rules to produce the source code for a program
that will parse the grammar. This source code can then be
compiled to obtain a program that reads input, parses it
according to the grammar, and takes action based on the
result.
The source code produced by YAY consists of a number of
data tables that represent the grammar, plus a C function
named "yyparse". By default, all the symbol names used by
YAY begin with "yy". This is an historical convention,
dating back to YAY's predecessor YACC. Users can avoid
conflicts with YAY names by avoiding symbols that start with
"yy".
Users who want to use a different prefix may indicate
this with a line of the form
%prefix PREFIX
at the beginning of their YAY input. For example,
%prefix ww
asks for a prefix of "ww" instead of "yy". The prefix
chosen should be one or two characters long -- longer
prefixes will lead to name conflicts on machines where
external names are truncated to six characters during the
loading process. In addition, at least one of the
characters in the prefix should be a lower case letter
(because YAY uses an all upper case version of the prefix
for some special names, and this has to be different from
the real prefix).
Different prefixes are useful when two YAY-produced
parsers are to be merged into a single program. For the
sake of convenience, however, the "yy" convention will be
used in this manual.
22..11 ""yyyyppaarrssee"" aanndd ""yyyylleexx""
"yyparse" returns a value of 0 if the input it parses is
valid according to the given grammar rules. It returns a 1
if the input is invalid and error recovery is impossible.
"yyparse" does not do its own lexical analysis. In other
words, it does not pull the input apart into tokens ready
November, 1995 Page 5
YAY Reference Manual Thinkage Ltd.
for parsing. Instead, it calls a routine called "yylex"
every time it wishes to obtain a token from the input.
"yylex" returns a value indicating the _t_y_p_e of token
that has been obtained. If the token has an actual _v_a_l_u_e,
this value (or some representation of the value, e.g. a
pointer to a string containing the value) is returned in an
external variable named "yylval".
It is up to the user to write a "yylex" routine that
breaks the input into tokens and returns the tokens one by
one to "yyparse". We will say more about the lexical
analyzer in Section 3.3.
22..22 GGrraammmmaarr RRuulleess
The grammar rules given to YAY not only describe what
inputs are valid according to the grammar but also specify
what action should be taken when a given input is
encountered. For example, if the parser recognizes a
statement that assigns a value to a variable, the parser
should either perform the assignment itself or take some
action to ensure that the assignment will eventually take
place.
As an example, if the parser is part of an interactive
desk calculator, it could carry out arithmetic calculations
as soon as the instructions are recognized. However, if the
parser is acting as the first pass in a multi-pass compiler,
it may simply encode the input in a way that will be used in
a later code-generation pass.
In summary, the user must provide a number of things
when using YAY to produce a parser:
(1) grammar rules indicating what input is and isn't valid;
(2) a lexical analyzer ("yylex") that breaks raw input into
tokens for the parsing routine "yyparse";
(3) any source code or functions that may be needed to
perform appropriate actions once particular inputs are
recognized;
(4) a mainline routine that performs any necessary
initializations, calls "yyparse", then performs
possible clean-up actions. The simplest kind of
mainline is just a function "main" that calls
"yyparse", then returns.
Page 6 November, 1995
Thinkage Ltd. YAY Reference Manual
22..33 NNootteess oonn CCoommppiilliinngg SSoouurrccee CCooddee PPrroodduucceedd bbyy YYAAYY
By default, C source code produced by YAY contains the
line
#define YYSCTAB const
It then uses the YYSCTAB manifest to declare various data
objects as ccoonnsstt. If you will be compiling the YAY output
with an old C compiler that doesn't recognize the ccoonnsstt
qualifier, you should change the definition to read
#define YYSCTAB
November, 1995 Page 7
YAY Reference Manual Thinkage Ltd.
33 IInnppuutt ttoo YYAAYY
This chapter describes the input to YAY when you are
defining an LALR(1) grammar. Chapter 9 will talk about
additional needs for LALR(2) grammars.
The input to YAY is broken into three sections:
-- declarations
-- grammar rules
-- programs
The contents of each section will be described shortly, but
first we will give some overall rules for YAY input.
Sections of YAY input are separated by the symbol %%.
The general lay-out of YAY input is therefore
declarations
%%
grammar rules
%%
programs
The declarations section may be omitted if no declarations
are necessary. This will mean that the input starts with
the first %%. The programs section may also be omitted,
from the second %% on. The simplest input for YAY is
therefore
%%
grammar rules
Blanks, tabs, and new-lines are used to separate items
in YAY input. These are called _w_h_i_t_e_-_s_p_a_c_e characters. Any
place one white-space character is valid, any number of
blanks, tabs, and/or new-lines may be used. This means, for
example, that the %% to separate sections does not have to
be on a line all by itself. However, giving it a line of
its own makes the YAY input easier to read.
Comments may appear anywhere a blank is valid. As in C,
comments begin with /* and end with */.
Identifiers used in YAY input may be of arbitrary
length, and may consist of all letters (upper and lower
case), all digits, and the characters dot (.) and underscore
(_). The first character of an identifier may not be a
digit. YAY distinguishes between upper and lower case
Page 8 November, 1995
Thinkage Ltd. YAY Reference Manual
letters, so "this" and "THIS" and "This" are all different
symbols.
Literals in YAY input consist of a single ASCII
character enclosed in single quotes, e.g. 'c'. The standard
C escape sequences are recognized:
\b -- backspace
\n -- new-line
\r -- carriage return
\t -- tab
\' -- single quote
\\ -- backslash
\xxx -- any ASCII character
("xxx" is octal representation)
For technical reasons, the null character '\000' should
never appear in YAY input.
33..11 TThhee DDeeccllaarraattiioonnss SSeeccttiioonn
The declarations section describes many of the
identifiers that will be used in the rest of the YAY input.
There are two types of declarations:
(a) Token declarations;
(b) Declarations of functions and variables used in the
actions that the parser takes when a particular input
is recognized.
The declarations section can also specify rules for the
precedence and binding of operators used in the grammar.
For example, the standard order of arithmetic operations
would normally be set up in the declarations section.
_3_._1_._1_ _ _T_o_k_e_n_ _D_e_c_l_a_r_a_t_i_o_n_s
All ASCII characters are automatically recognized as
tokens. For example, 'a' stands for a token that is the
literal character "a".
Other tokens are declared with statements of the form
%token name1 name2 name3 ...
This tells YAY that the given names will refer to tokens.
For example,
November, 1995 Page 9
YAY Reference Manual Thinkage Ltd.
%token INTNUM
indicates that the identifier INTNUM will refer to a
particular type of token returned by the lexical analyzer
"yylex". If INTNUM stands for any integer number token, you
might have the following code in "yylex".
c = getchar();
if ( (c >= '0') && (c <= '9') ) {
yylval = 0;
do {
yylval = (yylval * 10) + (c - '0');
c = getchar();
} while (c >= '0' && c <= '9');
ungetc(c, stdin);
return( INTNUM );
}
"yylex" returns INTNUM to indicate that a certain kind of
token (an integer number) has been returned. The actual
value of this number is returned in "yylval". The grammar
rules in the YAY input would dictate where an INTNUM token
is valid.
In the C source code produced by YAY, the identifiers
named in a %token statement will appear as constants set up
with ##ddeeffiinnee. The first named token has a #defined value of
257, the next is #defined as 258, and so on. Token values
start at 257 so they will not conflict with any of the 256
ASCII characters.
Because token identifiers are set up as manifest
constants, they must not conflict with reserved words or
other identifiers that will be used by the parser. For
example,
%token if yyparse ...
will almost certainly lead to errors when you try to compile
the source code output of YAY. By convention, this manual
will always create token names in upper case, and we
recommend that you follow the same practice.
_3_._1_._2_ _ _P_r_e_c_e_d_e_n_c_e_ _a_n_d_ _B_i_n_d_i_n_g
Parsers that evaluate expressions usually have to
establish the order in which various operations are carried
out. For example, parsers for arithmetic expressions
usually carry out multiplications before additions. Two
factors affect order of operation: precedence and binding.
Page 10 November, 1995
Thinkage Ltd. YAY Reference Manual
Precedence dictates which of two different operations
should be carried out first. For example, in
A + B * C
the standard arithmetic rules of precedence dictate that the
multiplication should take place before the addition.
Operations that should be carried out first are said to have
a _h_i_g_h_e_r precedence than operations that should be performed
later.
Different operators can sometimes have the same
precedence. In C, for example, addition and subtraction are
similar enough to share the same precedence.
Binding indicates which of two similar operations should
be carried out first. By "similar", we mean operations with
the same precedence (e.g. addition and subtraction in C).
For example, C chooses to parse
A - B - C
as
(A - B) - C
while a language like APL would use
A - (B - C)
If we evaluate the first operation before the second (as C
does), we say the operation is _l_e_f_t_-_a_s_s_o_c_i_a_t_i_v_e. If we
evaluate the second operation before the first (as APL
does), we say the operation is _r_i_g_h_t_-_a_s_s_o_c_i_a_t_i_v_e_.
Occasionally, a compiler may have operations which are
not associative. For example, Fortran regards
A .GT. B .GT. C
as invalid. In this case, we say the operation is _n_o_n_-
_a_s_s_o_c_i_a_t_i_v_e. The precedence and associativity of operator
tokens may be declared in the declarations section using the
keywords
%left
%right
%nonassoc
For example,
November, 1995 Page 11
YAY Reference Manual Thinkage Ltd.
%left '+' '-'
indicates that the + and - operations have the same
precedence and are left-associative.
Associativity declarations should be given in order of
precedence. Operations with lowest precedence are listed
first, and those with highest precedence are listed last.
Operations with equal precedence are listed on the same
line. For example,
%right '='
%left '+' '-'
%left '*' '/' '%'
says that = has a lower precedence than + and -, which in
turn have a lower precedence than *, /, and %. = is also
right-associative, so that
A = B = C
is parsed as
A = (B = C)
Because of the way YAY specifies precedence and
associativity, operators with equal precedence will always
have the same associativity. For example, if A and B have
equal precedence, their precedence must have been set with
one of
%left A B
%right A B
%nonassoc A B
which means A and B must have the same associativity.
The names supplied with %right, %left, and %nonassoc may
be literals or YAY identifiers. If they are identifiers,
they are regarded as token names. YAY generates a %token
directive for such names if they have not already been
declared. For example, in
%left '+' '-'
%left '*' '/'
%left UMINUS
UMINUS is taken to be a token identifier. There is no need
to define UMINUS as a token identifier -- a %token directive
Page 12 November, 1995
Thinkage Ltd. YAY Reference Manual
will be generated automatically if necessary. It is
perfectly valid to have an explicit
%token UMINUS
if you want. However, it must precede the %left
declaration.
For a more technical discussion of how precedence and
binding rules affect a parser, see Chapter 8.
_3_._1_._3_ _ _V_a_r_i_a_b_l_e_/_F_u_n_c_t_i_o_n_ _D_e_c_l_a_r_a_t_i_o_n_s
The declarations section may contain standard C
declarations for variables and/or functions used in the
actions specified in the grammar rules section. All such
declarations should be included in a block that begins with
%{ and ends with %}. For example,
%{
int i, j, k;
static float x = 1.0;
%}
gives a few variable declarations. These declarations are
essentially transferred "as is" to the _b_e_g_i_n_n_i_n_g of the
source code that is produced by YAY. This means that they
will be external to "yyparse" and therefore extern
definitions.
_3_._1_._4_ _ _S_u_m_m_a_r_y
The source code produced by YAY contains the following.
-- code from the declarations section
-- code given in the programs section
-- parsing tables produced by YAY to represent the grammar
-- the "yyparse" routine
33..22 GGrraammmmaarr RRuulleess
A YAY grammar rule has the general form
identifier : definition ;
A colon separates the definition from the identifier being
defined. A semicolon marks the end of the definition.
November, 1995 Page 13
YAY Reference Manual Thinkage Ltd.
The identifiers defined in the grammar rule section are
known as _n_o_n_-_t_e_r_m_i_n_a_l_ _s_y_m_b_o_l_s. "Non-terminal" suggests that
these symbols aren't the "end of the line". Instead, they
are made up of smaller things: tokens or other non-terminal
symbols.
Here is a simple example of the definition of a non-
terminal symbol.
paren_expr : '(' expr ')' ;
This says that a "paren_expr" consists of a left
parenthesis, followed by an "expr", followed by a right
parenthesis. The "expr" is either a token or a non-terminal
symbol defined in another grammar rule. This grammar rule
could be interpreted to say that a parenthesized expression
consists of a normal expression inside parentheses.
A non-terminal symbol may have more than one definition.
For example, we might define "if" statements with
if_stat : IF '(' expr ')' stat ;
if_stat : IF '(' expr ')' stat ELSE stat ;
This definition assumes that IF and ELSE are tokens
recognized by the lexical analyzer (which means that this
parser's "yylex" can recognize keywords). The definition
also assumes that "expr" and "stat" are non-terminal symbols
defined elsewhere.
When a single symbol has more than one meaning, YAY lets
you join the various possibilities into a single definition.
Different meanings are separated by or-bars (|). Thus you
could write
if_stat : IF '(' expr ')' stat
| IF '(' expr ')' stat ELSE stat ;
This technique is highly recommended, since it makes YAY
input more readable.
Definitions in a grammar may be recursive. For example,
list : item
| list ',' item;
defines "list" to be one or more items separated by commas.
intexp : '(' intexp ')'
| intexp '+' intexp
| intexp '-' intexp
Page 14 November, 1995
Thinkage Ltd. YAY Reference Manual
| intexp '*' intexp
| intexp '/' intexp
| INTNUM ;
says that an integer expression can be another integer
expression in parentheses, the sum of integer expressions,
the difference of integer expressions, the product of
integer expressions, the quotient of integer expressions, or
an integer number standing on its own (where INTNUM is a
token recognized by the lexical analyzer).
In recursive symbol definitions, it is often useful to
have the empty string as one of the possible definitions.
For example,
program :
/* the empty string */
| statement ';' program ;
defines a program as zero or more statements separated by
semicolons.
Our definition of "list" above was an example of _l_e_f_t
_r_e_c_u_r_s_i_o_n because "list" was on the left in the recursive
definition. The definition of "program" was an example of
_r_i_g_h_t_ _r_e_c_u_r_s_i_o_n. In Section 9.6, we discuss the pros and
cons of the two types of recursion.
_3_._2_._1_ _ _R_e_c_o_g_n_i_t_i_o_n_ _A_c_t_i_o_n_s
In addition to defining what a non-terminal symbol is, a
grammar rule usually describes what to do if the non-
terminal symbol is encountered in parser input. We call
this a _r_e_c_o_g_n_i_t_i_o_n_ _a_c_t_i_o_n_.
Recognition actions are specified as part of the grammar
rule. They are enclosed in brace brackets in the
definition, as in
break_stat : BREAK ';'
{ breakfn(); };
In this definition, "break_stat" is a non-terminal symbol
made up of the token known as BREAK, followed by a
semicolon. If this symbol is recognized, the parser will
invoke a function named "breakfn". Presumably this is a
user-defined function that handles a "break;" statement.
Note that a semicolon is needed to mark the end of the
definition even though the recognition action ends in a
November, 1995 Page 15
YAY Reference Manual Thinkage Ltd.
brace bracket. Programmers who are used to the way C works
should bear this in mind.
For compatibility with some versions of YACC, YAY lets
you put an equals sign (=) before the opening brace that
begins a recognition action, as in
break_stat : BREAK ';'
= { breakfn(); };
When a symbol has more than one definition, a different
recognition action may be associated with each definition.
We will show some examples of this in a moment.
_3_._2_._2_ _ _T_o_k_e_n_ _a_n_d_ _S_y_m_b_o_l_ _V_a_l_u_e_s
One of the most common recognition actions is to return
a value. For example, if an input is recognized as an
expression to be evaluated, the parser may want to return
the resulting value of the expression. To return a value,
the recognition action merely assigns the value to a special
variable named $$. For example,
hexdigit : '0' { $$ = 0; }
| '1' { $$ = 1; }
...
| 'A' { $$ = 10; }
| 'B' { $$ = 11; }
| 'C' { $$ = 12; }
| 'D' { $$ = 13; }
| 'E' { $$ = 14; }
| 'F' { $$ = 15; } ;
could be a way to convert hexadecimal digits into numeric
values. In this case, "yylex" just returns the digits it
finds and "yyparse" performs the actual conversion.
Another common recognition action is to return a value
based on one or more of the items that make up the non-
terminal symbol. Inside the recognition action, $1 stands
for the value of the first item in the symbol, $2 stands for
the value of the second item, and so on. If the item is a
token, its value is the "yylval" value returned by "yylex"
when the token was read. If the item is non-terminal
symbol, its value is the $$ value set by the recognition
action associated with the symbol. Thus we might write
Page 16 November, 1995
Thinkage Ltd. YAY Reference Manual
intexp : '(' intexp ')'
{
$$ = $2;
}
/* value of parenthesized expression
is expression inside parentheses */
| intexp '+' intexp
{
$$ = $1 + $3 ;
}
/* value of addition is sum of two
expressions */
| intexp '-' intexp
{
$$ = $1 - $3 ;
}
/* value of subtraction is difference
of two expressions */
| /* and so on */
;
Note that this particular definition shows that each part of
a multiple definition may have a different recognition
action.
In the source code for "yyparse", this set of actions
will be turned into a large sswwiittcchh statement. The cases of
the sswwiittcchh will be the various possible recognition actions.
If no recognition action is specified for a definition,
the default is
{ $$ = $1 ; }
This action just returns the value of the first item (if the
item has a value).
_3_._2_._3_ _ _P_r_e_c_e_d_e_n_c_e_ _i_n_ _t_h_e_ _G_r_a_m_m_a_r_ _R_u_l_e_s
In our discussion of the declarations section, we showed
how precedence could be assigned to _o_p_e_r_a_t_o_r_s. Precedence
can also be assigned to _g_r_a_m_m_a_r_ _r_u_l_e_s_, and this is done in
the grammar input section.
One way to give a grammar rule a precedence uses the
%prec construct.
%prec TOKEN
November, 1995 Page 17
YAY Reference Manual Thinkage Ltd.
in a grammar rule indicates that the rule should have the
same precedence as the specified token.
As a simple example, consider the case of the unary
minus operator. Suppose our declaration section contains
%left '+' '-'
%left '*' '/'
%left UMINUS
In the grammar rules section, we can write
exp : exp '+' exp
| exp '-' exp
| exp '*' exp
| exp '/' exp
| '-' exp %prec UMINUS
| /* and so on */
;
We could not directly set up a precedence for the unary
minus, since we had already set up a precedence for the -
token. Instead, we created a token named UMINUS and gave it
the precedence we wanted to assign the unary minus. When we
wrote out the grammar rule for the unary minus, we added
%prec UMINUS
to show that this rule should have the precedence of UMINUS.
As another example, we might set up precedence rules for
the right shift and left shift operations of C with
%left RS LS
...
exp :
| exp '<' '<' exp %prec LS
| exp '>' '>' exp %prec RS
...
In this way we give the shift operations the proper
precedence and avoid confusing them with the comparison
operations > and <. Of course, another way to resolve this
problem is to make the lexical analyzer clever enough to
recognize >> and << and to return the RS or LS tokens
directly.
Although symbols like UMINUS, LS, and RS are treated as
tokens, they do not have to correspond to actual input.
They may just be placeholders for operator tokens that have
two different meanings.
Page 18 November, 1995
Thinkage Ltd. YAY Reference Manual
We should note that the use of %prec is relatively rare
in YAY. People don't usually think of %prec in their first
draft of a grammar. %prec is only added in later drafts,
when it is needed to resolve conflicts that appear when the
rules are run through YAY.
If a grammar rule is not assigned a precedence using
%prec, the precedence of the rule is taken from the last
_t_o_k_e_n in the rule. For example, if the rule is
expr : expr '+' expr
the last token in the rule is + (since "expr" is a non-
terminal symbol, not a token). Thus the precedence of the
rule is the same as the precedence of +.
If the last token in a rule has no assigned precedence,
the rule will not have a precedence. This can result in
some surprises if you aren't careful. For example, if you
define
expr : expr '+' expr ';'
the last token in the rule is ; so the rule probably won't
have a precedence even if + does.
A rule that contains no tokens cannot have a precedence.
_3_._2_._4_ _ _T_h_e_ _S_t_a_r_t_ _S_y_m_b_o_l
The first non-terminal symbol defined in the rules
section is called the _s_t_a_r_t_ _s_y_m_b_o_l. This symbol is taken to
be the largest, most general structure described by the
grammar rules. For example, if you are generating the
parser for a compiler, the start symbol should describe what
a complete program looks like in the language to be parsed.
If you do not want the first grammar rule to be taken as
the start symbol, you can use the directive
%start NAME
in your rules section. This indicates that the non-terminal
symbol named NAME is the start symbol. NAME must be defined
somewhere in the rules section.
The start symbol must be all-encompassing: every other
rule in the grammar must be related to the start symbol. In
a sense, the grammar forms a "tree": the root is the start
November, 1995 Page 19
YAY Reference Manual Thinkage Ltd.
symbol, the first set of branches are the symbols that make
up the start symbol, the next set of branches are the
symbols that make up the first set, and so on. Any symbol
that is "outside" this tree is reported as a _u_s_e_l_e_s_s
_v_a_r_i_a_b_l_e in YAY output. Useless variables are ignored by
the parser -- the parser is looking for a complete start
symbol, and nothing else.
_3_._2_._5_ _ _T_h_e_ _E_n_d_ _M_a_r_k_e_r
The end of parser input is marked by a special token
called the _e_n_d_ _m_a_r_k_e_r. This token is often written as $end;
the value of the token is zero.
It is the job of the lexical analyzer "yylex" to return
a zero to indicate $end when the end of input is reached
(e.g. at end of file, or at a keyword that indicates end of
input).
"yyparse" terminates when it has parsed a start symbol
followed by the end marker.
_3_._2_._6_ _ _D_e_c_l_a_r_a_t_i_o_n_s_ _i_n_ _y_y_p_a_r_s_e
You can specify C declarations that will be local to
"yyparse" in much the same way that you specify external
declarations in the Declarations Section. Enclose the
declarations in %{ and %} symbols, as in
%{
/* External declaratios */
%}
%%
/* Rules Section starts here */
{%
/* Declarations here will be
local to "yyparse" */
%}
/* Rules */
%%
/* Program section */
You may also put declarations at the start of recognition
action code.
Page 20 November, 1995
Thinkage Ltd. YAY Reference Manual
33..33 TThhee PPrrooggrraammss SSeeccttiioonn
The programs section of YAY input may contain functions
that should be linked in with the "yyparse" routine. YAY
itself does nothing with these functions -- it simply adds
the source code on the end of the source code produced from
the grammar rules. In this way, the functions can be
compiled at the same time that the YAY-produced code is
compiled.
Of course, these additional functions could be compiled
separately and linked with the YAY-produced code later on
(once everything is in object code format). Separate
compilation of modules is strongly recommended for large
parsers. However, functions that are compiled separately
need a special mechanism if they want to use any manifests
that are defined in the YAY-produced code, and it is
sometimes simpler to make the program part of the YAY input.
For example, consider the case of "yylex". Every time
"yylex" obtains a token from the input, it returns to
"yyparse" with a value that indicates the type of token
found. Obviously then, "yylex" and "yyparse" must agree on
which return values indicate which kind of tokens. Since
"yyparse" already refers to tokens using manifest constants
(created in the declarations section with the %token
directive), it makes sense for "yylex" to use the same
manifest constants. The lexical analyzer can do this very
easily if it is compiled along with "yyparse".
Size might be the determining factor. With very simple
parsers, it's easier to put "yylex" in the Programs Section.
With larger parsers, the advantages of separate compilation
are well worth the extra effort.
If you are going to compile "yylex" or other routines
separately from "yyparse", use the
Header=file
option on the YAY command line. YAY will write out the
manifest definitions to the file of your choice. This file
can then be #included to obtain these definitions for
"yylex" or any other routine that needs them. The manifests
are already included in the generated parser code, so you
only need them for separately compiled modules.
November, 1995 Page 21
YAY Reference Manual Thinkage Ltd.
_3_._3_._1_ _ _T_h_e_ _L_e_x_i_c_a_l_ _A_n_a_l_y_z_e_r
The lexical analyzer "yylex" reads input and breaks it
into tokens. It is "yylex" that determines what constitutes
a token. For example, some lexical analyzers may return
numbers one digit at a time while others collect numbers in
their entirety before passing them to the parser.
Similarly, some lexical analyzers may recognize keywords
such as IF or WHILE and will tell the parser that an IF
token or WHILE token has been found. Others may not be
designed to recognize keywords, so it is up to the parser
itself to distinguish between keywords and other things like
variable names.
As noted before, each token named in the declarations
section of the YAY input is set up as a manifest constant.
The value of the first token named is 257, the value of the
next is 258, and so on. You can also set your own values
for tokens by placing a positive integer after the first
appearance of any token in the declarations section. For
example,
%token AA 56
assigns a value of 56 to the manifest (token) symbol AA.
This mechanism is very seldom needed, and we recommend that
users avoid it whenever possible.
There is little else to say about requirements for
"yylex". If it wishes to return the value of a token as
well as an indication of its type, the value is assigned to
the external variable "yylval". By default, "yylval" is
defined as an iinntt value but it can also be used to hold
other types of values. For more information, see the
description of %union in Chapter 7.
Page 22 November, 1995
Thinkage Ltd. YAY Reference Manual
44.. IInntteerrnnaall SSttrruuccttuurreess
In order to use YAY effectively, it is helpful to
understand some of the internal workings of the parser that
YAY produces. In this chapter, we will look at some of
these structures.
To give us a point of reference, suppose we have
produced a parser with grammar
%token NUM
%left '+' '-'
%left '*' '/'
%%
expr : NUM
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '(' expr ')'
;
44..11 SSttaatteess
As the parser reads in token after token, it switches
between various _s_t_a_t_e_s. You can think of a state as point
where the parser says, "I have read _t_h_i_s particular sequence
of input tokens and now I am looking for one of _t_h_e_s_e
tokens."
For example, a parser for the C language might be in a
state where it has finished reading a complete statement and
is ready for the start of a new statement. It therefore
expects some token that can legitimately start a statement
(e.g. a keyword like IF or WHILE, or the name of a variable
for an assignment).
In this state, it reads a token. Say it finds the token
corresponding to the keyword IF. It then switches to a new
state where it says "I have seen an IF and now I want to see
the ( that begins the IF condition." When it finds the (,
it switches again to a state that says "I have found IF( and
now I want the start of a condition expression."
States break the parsing process into simple steps. At
each step, the parser knows what it has seen and what it is
looking for next.
November, 1995 Page 23
YAY Reference Manual Thinkage Ltd.
YAY assigns numbers to every possible state the parser
can enter. The 0th state is always the one that describes
the parser's condition before it has read any input. Other
states are numbered arbitrarily.
Sometimes a particular input can only be the start of
one construct. For example, the FOR keyword in C can only
be the start of a FOR statement, and the FOR statement only
has one form.
On the other hand, a grammar may have several non-
terminal symbols that start the same way. In our sample
grammar, all of
expr '+' expr
expr '-' expr
expr '*' expr
expr '/' expr
start with an "expr". If the parser finds that the input
begins with an "expr", the parser has no idea which rule
matches the input until it has read the operator following
the first "expr".
The parser chooses which state it will enter next by
looking at the next input token. This token is called the
_l_o_o_k_a_h_e_a_d_ _s_y_m_b_o_l for that state.
44..22 DDiiaaggrraammmmiinngg SSttaatteess
YAY uses simple diagrams to describe the various states
of the parser. These diagrams show what the parser has seen
and what it is looking for next. The diagrams are given in
the Parser Description report prouced by YAY (see Chapter
6).
For example, consider the state where the parser has
just read a complete "expr" at the beginning of a larger
expression. It is now in a state where it expects to see
one of the operators +, -, *, or /, or perhaps the $end
marker indicating the end of input. YAY diagrams this state
as
$accept: expr.$end
expr: expr.'+' expr
expr: expr.'-' expr
expr: expr.'*' expr
expr: expr.'/' expr
Page 24 November, 1995
Thinkage Ltd. YAY Reference Manual
This lists the possible grammar constructs that the parser
may be working on. (The first line $accept stands for the
Start symbol.) The "." indicates how much the parser has
read so far.
If the lookahead symbol is *, the parser will switch to
a state diagrammed by
expr: expr '*'.expr
In this state, the parser knows that the next thing to come
will be another "expr". This means that the only valid
tokens that can be read next are ( or NUM, since those are
the only things that start a valid "expr".
44..33 SSttaattee AAccttiioonnss
There are several possible actions that the parser can
take in a state:
(a) _a_c_c_e_p_t the input;
(b) _s_h_i_f_t to a new state;
(c) _r_e_d_u_c_e one or more input tokens to a single non-
terminal symbol, according to a grammar rule;
(d) _g_o_t_o a new state;
(e) raise an _e_r_r_o_r condition.
In order to decide which action to take, the parser checks
the lookahead symbol (except in states where the parser can
only take one possible action so the lookahead symbol is
irrelevant).
This means that a typical state has a series of possible
actions based upon the possible values of the lookahead
symbol. In YAY output, you might see
'+' shift 8
'-' shift 7
'*' shift 6
'/' shift 5
')' shift 9
. error
This says that if the parser is in this state and the
lookahead symbol is +, the parser will shift to State 8. If
November, 1995 Page 25
YAY Reference Manual Thinkage Ltd.
the lookahead symbol is -, the parser will shift to State 7,
and so on.
The "." in the final line stands for any other token not
mentioned in the preceding list. If the parser finds any
unexpected tokens in this particular state, it will take the
error action.
The sections that follow explain precisely what each
state action means and what the parser does to handle these
actions.
_4_._3_._1_ _ _T_h_e_ _A_c_c_e_p_t_ _A_c_t_i_o_n
The Accept action only happens when the parser is in a
state that indicates it has seen a complete input and the
lookahead symbol is the end marker $end. When the parser
takes the Accept action, "yyparse" terminates and returns a
zero to indicate that the input was correct.
_4_._3_._2_ _ _T_h_e_ _S_h_i_f_t_ _A_c_t_i_o_n
The Shift action happens when the parser is part way
through a grammar construct and a new token is read in. As
an example, State 4 in our sample parser is diagrammed with
expr: expr.'+' expr
expr: expr.'-' expr
expr: expr.'*' expr
expr: expr.'/' expr
expr: '(' expr.')'
'+' shift 8
'-' shift 7
'*' shift 6
'/' shift 5
')' shift 9
. error
This shows that the parser will shift to various other
states depending on the value of the lookahead symbol. For
example, if the lookahead symbol is * the parser shifts to
State 6 which has the diagram
Page 26 November, 1995
Thinkage Ltd. YAY Reference Manual
expr: expr '*'.expr
NUM shift 2
'(' shift 1
. error
expr goto 11
In this new state, the parser has further shifts it can
make, depending on the next lookahead symbol.
When the parser shifts to a new state, it saves the
previous state on a stack called the _s_t_a_t_e_ _s_t_a_c_k. The stack
provides a history of the states that the parser has passed
through while it was reading input. It is also a control
mechanism as will be seen later in this chapter.
Paralleling the state stack is a _v_a_l_u_e stack that
records the values of tokens and non-terminal symbols
encountered while parsing. The value of a token is the
"yylval" value returned by "yylex" at the time the token was
read. The value of a non-terminal symbol is the $$ value
set by the recognition action associated with that symbol's
definition. If the definition didn't have an associate
recognition action, the value of the symbol is the value of
the first item in the symbol's definition.
At the same time that the Shift action pushes the
current state onto the state stack, it also pushes the
"yylval" value of the lookahead symbol (token) onto the
value stack.
_4_._3_._3_ _ _T_h_e_ _R_e_d_u_c_e_ _A_c_t_i_o_n
The Reduce action takes place in states where the parser
has recognized all the items that make up a non-terminal
symbol. For example, the diagram of State 9 in our sample
grammar is
expr: '(' expr ')'.
. reduce (6)
At this point, the parser has seen all three components that
make up the non-terminal symbol "expr":
'('
expr
')'
November, 1995 Page 27
YAY Reference Manual Thinkage Ltd.
As the line
. reduce (6)
shows, it doesn't matter what the lookahead symbol is at
this point. The non-terminal symbol has been recognized,
and the parser is ready for a Reduce action. (Note that the
(6) in the above line just means that the parser has
recognized the non-terminal symbol defined in rule (6) of
the grammar. We will say more about this in a later
chapter.)
The Reduce action performs a number of operations.
First, it pops states off the state stack. If the
recognized non-terminal symbol had N components, a reduction
pops N-1 states off the stack. In other words, the parser
goes back to the state it was in when it first began to
gather the recognized construct.
Next, the Reduce action pops values off the value stack.
If the definition that is being reduced consisted of N
items, the Reduce action conceptually pops N values off the
stack. The topmost value on the stack is assigned to $N,
the next to $N-1, and so on down to $1.
Once the Reduce action has gathered all the $X values,
the parser invokes the recognition action that was
associated with the grammar rule being reduced. This
recognition will use the $X values to come up with a $$
value for the non-terminal symbol. This value is pushed
onto the value stack, thereby replacing the N values that
were previously on the stack.
If the non-terminal symbol had no recognition action, or
if the recognition action did not set $$, the parser puts
the value of $1 back on the stack. (In reality, the value
is never popped off.)
Lastly, the Reduce action sets things up so that the
lookahead symbol seems to be the non-terminal symbol that
was just recognized. For example, it may say that the
lookahead symbol is now an "expr" instead of a token.
_4_._3_._4_ _ _T_h_e_ _G_o_t_o_ _A_c_t_i_o_n
The Goto action is a continuation of the Reduce process.
Goto is almost the same as Shift -- the only difference is
that the Goto action takes place when the lookahead symbol
is a non-terminal symbol while a Shift takes place when the
lookahead symbol is a token.
Page 28 November, 1995
Thinkage Ltd. YAY Reference Manual
For example, State 6 in our sample grammar reads
expr: expr '*'.expr
NUM shift 2
'(' shift 1
. error
expr goto 12
The first time the parser enters this state, the lookahead
symbol will be a token and the parser will Shift on the
token into some state where it will begin to gather an
"expr". When it has a complete "expr", it will perform a
Reduce action that will return to this state and set the
lookahead symbol to "expr". Now when the parser has to
decide what to do next, it sees that it has an "expr" for
the lookahead symbol and therefore takes the Goto action and
moves to State 12.
The Shift action pushes the current state onto the state
stack. The Goto does not have to do this -- the state was
on the stack already. Similarly, Shift pushes a value onto
the value stack, but Goto does not, since the value
corresponding to the non-terminal symbol was already put on
the value stack by the Reduce action.
When the parser reaches the new state, the lookahead
symbol will be restored to whatever it was at the time of
the Reduce action.
Essentially then, a Goto is like a Shift except that it
takes place when you come _b_a_c_k to a state with the Reduce
action. Also, a Shift is based on the value of a single
input token while a Goto is based on a non-terminal symbol.
_4_._3_._5_ _ _T_h_e_ _E_r_r_o_r_ _A_c_t_i_o_n
The parser takes the Error action when it encounters any
input token which cannot validly appear in a particular
input location. When this happens, the parser raises the
"error" condition. Since error-handling can be quite
complicated, we will devote the whole of the next chapter to
the subject.
November, 1995 Page 29
YAY Reference Manual Thinkage Ltd.
55.. EErrrroorr--HHaannddlliinngg
If a piece of input is invalid, the parser can do
nothing with it. Except in extreme cases, however, it is
inappropriate for the parser to stop all processing as soon
as an error is found. Instead, the parser should skip over
the invalid input and resume parsing as soon after the error
as possible. In this way, the parser can find many syntax
errors in one pass through the input, saving time and
trouble for the user.
YAY therefore tries to generate a parser that can
"restart" as soon as possible after an error occurs. YAY
does this by letting you specify points at which the parser
can pick up after errors. You may also dictate what special
processing should take place if an error is encountered at
one of these points.
55..11 TThhee ""eerrrroorr"" SSyymmbbooll
YAY's error handling facilities use the keyword eerrrroorr to
signify an erroneous input. As a result, eerrrroorr may not be
used as the name of a user-defined token or non-terminal
symbol.
You should put "eerrrroorr" in your grammar rules where error
recovery might take place. For example, you might write
statement: error
| /* other definitions of a statement */;
This tells YAY that errors may occur in statements, and that
after an error, the parser is free to restart parsing at the
end of a complete statement.
55..22 TThhee EErrrroorr CCoonnddiittiioonn
As noted in the last chapter, YAY will take the Error
action if it finds an input that is not valid in a
particular location. The Error action has the following
steps.
(a) See if the current state has a Shift action associated
with the eerrrroorr symbol. If it does, shift on this
action.
(b) If the current state has no such action, pop the state
off the stack and check the next state. Also pop off
Page 30 November, 1995
Thinkage Ltd. YAY Reference Manual
the top value on the value stack, so that the state
stack and value stack stay in synch.
(c) Repeat (b) until the parser finds a state that can
shift on the eerrrroorr symbol.
(d) Take the Shift action associated with the eerrrroorr symbol.
Note that this pushes the current state on the stack,
i.e. the state that is capable of handling errors. No
new value is pushed onto the value stack -- the parser
keeps whatever value was already associated with the
state that can handle errors.
When the parser shifts out of the state that can handle
errors, the lookahead symbol will be whatever token caused
the error condition in the first place. The parser will
then try to proceed with normal processing.
Of course, it is quite possible that the original
lookahead symbol is invalid in the new context. If the
lookahead symbol causes an error again, it is discarded and
the error condition stays in effect. The parser will
continue to read new tokens and discard them until it finds
a token that can validly follow the eerrrroorr. The parser will
then take whatever action is associated with the valid
token.
In a typical grammar, the state that has been handling
errors will eventually be popped off the stack in a Reduce
operation.
Notice that the parser always Shifts on the error token.
It will never Reduce on eerrrroorr, even if the grammar has a
state where eerrrroorr is associated with a Reduce action.
In some situations, an error condition will be raised
and the parser will pop all the way to the bottom of the
state stack without finding a state that can handle the
eerrrroorr symbol. For example, the grammar may have no
provisions for error recovery. In this case, "yyparse"
simply terminates and returns a 1 to its caller.
55..33 EExxaammpplleess
As a simple example, consider a parser for a simple desk
calculator. All statements end in a semicolon. Thus we
might see the rule
November, 1995 Page 31
YAY Reference Manual Thinkage Ltd.
statement : var '=' expr ';'
| expr ';'
| error ';'
;
When an error occurs in input, the parser will pop back
through the state stack until it comes to a state where the
eerrrroorr symbol is recognized. For example, the state might be
diagrammed as
$accept: .statement $end
error shift 2
NUM shift 4
. error
var goto 7
expr goto 3
statement goto 5
If an error occurs anywhere in an input statement, the
parser will pop back to this state, then Shift to State 2.
State 2 will look like
statement: error.';'
.
';' shift 6
. error
In other words, the next token must be a semicolon. If it
isn't, another error occurs. The parser pops back to the
previous state and takes the eerrrroorr shift again. Input will
be discarded token by token until a semicolon is found.
When the semicolon is found, the parser will be able to
shift from State 2 to State 6 which is
statement: error ';'.
. reduce (3)
The erroneous line is reduced to a "statement" non-terminal
symbol.
Now this example is simple, but it has its drawbacks.
It will get you into trouble if the grammar has any concept
of block structure or parenthesization. Why? Once an error
occurs, the rule
statement : error ';'
Page 32 November, 1995
Thinkage Ltd. YAY Reference Manual
effectively tells the parser to discard absolutely
everything until it finds a ; character. If you have a
parser for C, for example, it would skip over important
characters like ) or } until it found a semicolon. Your
parentheses and braces would be out of balance for the rest
of the input and the whole parsing process would be a waste
of time. The same principle applies to any rule that shows
the eerrrroorr token followed by some other non-null symbol: it
can lead to hopeless confusion in a lot of grammars.
It is safer to write the rule in a form like
statement : error
| ';'
| /* other stuff */
In this case, the eerrrroorr token only matches material
until the parser finds something else it recognizes (e.g.
the semicolon). Once this happens, the eerrrroorr state will be
reduced to a "statement" symbol and popped off the stack.
Parsing can then proceed as usual.
55..44 EErrrroorr RReeccooggnniittiioonn AAccttiioonnss
The easiest way to generate an error message is to
associate a recognition action with the grammar rule that
recognizes the error. You can do something simple like
statement: error
{
printf("You made an error!\n");
}
or you can be fancier, as in
line: error '\n' prompt line
{ $$ = $4; };
prompt: /* null token */
{ printf("Please re-enter line.\n"); };
If an error occurs, the parser skips until it finds a
new-line character. After the new-line, it will always find
a null token matching "prompt", and the recognition action
for "prompt" will print out the message
Please re-enter line
The final symbol in the rule is another "line", and the
action after the eerrrroorr rule shows that the result of the
November, 1995 Page 33
YAY Reference Manual Thinkage Ltd.
rule ($$) should be the material associated with the second
input line.
All this means that if the user makes a mistake entering
an input line, the parser prints out an error message and
accepts a second input line in place of the first. This
allows for an interactive user to correct an input line that
was incorrectly typed the first time.
Of course, this set-up will only work if the user
doesn't make an error the second time the line is typed too.
If the next token he or she types is also invalid, the
parser will discard the token and decide that it's still
gobbling up the original error.
55..55 TThhee yyyycclleeaarriinn MMaaccrroo
After an Error action, the parser will restore the
lookahead symbol to the value it had at the time the error
was detected. However, this is sometimes undesirable.
For example, your grammar may have a recognition action
associated with the eerrrroorr symbol and this may read through
the next lot of input until it finds the next sure-to-be-
valid data. If this happens, you certainly don't want the
parser to pick up the old lookahead symbol again once error
recovery is finished.
If you want the parser to throw away the old lookahead
symbol after an error, put
yyclearin ;
in the recognition action associated with the eerrrroorr symbol.
"yyclearin" is a macro that expands into code that discards
the lookahead symbol.
55..66 TThhee yyyyeerrrroorr FFuunnccttiioonn
The first thing the parser does when it performs the
Error action is to call a function named "yyerror". This
happens _b_e_f_o_r_e the parser begins going down the state stack
in search of a state that can handle the eerrrroorr symbol.
"yyerror" must be supplied by the user -- note that its name
is in lower case.
The simplest "yyerror" functions either abort the
parsing job or just return so that the parser can perform
its standard error handling.
Page 34 November, 1995
Thinkage Ltd. YAY Reference Manual
The "yyerror" function is passed one argument: a
character string describing the type of error that just took
place. This string is almost always
"Syntax error"
The only other argument strings that might be passed are
"Not enough space for parser stacks"
"Parser stack overflow"
which are used when the parser runs out of memory for the
state stack.
Once "yyerror" returns to "yyparse", the parser will
proceed popping down the stack in search of a state that can
handle errors.
If another error is encountered soon after the first,
"yyerror" will _n_o_t be called again. The parser considers
itself to be in a "potential error" situation until it finds
three correct tokens in a row. This avoids the torrents of
error messages that often occur as the parser wades through
input in search of some recognizable sequence.
Once the parser has found three correct tokens in a row,
it leaves the "potential error" situation. If a new error
is found later on, "yyerror" will be called again.
_5_._6_._1_ _ _C_h_a_n_g_i_n_g_ _y_y_c_h_a_r_ _i_n_ _y_y_e_r_r_o_r
The "yyerror" function may change the value of "yychar"
(the variable that indicates the type of token just read by
"yylex"). If "yyerror" does this, "yyparse" immediately
gets rid of the error condition and tries to make a shift or
reduce using the new "yychar" token. If there is a valid
action that can be taken, "yyparse" will proceed as if the
error never happened.
This behavior can help you deal with at least two kinds
of unusual situations:
(a) When "yyerror" is attempting to correct a problem by
supplying the correct input. For example, it may be
possible to determine what input should have been
received; the classic example is a compiler which can
figure out that a particular statement should have
ended in a semicolon. In this case, "yyerror" can
supply the missing semicolon by assigning an
November, 1995 Page 35
YAY Reference Manual Thinkage Ltd.
appropriate value to "yychar". If "yyerror" guessed
correctly, action can proceed as if the problem didn't
occur.
(b) When a particular token can be interpreted in different
ways in different contexts. For example, suppose that
X can be a keyword in some contexts and a variable name
in others. You can write "yylex" so that it always
interprets X as a keyword. If "yyerror" is invoked
because that keyword is invalid in a particular
context, "yyerror" can change "yychar" and see if the
input is valid when X is interpreted as a variable
name.
If "yyerror" changes "yychar", "yyerror" may have to
inform "yylex" of the situation so that "yylex" can return
the proper input token the next time it is called.
55..77 TThhee yyyyeerrrrffllaagg VVaarriiaabbllee
"yyerrflag" is an external integer variable whose value
must be 0, 1, 2, or 3. YYABORT (described in a later
section) is called if "yyerrflag" does not have one of these
values.
As we mentioned earlier, "yyparse" doesn't leave its
error state until three consecutive valid tokens have been
read. "yyerrflag" is used in this process.
If "yyerrflag" is zero, "yyerror" will be invoked the
next time an error is encountered. As soon as "yyerror" is
invoked, it sets "yyerrflag" to 3. Each time a "yyparse"
shifts on a valid token, "yyerrflag" is decremented, until
it gets to zero. At this point, "yyparse" leaves its error
state; if a new error occurs, "yyerror" will be called
again.
There are two ways in which actions can use "yyerrflag".
First, they can check the variable to see if there has been
a recent error. For example, if an action encounters a
semantic error, it can check "yyerrflag" to see if it is
non-zero (meaning that there has been a recent syntax
error). If "yyerrflag" is non-zero, the semantic error is
almost certainly a result of the previous syntax error.
Using this information, the action may choose to modify or
suppress the semantic error message.
Second, actions can set "yyerrflag" to a value if they
want to prevent calls to "yyerror". As long as "yyerrflag"
is non-zero, "yyerror" will not be called. This may be
Page 36 November, 1995
Thinkage Ltd. YAY Reference Manual
useful if an action is smart enough to realize that it is
still recovering from a previous error and "yyerror" should
not be called again for a while.
55..88 TThhee yyyyeerrrrookk MMaaccrroo
In some situations, you may want "yyerror" to be called
even if the parser hasn't seen three correct tokens since
the last error.
For example, suppose you have a parser for a line by
line desk calculator. A line of input contains errors, so
"yyerror" is called. "yyerror" prints an error message to
the user, throws away the rest of the line, and prompts for
new input. If the next line contains an error in the first
three tokens, the parser will normally start discarding
input _w_i_t_h_o_u_t calling "yyerror" again. This means that
"yyerror" doesn't print an error message for the user, even
though the input line is wrong.
To avoid this problem, you can explicitly tell the
parser to leave its "potential error" state, even if it
hasn't yet seen three correct tokens. Simply say
yyerrok ;
as part of the error recognition action. For example, you
might have the rule
expr : error
{
yyerrok;
printf("Please re-enter line.\n");
yyclearin;
}
"yyerrok" is a macro that is expanded into code that takes
the parser out of its "potential error" state and lets it
start fresh. More precisely, "yyerrok" sets "yyerrflag" to
zero.
55..99 OOtthheerr EErrrroorr SSuuppppoorrtt RRoouuttiinneess
YYABORT halts "yyparse" in midstream and immediately
returns a 1. To the function that called "yyparse", this
means that "yyparse" failed for some reason.
YYACCEPT halts the parser in midstream and returns a 0.
To the function that called "yyparse", this means that
November, 1995 Page 37
YAY Reference Manual Thinkage Ltd.
"yyparse" terminated successfully, even if the entire input
has not yet been scanned.
YYERROR (note that this is upper case) is a macro that
"fakes" an error. When YYERROR is encountered in the code,
the parser will react as if it just saw an error and will go
about recovering from the error. In Chapter 9, we will give
an example of how YYERROR can be useful.
Page 38 November, 1995
Thinkage Ltd. YAY Reference Manual
66.. YYAAYY OOuuttppuutt
YAY can produce several output files. Options on the
YAY command line dictate which files are actually generated.
The most important output file is the one containing
source code that can be compiled into the actual parser.
The name of this file is specified with the Parser=file
command line option.
Another possible output file contains manifest
definitions. The name of this file is specified with
Header=file on the command line. This file is a
distillation of the declarations section of the YAY input.
For example, all the %token directives are restated in terms
of manifest definitions.
%token IF
would appear as
#define IF 257
in the C version of the manifests (assuming that IF was the
first token in the declarations section). By including this
file with
#include "filename"
separately compiled modules can make use of all the
pertinent definitions in the YAY input.
The third output file that YAY can produce is called the
Parser Description. The name of the file is specified with
Description=file on the command line. The Parser
Description is split into three sections:
(a) a summary of the grammar rules;
(b) a list of state descriptions;
(c) a list of statistics for the parser generated by YAY.
In the sections that follow, we will show what the
Parser Description looks like for the following grammar:
November, 1995 Page 39
YAY Reference Manual Thinkage Ltd.
%token IF ELSE A
%%
stmt : IF stmt ELSE stmt
| IF stmt
| A
;
66..11 RRuulleess SSuummmmaarryy
The rules summary section of the Parser Description
begins with the command line used to invoke YAY. This is
intended to serve as a "heading" for the output material.
We'll use the command line
yay infile.y description=out +verbose
YAY input is in the file infile.y and output is written to
the file out. We have specified the +Verbose option because
this provides the largest amount of output. See Appendix C
for more information on the YAY command line.
Next comes a summary of the grammar rules. In our
example, we will have
Rules:
(0) $accept: stmt $end
(1) stmt: IF stmt ELSE stmt
(2) stmt: IF stmt
(3) stmt: A
The 0th rule will always be the definition for a symbol
named $accept. This describes what a complete input looks
like: the Start symbol followed by the end marker. Other
rules are those given in the grammar.
YAY puts a formfeed character on the line after the last
grammar rule so that the next part of the Parser Description
starts on a new page.
66..22 SSttaattee DDeessccrriippttiioonnss
The Parser Description output contains complete
descriptions of every possible state. For example, here is
the description of one state from our sample grammar.
Page 40 November, 1995
Thinkage Ltd. YAY Reference Manual
State 2
stmt : IF.stmt ELSE stmt
stmt : IF.stmt
IF shift 2
A shift 1
. error
stmt goto 4
By now, this sort of diagram should be familiar to you. The
numbers after the word "shift" indicate the state to which
the parser will Shift if the lookahead symbol happens to be
IF or A. If the lookahead symbol is anything else, the
parser raises the error condition and starts error recovery.
If the parser pops back to state 2 by means of a Reduce
Action, the lookahead symbol will now be "stmt" and the
parser will Goto state 4.
As another example of a state, here is State 1.
State 1
stmt : A. (3)
. reduce (3)
This is the state that is entered when an A token has
been found. The (3) on the end of the first line is a _r_u_l_e
_n_u_m_b_e_r. It indicates that this particular line sums up the
whole of the third grammar rule that was specified in the
YAY input. The line
. reduce (3)
indicates that no matter what token comes next, we can
reduce this particular input using grammar rule (3) and say
that we have successfully collected a valid "stmt". The
parser will perform a reduction by popping the top state off
the stack and setting the lookahead symbol to "stmt".
It is important to distinguish between
A shift 1
in State 2 and
. reduce (3)
in State 1. In the Shift instruction, the number that
follows is the number of a _s_t_a_t_e. In the Reduce
November, 1995 Page 41
YAY Reference Manual Thinkage Ltd.
instruction, the number that follows is the number of a
_g_r_a_m_m_a_r_ _r_u_l_e (using the numbers given to the grammar rules
in the first part of the Parser Description). The Parser
Description always encloses rule numbers in parentheses, and
leaves state numbers as they are.
The complete state descriptions for the grammar are
given below.
State 0
$accept: .stmt $end
IF shift 2
A shift 1
. error
stmt goto 3
State 1
(3) stmt: A.
. reduce (3)
State 2
stmt: IF.stmt ELSE stmt
stmt: IF.stmt
IF shift 2
A shift 1
. error
stmt goto 4
State 3
$accept: stmt.$end
$end accept
. error
State 4
stmt: IF stmt.ELSE stmt
(2) stmt: IF stmt. [ $end ELSE ]
Shift/reduce conflict (5,2) on ELSE
ELSE shift 5
. reduce (2)
State 5
stmt: IF stmt ELSE.stmt
Page 42 November, 1995
Thinkage Ltd. YAY Reference Manual
IF shift 2
A shift 1
. error
stmt goto 6
State 6
(1) stmt: IF stmt ELSE stmt.
. reduce (1)
The parser always begins in State 0, i.e. in a state
where no input has been read yet. An acceptable input is a
"stmt" followed by the end marker. When a "stmt" has been
collected, the parser goes to State 3. In State 3, the
required end marker $end indicates that the input should be
accepted. If anything else is found, it is excess input and
means an error.
In State 4, the rule labelled (2) has
[ $end ELSE ]
on the end. This just means that the parser expects to see
one of these two tokens next.
66..33 CCoonnfflliicctt SSuummmmaarryy
As noted in the sample output, there is a shift/reduce
conflict in State 4. If an ELSE is encountered while the
parser is in this state, the parser doesn't know whether to
shift (into State 5) or to reduce using rule (2). Thus YAY
prints the message
Shift/reduce conflict (5,2) on ELSE
After the state descriptions, YAY prints a summary of
all such conflicts. With our sample grammar, the output
contains
Conflicts:
State Token Action
4 ELSE shift 5
4 ELSE reduce (2)
Each line in this summary gives the number of the state in
which the conflict occurs, the token(s) that cause the
conflict, and the conflicting actions.
November, 1995 Page 43
YAY Reference Manual Thinkage Ltd.
66..44 PPaarrsseerr SSttaattiissttiiccss
The last section of the Parser Description is a set of
statistics summarizing YAY's work. Here are the stats we
got when we ran our sample grammar through YAY.
4 rules, 5 tokens, 2 variables, 7 states
Memory: max = 8K
States: 3 wasted, 4 resets
Items: 18, 0 kernel, (2,0) per state, maxival=16 (1 w/s)
Lalr: 1 calls, 2 recurs, (0 trans, 12 epred)
Actions: 0 entries, gotos: 0 entries
Exceptions: 1 states, 4 entries
Simple state elim: 0%, Error default elim: 33%
Optimizer: in 0, out 0
Size of tables: 24 bytes
2 seconds, final mem = 4
1 shift/reduce conflict
Some of these values are machine independent (e.g. the
number of rules), others are machine dependent (e.g. the
amount of memory used), and some can be different every time
you run the job (e.g. time elapsed while YAY was running).
The statistic lines are described below. Many of these
will be of no interest to the normal user; YAY only
generates them for the use of those maintaining the YAY
software. A number of the statistics refer to shift/reduce
or reduce/reduce conflicts. We will discuss these in a
later chapter.
4 rules, 5 tokens, 2 variables, 7 states
The four rules are the grammar rules given in the first
part of the Parser Description. The five tokens are A,
IF, ELSE, $end, and error (which is always defined,
even if we don't use it in this grammar). The two
variables are the non-terminal symbols, "stmt" and the
special $accept. The seven states are states 0 to 6.
Memory: max = 8K
This gives the maximum amount of dynamic memory that
YAY required while producing the parser. This line may
also have a "success rate" that tells how often YAY
succeeded in having enough memory to handle a situation
and how often it had to ask for more memory.
States: 3 wasted, 4 resets
The algorithm that constructs states from the grammar
rules makes a guess at the number of states it will
Page 44 November, 1995
Thinkage Ltd. YAY Reference Manual
need, very early on in the YAY process. If this guess
is too high, the excess states are said to be "wasted".
When creating states from the various grammar rules, it
is sometimes found that a state from one rule
duplicates the state from another (for example, there
were two rules that started with IF in our example
above). In the final parsing tables, such duplicate
states are merged together into a single state. The
number of "resets" is the number of duplicate states
formed, then merged.
Items: 18, 0 kernel, (2,0) per state, maxival=16 (1 w/s)
A state is made of items, and the kernel items are an
important subset of these: the size of the resulting
parsing tables and the running time for YAY are
proportional to the number of items and kernel items.
The rest of the statistics in this line are not of
interest to normal users.
Lalr: 1 call, 2 recurs, (0 trans, 12 epred)
This gives the number of calls and recursive calls to
the conflict resolution routine. The parenthesized
figures are related to the same process. In some ways,
this is a measure of the complexity of the grammar
being parsed. This line will not appear if there are
no reduce/reduce or shift/reduce conflicts in your
grammar.
Actions: 0 entries, gotos: 0 entries
This gives the number of entries in the tables "yyact"
and "yygo". "yyact" keeps track of the possible Shifts
that a program may make and "yygo" keeps track of the
"gotos" that take place at the end of states.
Exceptions: 1 states, 4 entries
This gives the number of entries in the table "yydef",
yet another table used in YAY. "yydef" keeps track of
the possible Reduce, Accept, and Error actions that a
program may make.
Simple state elim: 0%, Error default elim: 33%
The percentage figures indicate how much table space
could be saved through various optimization processes.
The better written your grammar, the greater the
percentage of space that can be saved. Therefore, high
percentages here are an indication of a well-written
grammar.
November, 1995 Page 45
YAY Reference Manual Thinkage Ltd.
Optimizer: in 0, out 0
More optimization statistics: not of interest to normal
YAY users.
Size of tables: 24 bytes
The size of the tables generated to represent the
parsing rules. This size is given in bytes on the host
machine, so it will be inaccurate if a cross-compiler
is being used on the eventual source code output. The
size does not include stack space used by "yyparse" or
debug tables obtained by defining YYDEBUG.
2 seconds, final mem = 4
Total real time that YAY used to produce the parser,
and the final dynamic memory of the parser (in K
bytes).
1 shift/reduce conflict
Number of conflicts in the grammar.
Page 46 November, 1995
Thinkage Ltd. YAY Reference Manual
77.. TTyyppeess
Earlier we mentioned that "yylval" is iinntt by default, as
are $$, $1, $2, etc. If you want these to have different
types, you may redeclare them in the declarations section of
the YAY input. This is done with a statement of the form
%union {
/* possible types for "yylval" and
* $$, $1, $2, etc. */
}
For example, suppose "yylval" can be either integer or
floating point. You might write
%union {
int intval;
float realval;
}
in the declarations section of the YAY input. YAY converts
the %union statement into the following C source.
typedef union {
int intval;
float realval;
} YYSTYPE;
"yylval" is always declared to have type YYSTYPE. If no
%union statement is given in the YAY input, it will use
#define YYSTYPE int
Once YYSTYPE has been defined as a union, you may
specify a particular interpretation of the union by
including a statement of the form
%type <interpretation> SYMBOL
in the declarations section of the YAY input. The
"interpretation" enclosed in the angle brackets is the name
of the interpretation of the union variable that you want to
use. The SYMBOL name is the name of a non-terminal symbol
defined in the grammar rules. For example, you might write
%type <intval> intexp
%type <realval> realexp
to indicate that an integer expression has an integer value
and a real expression has a floating point value.
November, 1995 Page 47
YAY Reference Manual Thinkage Ltd.
Tokens may also be declared to have types. This is done
in the %token statement in the declaration section, and
again shows the union interpretation in angle brackets, e.g.
%token <realval> floatnum
If you use types in your YAY input, YAY will enforce
compatibility of types in all expressions. For example, if
you write
$$ = $2
in an action, YAY will demand that the two corresponding
tokens have the same type; otherwise, the assignment will be
marked as invalid. The reason for this is that YAY must
always know what interpretation of the union is being used
in order to generate correct code.
TThhee DDeeffaauulltt AAccttiioonn
The default action associated with any rule can be
written as
$$ = $1
which means that the value of associated with $1 on the
value stack is assigned to $$ on the value stack when the
rule is reduced. If, for example, $1 is an integer, then $$
will be the same integer after the reduction occurs.
On the other hand, suppose that the recognition action
associated with a rule explicitly states
$$ = $1
This explicit assignment may not have the same effect as the
implicit assignment. For example, suppose that you define
%union {
float floatval;
int intval;
}
Also suppose that the type associated with $$ is "floatval"
and the type associated with $1 is "intval". Then the
explicit statement
$$ = $1
Page 48 November, 1995
Thinkage Ltd. YAY Reference Manual
performs an integer to floating point conversion when the
value of $1 is assigned to $$, whereas the implicit
statement did an integer to integer assignment and did _n_o_t
perform this conversion. You must therefore be careful and
think about the effects of implicit vs. explicit
assignments.
November, 1995 Page 49
YAY Reference Manual Thinkage Ltd.
88.. AAmmbbiigguuiittiieess
Suppose we have a grammar with the rule
expr : expr '-' expr ;
and the parser is reading an expression of the form
expr - expr - expr
The parser reads this token by token, of course, so after
three tokens it will have
expr - expr
The parser recognizes this form. In fact, the parser could
reduce this right away into a single "expr" according to the
given grammar rule.
However, the parser has a problem. At this point, the
parser doesn't know what comes next, and perhaps the entire
line will be something like
expr - expr * expr
If it is, the precedence rules say that the multiplication
should be performed before the subtraction, so handling the
subtraction first is incorrect. The parser must therefore
read another token to see if it is really all right to deal
with the subtraction now, or if the correct action is to
skip the subtraction for the moment and deal with whatever
follows the second "expr".
In terms of parser states, this problem boils down to a
choice between _r_e_d_u_c_i_n_g the expression
expr - expr
or _s_h_i_f_t_i_n_g and acquiring more input before making a
reduction. This is known as a _s_h_i_f_t_/_r_e_d_u_c_e_ _ _c_o_n_f_l_i_c_t_.
Sometimes a parser must also choose between two possible
reductions. This kind of situation is called a
_r_e_d_u_c_e_/_r_e_d_u_c_e_ _c_o_n_f_l_i_c_t_.
In case you are curious, there is no such thing as a
shift/shift conflict. To see why this is impossible,
suppose that we have the following definitions.
Page 50 November, 1995
Thinkage Ltd. YAY Reference Manual
thing : a b
| a c
;
b : T rest_of_b;
c : T rest_of_c;
If the parser is in the state where it has seen "a", you
would have the diagram
thing : a.b
thing : a.c
You might think that if the lookahead symbol was the
token T, the parser would be confused, since T is the first
token of both "b" and "c". However, there is no confusion
at all. The parser just Shifts to a state that we could
diagram as
b : T.rest_of_b
c : T.rest_of_c
88..11 RReessoollvviinngg CCoonnfflliiccttss bbyy PPrreecceeddeennccee
The precedence directives (%left, %right, and %nonassoc)
let YAY-produced parsers resolve shift/reduce conflicts in
an obvious way:
(a) The precedence of a Shift operation is defined to be
the precedence of the token on which the Shift takes
place.
(b) The precedence of a Reduce operation is defined to be
the precedence of the grammar rule that the Reduce
operation uses.
If you have a shift/reduce conflict, and the Shift and
Reduce operations both have a precedence, the parser chooses
the operation with the higher precedence.
88..22 DDiissaammbbiigguuaattiinngg RRuulleess
Precedence cannot resolve conflicts if one or both
conflicting operations have no precedence. For example,
consider the following.
statmt: IF '(' cond ')' statmt
| IF '(' cond ')' statmt ELSE statmt ;
Given this rule, how should the parser interpret the input
November, 1995 Page 51
YAY Reference Manual Thinkage Ltd.
IF ( cond1 ) IF ( cond2 ) statmt1 ELSE statmt2
There are two equally valid interpretations of this input:
IF ( cond1 ) {
IF ( cond2 ) statmt1
ELSE statmt2
}
and
IF ( cond1 ) {
IF ( cond2 ) statmt1
}
ELSE statmt2
In a typical grammar, the IF and IF-ELSE statements
would not have a precedence, so precedence could not resolve
the conflict. Thus consider what happens at the point when
the parser has read
IF ( cond1 ) IF ( cond2 ) statmt1
and has just picked up ELSE as the lookahead symbol.
(1) It can immediately Reduce the
IF ( cond2 ) statmt1
using the first definition of "statmt" and obtain
IF ( cond1 ) statmt ELSE ...
thereby associating the ELSE with the first IF.
(2) It can Shift, which means ignoring the first part (the
IF with "cond1") and going on to handle the second
part, thereby associating the ELSE with the second IF.
In this case, most programming languages choose to
associate the ELSE with the second IF, i.e. they want the
parser to Shift instead of Reduce. Because of this (and
other similar situations), YAY-produced parsers are designed
to use the following rule to resolve shift/reduce conflicts.
_D_i_s_a_m_b_i_g_u_a_t_i_n_g_ _R_u_l_e_ _1_:
If there is a shift/reduce conflict in situations where
no precedence rules have been created to resolve the
conflict, the default action is to Shift. The conflict
will also be reported in the YAY output so you can
Page 52 November, 1995
Thinkage Ltd. YAY Reference Manual
check that Shifting is actually what you want. If it
isn't what you want, the grammar rules will have to be
rewritten.
This is known as a disambiguating rule because it helps
YAY-produced parsers resolve ambiguities. The rule is only
used in situations where precedence rules cannot resolve the
conflict. If both the Shift operation and the Reduce
operation have an assigned precedence, the parser can
compare precedences and decide which operation to perform
first. Even if the precedences are equal, the precedences
must have originated from either %left, %right, or
%nonassoc, so the parser knows how to handle the situation.
The only time the disambiguating rule is needed is when one
or both of the Shift or Reduce operations does not have an
assigned precedence.
In a similar vein, YAY-produced parsers use the
following rule to resolve reduce/reduce conflicts.
_D_i_s_a_m_b_i_g_u_a_t_i_n_g_ _R_u_l_e_ _2_:
If there is a reduce/reduce conflict, the parser will
always Reduce by the rule that was given _f_i_r_s_t in the
rules section of the YAY input. Again, the conflict
will be reported in the YAY output so that users may
ensure that the choice of Reduction is correct.
Note that precedence is _n_o_t consulted in reduce/reduce
conflicts. YAY always Reduces by the earliest grammar rule,
regardless of precedence.
Finally, YAY-produced parsers use a third disambiguating
rule to prevent excessive errors.
_D_i_s_a_m_b_i_g_u_a_t_i_n_g_ _R_u_l_e_ _3_:
If the parser is asked to choose between shifting on
eerrrroorr or on any other item, it will choose the non-
error item.
The disambiguating rules are simple to state, but they
can have complex repercussions if the grammar is non-
trivial. If the grammar is sufficiently complicated, these
simple rules for resolving conflicts may not be capable of
handling all the necessary intricacies in the way you want.
You should pay close attention to all conflicts noted in the
parsing table report produced by YAY and should ensure that
the default actions taken by the parser are the desired
ones.
NNoottee:: the conflict between the rules
November, 1995 Page 53
YAY Reference Manual Thinkage Ltd.
statmt: IF '(' cond ')' statmt
| IF '(' cond ')' ELSE statmt ;
can be resolved by adding
%left ELSE
to the declarations section of the YAY input. This puts a
precedence on the ELSE action and says that it is left-
associative. This is another example of using a precedence
to resolve an ambiguity.
88..33 CCoonnfflliiccttss iinn YYAAYY OOuuttppuutt
If your grammar has shift/reduce or reduce/reduce
conflicts, you will also see a table of conflicts in the
statistics section of the Parser Description. For example,
if we change the rules section of our sample grammar to
stmt : IF stmt ELSE stmt
| IF stmt
| stmt stmt
| A ;
we get the following conflict report.
Conflicts:
State Token Action
5 IF shift 2
5 IF reduce (3)
5 A shift 1
5 A reduce (3)
This shows that State 5 has two shift/reduce conflicts.
If the parser is in State 5 and encounters an IF token, it
can shift to state 2 or reduce using rule 3. If the parser
encounters an A token, it can shift to state 1 or reduce
using rule 3. This is summarized in the final statistics
with the line
2 shift/reduce conflicts
Reading the conflict report shows you what action the
parser takes in case of a conflict -- the parser always
takes the _f_i_r_s_t action shown in the report. This action
will be chosen in accordance with the two disambiguating
rules.
Page 54 November, 1995
Thinkage Ltd. YAY Reference Manual
99.. AAddvvaanncceedd FFeeaattuurreess
In this chapter, we describe more specialized features
of YAY.
99..11 LLAALLRR((22)) GGrraammmmaarrss
An LALR(2) grammar has one more level of "lookahead"
than an LALR(1) grammar. When trying to decide how to parse
a given input, an LALR(1) parser sometimes looks at the next
input token to see if that makes a difference. In the same
situation, an LALR(2) parser can look at the next _t_w_o
tokens.
LALR(2) grammars are described in the same way as
LALR(1) grammars. To indicate that you want an LALR(2)
parser, you must put the option +LR2 on the command line
that invokes YAY.
An LALR(2) parser can perform a _l_o_o_k_a_h_e_a_d action as well
as Shift, Reduce, Goto, Accept, and Error. For example, in
a state diagram you might see
A shift 1
B reduce (2)
C lookahead
This means that if token A is encountered in this state, the
parser will shift to State 1; if token B is encountered, it
will reduce using Rule (2); and if token C is encountered,
it will lookahead.
_9_._1_._1_ _ _T_h_e_ _L_o_o_k_a_h_e_a_d_ _A_c_t_i_o_n
A state that has a Lookahead action has a list of
possible states the parser can enter next. These states
conflict with each other -- the Lookahead action isn't
needed if there is no conflict. The parser attempts to
resolve the conflict by reading one additional token. If
the grammar is set up properly, this token will be valid in
only one of the possible contexts, so the parser can choose
that rule instead of the other possibilities.
Thus the Lookahead action tests all the possibilities in
the list. It begins by making a private copy of the state
stack. (Actually, it just sets things up so that it looks
like it has a separate copy of the state stack -- it doesn't
really make a copy.) Next, it calls on a routine named
November, 1995 Page 55
YAY Reference Manual Thinkage Ltd.
"yy2lex" (described later) to obtain a second lookahead
token.
The Lookahead action then begins going through the
possible rules on its list. Some of these rules may be
Shifts, while others are Reduces -- the reductions are what
make it necessary to simulate a copy of the state stack,
because you don't want to play with the true stack.
The Lookahead action takes the action associated with
each possibility in the list, entering new states through
Shifts or Reduces. It then sees if the second lookahead
token causes an error in this new state. If an error
occurs, the parser pops back to the original Lookahead state
and checks the next possibility on the list.
When it finally finds a rule where the second lookahead
symbol doesn't cause an error, it pops back to the original
Lookahead state one last time. It gets rid of the stack
set-up that the Lookahead action was using and returns to
the old state stack. It then takes the action that it has
discovered in the list of possibilities.
Note that the Lookahead action takes the first possible
rule where the second lookahead symbol is valid. It is, of
course, possible that there are other valid possibilities in
the list. The order of the list of possibilities is based
on the standard disambiguating rules: a shift comes first,
followed by reductions in the order in which the
corresponding grammar rules were given.
_9_._1_._2_ _ _T_h_e_ _y_y_2_l_e_x_ _R_o_u_t_i_n_e
In order to perform the lookahead, "yyparse" calls a
routine called "yy2lex". This is a user-written lexical
analyzer, just as "yylex" is. It returns the same kind of
token value that "yylex" does. However, "yy2lex" should not
set the "yylval" value.
For some applications, "yy2lex" could even be the same
routine as "yylex". There are good reasons why the two
should be separate, however. Suppose, for the sake of
argument, that "yyparse" called "yylex" for all input.
"yyparse" might call "yylex" for the next token, then
immediately call "yylex" again for a lookahead. This second
call could overwrite important values like "yylval" before
they had been used, thereby confusing things greatly. For
such reasons, "yy2lex" and "yylex" sometimes have to be
different routines.
Page 56 November, 1995
Thinkage Ltd. YAY Reference Manual
In practice, parsers rarely call "yy2lex" -- the routine
is only needed in special circumstances. This means that
"yy2lex" usually can be much simpler than "yylex": "yylex"
must be able to handle all possible input tokens, while
"yy2lex" is only called in a few special cases. To find out
the tokens that "yy2lex" must be able to handle, run your
grammar through YAY and get a list of the states where the
parser performs a Lookahead instead of a Shift or Reduce.
These instances are the only ones where the parser will call
"yy2lex", so the associated tokens are the only ones that
the routine has to handle.
The values obtained by "yy2lex" are thrown away once the
parser has used them in its lookahead. What this means is
that the parser doesn't remember what "yy2lex" returned, so
it will call on "yylex" to re-read the token for normal
parsing. This means one of two things:
(a) "yy2lex" can duplicate the processing that "yylex"
does, then leave the pertinent information around for
"yylex" to obtain. This means that "yylex" must have a
way of knowing when "yy2lex" has already read the next
token (e.g. "yy2lex" can set a flag).
(b) "yy2lex" can cheat and only do minimal processing. For
example, "yy2lex" may only have to look at the first
character of the next token to figure out what is
coming next. "yy2lex" can then "ungetc" that character
and return an appropriate token value. When "yylex" is
called, it reads the full token in the usual manner.
The second approach is sufficient for most grammars. It
simplifies both "yylex" and "yy2lex".
_9_._1_._3_ _ _C_o_n_f_l_i_c_t_s
With an additional level of lookahead, there is the
potential for a staggering number of conflicts. For
example, you may lookahead to a new state to resolve one
conflict, only to find that the new state also has a
lookahead to resolve a conflict.
If an LALR(2) parser is performing a lookahead and
enters a state where the secondary lookahead token invokes
another Lookahead action, the parser ruthlessly resolves the
second Lookahead action by choosing the first possibility on
the second Lookahead list. This is not very elegant, so you
should avoid this double lookahead situation.
November, 1995 Page 57
YAY Reference Manual Thinkage Ltd.
Conflicts of this nature are reported in the usual
conflict table at the end of the State Description output.
99..22 MMuullttiippllee AAccttiioonnss
A rule can have more than one action. For example, you
might have
a : A1 {action1} A2 {action2} A3 {action3};
The non-terminal symbol "a" consists of symbols A1, A2, and
A3. When A1 is recognized, "action1" is invoked; when A2 is
recognized, "action2" is invoked; and when A3 is recognized
(and therefore the entire symbol "a"), "action3" is invoked.
In this case,
$1 -- is the value of A1
$2 -- is the value of $$ in action1
$3 -- is the value of A2
$4 -- is the value of $$ in action2
$5 -- is the value of A3
If types are involved, multiple actions become more
complicated. If "action1" mentions $$, there is no way for
YAY to guess what type $$ has, since it is not really
associated with a token or non-terminal symbol. You must
therefore state it explicitly by specifying the appropriate
type name in angle brackets between the two $ signs. If we
had
%union {
int intval;
float realval;
}
you might say
$<realval>$
in place of $$ in the action, to show that the result had
type ffllooaatt. In the same way, if "action2" refers to $2 (the
result of action1), you might say
$<realval>2
To deal with multiple actions, YAY changes the form of
the given grammar rule and creates grammar rules for dummy
symbols. The dummy symbols have names made up of a $
followed by the rule number. For example,
Page 58 November, 1995
Thinkage Ltd. YAY Reference Manual
a : A1 {action1} A2 {action2} A3 {action3};
might be changed to the rules
$21 : /* null */ {action1} ;
$22 : /* null */ {action2} ;
a : A1 $21 A2 $22 A3 {action3};
These rules will be shown in the Rules Summary of the Parser
Description report.
Note that this technique can introduce conflicts. For
example, suppose you have the definitions
a : A1 {action1} A2 X;
b : A1 A2 Y;
These are changed to
$50 : /* null */ {action1};
a : A1 $50 A2 X;
b : A1 A2 Y;
The definitions of "a" and "b" give a shift/reduce conflict
because the parser can't tell whether A1 followed by A2 has
the null string for $50 in the middle. It has to decide
whether to Reduce $50 or to Shift to a state diagrammed by
b : A1 A2.Y
This particular conflict could be resolved by using an
LALR(2) parser instead of LALR(1). However, there are more
complicated examples in which this is not possible.
There is another situation to watch for. Consider the
grammar
a : A1 c A2 X ;
b : A1 c A2 Y ;
c : /* nothing */ {action} ;
You might think that this is equivalent to
a : A1 {action} A2 X;
b : A1 {action} A2 Y;
but it is not. The first will not produce a conflict, but
the second one will. The second is converted into
a : A1 $50 A2 X;
b : A1 $51 A2 Y;
November, 1995 Page 59
YAY Reference Manual Thinkage Ltd.
$50 : {action} ;
$51 : {action} ;
even when the action is the same for both "a" and "b". If
the parser finds an A1 followed by a null string followed by
A2, it doesn't know if it should interpret the null string
as $50 or $51; therefore, it gets a conflict. Obviously, it
is better to use the first format, where the identical
actions are associated with a separate rule.
99..33 SSeelleeccttiioonn PPrreeffeerreennccee
A _s_e_l_e_c_t_i_o_n_ _p_r_e_f_e_r_e_n_c_e can be added to a grammar rule to
help resolve conflicts. The following input shows a simple
example of how a selection preference can resolve conflicts
between two rules.
a1 : a b ['+' '-']
{ /* Action 1 */ } ;
a2 : a b
{ /* Action 2 */ } ;
The selection preference is indicated by zero or more _t_o_k_e_n_s
inside square brackets. If the token that follows the "b"
is one of the tokens inside the square brackets, the parser
will use the grammar rule for "a1". If the token that
follows the "b" is not one of the given tokens, the parser
will use the rule for "a2". In this way, the conflict
between the two rules is resolved -- the preference tells
which one to select, depending on the value of the lookahead
token.
Notice that a selection preference states that a rule
_s_h_o_u_l_d be used when the next token is one of the ones listed
in the brackets and should _n_o_t be used if it is not in the
brackets.
The lookahead token is merely used to determine which
rule to select. It is _n_o_t part of the rule itself. For
example, suppose you have
a1 : a b ['+' '-'] ;
a2 : a b ;
xx : a1 op expr ;
and suppose you have an "a", a "b", and + as the lookahead
token. The + indicates that the "a" and "b" should be
reduced to "a1". The parser does this and finds that the
"a1" is part of the "xx" rule. The + lookahead token will
be associated with the "op" symbol in the "xx" rule. In
Page 60 November, 1995
Thinkage Ltd. YAY Reference Manual
other words, a selection preference does not "use up" an
input token; it just looks at the token value to help
resolve a conflict.
The $end symbol may be used inside square brackets to
indicate end-of-file in the input being parsed. For
example,
statement : line ['\n' $end]
shows that this rule is preferred if a "line" is followed by
a new-line character or the end of the file.
The square brackets of a selection preference may
contain no tokens, as in
x : y z [];
This says that the parser should never use this rule unless
it cannot be avoided.
When there are many selection preferences in the same
state, the parser must compare them to check for conflicts.
To do this, YAY chooses the most likely rule (based on
preference) and compares this to the other possible rules.
In order to understand what we mean by "most likely", an
example will help. Consider the grammar
a : b [ 'x' ]
| b [ 'y' ]
| b [ ]
;
YAY must consider what happens when a "b" symbol has been
encountered. If the "b" is followed by an 'x', the most
likely rule is
a : b [ 'x' ]
If the "b" is followed by a 'y', the most likely rule is
a : b [ 'y' ]
If the "b" is followed by any other token, the most likely
rule is
a : b [ ]
Once the most likely rule has been chosen, it will be
compared to the other rules in the set to make sure that
there are no conflicts. (In a previous release, YAY would
November, 1995 Page 61
YAY Reference Manual Thinkage Ltd.
compare every pair of rules, without choosing a most likely
one. In this case, the rules
a : b [ 'x' ]
a : b [ 'y' ]
would conflict with each other, because there was no way to
choose between the two if the next token was not 'x' or 'y'.
In this release, YAY will compare the most likely rule to
all the others, but will not compare all the possible
pairs.)
Selection preferences can also be stated using the
construct
[^ T1 T2 T3 ...]
where the first character is a caret (^) and T1, T2, etc.
are tokens. When this is put on the end of a rule, it
indicates that the rule should be used if the lookahead
token is _n_o_t one of the listed tokens. For example,
a1 : a b
{ /* Action 1 */ } ;
a2 : a b [^ '+' '-']
{ /* Action 2 */ } ;
says that rule "a2" should be used if the token after the
"b" is _n_o_t + or -. If the token is + or -, "a2" should not
be used (so "a1" will be).
The construct [^] is the reverse of []. It is used to
indicate a rule that should _a_l_w_a_y_s be taken unless there is
another rule that must be taken because of a selection
preference.
Selection preference constructs can be put in the middle
of rules as well as on the end. For example, we could write
expr : expr ['+' '-'] op expr
{ /* Action 1 */ }
| expr op expr
{ /* Action 2 */ } ;
This states that if the first "expr" is followed by + or -
you want to use the first rule; otherwise, you want to use
the second. Note that the preference does not use up the +
or - token: you still need a symbol ("op") to represent such
tokens.
Page 62 November, 1995
Thinkage Ltd. YAY Reference Manual
Selection preferences that appear in the middle of a
rule are implemented in the same way as multiple actions,
using dummy rules. The above example would result in
something like
$23 : ['+' '-'] ;
expr : expr $23 op expr
{ /* Action 1 */ }
| expr op expr
{ /* Action 2 */ } ;
(where the 23 in $23 is just a number we chose at random).
The dummy rule that is created is a null string with the
selection preference on the end. The first token for "op"
will be the + or - that was the lookahead token in rule $23.
If a selection preference in the middle of a rule is
immediately followed by an action, only one dummy rule is
created to handle both the action and the preference.
In most cases, a selection preference counts as a $N
symbol, but it has no associated value. For example, in
expr : expr ['+' '-'] op expr
we have
$1 -- first "expr"
$2 -- no value
$3 -- "op"
$4 -- second "expr"
If the preference is followed by an action, the preference
and the action count as a single $N symbol whose value is
equal to the $$ value of the action. For example, in
expr : expr ['+' '-'] {action} op expr
we have
$1 -- first "expr"
$2 -- $$ of action
$3 -- op
$4 -- second "expr"
The %%pprreecc construct is incompatible with rules that
contain selection preferences, because the preference is all
that is needed to resolve conflicts. For this reason, YAY
issues an error message if a rule contains both a preference
and the %%pprreecc construct.
November, 1995 Page 63
YAY Reference Manual Thinkage Ltd.
Selection preferences can be used to resolve most
conflicts. Indeed, there may be cases where the most
practical course of action is to write a number of
conflicting rules that contain selection preferences to
resolve the conflicts, as in
expr : expr ['+' '-'] op expr
| expr ['*' '/' '%'] op expr
| expr ['&' '|'] op expr
...
NNoottee:: selection preferences of the form
[error]
[^ error]
are not useful. Selection preferences are implemented
through (dummy) Reduce actions, but the parser's error-
handling routines always look for Shift actions and ignore
Reductions.
99..44 NNeeggaattiivvee NNuummbbeerrss iinn $$NN CCoonnssttrruuccttss
YAY lets you use constructs like $0, $-1, $-2, and so on
in recognition actions. These were once important, but the
techniques for specifying multiple actions have made them
obsolete. YAY only supports the constructs for
compatibility with older grammars.
To understand what these constructs mean, it is
important to think in terms of the state stack. Each $N
construct is associated with a state on the stack; the value
of $N is the value of the token or non-terminal symbol
associated with the state at the time of a Reduce operation.
(Recall that recognition actions are executed when the
appropriate Reduce action takes place.)
$1 is the value associated with the state that found the
first component of the grammar rule; $2 is the value
associated with the second state, and so on.
$0 is the value associated with the state that was on
top of the stack before the first component of the grammar
rule was found.
$-1 is the value associated with the state before that,
and so on. All of these states are still on the stack, and
their value can be obtained in this way.
Page 64 November, 1995
Thinkage Ltd. YAY Reference Manual
As an artificial example, suppose that a grammar has the
rules
stmt : IF condition stmt
| WHILE condition stmt
condition : /* something */
{ /* action */ }
The action associated with the condition can use the $-1
construct to find out if the preceding token was IF or
WHILE. (Of course, this assumes that the only items that
can precede a condition are the IF and WHILE tokens.) There
are occasionally times when this sort of information is
needed.
99..55 LLiissttss aanndd HHaannddlliinngg NNuullll SSttrriinnggss
Grammars often define lists of items. There are two
common ways to do this:
list : item
| list item ;
or
list : /* null */
| list item ;
The first definition means that every "list" has at least
one item. The second allows zero-length lists.
Using the second definition is sometimes necessary or
convenient, but it can lead to difficulties. To understand
why, consider a grammar with
list1 : /* null */
| list1 item1 ;
list2 : /* null */
| list2 item2 ;
list : list1
| list2 ;
When the parser is in a position to look for a "list",
it automatically finds a null string, then gets a conflict
because it can't decide if the null string is an instance of
"list1" or "list2". This problem is less likely to happen
if you define
November, 1995 Page 65
YAY Reference Manual Thinkage Ltd.
list1 : item1
| list1 item1 ;
list2 : item2
| list2 item2 ;
list : /* null */
| list1
| list2
;
The parser can determine if it has a "list1" or a "list2" by
seeing if the list starts with "item1" or "item2".
A YAY-produced parser avoids infinite recursions that
might result from matching the same null string over and
over again. If the parser matches a null string in a
particular state, goes through a few more states and shifts
once more into the state where the null string was matched,
it will not match the null string again. Without this
behaviour, you get infinite recursions on null strings.
However, the behaviour occasionally gets in the way if you
_w_a_n_t to match more than one null string in a row. For
example, consider how you might write the grammar rules for
types that may be used in a C cast operation, as in
char_ptr = (char *) float_ptr;
The rules for the parenthesized cast expression might be
written as
cast : '(' basic_type opt_abstract ')' ;
opt_abstract : /* null */
| abstract;
abstract : '(' abstract ')'
| opt_abstract '[' ']'
| opt_abstract '(' ')'
| '*' opt_abstract
;
Consider what happens with a cast like
(int *[])
This will be interpreted as * followed by a null
"opt_abstract" followed by a null "opt_abstract" followed by
square brackets. However, the parser will _n_o_t accept two
null "opt_abstracts" in a row, and will take some other
course of action.
To correct this problem, you must rewrite the grammar
rules. Instead of using the "opt_abstract" rules, have
rules with and without an "abstract".
Page 66 November, 1995
Thinkage Ltd. YAY Reference Manual
cast : '(' basic_type abstract ')' ;
abstract : /* null */
| abstract '[' ']'
| '[' ']'
| abstract '(' ')'
| '(' ')'
| '*' abstract
| '*'
;
99..66 RRiigghhtt vvss.. LLeefftt RReeccuurrssiioonn
Chapter 3 mentioned left and right recursion. For
example, if a program consists of a number of statements
separated by semicolons, we might define it with right
recursion as
program : statement
| statement ';' program ;
or with left recursion as
program : statement
| program ';' statement ;
If you think about the way that the state stack works, you
will see that the second way is much to be preferred.
Consider, for example, the way something like
S1 ; S2 ; S3 ; S4
would be handled (where all the Sn's are statements).
With right recursion, the parser would gather "S1;" and
then go looking for a program. To gather this program, it
would gather S2. It would then look at the lookahead symbol
; and see that this program had the form
statement ';' program
The parser would then go ahead and gather the program after
the semicolon. But after S3, it would find another
semicolon, so would begin gathering yet another program. If
you work the process through, you will find that the state
stack grows to seven entries (one for each Sn and one for
each ;) before the first Reduce takes place.
November, 1995 Page 67
YAY Reference Manual Thinkage Ltd.
On the other hand, if you have the left recursion
program : program ';' statement
and the same input, the parser will perform a Reduce as soon
as it sees
S1 ; S2
This is Reduced to a single state corresponding to the non-
terminal symbol "program". The parser reads ";S3" and
Reduces
program ; S3
to "program" again. The process repeats for the last
statement.
If you follow this through, the state stack never grows
longer than three states, as compared to the seven that are
required for the right recursive rule. With right
recursion, no reduction takes place until the entire list of
elements has been read; with left recursion, a reduction
takes place as each new list element is encountered. Left
recursion can therefore save a lot of stack space.
The choice of left or right recursion can also affect
the order in which recognition actions are performed.
Suppose T is a token. If we define
x : /* null */
| y ',' x {a1} ;
y : T {a2} ;
then the input
T , T , T
would execute recognition actions in the order
{a2} {a2} {a2} {a1} {a1} {a1}
The {a2} actions are performed each time a T is Reduced to
"y". The {a1} actions don't happen until the entire list
has been read, because right recursion reads the entire list
before any Reduce actions take place.
Page 68 November, 1995
Thinkage Ltd. YAY Reference Manual
On the other hand, if you define
x : /* null */
| x ',' y {a1} ;
y : T {a2};
the recognition actions for the same input will take place
in the order
{a2} {a1} {a2} {a1} {a2} {a1}
With left recursion, Reduce actions take place every time a
new element is read in for the list.
This means that if you want the action order
{a2} {a2} {a2} {a1} {a1} {a1}
you must use right recursion even though it takes more stack
space.
99..77 YYYYDDEEBBUUGG
If you ##ddeeffiinnee a symbol named YYDEBUG in the
declarations section and set the variable "yydebug" to a
non-zero value, your parser will print a good deal of
debugging information as it parses input. In this
description, we will describe the output you may see.
Every time "yylex" obtains a token, the parser prints
read T (VALUE)
T is the name of the token and VALUE is the numeric value.
Thus if "yylex" has read an IF token, you might see
read IF (257)
Every time the parser enters a state, it will print out
state N (X), char (C)
where N is the state number as given in the State
Description report, and X and C are other integers. X is
another number for the state -- YAY actually renumbers the
states and grammar rules after it generates the State
Description report in order to improve the parser's
efficiency, and X gives the state number after renumbering.
C is the token type of the lookahead symbol if the symbol is
November, 1995 Page 69
YAY Reference Manual Thinkage Ltd.
a token. If the symbol is not a token, or if there is no
lookahead symbol at the moment, C is -1. As an example,
state 6 (22), char (-1)
indicates that the parser has entered State 6 on the State
Description Report (State 22 after renumbering) and that the
current lookahead symbol is not a token.
Every time the parser performs a Shift action, it prints
shift N (X)
where N is the number of the state to which the parser is
shifting and X is the number of the same state after
renumbering.
Every time the parser performs a Reduce action, it
prints
reduce N (X), pops M (Y)
This says that the parser has Reduced by grammar rule N
(renumbered to X). After the reduction, the state on top of
the state stack was State M (renumbered to Y).
99..88 IImmppoorrttaanntt SSyymmbboollss
Debugging a parser produced by YAY can be a very
difficult task, since the code is only partly produced by
user input. The remaining code is standard software
produced by YAY itself. The situation is aggravated by
another problem: the state and rule numbers shown in the
State Description report are not the same as the numbers
used when the parser actually runs. In the interests of
optimization, the parser sorts the states into a more
convenient order. Thus the _i_n_t_e_r_n_a_l state number used by
the program is usually not the same as the _e_x_t_e_r_n_a_l state
number known to the user.
To help those who are examining parser code using a
symbolic debugger, here are a few of the important variables
that the parser uses.
yyval
holds the value $$ at the time of a reduction. This
has the type YYSTYPE.
yychar
holds the most recent token value returned by "yylex".
Page 70 November, 1995
Thinkage Ltd. YAY Reference Manual
yystate
is the _i_n_t_e_r_n_a_l number of the current state.
yyps
points to the current top of the state stack. Thus
yyps[0] is the internal number of the current state,
yyps[-1] is the internal number of the previous state,
and so on.
yypv
points to the top of the current value stack. The
entries in this stack have the type YYSTYPE. When a
Reduce operation executes a recognition action, this
pointer is moved down the stack to the point where
yypv[1] = $1
yypv[2] = $2
etc.
yyi
is the internal number of the rule being reduced by a
Reduce action.
yyrmap
is an array present only when YYDEBUG is defined. It
is used to convert internal rule numbers to external
ones. For example,
yyrmap[yyi]
is the external number of the rule being reduced by a
Reduce action.
yysmap
is an array present only when YYDEBUG is defined. It
is used to convert internal state numbers to external
ones. For example,
yysmap[yystate]
is the external number of the current state.
99..99 HHooww YYYYEERRRROORR MMaayy BBee UUsseedd
The YYERROR macro creates an artificial error condition.
To show how this can be useful, suppose we have a line-by-
line desk calculator that allows parenthesization of
expressions and suppose we have a variable named "depth"
that keeps track of how deeply parentheses are nested.
November, 1995 Page 71
YAY Reference Manual Thinkage Ltd.
Every time the parser finds an opening parenthesis, it adds
1 to "depth". Every time it finds a closing parenthesis, it
subtracts 1.
Consider how the following definitions will work.
expr : lp expr ')'
{depth--;}
| lp error
{depth--;}
;
lp : '(' {depth++;};
If no error occurs, the "depth" variable is incremented
and decremented correctly. If an error does occur, however,
what happens? Your "yyerror" routine is called on to
recover from the error in the middle of an expression.
Often, it is more reasonable to postpone this recovery until
you reach a point where you have a whole expression.
Therefore, you might use the following alternate definition.
expr : lp error
{depth--; YYERROR;}
;
line : error '\n' prompt line
{ $$ = $4; } ;
prompt : /* null token */
{printf("Please re-enter line.\n");};
Now, what happens when the grammar is asked to parse a line
like
1 + (( a +
When the end of the line is encountered, the parser
recognizes an error has occurred. Going up the stack, the
first state ready to handle the error is
expr : lp error ;
At this point, the parser will Reduce the input
( a +
into an "expr". The Reduction executes the recognition
action: it decrements "depth", then signals that an error
has taken place. The Error action begins popping the stack
again. It will find the previous opening parenthesis,
recognize another
lp error
Page 72 November, 1995
Thinkage Ltd. YAY Reference Manual
construct and perform another reduction. The parenthesis
count is again decremented, and another error condition
generated.
This time, the grammar rule that deals with the error is
the definition of "line". An error message is issued and a
new line is requested. In this way, the parser has worked
its way back to error-handling code that can deal with the
situation. Along the way, the parser correctly decremented
the "depth" variable to account for the missing parentheses.
This method of dealing with errors decrements "depth"
for every unbalanced opening parenthesis on the line. This
corrects the "depth" count properly. Our first definition
(without the YYERROR call) would only have decremented
"depth" once.
This example is somewhat contrived, of course -- you
could always just set "depth" to zero whenever you started a
new line of input. The usefulness of the technique is more
apparent in situations where you obtain memory with "malloc"
whenever you get an opening delimiter and free the memory
with "free" whenever you get a closing delimiter. In this
case, it is obvious that you need to do precisely as many
"free" operations as "malloc" operations, so you must raise
the error condition for each unbalanced opening delimiter.
You might think that the symbol "lp" is unnecessary, and
you could just define
expr : '(' {depth++;} expr ')' {depth--;}
| '(' error {depth--;} ;
However, this would not work in general. There is no
guarantee that the action
{depth++;}
would be executed in all cases, particularly if the token
after the '(' was one that could not start an expression.
As an interesting example of another way to use YYERROR,
consider the following (taken from a parser for the Pascal
programming language).
label_list : label_list ',' label
| label
| error
| error [LABEL CONST VAR PROC FUNC BEGIN]
{YYERROR; /* more code */};
November, 1995 Page 73
YAY Reference Manual Thinkage Ltd.
This deals with errors in two different ways:
(a) If an error is followed by one of the tokens LABEL,
CONST, etc. (representing the beginning of new
declaration sections in Pascal), the input is reduced
to a complete "label_list" input is Reduced to a
complete "label_list" and an appropriate action is
taken. This action uses YYERROR to raise the error
condition, but only _a_f_t_e_r the reduction has taken
place.
(b) The other rule is used when the parser finds an error
which is not followed by one of the listed tokens.
This corresponds to an error in the middle of a label
list and requires a different sort of handling. In
this case, error-handling is allowed to take place
immediately, without reduction, because there may be
more "label_list" to come.
This kind of approach can be used to distinguish
different kinds of errors that may take place in a
particular situation.
99..1100 TThhee DDeeffaauulltt AAccttiioonn
The default action is the one that is taken when the
parser finds a token which has no specified effect in the
current state. Understanding how default actions work will
help you understand what is going on when a YAY-produced
parser encounters an error.
In a state diagram, the default action is marked with a
".". The default will always be a Reduce or Error action,
chosen according to the following rules.
(a) If the state has no Shift actions and only one Reduce,
the default will be the Reduce action.
(b) Apart from (a), an empty rule will never have Reduce as
a default.
(c) If a state has more than one Reduce action, the parser
examines the "popularity" of each Reduce. For example,
if Reduction A is used with any of three different
input tokens and Reduction B is used with only one
input token, Reduction A is three times as "popular" as
B. If one Reduce action is more than twice as popular
as its closest contender (i.e. if it is taken on more
than twice as many input tokens), and if that Reduce
Page 74 November, 1995
Thinkage Ltd. YAY Reference Manual
action is associated with a rule that contains
reductions on at least _f_i_v_e tokens, the popular Reduce
action is made the default.
(d) In all other cases, the default action will be an Error
action. For example, Error is chosen when a rule has
more than one Reduce action, and there is no Reduce
that is more than twice as popular as all the other
contenders.
Note: YAY's predecessor YACC always chooses the most
popular Reduce action as default (if there is one). It does
not use the same requirements as (c) above. As a result of
this difference YAY's parser tables are about 20% larger
than YACC's, but a YAY-generated parser usually detects
errors much earlier than a parser generated by YACC.
Because of the way default actions are treated, a YAY-
produced parser will sometimes begin reducing when it
encounters an error. Several reductions may take place
before the error state is finally triggered. Thus your
grammar may need some way to determine what actions have
taken place between the time the erroneous token was read in
and the time that the error was actually triggered.
99..1111 IInnvvaalliidd TTookkeennss
It is invalid to say either
%token X 0
or
%token X 256
The value 0 is reserved for the end marker and 256 is
reserved for "error".
99..1122 DDyynnaammiicc SSttaacckk AAllllooccaattiioonn
The manifest YYSSIZE is used to determine the size of
the state and value stacks used by "yyparse". YYSSIZE gives
the maximum number of elements that these stacks will be
expected to hold; the size of each value element is dictated
by the YYSTYPE type and the size of each state element is
determined by YAY. The default value of YYSSIZE is 150. By
increasing the value of YYSSIZE, you can allow for grammars
with a larger number of pending states.
If you ##ddeeffiinnee YYALLOC in the declarations section, the
state and value stacks used by "yyparse" will be allocated
November, 1995 Page 75
YAY Reference Manual Thinkage Ltd.
dynamically via "malloc" and freed before "yyparse" returns.
What this effectively means is that "yyparse" makes itself
re-entrant by saving a number of externals when it begins
execution and restoring them upon completion. The externals
involved are
yylval yyval yypvt
yynerrs yychar yyerrflag
If you "longjmp" out of "yyparse" (due to an action), the
externals are _n_o_t restored, and "yyparse" will not be re-
entrant.
If you use YYALLOC, "yyerror" will be called if "malloc"
fails to allocate space for the state and value stacks.
If you ##ddeeffiinnee YYALLOC with a value greater than 10, the
parser allocates the state and value stacks dynamically,
beginning with a size of YYSSIZE. If this is not big enough
to hold the two stacks, "yyparse" attempts to use "realloc"
to grow the stacks by the amount given by YYALLOC. For
example, if YYALLOC is 20, "yyparse" tries to grow the
stacks to a size that will allow 20 additional elements
(whenever "yyparse" needs more space for the stacks).
"yyerror" is called if the call to "realloc" fails to
allocate additional space.
If you set up your parser in such a way that it may grow
the stacks, you must be careful not to take a pointer into a
stack in one action and use that pointer inside a different
action. The reason is that the stack may be grown using
"realloc" in between the two actions. Since "realloc" may
actually move the entire stack, the pointer will no longer
be valid. Thus you should not create pointers with
expressions like
&($1)
and expect those pointers to be valid in other actions.
Within the code of a single action, however, such pointers
wwiillll remain valid.
If you ##ddeeffiinnee YYSTATIC, both the state and value stacks
will be static. Otherwise, the state stack will be "auto"
(allocated on the program stack) and the value stack will be
static. Defining YYALLOC saves both stack space and static
space; defining YYSTATIC saves stack space.
Page 76 November, 1995
Thinkage Ltd. YAY Reference Manual
99..1133 SSyynncchhrroonniizziinngg tthhee LLooookkaahheeaadd
If you ##ddeeffiinnee YYSYNC, the parser will always have a
lookahead token when it performs a shift or reduce action.
If the symbol is not defined, the parser will only obtain a
lookahead token if the value of the token is needed.
You would define YYSYNC in situations where you wanted
to keep track of the line number on which various situations
occurred (e.g. errors). If "yylex" _a_l_w_a_y_s does a lookahead,
you know that the line number of the token you are working
with is the line number of the second last token read. If
"yylex" sometimes does not do a lookahead, you don't know if
the current line number in "yylex" is the line number for
the current token or for a lookahead token.
You would avoid defining YYSYNC in situations where some
actions do their own reading. For example, suppose that /*
is a token that indicates the beginning of a comment. You
could create an action for that token which reads all input
up to a closing */. With this kind of action, you would not
want "yylex" to read a lookahead token, since that token
would be the first token inside the comment, not the first
token after the comment.
99..1144 YYYYSSHHIIFFTT
The generated parser program invokes a macro named
YYSHIFT whenever the parser performs a shift action in
response to a token. The macro is invoked without
arguments. By default, YYSHIFT is defined to do nothing;
however, users can redefine YYSHIFT if desired, in the token
definition part of the YAY input.
As an example of how you might use YYSHIFT, consider a
situation where input is freeform and may be split over many
lines of text, including the insertion of blank lines in the
input. You may want to keep a count of the number of lines
you've read as you parse the input so that you can provide
error messages like "Line 12: Syntax error".
When the parser finishes reading a line, it often has to
read ahead to the next token before it can decide whether to
reduce what it already has or keep on reading additional
input. Since there may be blank lines in the input, this
means that the parser may actually read ahead several lines
before finding the next token. When it sees the next token,
the parser may decide to reduce what it has seen already
before shifting in response to the token.
November, 1995 Page 77
YAY Reference Manual Thinkage Ltd.
While the reduction is happening, you want your line
count to reflect the last line of input. You do not want
the line count to reflect the line where the parser found
the new token, because the parser isn't really using that
token yet; it's working with older input. You only want to
update the line count when the parser is ready to shift on
the new token.
That's where YYSHIFT comes in. You can define the
YYSHIFT macro to update the line count at the point when the
token is actually used, not when it is first read.
YYSHIFT is _n_o_t invoked if a token is discarded (because
of error handling or some other reason). It is only invoked
when the parser shifts on a token.
Page 78 November, 1995
Thinkage Ltd. YAY Reference Manual
AAppppeennddiixx AA
AAnn EExxaammppllee
This appendix gives a simple example of YAY input. We
have omitted the recognition actions of the grammar rules in
order to keep the example as simple as possible.
The parser implements a simple string manipulation
language. It has two types of objects: strings and
variables. Strings can have a maximum of 100 characters and
are enclosed in double quotes. Variable names are a maximum
of eight characters long.
There are three operations:
(a) = assigns a string to a variable;
(b) JOIN or + concatenates strings or the contents of
variables;
(c) PRINT outputs strings or the contents of variables.
Every statement is a single input line and blanks are used
to separate tokens on the line. Multiple = assignments are
permitted as in
A = B = "hello"
The keyword PRINT can only appear once on the line.
If a line does not contain an assignment, the value of
the expression on the line is merely printed out. This
means that it is valid to have a line that only contains a
string or a variable name. The end of the input file marks
the end of input.
Here are the declarations section of the YAY input and
the "yylex" routine (both written in C).
%{ /* declarations */
#include <stdio.h>
#define MAXSTRING 100
char str[MAXSTRING];
#define MAXVAR 8
char var[MAXVAR];
%}
%token VARIABLE STRING
%nonassoc PRINT
%right '='
%left JOIN
November, 1995 Page 79
YAY Reference Manual Thinkage Ltd.
%union {
char *cstr;
}
%%
program : stat
| program stat
;
stat : printexpr '\n'
| expr '\n'
;
printexpr : PRINT expr ;
expr : VARIABLE '=' expr
| JOIN expr expr
| STRING
| VARIABLE
;
%%
int yylex()
{
char c, *stringet(), *wordget();
extern YYSTYPE yylval;
while ( (c=getchar()) == ' ' )
/* skip blanks */;
switch(c) {
case '\n': /* end of line */
return('\n');
case EOF: /* end of file */
return(0);
case '+': /* same as JOIN */
return(JOIN);
case '"': /* start of string */
yylval.cstr = stringet();
return(STRING);
default: /* keyword or variable */
yylval.cstr = wordget(c);
if ( !strcmp("JOIN",yylval.cstr) )
return(JOIN);
else if ( !strcmp("PRINT",yylval.cstr) )
return(PRINT);
else return(VARIABLE);
}
}
char *stringet()
{
extern char str[];
int i;
Page 80 November, 1995
Thinkage Ltd. YAY Reference Manual
for ( i=0; i < MAXSTRING; i++ ) {
str[i] = getchar();
if ( (str[i]=='"') || (str[i]=='\n') )
break;
}
str[i] = '\0'; /* mark end of string */
return(str);
}
char *wordget(char c)
{
extern char var[];
int i;
var[0] = c; /* first letter obtained already */
for ( i=1; i < MAXVAR; i++ ) {
var[i] = getchar();
if ( (var[i]==' ') || (var[i]=='\n') )
break;
}
var[i] = '\0';
return(var);
}
void yyerror(char *s)
{
fprintf(stderr,s);
}
Note that the lexical analyzer always flags the keywords
JOIN and PRINT as operations. This means that you cannot
use JOIN or PRINT as variable names. In fact, this is a
general limitation of LALR(1) parsers. It is very difficult
to have names that are keywords in some contexts and
variable names in others, without a great deal of fiddling.
Therefore, we recommend that you resign yourself to having
_r_e_s_e_r_v_e_d keywords, i.e. keywords forbidden for use as
variable names.
The above example omitted recognition actions in the
Rules Section to make things easier to read.
November, 1995 Page 81
YAY Reference Manual Thinkage Ltd.
AAppppeennddiixx BB
YYAAYY vvss.. YYAACCCC
YAY differs from its predecessor YACC in a number of
respects. These are summarized below.
(a) YACC does not support the following.
-- all the LALR(2) features described in 9.1
-- selection preferences, as described in 9.3
-- the use of YYSTATIC, YYSYNC, or YYALLOC (9.12)
(b) If a state has several Reduce action, YACC always
chooses the most popular of these as the default
action. YAY's rules for choosing a default are given
in 9.10. In general, YAY's approach will detect errors
more quickly (sooner after they actually occur).
(c) The two programs handle YYERROR differently. YAY
always reduces and pops the current state off the stack
before generating the artificial error condition. YACC
doesn't do this. With YACC, you cannot use YYERROR
inside a rule with "error" in it; YAY, however, will
resolve the current rule and then trigger error
handling.
(d) Output from the two programs differs slightly.
As a rule of thumb, YAY accepts any YACC grammar and
creates a parser that behaves the same for correct input.
However, the YAY version usually detects errors earlier than
the YACC parser, and has more Error actions in the state
tables.
Page 82 November, 1995
Thinkage Ltd. YAY Reference Manual
AAppppeennddiixx CC
TThhee LLIINNTT CCoommmmaanndd LLiinnee
_S_y_n_t_a_x_:
yay sourcefile Parser=outfile [option]*
(+|-)LR2 (-) (+|-)Verbose (-)
(+|-)Warnings (+) Description=file
Header=file INSTallation=file
Language=C|C++ (C) Parser=file
_E_x_a_m_p_l_e_s_:
yay cgram.y parse=myparse.c
c myparse.c
yay ccgram.y lang=c++ pars=myparse.cpp
_O_p_t_i_o_n_s_:
sourcefile
is a file containing YAY input.
Language=C
produces parsing tables in the C programming language.
This is the default.
Language=C++
produces parsing tables in the C++ programming
language. Note that YAY only produces the tables; the
routines that use the tables to parse input are
predefined.
+LR2
says that the YAY input describes an LALR(2) grammar.
Without +LR2, YAY assumes that the grammar is LALR(1).
The manual describes modifications that need to be made
for LALR(2) grammars.
Description=file
translates the parsing tables into a format that humans
can read, and writes this output into the given file.
Header=file
writes token definitions and other information
necessary for separate compilation, to the named file.
INSTallation=file
tells YAY where to find the installation file. The
installation file tells where various software
components have been installed. For more information,
see the section on _I_n_s_t_a_l_l_a_t_i_o_n_ _F_i_l_e_s below.
November, 1995 Page 83
YAY Reference Manual Thinkage Ltd.
If you do not specify an INSTallation= option on the
command line, YAY checks for an environment variable
named YAY_INST and uses its value as the name of the
installation file. If this environment variable does
not exist, YAY uses the default installation file.
Parser=file
writes the resulting source code for the parser into
the named file. If this option is omitted, YAY just
checks the syntax of your input.
+Verbose
produces verbose output -- everything that can be
flagged is flagged.
-Warnings
suppresses a number of warning messages that YAY
normally issues.
_D_e_s_c_r_i_p_t_i_o_n_:
YAY converts your context-free grammar into a C or C++
program that is written to the file specified by the Parser=
option.
If you use the Description= option, YAY writes a full
description of the grammar to the specified file. YAY only
displays a brief message on the standard output, summarizing
conflicts (and other information if you specify +Verbose).
On the other hand, if you do not use the Description=
option, YAY writes more information to standard output,
including descriptions of the states where conflicts occur.
In this case, YAY actually provides additional information
to help you identify the source of the conflicts; if you ask
for a description file, YAY outputs less information when
reporting the conflicts because it assumes you can track
down additional information by looking at the description
file. For this reason, you can sometimes get a quicker idea
of what has gone wrong if you do not ask for a description
file.
_C_+_+_ _P_a_r_s_e_r_s
In general, you only need to use Language=C++ if you
intend YYSTYPE to contain a C++ object with constructors.
If you intend to compile the parser with C++ but the %union
statement does not have any elements that need constructors,
it's best to use Language=C to get more efficient C code.
If YYSTYPE does contain elements that need constructors,
you need to define an appropriate constructor-like function
for the YYSTYPE type. This function should have the
prototype
Page 84 November, 1995
Thinkage Ltd. YAY Reference Manual
void name(YYSTYPE *p)
where "name" can be any valid name. In the declarations
section of the grammar, you must then add the statement
#define YYSTYPE_INIT name
where "name" is the name of the constructor-like function.
With Language=C++, the %union statement generates a
structure type rather than a union, since C++ does not allow
objects with constructors to belong to unions.
In many cases, the same grammar may be processed with
either Language=C or Language=C++.
_I_n_s_t_a_l_l_a_t_i_o_n_ _F_i_l_e_s_:
An installation file specifies the pathnames for
software and data files used by YAY. Installation files are
text files made up of comment lines and option lines.
Comment lines:
Any line whose first non-blank character is # will be
taken as a comment. Blank lines are also considered
comments.
Option lines:
Option lines have the format
Keyword=pathname
In this documentation, keywords are written with some
letters in upper case and some in lower case. You may
abbreviate keywords by omitting any or all of the
letters shown in lower case. The remaining letters may
be entered in either upper or lower case; the
documentation simply uses upper case to show which
characters may not be omitted.
In this version of YAY, there is only one valid option
line:
Library=pathname
The pathname should be the directory containing the YAY
parser template files (e.g. yyparse.c).
November, 1995 Page 85
YAY Reference Manual Thinkage Ltd.
_N_o_t_e_s_:
If you define YYALLOC, the parser allocates its state
and value stacks dynamically via malloc and free. This
shrinks your parser and helps prevent stack overflows.
Copyright 1995, Thinkage Ltd.
Page 86 November, 1995