home *** CD-ROM | disk | FTP | other *** search
- From: goer@midway.uchicago.edu (Richard L. Goerwitz)
- Newsgroups: comp.sources.misc
- Subject: v44i094: ibpag2 - Icon-Based Parser Generation System 2, v1.2, Part01/04
- Date: 25 Sep 1994 15:00:56 -0500
- Organization: University of Chicago
- Sender: kent@sparky.sterling.com
- Approved: kent@sparky.sterling.com
- Message-ID: <csm-v44i094=ibpag2.150037@sparky.sterling.com>
- Reply-To: goer@midway.uchicago.edu
- X-Md4-Signature: 451a3edb3dc22c0ddf2fa2005ecfdd61
-
- Submitted-by: goer@midway.uchicago.edu (Richard L. Goerwitz)
- Posting-number: Volume 44, Issue 94
- Archive-name: ibpag2/part01
- Environment: Icon, UNIX
- Supersedes: ibpag2: Volume 38, Issue 45-49
-
- Ibpag2 is a so-called "parser generator," i.e. a tool for automating the
- process of generating a recognizer and/or parser from abstract structural
- descriptions of an input language. Put in more practical terms, Ibpag2 is
- a piece of software that a) reads a source file containing a grammar that
- defines an input language, and then b) outputs an automaton that recognizes
- that language. The user may, at his or her option, specify actions this
- automaton should take when it sees various substructures within its input
- language. By default, however, the parser simply recognizes a given
- sequence as belonging, or not, to that language. This all may sound pretty
- exotic, but once you get past the jargon you'll find that parser generators
- have very practical (and often quite simple) uses.
-
- Ibpag2 utilizes so-called "LR" table generation and parsing
- algorithms. These algorithms facilitate construction of reasonably
- fast deterministic pushdown automata that are powerful enough to
- handle most commonly used programming language constructs. LR-based
- systems come in three main flavors: SLR(1), LALR(1), and LR(1). The
- LR(1) flavor is fairly easy to implement, but uses too many resources
- to be practical. LALR(1) algorithms are harder to implement, but much
- faster, and the parse tables they construct use considerably less
- memory than do those of their LR(1) counterparts. SLR(1) algorithms
- are the easiest to implement, compile the fastest, and use about as
- much memory as LALR(1)s. SLR(1) is the least powerful of the three,
- though, so there is a tradeoff. Ibpag2 is an "enhanced" SLR(1) parser
- generator. It is enhanced in the sense that it can operate both in
- its native SLR(1) mode, and in a more powerful "quasi-GLR" mode.
-
- As its full title ("Icon-Based Parser Generator 2") implies,
- Ibpag2 is written in Icon [2,3], as are the automata it creates.
- Ibpag2 has been tested with Icon version 8.10 and 9.0. The only
- change from the one version to the other is that under Icon 9.0 Ibpag2
- has to be compiled with string invocation turned on (-f s). So far I
- have run Ibpag2 on an i386 (Xenix 2.3.3), an i486 (Linux 1.0.9), and
- a Sun 4 (SunOS 4.1.1?). I have reports, though, of it running under
- other UNIX variants. It will probably also run under other operating
- systems, though modifications will in some instances be required.
- Using Ibpag2 under MS-DOS may not be possible, on account of the way
- DOS manages memory.
-
- Any comments on the Ibpag2 system itself or its documentation
- will be gratefully received. Please feel free to drop me,
- Richard Goerwitz, an e-mail message at goer@midway.uchicago.edu,
- or (via the post) at:
-
- 5410 S. Ridgewood Ct., 2E
- Chicago, IL 60615
-
- ------------
- #! /bin/sh
- # This is a shell archive. Remove anything before this line, then feed it
- # into a shell via "sh file" or similar. To overwrite existing files,
- # type "sh file -c".
- # Contents: README ibutil.icn
- # Wrapped by kent@sparky on Sun Sep 25 14:50:14 1994
- PATH=/bin:/usr/bin:/usr/ucb:/usr/local/bin:/usr/lbin:$PATH ; export PATH
- echo If this archive is complete, you will see the following message:
- echo ' "shar: End of archive 1 (of 4)."'
- if test -f 'README' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'README'\"
- else
- echo shar: Extracting \"'README'\" \(53409 characters\)
- sed "s/^X//" >'README' <<'END_OF_FILE'
- X
- X
- X
- X
- X
- X
- X A User's Manual for Ibpag2
- X (Icon-Based Parser Generation System 2)
- X Version 1.2
- X
- X - or -
- X
- X How to Use an LR-based Parser Generator
- X
- X
- X Richard L. Goerwitz, III
- X University of Chicago
- X
- X
- X
- X
- X
- X
- X1.__What_is_Ibpag2?
- X
- X Ibpag2 is a so-called "parser generator," i.e. a tool for
- Xautomating the process of generating a recognizer and/or parser from
- Xabstract structural descriptions of an input language. Put in more
- Xpractical terms, Ibpag2 is a piece of software that a) reads a source
- Xfile containing a grammar that defines an input language, and then b)
- Xoutputs an automaton that recognizes that language. The user may, at
- Xhis or her option, specify actions this automaton should take when it
- Xsees various substructures within its input language. By default,
- Xhowever, the parser simply recognizes a given sequence as belonging,
- Xor not, to that language. This all may sound pretty exotic, but once
- Xyou get past the jargon you'll find that parser generators have very
- Xpractical (and often quite simple) uses.
- X
- X Ibpag2 utilizes so-called "LR" table generation and parsing
- Xalgorithms. These algorithms facilitate construction of reasonably
- Xfast deterministic pushdown automata that are powerful enough to
- Xhandle most commonly used programming language constructs. LR-based
- Xsystems come in three main flavors: SLR(1), LALR(1), and LR(1). The
- XLR(1) flavor is fairly easy to implement, but uses too many resources
- Xto be practical. LALR(1) algorithms are harder to implement, but much
- Xfaster, and the parse tables they construct use considerably less
- Xmemory than do those of their LR(1) counterparts. SLR(1) algorithms
- Xare the easiest to implement, compile the fastest, and use about as
- Xmuch memory as LALR(1)s. SLR(1) is the least powerful of the three,
- Xthough, so there is a tradeoff. Ibpag2 is an "enhanced" SLR(1) parser
- Xgenerator. It is enhanced in the sense that it can operate both in
- Xits native SLR(1) mode, and in a more powerful "quasi-GLR" mode (on
- Xwhich, see section 5 below).
- X
- X As its full title ("Icon-Based Parser Generator 2") implies,
- XIbpag2 is written in Icon [2,3], as are the automata it creates.
- XIbpag2 has been tested with Icon version 8.10 and 9.0. The only
- Xchange from the one version to the other is that under Icon 9.0 Ibpag2
- Xhas to be compiled with string invocation turned on (-f s). So far I
- Xhave run Ibpag2 on an i386 (Xenix 2.3.3), an i486 (Linux 1.0.9), and
- Xa Sun 4 (SunOS 4.1.1?). I have reports, though, of it running under
- Xother UNIX variants. It will probably also run under other operating
- Xsystems, though modifications will in some instances be required.
- XUsing Ibpag2 under MS-DOS may not be possible, on account of the way
- XDOS manages memory.
- X
- X The Ibpag2 distribution adheres to de facto UNIX installation
- Xstandards: Just set the appropriate variables in the makefile, and
- Xthen "make install." For those who are using a non-UNIX system, or
- Xwho have not installed such a package before, there is a section at
- Xthe end entitled "Installing Ibpag2" that details the installation
- Xprocedure (section 6).
- X
- X Aside from the above-mentioned installation section (6), the
- Xremainder of this document aims to provide the reader a) with a
- Xsimple, practical explanation of what LR-family parser generators are
- Xand how they work (section 2), and b) with a set of directions
- Xspecifically on how to use Ibpag2 (section 3). There is also an
- Xadvanced section on debugging (4), and one on using Ibpag2 with non-LR
- Xand/or ambiguous languages (5). The discussion is geared for those
- Xthat have little or no experience in parsing or automaton theory. For
- Xvery advanced reading, consult the bibliography. For a brief summary
- Xof Ibpag's command-line options, see the main Ibpag2 source file,
- Xibpag2.icn, or invoke ibpag2 with the -h (help) option.
- X
- X In general, be warned that Ibpag2 works best with small or
- Xmedium-sized grammars. Its parse tables have to be reconstructed at
- Xrun-time, and the code for doing this can become a bit cumbersome for
- Xgrammars with more than 100 rules and fifty or so terminal symbols. I
- Xmyself have processed grammars with as many as 300 terminals and 400
- Xrules. Although the resulting automata run well enough, the output
- Xfiles are over 300k, and Ibpag2 takes a long time to create them. If
- Xyou must use Ibpag2 with a very large grammar symbols, try the -c
- Xcommand-line option (which produces compressed parse tables). This
- Xoption is discussed below, in section 4. Compiling (rather than
- Xinterpreting) Ibpag2 may result in much faster processing, as will
- Xresetting your BLOCKSIZE and STRSIZE environment variables. See the
- Xinstallation section (6) below on using the Icon compiler to create
- Xthe Ibpag2 executable. Good starting values for BLOCKSIZE and STRSIZE
- Xare triple their default values (i.e. 3 x 65000). These variables are
- Xdiscussed in the Icon manual page.
- X
- X My ultimate aim in writing this document has been to make
- Xaccessible to the non-CS portion of the Icon community what for them
- Xmight seem an inaccessible branch of applied parsing and automaton
- Xtheory. I am a philologist myself, and feel that there is a great
- Xdeal that can and ought to be done to make advanced tools accessible
- Xto people without much intrinsic interest in computer science other
- Xthan to use its tools and methods to get other work done. Despite my
- Xgoal to serve a non-technical audience, this introduction must still
- Xget pretty technical at times. Be patient, and concentrate on the
- Xpractical examples.
- X
- X Any comments on the Ibpag2 system itself or its documentation
- Xwill be gratefully received. Write to me at the address appended to
- Xthe final section (6).
- X
- X
- X2.__What_is_an_LR_Parser_Generator?
- X
- X Back in the late 50s and 60s, linguists, mathematicians, and
- Xsoftware engineers all became intensely interested in the formal
- Xproperties of languages: Could they be described as a series of logical
- Xstructures and relations? Could computers recognize and manipulate
- Xthese structures efficiently? Linguists, in particular, quickly
- Xrealized that the amount of structural complexity, ambiguity, and pure
- Xnoise in natural language would render it computationally intractable,
- Xespecially given the limited memory/throughput of then available CPUs.
- XMathematicians and engineers, however, found that many of the
- Xformalized notations they dealt with could, in fact, be (re)designed
- Xin such a way that efficient computer processing could - at least in
- Xprinciple - be achieved.
- X
- X Principle, in this case, did not squarely meet reality until
- Xviable parser generation tools came into being. Parser generation
- Xtools map an abstract structural description of a formal notation or
- X"language" to working computer code. Ideally, the designer simply
- Xmakes assertions like:
- X
- X an expression is composed of either
- X 1) a term (e.g. 10), or
- X 2) an expression, a "+" or "-", and another expression
- X
- XParser generator systems translate these assertions (the "grammar")
- Xinto a machine, i.e. automaton, that can recognize and/or manipulate
- Xinput streams that conform to the "language" so described.
- X
- X Let me dwell, for a moment, on the toy expression grammar
- Xoffered above. Note that it describes a set of simple mathematical
- Xconstructs like:
- X
- X 9
- X 9 + 3
- X 9 + 3 - 8
- X
- XAccording to the specifications given above, the nine, three, and
- Xeight alone constitute terms - which are also expressions (via rule
- X1). Because these terms are also expressions, "9 + 3" can be reduced
- Xto a larger expression by rule 2. The same is true for "9 + 3 - 8,"
- Xexcept that there rule 2 must apply twice - once for "9 + 3," and then
- Xagain for that and the remainder of the line - in effect grouping the
- Xexpressions as ( ( (9) + (3) ) - (8) ). It is also possible to group
- Xthe expression ( (9) + ( (3) - (8) ) ), although for the discussion
- Xthat immediately follows this second grouping will be ignored (see
- Xbelow on the terms "precedence" and "associativity").
- X
- X If we add actions to the above grammar specification, we can
- Xcreate a calculator-like automaton. Traditionally, LR-family automata
- X(like the ones Ibpag2 creates) contain a parser, one or more stacks,
- Xand a set of action tables. The parser reads from an input stream
- Xsegmented into "tokens" (e.g. TERM, '+', '-'), and then manipulates
- Xits stacks according to directives contained in so-called "action" and
- X"goto" tables. As it reads the input stream, the parser matches rules
- Xwith action code specified by the programmer, e.g. rule 2 above might
- Xbe matched with code that added/subtracted the expressions on either
- Xside of the '+'/'-' operator, and produced (in calculator style) the
- Xresult. Alternatively, it might be matched with code that generated
- Xan equivalent construct in another language.
- X
- X In the case of our toy expression grammar above, the
- Xcorresponding LR automaton operates as follows. Omitting and/or
- Xsimplifying some of the inner details, it first looks at the input
- Xstream to see what the next token is. If the next token is an
- Xoperator or end-of-input, it checks the top of its stack. If the top
- Xof the stack has a term on it, that term is popped off, and pushed
- Xback on, this time renamed as an expression (rule 1 above). The input
- Xtoken is then shifted from the input stream onto the stack, unless it
- Xis the end-of-input token, in which case the parser returns with a
- Xresult. If the top of the stack has an expression on it (rather than
- Xa term), the parser pops the top three elements off of the stack, and
- Xthen pushes an expression back on in its place (2).
- X
- X Even in this much-simplified form, the automaton's structure
- Xseems complex. Let us take a more detailed look, therefore, at its
- Xactual workings. For the present, let us assume that actual calcula-
- Xtions will be performed when expressions are encountered (e.g. if we
- Xfind an expression "9 + 3" we will at some point save "12" on a value
- Xstack). The following summary describes what an LR automaton might
- Xdo, based on the above toy expression grammar, and given the input
- Xstring "9 + 3 + 8":
- X
- X 1) read the 9, and push it onto the stack as a term
- X 2) see a plus sign on the input stream
- X 3) pop the term (9) off of the stack and push it back on again
- X (this time calling it an expression)
- X 4) push the plus sign onto the stack
- X 5) read the 3, and push it onto the stack as a term
- X 6) see a minus sign on the input stream
- X 7) pop the 3 off of the stack and push it back on again (this
- X time calling it an expression)
- X 8) see a minus sign still waiting on the input stream
- X 9) pop 9, +, and 3 off of the stack, apply the plus operator
- X to 9 and 3, then push the result onto the stack again a
- X single expression (the stack now has 12 on top)
- X 10) read the minus sign, and push it onto the stack
- X 11) read the 8, and push it onto the stack as a term
- X 12) see the end of input coming up on the input stream
- X 13) pop the 8 off of the stack and push it back on again as an
- X expression
- X 14) see the end-of-input token still sitting on the input
- X stream
- X 15) pop 12, -, and 8 off of the stack, apply the minus operator
- X to 12 and 8, then push the result onto the stack again (the
- X stack now has 4 on top)
- X 16) return the "answer" (i.e. 4)
- X
- X This series of actions is hard to describe, and even more so
- Xto model as part of a hand-written computer program. And, even if
- Xsuch a program were written by hand, this program would have to be
- Xmodified, at times radically, every time the grammar it assumes was
- Xaugmented or changed. What I am leading up to is that, with a parser
- Xgenerator, the hand compilation stage can be eliminated by allowing
- Xthe programmer simply to declare his/her tokens and language specs,
- Xthen have the appropriate automaton constructed with little, or no,
- Xhuman intervention. This is why parser generation tools were critical
- Xto the development of not just theoretically feasible, but truly
- X*practical*, LR-based computer language design systems.
- X
- X
- X3.__Using_Ibpag2
- X
- X To recode the above toy expression grammar in Ibpag2-compatible
- Xformat is relatively simple, especially if we omit the actions initially,
- Xand concentrate on simple recognition. We need only a set of token
- Xdeclarations and three rules. Certain modifications will have to be
- Xmade to the token declarations later on. For general illustration's
- Xsake, however, the following will suffice:
- X
- X %token TERM, '+', '-'
- X %%
- X expression : TERM
- X expression : expression, '+', expression
- X expression : expression, '-', expression
- X
- XTERM, and the addition and subtraction operators, are the tokens (i.e.
- Xthe terminals symbols out of which the grammar is constructed - the
- Xthings that the input stream is segmented into). Note the %token
- Xkeyword used to declare them. The colon means "is composed of." The
- Xdouble percent sign separates token declarations from the grammar
- Xproper.
- X
- X Adding in our actions - which above were keyed to a complex
- Xset of decisions based on input tokens and stack conditions - requires
- Xjust a few extra lines of Ibpag2 action code, set off in curly braces:
- X
- X %token TERM, '+', '-'
- X %%
- X expression : TERM { return arg1 }
- X expression : expression, '+', expression { return arg1 + arg3 }
- X expression : expression, '-', expression { return arg1 - arg3 }
- X
- XUsing a "|" shorthand for repeated left-hand sides of rules, we may
- Xreformat this as:
- X
- X %token TERM, '+', '-'
- X %%
- X expression : TERM { return arg1 }
- X | expression, '+', expression { return arg1 + arg3 }
- X | expression, '-', expression { return arg1 - arg3 }
- X
- X ArgX above refers to the Xth element of the right-hand side of
- Xthe preceding rule. So, for example, arg1 in "{ return arg1 }" above
- Xrefers to TERM - the only right-hand side element of the first rule.
- XThe action "{ return arg1 }" means, "once you find a TERM and have
- Xrenamed it as an expression, use the value of TERM as the value for
- Xthat expression." By way of contrast, the action "{ return arg1 +
- Xarg3 }" means, in conjunction with the rule it follows: "When you find
- Xan expression consisting of a sub-expression, a plus operator, and
- Xanother sub-expression, use the value of sub-expression 1 + the value
- Xof sub-expression 2 as the value for the expression as a whole."
- XTechnically, the action "{ return arg1 }" for expression : TERM is not
- Xnecessary, since the Ibpag2 parser, by default, pushes the value of
- Xthe last RHS arg onto the stack. For epsilon productions (to be
- Xdiscussed below), it pushes &null.
- X
- X One serious problem with this set of specifications is that
- Xthe operators '-' and '+' are left associative. We humans take this
- Xfor granted, because correct algebraic grouping is something our
- Xhigh-school math teachers burned into us. The computer, though, has
- Xto be told, pedantically, how to group addition and subtraction
- Xexpressions. It has to be explicitly instructed, in other words, to
- Xgroup expressions like "9 + 32 - 4" as (9 + 32) - 4. Without
- Xinstructions of this kind, the parser does not know, after it has read
- X"9 + 32" and is looking at a minus sign, whether to shift the minus
- Xsign onto the stack, and eventually try to group as 9 + (32 - 4), or
- Xto reduce "9 + 32" to an expression and group as (9 + 32) - 4.
- XAlthough in this case the grouping may not seem to matter, it
- Xsometimes does. Some operators group right to left. The unary minus
- Xsign, for example, is one such operator (--4 groups as (- (- 4))). To
- Xinclude the unary minus sign in our grammar, we might append yet
- Xanother rule:
- X
- X %token TERM
- X %left '+', '-'
- X %right '-'
- X %%
- X expression : TERM { return arg1 }
- X | expression, '+', expression { return arg1 + arg3 }
- X | expression, '-', expression { return arg1 - arg3 }
- X | '-', expression { return - arg2 }
- X
- XThe trouble with this arrangement is that the minus sign was already
- Xdeclared as left associative. To get around the conflict we use a
- X"dummy" token declaration, and a %prec declaration in the applicable
- Xrule:
- X
- X %token TERM
- X %left '+', '-'
- X %right UMINUS
- X %%
- X expression : TERM { return arg1 }
- X | expression, '+', expression { return arg1 + arg3 }
- X | expression, '-', expression { return arg1 - arg3 }
- X | '-', expression %prec UMINUS { return - arg2 }
- X
- XThe %prec declaration simply tells the parser that, even though the
- Xrule contains a '-' operator, the rule should be handled as if the
- Xoperator were UMINUS. UMINUS is not actually used as a symbol in the
- Xright-hand side of any rule (hence the designation "dummy"). It is
- Xthere simply to make the last rule behave as if the minus sign in the
- Xlast rule were different than in the second-to-last rule.
- X
- X Let us now add in multiplication and division operators to our
- Xcalculator specifications, and see what happens. Let me reiterate
- Xhere that the action "{ return arg1 }" for rule 1 (expression : TERM)
- Xis not strictly necessary, since the default is to push the last RHS
- Xarg onto the value stack:
- X
- X %token TERM
- X %left '+', '-'
- X %left '*', '/'
- X %right UMINUS
- X %%
- X expression : TERM { return arg1 }
- X | expression, '+', expression { return arg1 + arg3 }
- X | expression, '-', expression { return arg1 - arg3 }
- X | expression, '*', expression { return arg1 * arg3 }
- X | expression, '/', expression { return arg1 / arg3 }
- X | '-', expression %prec UMINUS { return - arg2 }
- X
- XNote that the multiplication and division operators were defined
- X*after* the addition and subtraction operators. The reason for this
- Xis that, technically speaking, the grammar itself is ambiguous. If we
- Xtreat all operators identically, the parser will not be able to tell
- Xwhether "9 + 1 * 3" should be parsed as (9 + 1) * 3 or as 9 + (1 * 3).
- XAs we all know from our high-school algebra, multiplication has a
- Xhigher precedence than addition. You do the multiplications before
- Xthe additions, in other words, no matter where they occur. To tell
- Xthe parser to behave in this same manner, we declare '*' after '+'.
- XNote that, despite their higher priority, the '*' and '/' operators
- Xare still left associative. Hence, given "3 / 4 * 7," the parser will
- Xgroup its input as (3 / 4) * 7. As a brain teaser, try to figure out
- Xhow the parser might group the input "9 + 3 / 4 * 7." Remember that
- Xhigher-precedence rules get done first, but that same-precedence rules
- Xget done according to associativity.
- X
- X The only fundamental problem remaining with the above grammar
- Xis that it assumes that the end of the input coincides with the end of
- Xthe line. Is it possible to redefine the language described as
- Xconsisting of arbitrary many lines? The answer to this question is
- X"yes." One can simply add another set of productions to the grammar
- Xthat state, essentially, that the input language consists of lines
- Xmade up of an expression and a carriage return or of nothing. Nothing
- Xis indicated by the keyword epsilon. Note that only the first rule
- Xhas an action field:
- X
- X lines : lines, expression, '\n' { write(arg2) }
- X | lines, '\n'
- X | epsilon
- X
- XThis rule-series may seem rather abstruse, but it becomes a bit
- Xclearer when you think about what happens on actual input. If there
- Xis no input (epsilon), nothing gets printed, because lines : epsilon
- Xhas no action field. If the parser sees an expression and a newline,
- Xthe parser takes this as an instance of epsilon, plus an expression,
- Xplus a newline. This, then, becomes the first component of rule 1 if
- Xanother expression + newline follows, or of rule two if just a newline
- Xoccurs. Every time an instance of rule 1 occurs, the action "{
- Xwrite(arg2) }" is executed, i.e. the value of the expression gets
- Xprinted. If this still seems hard to fathom, try walking through
- Xstep-by-step. Even experienced hands may find these sorts of rules
- Xdifficult to construct and debug.
- X
- X Note that "lines" is now the so-called "start symbol" of our
- Xgrammar. It is, in other words, the goal of every parse. By default
- Xthe left-hand side symbol of the first rule is the start symbol. This
- Xmay be overridden with a %start declaration in the tokens section (on
- Xwhich, see the sample Ibpag2 input file below).
- X
- X With our new, multi-line start symbol in place, the only piece
- Xthat needs to be added, in order to make our calculator specification
- Xa full working input to Ibpag2, is a tokenizer. A tokenizer is a
- Xroutine that reads input from a file or from some other stream (e.g.
- Xthe user's console), and then segments this input into tokens that its
- Xparser can understand. In some cases, the tokens must be accompanied
- Xby a literal value. For example, if we encounter a TERM, we return
- XTERM, just as it is listed in the %token declaration. But what is the
- Xliteral value of a TERM token? It could be, for example, 9, or 5, or
- X700. The tokenizer returns the symbol TERM, in this case, but then
- Xrecords that TERM's actual value by setting some global variable. In
- XIbpag2's parser, this variable is assumed to be "iilval." In the
- Xtokenizer, therefore, one might write
- X
- X iilval := (literal value)
- X suspend TERM
- X
- XFor literal operators like '+' and '*', there is no need to set
- Xiilval, since their literal value is irrelevant. One simply returns
- Xthese as integers (usually via "suspend ord(c)").
- X
- X The tokenizer routine is normally appended to the grammar
- Xafter another double percent sign. Everything after this second
- Xdouble percent sign is copied literally to the output file.
- XAlternatively, the tokenizer can be $included via Icon's preprocessor.
- XIbpag2 demands that the tokenizer be called iilex, and that it take a
- Xsingle file argument, that it be a generator, and that it fail when it
- Xreaches end-of-input. Combined with our "lines" productions, the
- Xaddition of an iilex routine to our calculator grammar yields the
- Xfollowing Ibpag2 input file:
- X
- X %token TERM
- X %left '+', '-'
- X %left '*', '/'
- X %right UMINUS
- X
- X %start lines
- X
- X %%
- X
- X expression : TERM { return arg1 }
- X | expression, '+', expression { return arg1 + arg3 }
- X | expression, '-', expression { return arg1 - arg3 }
- X | expression, '*', expression { return arg1 * arg3 }
- X | expression, '/', expression { return arg1 / arg3 }
- X | '-', expression %prec UMINUS { return - arg2 }
- X
- X lines : lines, expression, '\n' { write(arg2) }
- X | lines, '\n'
- X | epsilon
- X
- X %%
- X
- X procedure iilex(infile)
- X
- X local nextchar, c, num
- X
- X nextchar := create !(!infile || "\n" || "\n")
- X c := @nextchar | fail
- X
- X repeat {
- X if any(&digits, c) then {
- X if not (\num ||:= c) then
- X num := c
- X } else {
- X if iilval := \num then {
- X suspend TERM
- X num := &null
- X }
- X if any('+-*/()\n', c) then {
- X iilval := c
- X suspend ord(c)
- X } else {
- X if not any(' \t', c) then {
- X # deliberate error - will be handled later
- X suspend &null
- X }
- X }
- X }
- X c := @nextchar | break
- X }
- X if iilval := \num then {
- X return TERM
- X num := &null
- X }
- X
- X end
- X
- X procedure main()
- X return iiparse(&input, 1)
- X end
- X
- XAs noted above, the tokenizer (iilex) must be a generator. It must
- Xsuspend integers either directly (e.g. ord(c)), or else via symbolic
- Xdefines like TERM, created by Ibpag2 on the basis of %token, %right,
- X%left, and %nonassoc declarations. The tokenizer must fail on end of
- Xinput.
- X
- X If you like, cut the above code out, place it in a temporary file,
- Xtmp.ibp, and then feed this file to Ibpag2 by typing "ibpag2 -f tmp.ibp -o
- Xtmp.icn." If your system supports input and output redirection, type:
- X"ibpag2 < tmp.ibp > tmp.icn." Ibpag2 will turn your grammar specifications
- Xand actions into a routine called iiparse. If you look above, you will see
- Xthat I appended a main procedure that, in fact, calls iiparse(). Iiparse()
- Xtakes two arguments: 1) an input stream, and 2) a switch that, if nonnull,
- Xtells the parser to fail rather than abort on unrecoverable errors. When
- XIbpag2 is finished creating its output file (tmp.icn above), compile that
- Xfile the way you would compile any other Icon program, except that full
- Xstring invocation must be enabled (e.g. "icont -f s tmp"). Finally, run the
- Xexecutable. You should be able to type in various simple arithmetic
- Xexpressions and have the program spit back answers each time you hit a
- Xreturn. The only problem you might encounter is that the parser aborts on
- Xerroneous input.
- X
- X The issue of erroneous input brings up yet another point of
- Xgeneral Ibpag2 usage. Normally, if one is processing input, one does
- Xnot want to abort on errors, but rather just emit an error message,
- Xand to continue processing - if this is at all possible. To do this,
- XIbpag2 provides a simple but fairly effective mechanism: A reserved
- X"error" token.
- X
- X When Ibpag2 encounters an error, it will remove symbols from
- Xits stack until it has backtracked to a point where the error token is
- Xlegal. It then shifts the error token onto the stack, and tries to
- Xre-start the token stream at the point where it left off, discarding
- Xtokens if necessary in order to get itself resynchronized. The parser
- Xconsiders itself resynchronized when it has successfully read and
- Xshifted three tokens after shifting the error token. Until then it
- Xremains in an error state, and will not output additional error
- Xmessages as it discards tokens.
- X
- X This explanation may sound a bit abstruse, but in practice it
- Xis turns out to be quite simple. To implement error handling for our
- Xcalculator, we really have to add only one production to the end of
- Xthe "lines" section:
- X
- X lines : lines, expression, '\n' { write(arg2) }
- X | lines, '\n'
- X | epsilon
- X | error, '\n' {
- X write("syntax error; try again:")
- X iierrok
- X }
- X
- XGiven the above grammar, the parser will handle errors as follows: If
- Xan error occurs (say it has an expression then an operator on its
- Xstack and sees a newline on the input stream) the parser will throw
- Xout the operator, then check if the error token would be OK in this
- Xstate (which it would not). Then it would throw out the expression.
- XAt this point, the stack is in the ready-to-read-a-lines state - the
- Xstate it was in before it read the last expression. Since "lines" may
- Xconsist of error and '\n,' the error token is legal here, and so the
- Xparser pushes error onto the stack, then looks back at the input
- Xstream (where a newline is still waiting). Since the newline now
- Xcompletes the rule lines : error, '\n', the parser pushes the newline
- Xonto its stack, then executes the action associated with this
- Xproduction, i.e. it writes "syntax error; try again:" to the console,
- Xprompting the user for additional input.
- X
- X The keyword "iierrok" in the above error production's action
- Xfield is there for a subtle, but important, reason: It tells the
- Xparser to consider itself resynchronized, even if three tokens have
- Xnot yet been shifted. If iierrok were not in the action code for this
- Xrule, and the user were to supply more bad input after the prompt,
- Xthen the parser would simply discard those tokens, without emitting
- Xanother error message. Why? Because, as you will recall, the parser
- Xdiscards tokens after an error, in efforts to resynchronize itself.
- XUntil it reads and shifts three tokens successfully, it considers
- Xitself in an error state, and will not emit additional error messages.
- XThe three-token resync rule is there to prevent a cascade of
- Xirrelevant error messages touched off by a single error. In our
- Xcalculator's case above, though, we are smarter than the parser. We
- Xknow that it is resynchronized as soon as it reduces error, '\n' to
- Xlines. So if a syntax error occurs on the next token, it should be
- Xreported. Adding "iierrok" to the action insures that the parser will
- Xdo just this.
- X
- X In addition to iierrok, there are several other directives
- XIbpag2 accepts as part of the action code segments. These are as
- Xfollows:
- X
- X iiclearin clear the current input token
- X IIERROR perform error recovery
- X IIACCEPT simulate an accept action
- X
- XThere are several other directives (all implemented as macros) that
- XIbpag2 accepts in GLR mode. For a discussion of GLR mode, see below,
- Xsection 5. IIERROR in particular, and error recovery in general, work
- Xa bit differently in that mode than they do in Ibpag2's normal (i.e.
- XLR) mode.
- X
- X There are admittedly many other topics that might be covered
- Xhere. This treatment, however, is intended as a general nontechnical
- Xintroduction, and not as a complete textbook on parser generation use.
- XIf you want to learn more about this topic, consult the bibliography.
- XAlso, check the UNIX manual pages on the YACC utility (Yet Another
- XCompiler Compiler). Ibpag's input format is fairly close (too close,
- Xperhaps) to YACC's. In fact, most of what is said about YACC in UNIX
- Xdocumentation can be carried directly over to Ibpag2. Several salient
- Xdifferences, though, should be kept in mind:
- X
- X 1) YACC's "$$ = x" constructs are replaced by "return x" (e.g.
- X "$$ = $1 + $3" -> "return $1 + $3" [$1 is a synonym for
- X "arg1", $3 for "arg3", etc.])
- X
- X 2) all variables within a given action are, by default, local
- X to that action; i.e. they cannot be accessed by other
- X actions unless you declare them global elsewhere (e.g. in
- X the pass-through part of the declarations section %{ ...
- X %})
- X
- X 3) the %union and %type declarations/tags are not needed by
- X Ibpag2 (both for better and for worse)
- X
- X 4) tokens and symbols are separated from each other by a comma
- X in Ibpag2 files (e.g. %token '+', '-' and S : NP, VP)
- X
- X 5) epsilon is indicated by the keyword "epsilon" (e.g. REL :
- X epsilon), and not by an empty RHS
- X
- X 6) both epsilon and error *may* be declared as %tokens for
- X reasons of precedence, although they retain hard-coded
- X internal values (-2 and -1, respectively)
- X
- X 7) all actions must follow the last RHS symbol of the rule
- X they apply to (preceded by an optional %prec directive); to
- X achieve S : NP { action1 }, VP { action2 }, insert a dummy
- X rule: S : NP, dummy, VP { action2 }; dummy : epsilon {
- X action1 } ;
- X
- X 8) YYERROR, YYACCEPT, yyclearin, and yyerrok are the same,
- X except they are written IIERROR, IIACCEPT, iiclearin, and
- X iierrok (i.e. "ii" replaces "yy")
- X
- X 9) Ibpag2's input files are tokenized as modified Icon files,
- X and, as a consequence, Icon's reserved words must not be
- X used as symbols (e.g. "if : if, then" is no go)
- X
- XI myself find YACC to be ugly. As a result, Ibpag2 is not an exact
- XYACC clone. I would like to underscore the fact that I have no
- Xintention to move in this direction, either. It's as YACC-like as
- Xit's going to get!
- X
- X Both YACC and non-YACC users should note number 9 in the above
- Xlist. Don't use things like "while," "every," "do," etc. as symbols
- Xin your grammar! Just use the same rules for Ibpag2 nonterminals as
- Xfor Icon variables, and you'll be OK.
- X
- X For those that just can't bear using anything but a strictly
- XYACC-conformant system, I've included a preprocessor with the Ibpag2
- Xdistribution called (at one user's recommendation) "iacc." Iacc reads
- X&input - assumed to be a YACCish grammar - and sends to &output an
- XIbpag2-conformant file. I have not tested this file extensively, and
- Xthere are likely to be bugs in the way I've handled the necessary 2
- Xtoken lookaheads and value stack references. Give it a whirl, though,
- Xif you are feeling adventurous. The only reason I personally use Iacc
- Xis that some YACCs (e.g. BSD YACC) have particularly nice debugging
- Xmessages and help. If my grammar is particularly complex, I just run
- Xit through YACC without action code first, then use Iacc to convert it
- Xto Ibpag2 format. Iacc's output, as I noted, is not meant to be
- Xpretty, so I invariably end up doing a little editing - usually just
- Xrespacing a few rules, and re-inserting any comments that I might have
- Xput in the original YACC file.
- X
- X In general, Ibpag2 (like YACC) handles epsilon moves and
- Xindirect cycles. LR-mode shift-reduce conflicts are also handled in
- Xthe normal way (i.e. pick the rule with the highest priority, and, in
- Xcases where the priority is the same, check the associativities). In
- Xcontrast to YACC, Ibpag2 flags reduce/reduce conflicts as errors
- X(since these often conceal deeper precedence problems and just plain
- Xkludges). Reduce/reduce conflict errors are easily enough remedied,
- Xif need be, via (dummy) precedences. One can convert these errors to
- Xwarnings by specifying -y on the command line. With the -y option,
- Xreduce/reduce conflicts are resolved in favor of the rule that occurs
- Xfirst in the grammar. The -y switch also prevents Ibpag2 from
- Xaborting on shift/reduce conflicts, telling it instead to resolve in
- Xfavor of shift. Basically, -y is a partial YACC compatibility switch.
- XNormally (i.e. in SLR mode) Ibpag2 is much more finicky than YACC
- Xabout conflicts in its grammars.
- X
- X Also in contrast to YACC, Ibpag2 supports multiple
- Xsimultaneous parsers. Ibpag2 normally names its main parser routine
- Xiiparse(). By using the -m command-line option, however, you can
- Xoverride this default behavior, and force Ibpag2 to augment this name
- Xin some uniquely identifiable fashion. For example, "ibpag2 -m _1 <
- Xtmp.ibp > tmp.icn" will force Ibpag2 to write a parser called
- X"iiparse_1" to tmp.icn. Note that, instead of calling iilex, this
- Xiiparse_1() routine will now call iilex_1, and all necessary global
- Xvariables will have _1 appended to them (e.g. errors will become
- Xerrors_1). I don't expect that many people will have occasion to use
- Xthis feature. It is there, though, for those that want it.
- X
- X
- X4.__Debugging
- X
- X Constructing and debugging LR(1) family parsers can sometimes
- Xbe hair raising, even with a parser generator. Several precautions
- Xcan be taken, however, to minimize the agony. The first is to declare
- Xall tokens initially as part of a single %token declaration, i.e. with
- Xno precedences, and with the same associativities. Also, leave out
- Xaction code until the grammar seems to be working. In this stage, you
- Xcan even run the grammar through (BSD)YACC or GNU Bison. All you
- Xwould need to do is remove the commas between tokens and symbols, and
- Xplace a semicolon at the end of every rule. During this and all
- Xdebugging stages, supply Ibpag2 with a -v command-line switch. This
- Xwill cause Ibpag2 to write a summary of rules, tokens, and its two
- Xstate tables to "ibpag2.output" (a bit like GNU Bison, but with a
- Xhard-coded name). If you get messages about conflicts in your parse
- Xtables (e.g. "unresolvable reduce/reduce conflict, state 5, token
- X257, rules 4,5"). This file will tell you what rules these are, and
- Xwhat token number 257 is. Use precedences and associativities to
- Xclear these problems up as they arise. If you are comfortable having
- Xreduce/reduce errors resolved by the order in which the conflicting
- Xrules occur, then use the -y command-line switch. With -y on the
- Xcommand line, Ibpag2 will always resolve in favor of the earlier rule.
- XThis option will also cause it to resolve all shift/reduce conflicts
- Xin favor of shift.
- X
- X There are certain languages that are not ambiguous that SLR(1)
- Xparsers like Ibpag2 will fail to produce an unambiguous parse table
- Xfor. The classic example is
- X
- X expr : lval, '=', rval | rval
- X lval : '*', rval | ID
- X rval : lval
- X
- XC programmers will recognize this as a toy expression grammar with
- Xcode for identifiers, assignments, and pointers. The problem is that
- Xif we feed this grammar to Ibpag2, it will claim that there is a
- Xconflict on lookahead '='. In truth, there is no ambiguity. The SLR
- Xparser simply doesn't remember the pathway the parser used to get to
- Xthe state it is in when it sees '=' on the input stream. Whether the
- Xparser gets into this state by seeing '*' plus and ID, or by seeing
- Xjust an ID, it knows to turn the ID into an lval. Then it knows to
- Xturn lval into rval. At this point, though, it doesn't know whether
- Xto shift the = sign via rule 1, or to turn rval and the preceding '*'
- Xinto an lval. The parser has "forgotten" that the '*' is there
- Xwaiting on level down on the stack!
- X
- X The solution to this problem is actually quite simple (at
- Xleast in concept). Just provide a unique pathway in the grammar for
- Xthe conflicting rules. In this case, they are rules 1 and 5 (the
- Xfirst and last):
- X
- X expr : lval, '=', rval | rval
- X lval : '*', pval | ID
- X pval : lval
- X rval : lval
- X
- XNow when the parser sees '*,' it can only have a pval after it. Never
- Xmind that pval is composed of precisely the same things as rval. The
- Xpoint is that the parser generator follows a different route after
- Xseeing '*' than if it starts with ID and no preceding '*'. Hence it
- X"remembers" that that the '*' is back on the stack, waiting for the
- X"lval : '*', pval" rule to apply. There is no more conflict.
- X
- X Go ahead and run these grammars through Ibpag2 if you aren't
- Xsure what is going on. Remember to declare ID as a token, and to
- Xplace "%%" in the appropriate spot!
- X
- X If you get your parser up and running, but find that it is not
- Xfunctioning quite the way you expect, add the following line somewhere
- Xnear the start of Ibpag2's output file:
- X
- X $define IIDEBUG
- X
- XIf you like, you can add it to the beginning of your Ibpag2 input
- Xfile. Place it in the declarations section (before the first double
- Xpercent sign), and surround it by %{ and %}, e.g.:
- X
- X %{
- X $define IIDEBUG
- X %}
- X
- XThis tells Ibpag2 to send $define IIDEBUG straight through to the
- Xoutput file.
- X
- X What defining IIDEBUG does is tell iiparse, once compiled, to
- Xemit profuse debugging messages about the parser's actions, and about
- Xthe state of its stacks. This display will not make a whole lot of
- Xsense to anyone who doesn't understand LR-family parsers, so those who
- Xwant to access this feature should perhaps go through a standard
- Xreference like Aho, Sethi, and Ullman [1].
- X
- X If, after you are finished debugging your grammar, you find
- Xthat Ibpag2's output files are rather large, you may try saving space
- Xby compressing the action and goto tables. This is accomplished by
- Xinvoking Ibpag2 with the -c (compress) option. Using this option
- Xmakes debugging difficult, and makes the parser run a bit more slowly.
- XIt also only works for rather large grammars with long nonterminal
- Xsymbol names. Don't even consider it until the grammar is thoroughly
- Xdebugged and you have determined that the output file's size is just
- Xtoo great for practical use. Even then, compression may or may not
- Xhelp, depending on how long your nonterminal names are. In general,
- XIbpag2 is best as a teaching tool, or as a production system for
- Xmedium or small grammars.
- X
- X
- X5.__Using_Ibpag2_with_Non-LR_Grammars
- X
- X There may be times when you *want* to parse languages that no
- XLR-based algorithm can handle. There may be times, that is, when the
- Xgrammar you want to use contains conflicts or ambiguities that are
- Xthere by design, and not by oversight. For example, you may want to
- Xparse a natural language. Full-blown natural languages involve many
- Xhighly ambiguous constructs, and are not LR-parsable. By invoking it
- Xwith the -a option, Ibpag2 can parse or recognize certain natural
- Xlanguages, or, more practically speaking, certain NL subsets. The
- Xletter "a" in -a is supposed to stand for "ambiguous," although what
- Xthis option really does is put Ibpag2 into a quasi-GLR mode - i.e.
- Xinto a kind of "generalized" LR mode in which it can accept non-LR
- Xgrammars [4,5].
- X
- X User-visible changes to Ibpag2's operation in quasi-GLR mode
- X(i.e. with the -a option) are as follows:
- X
- X 1) iiparse() is now a generator
- X 2) action code can use suspend as well as return
- X 3) IIERROR places the current thread in an error state (i.e.
- X it doesn't *necessarily* trigger error recovery; see below)
- X 4) there are two new action-code directives (iiprune and
- X iiisolate) and a general define (AUTO_PRUNE)
- X 5) conflicts due to ambiguities in the grammar no longer
- X result in aborted processing (so, e.g., if you do not
- X specify the -y option on a grammar with reduce/reduce
- X conflicts, Ibpag2 will simply generate a parser capable of
- X producing multiple parses for the same input)
- X
- X In quasi-GLR mode, iiparse() should be invoked in a way that
- Xwill render multiple results usable, if they are available (e.g.
- X"every result := iiparse(&input) do...". Action code is also allowed
- Xto produce more than one value (i.e. to use suspend). When it does
- Xso, iiparse() creates separate parse threads for each value. So, for
- Xinstance, if your action code for some production suspends both of the
- Xfollowing lists,
- X
- X ["noun", "will", "gloss: desire"]
- X ["noun", "will", "gloss: legal document mandating how _
- X one's possessions are to be disposed _
- X of after one's death"],
- X
- Xiiparse() would create two separate parse threads - one for each
- Xresult. Note that in this case, the syntactic structure of each
- Xthread is the same. It is their semantics (i.e. the stuff on the
- Xvalue stack) that differs.
- X
- X If you use the iierrok and iiclearin macros in your action
- Xcode before suspending any result, their affect persists through all
- Xsubseqent suspensions and resulting parse threads. If you use these
- Xmacros after suspending one or more times, however, they are valid
- Xonly for the parse thread generated by the next suspension. By way of
- Xcontrast, the IIERROR macro *always* flags only the next parse thread
- Xas erroneous. Likewise, IIACCEPT always simulates an accept action on
- Xthe next suspension only. IIERROR and IIACCEPT, in other words, never
- Xhave any effect on subsequent suspensions and parse threads other than
- Xthe one that immediately follows them. This is true of iierrok and
- Xiiclearin only when used after the first suspension.
- X
- X In quasi-GLR mode, IIERROR (number three in the difference
- Xlist above) becomes a mechanism for placing the current parse thread
- Xin error mode. This is similar to, but not quite identical to, how
- XIIERROR functions in straight LR mode. In quasi-GLR mode, if other
- Xthreads can carry on the parse without error the erroneous parse
- Xthread is quietly clobbered. Full-blown error recovery only occurs if
- Xall of the other parsers halt as well. This makes sense if you think
- Xabout it. Why keep erroneous threads around when there are threads
- Xstill continuing a valid parse? For some large interactive systems,
- Xit might be necessary to keep bogus threads around longer, and weed
- Xthem out only after a lengthy grading process. If you are
- Xconstructing a system such as this, you'll have to modify Ibpag2's
- Xiiglrpar.lib file. In particular, you'll need to change the segment
- Xin iiparse() that takes out the trash, so to speak, in such a way that
- Xit does so only if the error count in a given parser either rises
- Xabove a specific threshhold or else exceeds the number of errors in
- Xthe "most correct" parser by a certain amount. This is not that hard
- Xto do. I just don't expect that most parsers people generate with
- XIbpag2 will use IIERROR or error recovery in general in so involved a
- Xfashion.
- X
- X Iiprune and iiisolate (number 4 above) are used to control the
- Xgrowth of the parallel parser array. In order to give straightforward
- X(read "implementationally trivial") support for action code, Ibpag2
- Xcannot create a parse "forest" in the sense that a standard GLR parser
- Xdoes. Instead, it simply duplicates the current parser environment
- Xwhenever it encounters a conflict in its action table. Even if the
- Xconflict turns out to reflect only a local ambiguity, the parsers, by
- Xdefault, remain separate. Put differently, Ibpag2's quasi-GLR parser,
- Xby default, makes no direct effort to reduce the size of its parser
- Xarrays or to alter the essentially linear structure of their value and
- Xstate stacks. Size reduction, where necessary and/or desirable, is up
- Xto the programmer. What the iiprune macro is there to do is to give
- Xthe programmer a way of pruning a given thread out of the active
- Xparser list. Iiisolate allows him or her to prune out every thread
- X*but* the current one. AUTO_PRUNE makes the parser behave more like a
- Xstandard GLR parser, instructing it to prune parse threads that are
- Xessentially duplicating another parse thread's efforts. The parser,
- Xthough, does not build a parse tree per se, the way most GLR parsers
- Xtypically do, but rather manipulates its value stack like a
- Xtraditional LR-family parser.
- X
- X Iiprune is useful when, for example, the semantics (i.e. your
- X"action" code segments) determine that a given parse thread is no
- Xlonger viable, and you want to signal the syntactic analyzer not to
- Xcontinue pursuing it. The difference between iiprune and IIERROR is
- Xthat iiprune clobbers the current parser immediately. IIERROR only
- Xputs it into an error state. If all active parsers end up in an error
- Xstate, and none can shift additional input symbols, then the IIERROR
- Xmacro induces error recovery. Iiprune does not. NB: iiprune, if used
- Xin action code that suspends multiple results, cancels the current and
- Xremaining results (i.e. it does not clobber parsers already spun off
- Xby previous suspensions by invocation of that same code; it merely
- Xcuts the result sequence). Iiprune essentially stands in for "fail"
- Xin this situation. Fail itself can be used in the code, but be warned
- Xthat iiparse() will still push *at least one* value onto its value
- Xstack, even if a given action code segment fails. This keeps the
- Xvalue stack in sync with the syntax. To avoid confusion, I recommend
- Xnot using "fail" in any action code.
- X
- X Iiisolate is useful if, during error recovery, you prompt the
- Xuser interactively, or do something else that cannot be elegantly done
- Xin parallel for two or more distinct parse threads. Iiisolate allows
- Xyou to preserve only the the current parse thread, and to clobber the
- Xrest. Iiisolate can also be useful as a way of making sure that only
- Xone thread carries on the parse in non-error situations. Suppose that
- Xwe have a series of productions:
- X
- X sentences : sentences sentence '\n'
- X { print_parse(arg2) }
- X | sentences '\n'
- X | error '\n'
- X | epsilon
- X
- XIf we get a sentence with more than one parse, all of the underlying
- Xthreads that produced these parses will be active for the next
- Xsentence as well. In many situations this will not be what we want.
- XIf our desire it to have only one active parse thread at the start of
- Xeach sentence, we simply tell our lexical analyzer to suspend two
- Xnewlines every time it sees a newline on the input stream. This
- Xinsures that the second rule will always apply right after the first.
- XWe then insert iiisolate directives for both it and the one error
- Xproduction:
- X
- X sentences : sentences sentence '\n'
- X { print_parse(arg2) }
- X | sentences '\n'
- X { iiisolate }
- X | error '\n'
- X { iiisolate; iierrok }
- X | epsilon
- X
- XThe effect here is to allow multiple parsers to be generated only
- Xwhile parsing "sentence". The iiisolate directive, in other words,
- Xsees to it that no sentence parse will ever begin with multiple active
- Xparsers. As with LR mode, iierrok clears the error flag for the
- X(current) parser.
- X
- X Note that if you use iiisolate in action code that suspends
- Xmultiple results, iiisolate will clobber all parsers but the one
- Xgenerated by the next suspension.
- X
- X If there is no need for close control over the details of the
- Xparser array, and you wish only to clobber parsers that end up doing
- Xthe same thing as some other parser (and hence returning identical
- Xvalues), then just make sure you add "$define AUTO_PRUNE" to the
- Xpass-through code section at the top of the file. Put differently,
- Xdefining AUTO_PRUNE instructs the quasi-GLR parser to weed out parsers
- Xthat are in the same state, and which have identical value stacks.
- XAUTO_PRUNE can often be used in place of iiisolate in situations like
- Xthe one discussed just above. Its only drawback is that it slows
- Xthe parser a bit.
- X
- X Other than these deviations (action code and iiparse becoming
- Xgenerators, IIERROR's altered behavior, and the addition of iiprune,
- Xiiisolate, and AUTO_PRUNE), Ibpag2's quasi-GLR mode - at least on the
- Xsurface - works pretty much like its straight LR mode. In fact, if
- Xyou take one of your SLR(1) grammars, and run it through Ibpag2 using
- Xthe -a option, you probably won't notice any difference in the
- Xresulting automaton unless you do some debugging or perform some
- Xtiming tests (the GLR parser is slower, though for straight SLR(1)
- Xgrammars not by much). Even with non-SLR(1) grammars, the quasi-GLR
- Xparser will clip along merrily, using all the same sorts of rules,
- Xaction code, and macros that you would typically use in LR mode!
- X
- X
- X6.__Installing_Ibpag
- X
- X If you are a UNIX user, or have a generic "make" utility, you
- Xare in luck. Just edit Makefile.dist according to the directions
- Xgiven in that file, rename it as "makefile," then execute "make."
- XIbpag2 should be created automatically. If everything goes smoothly,
- Xthen "make install" (su-ing root, if both possible and necessary for
- Xcorrect installation of the iiparse.icn file). Check with your system
- Xadministrator if you are on a public system, and aren't sure what to
- Xdo.
- X
- X Please be sure to read the directions in the makefile
- Xcarefully, and set DESTDIR and LIBDIR to the directory where you want
- Xthe executable and parser file to reside. Also, make sure the paths
- Xyou specify are correct for your Icon executables. Although Ibpag2
- Xwill apparently compile using iconc, I would recommend using the
- Xinterpreter, icont, first, unless you are planning on working with a
- Xlarge grammar.
- X
- X If you are using some other system - one that lacks "make" -
- Xthen shame on your manufacturer :-). You'll be a bit inconvenienced.
- XTry typing:
- X
- X icont -o ibpag2 follow.icn ibpag2.icn ibreader.icn \
- X ibtokens.icn ibutil.icn ibwriter.icn iohno.icn \
- X outbits.icn slritems.icn slrtbls.icn shrnktbl.icn \
- X version.icn slshupto.icn
- X
- XThe backslashes merely indicate that the next line is a continuation.
- XThe whole thing should, in other words, be on a single line. As noted
- Xabove, you may compile rather than interpret - if your OS supports the
- XIcon compiler. Just replace "icont" above with "iconc." The
- Xresulting executable will run considerably faster than with "icont,"
- Xalthough the time required to compile it may be large, and the (still
- Xsomewhat experimental) compiler may not work smoothly in all
- Xenvironments.
- X
- X If your operating system support environment variables, and
- Xyou have set up your LPATH according to the specifications in the Icon
- Xdistribution (see below), then you may copy iiparse.lib and
- Xiiglrpar.lib to some file in your LPATH. If you do not do this, or if
- Xyour OS does not support environment variables, then you must be in
- Xthe directory where you keep your Ibpag2 files when you use it, or
- Xelse invoke Ibpag2 with the -p dirname option (where dirname is the
- Xdirectory that holds the iiparse.lib and iiglrpar.lib files that come
- Xwith the Ibpag2 distribution). The .lib files contain template
- Xparsers that are critical to Ibpag2's operation. Ibpag2 will abort if
- Xit cannot find them.
- X
- X If your operating system permits the creation of macros or
- Xbatch files, it might be useful to create one that changes
- Xautomatically to the Ibpag2 source directory, and runs the executable.
- XThis has the side-benefit of making it easier for Ibapg2 to find the
- Xparser library files, iiparse.lib and iiglrpar.lib. Under DOS, for
- Xinstance, one might create a batch file that says:
- X
- X c:
- X cd c:\ibpag2
- X iconx ibpag2 %1 %2 %3 %4 %5 %6 %7 %8 %9
- X
- XDOS, it turns out, has to execute Icon files indirectly through iconx,
- Xso this technique has yet another advantage in that it hides the
- Xsecond level of indirection - although it prevents you from using
- Xinput and output redirection. Naturally, the above example assumes
- Xthat Ibpag2 is in c:\ibpag2.
- X
- X Ibpag2 assumes the existence on your system, not only of an
- XIcon interpreter or compiler, but also of an up-to-date Icon Program
- XLibrary. There are several routines included in the IPL that Bibleref
- Xuses. Make sure you (or the local system administrators) have put the
- XIPL online, and have translated the appropriate object modules. Set
- Xyour IPATH environment variable to point to the place where the object
- Xmodules reside. Set LPATH to point to the modules' source files.
- XBoth IPATH and LPATH are documented in doc directory of the Icon
- Xsource tree (ipd224.doc). If your system does not support environment
- Xvariables, copy ximage.icn, options.icn, ebcdic.icn, and escape.icn
- Xfrom the IPL into the Ibpag2 source directory, and compile them in
- Xwith the rest of the Ibpag2 source files, either by adding them to the
- XSRC variable in the makefile, or by adding them manually to the "icont
- X-o ..." command line given above.
- X
- X If you have any problems installing or using Ibpag2, please
- Xfeel free to drop me, Richard Goerwitz, an e-mail message at
- Xgoer@midway.uchicago.edu, or (via the post) at:
- X
- X 5410 S. Ridgewood Ct., 2E
- X Chicago, IL 60615
- X
- X
- X6.__Bibliography
- X
- X1. Aho, Alfred V., Sethi, Ravi, and Ullman, Jeffrey D. Compilers.
- X Addison-Wesley: Reading, Massachusetts, second printing, 1988.
- X
- X2. Griswold, Ralph E. and Griswold, Madge T. The Icon Programming
- X Language. Prentice-Hall, Inc.: Englewood Cliffs, New Jersey, USA,
- X second edition, 1990.
- X
- X3. Griswold, Ralph E., Jeffery, Clinton L., and Townsend, Gregg M.
- X Version 8.10 of the Icon Programming Language. Univ. of Arizona
- X Icon Project Document 212, 1993. (obtain via anonymous FTP from
- X cs.arizona.edu ~ftp/icon/docs/ipd212.doc)
- X
- X4. Tomita, Masaru. Efficient Parsing for Natural Language. Boston:
- X Kluwer Academic Publishers, c. 1985.
- X
- X5. Tomita, Masaru editor. Generalized LR Parsing. Boston: Kluwer
- X Academic Publishers, 1991.
- END_OF_FILE
- if test 53409 -ne `wc -c <'README'`; then
- echo shar: \"'README'\" unpacked with wrong size!
- fi
- # end of 'README'
- fi
- if test -f 'ibutil.icn' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'ibutil.icn'\"
- else
- echo shar: Extracting \"'ibutil.icn'\" \(7298 characters\)
- sed "s/^X//" >'ibutil.icn' <<'END_OF_FILE'
- X############################################################################
- X#
- X# Name: ibutil.icn
- X#
- X# Title: utilities for Ibpag2
- X#
- X# Author: Richard L. Goerwitz
- X#
- X# $Revision: 1.21 $
- X#
- X############################################################################
- X#
- X# Contains:
- X#
- X# production_2_string(p) makes production or item p human-
- X# readable
- X#
- X# print_item_list(C, i) returns human-readable version of
- X# item list C
- X#
- X# print_grammar(grammar, f) sends to file f (default &output)
- X# a human-readable printout of a grammar,
- X# as recorded in an ib_grammar structure
- X#
- X# print_action_goto_tables(atbl, gtbl, ibtoktbl, f)
- X# sends to file f (default (&output)
- X# a human-readable printout of action
- X# table atbl and goto table gtbl
- X#
- X# print_follow_sets(FOLLOW_table)
- X# returns a human-readable version
- X# of a FOLLOW table (table of sets)
- X#
- X# print_first_sets(FIRST_table)
- X# returns a human-readable version
- X# of a FIRST table (a table of sets)
- X#
- X# ibreplace(s1, s2, s3) replaces s2 with s3 in s1
- X#
- X# equivalent_items(i1, i2) succeeds if item i1 is structurally
- X# identical to item i2
- X#
- X# equivalent_item_lists(l1,l2) same as equivalent_items, but for
- X# lists of items, not individual items
- X#
- X############################################################################
- X#
- X# Links: none
- X#
- X############################################################################
- X
- X
- Xrecord production(LHS, RHS, POS, LOOK, no, prec, assoc)
- X
- X#
- X# production_2_string: production record -> string
- X# p -> s
- X#
- X# Stringizes an image of the LHS and RHS of production p in
- X# human-readable form.
- X#
- Xprocedure production_2_string(p, ibtoktbl)
- X
- X local s, m, t
- X
- X s := image(p.LHS) || " -> "
- X every m := !p.RHS do {
- X if t := \ (\ibtoktbl)[m]
- X then s ||:= t || " "
- X else s ||:= image(m) || " "
- X }
- X # if the POS field is nonnull, print it
- X s ||:= "(POS = " || image(\p.POS) || ") "
- X # if the LOOK field is nonnull, print it, too
- X s ||:= "lookahead = " || image(\p.LOOK)
- X
- X return trim(s)
- X
- Xend
- X
- X
- X#
- X# print_item_list: makes item list human readable
- X#
- Xprocedure print_item_list(C, i)
- X
- X write(&errout, "Productions for item list ", i, ":")
- X every write(&errout, "\t", production_2_string(!C[i]))
- X write(&errout)
- X return
- X
- Xend
- X
- X
- X#
- X# print_grammar: makes entire grammar human readable
- X#
- Xprocedure print_grammar(grammar, f)
- X
- X local p, i, sl
- X
- X /f := &errout
- X
- X write(f, "Start symbol:")
- X write(f, "\t", grammar.start)
- X write(f)
- X write(f, "Rules:")
- X every p := !grammar.rules do {
- X writes(f, "\tRule ", right(p.no, 3, " "), " ")
- X write(f, production_2_string(p, grammar.tbl))
- X }
- X write(f)
- X write(f, "Tokens:")
- X sl := sort(grammar.tbl, 3)
- X every i := 1 to *sl-1 by 2 do
- X write(f, "\t", left(sl[i], 5, "."), right(sl[i+1], 20, "."))
- X write(f)
- X return
- X
- Xend
- X
- X
- X#
- X# print_action_goto_tables
- X#
- X# Makes action & goto tables human readable. If a table mapping
- X# integer (i.e. char) literals to token names is supplied, the
- X# token names themselves are printed.
- X#
- Xprocedure print_action_goto_tables(atbl, gtbl, ibtoktbl, f)
- X
- X local TAB, tbl, key_set, size, i, column, k
- X
- X /f := &errout
- X TAB := "\t"
- X
- X every tbl := atbl|gtbl do {
- X
- X key_set := set(); every insert(key_set, key(tbl))
- X writes(f, TAB)
- X every k := !key_set do
- X writes(f, \(\ibtoktbl)[k] | k, TAB)
- X write(f)
- X
- X size := 0; every size <:= key(!tbl)
- X every i := 1 to size do {
- X writes(f, i, TAB)
- X every column := tbl[!key_set] do {
- X # action lists may have more than one element
- X if /column[i] then
- X writes(f, " ", TAB) & next
- X \column[i] ? {
- X if any('asr') then {
- X while any('asr') do {
- X writes(f, ="a") & next
- X writes(f, tab(upto('.<')))
- X if ="<" then tab(find(">")+1) else ="."
- X tab(many(&digits))
- X }
- X writes(f, TAB)
- X }
- X else writes(f, tab(many(&digits)), TAB)
- X }
- X }
- X write(f)
- X }
- X write(f)
- X }
- X
- X return
- X
- Xend
- X
- X
- X#
- X# print_follow_sets: make FOLLOW table human readable
- X#
- Xprocedure print_follow_sets(FOLLOW_table)
- X
- X local FOLLOW_sets, i
- X
- X FOLLOW_sets := sort(FOLLOW_table, 3)
- X write(&errout, "FOLLOW sets are as follows:")
- X every i := 1 to *FOLLOW_sets-1 by 2 do {
- X writes(&errout, "\tFOLLOW(", image(FOLLOW_sets[i]), ") = ")
- X every writes(&errout, image(! FOLLOW_sets[i+1]), " ")
- X write(&errout)
- X }
- X write(&errout)
- X return
- X
- Xend
- X
- X
- X#
- X# print_first_sets: make FIRST table human readable
- X#
- Xprocedure print_first_sets(FIRST_table)
- X
- X local FIRST_sets, i
- X
- X FIRST_sets := sort(FIRST_table, 3)
- X write(&errout, "FIRST sets are as follows:")
- X every i := 1 to *FIRST_sets-1 by 2 do {
- X writes(&errout, "\tFIRST(", image(FIRST_sets[i]), ") = ")
- X every writes(&errout, image(! FIRST_sets[i+1]), " ")
- X write(&errout)
- X }
- X write(&errout)
- X return
- X
- Xend
- X
- X
- X#
- X# ibreplace: string x string x string -> string
- X# (s1, s2, s3) -> s4
- X#
- X# Where s4 is s1, with every instance of s2 stripped out and
- X# replaced by s3. E.g. replace("hello there; hello", "hello",
- X# "hi") yields "hi there; hi". Taken straight from the IPL.
- X#
- Xprocedure ibreplace(s1,s2,s3)
- X
- X local result, i
- X
- X result := ""
- X i := *s2
- X
- X s1 ? {
- X while result ||:= tab(find(s2)) do {
- X result ||:= s3
- X move(i)
- X }
- X return result || tab(0)
- X }
- X
- Xend
- X
- X
- X#
- X# equivalent_items: record x record -> record or failure
- X# (item1, item2) -> item1 or failure
- X#
- X# Where item1 and item2 are records having LHS, RHS, POS, & LOOK
- X# fields (and possibly others, though they aren't used). Returns
- X# item1 if item1 and item2 are structurally identical as far as
- X# their LHS, RHS, LOOK, and POS fields are concerned. For SLR
- X# table generators, LOOK will always be null.
- X#
- Xprocedure equivalent_items(item1, item2)
- X
- X local i
- X
- X item1 === item2 & (return item1)
- X
- X if item1.LHS == item2.LHS &
- X item1.POS = item2.POS &
- X #
- X # This comparison doesn't have to be recursive, since I take
- X # care never to alter RHS structures. Identical RHSs should
- X # always be *the same underlying structure*.
- X #
- X item1.RHS === item2.RHS &
- X item1.LOOK === item2.LOOK
- X then
- X return item1
- X
- Xend
- X
- X
- X#
- X# equivalent_item_lists: list x list -> list or fail
- X# (il1, il2) -> il1
- X#
- X# Where il1 is one sorted list-of-items (as returned by goto() or
- X# by closure()), where il2 is another such list. Returns the
- X# first list if the LHS, RHS, and POS fields of the constituent
- X# items are all structurally identical, i.e. if the two lists
- X# contain the structurally identical items.
- X#
- Xprocedure equivalent_item_lists(il1, il2)
- X
- X local i
- X
- X il1 === il2 & (return il1)
- X if *il1 = *il2
- X then {
- X every i := 1 to *il1 do
- X equivalent_items(il1[i], il2[i]) | fail
- X }
- X else fail
- X
- X return il1
- X
- Xend
- END_OF_FILE
- if test 7298 -ne `wc -c <'ibutil.icn'`; then
- echo shar: \"'ibutil.icn'\" unpacked with wrong size!
- fi
- # end of 'ibutil.icn'
- fi
- echo shar: End of archive 1 \(of 4\).
- cp /dev/null ark1isdone
- MISSING=""
- for I in 1 2 3 4 ; do
- if test ! -f ark${I}isdone ; then
- MISSING="${MISSING} ${I}"
- fi
- done
- if test "${MISSING}" = "" ; then
- echo You have unpacked all 4 archives.
- rm -f ark[1-9]isdone
- else
- echo You still must unpack the following archives:
- echo " " ${MISSING}
- fi
- exit 0
- exit 0 # Just in case...
-