home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-11-07 | 112.2 KB | 2,564 lines |
-
-
- QPARSER+ User Manual page 1
-
-
- QPARSER+ SHAREWARE User Manual
- QCAD Systems
- 1164 Hyde Avenue, San Jose, CA 95129
-
- Table of Contents
-
- Chapter 1. Introduction page 2
- Chapter 1. QPARSER+ in Brief page 3
- Chapter 2. Parser Generator Overview page 7
- Chapter 3. Grammars page 11
- Chapter 4. Building a Parser page 15
- Chapter 5. The Semantics Stack page 17
- Chapter 6. Accessing Semantic Elements page 23
- Chapter 7. More about the Lexical Analyzer page 27
- Chapter 8. Embedded Semantics page 29
- Chapter 9. Using the DEBUG Option page 33
- Chapter 10. Syntax Trees page 34
- Chapter 11. Miscellaneous Topics page 44
- Chapter 12. Debugging a Translator page 46
- Chapter 13. Registration and Support page 49
- Chapter 14. For Additional Reading page 50
-
-
- QPARSER+ User Manual page 2
-
-
- Chapter 1. Introduction
-
- Almost every computer program depends upon some type of textual
- input. The language associated with that input may be as complex
- as a programming language like Pascal and C or as simple as a
- series of yes and no responses. QPARSER+ provides the utilities
- for the development of a translator capable of understanding an
- input form and then converting that form into an alternate output
- form.
-
- This definition of computer translation may sound quite foreign to
- you, but if you work with computers it is quite possible that you
- use a `translator' every day since one of the most common
- translator applications is the language compiler. A product like
- Turbo Pascal, for example, is a translator used to convert Pascal
- statements into machine code.
-
- QPARSER+ may be used to define new programming languages, but that
- is by no means the only use for the QPARSER+ system. For example,
- many of our customers have used the product to write user-friendly
- interfaces for their programs since QPARSER+ facilitates the
- development of interfaces capable of understanding complex
- user-input forms. The applications for QPARSER+ are almost
- limitless since almost every computer program uses language in
- some form. With a little imagination, QPARSER+ can be used to
- develop products ranging from Artificial Intelligence systems to
- word processor translators capable of converting documents from
- one word processor `language' into another.
-
- Qparser+ is a professional, world-class product with an installed
- base of over 400 users. It has been preferred to Lex and Yacc by
- many of its users for its ease of use, its advanced syntax tree
- features, its generation of full source code, and the ease with
- which the source code can be modified for the sake of performance
- or unusual conditions that often arise in language implementation.
- It was designed and written by Dr. William Barrett, an author of
- a college text on compiler construction, and a languages/compiler
- specialist and consultant for more than 15 years.
-
-
-
- QPARSER+ User Manual page 3
-
-
- Chapter 1. QPARSER+ in Brief
-
- The QPARSER+ user must supply the structural rules of the input
- language as well as the actions to be taken when language
- structures are recognized. QPARSER+ will then take these two
- inputs and produce a complete translator capable of interpreting
- languages as complex as Pascal or as simple as the set of control
- codes in a word processor `language'.
-
- The QPARSER+ translator system is designed to operate in two
- phases -- a syntactic phase and a semantics phase. We will
- sometimes refer to these phases as the front end and back end of a
- translator.
-
- The syntactic phase is almost completely automatic, given a
- suitable grammar for the language. In this phase, QPARSER+ will
- take the user-defined grammar and automatically generate a parser
- capable of understanding any input language form. No programming
- is required in this phase, although Qparser requires a compiler to
- generate executable code from the generated source programs.
-
- The semantics phase requires some programming skills. In this
- phase, guided by the grammar's language rules, actions are added
- to the translator through additions to the C or Pascal source
- code. Qparser provides a number of useful functions to manage the
- symbol table, and access the parsing stack, but the central task
- of providing the actions (semantics) is up to you. This is the
- back-end, or operational side, of the translator and it must be
- specially written for each application. There are a number of
- worked-out examples and tutorials to help you through this task.
-
- QPARSER+ provides the tools to develop a complete translator using
- either C or Pascal as the host language.
-
- Features
-
- Many of the QPARSER+ features listed here can only be appreciated
- by those who have had some experience with translator writing, but
- we feel that everyone will come to appreciate them after having
- used the product for a while.
-
- The principal features of QPARSER+ are:
-
- * Correctness. The language accepted by the translator will be
- exactly that specified by the grammar.
-
- * Detection of Ambiguity. If your grammar is ambiguous, or if
- the parser is unable to assure a correct translation, a diagnostic
- message will be provided.
-
- * Automatic generation of two key components. The lexical
- scanner and the LR parser are automatically generated.
-
- * Automatic generation of Syntax Trees. The QTREES package
- included in QPARSER+ will automatically build the syntax trees
-
-
- QPARSER+ User Manual page 4
-
-
- which must be painfully constructed branch by branch with a system
- like YACC. The use of a syntax tree, generated automatically by
- Qparser functions, makes code generation a snap.
-
- * Flexibility. A translator can be generated in the source
- language of your choice. The files which define the source
- language syntax are provided only for Pascal and C, but they can
- be used as models for other languages. A program like YACC can
- only produce translators in C.
-
- * Efficient table-driven parser operations. For a large
- grammar, most compiler experts agree that a table-driven LR(1)
- approach is superior in both performance and space requirements to
- any other. Several optimizations are made to minimize table size.
-
- * Modularity of semantic operations. A translator programming
- task will become neatly partitioned into a number of relatively
- water-tight compartments.
-
- * Efficient lexical scanner. The lexical scanner operates by a
- CASE or SWITCH statement transfer on the next character to be
- scanned from the source. No backtracking, table searching or
- waste motion is involved.
-
- * Efficient and powerful symbol table mechanism. Symbol
- searching is built into the lexical scanner, which is
- automatically generated.
-
- * Error recovery. A reasonable error recovery system is built
- into each translator. No extra parser table space is required.
-
- * Ease of extending or changing the language. Once a translator
- is written following the guidelines in this manual, it'll be easy
- to extend or modify. QPARSER+ was written to be very flexible and
- you'll find that once you've mastered the basics, you can do
- translations of any type.
-
- * Ease of extending or changing the lexical scanner. The
- scanner is provided in well-documented source form. Its
- underlying assumptions are described in the user manual.
-
- Required Hardware and Software
-
- QPARSER+ runs on a PC AT or compatible under version 5.0 (or
- later) of MSDOS. We recommend 640 Kbytes RAM for a large grammar,
- but you can manage with 250 Kbytes. A floating-point coprocessor
- is not required. There are no special screen or console
- requirements -- your PC will be used in a 'dumb terminal' mode for
- all of the Qparser utilities, debugging, etc. So you can run
- under DOS or Windows -- this guide will refer to a DOS operation.
- Qparser does not support extended memory, but that shouldn't be a
- handicap unless you have an exceptionally large grammar. It's
- optimized to make efficient use of your RAM. A memory crunch --
- if any -- will appear in LR1, which requires an unpredictable
- amount of RAM, depending on the structure of your grammar.
-
-
- QPARSER+ User Manual page 5
-
-
-
- A hard disk with at least 2 Mbytes of available capacity is
- recommended. You can try out the programs on a floppy-based PC,
- but you'll quickly tire of juggling program and source files among
- your floppy drives.
-
- You'll need an editor. It must be a technical editor, not a word
- processing editor, unless your word processor can export technical
- text files. The required file must consist of printable ASCII
- characters (0x20 - 0x7E) separated by CR-LF pairs. The DOS EDIT
- utility is acceptable, as is the builtin Turbo C editor.
-
- We recommend the Microsoft C or Turbo C compiler. Almost any C
- compiler will do, but we've tested the product most intensively on
- Turbo C and Microsoft C, and "make" files are provided for these
- compilers.
-
- DOS Installation
-
- You should have been able to install the product if you can read
- this. It's simple -- 1) create a directory, like C:\QPARSER, 2)
- choose an executables directory, say C:\BIN, 3) copy all the *.EXE
- files to C:\BIN, 4) "xcopy" all the remaining files and their
- subdirectories to C:\QPARSER, and 5) modify the C:\AUTOEXEC.BAT
- file as shown below.
-
- When the installation is complete, you should find the following
- directories and files in place. We've assumed that the
- executables are installed in C:\BIN, and the Qparser work files in
- C:\QPARSER.
-
- C:\BIN LR1.EXE
- C:\BIN OPT.EXE
- C:\BIN LR1P.EXE
- C:\BIN GRAMCHK.EXE
- C:\QPARSER README.DOC
- C:\QPARSER ORDRFORM.TXT
- C:\QPARSER QPARSER.DOC
- C:\QPARSER QPHELP.HLP
- C:\QPARSER\CSKELS MAIN.C
- C:\QPARSER\CSKELS CALC.GRM
- C:\QPARSER\CSKELS CALCC.GRM
- C:\QPARSER\CSKELS CALCT.GRM
- C:\QPARSER\CSKELS M
- C:\QPARSER\CSKELS MAKEFILE.MAK
- C:\QPARSER\CSKELS SKELEVAL.C
- C:\QPARSER\CSKELS SKELTAB.C
- C:\QPARSER\CSKELS SKELDECL.H
- C:\QPARSER\CSKELS SKELLEX.C
- C:\QPARSER\CSKELS LEX.C
- C:\QPARSER\CSKELS DECL.H
- C:\QPARSER\CSKELS EVAL.C
- C:\QPARSER\CSKELS TABLE.C
- C:\QPARSER\CSKELS SYMS.C
- C:\QPARSER\CSKELS PARS.C
-
-
- QPARSER+ User Manual page 6
-
-
- C:\QPARSER\CSKELS DEBUG.C
- C:\QPARSER\LIB M
- C:\QPARSER\LIB MAKEFILE.MAK
- C:\QPARSER\LIB SETS.H
- C:\QPARSER\LIB ALLOC.H
- C:\QPARSER\LIB HELP.H
- C:\QPARSER\LIB PROTOS.H
- C:\QPARSER\LIB RESPF.H
- C:\QPARSER\LIB HELP.C
- C:\QPARSER\LIB RESPF.C
- C:\QPARSER\LIB ALLOC.C
- C:\QPARSER\LIB SETS.C
-
- Your AUTOEXEC.BAT file should contain the following entries:
-
- PATH=.... C:\BIN; ...
- SET QPHELP=C:\QPARSER\QPHELP.HLP
-
- You can install the executables under some other directory, if you
- adjust the PATH variable in the AUTOEXEC.BAT file to suit.
-
- Regarding Pascal
-
- If you're a Pascal fan, you'll want to register with QCAD to
- obtain the full Qparser+ system. It contains skeletons and
- several worked-out examples compatable with Turbo Pascal. (This
- version contains only C versions.) It probably won't be compatible
- with the Pascal provided in other computers and operating systems,
- but you can modify the source files provided to make it compatible
- with your favorite Pascal.
-
-
-
- QPARSER+ User Manual page 7
-
-
- Chapter 2. Parser Generator Overview
-
- In order to generate a parser, you need a grammar file (XXX.GRM)
- and one or more skeleton files (SKELzzzz.C).
-
- The grammar defines the rules for parsing your language, and the
- fragments of C code that are executed when a rule is satisfied.
- It has to have a particular form that will described in the next
- chapter. Program LR1.EXE is designed to operate on a grammar file
- and produce several table files that are used by the remaining
- utilities. An important table file is XXX.TBL. This contains all
- the parsing tables that LR1 has worked out from the grammar.
-
- LR1 and the other utilities OPT, LR1P and GRAMCHK, will accept a
- grammar file name with or without the suffix .GRM. LR1 will
- create several table files based on the grammer prefix XXX.
-
- Program OPT.EXE will compress the tables in an XXX.TBL file. The
- tables will usually take less than half the space when the parser
- is finished after OPT has done its job. We recommend running it,
- although it strictly isn't necessary.
-
- A skeleton file looks like a C file, and mostly contains C source
- code. However, there are sections that cause C code to be
- generated, based on table file information. Program LR1P.EXE is
- designed to work on a skeleton file, given a set of grammar files.
- It generates a valid C file, one that now is specific to the
- grammar's rules. LR1P will usually be used several times, once
- for each skeleton file. For example, LR1P will be applied to
- SKELLEX.C to yield LEX.C, in order to specialize the code for the
- particular token forms found in the grammar.
-
- You can write your own skeleton file, or revise the ones provided,
- in order to set up a new parsing regime. By trying out LR1P on
- various skeleton segments with a known grammar, you can figure out
- how it transforms a skeleton file into a compilable C file. (The
- registered version includes a manual that explains the system in
- detail).
-
- Once you've run LR1P on the appropriate files, they are ready to
- be compiled and linked. The result should be an operational
- parser, given only a grammar with no code fragments. If you've
- attached C code fragments to the rules, the resulting EXE file
- will not only parse input statement forms, but will also execute
- the code fragments at the right places during the parse.
-
- Operations Summary
-
- Here's a summary of the Qparser operations:
-
- Grammar File --> LR1 --> Grammar tables --> OPT --> Grammar
- tables
-
- (Skeleton File + Grammar tables) --> LR1P --> Compilable C file
-
-
-
- QPARSER+ User Manual page 8
-
-
- Compilable C files --> C COMPILER --> LINKER --> EXECUTABLES
-
- MAKE Files
-
- In order to make these operations easy, we've included two working
- examples (CALC.GRM, CALCC.GRM) and make files that carry through
- all the operations.
-
- Look in directory CSKELS. File "m" is a make file designed for
- the Microsoft C compiler. It runs under the Microsoft "make"
- utility. Please examine it -- it assumes that LR1, OPT, LR1P and
- QPHELP.HLP are in the C:\QPARSER directory, and the Microsoft
- compiler executables are in the C:\MCC directory. is named in the
- PATH= statement in your autoexec.bat file. Also the Microsoft
- compiler is assumed to be called by "cl". You call the Microsoft
- make utility like this:
-
- make m
-
- File "makefile.mak" is a make file designed for Turbo C; it runs
- under the Turbo "make" utility. This one contains a specific path
- directive for the Turbo C compiler directory -- "C:\TC". You will
- have to change this if it isn't in that directory. It also
- contains a specific reference to the library files which you may
- have to change. You can call the Turbo make utility like this:
-
- make -fmakefile.mak
-
- The default grammar in both of these is CALC.GRM. You can change
- that with the option
-
- GRM=YourGrammar
-
- in the make call, e.g.
-
- make GRM=YourGrammar m ... using Microsoft make
- make GRM=YourGrammar -fmakefile.mak ... using Turbo make
-
- What's happening is that any appearance of $(GRM) in the makefile
- is replaced by "YourGrammar". Also, any definition of GRM in the
- make file is overridden by the command-line definition. Look at
- the makefile and you'll see where GRM is defined by default, and
- how $(GRM) is used.
-
- +***********************+
- NOTE: in older versions of Turbo's make, the -D option doesn't
- work. You will have to change the GRM=CALC line in makefile.mak
- +***********************+
-
- The makefiles will yield the executable file MAIN.EXE. We'll
- assume that you won't change the name in what follows.
-
- Memory Model
-
- We recommend memory model "large" for both compilers. Check the
-
-
- QPARSER+ User Manual page 9
-
-
- size of the executable after linking -- if it exceeds 64K you
- can't use the "compact" or "small" model. Turbo C will compile
- and link the executable without complaint for a compact model, but
- you'll run into mysterious execution problems. Microsoft C seems
- to work with their compact model, although the executable file may
- exceed 64K.
-
- When you change the model, be sure to recompile everything in the
- LIB and CSKELS directory, and relink everything. Having a mixture
- of compact and large model object files is a recipe for disaster.
-
- Qparser Executable Options
-
- You can read about the options for LR1, OPT, LR1P and GRAMCHK by
- calling them with no parameters. For example, LR1 reports this:
-
- QPARSER vs. 4.8.3
- Copyright (c) QCAD Systems, Inc. 1990. All rights reserved.
- Usage: lr1 [options] GrammarFile [RptFile]
- options: -t dump tokens
- -p dump productions
- -n dump nullable nonterminals
- -i dump itemsets
- -mL language mode -- L==C, P, N
- -MN dump state machine
- N= 1, 2: dump level
- -r dump reads/includes/lookbacks
- -T tablefile table file name
- -s suppress single-production bypassing
- -? help: more description of the options
- -C generate table file compatible with 3.0
- (default tablefile name is grammar-filename.tbl)
-
- Most of these give additional reports of one kind or another
- without affecting the resulting grammar table files. (-MN, -s and
- -C will affect the table file form). You can ignore most of them.
- If you want to know more about any one option, run LR1 with the
- option -?. (The help file QPHELP.HLP must be installed properly,
- or you'll get a complaint about it). You can also read QPHELP.HLP
- in your text editor; it's just an ordinary text file.
-
- You should use the option
-
- -mC
-
- with LR1. This states that you intend to use C as your host
- language -- this is the compiler you will use to compile your
- translator, not necessarily the language you are creating with
- your grammar.
-
- Executable OPT requires no options, just the grammar file name,
- without a suffix:
-
- OPT gram
-
-
-
- QPARSER+ User Manual page 10
-
-
- You should use the option
-
- -mC
-
- with LR1P. LR1P is called with two or three files, like this:
-
- lr1p -mC gram skeleton [ target ]
-
- We recommend leaving the suffix OFF the grammar file, i.e. use
- CALC rather than CALC.GRM. The skeleton file name should be
- complete, as well as the target file name. Look at the make file
- for more details. If you don't specify a target file name, lr1p
- will send the expanded source file to stdout; you can redirect
- this in DOS with the ">" operator, i.e.
-
- lr1p -mC gram skeleton > target
-
- Executable GRAMCHK is used for grammar reporting and checking; you
- don't need to call it to form your compilable target files. It
- has many parameters, which are explained through the QPHELP.HLP
- file. Again, use option -mC to inform it that the target language
- is C.
-
-
-
- QPARSER+ User Manual page 11
-
-
- Chapter 3. Grammars
-
- A grammar is a set of production rules. The rules can be in any
- order, except that the so-called "goal production" must be the
- first in the set. Here's what a rule looks like:
-
- Stmt -> Expr <eol> #PRTVAL
-
- This says that there is some class of input strings, or
- "sentences" that we'll call "Stmt", that consists of the class
- "Expr" followed by an end-of-line token, "<eol>". The symbol
- "#PRTVAL" is a "tag" that will used as a handle, or reference, to
- this particular rule. Symbol PRTVAL will appear later as a C
- define in the generated code, and will stand for some small index
- number.
-
- The symbols "Stmt" and "Expr" are called "nonterminals". The
- symbol "<eol>" is called a "terminal". A terminal stands for
- itself, while a nonterminal can stand for any one of a set of
- strings.
-
- The arrow "->" is just a place-holder in the production rule.
-
- You can have several production rules with "Stmt" on the left
- side. For example, in the grammar CALC.GRM, there are several
- like that:
-
- Stmt -> Expr <eol> #PRTVAL Stmt -> <identifier> := Expr <eol>
- #ASSIGN1 Stmt -> <eol> \ allow an empty line
-
- Whenever several such rules exist, it means that "Stmt" can expand
- into any one of the given right parts. You have a choice between
- expanding Stmt into "Expr <eol>", or "<identifier> := Expr <eol>",
- or just "<eol>".
-
- The second rule represents an assignment statement as it might
- appear in Pascal. The intention is to replace the value
- associated with the user name "<identifier>" with the current
- value calculated from "Expr". We'll see that "Expr" can expand
- into a large number of algebraic expressions, so there's lots of
- possibilities in that second rule.
-
- The symbol ":=" is an example of a "token". A "token" is a short
- string that appears in the input language. It stands for itself,
- and is recognized as a whole by a special front-end character
- scanner called the "lexical analyzer".
-
- An <identifier> is a builtin nonterminal that stands for any
- string starting with a letter and continuing with letters, digits
- and underscores. (You can change this convention to accept other
- kinds of identifiers). It's used in programming languages to
- represent types, constants, variables, functions, etc. You can
- introduce a name like "X" or "Sensor2", and the parser will scan
- it and associate with the <identifier> nonterminal. Each such
- <identifier> is also a "token".
-
-
- QPARSER+ User Manual page 12
-
-
-
- Every <identifier> is also plugged into a symbol table, so that
- when the same name appears more than once, you'll automatically
- have a handle on the same thing through the symbol table.
-
- The third rule,
-
- Stmt -> <eol> \ allow an empty line
-
- makes use of Qparser's comment convention. The backslash causes
- it and the rest of the line to be ignored.
-
- A production rule has to be written on a single text line. If you
- insert a line ending in the middle, LR1 will probably complain
- about a syntax error. Use a backslash as the LAST character in a
- line in order to cause a line continuation. Or, use long lines --
- LR1 will accept a line as long as 255 characters.
-
- Recursive Rules
-
- An important principle to use in writing a grammar is to use a
- "recursive rule" when you need to express the idea of repeating
- something any number of times. Look at the CALC.GRM file again.
- You'll find a pair of rules like this:
-
- Stmts -> Stmts Stmt
- -> <empty>
-
- There are two more ideas here -- the "<empty>" means literally
- "empty", or nothing, not even a single character. That's OK -- it
- means that the Stmts nonterminal can expand into nothing at all.
- You may think that a rule like would be impossible to recognize
- during parsing. Well, it may cause some problems, but it can be
- recognized on the basis of what characters are ahead of and behind
- the "empty".
-
- The second idea is that that when there's no left member, just
- "->", Qparser assumes the same left member is used as in the
- previous rule. So we really have the rules
-
- Stmts -> Stmts Stmt
- Stmts -> <empty>
-
- Now recursion -- the first rule says that a Stmts is another Stmts
- followed by a "Stmt". It would appear that nothing's gained.
- What can this possibly mean? Well, assume that we understand what
- a Stmt stands for -- we've already looked its rule, in fact. Then
- a Stmts can be "nothing", thanks to rule.
-
- Ah ha! ...then the first rule says that a Stmts can be "nothing"
- followed by a Stmt, or just Stmt. Given that, it's clear that a
- Stmts can also be a Stmt followed by a Stmt. (The two forms of
- Stmt don't have to stand for the same thing, of course). And so
- forth, ad infinitum.
-
-
-
- QPARSER+ User Manual page 13
-
-
- This is an example of a "left recursive rule". You'll see this
- repeated over and over in the grammars. The pattern is this:
-
- Something -> Something X
-
- where X can be anything at all, except <empty>. There also has to
- be a non-recursive rule for Something, such as
-
- Something -> Y
-
- Then it's easy to see that SomeThing will stand for any of these
- sentences:
-
- Y
- Y X
- Y X X
- Y X X X
- (etc.)
-
- Recursion is how we get "any number" of something -- in this case
- X -- from a small number of rules.
-
- GOAL Production
-
- The first production rule in a Qparser grammar is assumed to be
- the "goal production". This is where you start an expansion
- process. There must also be only ONE such rule with its left
- member. You are allowed to use the special token "<eof>", which
- stands for the end-of-file in this rule, as the last token in the
- rule. You can't use <eof> anywhere else, however.
-
- Grammar CALC.GRM has this goal production:
-
- Goal -> Stmts QUIT <eol> #QUIT
-
- It clearly says that the expanded forms all have to be a "Stmts"
- followed by the keyword QUIT followed by an end-of-line. An
- end-of-file is implied at the end. Since "Stmts" expands into
- nothing or more "Stmt" forms, the CALC grammar is a list of "Stmt"
- forms followed by QUIT as the last one.
-
- Since QUIT doesn't appear anywhere else, it's only allowed at the
- end of an input file.
-
- Grammar Syntax
-
- Qparser grammar rules must be expressed through the following
- rules:
-
- * Use a backslash to start a comment, which extends to the end
- of the current line.
-
- * Use one or more spaces to separate the rule tokens.
-
- * A rule must occupy a single line. However, a backslash at the
-
-
- QPARSER+ User Manual page 14
-
-
- end of a line, followed only by a line return character, is
- interpreted as a continuation to the next line.
-
- * Most terminal tokens (like QUIT or :=) don't have to be
- quoted. They are distinguished from nonterminals by the fact that
- they don't appear as a left member of any rule. Names are assumed
- to be case-insensitive in the default lexical analyzer. You need
- to express them in upper case in the grammar.
- However, you can override case-insensitivity with suitable
- changes to file SKELLEX.C. If your language is case-sensitive,
- make sure the tokens are expressed correctly in the grammar, and
- remove the "to_upper" function found in function get_symbol, file
- SKELLEX.C.
-
- * If you need a token that starts with a quote mark, a
- backslash, or a hash mark (#), use a quote (') to open and close
- the token, for example,
-
- '#'
-
- However, be careful about quote marks and spaces as tokens --
- their use will probably confuse the lexical analyzer. If a quote
- mark is part of a token, you need to double it, i.e.
- 'one''token'.
-
- * A rule doesn't have to be tagged, i.e. (#GOAL), but we
- recommend tagging every rule except single productions. A single
- production has a single nonterminal as its right member. Tag
- everything else. The tags should be unique, and will reappear as
- C define names, so they need to look like C identifiers. You can
- use the same tag on different rules, but you'll be warned, and you
- need to make sure that having two rules with the same target won't
- cause problems.
-
- Expressions
-
- CALC describes a complete set of algebraic expressions with the
- simple operations + - * / and parenthesizing. You can explore
- these and their properties by running the CALC grammar on various
- input forms if you want to. The registered version's manual
- discusses these and other commonly-found expressions, statements,
- declarations, etc. It also has a C grammar and an Ada grammar.
- You can also find grammars in various reference books -- see the
- reference list at the end of this document.
-
-
-
- QPARSER+ User Manual page 15
-
-
- Chapter 4. Building a Parser
-
- A grammar describes the structure of a language. You usually
- aren't interested in using a grammar to generate language strings,
- as we've described above. You usually have some string, in the
- form of a file or user input, and you want to break it down into
- the structure defined by the grammar. The program that does that
- operation is called a "parser".
-
- Qparser will build a parser for you from your grammar. Let's do
- that by walking you through a parser construction for the CALC.GRM
- grammar.
-
- Start by building the library files found in the QPARSER\LIB
- directory. These are used over and over and will almost never
- change, so they've been pulled out. The make file there -- when
- run -- should create a library file called QP.LIB, containing
- these oft-used functions. They'll be used in linking your parser
- from here on.
-
- Now build the parser by running the make file found in the CSKELS
- directory. Make should run LR1 on the grammar file CALC.GRM, then
- OPT, then LR1P on each of the skeleton files listed in the make
- file. Finally, it will compile each of the files in CSKELS,
- including some that LR1P didn't touch. These, along with the
- QPARSER\LIB library, should link together to form an executable
- file. It'll be called MAIN.EXE -- the make file renames it.
-
- Try executing MAIN as follows:
-
- main -d1
-
- This invokes some Qparser debugging tools, so that you can follow
- what the parser is doing. You'll get a prompt, like this:
-
- >
-
- Enter a statement, like this, followed by an Enter:
-
- X + 16 * (5-6)
-
- You can enter spaces freely, or leave them out, between the
- tokens.
-
- The response should be this:
-
- Primary -> <identifier> #VARIABLE
-
- [idebug] ?(help, N(ame, A(ll, D(bug level,
- P(rodTrap, T(race, L(ex, S(tack, ENTER?
-
- What you've done is cause the rule Stmt -> <empty> to be applied,
- then go on. The next rule to be applied is Primary->
- <identifier>. (The <identifier> stands for the X you entered as
- the first variable name).
-
-
- QPARSER+ User Manual page 16
-
-
-
- Keep typing the Enter key, and you'll get a complete sequence of
- production rules that corresponds to the sentence you've entered.
-
- This sequence comprises the "parsing" of your sentence. It won't
- end until you enter "quit" followed by an end-of-line, then an
- end-of-file mark. Under MSDOS, this pair is entered by typing
- "control-Z Enter" with input from the console. Even after that,
- you may get one or two more rules before the thing decides
- everything is done.
-
- The last rule applied will be the goal rule,
-
- Goal -> Stmts <eol>
-
- Bottom-Up Parsing
-
- This is a "bottom-up" parser. It collects all the pieces of a
- rule's right members, like the "Stmts" and the "<eol>" in the Goal
- rule, then collapses them into the Goal. However, the input
- sentence is scanned in a left-to-right fashion.
-
-
-
- QPARSER+ User Manual page 17
-
-
- Chapter 5. The Semantics Stack
-
- The semantics stack is a first-in-last-out (LIFO) stack that
- carries information about the individual tokens used in the parse.
- It grows and shrinks as the input sentence is scanned and the
- rules are applied. Growth occurs when a token is accepted from
- the input sentence -- each token is pushed into the stack as it
- appears in a left-to-right fashion in the input list. The stack
- then shrinks when a production rule is applied. The growth action
- is called a "read" or "shift" move, from the fact that an input
- token is shifted into the stack.
-
- Each rule application is called an "apply" or "reduce" action. An
- application causes one or more things on the top of the stack to
- be peeled off and replaced by a single state representing a
- nonterminal symbol. The apply action is the one that you'll use
- to obtain a translation -- the others are just behind-the-scenes
- window dressing that you can essentially ignore. Qparser is
- structured to ignore them for you, in fact.
-
- A third action, called a "lookahead" is sometimes seen. This
- action results in a state change, but nothing happens to the stack
- and no token is shifted from the input list. It's a kind of
- gear-change required by the parser.
-
- Let's watch all this happen next. Run MAIN with the option '-d3'
- instead of '-d1'. That will cause the semantics stack to appear,
- and also follow each shift action. When we ran with option '-d1',
- the shift and lookahead actions were skipped over. Option '-d2'
- shows the stack with each reduce action, and option '-d3' shows
- the stack on every action. Option '-d4' will show error recovery
- action on a syntax error -- that one's really verbose and boring
- unless you are interested in how Qparser handles syntax errors.
-
- So type this expression at the prompt '>':
-
- X + 16 * (5 - 6)
-
- You'll see this on the screen after entering that expression and
- touching the ENTER key several times. (We've omitted the two
- prompt lines for the sake of clarity, and the states are separated
- by a blank line.)
-
- Reduce, state 1
- Stmts ->
-
- Read, state 18 on id X
- 1: 1 Stmts NULL
-
- Look, state 34, on token +
- 1: 1 Stmts NULL
- 2: 18 <identifier> IDENT: X
-
- Reduce, state 3
- 1: 1 Stmts NULL
-
-
- QPARSER+ User Manual page 18
-
-
- 2: 18 <identifier> IDENT: X
- Primary -> <identifier> #VARIABLE
-
- VARIABLE seen
- Look, state 35, on token +
- 1: 1 Stmts NULL
- 2: 18 Primary NULL
-
- What does all this mean? Well, the parser has worked through four
- states, counting the very first reduce (state 1) on Stmts ->
- <empty>. The second state (18) is a read, which shifts the first
- token X into the stack. The report tells us that X is an
- identifier.
-
- The third state (34) is a lookahead. Token + is used in this
- state transition, but it stays in the input list, as we'll see.
- The token X has been shifted into the stack and occupies the stack
- top.
-
- That leads to the fourth state (3), a reduce. This is the
- interesting one. Note that the previous lookahead didn't change
- the stack -- the identifier X is still on top. Under it is a
- Stmts, which it obtained from the first reduce action. The stack
- 'top' will always be listed on the bottom.
-
- This reduce state (3) is associated with a production about to be
- applied --
-
- Primary -> <identifier> #VARIABLE
-
- You will notice that the production's right member is an
- identifier, and it happens that an identifier is on the stack top.
- Coincidence? Not at all. It will always be that way --
-
- *****************************************
- The top elements of the stack will always
- match the applied rule right member.
- *****************************************
-
- For example, if the grammar rule at the moment of a reduce action
- is
-
- Expr -> Expr + Term #PLUS
-
- then the semantics stack will look like this:
-
- TOS-3: (stuff below the stack top)
- TOS-2: Expr
- TOS-1: +
- TOS: Term
-
- This is an important idea, and we'll keep reminding you of it.
- Note that we'll show the top of the stack (TOS) as the last thing
- printed, so the stack appears to be upside down. This is because
- we are usually only interested in a few elements near the top, not
-
-
- QPARSER+ User Manual page 19
-
-
- the stuff underneath, which may scroll off the screen.
-
- Each stack element is associated with a terminal or nonterminal
- token. We'll see that nonterminal stack elements can be used to
- carry a data structure, which can be as simple as an int, or a
- complex as a pointer to a large data structure.
-
- Terminal elements (other than <identifier>, <string>, <real>, and
- <integer>) won't carry information as a rule, although there's no
- reason they couldn't.
-
- When you type ENTER, you'll get a brief message that announces
- that this production -- tagged VARIABLE -- has been applied:
-
- VARIABLE seen
-
- You'll get a similar message every time a tagged production is
- applied. Untagged productions aren't so announced. This is built
- in as the default action on every tagged production. It's to
- remind you to fill in an "apply" action clause as explained later.
-
- Getting the Whole Parse
-
- You can view the whole parse trail without the debugging prompts
- by using option -D3 (note the capital D rather than lowercase d),
- for example:
-
- main -s -D3 input > output
-
- The "-s" option causes input lines to be printed on the screen
- along with the debugging information.
-
- "input" is a file containing the sentences to parse. It should
- contain:
-
- X + 16*(5 - 6)
- quit
-
- "output" is a file that receives the textual output from the
- running MAIN program. The -D option causes stack debug
- information to be sent to the output file, but with no prompts
- requiring attention. You'll get a whole trail of stack dumps
- which you can examine with your editor.
-
- If you try this without an "input" file, MAIN will expect input
- from the console, but you won't get a prompt. You need to type in
- the recommended statements, followed by control-Z ENTER to
- simulate the file input.
-
- The trail (in file "output") is rather long. We'll just look at a
- few sections of it to bring out a few more points.
-
- Find this piece of the trail -- it's been preceded by a couple of
- dozen states:
-
-
-
- QPARSER+ User Manual page 20
-
-
- Reduce, state 14
- 1: 1 Stmts NULL
- 2: 18 Expr NULL
- 3: 23 + NULL
- 4: 27 Term NULL
- 5: 32 * NULL
- 6: 29 ( NULL
- 7: 19 Expr NULL
- 8: 25 - NULL
- 9: 28 Term NULL
- Expr -> Expr - Term #MINUS
- MINUS seen
-
- This is the first example of a reduce state in which the
- production's right member has more than one token in it. Notice
- how the top three states (7, 8, 9) match the production's right
- part.
-
- The very next state is this one:
-
- Read, state 25, on token )
- 1: 1 Stmts NULL
- 2: 18 Expr NULL
- 3: 23 + NULL
- 4: 27 Term NULL
- 5: 32 * NULL
- 6: 29 ( NULL
- 7: 19 Expr NULL
-
- If you compare this stack with the previous one, you'll see that
- elements 1 through 6 are the same, while elements 7 through 9 have
- been replaced by a single element corresponding to an Expr.
-
- You can think of a reduce action as
-
- *********************************************************
- replacing the stack top elements matching the right member of its
- production with a single element equivalent to the production's left
- member.
- *********************************************************
-
- We said earlier that the stack grows and shrinks. We've now seen
- what makes it grow (tokens shifted into the stack through read
- actions) and what makes it shrink (apply actions).
-
- Stack Represents Past History
-
- What can be said about the rest of the stack, the portion that is
- underneath the tokens that match a production right member? Well,
- the whole stack corresponds to a kind of past history of the
- parse. It carries condensed representations of what's been
- scanned in the input string. Look at the stack again for state
- 14. The deepest member, Stmts, came from an empty reduce action.
- The next-deepest is an Expr -- this was the result of scanning the
- identifier X. Then there's a + sign followed by Term -- the Term
-
-
- QPARSER+ User Manual page 21
-
-
- clearly corresponds to the number 16.
-
- Here's a copy of the stack, and this time, we've added the parts
- of the original expression
-
- X + 16*(5 - 6)
-
- that go with it:
-
- Reduce, state 14
- 1: 1 Stmts NULL <empty>
- 2: 18 Expr NULL X
- 3: 23 + NULL +
- 4: 27 Term NULL 16
- 5: 32 * NULL *
- 6: 29 ( NULL (
- 7: 19 Expr NULL 5
- 8: 25 - NULL -
- 9: 28 Term NULL 6
- Expr -> Expr - Term #MINUS
- MINUS seen
-
- There's a right parenthesis following the 6, but the parser hasn't
- scanned that yet. If you continue the parse, it'll show up. In
- fact, here's the next state, and -- sure enough -- the right
- parenthesis has been shifted into the stack:
-
- Reduce, state 12
- 1: 1 Stmts NULL <empty>
- 2: 18 Expr NULL X
- 3: 23 + NULL +
- 4: 27 Term NULL 16
- 5: 32 * NULL *
- 6: 29 ( NULL (
- 7: 19 Expr NULL 5 - 6
- 8: 25 ) NULL )
- Primary -> ( Expr ) #PARENS
- PARENS seen
-
- More reduce actions follow in rapid succession. The next one
- covers the multiplication:
-
- Reduce, state 15
- 1: 1 Stmts NULL <empty>
- 2: 18 Expr NULL X
- 3: 23 + NULL +
- 4: 27 Term NULL 16
- 5: 32 * NULL *
- 6: 29 Primary NULL ( 5 - 6 )
- Term -> Term * Fact #MPY
- MPY seen
-
- After a lookahead, the addition is reduced:
-
- Reduce, state 13
-
-
- QPARSER+ User Manual page 22
-
-
- 1: 1 Stmts NULL <empty>
- 2: 18 Expr NULL X
- 3: 23 + NULL +
- 4: 27 Term NULL 16 * ( 5 - 6 )
- Expr -> Expr + Term #PLUS
- PLUS seen
-
- A few more reduce actions, and the input sentences are completely
- swallowed up. The parsing action ends on reducing the Goal
- production, as follows:
-
- Reduce, state 10
- 1: 1 Stmts NULL
- 2: 18 QUIT NULL
- 3: 22 <eol> NULL
- Goal -> Stmts QUIT <eol> #QUIT
- QUIT seen
-
- This reduction obviously leaves just one element on the stack,
- associated with the Goal token. Two things have to happen in
- order to terminate a parse with no errors -- the stack should
- contain just the Goal token, and no more tokens appear in the
- input sentence. If only one of these is true, then the input
- sentence is considered to have a syntax error.
-
-
-
- QPARSER+ User Manual page 23
-
-
- Chapter 6. Accessing Semantic Elements
-
- We now discuss how to do something when a production rule is
- applied. In particular, there'll be stuff associated with each of
- the nonterminals on the stack, including "<identifier>", which
- stands for various names. The rules for handling semantics are
- fairly simple, as follows:
-
- 1. The array "semstack" is an array of pointers to type
- "semrectype" (found in QPARSER\CSKELS\DECL.H) is created in each
- stack position. Each element will in general be NULL or a
- pointer. You should assume that a non-NULL element has been
- allocated from the heap, and must be released when no longer
- required. The struct "semrectype" carries all the information
- associated with the stack element. There are two type indicators,
- and a union type, corresponding to various kinds of thing that can
- be on the stack. You will probably want to expand this with your
- own special types.
-
- 2. Variable "stackx" is the index of the top-of-stack element
- in "semstack". You can access the semrectype struct located at
- index N below the top of the stack by the expression
-
- semstack[stackx-N]->(whatever)
-
- N==0 refers to the top of the stack element, N==1 to the one just
- under it, etc. In relation to a production rule like this:
-
- P -> X4 X3 X2 X1 X0
-
- the N==0 member is the RIGHT-MOST element, N==1 is next to it,
- etc. ALL the right members need to be counted.
- For example, a <real> in position X2 is accessed as
-
- semstack[stackx-2]->usem.rval
-
- Note that the "usem" is how the union type is referenced.
-
- 3. As a shorthand, the define TOS(N) is equivalent to
-
- semstack[stackx-N]
-
- Thus the <real> given above can be referenced as
-
- TOS(2)->usem.rval
-
- 4. Function "apply" is called on each apply action for which a
- production rule tag has been assigned. (It isn't called
- otherwise). This function is in file QPARSER\CSKELS\EVAL.C, which
- has been expanded from the skeleton file SKELEVAL.C. You'll find
- that this is essentially a switch statement on the production rule
- tags.
- When a particular production rule apply action is about to
- occur, the apply is called, whence the switch will separate out
- the rule that's being applied. The stack is guaranteed to carry
-
-
- QPARSER+ User Manual page 24
-
-
- the right member semrectype invocations. It's up to you to use
- these struct values and construct a semrectype to replace them.
-
- 5. Qparser will allocate a "semrectype" for each of the special
- nonterminal types <identifier>, <integer>, <real>, <string>, and
- <character>. All other tokens shifted into the stack are
- associated with NULL. All other nonterminals will have NULL in
- the stack, unless you have allocated a semrectype for certain of
- them, as explained later.
- You need to free these before returning from apply. The safest
- course of action is to call the function "disp_sem" on each of the
- right member nodes that might have something in it. It will deal
- with NULLs and the known semrectype types properly. You need to
- expand this function if you have other kinds of nodes.
-
- 6. For most parsers, you can just use the information given in
- the stack and generate some code. If the production has something
- to do with a declaration, it'll have an <identifier> in it --
- you'll want to do something with the symbol table attribute
- associated with the identifier. If that's all you are doing, then
- make sure that the apply function returns NULL. When "apply"
- returns, the parsing system will strip the stack back by the
- number of elements in the rule's right part, then push the return
- pointer from "apply"; this will be NULL in this case.
-
- 7. If you want to save something on the stack from this
- production rule -- and that gets us into some advanced parsing
- ideas -- you should allocate a semrectype struct from the heap,
- and return its pointer from "apply". That pointer will get pushed
- on the stack. Note that it is effectively associated with some
- rule left member token, for example "Expr". This pointer will
- show up later in the parse. You can identify it by the appearance
- of "Expr" in the right member of a production.
- Use "new_sem" to allocate a "semrectype". It also initializes
- various slots in the struct.
- Say you've done this. You now must make sure that:
-
- 7a. You've done the same thing for ALL the rules with "Expr"
- as their left member. It's OK to push NULL for some of these,
- provided you test for NULL when you find "Expr" appearing later.
- But you need to do something sensible for all the productions,
- since the reappearance of "Expr" may have come from any of them.
-
- 7b. Since the "Expr" reappearance may refer to a struct
- allocated from the heap, you should free that struct before
- returning from apply. Otherwise, you'll get a memory leak, since
- the parser doesn't pay attention to what's on the stack when the
- top part is scraped off. Obviously, it doesn't need to be freed
- if it's NULL.
-
- The <identifier> and the Symbol Table
-
- Every appearance of <identifier> in your grammar causes the
- scanner to expect a user name object, for example,
-
-
-
- QPARSER+ User Manual page 25
-
-
- credit_slip
- itemized_bill
- sender001
-
- or something like that. Now there will also be tokens in your
- grammar that look like a user name but are actually "reserved
- words" or "key words". In C, for example, some of these are
-
- while
- for
- switch
- case
-
- and so on. A reserved word is distinguished from an <identifier>
- by the fact that it will explicitly appear in the grammar. For
- example, in C, there might be the rule
-
- Stmt -> while ( Expr ) Stmt
-
- in which the reserved word "while" has explicitly appeared.
-
- The problem for the scanner is that both appear superficially like
- a user identifier. There are different ways of distinguishing
- these, but in Qparser, the solution chosen is this:
-
- 1. The lexical analyzer picks up and scans everything that
- starts with a letter and continues with a letter, digit or
- underbar. The code that does this specifically is in function
- "get_symbol" found in skellex.c. You can change this strategy if
- you want to, in order to accept different kinds of things as user
- names, but be careful -- the system assumes that keywords look
- like user names.
-
- 2. Each user name is looked up in a symbol table. If it's in
- the table, it might be there as the same user name seen
- previously, or it might be marked with an attribute that says
- "this is a keyword". The keyword attribute is RESERVED. If the
- name has the RESERVED attribute, the symbol table also carries its
- token code, and the token scanner returns that, not the
- <identifier> token code.
-
- 3. The keywords are put in the symbol table as part of the
- parser initialization. This is also in SKELLEX.C -- see function
- "inittables", which opens the symbol table system for operation
- and stuffs it with the known keywords. You'll also find a bunch
- of other things in the symbol table as keywords that don't look
- like user names -- ":=" for example, in CALC. These are used by a
- different section of the lexical analyzer to figure out compound
- tokens.
-
- The symbol table functions are defined in the file
- CSKELS\SYMS.C, and the structs and prototypes in CSKELS\DECL.H.
- The symbol table consists of an array of pointers to a linked list
- of "symtabtype" structs. Each symtabtype carries a pointer to the
- next symtabtype in the list, if any, and NULL if none. Each
-
-
- QPARSER+ User Manual page 26
-
-
- symtabtype also carries the symbol string as a heap-allocated
- string "sym". A scope level "level" is used to distinguish
- reserved tokens, globals, locals, etc. An attribute tag "symt"
- has a few predefined values, and will usually be expanded by you
- in order to support a variety of symbol classes, for example,
- labels, typedefs, constants, etc.
- A symbol's attribute may also include other values.
-
- Several symbol table functions are provided through the SYMS.C
- module. For example, "findsym" accepts a character string "fsym",
- and returns a pointer to the symtabtype in the symbol table with
- that symbol string, or NULL if it can't be found. "makesym" does
- the same thing, except that it will always insert the symbol at
- the head of the list if it can't find one by that name already.
- By using a combination of these functions, you can shape the
- symbol operations of your language in a variety of ways.
-
-
-
- QPARSER+ User Manual page 27
-
-
- Chapter 7. More about the Lexical Analyzer
-
- The lexical analyzer has the task of scanning the input string,
- which may be a user-entered string or a file. It pays attention
- to line endings, surplus spaces, comments, etc. and attempts to
- recognize and extract a sequence of tokens that the parser can
- operate on.
-
- The Qparser lexical analyzer (which we'll call QLA) is partially
- constructed from information obtained from the grammar. When LR1
- scans the grammar, it also produces tables of information needed
- to build the analyzer, for example, what reserved words are found,
- and what special tokens are found.
-
- How all this works is best explained in the registered version's
- manual. You can of course study the code in SKELLEX.C and a
- generated file such as LEX.C. In a nutshell, here is a guide to
- Qparser's lexical system:
-
- * QLA operates in a "skipblanks" mode or in a "recognize" mode.
- It starts in the skipblanks mode, in which it scans over -- and
- ignores -- spaces, tabs and comments. It goes into the recognize
- mode when something shows up that is neither of these. See
- function "get_token" in LEX.C which makes the mode decision and
- later branches to one of the token functions.
-
- * All the QLA functions ultimately appear as explicit C code,
- not as cryptic tables. You can adjust them for particular
- circumstances as required.
-
- * Identifiers and reserved words are considered to start with a
- letter, and continue with letters, digits and certain other
- special characters. Each identifier found in the source is
- entered in the symbol table, but only if it's not already there.
- Reserved words are separated from user names by their appearance
- in the symbol table as such. Reserved words are marked by special
- initialization code found in function inittables. See function
- "get_symbol" in LEX.C.
-
- * Numbers (integers and floats) are considered to start with a
- digit, and continue until a C-like floating point number is
- recognized. Note that a leading sign (+, -) is NOT recognized --
- you need to place these explicitly in the grammar. The default
- CSKELS number scanner accepts simple integers, simple decimal
- numbers, or decimal numbers followed by an exponent part. The
- exponent part must be "E" optionally followed by "+" or "-", and
- followed by an integer. No spaces or line returns may appear
- within the number. Floating point forms are converted to a double
- internally. Integers are converted to a long int. See function
- "get_number" in LEX.C.
- A more complete C number recognizer is provided in the
- registered version.
-
- * Strings are considered to start with a string quote ("), and
- continue through the closing quote. A line return may not divide
-
-
- QPARSER+ User Manual page 28
-
-
- a string. The default string recognizer also accepts \" as an
- embedded quote, and \\ as an embedded backslash, per the usual C
- string conventions. No other backslash conventions are provided.
- See function "get_string" in LEX.C.
- A more complete C string recognizer is provided in the
- registered version.
-
- * All other tokens, such as "+", ":=", ">>", that appear in the
- grammar, are recognized through a "greedy" algorithm. Qparser
- provides a unique way of scanning these that is explained more
- fully in the registered version. See function "get_special" in
- LEX.C. This function is modified by LR1P according to information
- derived from the grammar.
-
- Your Very Own Lexical Analyzer
-
- You can also replace all of this with your own lexical analyzer.
- It only needs to do the following:
-
- * Scan whatever source you have
-
- * Recognize the terminal tokens that appear in your grammar
- somehow.
-
- * Assign a numeric code to each token so recognized that
- corresponds to the numbers assigned by LR1. You can get a list of
- the tokens and their code numbers through option -t when you run
- LR1. Return this int code from "get_token".
-
- * If a token represents a class of strings rather than a single
- string, you need to have your lexical analyzer set up a semrectype
- for it. This is assigned to the global variable "lsemp", which in
- turn will appear on the parser stack at the appropriate place.
- lsemp is allocated from the heap. Follow the examples for
- <identifier>, <real> and <string> found in LEX.C.
-
- You can also keep the lexical analyzer extremely simple and get
- the grammar parser and your semantics codes to do all the work.
- However, note that a grammar is not a very good way to describe
- the idea of a comment that can appear anywhere, so you will need
- to build this into your lexical analyzer. You'll find similar
- problems with quoted strings.
-
-
-
- QPARSER+ User Manual page 29
-
-
- Chapter 8. Embedded Semantics
-
- Qparser supports "embedded semantics", which means that you can
- write the semantics for a rule directly in the grammar. We
- recommend doing this, rather than modify file EVAL.C after Qparser
- has produced it by expanding SKELEVAL.C.
-
- An example of this is given in the file QPARSER\CSKELS\CALCC.GRM.
- Please refer to it in the discussion that follows.
-
- An embedded semantic rule looks like this:
-
- Stmt -> Expr <eol> #PRTVAL
- {
- printf("%lg\n", STACK(1)->usem.rval);
- FREE(1);
- }
-
- The production rule is of course the first line. The tag #PRTVAL
- must appear on the same line. You can now write several lines of
- embedded C code in the following lines, provided that you follow
- these rules:
-
- 1. The left and right braces {...} must be balanced and
- matched. If they aren't, LR1 will get confused and complain about
- a syntax error. LR1 understands the C string and comment syntax,
- so you can have unbalanced braces in those.
-
- 2. If you're fussy about indenting rules, the indentation of
- the left brace in the first line will be fit to the indentation
- specified in the target file, and the remaining lines will be
- indented relative to it. Try some examples through LR1P and
- you'll get the idea.
-
- 3. The embedded semantics code fragment will appear in function
- "apply" found in EVAL.C, after expansion of SKELEVAL.C by LR1P.
- It will become the particular case branch code associated with the
- specified tag. The "break" usually used at the end of a case code
- fragment is provided by the expansion code.
-
- CALCC Semantics
-
- We're now in a position to review the CALCC semantics. CALCC has
- the same grammar as CALC, except that it has been embellished with
- embedded semantic rules. Please refer to the
- QPARSER\CSKELS\CALCC.GRM file for details.
-
- The idea here is that each of the nonterminals Expr, Term, and
- Primary will be associated with a semrectype struct, which in turn
- will carry a floating point number in its "rval" slot. (Look at
- the struct in file DECLS.H and you'll find this).
-
- The PRTVAL rule is a simple one to start with:
-
- Stmt -> Expr <eol> #PRTVAL
-
-
- QPARSER+ User Manual page 30
-
-
- {
- printf("%lg\n", STACK(1)->usem.rval);
- FREE(1);
- }
-
- Here, we just print the real number in the semrectype struct
- associated with "Expr". Note that the <eol> is in position 0 on
- the stack, and "Expr" is in position 1, so STACK(1) expands into
- semstack[stackx-1]. (The #PRTVAL tag doesn't count in this
- indexing scheme).
-
- The "Expr" element has been allocated from the heap, so it should
- be freed before returning from the apply action. That's done by
- the FREE(1) -- the "1" refers to index position 1, or the "Expr"
- position.
-
- Getting a REAL on the Stack
-
- Here's how a real semrectype struct gets on the stack:
-
- Primary -> <real> #REALVAL
- {
- tsemp= TOS(0);
- }
-
- The only place that "<real>" appears in a right member in the
- grammar is in this production rule. If you follow the rule chain
- backward, you'll see that a Term can generate a Primary, and an
- Expr can generate a Term, so it's possible for the <real> element
- to pop up as an Expr.
-
- The action here is simple -- just return the top-of-stack element,
- TOS(0). What's going on? Well, Qparser will allocate a
- semrectype when it scans a real number, and will place a pointer
- to it on the stack. So all we have to do is return it through the
- apply. The returned pointer will be pushed onto the stack and be
- associated with the Primary. That in turn will later get
- associated with Term and ultimately with Expr.
-
- Single Productions
-
- There's a bit more magic that we haven't mentioned -- the apply
- function returns the current top-of-stack pointer TOS(0) when the
- production has no flag. That happens through this code fragment
- found in "apply":
-
- case 0: /* no flag */
- if (popno[cstate]==1) tsemp= TOS(0); /* pass this through */
- break;
-
- This works fine for each untagged single production, for example,
-
- A -> B
-
- where both A and B are nonterminals. Whatever was associated with
-
-
- QPARSER+ User Manual page 31
-
-
- B becomes associated with A. However, a tagged single production
- will have a case branch, and you need to deal with the return
- value, which will be NULL by default. You also need to make sure
- that a tag appears on every empty production and every non-single
- production, otherwise you'll get interesting things on the stack
- by default.
-
- That's really how a Primary turns into a Term, etc. -- through
- the automatic semantics associated with single productions.
-
- Binary Operations
-
- Now look at a typical binary operation production:
-
- Expr -> Expr + Term #PLUS
- {
- CCBINOP(+);
- }
-
- This just calls CCBINOP. This is a macro, as follows:
-
- #define CCBINOP(op) { NEW(FLOAT,REALVAR); \
- tsemp->usem.rval= TOS(2)->usem.rval op TOS(0)->usem.rval;\
- FREE(0); FREE(2); }
-
- (This is in the grammar file CALCC.GRM, too, but isolated from the
- production rules)
-
- What CCBINOP does is allocate a new semrectype struct from the
- heap, with the types FLOAT and REALVAR. This will be assigned to
- the local pointer variable tsemp.
-
- In the second line, the operation "op" expands into "+", and
- therefore carries out an addition followed by an assignment. The
- addition is of the two right production member values, "2" and
- "0", corresponding to "Expr" and "Term", respectively. The values
- are assigned to the "rval" slot of tsemp.
-
- Finally, members 0 and 2 are freed. We could have kept one of
- them for the result value, but that's dangerous in general.
-
- Other Grammar Semantics
-
- The CCBINOP define, along with some other useful ones, are part of
- a section of the grammar that's kept apart from the production
- rules. The section opens with "#begin" in the leftmost column of
- a new line. In CALCC.GRM, the line is
-
- #begin apply_locals
-
- The section continues with whatever C code you want to put in, and
- ends with
-
- #end apply_locals
-
-
-
- QPARSER+ User Manual page 32
-
-
- The names "apply_locals" must be matched.
-
- You can have any number of such blocks of code in your grammar.
-
- They reappear in a skeleton file, such as SKELEVAL.C, wherever you
- see the phrase
-
- {## begin
- write(apply_locals);
- end; ##}
-
- You'll find this in the skeleval.c file within the function body
- of "apply". Compare with the eval.c file and you'll see that the
- code block has been properly inserted from the grammar.
-
- In fact, the "write" phrase has been conditioned with
-
- if (defined(apply_locals)) then ... ;
-
- just in case a block called "apply_locals" doesn't exist. You'll
- discover some other blocks in skeleval.c that won't be expanded by
- CALCC.GRM, since they aren't defined.
-
- Other code blocks are provided in SKELEVAL.C, to support global
- functions (eval_functions), a call just before the first
- production is called (init_sem), and a call just after the last
- production is called (end_sem). You can add more to any of the
- skeleton files by following the pattern shown there. In this way,
- ANY language-dependent feature can be embedded in your grammar, so
- that you never need to modify the standard Qparser skeleton files
- again.
-
- Running CALCC
-
- It's best to delete *.TBL. This will force a rebuild and
- recompile based on a new grammar.
-
- You can compile and link CALCC by changing the makefile to accept
- CALCC instead of CALC as the grammar, e.g. run
-
- make GRM=CALCC (etc.)
-
- It should regenerate the grammar files and source files, and link
- to produce a working calculator as file MAIN.EXE. (NOTE: Turbo's
- make utility seems to require changing the GRM=CALC line in the
- makefile instead).
-
- Try out the compiled/linked result (without the -d1 option) with
- some expressions, and you'll see how it evaluates and prints
- results.
-
-
-
- QPARSER+ User Manual page 33
-
-
- Chapter 9. Using the DEBUG Option
-
- Qparser provides several tools for checking your grammar and
- semantics. These are invoked by building your files with the make
- option -DDEBUG=1, i.e. the DEBUG variable must be defined.
- You'll find instructions on this in the make file. (Again,
- Turbo's make utility doesn't accept this; you need to modify the
- make file directly).
-
- You can invoke the stack debugging when executing the parser by
- entering the special character "!" with the input. When the
- parser encounters this, it will invoke the debugger on the current
- stack. You can then set the debugging level to 1 or higher -- try
- 2 to see the whole stack -- and each Enter will show you the
- situation just before a particular apply action.
-
- The options -dN and -DN cause the stack debugging to start before
- anything is entered, which is often useful. The lower-case 'd'
- sets the debug level (N= 0, 1, 2, ...) and pauses on each apply
- action for your response. The upper-case 'D' sets the debug
- level, but does not wait for a pause -- you can get a detailed
- trace of the actions with it.
-
- When you execute the CALCC program under debug level 2 or higher,
- you'll see the values entered appearing on the stack associated
- with various nonterminal tokens. You should be able to see how
- the system works using this.
-
- The character "!" is caught in the lexical analyzer, file
- SKELLEX.C, in function get_token. You can remove this or change
- the character to something else by adjusting the source code and
- recompiling.
-
- All the debug tools are in file DEBUG.C. You'll want to expand
- certain of its functions if you extend the symtabtype or
- semrectype structs with your own variant types, so that they will
- be reported correctly.
-
-
-
- QPARSER+ User Manual page 34
-
-
- Chapter 10. Syntax Trees
-
- Qparser can automatically build an "abstract syntax tree" that can
- hold all the information from a parser in a form especially
- convenient for generating code or handling a declaration.
-
- An abstract syntax tree, or AST for short, is rooted in some
- nonterminal token of the grammar. Each internal nodes also
- corresponds to a nonterminal token, and the leaf nodes will
- generally be an <identifier>, <number>, or <string>. (Sometimes a
- leaf node will be empty). Each node is represented by a
- "semrectype" struct, which always carries a node designator
- "semt", a node type "symt", some attributes, and slots for child
- node pointers.
-
- An AST will represent a clause in the input language stream that
- has been scanned and covered by the root nonterminal token.
-
- Let's take an example. Suppose the input clause is
-
- x - 15 * z
-
- This can be derived from an Expr, using our CALC grammar. Here
- are some of the key derivations from Expr that show how this
- clause expands out of Expr:
-
- Expr -> Expr - Term
- -> Expr - Term * Fact
- -> Expr - Term * <identifier>
- -> Expr - <integer> * <identifier>
- -> <identifier> - <integer> * <identifier>
- -> x - 15 * z
-
- We've left out several single-production derivation steps, for
- example,
-
- Fact -> Primary -> <identifier>
-
- in order to give you the general idea. In fact, the parser hides
- steps like these, too.
-
- What we'd like to do is describe this as a tree in which the nodes
- will be production tags. Note that a production tag corresponds
- to a production rule, which can also be loosely associated with a
- nonterminal. The tree we want will look like this:
-
-
-
- QPARSER+ User Manual page 35
-
-
- MINUS (-)
- |
- +-----------------+-----------------+
- | |
- <identifier> (x) MPY (*)
- |
- +-----------+-----------+
- | |
- <integer> (15) <identifier> (z)
-
- Notice how the "MPY: 15 * z" part comes out of the last few
- derivation steps in a natural way. It will come out of our
- bottom-up system first, in fact, and later get attached to the
- MINUS part. If you trace the productions and stack development by
- running CALC, you'll see the order in which these fall out.
-
- AST Represents Evaluation Order
-
- Why is this form useful? Because it concisely represents the
- order in which the operations need to be carried out. For
- example, the subtraction can't be done until the subtrahend (x)
- and the subtractor (MPY) are both known, so that requires that the
- MPY be done first. This agrees with the precedence rules of
- algebra, which are in fact built into the structure of the
- grammar. (You can set up the precedence rules in any manner you
- like by suitably organizing the production rules.)
-
- How Qparser Builds an AST
-
- Qparser generates a pair of tables that help it construct an AST
- like this. The function "make_tree", found in eval.c, is called
- as the first operation in function apply. Note that it returns a
- "tsemp" pointer. Each call of make_tree does the following, in
- general:
-
- 1. Allocates a semrectype node from the heap, by calling
- "new_sem". Let's call the node N.
- 2. Attaches certain of the stack semrectype nodes as N's
- children. The child nodes are also semrectype structs, and are
- linked to N through the array "position0". The size of
- "position0" is set by Qparser to hold just enough pointers for the
- longest possible production rule.
- 3. Returns the "tsemp" pointer which carries the current rule
- tag, and carries (as its children) the appropriate stuff from the
- semantics stack that corresponds to the current production rule
- right members.
-
- The "tsemp" pointer is returned by apply in general. If you do
- nothing else after "make_tree" is called, the apply calls will
- simply construct a big AST from your input stream. Eventually the
- goal production is invoked and the tree will represent everything
- in your input stream.
-
- Embellishments
-
-
-
- QPARSER+ User Manual page 36
-
-
- Function "make_tree" has several important embellishments, which
- you need to pay attention to:
-
- 1. It doesn't consider any terminal token as a child. For
- example, stuff like "+", ":=" are ignored, and don't get a
- placeholder in the tree.
- 2. A single production with no tag is just skipped over; that
- is, any tree rooted in B will simply be attached A, when a
- production like A->B is applied. If the single production IS
- tagged, a new node with a single child is created.
- 3. A production with no tag will be assigned the
- general-purpose tag GENL_KIND. You can sometimes figure out what
- the node is without knowing the tag. In general, we recommend
- that you assign a unique tag to EACH production, except for those
- single productions that can be bypassed in the tree.
- 4. An empty production, i.e. one like
-
- A -> <empty>
-
- will have a NULL child, i.e. position0[0] will be NULL for this.
- 5. The position0 indices go from left to right through the
- right members of a production as tree children, rather than from
- right to left as stack elements. Remember that terminals are
- skipped in the count. So the right-members in the production rule
- X -> A + B will be accessed like this in the AST:
-
- A: root->position0[0]
- B; root->position0[1]
-
- assuming that "root" is the pointer corresponding to X.
-
- In order to make it easier to refer to the children, you can use
- these macros, found in SKELDECL.H:
-
- #define POSITION(N) usem.position0[(N)-1]
- #define CHILD usem.position0[0]
- #define LEFT usem.position0[0]
- #define RIGHT usem.position0[1]
- #define ONE usem.position0[0]
- #define TWO usem.position0[1]
- #define THREE usem.position0[2]
- #define FOUR usem.position0[3]
- #define FIVE usem.position0[4]
-
- Tree Walking
-
- We said that an AST makes it easier to generate code than by
- constructing special semantics code for each production rule. The
- general idea is to write a tree-walking function that accepts a
- semrectype pointer as its argument. It then branches through a
- switch statement on the possible tags that can be associated with
- the root. You then decide what to do on each node.
-
- The biggest advantage of this approach is that the code generation
- actions can occur through your tree-walk in a different order than
-
-
- QPARSER+ User Manual page 37
-
-
- that in which they happen to pop up in the bottom-up parse. You
- can decide whether to inspect the right member ahead of the left
- member, for example, even though the left member gets built first
- in the parser.
-
- For example, if code generation requires a register assignment,
- the assignment algorithm can't be expected work in the same order
- as the production rule parsing.
-
- Another advantage is that you can walk over a tree several times
- if you need to in order to get certain actions to take place.
- This amounts to using multiple passes, except that you can control
- the passes to take place in only a limited section of the grammar.
-
- A third advantage is that this approach eliminates most of the
- worry about freeing invalid pointers or having a memory leak --
- virtually all the allocation and free operations are done in a
- safe fashion.
-
- Recursive Tree-Walking Functions
-
- Tree-walking is done with a recursive function that accepts a tree
- node (a "semrectype" pointer) as one of its arguments. It may or
- may not return a value. The art of designing such functions comes
- with practice. The general idea is that you need to decide just
- what the function is supposed to do in an abstract way, then make
- sure it does just that in each case.
-
- A common case is for it to just call itself on one or more
- children. You'll probably want to do this (but not always) on the
- left member of a recursive rule. For example, if the rule is
-
- Expr -> Expr + Term #PLUS
-
- then when the node is PLUS, you will probably start with a
- recursive call on the LEFT child, followed a call on the RIGHT
- child, followed by whatever stuff you want to do with the PLUS.
- It's possible for the left and right children to return some mixed
- types, e.g. double for the left and integer for the right.
- That's a signal that the Term value requires a cast to double in
- order to finish the addition. In Pascal, the "+" operation is
- overloaded as a string operation, so if one of the two things is a
- string, the other one must be, too, and a string concatenation is
- called for.
-
- CALC by Syntax Tree Evaluation
-
- This all sounds very complicated, but in fact, it's easier to do
- than fussing with specific production rule semantics. Look at the
- grammar file CALCT.GRM, which is a fully worked-out calculator
- that behaves just like CALCC.GRM, except that it is implemented
- through a tree-builder and tree-walker.
-
- The tree-walker is function "eval", found in CALCT.GRM. This
- accepts a tree node as a semrectype pointer, and returns a double
-
-
- QPARSER+ User Manual page 38
-
-
- value. Since we are implementing a calculator, a double will
- always be the "value" of any tree node. If we were implementing
- something else, such as a compiler, the return value might be the
- type of the tree as inferred from its members. Function "eval"
- will be called from within itself, and also from "apply" on
- completion of one the "Stmt" production rules. You can find the
- "eval" call associated with "Stmt" later in the grammar file.
-
- Function eval immediately switches on the "semt" member of the
- semrectype struct. "semt" will be one of the production rule
- tags, or one of the predefined tags IDENT, FIXED, FLOAT, STRNG,
- etc. that are reserved for an <identifier>, <number>, <string>,
- etc. See decl.h for a complete list. The tag GENL_KIND may show
- up here if you haven't tagged all your non-single production
- rules.
-
- Finding the Switch Tags to Consider
-
- The way to decide what tags need to be among your walker's switch
- list is to look at the derivations from the "apply" calls. In our
- case, the apply eval calls are from the nonterminal "Expr". So
- you need to see what "Expr" can derive -- it can derive a PLUS or
- a MINUS directly. Through "Term", it can derive MPY and DIVIDE.
- Then through "Fact" it can derive UMINUS, PARENS, VARIABLE,
- REALVAL or INTVAL. So all these tags need to appear within the
- switch statement. Note that this chain is the result of having a
- series of untagged single productions in the rule set, i.e.
-
- Expr -> Term
- Term -> Primary
-
- You can break the chain by tagging one of these, and catching the
- tag in your tree-walker. Then direct those calls somewhere else.
-
- CALC Eval Switch Cases
-
- Now let's look at the "eval" switch cases. Case PLUS is easy. We
- call eval on the left node (root->LEFT), then on the right node
- (root->RIGHT). These are supposed to each return a double value,
- and we want the double result of the "+" operation, so we return
- the sum of these function calls.
-
- Case MINUS, MPY and DIVIDE work the same way, except that we've
- put in a trap to catch a division by zero for the DIVIDE case.
- (It would be best to also trap on overflow and underflow, which
- could happen in any of the operations, rather than on just this
- obvious one).
-
- Case UMINUS is also easy. The single child is referred to by
- root->CHILD.
-
- Case PARENS, INTVAL, REALVAL and VARIABLE are also easy -- just
- return the value of root->CHILD as evaluated by eval.
-
- Case IDENT requires that we extract the identifier's value from
-
-
- QPARSER+ User Manual page 39
-
-
- its symbol table. It happens that we can depend on the identifier
- having been previously assigned-to through code in the VARIABLE
- production rule, as we'll see later.
-
- Case FIXED picks up the long value in root->usem.numval and casts
- it to a double.
-
- Case FLOAT simply returns the root->usem.rval, which is already a
- double.
-
- Case QUIT is thrown in for good measure. Note that it falls into
- the "default" case, which should always be included so that you
- catch any omissions in the tree-walker cases. This refers to the
- "flags" array, which is guaranteed to deliver a string
- representation of any of the legal root->semt values. In CALCT,
- there's no way that QUIT can be passed to eval, nor will the
- "default" case ever be called, so these could be "assert" calls or
- just left out completely.
-
- The APPLY Actions
-
- Now let's look at the apply function actions, which of course are
- associated with production rules. There isn't much. Note that
- most of the rules are tagged, but have no semantics. This means
- that the rule will get built into a syntax tree with nothing
- further required from you.
-
- However, we need to do something on the PRTVAL and ASSIGN1 rules,
- and in particular to stop the tree building on these rules.
-
- In the PRTVAL case, the required action is to evaluate the Expr
- tree, through the call
-
- evalue= eval(tsemp->CHILD);
-
- Assuming that goes well (why shouldn't it?), the result is
- printed. The tree is then disposed by a call to "disp_sem", which
- incidentally returns NULL so that tsemp returns something sensible
- from the apply call.
-
- In the ASSIGN1 production rule, we again evaluate the Expr tree.
- The value will then be associated with the <identifier> by the
- next two statements:
-
- tsemp->LEFT->usem.symp->usym.value= evalue; /* the assignment */
- tsemp->LEFT->usem.symp->symt= REALVAR; /* defined */
-
- What are these doing? The <identifier> is always associated with
- a symbol table entry, accessed through the pointer
-
- tsemp->LEFT->usem.symp
-
- where LEFT refers to the <identifier> position in the rule. The
- value associated with this identifier is then set to the recently
- evaluated tree value, evalue. The type of the value "symt" in the
-
-
- QPARSER+ User Manual page 40
-
-
- symbol table is set to REALVAR -- this states that the identifier
- has appeared on the left side of an assignment. We'll come back
- to this later.
-
- The remaining two lines are the same as for the PRTVAL case, since
- we want our calculator to print the result of the assigned-to
- variable.
-
- Note that the AST is unaffected by the two assignments. We have
- merely changed two entries in the symbol table attribute for the
- <identifier>. The tree can be disposed as usual, having served
- its purpose.
-
- The VARIABLE production rule has a trap for an unassigned-to
- identifier. The symptom of such an identifier is that it does not
- have the REALVAR attribute set in its symbol table entry. Note
- that this attribute is set when the identifier is assigned-to in
- the ASSIGN1 production semantics. So we complain about the
- problem, then set up the identifier with some reasonable values in
- case someone else notices the problem somewhere else in the code.
-
- Running CALCT
-
- You can try out the tree-builder calc as follows:
-
- 1. Delete the *.TBL, *.OBJ and *.EXE files in the CSKELS
- subdirectory. This will force "make" to rebuild them.
- 2. Run make with the option GRM=CALCT. (With the Turbo make
- file, you need to set GRM=CALCT in the makefile, since it won't
- take this as an external option.)
- 3. The result is MAIN.EXE. It should execute and perform in a
- manner similar to CALCC. You won't see any difference, unless you
- use the debugging tools to explore the stack and the trees rooted
- in the stack.
-
- Debugging Tree Programs
-
- We recommend using your compiler debugger if you have to debug a
- QTREES program. Set a breakpoint on one of the recursive
- evaluators, or on one of the switch case statements to see what's
- happening. If your debugger will allow you to call a function,
- call "print_tree" with a pointer to some tree root. It'll display
- what's in the tree.
-
- You can also get the "print_tree" display from the Qparser
- debugging line. You need to trap on a production rule apply
- operation with a debugging level or 2 or greater. When the trap
- occurs, select the "S(tack" option, then move up or down in the
- stack with the + or - key. You'll see an asterisk (*) next to the
- selected stack item. Then press T(ree to display the tree rooted
- in that position in the stack. You can move down into the tree by
- selecting one of the numbers shown at the end of the prompt -- "1"
- means choose the left-most nonterminal, "2", the one next to it,
- etc. ("1" corresponds to "position0[0]"). Move back up the tree
- by pressing "U". With enough "U" presses, you'll be back at the
-
-
- QPARSER+ User Manual page 41
-
-
- stack level.
-
- For more information, consult the help facility, which you can get
- at any time by pressing "?".
-
- Tree Walking Hints
-
- The decisions you need to make when designing a tree-walking
- evaluator are these:
-
- 1. On which production should the tree-building be interrupted
- so that you can apply a tree-walker? We recommend an interruption
- on each declaration -- which may result in declaring several
- identifiers.
- If you care about global optimizations within a function, you
- may want to build a tree for an entire function, so that you can
- do an intelligent register assignment, look for common
- subexpressions, etc. That also makes it easy to expand control
- statements like "while...do" or "for (){...}" by calling a
- statement tree-walker.
- In any case, we recommend building a tree for each assignment
- statement or expression. Usually (in a compiler) you need to walk
- through such a tree in order to establish the types, eliminate
- constant expressions, etc., before generating code segments.
-
- 2. What should the tree-walkers do? Again, much depends on
- your imagination. We've demonstrated a simple numeric evaluator.
- The extension of that to a code generator is fairly easy -- and in
- fact is one of the worked-out examples in the registered version
- of Qparser+.
- You will probably want a walker that evaluates constant
- subexpressions. Another one might assign types to all the nodes,
- based on casts and the implicit type rules of the language.
- Another one might reorganize parts of the tree around a set of
- rules that facilitates optimizations, for example, moving all the
- constants to a canonical location for commutative operations.
- Another one might perform register assignments. Yet another one
- will generate code for the control statements -- it will no doubt
- call an expression evaluator when it runs into expressions.
-
- 3. What cases must be tree-walker deal with? We've explained
- above that you need to trace a derivation through from one or more
- nonterminals. Start with the nonterminals that appear in the
- "apply" or other root calls to your evaluator. You need all the
- tags that can appear in a derivation chain stemming from those.
- If you choose to stop calling your tree-walker on some tag,
- that's OK -- it just won't penetrate that tree. If you do, you
- need to include the tags that come from that production rule's
- left member.
-
- Hints on Identifier Attributes
-
- We've pointed out that Qparser assigns every identifier to a
- symbol table slot upon its first appearance. The attribute tag is
- the "symt" field of the symtabtype struct, and is USER by default.
-
-
- QPARSER+ User Manual page 42
-
-
-
- The identifier assignment is performed with the "make_sym"
- function -- you can see where it happens by looking at function
- get_symbol in skellex.c. "make_sym" will accept an identifier
- already in the symbol table, even though it may be at a deeper
- scope than you wish. The theory is that the reference for the
- sake of consistency. may be to the higher-scope identifier, for
- example, a reference within a function to a global variable.
-
- If you discover through your tree-walker that the identifier is
- being declared, you need to pay attention to this:
-
- 1. If the identifier's attribute is USER, it means it has been
- seen for the first time in this context. You can safely set its
- attributes to something else appropriate to your declaration, for
- example, change its symt to REALVAR or something.
- 2. Or, if the identifier has some other attribute, its scope
- level will probably also be smaller than the current scope level
- "clevel". This means it was previously declared in a covering
- scope. If this is a new declaration, you MUST call "forcesym"
- with the symbol in order to push a second instance of the
- identifier into the symbol table, then set up its attributes
- appropriately.
- 3. At the end of a scope, you need to decide what has to be
- saved and what can be discarded. If the names are associated with
- function parameters, you usually need to save something in order
- to check the parameter types against subsequent function calls.
- If the names are truly local in scope, you can safely get rid of
- all of it.
- 4. Function "clearsym" will unlink all the symbol table entries
- at and below a given scope level for you. It will free the
- symtabtype struct and the "sym" string as well. However, it will
- NOT free any other space associated with the attribute, unless you
- explicitly require this. See file "syms.c", and
- "free_symtab_item" in particular. So you need to embellish this
- if you've attached other information to the symtabtype struct, and
- you want to avoid memory leaks.
-
- Releasing an AST
-
- Releasing the heap space occupied by an AST is easy -- just call
- "disp_sem" on the root node. Usually this is done within an
- "apply" case, but you can do this within a tree-walker as well if
- you like. Of course, you need to make sure that any pointers to
- the tree node, or any node within it, are neutralized by setting
- them to NULL. In particular, you should NOT set up some auxiliary
- data structures with pointers into a tree somewhere, then release
- the tree space. Learn to live with the AST and stack structures
- instead.
-
- Note that if you embellish semrectype with additional structures,
- you also need to extend "new_sem", "disp_sem", and "dump_sem".
- "dump_sem" requires some thought -- you probably won't want to
- display everything in a complicated node when displaying the
- stack. It's best to decide where to cut off the display, or how
-
-
- QPARSER+ User Manual page 43
-
-
- to condense it.
-
-
-
- QPARSER+ User Manual page 44
-
-
- Chapter 11. Miscellaneous Topics
-
- Skeleton File Semantics
-
- The skeleton files are divided into sections of code that are just
- copied to their target files, and other sections that get expanded
- by LR1P, according to information derived from the grammar. The
- sections to be expanded are covered by the special bracket codes
-
- {## ... ##}
-
- which are recognized by LR1P. Everything outside those brackets
- are simply copied to the target file by LR1P. Everything inside
- is interpreted as a Pascal-like source code generator language,
- with a precisely defined syntax.
-
- The registered version of Qparser has a complete explanation of
- the code generation language syntax, so that you can construct
- your own generator forms. It's based on a special variant of
- Pascal, and is quite powerful. You can figure out what most of
- the forms do by comparing skeleval.c with eval.c, and trying out
- some cases on your own.
-
- The more difficult forms are associated with the parser tables.
- They will be hard to decipher without some help.
-
- For most purposes, you won't need to touch these generator forms.
-
- Help Functions
-
- We pointed out that when your parser or translator is finally
- linked and ready for execution, you can invoke the debug functions
- by either using the option "-d1" or using character "!" in your
- input. When the debug line appears:
-
- [idebug] ?(help, N(ame, A(ll, D(bug level,
- P(rodTrap, T(race, L(ex, S(tack, ENTER?
-
- typing a question mark "?" should produce several screensful of
- help information about your environment. In this case, the help
- will be about this particular set of options. The first help
- report will give you some topics that you can enter by typing ?
- followed by the option name.
-
- Gramchk
-
- Program GRAMCHK.EXE hasn't been described yet. Like LR1, it
- accepts a grammar file name. Unlike LR1, it's just designed to
- give you lots of information about the grammar that would be hard
- to figure out otherwise.
-
- An accurate description of GRAMCHK's services can be obtained by
- running GRAMCHK with no parameters. You'll get a list of the
- options that it understands. To get more information on any one
- of the options, run it with the option tagged with a question
-
-
- QPARSER+ User Manual page 45
-
-
- mark, for example,
-
- gramchk -t?
-
- will generate a screen of information about the -t option.
-
-
-
- QPARSER+ User Manual page 46
-
-
- Chapter 12. Debugging a Translator
-
- Qparser should always produce a valid parser if your grammar can
- get through gramchk and lr1 without complaints, and if you have
- not changed any of the base CSKELS files.
-
- However, a parser alone isn't usually what you want. When you
- start adding semantics operations, particularly operations that
- call for leaving material in the stack and later using it, you may
- introduce an error that could cripple the system.
-
- Here are some tips on developing reliable Qparser programs:
-
- * You should compile with the DEBUG option defined. This will
- help you trace the stack and parsing operations while setting
- breakpoints and viewing program variables.
-
- * Keep DEBUG.C up to date so that the symtabtype and semrectype
- are always reported according to your latest revision. Similarly,
- check new_sem, new_sym, disp_sem and disp_sym for the correct
- disposition of the structs.
-
- * Check the terminal and nonterminal list by using option "-t"
- in LR1 or GRAMCHK. A common error is omitting the definition of a
- nonterminal as a production rule, causing it to appear to Qparser
- as a terminal. GRAMCHK will also warn you of production rules
- that seem to go nowhere or are not connected to the chain of rules
- in any way -- this is also a symptom of some grammar error.
-
- * Pay attention to the reports from LR1 and GRAMCHK. LR1 may
- complain about syntax errors -- which you should repair -- or
- state conflicts. A state conflict can be difficult to repair, and
- is beyond the scope of this little document to explain and help
- you to correct. LR1 will nevertheless generate a parser, but it
- may not behave as you expect. If GRAMCHK complains about a
- variable name that has been overloaded or some other problem, you
- should look into it, not just assume that everything will work
- anyway -- it won't.
-
- * If your compiler supports prototypes, and checks parameter
- types, you should enable that option and change the skeleton and
- other files so that they make use of this important type-checking
- feature. Then make sure your C files are type-clean.
-
- * If you've changed the internals of the lexical analyzer --
- file SKELLEX.C or LEX.C -- you should first test the lexical
- analyzer separately. That file can be compiled separately by
- compiling it with the macro TEST defined. It will become a
- program in its own right, designed to accept a file of characters,
- then report the tokens that it sees. Only after you've made sure
- that the lexical analyzer is working correctly should you attempt
- to move on to generating and testing a parser.
-
- * Develop your translator in these stages, making sure that each
- stage is correct before moving on to the next stage: a) lexical
-
-
- QPARSER+ User Manual page 47
-
-
- analyzer, if you've changed SKELLEX.C in some way, b) grammar --
- it should be complete, free of LR1 and GRAMCHK complaints and
- correctly represent your intentions, c) symbol attributes and
- correct scope rules, d) symbol declarations, e) control
- statements, f) function calls, g) executable statements.
-
- * Qparser should correctly call the apply function with the
- appropriate state number and that in turn will drop into the
- appropriate switch case number. If you aren't convinced of this,
- follow the stack traces with some simple input streams and make
- sure that that mechanism is performing correctly before adding any
- semantics code to "apply". It's all driven by tables in the code
- segment, which can still get corrupted by bugs in most systems.
-
- * Use the compiler debugger to plant a breakpoint on a
- particular case in "apply" to trap on one or another apply action.
- If your debugger can execute a function call from a breakpoint,
- execute "idebug", which will allow you to interactively examine
- the stack, symbol table variables, etc. It draws on the global
- stack information, and requires no parameters.
-
- * Make your program report what it's doing with extra printf
- statements covered by some test option, i.e. TEST1, TEST2, etc.
- Prepare some input files with situations that you are testing that
- will set up the symbol table and other environment so that you can
- track down the problem you are having. Insert "idebug()" calls
- freely where you suspect trouble; they will help you track them
- down.
-
- * Pay special attention to the use of the heap. In particular,
- please remember that Qparser allocates each of the following, and
- it's up to you to dispose of them at the appropriate time:
-
- a. One symtabtype type on the appearance of a fresh
- identifier. An identifier appearing more than once is associated
- with the same symtabtype. Call "free_symtab_item" to free a
- symtabtype. Call "clearsym" to remove a set of identifiers from
- the symbol table when falling out of some inner scope.
-
- b. One "char* sym" type within each symtabtype. This is
- freed with the symtabtype struct by free_symtab_type.
-
- c. One semrectype in the semantics stack "semstack" for each
- appearance of <identifier>, <real>, <integer>, <character> or
- <string> in the input stream. There is NO automatic disposal of
- these -- you must dispose of them at the appropriate time, usually
- after you have made use of their information in an "apply" call,
- just before its return. Call "disp_sem" to free any semrectype
- struct. It is also designed to free an entire tree rooted in a
- semrectype.
- d. One "char* strx" type on each appearance of a literal
- string in the input stream. This is associated with a
- "semrectype" and is normally freed through the "disp_sem" call.
-
- * Use functions "new_sem", "new_sym", "disp_sem" and "disp_sym"
-
-
- QPARSER+ User Manual page 48
-
-
- to allocate and free a semrectype or symtabtype. Keep them up to
- date with changes you make in those structs. This is much safer
- than "malloc" and "free", since new_sem and new_sym initialize
- critical struct members, and the "disp" functions handle NULL and
- embedded allocated material safely.
-
- If your compiler has a library function for checking the heap
- status, we urge that you enable it and pay attention to its
- complaints. The nature of this parsing system is such that you
- will be allocating heap space in one section of the program, but
- freeing it somewhere else, such that it won't be obvious that
- every allocated object is freed exactly once. We also recommend
- that you be fastidious about setting pointers to NULL before
- allocating space for them, and after freeing them, if they live on
- thereafter. You can then test a pointer for NULL before freeing
- it.
-
-
-
- QPARSER+ User Manual page 49
-
-
- Chapter 13. Registration and Support
-
- We urge that you register your Qparser+ software with QCAD Systems
- by filling out the order blank file form (file ORDRFORM.TXT) and
- returning it with the requested payment. Your registration will
- entitle you to the full version, which contains more worked-out
- examples and grammars, and skeletons in Pascal as well as C. It
- comes with a typeset 170-page user manual. You'll be entitled to
- limited free phone support to answer any questions you have or to
- help you track down problems with your translator.
-
- You'll also learn more about using abstract syntax trees, which
- are the recommended way to develop a translator product of any
- reasonable complexity.
-
- The Qparser binaries (LR1.EXE, OPT.EXE, LR1P.EXE, GRAMCHK.EXE) can
- be ported to any computer system with a K&R or ANSI C compiler by
- purchasing a license for the source code. A make file is provided
- for Unix system V, and the product has been tested on a wide
- variety of Unix platforms, as well as VMS and IBM platforms. For
- more details, please contact us below.
-
- If you need support, please write or FAX us at the following:
-
- QCAD Systems
- 1164 Hyde Avenue
- San Jose, CA 95129
- FAX/voice message: 408 727 6884
-
- Even if you don't want to register, let us know with a postcard or
- a one page FAX that you've acquired this shareware edition, and
- what you think of it. We'll be working on improvements over time,
- and are always interested in customer response to the product.
-
- Version Number
-
- By running one of the executables with no parameters, the version
- number is reported. Please give us the version number if you want
- to report a problem or a suggestion -- it's 4.8.3 as of this
- writing.
-
-
-
- QPARSER+ User Manual page 50
-
-
- Chapter 14. For Additional Reading
-
- The registered version includes a 170-page printed manual that
- describes all the Qparser internals, and an introduction to the
- underlying theory. There are more worked-out examples, and tips
- on compiler development.
-
- Dr. Barrett's compiler construction book "Compiler Construction
- -- Theory and Practice" was out of print at the time of writing
- this. A new edition should be available from MacMillan publishers
- in early 1994.
-
- Allen Holub, "Compiler Design in C", Prentice-Hall. We highly
- recommend this book for the professional language designer.
- Virtually all the theoretical material is provided in the form of
- tested C programs, which can be purchased separately in diskette
- form. It's well written and full of examples and example
- programs.
-
- Aho and Ullman, "The Theory of Parsing, Translation and
- Compiling", two volumes, Prentice-Hall. This is a classic
- reference volume on the theory. Those unfamiliar with the
- structure of proofs and abstract language notation will find it
- difficult to read.
-
- Kernighan and Ritchie, "The C Programming Language", second
- edition, Prentice-Hall. Contains a C grammar that is LR(1).
-
- Bjarne Stroustrup, "The C++ Programming Manual", second edition,
- Addison-Wesley. Contains a C++ grammar that is LR(1).
-