The World of Computer Software

home *** CD-ROM | disk | FTP | other *** search

/ The World of Computer Software / World_Of_Computer_Software-02-387-Vol-3of3.iso / p / pascal-.zip / pascal- / doc / Structure.fwi < prev next >

Wrap

Text File | 1993-01-05 | 47.1 KB | 1,222 lines

@=~ ~A~<A Pascal Subset~> The compiler described here accepts a subset of Pascal known as ``Pascal-''. It has enough features to illustrate the basic techniques for specifying a compiler. With the exception of coercions, the omitted Pascal features do not require any techniques beyond those described here. This chapter begins by describing the differences between Pascal- and Pascal, assuming that you are already familiar with Pascal. An informal description of the Pascal- vocabulary and a formal description of the structure of a Pascal- program complete the language overview. ~B~<General description~> Pascal- has only two simple types, Boolean and integer, and two structured types, array types and record types. Boolean and integer are predefined, with the names ~{Boolean~} and ~{integer~} respectively. Every structured type also has a name, which is introduced by a type definition: ~$~<Type definition example~>==~{ type table = array [1..100] of integer; stack = record contents: table; size: integer end; ~} A type definition always creates a new type; it can never rename an existing type. The following type definition is therefore incorrect: ~$~<Incorrect type definition~>==~{ type number = integer; ~} All variables must be defined and given a type: ~$~<Variable definition example~>==~{ var A: table; x, y: integer; ~} The type name must always be used in a variable declaration. In other words, it is not possible to create an ``anonymous'' type as in the following incorrect variable definition: ~$~<Incorrect variable definition~>==~{ var A: array [1..100] of integer; ~} All constants have simple types. Two predefined identifiers, ~{true~} and ~{false~}, represent the two possible values of type ~{Boolean~}. Values of type ~{integer~} are represented in the usual way. Names for constants can be introduced by constant definitions: ~$~<Constant definition example~>==~{ const max = 100; on = true; ~} Assignment statements, procedure calls, if statements, while statements and compound statements are available in Pascal-. A statement may be empty. There are no labels or goto statements, case statements, repeat statements, for statements or with statements. Procedure declarations can be nested, and a procedure may have variable and/or value parameters of any type. Procedures cannot be passed to other procedures as parameters, and functions cannot be defined. There are two standard procedures, ~{read(x)~} and ~{write(x)~}. ~{Read~} expects a single variable parameter of type integer, obtains the next integer from the standard input stream, and assigns the value of that integer to the variable parameter. ~{Write~} expects a single value parameter of type integer, and places the value of that integer onto the standard output stream as a complete line. Here is an algorithm that illustrates most of the features of Pascal-: ~$~<Complete example program~>==~{ Program ProgramExample; const n=100; type table=array[1..n] of integer; var A: table; i,x: integer; yes: Boolean; procedure search(value: integer; var found: Boolean; var index: integer); var limit: integer; begin index:=1; limit:=n; while index<limit do if A[index]=value then limit:=index else index:=index+1; found:=A[index]=value end; begin {input table} i:=1; while i<=n do begin read(A[i]); i:=i+1 end; {test search} read(x); while x<>0 do begin search(x,yes,i); write(x); if yes then write(i); read(x); end end. ~} ~B~<Vocabulary~> The vocabulary of a programming language is made up of ~/basic symbols~/ and ~/comments~/. There are three kinds of basic symbols: identifiers, denotations and delimiters. Identifiers are names that are freely chosen by the programmer to represent entities like constants, types, variables and procedures. Denotations represent specific values, according to conventions laid down by the language designer. Delimiters are used to give the program structure. In Pascal-, an identifier is called a ~{Name~}, and consists of a letter that may be followed by any sequence of letters and digits: ~$~<Name examples~>==~{ x LongName S100 CaseInsensitive CASEINSENSITIVE caseinsensitive ~} There is no distinction made between upper and lower case letters in Pascal-, so the three names on the second line of the example are indistinguishable from one another. A ~{Numeral~} is the only denotation in the vocabulary of Pascal-. It consists of a sequence of decimal digits, and denotes an integer value: ~$~<Numeral examples~>==~{ 1 123 ~} There are two kinds of delimiters in Pascal-, ~/word symbols~/ and ~/special symbols~/: ~$~<Word symbols~>==~{ and array begin const div do else end if mod not of or procedure program record then type var while ~} ~$~<Special symbols~>==~{ + - * < = > <= <> >= := ( ) [ ] , . : ; .. ~} A word symbol cannot be used as a ~{Name~} in a Pascal- program. A comment in Pascal- is an arbitrary sequence of characters enclosed in braces ~{{~} ~{}~}. Comments may extend over several lines and may be nested to arbitrary depth: ~$~<Comment examples~>==~{ if x>0 then {reserve resource} x:=x-1; {This is a {nested} comment that extends over two lines} ~} White space (spaces, tabs and new lines) and comments are called ~/separators~/. Any basic symbol may be preceded by one or more separators, and the program may be followed by zero or more separators. Word symbols, names and numerals are called ~/undelimited~/ basic symbols. Any pair of adjacent undelimited basic symbols must be separated by at least one separator: ~$~<Example of basic symbol separation~>==~{ {Incorrect} ifx>0thenx:=10divx-1; {Correct} if x>0 then{Can divide}x:=10 div x-1; ~} In this example, separators needed between ~{if~} and ~{x~}, ~{0~} and ~{then~}, ~{then~} and ~{x~}, ~{0~} and ~{div~}, ~{div~} and ~{x~}. The separators can be white space and/or comments, as indicated. ~B A complete program is called a ~/sentence~/ of the programming language in which it is written. The structure of that sentence is described in terms of its component ~/phrases~/. Compilation of a program is based on its phrase structure, as defined by a ~/grammar~/ for the programming language. The phrase structure must reflect both the way a program is written and the meaning to be conveyed by that program. It is convenient to divide a grammar into parts that describe coherent sets of phrases, and then to discuss each part individually. Here is a breakdown appropriate for Pascal-, with two rules describing the program structure given explicitly as examples of the notation: ~$~<Grammar~>==~{ Program: 'program' ProgramName ';' BlockBody '.' . BlockBody: ConstantDefinitionPart TypeDefinitionPart VariableDefinitionPart ProcedureDefinitions CompoundStatement . ~<Constant, type and variable definition grammar~> ~<Expression grammar~> ~<Statement grammar~> ~<Procedure grammar~> ~} The first grammar rule says that a sentence in Pascal- has two component phrases, a ~{ProgramName~} and a ~{BlockBody~}. Delimiters ~{program~}, ~{;~} and ~{.~} are used to separate these phrases in the program text as written. The ~{BlockBody~} phrase is the description of the algorithm, and consists of definitions and a statement. All of the definitions and the statement together constitute the meaning of the algorithm, while the name of the program does not affect the algorithm in any way. A program is written as the sequence of basic symbols ~{program~}, a name, ~{;~}, a sub-sequence describing the algorithm, and ~{.~}. Thus these two rules reflect both the way the program is written and the underlying meaning. The Pascal- grammar given here is taken almost verbatim from Brinch-Hansen's description in Section 2.4 of ``Brinch-Hansen on Pascal Compilers''. His grammar does not make some properties of the program clear, however, and these aspects have been changed. Each such change is discussed in the appropriate section below. Eli does not permit the extensions to BNF that Brinch-Hansen uses to describe optional and repeated phrases. The rules involving these extensions have been re-written to avoid the extensions. Finally, Eli uses ~{:~} instead of ~{=~} to separate the left and right hand sides of a rule, and uses ~{/~} instead of ~{|~} as the separator for alternative right-hand sides. These notational changes will not be discussed further. ~C Constant, type and variable definitions have similar forms in a Pascal- program: A word symbol describing the kind of definitions to follow, and a list of the definitions themselves. In each case, one or more names are defined. If a particular word symbol is missing, then there are no definitions of that kind. Notice that the grammar distinguishes between a definition of a name and a use of that name (e.g ~{ConstantNameDef~} and ~{ConstantNameUse~}). This distinction reflects the meaning of the program, rather than the way the program is written. The constant name ~{max~} is always written in the same way, but when it appears to the left of ~{=~} in a ~{ConstantDefinition~} that is a ~/defining occurrence~/ and when it appears to the right of ~{=~} in a ~{ConstantDefinition~} that is an ~/applied occurrence~/. All of the properties of a name are deduced from the context of its defining occurrence; those properties are made available at the applied occurrences. ~$~<Constant, type and variable definition grammar~>==~{ ConstantDefinitionPart: / 'const' ConstantDefinitions . ConstantDefinitions: ConstantDefinition / ConstantDefinitions ConstantDefinition. ConstantDefinition: ConstantNameDef '=' Constant ';' . Constant: Numeral / ConstantNameUse . TypeDefinitionPart: / 'type' TypeDefinitions . TypeDefinitions: TypeDefinition / TypeDefinitions TypeDefinition . TypeDefinition: TypeNameDef '=' NewType ';' . NewType: NewArrayType / NewRecordType . NewArrayType: 'array' '[' IndexRange ']' 'of' TypeNameUse . IndexRange: Constant '..' Constant . NewRecordType: 'record' FieldList 'end' . FieldList: RecordSection / FieldList ';' RecordSection . RecordSection: FieldNameDefList ':' TypeNameUse . FieldNameDefList: FieldNameDef / FieldNameDefList ',' FieldNameDef . VariableDefinitionPart: / 'var' VariableDefinitions . VariableDefinitions: VariableDefinition / VariableDefinitions VariableDefinition . VariableDefinition: VariableNameDefList ':' TypeNameUse ';' . VariableNameDefList: VariableNameDef / VariableNameDefList ',' VariableNameDef . ~} Brinch-Hansen does not distinguish between defining and applied occurrences of names in his grammar. This distinction is made in the text of Section 6.1 of his book, where scope rules are discussed. When one is specifying a language to a compiler generator like Eli, however, natural language descriptions must be formalized. ~C Like most programming languages, Pascal- defines precedence and association rules to eliminate the need for explicit parentheses. For example, the precedence rules ensure that the expression ~{a+b*c~} has the same meaning as ~{(a+(b*c))~}, ~/not~/ the same meaning as ~{((a+b)*c)~}. Similarly, the association rules ensure that ~{a-b-c~} has the same meaning as ~{((a-b)-c)~}, ~/not~/ the same meaning as ~{(a-(b-c))~}. These rules are embodied in the phrase structure of an expression. ~$~<Expression grammar~>==~{ Expression: SimpleExpression / SimpleExpression RelationalOperator SimpleExpression . RelationalOperator: '<' / '=' / '>' / '<=' / '<>' / '>=' . SimpleExpression: Term / SignOperator Term / SimpleExpression AddingOperator Term . SignOperator: '+' / '-' . AddingOperator: '+' / '-' / 'or' . Term: Factor / Term MultiplyingOperator Factor . MultiplyingOperator: '*' / 'div' / 'mod' / 'and' . Factor: Numeral / VariableAccess / '(' Expression ')' / NotOperator Factor . NotOperator: 'not' . VariableAccess: VariableNameUse / VariableAccess '[' Expression ']' / VariableAccess '.' FieldNameUse . ~} Precedence and association rules affect the way the program is written, determining the phrase structure of an expression. Once that phrase structure has been determined, however, precedence and association rules have no further effect on the meaning of the program. As far as the meaning of the program is concerned, it is sufficient to distinguish dyadic expressions, monadic expressions, and expressions without operators. Each operator or operand should therefore be a distinct phrase in the sentence. Brinch-Hansen's grammar does not associate the ~{not~} operator with a phrase, because it uses that operator literally in a rule: ~{Factor: 'not' Factor~}. Here the ~{not~} operator is made a distinct phrase by writing Brinch-Hansen's single rule as two: ~{Factor: NotOperator Factor .~} and ~{NotOperator: 'not' .~}. The first alternative for ~{Factor~} in Brinch-Hansen's grammar is ~{Constant~}. A ~{Constant~} is defined as being either a ~{Numeral~} or a ~{ConstantNameUse~} (see Section 2.3.1). But notice that a ~{VariableAccess~}, another alternative for ~{Factor~}, could be a ~{VariableNameUse~}. When presented with the ~{Factor~} ``~{x~}'', how can the compiler decide whether this name represents a constant or a variable? The distinction is irrelevant as far as the structure of the program is concerned, and the attempt to make it leads to an ambiguous grammar. Brinch-Hansen discusses this ambiguity in Section 5.5 of his book, and resolves it by altering the grammar to the one given here. ~C Statements in Pascal- are separated by semicolons, but the fact that a statement can also be empty means that semicolon can be considered to be a terminator. ~$~<Statement grammar~>==~{ Statement: AssignmentStatement / ProcedureStatement / IfStatement / WhileStatement / CompoundStatement / Empty . AssignmentStatement: VariableAccess ':=' Expression . ProcedureStatement: ProcedureNameUse ActualParameterList . ActualParameterList: / '(' ActualParameters ')' . ActualParameters: ActualParameter / ActualParameters ',' ActualParameter . ActualParameter: Expression . IfStatement: 'if' Expression 'then' Statement $'else' / 'if' Expression 'then' Statement 'else' Statement . WhileStatement: 'while' Expression 'do' Statement . CompoundStatement: 'begin' Statements 'end' . Statements: Statement / Statements ';' Statement . ~} Brinch-Hansen's grammar contains the rule ~{ActualParameter: Expression | VariableAccess .~}. This rule is ambiguous, because a ~{VariableAccess~} is one form of an ~{Expression~}. He is trying to distinguish a context in which a value parameter is required from that in which a variable parameter is required. These contexts can only be distinguished on the basis of information drawn from the definitions, and therefore do not affect the structure of the program. In Section 5.5 of his book, Brinch-Hansen alters the grammar to use the rule listed here. The sequence ~{$'else'~} is used in the definition of an ~{IfStatement~} to avoid another ambiguity: What is the meaning of the phrase ~{if a then if b then x:=1 else x:=2~}? If ~{a~} is ~{false~}, does ~{x~} remain unaltered or is it set to ~{2~}? The rules of Pascal require it to remain unaltered, because an else clause is associated with the closest if. Brinch-Hansen's grammar for Pascal- is ambiguous, and there is no mention of the ambiguity in his book. In the absence of specific guidance, the rules of Pascal are assumed here. This decision is made explicit in the grammar by the inclusion of the sequence ~{$'else'~}. It specifies that the alternative of which it is a part is only valid if ~/not~/ followed by the basic symbol ~{else~}. ~C Like a ~{Program~}, a ~{ProcedureDefinition~} has two constituent phrases. The ~{ProcedureBlock~} defines the algorithm abstracted by the procedure, and the ~{ProcedureNameDef~} defines the name by which that algorithm will be known. Formal parameters are a part of the abstraction, not a part of the name. Their names are local to the procedure body, whereas the procedure name is defined in the surrounding context. ~$~<Procedure grammar~>==~{ ProcedureDefinitions: / ProcedureDefinitions ProcedureDefinition . ProcedureDefinition: 'procedure' ProcedureNameDef ProcedureBlock ';' . ProcedureBlock: FormalParameterList ';' BlockBody . FormalParameterList: / '(' ParameterDefinitions ')' . ParameterDefinitions: ParameterDefinition / ParameterDefinitions ';' ParameterDefinition . ParameterDefinition: 'var' ParameterNameDefList ':' TypeNameUse / ParameterNameDefList ':' TypeNameUse . ParameterNameDefList: ParameterNameDef / ParameterNameDefList ',' ParameterNameDef . ~} ~B Here is a summary of the examples: ~$~<Examples~>~Z~{ ~<Name examples~> ~<Numeral examples~> ~<Comment examples~> ~<Word symbols~> ~<Special symbols~> ~<Example of basic symbol separation~> ~<Constant definition example~> ~<Type definition example~> ~<Incorrect type definition~> ~<Variable definition example~> ~<Incorrect variable definition~> ~<Complete example program~> ~} ~A~<Compiler Organization~> Eli embodies a large fraction of the knowledge required to build a compiler, providing an overall design that will satisfy the requirements of almost any translation problem. It assumes a particular decomposition of translation problems, supports the solution of the resulting subproblems, and combines those solutions into a complete program that carries out the desired translation. The compiler generated by Eli reads a file containing the source program, examining it character-by-character. Character sequences are recognized as basic symbols or discarded, and relationships among the basic symbols are used to build a tree that reflects the structure of the source program. Computations are then carried out over the tree, and the results of these computations output as the target program. Thus Eli assumes that the original translation problem is decomposed into the problems of determining which character sequences are basic symbols and which should be discarded, what tree structure should be imposed on sequences of basic symbols, what computations should be carried out over the resulting tree, and how the computed values should be encoded and output. There is considerable variability in the specific decompositions for particular translation problems. For example, operator overloading is very simple in Pascal-. Only the relational operators are overloaded, and they can be applied only to Boolean or integer operands. Because of this simplicity, there is no need to treat overload resolution as a separate subproblem of the Pascal- translation problem. This chapter describes the decomposition of the Pascal- translation problem and gives the general framework within which the subproblems are specified. The complete specifications appear in subsequent chapters. ~B~<Subproblems for Pascal-~> ~/Lexical analysis~/ is the process that examines the source program text, recognizing basic symbols and discarding separators. Basic symbols are classified by an integer called a ~/syntax code~/. Each delimiter has a unique syntax code, and there is one syntax code for all names and another for all numerals. A name or numeral is characterized by a single integer value called an ~/intrinsic attribute~/ in addition to the syntax code. No intrinsic attribute is computed for a delimiter. As it recognizes each basic symbol, the lexical analyzer passes the information characterizing that basic symbol to the ~/syntax analyzer~/. ~/Syntax analysis~/ determines the phrase structure of the source program and builds a tree to represent it. Each phrase results in a tree node. If a phrase has component phrases, the nodes representing those component phrases are the children of the node representing the composite phrase. Identifiers and denotations are phrases that have no components, and therefore represent leaves of the tree. Every other phrase corresponds to a rule of the grammar. Rules like ~{NotOperator: 'not'.~} correspond to phrases with no components, because literals are not considered to be phrases. The phrases corresponding to these rules will therefore result in leaves. Rules like ~{AssignmentStatement: VariableAccess ':=' Expression.~} correspond to phrases with components. The phrases corresponding to these rules will therefore result in tree nodes that have children. (In the case of ~{AssignmentStatement: VariableAccess ':=' Expression.~} there are two components, and hence two children, because names are considered to be phrases and literals are not.) The syntax analyzer uses the syntax codes provided by the lexical analyzer to define the basic symbols. When it builds a leaf corresponding to an identifier or a denotation, it stores the value of the intrinsic attribute supplied by the lexical analyzer into that node. Once the tree is built, the ~/attribute evaluator~/ establishes a number of values associated with the tree nodes. For example, each node representing an expression has a value associated with it that specifies the type of result yielded by that expression. Consider a Pascal- expression consisting of an operator and two operands. The type specification for the result in this case is determined by the expression's operator. The type specifications for the two operands are used to verify that the operands are legal for the expression's operator. Attribute evaluation is the heart of any compiler, where most of the effort and most of the complexity resides. For this reason, specifications of attribute evaluation are usually partitioned by function to make them easier to understand. The major functions of attribute evaluation in Pascal- are scope analysis (determining the relationships between name uses and definitions), type analysis (verifying the context conditions on expressions) and code generation (producing a target program to implement the source code). Each of these functions is treated in a separate chapter. One of the values established by the attribute evaluator is a tree that represents the target code. The final step in the Pascal- compilation is to linearize and output this tree. That process is carried out by a ~/program text generator~/, invoked on the value produced during attribute evaluation. Some information is associated with specific contexts in the tree, while other information is associated with specific entities in the program. For example, the type of result yielded by an expression is a property of that expression, and is associated with the node representing the expression in the tree. The type of value yielded by a variable, however, is a property of that variable. Since a particular variable could be used many places in the tree, the information about its type must be available globally. The generated compiler includes a ~/definition table~/, in which arbitrary ~/properties~/ can be stored and accessed via ~/keys~/. To make the information about a variable's type available globally, the compiler associates a key with each variable and stores the type information in the definition via that key. The key is therefore a unique internal representation of the variable, and is stored at every use of that variable. Because the key is available at each use, the type information can be accessed at each use. ~B~<How a compiler is specified~> Eli focuses the user's attention on the requirements and design decisions that are unique to the Pascal- compiler, having already provided solutions to problems common to all compilers. It accepts descriptions of those requirements and design decisions, and combines them with its understanding of compiler construction problems to produce the Pascal- compiler. Requirements and design decisions can be thought of as specifications of instances of subproblems of the Pascal- translation problem. There are three basic ways in which a user might specify an instance of a subproblem: by analogy (``this problem is the same as problem X''), by description (``this is a problem of class Y, and is characterized as follows...'') and by solution(``here is a program to solve this problem''). To support specification by analogy, Eli provides a library of solutions to common compiler subproblems. A user must have sufficient understanding of these subproblems to recognize that they have been solved before and to find the solutions in the library. For example, a Pascal- name is defined in a single procedure. If the same name is defined in a nested procedure, the outer definition is ``hidden'' within the inner procedure. This behavior is common to many programming languages, and a solution for the problem of associating the definition of a name with each use of that name can be solved by analogy by instantiating a module from Eli's library: ~$~<Instantiate a module to analyze nested definitions~>==~{ $/Tool/lib/Name/Nest.gnrc :inst ~} To support specification by description, Eli provides a set of notations for characterizing common compiler subproblems. A user must not only be able to recognize the kind of subproblem, but also understand the notation used to characterize problems of that kind. The grammar of Section 2.3 is specified in such a notation -- a notation used to describe the phrase structure of a language and thus characterize the syntax analysis subproblem for that language. To support specification by solution, Eli accepts arbitrary C code that solves a unique compiler subproblem, and incorporates it into the generated translator. A user must not only be able to recognize such subproblems, but also be able to solve them completely. An example of a subproblem of the Pascal- compiler that is specified by solution is that of scanning a comment (Section 4.3.2). Specifications by analogy and description are two different forms of re-use: A specification by analogy re-uses a particular solution, while a specification by description re-uses a problem-solving method. In each case, re-use simplifies the specification by allowing much to be omitted. Eli provides leverage in direct proportion to the set of compiler subproblems for which it embodies solutions and problem-solving methods. In addition to this kind of leverage, however, Eli supports a paradigm for structuring the specification known as ~/literate programming~/. The literate programming paradigm suggests that code (or in this case specifications) may need to be organized in one way to support human understanding and in another to support implementation. To solve this dilemma, the user creates a ~/document~/ whose organization supports human understanding. Information describing the organization that supports implementation is embedded in this document, which can be processed to yield a printed version for human use or a machine-readable version for implementation. In its simplest form, the literate programming paradigm allows a user to group together specifications that are related but have different forms. For example, some subproblems of lexical analysis are specified by analogy, others by description, and still others by solution. Literate programming groups all of these specifications together into one file, organizing them so that their tasks and relationships are clear. The document you are reading is an example of a more sophisticated use of literate programming: Not only are the various specifications that define subproblems of the Pascal- compilation problem organized according to function, but they are interspersed with explanations and commentary. ~B Eli requires that a complete translation problem be described by a single file of type ~{specs~}. A type-~{specs~} file lists all of the specifications for the subproblems of that translation problem. Subproblems specified by analogy are represented by library references (which might be to the Eli library or to some other library). Subproblems specified either by description or by solution are represented by file references. Here is content of ~{pascal-.specs~}, the file that describes the Pascal- translation problem: ~$~<The Pascal- compiler~>~Z==~{ /* Lexical and Syntax Analysis */ Structure.fw pident.gla :kwd /* Exclude keywords from the finite-state machine */ /* Scope Analysis */ Scope.fw ~<Instantiate a module to analyze nested definitions~> $/Tool/lib/Name/NoKeyMsg.gnrc :inst $/Tool/lib/Name/Unique.gnrc :inst /* Type Analysis */ Type.fw $/Tool/lib/Name/Field.gnrc :inst /* A Pascal Computer */ Computer.fw $/Tool/lib/Tech/LeafPtg.gnrc :inst /* Code Generation */ Code.fw ~} Five of the specifications are type-~{fw~} files, which are documents that follow the literate programming paradigm. ~{Structure.fw~} contains Chapters 1-5; the other documents contain one chapter each. Each of Chapters 4-9 ends with a section that describes the purpose and contents of the specification files generated from that chapter. Lines that begin with ~{$/Tool/lib/~} extract modules from the Eli library, specifying subproblems by analogy. Each of these library modules is described at the appropriate point in this report. The line ~{pident.gla :kwd~} is an instruction to Eli to produce a lexical analyzer that behaves in a certain way; it is discussed in Section 4.4. To create a Pascal- compiler from the set of specifications defined by ~{pascal-.specs~}, one or more of the following requests should be made to Eli: ~$~<Requests creating a Pascal- compiler~>~Z~{ pascal-.specs +fold +arg=(input) :stdout pascal-.specs +fold :exe >pascal.exe pascal-.specs +fold :source >src ~} The first request asks Eli to create the compiler and run it with the command-line argument ~{input~}. Its purpose is to test the generated compiler against a sample Pascal- program. The second request also asks Eli to create the compiler, but to write the executable version into file ~{pascal.exe~} in the current directory. Its purpose is to create an executable version that can be used independently of Eli itself. The third request writes into directory ~{src~} (which must exist before the request is made) all of the files necessary to create an executable version of the compiler. Its purpose is to create a source code version that can be used independently of both Eli and the computer on which Eli is currently running. Directory ~{src~} will contain C files, header files, a Makefile, and a Configure script that can tailor the other files for specific machine/operating system combinations. All of the requests specify the parameter ~{+fold~}, which causes Eli to generate a lexical analyzer that is case-insensitive. The meaning of this parameter is further elaborated in Section 4.4. ~B~<Errors and Failures~> The compiler must report any errors in the source program. Lexical errors (character sequences that are neither separators nor basic symbols) and syntactic errors (sequences of basic symbols that cannot be built into phrases) are detected automatically by the generated lexical and syntactic analyzers. Contextual errors (such as a Boolean operand for an addition operator) must be detected by tests specified as part of the attribution. When one of these tests detects an error, it must invoke the error reporting module with information about the nature of the error and its location. The location is determined from coordinate information (source program line and character position) stored in the node of the tree at which the computation is being made. Since the coordinate information is always supplied in exactly the same way, it is convenient to define a macro to simplify the error reporting: ~O~<report.head~>~{ #define Message(s,t) message(s,t,0,COORDREF) /* Produce a report at the leftmost symbol of the current fragment * On entry- * s=severity (INFO, WARNING, FATAL, DEADLY) * t=report text ***/ ~} A type-~{head~} file contains C pre-processor directives that should be supplied to the generated attribution module. All type-~{head~} files are collected, in some arbitrary order, into a single file called ~{HEAD.h~}. The generated attribution module contains the C pre-processor directive ~{#include "HEAD.h"~}. ~A~<Lexical Analysis~> The purpose of lexical analysis is to determine the sequence of basic symbols making up the source program. Vocabulary errors, such as invalid characters, missing separators and unterminated comments, are detected and reported during lexical analysis. An Eli-generated compiler produces a sequence of basic symbols even if vocabulary errors are found. That sequence contains only basic symbols that are correct according to the language definition, and mirrors the program as closely as possible given the particular errors. The lexical analyzer scans the input text character-by-character, recognizing basic symbols and discarding separators. It must determine a syntax code for each basic symbol and an intrinsic attribute for each ~{Name~} and ~{Numeral~}. ~B~<Source text~> Eli supplies a module for reading the source text and making it available to the remainder of the system. No specification is required for this module. It will be extracted from the Eli library if it is needed, unless the user supplies a module that defines its operations. The operation ~{initBuf~} initializes the source text module. It has two arguments, the character form of the name of an open file and the descriptor for that file. On return from ~{initBuf~}, ~{TokenEnd~} addresses the first character of the file. If the file is empty, ~{TokenEnd~} addresses a null character (ASCII NUL, with value 0). Otherwise ~{initBuf~} guarantees that the string addressed by ~{TokenEnd~} consists of at least one complete line, including its terminating newline character. If the character following a terminating newline is not the null character then the following line, including its terminating newline character, is also guaranteed to be a part of the string. When the client of the source module has dealt completely with the line or lines of the input file that are in memory, it invokes the source module operation ~{refillBuf~}. The ~{refillBuf~} operation has one argument, a pointer to the null character terminating the current string. Upon return from ~{refillBuf~}, the conditions described in the previous paragraph hold (i.e. ~{refillBuf~} has the same exit condition as ~{initBuf~}). The lexical analyzer imposes a coordinate system on the text, with each coordinate consisting of a line number and a column number within the line. All components of the lexical analyzer that move across line boundaries must maintain these coordinates. They are implemented by two variables exported by the source text module but maintained by its clients: ~{LineNum~} and ~{StartLine~}. ~{LineNum~} contains the integer line number, while ~{StartLine~} addresses the character position preceding the first character of the line. Thus the column number can be computed by taking the difference between the pointer to the current character and ~{StartLine~}. Horizontal tab characters, which occupy only one position in the string but possibly many in the source line are handled by decrementing ~{StartLine~} appropriately. ~B~<Intermediate code~> The syntax code associated with each basic symbol is relevant only for the interaction between the lexical and syntactic analyzers. It is supplied by Eli, and its value is of no concern. There is no need to specify it. Each distinct Pascal- ~{Name~} must be given a unique intrinsic attribute value, and instances of the same name must be given the same value. We will want to use this value to associate definitions and uses, so it must be compatible with the module we use for that process. Eli provides a module that can associate definitions with uses, and that module expects each name to be represented by a unique, small integer. A Pascal- ~{Numeral~} denotes an integer value, and this value will be placed into the target machine code as an integer. It is therefore reasonable to make the intrinsic attribute of a ~{Numeral~} the integer value it denotes. ~B The delimiters of Pascal- are specified by literals in the grammar. Eli will extract those literals from the grammar, so they need not be re-specified here. Pascal- has two non-literal basic symbols, ~{Name~} and ~{Numeral~}. These must be specified, as must the form of a comment. Because there are only a few basic symbol formats used in most programming languages, Eli provides a library of ~/canned descriptions~/. These canned descriptions provide not only the format of the basic symbol, but also the processors needed to establish the value of an appropriate intrinsic attribute. They are another example of specifying a subproblem (in this case scanning a name) by analogy. The Pascal- definition of a name is the same as the Pascal definition of an identifier, so the specification of a ~{Name~} can be stated in terms of a canned description. Numerals and comments are somewhat more complex, and their specifications are discussed below. ~$~<Scanning~>==~{ Name: PASCAL_IDENTIFIER ~<Scan a numeral~> ~<Scan a comment~> ~} This specification defines the basic symbol ~{Name~}, which appears to the left of a colon (~{:~}), by means of the canned description ~{PASCAL_IDENTIFIER~}. ~{PASCAL_IDENTIFIER~} describes a character sequence that begins with a letter and continues with a (possibly empty) sequence of letters and digits. Once this character string has been recognized, the processor ~{mkidn~} will be used to provide it with an intrinsic attribute. (The properties of canned descriptions like ~{PASCAL_IDENTIFIER~} are given in the Eli documentation dealing with lexical analysis.) ~C A Pascal- ~{Numeral~} is defined as a sequence of digits. If the generated scanner simply accepts a sequence of digits as a numeral, however, it will decompose the incorrect phrase ~{10div 3~} into three basic symbols: ~{10~}, ~{div~} and ~{3~}. Although this is probably the programmer's intent, it is a violation of the language definition. The generated scanner must therefore accept a sequence of digits and then verify that the next character is ~/not~/ a letter. If a letter is found, then the scanner must report a ``missing separator'' error. Finally, a processor must be invoked to compute the value represented by the sequence of digits and deliver that value as an intrinsic attribute. Here is the resulting specification: ~$~<Scan a numeral~>==~{ Numeral: $[0-9]+ (CheckSep) [mkint] ~} This specification defines the basic symbol ~{Numeral~}, which appears to the left of a colon (~{:~}), by means of the regular expression ~{[0-9]+~} that accepts a sequence of one or more decimal digits. When the generated scanner encounters a character that is not a decimal digit, it will invoke the auxiliary scanner ~{CheckSep~}, passing a pointer to the first digit of the sequence and an integer giving the length of the sequence. Upon return from ~{CheckSep~}, the processor ~{mkint~} is invoked. It computes the value of the digit sequence and delivers that value as an intrinsic attribute. The processor ~{mkint~} is a library routine, but the auxiliary scanner ~{CheckSep~} must be defined specially for this task. ~{CheckSep~} is a C routine that uses a table to check the character following the digit sequence. It assumes the ISO standard character set: ~$~<Check for a separator~>==~{ static char Letter[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 }; char * CheckSep(start, length) char * start; /* start of characters already recognized */ int length; /* length of what was already recognized */ { POSITION Now; if (Letter[start[length]]){ Now = curpos; Now.col += length; message(FATAL, "Missing separator", 0, &Now); } return start+length; /* First unprocessed character */ } ~} Verifying the separator that must follow a ~{Numeral~} is an example of a subproblem specified by solution. It is solved by the C routine ~{CheckSep~}, which obeys the standard interface given in the Eli documentation for auxiliary scanners: accept a pointer to the start of the basic symbol and the number of characters already recognized, and return a pointer to the first character following the basic symbol. ~C Comments can be nested in Pascal-, and all comments are delimited by braces. They may span multiple lines. This behavior does not correspond to any canned description, so the comment scanning subproblem must be specified by solution: ~$~<Scan a comment~>==~{ $"{" (Comment) ~} The auxiliary scanner ~{Comment~} begins with the character following the opening brace that was recognized by the specified regular expression, and seeks the matching closing brace. It invokes itself recursively for a nested comment, handles transitions between lines, and detects an unclosed comment: ~$~<Comment scanner~>==~{ char * Comment(start, length) char * start; /* start of characters recognized by reg expr */ int length; /* length of what was recognized in reg expr */ { register char c; register char *p = start + length; /* first char not yet processed */ for (;;) { while( (c = *p++) && c != '}' && c != '\n' && c != '{' ) ; if(c == '}') return(p); else if (c == '\n'){ LineNum++; StartLine = p-1; } else if (c == '{') p = Comment(start, p-start); else /*if (c == '\0')*/ { /* end of the current input buffer */ refillBuf(p-1); p = TokenEnd; StartLine = p-1; if (*p == '\0'){ message(FATAL, "file ends in comment", 0, &curpos); return(p); } } } } ~} ~{Comment~} simply advances ~{p~} until it reaches a ``significant'' character. If that character signals a new line, the line number is incremented; if it marks the end of the input, additional input is requested from the source module (Section 4.1). The value returned by ~{Comment~} is a pointer to the first character position beyond the end of the comment, as required by the standard auxiliary scanner interface. ~B~<Identifier table~> The processor ~{mkidn~} assigns a unique integer for each distinct name. It uses a hash table to check whether the name has been seen before. It stores the string form of each name once in an array of strings, returning the index of the stored string. That index is then made the value of the intrinsic attribute. Pascal- requires that letter case not be distinguished in identifiers and keywords. Two additional specifications must be provided to obtain this behavior: ~{mkidn~} must be specified case-insensitive and the keywords must be recognized as though they were identifiers. The ~{+fold~} parameter of an Eli request specifies that ~{mkidn~} is to ignore letter case. Keywords appear as literals in the grammar. Normally, a grammar literal is built into the implementation of the lexical analyzer. To recognize certain literals using the mechanism used to recognize identifiers requires that those literals be removed from the set of literals passed to the lexical analyzer generator. The file ~{keywords.gla~} defines the lexical structure of the Pascal- keywords: ~$~<keywords.gla~>~Z==~{ Name: PASCAL_IDENTIFIER ~} By adding the line ~{keywords.gla :kwd~} to the type-~{specs~} file that defines the Pascal- compiler (Section 3.3), we tell Eli to remove all literals having the structure defined by ~{keywords.gla~} from the set passed to the lexical analyzer generator. It also adds those literals, with their corresponding syntax codes, to the table used by ~{mkidn~}. ~B~<Specification files for lexical analysis~> Three kinds of specifications are needed to define the Pascal- lexical analysis problem to Eli. ~C A type-~{gla~} file describes the lexical structure of the comments and the basic symbols that do not appear literally in the grammar. Eli uses this information, in conjunction with the descriptions of the literal basic symbols derived from the grammar, to construct the scanner. ~O~<lexical.gla~>~{ ~<Scanning~> ~} ~C A type-~{c~} file describes a problem by giving its solution. The problems of checking for a missing separator following a digit string and scanning a Pascal- comment were described in this way. ~O~<lexical.c~>==~{ #include "err.h" #include "source.h" ~<Check for a separator~> ~<Comment scanner~> ~} No interface specification is required for this module, because all auxiliary scanners and processors obey fixed interface conventions. The scanner generator can therefore provide the appropriate ~{extern~} declarations, given the routine names. ~A~<Syntax Analysis~> The purpose of syntax analysis is to determine the phrase structure of the source program, and build a tree with one node per phrase. Structural errors, such as unbalanced parentheses, are detected and reported during syntactic analysis. An Eli-generated compiler constructs a tree even if structural errors are found. That tree's structure is correct according to the language definition, and mirrors the source program as closely as possible given the particular errors. The source language grammar specifies the phrase structure of any source program, and serves as the basic specification for syntax analysis. It may be necessary to modify the grammar to ensure that it is complete and unambiguous. To make this guarantee, it may be necessary to distinguish some phrases that have identical meanings and should correspond to identical nodes in the tree. These two points, and the specifications resulting from them, are the subject of this chapter. ~B~<Parser construction~> Eli generates a particular kind of syntax analyzer, called an ~/LALR(1) parser~/, from the grammar. This is possible only if the grammar satisfies certain conditions: At every point during a left-to-right scan of the source program's basic symbols, the parser must be able to tell whether or not it is at the end of a phrase and, if so, ~/which~/ phrase it has just completed. The Pascal- grammar as defined in Section 2.3 satisfies the LALR(1) condition, but it is not compatible with the definition of the vocabulary given in Section 2.2. This section corrects the incompatibilities. ~C The Pascal- grammar distinguishes several different kinds of names, such as ~{ConstantNameDef~}, ~{ConstantNameUse~} and ~{VariableNameUse~}. Symbols like these are not defined in the grammar, and they are not specified as basic symbols either. A human reader has no difficulty in assuming that they are, in fact, names; this correspondence must be made explicit if the grammar is to specify the parsing process. Each of these symbols (except ~{ProgramName~}) defines a specific context that determines the meaning of the name. These contexts must be represented by distinct nodes in the tree, because of their distinct meanings: ~$~<Distinguishing name contexts~>==~{ ProgramName: Name . ConstantNameDef: Name . ConstantNameUse: Name . TypeNameDef: Name . TypeNameUse: Name . FieldNameDef: Name . FieldNameUse: Name . VariableNameDef: Name . VariableNameUse: Name . ProcedureNameDef: Name . ProcedureNameUse: Name . ParameterNameDef: Name . ~} ~C The symbol ~{Empty~} is used in the grammar to emphasize that an empty statement is allowed, but this symbol is never defined. There is no need to define it, because any meaning attributed to the empty statement would be associated with the tree node representing that statement. Therefore the symbol ~{Empty~} should be replaced in the grammar by nothing at all: ~$~<Deleting the symbol ``Empty''~>==~{ #define Empty ~} Grammars are examined by the C pre-processor, as are most Eli specifications. Thus pre-processor directives an C-style comments are allowed in grammars, as they are in virtually every Eli specification. ~B~<Specification files for syntax analysis~> A type-~{con~} file specifies the grammar for the source language. This information is used to construct the parser and define some of the basic symbols. ~O~<syntax.con~>~{ ~<Deleting the symbol ``Empty''~> ~<Grammar~> ~<Distinguishing name contexts~> ~}