home *** CD-ROM | disk | FTP | other *** search
-
-
-
-
-
-
-
-
-
-
-
- REC - A Regular Expression Compiler
-
-
-
- Gerardo Cisneros
-
- Escuela Superior de Ingenieria Mecanica y Electrica
- Instituto Politecnico Nacional
-
-
-
- and
-
-
-
- Harold V. McIntosh
-
- Departamento de Aplicacion de Microcomputadoras
- Instituto de Ciencias
- Universidad Autonoma de Puebla
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Copyright (c) 1985
- Gerardo Cisneros
- H. V. McIntosh
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- PREFACE
-
- With dropping prices of microcomputers, programming is
- becoming accessible to more people, and software is assuming an
- increasing importance in the total balance of the operation of a
- computer. This has originated much controversy about an author's
- rights to his software, ranging from the idea that the world can be
- held to ransom for a brilliant idea to the suggestion that there
- should be no impediment to any copying whatsoever.
-
- The research reported in this document has been supported by
- public funds (Mexican, Northamerican, Swedish) in a number of
- institutions (Universities, Atomic Energy Commissions, Polytechnic
- Institutes) over a decade or two. Its result has especially been used
- in programming and applied mathematics courses in an attempt to get
- people to learn it and to use it. Accordingly it might be held to
- belong in the public domain.
-
- Most people read scientific literature, as well as any other
- literature, to establish a background. Should they encounter
- something particularly useful, they are expected to acknowledge its
- source, because the resulting fame more than anything is the author's
- compensation for his work. In those rare instances where a resounding
- commercial success has been the sole or a substantial consequence of
- the largely unaltered use of an author's work, it is to be expected
- that he, his associates, and his employer will expect to share in the
- benefits. It is curious that there does not seem to be a
- corresponding willingness to share in the consequences of a bungling
- or even bona fide misapplication of the work or the attendant
- catastrophe, should something go wrong.
-
- Gainesville, Florida, U.S.A.
- Puebla, Puebla, Mexico.
-
-
-
- Introduction
-
- REC (Regular Expression Compiler) is the end product of an
- effort begun many years ago at the University of Florida to find an
- aesthetic structure to take the place of the "program feature" in
- McCarthy's LISP. "Operator Predicates" were the solution evolved at
- that time. Later, when they were taken completely out of the context
- of LISP and implemented on a PDP-8 computer at the National
- Autonomous University of Mexico in 1966, the programming language REC
- arose.
-
- For more than a decade thereafter it was used in programming,
- numerical analysis and mathematical logic courses at the National
- Polytechnic Institute, also in Mexico City, acquiring a number of
- versions and giving many people considerable experience in its
- presentation and usage. REC compilers developed at the Polytechnic
- included versions suitable for matrix manipulations, text editing and
- numerical integration of second order differential equations; REC was
- also used there to program hand calculations on the Friden 132 desk
- calculator, one of the first machines to offer an operand stack. One
- version was used extensively in the administration and data
- processing of the Mexican Nuclear Energy Institute, in conjunction
- with the monitor system of the PDP-10.
-
- In the last five years, and with the advent of
- microcomputers, REC has matured at the Autonomous University of
- Puebla into a very powerful symbolic manipulation language with some
- arithmetic capabilities, which has proved to be very well suited to
- the construction of compilers and operating systems.
-
- REC possesses an extremely simple control structure
- consisting of four symbols: the two parentheses, the colon, and the
- semicolon. Besides these there may be any number of predicates and
- operators (these being a special case of predicates), which are
- responsible for all the work to be done. This represents an attempt
- to reduce computer programming to its barest essentials, both for
- pedagogical reasons, and in the expectation that it will be a
- manageable structure when it is cast in its simplest terms. The
- resulting language is very concise, not unlike APL or TECO.
-
- As the allusion to regular expressions implies, there is also
- the expectation that the language is intimately related to the theory
- of computations. In effect, REC derives its appeal from the fact that
- computers can be regarded reasonably well as Turing machines with
- very elaborate built-in shortcuts to eliminate the grotesque
- inefficiency of manipulating individual bits on a single linear tape.
- A Turing machine itself consists of a finite-state machine acting as
- the control of a tape memory; finite state machines in turn are
- conveniently described by regular expressions. The REC notation
- simply allows one to write regular expressions in a manner somewhat
- more amenable to programming the Turing machine which they control.
- If one does not wish to think so strictly in terms of Turing
- machines, REC expressions still provide a means of defining the flow
- of control in a program which is quite convenient for many
- applications.
-
- REC defines only a control structure. It takes some getting
- used to, but coincides well with contemporary ideas about "structured
- programming." Some particularly useful REC languages come into being
- when specific sets of operators and predicates are chosen. One
- selection gives the symbol-manipulation language REC/MARKOV. People
- who are familiar with DEC's PDP-10 will note a strong resemblance to
- TECO, although REC actually contemporizes with or even antedates
- TECO. One of the most important differences between these two
- languages is that TECO lacks REC's orderly control structure.
-
- The REC/MARKOV processor is remarkably compact, fitting into
- about 5Kbytes of memory in many machines (7.5Kbytes when floating
- point arithmetic operations are included), exclusive of working
- storage. In the course of this book we shall analyze the operation of
- the microcomputer implementation of REC/MARKOV mentioned earlier; we
- shall call it simply REC throughout the remainder of the book. Our
- principal objective will be to introduce a symbol manipulation
- programming language sufficiently powerful that it can handle the
- compilation of REC languages in a wide variety of different
- computers. At the same time one will have a language useful for text
- editing, assembly including macro assembly and cross assemblers,
- compiling, and even file maintenance and data analysis. To do all
- this will also entail some discussion of programming languages and
- computer architecture, but not to the extent that no previous
- familiarity with such things is needed.
-
- Chapter 1. The Structure of REC
-
- In order to define a REC expression seven characters will be
- used as metasymbols, that is, symbols which would not appear
- explicitly in an actual instance of the object being defined and
- whose significance as metasymbols is limited to the definition in
- which they are used. These characters are [, ], E, P, |, * and =.
- Using the language of regular expressions and establishing the
- convention that concatenation is implicit, that alternatives are
- separated by |, that repetition - even zero times - is denoted by *,
- that = means "is defined as", and using square brackets rather than
- parentheses as symbols of aggrupation, a REC expression E is defined
- recursively by
-
- E = ( [[P | E]*[: | ;]*]* [P | E]* )
-
- where the enclosing parentheses are essential and P denotes a
- predicate. In more common terms, a REC expression is a string
- enclosed by balanced parentheses, which consists of a sequence of
- operators, predicates, or further REC expressions separated by
- punctuation consisting of one more colons or semicolons.
-
- Operators and predicates carry out operations, not all of
- them necessarily primitive. They may be coded either in machine
- language or consist of subroutines themselves defined in REC.
- Predicates differ from operators by acquiring either of two truth
- values as a consequence of their operation; operators always have the
- value TRUE. The difference between operators and predicates makes
- itself felt in the way that predicates can affect the sequence of
- operations which is chosen to execute a program. A REC expression
- also acquires a truth value as a result of its execution and may be
- thus regarded as a composite predicate.
-
- A REC expression is executed by reading it from left to right
- in sequence. Operators or predicates are performed in the order in
- which they occur, except that jumps occur after predicates if their
- truth value turns out to be FALSE. On encountering a right
- parenthesis, the expression terminates with the value FALSE, but if a
- semicolon is encountered, termination results with the value TRUE.
- Colons mean repetition from the opening left parenthesis. Finally we
- have to say what kinds of jumps are caused by predicates whose truth
- value is FALSE. If there is a colon or semicolon to the right of the
- predicate and on the same parenthesis level, the jump is to the next
- position beyond. If there is none, the jump is beyond the terminating
- right parenthesis, resulting in the value TRUE just as if a semicolon
- had been encountered. It should be mentioned that parentheses within
- parentheses are treated as a single expression, so that all jumps
- originating inside a given level of parentheses refer to locations
- only on the same level and never to locations within subexpressions.
-
- The following figure illustrates the flow of control in REC,
- assuming the only colons, semicolons and parentheses are those
- explicitly shown, and P is any predicate.
-
-
- ( (P ) : ( P ; ) : ; )
-
-
- The structure which we have described allows Boolean
- combinations of predicates. Suppose that P, P1, and P2 are
- predicates. Then
-
- (P) is not-P, () is always FALSE,
- (P1; P2;) is (P1 or P2), (;) is always TRUE,
- (P1 P2;) is (P1 and P2).
-
- Some care has to be taken with the Boolean interpretation
- because the combinations which were described do not strictly follow
- the laws of Boolean algebra. The principal discrepancy is the failure
- of the commutative law, because the order in which the two predicates
- are executed might make a difference. One reason for the importance
- of the order might be that an irreversible operation is effected,
- such as initiating an input-output operation, or changing the value
- of a variable, or something similar. Another factor is the fact that
- algorithms are sometimes undefined, either because their arguments
- are inappropriate, or the attempt to follow them out would never
- terminate. By testing first for terminal conditions or for the
- suitability of data, the undefined conditions can be avoided.
-
- Subroutines are defined in REC by grouping them together with
- a main program within a pair of braces; any printing ASCII character
- other than the space, atsign (@) or right brace may be assigned as a
- subroutine name. The structure of a subroutine group is as follows:
-
- { E1 n1 E2 n2 ... Ek nk Em }
-
- where E1, E2, ... Ek and Em are REC expressions as defined above; n1,
- n2, ... nk are the ASCII characters assigned as names to E1, E2, ...
- Ek, respectively, and hence declaring them to be subroutines with the
- given names, and Em is the main program, with which execution of the
- braced group starts. Since a program within braces also has a truth
- value (that of its main program), such a structure may also be given
- a name within an outer level of braces, or it may appear as a
- composite predicate within another expression. Subroutines are
- recursive and definitions that become active when entering a pair of
- braces replace definitions of the same name which were active in the
- previous level; on exit from a braced expression all of its
- definitions become inactive and any definitions which were replaced
- on entry become active again. Calls are effected by the predicate
- @x, which produces a call to subroutine x and whose truth value is
- that of the expression represented by x.
-
- All of the above is illustrated in the following example, in
- which ellipses (...) are used to denote features not essential to our
- discussion and the numbers on the left are for reference only:
-
- 1 { { ( ... @A ... ) A
- 2 ( ... @C ... @D ... ) B
- 3 ( ... @A ... ) } C
- 4 ( ... ) A
- 5 ( ... ) D
- 6 ( ... @A ...
- 7 {( ... ) A ( ... @A ... ) }
- 8 ... @C ... @A ... ) }
-
-
- Assume the only subroutine definitions are the ones shown
- explicitly in the example. The main program is defined in lines 6
- through 8; its level of braces (which runs through the whole example)
- contains also definitions for subroutines C (lines 1-3), A (line 4)
- and D (line 5). @A in lines 6 and 8 are calls to subroutine A of line
- 4, while @A in lines 1 and 3 refer to subroutine A defined in line 1.
- The call @A appearing in line 7 calls subroutine A of the same line.
- Due to the mechanism whereby upon entrance to a brace level previous
- definitions of the names defined in that level are saved and the new
- definitions become active (with reversal of the process on exit), it
- will be noticed that in our example the definition of subroutines D
- (line 5) and C (lines 1-3) remain active throughout the entire
- program while the definition of A depends on which pair of braces is
- currently executing. B is only defined within the brace pair
- defining C (lines 1-3) and reference to B from any of the programs
- defined in lines 4-8 would cause immediate termination of execution
- and return to the computer's operating system.
-
- Finally, both square brackets, the space and the comma have
- been reserved for readability and documentation purposes. Brackets
- are used to delimit comments; these may be nested, so that it is
- possible to inhibit compilation of a program or portion of a program
- containing comments by enclosing it with a pair of brackets.
- Comments may appear anywhere except between an expression defining a
- subroutine and its name and between the main program of a group and
- its closing right brace. No operations have been assigned to either
- the space or the comma, so they may be freely used to improve the
- readability of programs.
-
-
- Chapter 2. Operator Arguments and Data Structures
-
-
- Insofar as REC is defined as a language exclusively in terms
- of its control structure, leaving the selection of operators and
- predicates to be chosen as best fits a given problem or application
- area, there can be differing versions of REC, such as one dedicated
- to arithmetic, or another to text editing, or even monitor systems
- whose primitive operations are fairly complicated file handling
- subroutines.
-
- However, there are still general aspects of operators, such
- as the way that they pass data to one another, or how they are formed
- into subroutines, which remain the same from one version of REC to
- another. No doubt it is desirable to establish norms and a
- vocabulary for these aspects of the language just as was done for the
- control structure. For example, the problem of data transmission can
- be considered.
-
- There are about three ways of passing information from one
- part of a program to another - at least three that have found a
- noticeably popular acceptance. These are: by means of a calling
- sequence, by depositing the information in known locations, or by
- means of a pushdown list.
-
- Calling sequences are associated with subroutine usage; one
- might write
-
- CALL SUBROUTINE
- (data # 1)
- ( ... )
- (data # n)
- RETURN: ...
-
- Here, it is the program counter which localizes the area in
- which data is stored. From the point of view of writing a program,
- especially in assembly language, the registers following the
- subroutine call are a convenient place to locate data, the more so if
- the computer being used does not provide an automatic pushdown stack.
- This is particularly so if the data is constant from one subroutine
- call to another, but differs from one calling location to another.
-
- A second device for data transmission is to store information
- in a known location which is accessible to all the subroutines and to
- the main program. A typical example of this type of communication is
- through COMMON or labelled COMMON in FORTRAN programs. The method is
- particularly useful when the data is never changed or only
- infrequently modified. Even though this is its preferred use, it is
- not a real restriction because programs can always store data in
- COMMON just before a subroutine jump. Speed can be gained by thereby
- bypassing the way in which FORTRAN copies and restores the arguments
- in subroutine calling sequences. Nevertheless the use of fixed
- locations is most likely when the data is constant, or it is so
- widely used throughout the program that it would be redundant to
- incorporate it in a large number of calling sequences.
-
- The third technique of data communication is the use of a
- stack pointer which locates the head of a structure known as a
- pushdown list. This is the preferred method of handling temporary
- storage of the type which is required in such applications as the
- execution of algebraic formulas. It is nearly unavoidable whenever
- recursive programming is involved.
-
- The idea is to maintain a pointer to an array of temporary
- storage. As the arguments of a function or subroutine are evaluated
- they are deposited in the array and the pointer advanced to be ready
- for the next result. When all of the arguments are present, the
- calculations are made and the pointer moves backward, effectively
- erasing the arguments. Then the result of the calculation is left in
- the last position on the array, in place of the first argument if
- threre was one, and the process continues.
-
- Since each subroutine knows how many arguments it needs and
- how many values it is going to deposit, the maintenance of the
- pointer and the pushdown list is quite automatic. Errors can occur
- through setting up an incorrect number or arguments, and it is always
- possible, especially in a very deep recursion, that the pushdown list
- may grow beyond available space. Some checking should always be done
- to prevent the pointer from straying out of bounds.
-
- Primitive functions which have several arguments and which
- are coded in machine language can easily locate their arguments
- through a displacement relative to the stack pointer, but it is
- sometimes convenient to have a way of directing a component of a
- composite subroutine to the nth argument. A much more systematic way
- to make these arguments accessible is to give them names, and set
- aside a special storage area for them. This leads to the idea of a
- variable, even though its name may be nothing more than a serial
- number.
-
- Two activities concern the variables. First, its value has to
- be transferred to the pushdown list when it occurs in a program so
- that it will be available for use as an argument. Second, the value
- has to be transferred from the pushdown list to the special storage
- area so as to define the variable on entry into the subroutine.
-
- Two complications can occur, both aggravated when recursive
- subroutines are allowed. One is that the same names may be in use in
- various parts of the program, particularly when this happens both in
- the subroutine and in the main program which called it. Then it is
- necessary that previous values be set aside - not destroyed - when
- the subroutine is entered, and that they be replaced upon exit. Then
- all will work smoothly and the conflict in variable names will not be
- noticed.
-
- The other complication which can occur is the one of free
- variables, wherein we might not only want to refer to the kth
- argument of a subroutine, but as well to the mth argument of a higher
- level subroutine which called it. If no recursion is involved and one
- knows how many arguments each subroutine had, the position of any
- given argument on the pushdown list can be calculated; this is not
- possible when recursive subroutine chains are permitted.
-
- Rather, it is much more convenient to use the device of named
- variables and copies of the arguments rather than working with
- offsets from the stack pointer even though under restricted
- circumstances the latter is more economical.
-
- Some argumentation also surrounds the decision as to whether
- an operator ought to erase its own arguments, or whether the erasure
- ought to be done by a separate operator. The argument in favor of
- automatic erasure is that the pushdown list is then self-balancing,
- while the argument against stems from the possibility of the iterated
- use of an operator. In that case it would seem futile to keep
- loading and erasing the same argument over and over again.
-
- Finally we might want to get the nth argument directly off
- the pushdown list without having to fetch it up to the top. This
- would facilitate the definition of subroutines with an arbitrary
- number of arguments by allowing them to be used directly without the
- intermediate step of storing them as variables and fetching them out
- again later. However this direct access will not work if some part of
- the subroutine itself involved a recursive process, and thus direct
- access would not be usable in general.
-
- As a practical matter pushdown underflow and overflow have to
- be detected and acted upon - probably resulting in a fatal error
- message.
-
- Even though the definition of techniques of argument
- communication does not form part of the official specification of
- REC, most varieties of REC will naturally require data handling
- capabilities. The current version of REC has
-
- - An array, together with a pointer and limit
- registers, which can be used as a pushdown
- list.
-
- - A space which can be used to store the
- values or addresses of variables.
-
- - A workspace into which text may be inserted and
- in which searches, deletions and substitutions may
- be carried out.
-
- Any other version of REC would be expected to have at
- least the first two data structures.
-
- In summary, the three data transference techniques and the
- error detection aspect affect current REC programming in the
- following way:
-
- - A double push down list is used systematically; "double"
- referring to the fact that both ends of the (finite) list
- are available for argument storage.
-
- - Constants are introduced into a program through a
- subroutine which expects to find the constant as part of its
- calling sequence. Parameters may also be introduced into a
- subroutine through a calling sequence, but otherwise calling
- sequences are not much used. More elaborate compilers
- may even set aside variable space rather than calling sequence
- space for constants, referring to the constants only through
- pointers.
-
- - Variables or registers which are of central importance
- to a program are set aside, and manipulated through special
- operators. This takes the form of a workspace for text
- manipulation and a set of locations in which values or pointers
- to values may be stored. Otherwise there is not much data
- communication through fixed program areas.
-
- - In order to conserve space, error detection and handling is
- kept to a minimum. Pushdown list, workspace and compile area
- overflows cause a message to be issued and immediate
- termination of execution. So does an attempt to read a
- source file or a memory buffer beyond its end during
- compilation (usually due to an excess of opening braces or
- parentheses). Other errors are noted in a location reserved
- for this purpose; the question mark '?' is the predicate the
- programmer may use to find out whether an error has occurred.