Programmer 7500

home *** CD-ROM | disk | FTP | other *** search

/ Programmer 7500 / MAX_PROGRAMMERS.iso / INFO / MISC / REC.ZIP / RECA.DOC < prev next >

Wrap

Text File | 1986-12-01 | 23.9 KB | 518 lines

REC - A Regular Expression Compiler Gerardo Cisneros Escuela Superior de Ingenieria Mecanica y Electrica Instituto Politecnico Nacional and Harold V. McIntosh Departamento de Aplicacion de Microcomputadoras Instituto de Ciencias Universidad Autonoma de Puebla Copyright (c) 1985 Gerardo Cisneros H. V. McIntosh PREFACE With dropping prices of microcomputers, programming is becoming accessible to more people, and software is assuming an increasing importance in the total balance of the operation of a computer. This has originated much controversy about an author's rights to his software, ranging from the idea that the world can be held to ransom for a brilliant idea to the suggestion that there should be no impediment to any copying whatsoever. The research reported in this document has been supported by public funds (Mexican, Northamerican, Swedish) in a number of institutions (Universities, Atomic Energy Commissions, Polytechnic Institutes) over a decade or two. Its result has especially been used in programming and applied mathematics courses in an attempt to get people to learn it and to use it. Accordingly it might be held to belong in the public domain. Most people read scientific literature, as well as any other literature, to establish a background. Should they encounter something particularly useful, they are expected to acknowledge its source, because the resulting fame more than anything is the author's compensation for his work. In those rare instances where a resounding commercial success has been the sole or a substantial consequence of the largely unaltered use of an author's work, it is to be expected that he, his associates, and his employer will expect to share in the benefits. It is curious that there does not seem to be a corresponding willingness to share in the consequences of a bungling or even bona fide misapplication of the work or the attendant catastrophe, should something go wrong. Gainesville, Florida, U.S.A. Puebla, Puebla, Mexico. Introduction REC (Regular Expression Compiler) is the end product of an effort begun many years ago at the University of Florida to find an aesthetic structure to take the place of the "program feature" in McCarthy's LISP. "Operator Predicates" were the solution evolved at that time. Later, when they were taken completely out of the context of LISP and implemented on a PDP-8 computer at the National Autonomous University of Mexico in 1966, the programming language REC arose. For more than a decade thereafter it was used in programming, numerical analysis and mathematical logic courses at the National Polytechnic Institute, also in Mexico City, acquiring a number of versions and giving many people considerable experience in its presentation and usage. REC compilers developed at the Polytechnic included versions suitable for matrix manipulations, text editing and numerical integration of second order differential equations; REC was also used there to program hand calculations on the Friden 132 desk calculator, one of the first machines to offer an operand stack. One version was used extensively in the administration and data processing of the Mexican Nuclear Energy Institute, in conjunction with the monitor system of the PDP-10. In the last five years, and with the advent of microcomputers, REC has matured at the Autonomous University of Puebla into a very powerful symbolic manipulation language with some arithmetic capabilities, which has proved to be very well suited to the construction of compilers and operating systems. REC possesses an extremely simple control structure consisting of four symbols: the two parentheses, the colon, and the semicolon. Besides these there may be any number of predicates and operators (these being a special case of predicates), which are responsible for all the work to be done. This represents an attempt to reduce computer programming to its barest essentials, both for pedagogical reasons, and in the expectation that it will be a manageable structure when it is cast in its simplest terms. The resulting language is very concise, not unlike APL or TECO. As the allusion to regular expressions implies, there is also the expectation that the language is intimately related to the theory of computations. In effect, REC derives its appeal from the fact that computers can be regarded reasonably well as Turing machines with very elaborate built-in shortcuts to eliminate the grotesque inefficiency of manipulating individual bits on a single linear tape. A Turing machine itself consists of a finite-state machine acting as the control of a tape memory; finite state machines in turn are conveniently described by regular expressions. The REC notation simply allows one to write regular expressions in a manner somewhat more amenable to programming the Turing machine which they control. If one does not wish to think so strictly in terms of Turing machines, REC expressions still provide a means of defining the flow of control in a program which is quite convenient for many applications. REC defines only a control structure. It takes some getting used to, but coincides well with contemporary ideas about "structured programming." Some particularly useful REC languages come into being when specific sets of operators and predicates are chosen. One selection gives the symbol-manipulation language REC/MARKOV. People who are familiar with DEC's PDP-10 will note a strong resemblance to TECO, although REC actually contemporizes with or even antedates TECO. One of the most important differences between these two languages is that TECO lacks REC's orderly control structure. The REC/MARKOV processor is remarkably compact, fitting into about 5Kbytes of memory in many machines (7.5Kbytes when floating point arithmetic operations are included), exclusive of working storage. In the course of this book we shall analyze the operation of the microcomputer implementation of REC/MARKOV mentioned earlier; we shall call it simply REC throughout the remainder of the book. Our principal objective will be to introduce a symbol manipulation programming language sufficiently powerful that it can handle the compilation of REC languages in a wide variety of different computers. At the same time one will have a language useful for text editing, assembly including macro assembly and cross assemblers, compiling, and even file maintenance and data analysis. To do all this will also entail some discussion of programming languages and computer architecture, but not to the extent that no previous familiarity with such things is needed. Chapter 1. The Structure of REC In order to define a REC expression seven characters will be used as metasymbols, that is, symbols which would not appear explicitly in an actual instance of the object being defined and whose significance as metasymbols is limited to the definition in which they are used. These characters are [, ], E, P, |, * and =. Using the language of regular expressions and establishing the convention that concatenation is implicit, that alternatives are separated by |, that repetition - even zero times - is denoted by *, that = means "is defined as", and using square brackets rather than parentheses as symbols of aggrupation, a REC expression E is defined recursively by E = ( [[P | E]*[: | ;]*]* [P | E]* ) where the enclosing parentheses are essential and P denotes a predicate. In more common terms, a REC expression is a string enclosed by balanced parentheses, which consists of a sequence of operators, predicates, or further REC expressions separated by punctuation consisting of one more colons or semicolons. Operators and predicates carry out operations, not all of them necessarily primitive. They may be coded either in machine language or consist of subroutines themselves defined in REC. Predicates differ from operators by acquiring either of two truth values as a consequence of their operation; operators always have the value TRUE. The difference between operators and predicates makes itself felt in the way that predicates can affect the sequence of operations which is chosen to execute a program. A REC expression also acquires a truth value as a result of its execution and may be thus regarded as a composite predicate. A REC expression is executed by reading it from left to right in sequence. Operators or predicates are performed in the order in which they occur, except that jumps occur after predicates if their truth value turns out to be FALSE. On encountering a right parenthesis, the expression terminates with the value FALSE, but if a semicolon is encountered, termination results with the value TRUE. Colons mean repetition from the opening left parenthesis. Finally we have to say what kinds of jumps are caused by predicates whose truth value is FALSE. If there is a colon or semicolon to the right of the predicate and on the same parenthesis level, the jump is to the next position beyond. If there is none, the jump is beyond the terminating right parenthesis, resulting in the value TRUE just as if a semicolon had been encountered. It should be mentioned that parentheses within parentheses are treated as a single expression, so that all jumps originating inside a given level of parentheses refer to locations only on the same level and never to locations within subexpressions. The following figure illustrates the flow of control in REC, assuming the only colons, semicolons and parentheses are those explicitly shown, and P is any predicate. ( (P ) : ( P ; ) : ; ) The structure which we have described allows Boolean combinations of predicates. Suppose that P, P1, and P2 are predicates. Then (P) is not-P, () is always FALSE, (P1; P2;) is (P1 or P2), (;) is always TRUE, (P1 P2;) is (P1 and P2). Some care has to be taken with the Boolean interpretation because the combinations which were described do not strictly follow the laws of Boolean algebra. The principal discrepancy is the failure of the commutative law, because the order in which the two predicates are executed might make a difference. One reason for the importance of the order might be that an irreversible operation is effected, such as initiating an input-output operation, or changing the value of a variable, or something similar. Another factor is the fact that algorithms are sometimes undefined, either because their arguments are inappropriate, or the attempt to follow them out would never terminate. By testing first for terminal conditions or for the suitability of data, the undefined conditions can be avoided. Subroutines are defined in REC by grouping them together with a main program within a pair of braces; any printing ASCII character other than the space, atsign (@) or right brace may be assigned as a subroutine name. The structure of a subroutine group is as follows: { E1 n1 E2 n2 ... Ek nk Em } where E1, E2, ... Ek and Em are REC expressions as defined above; n1, n2, ... nk are the ASCII characters assigned as names to E1, E2, ... Ek, respectively, and hence declaring them to be subroutines with the given names, and Em is the main program, with which execution of the braced group starts. Since a program within braces also has a truth value (that of its main program), such a structure may also be given a name within an outer level of braces, or it may appear as a composite predicate within another expression. Subroutines are recursive and definitions that become active when entering a pair of braces replace definitions of the same name which were active in the previous level; on exit from a braced expression all of its definitions become inactive and any definitions which were replaced on entry become active again. Calls are effected by the predicate @x, which produces a call to subroutine x and whose truth value is that of the expression represented by x. All of the above is illustrated in the following example, in which ellipses (...) are used to denote features not essential to our discussion and the numbers on the left are for reference only: 1 { { ( ... @A ... ) A 2 ( ... @C ... @D ... ) B 3 ( ... @A ... ) } C 4 ( ... ) A 5 ( ... ) D 6 ( ... @A ... 7 {( ... ) A ( ... @A ... ) } 8 ... @C ... @A ... ) } Assume the only subroutine definitions are the ones shown explicitly in the example. The main program is defined in lines 6 through 8; its level of braces (which runs through the whole example) contains also definitions for subroutines C (lines 1-3), A (line 4) and D (line 5). @A in lines 6 and 8 are calls to subroutine A of line 4, while @A in lines 1 and 3 refer to subroutine A defined in line 1. The call @A appearing in line 7 calls subroutine A of the same line. Due to the mechanism whereby upon entrance to a brace level previous definitions of the names defined in that level are saved and the new definitions become active (with reversal of the process on exit), it will be noticed that in our example the definition of subroutines D (line 5) and C (lines 1-3) remain active throughout the entire program while the definition of A depends on which pair of braces is currently executing. B is only defined within the brace pair defining C (lines 1-3) and reference to B from any of the programs defined in lines 4-8 would cause immediate termination of execution and return to the computer's operating system. Finally, both square brackets, the space and the comma have been reserved for readability and documentation purposes. Brackets are used to delimit comments; these may be nested, so that it is possible to inhibit compilation of a program or portion of a program containing comments by enclosing it with a pair of brackets. Comments may appear anywhere except between an expression defining a subroutine and its name and between the main program of a group and its closing right brace. No operations have been assigned to either the space or the comma, so they may be freely used to improve the readability of programs. Chapter 2. Operator Arguments and Data Structures Insofar as REC is defined as a language exclusively in terms of its control structure, leaving the selection of operators and predicates to be chosen as best fits a given problem or application area, there can be differing versions of REC, such as one dedicated to arithmetic, or another to text editing, or even monitor systems whose primitive operations are fairly complicated file handling subroutines. However, there are still general aspects of operators, such as the way that they pass data to one another, or how they are formed into subroutines, which remain the same from one version of REC to another. No doubt it is desirable to establish norms and a vocabulary for these aspects of the language just as was done for the control structure. For example, the problem of data transmission can be considered. There are about three ways of passing information from one part of a program to another - at least three that have found a noticeably popular acceptance. These are: by means of a calling sequence, by depositing the information in known locations, or by means of a pushdown list. Calling sequences are associated with subroutine usage; one might write CALL SUBROUTINE (data # 1) ( ... ) (data # n) RETURN: ... Here, it is the program counter which localizes the area in which data is stored. From the point of view of writing a program, especially in assembly language, the registers following the subroutine call are a convenient place to locate data, the more so if the computer being used does not provide an automatic pushdown stack. This is particularly so if the data is constant from one subroutine call to another, but differs from one calling location to another. A second device for data transmission is to store information in a known location which is accessible to all the subroutines and to the main program. A typical example of this type of communication is through COMMON or labelled COMMON in FORTRAN programs. The method is particularly useful when the data is never changed or only infrequently modified. Even though this is its preferred use, it is not a real restriction because programs can always store data in COMMON just before a subroutine jump. Speed can be gained by thereby bypassing the way in which FORTRAN copies and restores the arguments in subroutine calling sequences. Nevertheless the use of fixed locations is most likely when the data is constant, or it is so widely used throughout the program that it would be redundant to incorporate it in a large number of calling sequences. The third technique of data communication is the use of a stack pointer which locates the head of a structure known as a pushdown list. This is the preferred method of handling temporary storage of the type which is required in such applications as the execution of algebraic formulas. It is nearly unavoidable whenever recursive programming is involved. The idea is to maintain a pointer to an array of temporary storage. As the arguments of a function or subroutine are evaluated they are deposited in the array and the pointer advanced to be ready for the next result. When all of the arguments are present, the calculations are made and the pointer moves backward, effectively erasing the arguments. Then the result of the calculation is left in the last position on the array, in place of the first argument if threre was one, and the process continues. Since each subroutine knows how many arguments it needs and how many values it is going to deposit, the maintenance of the pointer and the pushdown list is quite automatic. Errors can occur through setting up an incorrect number or arguments, and it is always possible, especially in a very deep recursion, that the pushdown list may grow beyond available space. Some checking should always be done to prevent the pointer from straying out of bounds. Primitive functions which have several arguments and which are coded in machine language can easily locate their arguments through a displacement relative to the stack pointer, but it is sometimes convenient to have a way of directing a component of a composite subroutine to the nth argument. A much more systematic way to make these arguments accessible is to give them names, and set aside a special storage area for them. This leads to the idea of a variable, even though its name may be nothing more than a serial number. Two activities concern the variables. First, its value has to be transferred to the pushdown list when it occurs in a program so that it will be available for use as an argument. Second, the value has to be transferred from the pushdown list to the special storage area so as to define the variable on entry into the subroutine. Two complications can occur, both aggravated when recursive subroutines are allowed. One is that the same names may be in use in various parts of the program, particularly when this happens both in the subroutine and in the main program which called it. Then it is necessary that previous values be set aside - not destroyed - when the subroutine is entered, and that they be replaced upon exit. Then all will work smoothly and the conflict in variable names will not be noticed. The other complication which can occur is the one of free variables, wherein we might not only want to refer to the kth argument of a subroutine, but as well to the mth argument of a higher level subroutine which called it. If no recursion is involved and one knows how many arguments each subroutine had, the position of any given argument on the pushdown list can be calculated; this is not possible when recursive subroutine chains are permitted. Rather, it is much more convenient to use the device of named variables and copies of the arguments rather than working with offsets from the stack pointer even though under restricted circumstances the latter is more economical. Some argumentation also surrounds the decision as to whether an operator ought to erase its own arguments, or whether the erasure ought to be done by a separate operator. The argument in favor of automatic erasure is that the pushdown list is then self-balancing, while the argument against stems from the possibility of the iterated use of an operator. In that case it would seem futile to keep loading and erasing the same argument over and over again. Finally we might want to get the nth argument directly off the pushdown list without having to fetch it up to the top. This would facilitate the definition of subroutines with an arbitrary number of arguments by allowing them to be used directly without the intermediate step of storing them as variables and fetching them out again later. However this direct access will not work if some part of the subroutine itself involved a recursive process, and thus direct access would not be usable in general. As a practical matter pushdown underflow and overflow have to be detected and acted upon - probably resulting in a fatal error message. Even though the definition of techniques of argument communication does not form part of the official specification of REC, most varieties of REC will naturally require data handling capabilities. The current version of REC has - An array, together with a pointer and limit registers, which can be used as a pushdown list. - A space which can be used to store the values or addresses of variables. - A workspace into which text may be inserted and in which searches, deletions and substitutions may be carried out. Any other version of REC would be expected to have at least the first two data structures. In summary, the three data transference techniques and the error detection aspect affect current REC programming in the following way: - A double push down list is used systematically; "double" referring to the fact that both ends of the (finite) list are available for argument storage. - Constants are introduced into a program through a subroutine which expects to find the constant as part of its calling sequence. Parameters may also be introduced into a subroutine through a calling sequence, but otherwise calling sequences are not much used. More elaborate compilers may even set aside variable space rather than calling sequence space for constants, referring to the constants only through pointers. - Variables or registers which are of central importance to a program are set aside, and manipulated through special operators. This takes the form of a workspace for text manipulation and a set of locations in which values or pointers to values may be stored. Otherwise there is not much data communication through fixed program areas. - In order to conserve space, error detection and handling is kept to a minimum. Pushdown list, workspace and compile area overflows cause a message to be issued and immediate termination of execution. So does an attempt to read a source file or a memory buffer beyond its end during compilation (usually due to an excess of opening braces or parentheses). Other errors are noted in a location reserved for this purpose; the question mark '?' is the predicate the programmer may use to find out whether an error has occurred.