home *** CD-ROM | disk | FTP | other *** search
Text File | 1979-01-10 | 49.5 KB | 2,311 lines |
- .hc ~
- .bd I 2
- .de TS
- .br
- .nf
- .SP 1v
- .ul 0
- ..
- .de TE
- .SP 1v
- .fi
- ..
- .de PT
- .if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
- .if \\n%>1 'sp
- ..
- .ND July 21, 1975
- .RP
- .TM 75-1274-15 39199 39199-11
- .TL
- Lex \- A Lexical Analyzer ~Generator~
- .AU ``MH 2C-569'' 6377
- M. E. Lesk and E. Schmidt
- .AI
- .MH
- .AB
- .sp
- .bd I 2
- .nr PS 8
- .nr VS 9
- .ps 8
- .vs 9p
- Lex helps write programs whose control flow
- is directed by instances of regular
- expressions in the input stream.
- It is well suited for editor-script type transformations and
- for segmenting input in preparation for
- a parsing routine.
- .PP
- Lex source is a table of regular expressions and corresponding program fragments.
- The table is translated to a program
- which reads an input stream, copying it to an output stream
- and partitioning the input
- into strings which match the given expressions.
- As each such string is recognized the corresponding
- program fragment is executed.
- The recognition of the expressions
- is performed by a deterministic finite automaton
- generated by Lex.
- The program fragments written by the user are executed in the order in which the
- corresponding regular expressions occur in the input stream.
- .if n .if \n(tm .ig
- .PP
- The lexical analysis
- programs written with Lex accept ambiguous specifications
- and choose the longest
- match possible at each input point.
- If necessary, substantial look~ahead
- is performed on the input, but the
- input stream will be backed up to the
- end of the current partition, so that the user
- has general freedom to manipulate it.
- .PP
- Lex can generate analyzers in either C or Ratfor, a language
- which can be translated automatically to portable Fortran.
- It is available on the PDP-11 UNIX, Honeywell GCOS,
- and IBM OS systems.
- This manual, however, will only discuss generating analyzers
- in C on the UNIX system, which is the only supported
- form of Lex under UNIX Version 7.
- Lex is designed to simplify
- interfacing with Yacc, for those
- with access to this compiler-compiler system.
- ..
- .nr PS 9
- .nr VS 11
- .AE
- .SH
- .ce 1
- Table of Contents
- .LP
- .ce 100
- .TS
- r 1l 2r .
- 1. Introduction. 1
- 2. Lex Source. 3
- 3. Lex Regular Expressions. 3
- 4. Lex Actions. 5
- 5. Ambiguous Source Rules. 7
- 6. Lex Source Definitions. 8
- 7. Usage. 8
- 8. Lex and Yacc. 9
- 9. Examples. 10
- 10. Left Context Sensitivity. 11
- 11. Character Set. 12
- 12. Summary of Source Format. 12
- 13. Caveats and Bugs. 13
- 14. Acknowledgments. 13
- 15. References. 13
- .TE
- .ce 0
- .2C
- .NH
- Introduction.
- .PP
- Lex is a program generator designed for
- lexical processing of character input streams.
- It accepts a high-level, problem oriented specification
- for character string matching,
- and
- produces a program in a general purpose language which recognizes
- regular expressions.
- The regular expressions are specified by the user in the
- source specifications given to Lex.
- The Lex written code recognizes these expressions
- in an input stream and partitions the input stream into
- strings matching the expressions. At the bound~aries
- between strings
- program sections
- provided by the user are executed.
- The Lex source file associates the regular expressions and the
- program fragments.
- As each expression appears in the input to the program written by Lex,
- the corresponding fragment is executed.
- .PP
- .de MH
- Bell Laboratories, Murray Hill, NJ 07974.
- ..
- The user supplies the additional code
- beyond expression matching
- needed to complete his tasks, possibly
- including code written by other generators.
- The program that recognizes the expressions is generated in the
- general purpose programming language employed for the
- user's program fragments.
- Thus, a high level expression
- language is provided to write the string expressions to be
- matched while the user's freedom to write actions
- is unimpaired.
- This avoids forcing the user who wishes to use a string manipulation
- language for input analysis to write processing programs in the same
- and often inappropriate string handling language.
- .PP
- Lex is not a complete language, but rather a generator representing
- a new language feature which can be added to
- different programming languages, called ``host languages.''
- Just as general purpose languages
- can produce code to run on different computer hardware,
- Lex can write code in different host languages.
- The host language is used for the output code generated by Lex
- and also for the program fragments added by the user.
- Compatible run-time libraries for the different host languages
- are also provided.
- This makes Lex adaptable to different environments and
- different users.
- Each application
- may be directed to the combination of hardware and host language appropriate
- to the task, the user's background, and the properties of local
- implementations.
- At present, the only supported host language is C,
- although Fortran (in the form of Ratfor [2] has been available
- in the past.
- Lex itself exists on UNIX, GCOS, and OS/370; but the
- code generated by Lex may be taken anywhere the appropriate
- compilers exist.
- .PP
- Lex turns the user's expressions and actions
- (called
- .ul
- source
- in this memo) into the host general-purpose language;
- the generated program is named
- .ul
- yylex.
- The
- .ul
- yylex
- program
- will recognize expressions
- in a stream
- (called
- .ul
- input
- in this memo)
- and perform the specified actions for each expression as it is detected.
- See Figure 1.
- .GS
- .TS
- center;
- l _ r
- l|c|r
- l _ r
- l _ r
- l|c|r
- l _ r
- c s s
- c s s.
-
- Source \(-> Lex \(-> yylex
-
- .sp 2
-
- Input \(-> yylex \(-> Output
-
- .sp
- An overview of Lex
- .sp
- Figure 1
- .TE
- .GE
- .PP
- For a trivial example, consider a program to delete
- from the input
- all blanks or tabs at the ends of lines.
- .TS
- center;
- l l.
- %%
- [ \et]+$ ;
- .TE
- is all that is required.
- The program
- contains a %% delimiter to mark the beginning of the rules, and
- one rule.
- This rule contains a regular expression
- which matches one or more
- instances of the characters blank or tab
- (written \et for visibility, in accordance with the C language convention)
- just prior to the end of a line.
- The brackets indicate the character
- class made of blank and tab; the + indicates ``one or more ...'';
- and the $ indicates ``end of line,'' as in QED.
- No action is specified,
- so the program generated by Lex (yylex) will ignore these characters.
- Everything else will be copied.
- To change any remaining
- string of blanks or tabs to a single blank,
- add another rule:
- .TS
- center;
- l l.
- %%
- [ \et]+$ ;
- [ \et]+ printf(" ");
- .TE
- The finite automaton generated for this
- source will scan for both rules at once,
- observing at
- the termination of the string of blanks or tabs
- whether or not there is a newline character, and executing
- the desired rule action.
- The first rule matches all strings of blanks or tabs
- at the end of lines, and the second
- rule all remaining strings of blanks or tabs.
- .PP
- Lex can be used alone for simple transformations, or
- for analysis and statistics gathering on a lexical level.
- Lex can also be used with a parser generator
- to perform the lexical analysis phase; it is particularly
- easy to interface Lex and Yacc [3].
- Lex programs recognize only regular expressions;
- Yacc writes parsers that accept a large class of context free grammars,
- but require a lower level analyzer to recognize input tokens.
- Thus, a combination of Lex and Yacc is often appropriate.
- When used as a preprocessor for a later parser generator,
- Lex is used to partition the input stream,
- and the parser generator assigns structure to
- the resulting pieces.
- The flow of control
- in such a case (which might be the first half of a compiler,
- for example) is shown in Figure 2.
- Additional programs,
- written by other generators
- or by hand, can
- be added easily to programs written by Lex.
- .BS 2
- .TS
- center;
- l c c c l
- l c c c l
- l c c c l
- l _ c _ l
- l|c|c|c|l
- l _ c _ l
- l c c c l
- l _ c _ l
- l|c|c|c|l
- l _ c _ l
- l c s s l
- l c s s l.
- lexical grammar
- rules rules
- \(da \(da
-
- Lex Yacc
-
- \(da \(da
-
- Input \(-> yylex \(-> yyparse \(-> Parsed input
-
- .sp
- Lex with Yacc
- .sp
- Figure 2
- .TE
- .BE
- Yacc users
- will realize that the name
- .ul
- yylex
- is what Yacc expects its lexical analyzer to be named,
- so that the use of this name by Lex simplifies
- interfacing.
- .PP
- Lex generates a deterministic finite automaton from the regular expressions
- in the source [4].
- The automaton is interpreted, rather than compiled, in order
- to save space.
- The result is still a fast analyzer.
- In particular, the time taken by a Lex program
- to recognize and partition an input stream is
- proportional to the length of the input.
- The number of Lex rules or
- the complexity of the rules is
- not important in determining speed,
- unless rules which include
- forward context require a significant amount of re~scanning.
- What does increase with the number and complexity of rules
- is the size of the finite
- automaton, and therefore the size of the program
- generated by Lex.
- .PP
- In the program written by Lex, the user's fragments
- (representing the
- .ul
- actions
- to be performed as each regular expression
- is found)
- are gathered
- as cases of a switch.
- The automaton interpreter directs the control flow.
- Opportunity is provided for the user to insert either
- declarations or additional statements in the routine containing
- the actions, or to
- add subroutines outside this action routine.
- .PP
- Lex is not limited to source which can
- be interpreted on the basis of one character
- look~ahead.
- For example,
- if there are two rules, one looking for
- .I ab
- and another for
- .I abcdefg ,
- and the input stream is
- .I abcdefh ,
- Lex will recognize
- .I ab
- and leave
- the input pointer just before
- .I "cd. . ."
- Such backup is more costly
- than the processing of simpler languages.
- .2C
- .NH
- Lex Source.
- .PP
- The general format of Lex source is:
- .TS
- center;
- l.
- {definitions}
- %%
- {rules}
- %%
- {user subroutines}
- .TE
- where the definitions and the user subroutines
- are often omitted.
- The second
- .I %%
- is optional, but the first is required
- to mark the beginning of the rules.
- The absolute minimum Lex program is thus
- .TS
- center;
- l.
- %%
- .TE
- (no definitions, no rules) which translates into a program
- which copies the input to the output unchanged.
- .PP
- In the outline of Lex programs shown above, the
- .I
- rules
- .R
- represent the user's control
- decisions; they are a table, in which the left column
- contains
- .I
- regular expressions
- .R
- (see section 3)
- and the right column contains
- .I
- actions,
- .R
- program fragments to be executed when the expressions
- are recognized.
- Thus an individual rule might appear
- .TS
- center;
- l l.
- integer printf("found keyword INT");
- .TE
- to look for the string
- .I integer
- in the input stream and
- print the message ``found keyword INT'' whenever it appears.
- In this example the host procedural language is C and
- the C library function
- .I
- printf
- .R
- is used to print the string.
- The end
- of the expression is indicated by the first blank or tab character.
- If the action is merely a single C expression,
- it can just be given on the right side of the line; if it is
- compound, or takes more than a line, it should be enclosed in
- braces.
- As a slightly more useful example, suppose it is desired to
- change a number of words from British to American spelling.
- Lex rules such as
- .TS
- center;
- l l.
- colour printf("color");
- mechanise printf("mechanize");
- petrol printf("gas");
- .TE
- would be a start. These rules are not quite enough,
- since
- the word
- .I petroleum
- would become
- .I gaseum ;
- a way of dealing
- with this will be described later.
- .2C
- .NH
- Lex Regular Expressions.
- .PP
- The definitions of regular expressions are very similar to those
- in QED [5].
- A regular
- expression specifies a set of strings to be matched.
- It contains text characters (which match the corresponding
- characters in the strings being compared)
- and operator characters (which specify
- repetitions, choices, and other features).
- The letters of the alphabet and the digits are
- always text characters; thus the regular expression
- .TS
- center;
- l l.
- integer
- .TE
- matches the string
- .ul
- integer
- wherever it appears
- and the expression
- .TS
- center;
- l.
- a57D
- .TE
- looks for the string
- .ul
- a57D.
- .PP
- .I
- Operators.
- .R
- The operator characters are
- .TS
- center;
- l.
- " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
- .TE
- and if they are to be used as text characters, an escape
- should be used.
- The quotation mark operator (")
- indicates that whatever is contained between a pair of quotes
- is to be taken as text characters.
- Thus
- .TS
- center;
- l.
- xyz"++"
- .TE
- matches the string
- .I xyz++
- when it appears. Note that a part of a string may be quoted.
- It is harmless but unnecessary to quote an ordinary
- text character; the expression
- .TS
- center;
- l.
- "xyz++"
- .TE
- is the same as the one above.
- Thus by quoting every non-alphanumeric character
- being used as a text character, the user can avoid remembering
- the list above of current
- operator characters, and is safe should further extensions to Lex
- lengthen the list.
- .PP
- An operator character may also be turned into a text character
- by preceding it with \e as in
- .TS
- center;
- l.
- xyz\e+\e+
- .TE
- which
- is another, less readable, equivalent of the above expressions.
- Another use of the quoting mechanism is to get a blank into
- an expression; normally, as explained above, blanks or tabs end
- a rule.
- Any blank character not contained within [\|] (see below) must
- be quoted.
- Several normal C escapes with \e
- are recognized: \en is newline, \et is tab, and \eb is backspace.
- To enter \e itself, use \e\e.
- Since newline is illegal in an expression, \en must be used;
- it is not
- required to escape tab and backspace.
- Every character but blank, tab, newline and the list above is always
- a text character.
- .PP
- .I
- Character classes.
- .R
- Classes of characters can be specified using the operator pair [\|].
- The construction
- .I [abc]
- matches a
- single character, which may be
- .I a ,
- .I b ,
- or
- .I c .
- Within square brackets,
- most operator meanings are ignored.
- Only three characters are special:
- these are \e \(mi and ^. The \(mi character
- indicates ranges. For example,
- .TS
- center;
- l.
- [a\(miz0\(mi9<>_]
- .TE
- indicates the character class containing all the lower case letters,
- the digits,
- the angle brackets, and underline.
- Ranges may be given in either order.
- Using \(mi between any pair of characters which are
- not both upper case letters, both lower case letters, or both digits
- is implementation dependent and will get a warning message.
- (E.g., [0\-z] in ASCII is many more characters
- than it is in EBCDIC).
- If it is desired to include the
- character \(mi in a character class, it should be first or
- last; thus
- .TS
- center;
- l.
- [\(mi+0\(mi9]
- .TE
- matches all the digits and the two signs.
- .PP
- In character classes,
- the ^ operator must appear as the first character
- after the left bracket; it indicates that the resulting string
- is to be complemented with respect to the computer character set.
- Thus
- .TS
- center;
- l.
- [^abc]
- .TE
- matches all characters except a, b, or c, including
- all special or control characters; or
- .TS
- center;
- l.
- [^a\-zA\-Z]
- .TE
- is any character which is not a letter.
- The \e character provides the usual escapes within
- character class brackets.
- .PP
- .I
- Arbitrary character.
- .R
- To match almost any character, the operator character
- .TS
- center;
- l.
- \&.
- .TE
- is the class of all characters except newline.
- Escaping into octal is possible although non-portable:
- .TS
- center;
- l.
- [\e40\-\e176]
- .TE
- matches all printable characters in the ASCII character set, from octal
- 40 (blank) to octal 176 (tilde).
- .PP
- .I
- Optional expressions.
- .R
- The operator
- .I ?
- indicates
- an optional element of an expression.
- Thus
- .TS
- center;
- l.
- ab?c
- .TE
- matches either
- .I ac
- or
- .I abc .
- .PP
- .I
- Repeated expressions.
- .R
- Repetitions of classes are indicated by the operators
- .I \(**
- and
- .I + .
- .TS
- center;
- l.
- \f2a\(**\f1
- .TE
- is any number of consecutive
- .I a
- characters, including zero; while
- .TS
- center;
- l.
- a+
- .TE
- is one or more instances of
- .I a.
- For example,
- .TS
- center;
- l.
- [a\-z]+
- .TE
- is all strings of lower case letters.
- And
- .TS
- center;
- l.
- [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
- .TE
- indicates all alphanumeric strings with a leading
- alphabetic character.
- This is a typical expression for recognizing identifiers in
- computer languages.
- .PP
- .I
- Alternation and Grouping.
- .R
- The operator |
- indicates alternation:
- .TS
- center;
- l.
- (ab\||\|cd)
- .TE
- matches either
- .ul
- ab
- or
- .ul
- cd.
- Note that parentheses are used for grouping, although
- they are
- not necessary on the outside level;
- .TS
- center;
- l.
- ab\||\|cd
- .TE
- would have sufficed.
- Parentheses
- can be used for more complex expressions:
- .TS
- center;
- l.
- (ab\||\|cd+)?(ef)\(**
- .TE
- matches such strings as
- .I abefef ,
- .I efefef ,
- .I cdef ,
- or
- .I cddd\| ;
- but not
- .I abc ,
- .I abcd ,
- or
- .I abcdef .
- .PP
- .I
- Context sensitivity.
- .R
- Lex will recognize a small amount of surrounding
- context. The two simplest operators for this are
- .I ^
- and
- .I $ .
- If the first character of an expression is
- .I ^ ,
- the expression will only be matched at the beginning
- of a line (after a newline character, or at the beginning of
- the input stream).
- This can never conflict with the other meaning of
- .I ^ ,
- complementation
- of character classes, since that only applies within
- the [\|] operators.
- If the very last character is
- .I $ ,
- the expression will only be matched at the end of a line (when
- immediately followed by newline).
- The latter operator is a special case of the
- .I /
- operator character,
- which indicates trailing context.
- The expression
- .TS
- center;
- l.
- ab/cd
- .TE
- matches the string
- .I ab ,
- but only if followed by
- .ul
- cd.
- Thus
- .TS
- center;
- l.
- ab$
- .TE
- is the same as
- .TS
- center;
- l.
- ab/\en
- .TE
- Left context is handled in Lex by
- .I
- start conditions
- .R
- as explained in section 10. If a rule is only to be executed
- when the Lex automaton interpreter is in start condition
- .I
- x,
- .R
- the rule should be prefixed by
- .TS
- center;
- l.
- <x>
- .TE
- using the angle bracket operator characters.
- If we considered ``being at the beginning of a line'' to be
- start condition
- .I ONE ,
- then the ^ operator
- would be equivalent to
- .TS
- center;
- l.
- <ONE>
- .TE
- Start conditions are explained more fully later.
- .PP
- .I
- Repetitions and Definitions.
- .R
- The operators {} specify
- either repetitions (if they enclose numbers)
- or
- definition expansion (if they enclose a name). For example
- .TS
- center;
- l.
- {digit}
- .TE
- looks for a predefined string named
- .I digit
- and inserts it
- at that point in the expression.
- The definitions are given in the first part of the Lex
- input, before the rules.
- In contrast,
- .TS
- center;
- l.
- a{1,5}
- .TE
- looks for 1 to 5 occurrences of
- .I a .
- .PP
- Finally, initial
- .I %
- is special, being the separator
- for Lex source segments.
- .2C
- .NH
- Lex Actions.
- .PP
- When an expression written as above is matched, Lex
- executes the corresponding action. This section describes
- some features of Lex which aid in writing actions. Note
- that there is a default action, which
- consists of copying the input to the output. This
- is performed on all strings not otherwise matched. Thus
- the Lex user who wishes to absorb the entire input, without
- producing any output, must provide rules to match everything.
- When Lex is being used with Yacc, this is the normal
- situation.
- One may consider that actions are what is done instead of
- copying the input to the output; thus, in general,
- a rule which merely copies can be omitted.
- Also, a character combination
- which is omitted from the rules
- and which appears as input
- is likely to be printed on the output, thus calling
- attention to the gap in the rules.
- .PP
- One of the simplest things that can be done is to ignore
- the input. Specifying a C null statement, \fI;\fR as an action
- causes this result. A frequent rule is
- .TS
- center;
- l l.
- [ \et\en] ;
- .TE
- which causes the three spacing characters (blank, tab, and newline)
- to be ignored.
- .PP
- Another easy way to avoid writing actions is the action character
- |, which indicates that the action for this rule is the action
- for the next rule.
- The previous example could also have been written
- .TS
- center;
- l l.
- " " |
- "\et" |
- "\en" ;
- .TE
- with the same result, although in different style.
- The quotes around \en and \et are not required.
- .PP
- In more complex actions, the user
- will
- often want to know the actual text that matched some expression
- like
- .I [a\(miz]+ .
- Lex leaves this text in an external character
- array named
- .I
- yytext.
- .R
- Thus, to print the name found,
- a rule like
- .TS
- center;
- l l.
- [a\-z]+ printf("%s", yytext);
- .TE
- will print
- the string in
- .I
- yytext.
- .R
- The C function
- .I
- printf
- .R
- accepts a format argument and data to be printed;
- in this case, the format is ``print string'' (% indicating
- data conversion, and
- .I s
- indicating string type),
- and the data are the characters
- in
- .I
- yytext.
- .R
- So this just places
- the matched string
- on the output.
- This action
- is so common that
- it may be written as ECHO:
- .TS
- center;
- l l.
- [a\-z]+ ECHO;
- .TE
- is the same as the above.
- Since the default action is just to
- print the characters found, one might ask why
- give a rule, like this one, which merely specifies
- the default action?
- Such rules are often required
- to avoid matching some other rule
- which is not desired. For example, if there is a rule
- which matches
- .I read
- it will normally match the instances of
- .I read
- contained in
- .I bread
- or
- .I readjust ;
- to avoid
- this,
- a rule
- of the form
- .I [a\(miz]+
- is needed.
- This is explained further below.
- .PP
- Sometimes it is more convenient to know the end of what
- has been found; hence Lex also provides a count
- .I
- yyleng
- .R
- of the number of characters matched.
- To count both the number
- of words and the number of characters in words in the input, the user might write
- .TS
- center;
- l l.
- [a\-zA\-Z]+ {words++; chars += yyleng;}
- .TE
- which accumulates in
- .ul
- chars
- the number
- of characters in the words recognized.
- The last character in the string matched can
- be accessed by
- .TS
- center;
- l.
- yytext[yyleng\-1]
- .TE
- .PP
- Occasionally, a Lex
- action may decide that a rule has not recognized the correct
- span of characters.
- Two routines are provided to aid with this situation.
- First,
- .I
- yymore()
- .R
- can be called to indicate that the next input expression recognized is to be
- tacked on to the end of this input. Normally,
- the next input string would overwrite the current
- entry in
- .I
- yytext.
- .R
- Second,
- .I
- yyless (n)
- .R
- may be called to indicate that not all the characters matched
- by the currently successful expression are wanted right now.
- The argument
- .I
- n
- .R
- indicates the number of characters
- in
- .I
- yytext
- .R
- to be retained.
- Further characters previously matched
- are
- returned to the input. This provides the same sort of
- look~ahead offered by the / operator,
- but in a different form.
- .PP
- .I
- Example:
- .R
- Consider a language which defines
- a string as a set of characters between quotation (") marks, and provides that
- to include a " in a string it must be preceded by a \e. The
- regular expression which matches that is somewhat confusing,
- so that it might be preferable to write
- .TS
- center;
- l l.
- \e"[^"]\(** {
- if (yytext[yyleng\-1] == \(fm\e\e\(fm)
- yymore();
- else
- ... normal user processing
- }
- .TE
- which will, when faced with a string such as
- .I
- "abc\e"def\|"
- .R
- first match
- the five characters
- \fI"abc\e\|\fR;
- then
- the call to
- .I yymore()
- will
- cause the next part of the string,
- \fI"def\|\fR,
- to be tacked on the end.
- Note that the final quote terminating the string should be picked
- up in the code labeled ``normal processing''.
- .PP
- The function
- .I
- yyless()
- .R
- might be used to reprocess
- text in various circumstances. Consider the C problem of distinguishing
- the ambiguity of ``=\(mia''.
- Suppose it is desired to treat this as ``=\(mi a''
- but print a message. A rule might be
- .TS
- center;
- l l.
- =\(mi[a\-zA\-Z] {
- printf("Operator (=\(mi) ambiguous\en");
- yyless(yyleng\-1);
- ... action for =\(mi ...
- }
- .TE
- which prints a message, returns the letter after the
- operator to the input stream, and treats the operator as ``=\(mi''.
- Alternatively it might be desired to treat this as ``= \(mia''.
- To do this, just return the minus
- sign as well as the letter to the input:
- .TS
- center;
- l l.
- =\(mi[a\-zA\-Z] {
- printf("Operator (=\(mi) ambiguous\en");
- yyless(yyleng\-2);
- ... action for = ...
- }
- .TE
- will perform the other interpretation.
- Note that the expressions for the two cases might more easily
- be written
- .TS
- center;
- l l.
- =\(mi/[A\-Za\-z]
- .TE
- in the first case and
- .TS
- center;
- l.
- =/\-[A\-Za\-z]
- .TE
- in the second;
- no backup would be required in the rule action.
- It is not necessary to recognize the whole identifier
- to observe the ambiguity.
- The
- possibility of ``=\(mi3'', however, makes
- .TS
- center;
- l.
- =\(mi/[^ \et\en]
- .TE
- a still better rule.
- .PP
- In addition to these routines, Lex also permits
- access to the I/O routines
- it uses.
- They are:
- .IP 1)
- .I
- input()
- .R
- which returns the next input character;
- .IP 2)
- .I
- output(c)
- .R
- which writes the character
- .I
- c
- .R
- on the output; and
- .IP 3)
- .I
- unput(c)
- .R
- pushes the character
- .I
- c
- .R
- back onto the input stream to be read later by
- .I
- input().
- .R
- .LP
- By default these routines are provided as macro definitions,
- but the user can override them and supply private versions.
- These routines
- define the relationship between external files and
- internal characters, and must all be retained
- or modified consistently.
- They may be redefined, to
- cause input or output to be transmitted to or from strange
- places, including other programs or internal memory;
- but the character set used must be consistent in all routines;
- a value of zero returned by
- .I
- input
- .R
- must mean end of file; and
- the relationship between
- .I
- unput
- .R
- and
- .I
- input
- .R
- must be retained
- or the Lex look~ahead will not work.
- Lex does not look ahead at all if it does not have to,
- but every rule ending in
- .ft I
- + \(** ?
- .ft R
- or
- .ft I
- $
- .ft R
- or containing
- .ft I
- /
- .ft R
- implies look~ahead.
- Look~ahead is also necessary to match an expression that is a prefix
- of another expression.
- See below for a discussion of the character set used by Lex.
- The standard Lex library imposes
- a 100 character limit on backup.
- .PP
- Another Lex library routine that the user will sometimes want
- to redefine is
- .I
- yywrap()
- .R
- which is called whenever Lex reaches an end-of-file.
- If
- .I
- yywrap
- .R
- returns a 1, Lex continues with the normal wrapup on end of input.
- Sometimes, however, it is convenient to arrange for more
- input to arrive
- from a new source.
- In this case, the user should provide
- a
- .I
- yywrap
- .R
- which
- arranges for new input and
- returns 0. This instructs Lex to continue processing.
- The default
- .I
- yywrap
- .R
- always returns 1.
- .PP
- This routine is also a convenient place
- to print tables, summaries, etc. at the end
- of a program. Note that it is not
- possible to write a normal rule which recognizes
- end-of-file; the only access to this condition is
- through
- .I
- yywrap.
- .R
- In fact, unless a private version of
- .I
- input()
- .R
- is supplied
- a file containing nulls
- cannot be handled,
- since a value of 0 returned by
- .I
- input
- .R
- is taken to be end-of-file.
- .PP
- .2C
- .NH
- Ambiguous Source Rules.
- .PP
- Lex can handle ambiguous specifications.
- When more than one expression can match the
- current input, Lex chooses as follows:
- .IP 1)
- The longest match is preferred.
- .IP 2)
- Among rules which matched the same number of characters,
- the rule given first is preferred.
- .LP
- Thus, suppose the rules
- .TS
- center;
- l l.
- integer keyword action ...;
- [a\-z]+ identifier action ...;
- .TE
- to be given in that order. If the input is
- .I integers ,
- it is taken as an identifier, because
- .I [a\-z]+
- matches 8 characters while
- .I integer
- matches only 7.
- If the input is
- .I integer ,
- both rules match 7 characters, and
- the keyword rule is selected because it was given first.
- Anything shorter (e.g. \fIint\fR\|) will
- not match the expression
- .I integer
- and so the identifier interpretation is used.
- .PP
- The principle of preferring the longest
- match makes rules containing
- expressions like
- .I \&.\(**
- dangerous.
- For example,
- .TS
- center;
- l.
- \&\(fm.\(**\(fm
- .TE
- might seem a good way of recognizing
- a string in single quotes.
- But it is an invitation for the program to read far
- ahead, looking for a distant
- single quote.
- Presented with the input
- .TS
- center;
- l l.
- \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
- .TE
- the above expression will match
- .TS
- center;
- l l.
- \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
- .TE
- which is probably not what was wanted.
- A better rule is of the form
- .TS
- center;
- l.
- \&\(fm[^\(fm\en]\(**\(fm
- .TE
- which, on the above input, will stop
- after
- .I \(fmfirst\(fm .
- The consequences
- of errors like this are mitigated by the fact
- that the
- .I \&.
- operator will not match newline.
- Thus expressions like
- .I \&.\(**
- stop on the
- current line.
- Don't try to defeat this with expressions like
- .I [.\en]+
- or
- equivalents;
- the Lex generated program will try to read
- the entire input file, causing
- internal buffer overflows.
- .PP
- Note that Lex is normally partitioning
- the input stream, not searching for all possible matches
- of each expression.
- This means that each character is accounted for
- once and only once.
- For example, suppose it is desired to
- count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
- Some Lex rules to do this might be
- .TS
- center;
- l l.
- she s++;
- he h++;
- \en |
- \&. ;
- .TE
- where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
- Remember that . does not include newline.
- Since \fIshe\fR includes \fIhe\fR, Lex will normally
- .I
- not
- .R
- recognize
- the instances of \fIhe\fR included in \fIshe\fR,
- since once it has passed a \fIshe\fR those characters are gone.
- .PP
- Sometimes the user would like to override this choice. The action
- REJECT
- means ``go do the next alternative.''
- It causes whatever rule was second choice after the current
- rule to be executed.
- The position of the input pointer is adjusted accordingly.
- Suppose the user really wants to count the included instances of \fIhe\fR:
- .TS
- center;
- l l.
- she {s++; REJECT;}
- he {h++; REJECT;}
- \en |
- \&. ;
- .TE
- these rules are one way of changing the previous example
- to do just that.
- After counting each expression, it is rejected; whenever appropriate,
- the other expression will then be counted. In this example, of course,
- the user could note that \fIshe\fR includes \fIhe\fR but not
- vice versa, and omit the REJECT action on \fIhe\fR;
- in other cases, however, it
- would not be possible a priori to tell
- which input characters
- were in both classes.
- .PP
- Consider the two rules
- .TS
- center;
- l l.
- a[bc]+ { ... ; REJECT;}
- a[cd]+ { ... ; REJECT;}
- .TE
- If the input is
- .I ab ,
- only the first rule matches,
- and on
- .I ad
- only the second matches.
- The input string
- .I accb
- matches the first rule for four characters
- and then the second rule for three characters.
- In contrast, the input
- .I accd
- agrees with
- the second rule for four characters and then the first
- rule for three.
- .PP
- In general, REJECT is useful whenever
- the purpose of Lex is not to partition the input
- stream but to detect all examples of some items
- in the input, and the instances of these items
- may overlap or include each other.
- Suppose a digram table of the input is desired;
- normally the digrams overlap, that is the word
- .I the
- is considered to contain
- both
- .I th
- and
- .I he .
- Assuming a two-dimensional array named
- .ul
- digram
- to be incremented, the appropriate
- source is
- .TS
- center;
- l l.
- %%
- [a\-z][a\-z] {digram[yytext[0]][yytext[1]]++; REJECT;}
- . ;
- \en ;
- .TE
- where the REJECT is necessary to pick up
- a letter pair beginning at every character, rather than at every
- other character.
- .2C
- .NH
- Lex Source Definitions.
- .PP
- Remember the format of the Lex
- source:
- .TS
- center;
- l.
- {definitions}
- %%
- {rules}
- %%
- {user routines}
- .TE
- So far only the rules have been described. The user needs
- additional options,
- though, to define variables for use in his program and for use
- by Lex.
- These can go either in the definitions section
- or in the rules section.
- .PP
- Remember that Lex is turning the rules into a program.
- Any source not intercepted by Lex is copied
- into the generated program. There are three classes
- of such things.
- .IP 1)
- Any line which is not part of a Lex rule or action
- which begins with a blank or tab is copied into
- the Lex generated program.
- Such source input prior to the first %% delimiter will be external
- to any function in the code; if it appears immediately after the first
- %%,
- it appears in an appropriate place for declarations
- in the function written by Lex which contains the actions.
- This material must look like program fragments,
- and should precede the first Lex rule.
- .IP
- As a side effect of the above, lines which begin with a blank
- or tab, and which contain a comment,
- are passed through to the generated program.
- This can be used to include comments in either the Lex source or
- the generated code. The comments should follow the host
- language convention.
- .IP 2)
- Anything included between lines containing
- only
- .I %{
- and
- .I %}
- is
- copied out as above. The delimiters are discarded.
- This format permits entering text like preprocessor statements that
- must begin in column 1,
- or copying lines that do not look like programs.
- .IP 3)
- Anything after the third %% delimiter, regardless of formats, etc.,
- is copied out after the Lex output.
- .PP
- Definitions intended for Lex are given
- before the first %% delimiter. Any line in this section
- not contained between %{ and %}, and begining
- in column 1, is assumed to define Lex substitution strings.
- The format of such lines is
- .TS
- center;
- l l.
- name translation
- .TE
- and it
- causes the string given as a translation to
- be associated with the name.
- The name and translation
- must be separated by at least one blank or tab, and the name must begin with a letter.
- The translation can then be called out
- by the {name} syntax in a rule.
- Using {D} for the digits and {E} for an exponent field,
- for example, might abbreviate rules to recognize numbers:
- .TS
- center;
- l l.
- D [0\-9]
- E [DEde][\-+]?{D}+
- %%
- {D}+ printf("integer");
- {D}+"."{D}\(**({E})? |
- {D}\(**"."{D}+({E})? |
- {D}+{E} printf("real");
- .TE
- Note the first two rules for real numbers;
- both require a decimal point and contain
- an optional exponent field,
- but the first requires at least one digit before the
- decimal point and the second requires at least one
- digit after the decimal point.
- To correctly handle the problem
- posed by a Fortran expression such as
- .I 35.EQ.I ,
- which does not contain a real number, a context-sensitive
- rule such as
- .TS
- center;
- l l.
- [0\-9]+/"."EQ printf("integer");
- .TE
- could be used in addition to the normal rule for integers.
- .PP
- The definitions
- section may also contain other commands, including the
- selection of a host language, a character set table,
- a list of start conditions, or adjustments to the default
- size of arrays within Lex itself for larger source programs.
- These possibilities
- are discussed below under ``Summary of Source Format,''
- section 12.
- .2C
- .NH
- Usage.
- .PP
- There are two steps in
- compiling a Lex source program.
- First, the Lex source must be turned into a generated program
- in the host general purpose language.
- Then this program must be compiled and loaded, usually with
- a library of Lex subroutines.
- The generated program
- is on a file named
- .I lex.yy.c .
- The I/O library is defined in terms of the C standard
- library [6].
- .PP
- The C programs generated by Lex are slightly different
- on OS/370, because the
- OS compiler is less powerful than the UNIX or GCOS compilers,
- and does less at compile time.
- C programs generated on GCOS and UNIX are the same.
- .PP
- .I
- UNIX.
- .R
- The library is accessed by the loader flag
- .I \-ll .
- So an appropriate
- set of commands is
- .KS
- .in 5
- lex source
- cc lex.yy.c \-ll
- .in 0
- .KE
- The resulting program is placed on the usual file
- .I
- a.out
- .R
- for later execution.
- To use Lex with Yacc see below.
- Although the default Lex I/O routines use the C standard library,
- the Lex automata themselves do not do so;
- if private versions of
- .I
- input,
- output
- .R
- and
- .I unput
- are given, the library can be avoided.
- .PP
- .2C
- .NH
- Lex and Yacc.
- .PP
- If you want to use Lex with Yacc, note that what Lex writes is a program
- named
- .I
- yylex(),
- .R
- the name required by Yacc for its analyzer.
- Normally, the default main program on the Lex library
- calls this routine, but if Yacc is loaded, and its main
- program is used, Yacc will call
- .I
- yylex().
- .R
- In this case each Lex rule should end with
- .TS
- center;
- l.
- return(token);
- .TE
- where the appropriate token value is returned.
- An easy way to get access
- to Yacc's names for tokens is to
- compile the Lex output file as part of
- the Yacc output file by placing the line
- .TS
- center;
- l.
- # include "lex.yy.c"
- .TE
- in the last section of Yacc input.
- Supposing the grammar to be
- named ``good'' and the lexical rules to be named ``better''
- the UNIX command sequence can just be:
- .TS
- center;
- l.
- yacc good
- lex better
- cc y.tab.c \-ly \-ll
- .TE
- The Yacc library (\-ly) should be loaded before the Lex library,
- to obtain a main program which invokes the Yacc parser.
- The generations of Lex and Yacc programs can be done in
- either order.
- .2C
- .NH
- Examples.
- .PP
- As a trivial problem, consider copying an input file while
- adding 3 to every positive number divisible by 7.
- Here is a suitable Lex source program
- .TS
- center;
- l l.
- %%
- int k;
- [0\-9]+ {
- k = atoi(yytext);
- if (k%7 == 0)
- printf("%d", k+3);
- else
- printf("%d",k);
- }
- .TE
- to do just that.
- The rule [0\-9]+ recognizes strings of digits;
- .I
- atoi
- .R
- converts the digits to binary
- and stores the result in
- .ul
- k.
- The operator % (remainder) is used to check whether
- .ul
- k
- is divisible by 7; if it is,
- it is incremented by 3 as it is written out.
- It may be objected that this program will alter such
- input items as
- .I 49.63
- or
- .I X7 .
- Furthermore, it increments the absolute value
- of all negative numbers divisible by 7.
- To avoid this, just add a few more rules after the active one,
- as here:
- .TS
- center;
- l l.
- %%
- int k;
- \-?[0\-9]+ {
- k = atoi(yytext);
- printf("%d", k%7 == 0 ? k+3 : k);
- }
- \-?[0\-9.]+ ECHO;
- [A-Za-z][A-Za-z0-9]+ ECHO;
- .TE
- Numerical strings containing
- a ``.'' or preceded by a letter will be picked up by
- one of the last two rules, and not changed.
- The
- .I if\-else
- has been replaced by
- a C conditional expression to save space;
- the form
- .ul
- a?b:c
- means ``if
- .I a
- then
- .I b
- else
- .I c ''.
- .PP
- For an example of statistics gathering, here
- is a program which histograms the lengths
- of words, where a word is defined as a string of letters.
- .TS
- center;
- l l.
- int lengs[100];
- %%
- [a\-z]+ lengs[yyleng]++;
- \&. |
- \en ;
- %%
- .T&
- l s.
- yywrap()
- {
- int i;
- printf("Length No. words\en");
- for(i=0; i<100; i++)
- if (lengs[i] > 0)
- printf("%5d%10d\en",i,lengs[i]);
- return(1);
- }
- .TE
- This program
- accumulates the histogram, while producing no output. At the end
- of the input it prints the table.
- The final statement
- .I
- return(1);
- .R
- indicates that Lex is to perform wrapup. If
- .I
- yywrap
- .R
- returns zero (false)
- it implies that further input is available
- and the program is
- to continue reading and processing.
- To provide a
- .I
- yywrap
- .R
- that never
- returns true causes an infinite loop.
- .PP
- As a larger example,
- here are some parts of a program written by N. L. Schryer
- to convert double precision Fortran to single precision Fortran.
- Because Fortran does not distinguish upper and lower case letters,
- this routine begins by defining a set of classes including
- both cases of each letter:
- .TS
- center;
- l l.
- a [aA]
- b [bB]
- c [cC]
- \&...
- z [zZ]
- .TE
- An additional class recognizes white space:
- .TS
- center;
- l l.
- W [ \et]\(**
- .TE
- The first rule changes
- ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
- .TS
- center;
- l.
- {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
- printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
- }
- .TE
- Care is taken throughout this program to preserve the case
- (upper or lower)
- of the original program.
- The conditional operator is used to
- select the proper form of the keyword.
- The next rule copies continuation card indications to
- avoid confusing them with constants:
- .TS
- center;
- l l.
- ^" "[^ 0] ECHO;
- .TE
- In the regular expression, the quotes surround the
- blanks.
- It is interpreted as
- ``beginning of line, then five blanks, then
- anything but blank or zero.''
- Note the two different meanings of
- .I ^ .
- There follow some rules to change double precision
- constants to ordinary floating constants.
- .TS
- center;
- l.
- [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
- [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
- "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
- /\(** convert constants \(**/
- for(p=yytext; \(**p != 0; p++)
- {
- if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
- \(**p=+ \(fme\(fm\- \(fmd\(fm;
- ECHO;
- }
- .TE
- After the floating point constant is recognized, it is
- scanned by the
- .ul
- for
- loop
- to find the letter
- .I d
- or
- .I D .
- The program than adds
- .c
- .I \(fme\(fm\-\(fmd\(fm ,
- which converts
- it to the next letter of the alphabet.
- The modified constant, now single-precision,
- is written out again.
- There follow a series of names which must be respelled to remove
- their initial \fId\fR.
- By using the
- array
- .I
- yytext
- .R
- the same action suffices for all the names (only a sample of
- a rather long list is given here).
- .TS
- center;
- l l.
- {d}{s}{i}{n} |
- {d}{c}{o}{s} |
- {d}{s}{q}{r}{t} |
- {d}{a}{t}{a}{n} |
- \&...
- {d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
- .TE
- Another list of names must have initial \fId\fR changed to initial \fIa\fR:
- .TS
- center;
- l l.
- {d}{l}{o}{g} |
- {d}{l}{o}{g}10 |
- {d}{m}{i}{n}1 |
- {d}{m}{a}{x}1 {
- yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
- ECHO;
- }
- .TE
- And one routine
- must have initial \fId\fR changed to initial \fIr\fR:
- .TS
- center;
- l l.
- {d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
- ECHO;
- }
- .TE
- To avoid such names as \fIdsinx\fR being detected as instances
- of \fIdsin\fR, some final rules pick up longer words as identifiers
- and copy some surviving characters:
- .TS
- center;
- l l.
- [A\-Za\-z][A\-Za\-z0\-9]\(** |
- [0\-9]+ |
- \en |
- \&. ECHO;
- .TE
- Note that this program is not complete; it
- does not deal with the spacing problems in Fortran or
- with the use of keywords as identifiers.
- .br
- .2C
- .NH
- Left Context Sensitivity.
- .PP
- Sometimes
- it is desirable to have several sets of lexical rules
- to be applied at different times in the input.
- For example, a compiler preprocessor might distinguish
- preprocessor statements and analyze them differently
- from ordinary statements.
- This requires
- sensitivity
- to prior context, and there are several ways of handling
- such problems.
- The \fI^\fR operator, for example, is a prior context operator,
- recognizing immediately preceding left context just as \fI$\fR recognizes
- immediately following right context.
- Adjacent left context could be extended, to produce a facility similar to
- that for adjacent right context, but it is unlikely
- to be as useful, since often the relevant left context
- appeared some time earlier, such as at the beginning of a line.
- .PP
- This section describes three means of dealing
- with different environments: a simple use of flags,
- when only a few rules change from one environment to another,
- the use of
- .I
- start conditions
- .R
- on rules,
- and the possibility of making multiple lexical analyzers all run
- together.
- In each case, there are rules which recognize the need to change the
- environment in which the
- following input text is analyzed, and set some parameter
- to reflect the change. This may be a flag explicitly tested by
- the user's action code; such a flag is the simplest way of dealing
- with the problem, since Lex is not involved at all.
- It may be more convenient,
- however,
- to have Lex remember the flags as initial conditions on the rules.
- Any rule may be associated with a start condition. It will only
- be recognized when Lex is in
- that start condition.
- The current start condition may be changed at any time.
- Finally, if the sets of rules for the different environments
- are very dissimilar,
- clarity may be best achieved by writing several distinct lexical
- analyzers, and switching from one to another as desired.
- .PP
- Consider the following problem: copy the input to the output,
- changing the word \fImagic\fR to \fIfirst\fR on every line which began
- with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
- which began with the letter \fIb\fR, and changing
- \fImagic\fR to \fIthird\fR on every line which began
- with the letter \fIc\fR. All other words and all other lines
- are left unchanged.
- .PP
- These rules are so simple that the easiest way
- to do this job is with a flag:
- .TS
- center;
- l l.
- int flag;
- %%
- ^a {flag = \(fma\(fm; ECHO;}
- ^b {flag = \(fmb\(fm; ECHO;}
- ^c {flag = \(fmc\(fm; ECHO;}
- \en {flag = 0 ; ECHO;}
- magic {
- switch (flag)
- {
- case \(fma\(fm: printf("first"); break;
- case \(fmb\(fm: printf("second"); break;
- case \(fmc\(fm: printf("third"); break;
- default: ECHO; break;
- }
- }
- .TE
- should be adequate.
- .PP
- To handle the same problem with start conditions, each
- start condition must be introduced to Lex in the definitions section
- with a line reading
- .TS
- center;
- l l.
- %Start name1 name2 ...
- .TE
- where the conditions may be named in any order.
- The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
- The conditions may be referenced at the
- head of a rule with the <> brackets:
- .TS
- center;
- l.
- <name1>expression
- .TE
- is a rule which is only recognized when Lex is in the
- start condition \fIname1\fR.
- To enter a start condition,
- execute the action statement
- .TS
- center;
- l.
- BEGIN name1;
- .TE
- which changes the start condition to \fIname1\fR.
- To resume the normal state,
- .TS
- center;
- l.
- BEGIN 0;
- .TE
- resets the initial condition
- of the Lex automaton interpreter.
- A rule may be active in several
- start conditions:
- .TS
- center;
- l.
- <name1,name2,name3>
- .TE
- is a legal prefix. Any rule not beginning with the
- <> prefix operator is always active.
- .PP
- The same example as before can be written:
- .TS
- center;
- l l.
- %START AA BB CC
- %%
- ^a {ECHO; BEGIN AA;}
- ^b {ECHO; BEGIN BB;}
- ^c {ECHO; BEGIN CC;}
- \en {ECHO; BEGIN 0;}
- <AA>magic printf("first");
- <BB>magic printf("second");
- <CC>magic printf("third");
- .TE
- where the logic is exactly the same as in the previous
- method of handling the problem, but Lex does the work
- rather than the user's code.
- .2C
- .NH
- Character Set.
- .PP
- The programs generated by Lex handle
- character I/O only through the routines
- .I
- input,
- output,
- .R
- and
- .I
- unput.
- .R
- Thus the character representation
- provided in these routines
- is accepted by Lex and employed to return
- values in
- .I
- yytext.
- .R
- For internal use
- a character is represented as a small integer
- which, if the standard library is used,
- has a value equal to the integer value of the bit
- pattern representing the character on the host computer.
- Normally, the letter
- .I a
- is represented as the same form as the character constant
- .I \(fma\(fm .
- If this interpretation is changed, by providing I/O
- routines which translate the characters,
- Lex must be told about
- it, by giving a translation table.
- This table must be in the definitions section,
- and must be bracketed by lines containing only
- ``%T''.
- The table contains lines of the form
- .TS
- center;
- l.
- {integer} {character string}
- .TE
- which indicate the value associated with each character.
- Thus the next example
- .GS 2
- .TS
- center;
- l l.
- %T
- 1 Aa
- 2 Bb
- \&...
- 26 Zz
- 27 \en
- 28 +
- 29 \-
- 30 0
- 31 1
- \&...
- 39 9
- %T
- .TE
- .sp
- .ce 1
- Sample character table.
- .GE
- maps the lower and upper case letters together into the integers 1 through 26,
- newline into 27, + and \- into 28 and 29, and the
- digits into 30 through 39.
- Note the escape for newline.
- If a table is supplied, every character that is to appear either
- in the rules or in any valid input must be included
- in the table.
- No character
- may be assigned the number 0, and no character may be
- assigned a bigger number than the size of the hardware character set.
- .2C
- .NH
- Summary of Source Format.
- .PP
- The general form of a Lex source file is:
- .TS
- center;
- l.
- {definitions}
- %%
- {rules}
- %%
- {user subroutines}
- .TE
- The definitions section contains
- a combination of
- .IP 1)
- Definitions, in the form ``name space translation''.
- .IP 2)
- Included code, in the form ``space code''.
- .IP 3)
- Included code, in the form
- .TS
- center;
- l.
- %{
- code
- %}
- .TE
- .ns
- .IP 4)
- Start conditions, given in the form
- .TS
- center;
- l.
- %S name1 name2 ...
- .TE
- .ns
- .IP 5)
- Character set tables, in the form
- .TS
- center;
- l.
- %T
- number space character-string
- \&...
- %T
- .TE
- .ns
- .IP 6)
- Changes to internal array sizes, in the form
- .TS
- center;
- l.
- %\fIx\fR\0\0\fInnn\fR
- .TE
- where \fInnn\fR is a decimal integer representing an array size
- and \fIx\fR selects the parameter as follows:
- .TS
- center;
- c c
- c l.
- Letter Parameter
- p positions
- n states
- e tree nodes
- a transitions
- k packed character classes
- o output array size
- .TE
- .LP
- Lines in the rules section have the form ``expression action''
- where the action may be continued on succeeding
- lines by using braces to delimit it.
- .PP
- Regular expressions in Lex use the following
- operators:
- .br
- .TS
- center;
- l l.
- x the character "x"
- "x" an "x", even if x is an operator.
- \ex an "x", even if x is an operator.
- [xy] the character x or y.
- [x\-z] the characters x, y or z.
- [^x] any character but x.
- \&. any character but newline.
- ^x an x at the beginning of a line.
- <y>x an x when Lex is in start condition y.
- x$ an x at the end of a line.
- x? an optional x.
- x\(** 0,1,2, ... instances of x.
- x+ 1,2,3, ... instances of x.
- x|y an x or a y.
- (x) an x.
- x/y an x but only if followed by y.
- {xx} the translation of xx from the definitions section.
- x{m,n} \fIm\fR through \fIn\fR occurrences of x
- .TE
- .NH
- Caveats and Bugs.
- .PP
- There are pathological expressions which
- produce exponential growth of the tables when
- converted to deterministic machines;
- fortunately, they are rare.
- .PP
- REJECT does not rescan the input; instead it remembers the results of the previous
- scan. This means that if a rule with trailing context is found, and
- REJECT executed, the user
- must not have used
- .ul
- unput
- to change the characters forthcoming
- from the input stream.
- This is the only restriction on the user's ability to manipulate
- the not-yet-processed input.
- .PP
- .2C
- .NH
- Acknowledgments.
- .PP
- As should
- be obvious from the above, the outside of Lex
- is patterned
- on Yacc and the inside on Aho's string matching routines.
- Therefore, both S. C. Johnson and A. V. Aho
- are really originators
- of much of Lex,
- as well as debuggers of it.
- Many thanks are due to both.
- .PP
- The code of the current version of Lex was designed, written,
- and debugged by Eric Schmidt.
- .SG MH-1274-MEL-unix
- .sp 1
- .2C
- .NH
- References.
- .SP 1v
- .IP 1.
- B. W. Kernighan and D. M. Ritchie,
- .I
- The C Programming Language,
- .R
- Prentice-Hall, N. J. (1978).
- .IP 2.
- B. W. Kernighan,
- .I
- Ratfor: A Preprocessor for a Rational Fortran,
- .R
- Software \- Practice and Experience,
- \fB5\fR, pp. 395-496 (1975).
- .IP 3.
- S. C. Johnson,
- .I
- Yacc: Yet Another Compiler Compiler,
- .R
- Computing Science Technical Report No. 32,
- 1975,
- .MH
- .if \n(tm (also TM 75-1273-6)
- .IP 4.
- A. V. Aho and M. J. Corasick,
- .I
- Efficient String Matching: An Aid to Bibliographic Search,
- .R
- Comm. ACM
- .B
- 18,
- .R
- 333-340 (1975).
- .IP 5.
- B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
- .I
- QED Text Editor,
- .R
- Computing Science Technical Report No. 5,
- 1972,
- .MH
- .IP 6.
- D. M. Ritchie,
- private communication.
- See also
- M. E. Lesk,
- .I
- The Portable C Library,
- .R
- Computing Science Technical Report No. 31,
- .MH
- .if \n(tm (also TM 75-1274-11)
-