home *** CD-ROM | disk | FTP | other *** search
- .SH
- 3: Lexical Analysis
- .PP
- The user must supply a lexical analyzer to read the input stream and communicate tokens
- (with values, if desired) to the parser.
- The lexical analyzer is an integer-valued function called
- .I yylex .
- The function returns an integer, the
- .I "token number" ,
- representing the kind of token read.
- If there is a value associated with that token, it should be assigned
- to the external variable
- .I yylval .
- .PP
- The parser and the lexical analyzer must agree on these token numbers in order for
- communication between them to take place.
- The numbers may be chosen by Yacc, or chosen by the user.
- In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
- to return these numbers symbolically.
- For example, suppose that the token name DIGIT has been defined in the declarations section of the
- Yacc specification file.
- The relevant portion of the lexical analyzer might look like:
- .DS
- yylex(){
- extern int yylval;
- int c;
- . . .
- c = getchar();
- . . .
- switch( c ) {
- . . .
- case \'0\':
- case \'1\':
- . . .
- case \'9\':
- yylval = c\-\'0\';
- return( DIGIT );
- . . .
- }
- . . .
- .DE
- .PP
- The intent is to return a token number of DIGIT, and a value equal to the numerical value of the
- digit.
- Provided that the lexical analyzer code is placed in the programs section of the specification file,
- the identifier DIGIT will be defined as the token number associated
- with the token DIGIT.
- .PP
- This mechanism leads to clear,
- easily modified lexical analyzers; the only pitfall is the need
- to avoid using any token names in the grammar that are reserved
- or significant in C or the parser; for example, the use of
- token names
- .I if
- or
- .I while
- will almost certainly cause severe
- difficulties when the lexical analyzer is compiled.
- The token name
- .I error
- is reserved for error handling, and should not be used naively
- (see Section 7).
- .PP
- As mentioned above, the token numbers may be chosen by Yacc or by the user.
- In the default situation, the numbers are chosen by Yacc.
- The default token number for a literal
- character is the numerical value of the character in the local character set.
- Other names are assigned token numbers
- starting at 257.
- .PP
- To assign a token number to a token (including literals),
- the first appearance of the token name or literal
- .I
- in the declarations section
- .R
- can be immediately followed by
- a nonnegative integer.
- This integer is taken to be the token number of the name or literal.
- Names and literals not defined by this mechanism retain their default definition.
- It is important that all token numbers be distinct.
- .PP
- For historical reasons, the endmarker must have token
- number 0 or negative.
- This token number cannot be redefined by the user; thus, all
- lexical analyzers should be prepared to return 0 or negative as a token number
- upon reaching the end of their input.
- .PP
- A very useful tool for constructing lexical analyzers is
- the
- .I Lex
- program developed by Mike Lesk.
- .[
- Lesk Lex
- .]
- These lexical analyzers are designed to work in close
- harmony with Yacc parsers.
- The specifications for these lexical analyzers
- use regular expressions instead of grammar rules.
- Lex can be easily used to produce quite complicated lexical analyzers,
- but there remain some languages (such as FORTRAN) which do not
- fit any theoretical framework, and whose lexical analyzers
- must be crafted by hand.
-