home *** CD-ROM | disk | FTP | other *** search
- Another area where the user can increase a scanner's performance
- (and one that's easier to implement) arises from the fact that
- the longer the tokens matched, the faster the scanner will run.
- This is because with long tokens the processing of most input
- characters takes place in the (short) inner scanning loop, and
- does not often have to go through the additional work of setting up
- the scanning environment (e.g.,
- .B yytext)
- for the action. Recall the scanner for C comments:
- .nf
-
- %x comment
- %%
- int line_num = 1;
-
- "/*" BEGIN(comment);
-
- <comment>[^*\\n]*
- <comment>"*"+[^*/\\n]*
- <comment>\\n ++line_num;
- <comment>"*"+"/" BEGIN(INITIAL);
-
- .fi
- This could be sped up by writing it as:
- .nf
-
- %x comment
- %%
- int line_num = 1;
-
- "/*" BEGIN(comment);
-
- <comment>[^*\\n]*
- <comment>[^*\\n]*\\n ++line_num;
- <comment>"*"+[^*/\\n]*
- <comment>"*"+[^*/\\n]*\\n ++line_num;
- <comment>"*"+"/" BEGIN(INITIAL);
-
- .fi
- Now instead of each newline requiring the processing of another
- action, recognizing the newlines is "distributed" over the other rules
- to keep the matched text as long as possible. Note that
- .I adding
- rules does
- .I not
- slow down the scanner! The speed of the scanner is independent
- of the number of rules or (modulo the considerations given at the
- beginning of this section) how complicated the rules are with
- regard to operators such as '*' and '|'.
- .LP
- A final example in speeding up a scanner: suppose you want to scan
- through a file containing identifiers and keywords, one per line
- and with no other extraneous characters, and recognize all the
- keywords. A natural first approach is:
- .nf
-
- %%
- asm |
- auto |
- break |
- ... etc ...
- volatile |
- while /* it's a keyword */
-
- .|\\n /* it's not a keyword */
-
- .fi
- To eliminate the back-tracking, introduce a catch-all rule:
- .nf
-
- %%
- asm |
- auto |
- break |
- ... etc ...
- volatile |
- while /* it's a keyword */
-
- [a-z]+ |
- .|\\n /* it's not a keyword */
-
- .fi
- Now, if it's guaranteed that there's exactly one word per line,
- then we can reduce the total number of matches by a half by
- merging in the recognition of newlines with that of the other
- tokens:
- .nf
-
- %%
- asm\\n |
- auto\\n |
- break\\n |
- ... etc ...
- volatile\\n |
- while\\n /* it's a keyword */
-
- [a-z]+\\n |
- .|\\n /* it's not a keyword */
-
- .fi
- One has to be careful here, as we have now reintroduced backtracking
- into the scanner. In particular, while
- .I we
- know that there will never be any characters in the input stream
- other than letters or newlines,
- .I flex
- can't figure this out, and it will plan for possibly needing backtracking
- when it has scanned a token like "auto" and then the next character
- is something other than a newline or a letter. Previously it would
- then just match the "auto" rule and be done, but now it has no "auto"
- rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
- we could either duplicate all rules but without final newlines, or,
- since we never expect to encounter such an input and therefore don't
- how it's classified, we can introduce one more catch-all rule, this
- one which doesn't include a newline:
- .nf
-
- %%
- asm\\n |
- auto\\n |
- break\\n |
- ... etc ...
- volatile\\n |
- while\\n /* it's a keyword */
-
- [a-z]+\\n |
- [a-z]+ |
- .|\\n /* it's not a keyword */
-
- .fi
- Compiled with
- .B -Cf,
- this is about as fast as one can get a
- .I flex
- scanner to go for this particular problem.
- .LP
- A final note:
- .I flex
- is slow when matching NUL's, particularly when a token contains
- multiple NUL's.
- It's best to write rules which match
- .I short
- amounts of text if it's anticipated that the text will often include NUL's.
- .SH INCOMPATIBILITIES WITH LEX AND POSIX
- .I flex
- is a rewrite of the Unix
- .I lex
- tool (the two implementations do not share any code, though),
- with some extensions and incompatibilities, both of which
- are of concern to those who wish to write scanners acceptable
- to either implementation. At present, the POSIX
- .I lex
- draft is
- very close to the original
- .I lex
- implementation, so some of these
- incompatibilities are also in conflict with the POSIX draft. But
- the intent is that except as noted below,
- .I flex
- as it presently stands will
- ultimately be POSIX conformant (i.e., that those areas of conflict with
- the POSIX draft will be resolved in
- .I flex's
- favor). Please bear in
- mind that all the comments which follow are with regard to the POSIX
- .I draft
- standard of Summer 1989, and not the final document (or subsequent
- drafts); they are included so
- .I flex
- users can be aware of the standardization issues and those areas where
- .I flex
- may in the near future undergo changes incompatible with
- its current definition.
- .LP
- .I flex
- is fully compatible with
- .I lex
- with the following exceptions:
- .IP -
- The undocumented
- .I lex
- scanner internal variable
- .B yylineno
- is not supported. It is difficult to support this option efficiently,
- since it requires examining every character scanned and reexamining
- the characters when the scanner backs up.
- Things get more complicated when the end of buffer or file is reached or a
- NUL is scanned (since the scan must then be restarted with the proper line
- number count), or the user uses the yyless(), unput(), or REJECT actions,
- or the multiple input buffer functions.
- .IP
- The fix is to add rules which, upon seeing a newline, increment
- yylineno. This is usually an easy process, though it can be a drag if some
- of the patterns can match multiple newlines along with other characters.
- .IP
- yylineno is not part of the POSIX draft.
- .IP -
- The
- .B input()
- routine is not redefinable, though it may be called to read characters
- following whatever has been matched by a rule. If
- .B input()
- encounters an end-of-file the normal
- .B yywrap()
- processing is done. A ``real'' end-of-file is returned by
- .B input()
- as
- .I EOF.
- .IP
- Input is instead controlled by redefining the
- .B YY_INPUT
- macro.
- .IP
- The
- .I flex
- restriction that
- .B input()
- cannot be redefined is in accordance with the POSIX draft, but
- .B YY_INPUT
- has not yet been accepted into the draft (and probably won't; it looks
- like the draft will simply not specify any way of controlling the
- scanner's input other than by making an initial assignment to
- .I yyin).
- .IP -
- .I flex
- scanners do not use stdio for input. Because of this, when writing an
- interactive scanner one must explicitly call fflush() on the
- stream associated with the terminal after writing out a prompt.
- With
- .I lex
- such writes are automatically flushed since
- .I lex
- scanners use
- .B getchar()
- for their input. Also, when writing interactive scanners with
- .I flex,
- the
- .B -I
- flag must be used.
- .IP -
- .I flex
- scanners are not as reentrant as
- .I lex
- scanners. In particular, if you have an interactive scanner and
- an interrupt handler which long-jumps out of the scanner, and
- the scanner is subsequently called again, you may get the following
- message:
- .nf
-
- fatal flex scanner internal error--end of buffer missed
-
- .fi
- To reenter the scanner, first use
- .nf
-
- yyrestart( yyin );
-
- .fi
- .IP -
- .B output()
- is not supported.
- Output from the
- .B ECHO
- macro is done to the file-pointer
- .I yyout
- (default
- .I stdout).
- .IP
- The POSIX draft mentions that an
- .B output()
- routine exists but currently gives no details as to what it does.
- .IP -
- .I lex
- does not support exclusive start conditions (%x), though they
- are in the current POSIX draft.
- .IP -
- When definitions are expanded,
- .I flex
- encloses them in parentheses.
- With lex, the following:
- .nf
-
- NAME [A-Z][A-Z0-9]*
- %%
- foo{NAME}? printf( "Found it\\n" );
- %%
-
- .fi
- will not match the string "foo" because when the macro
- is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
- and the precedence is such that the '?' is associated with
- "[A-Z0-9]*". With
- .I flex,
- the rule will be expanded to
- "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
- Note that because of this, the
- .B ^, $, <s>, /,
- and
- .B <<EOF>>
- operators cannot be used in a
- .I flex
- definition.
- .IP
- The POSIX draft interpretation is the same as
- .I flex's.
- .IP -
- To specify a character class which matches anything but a left bracket (']'),
- in
- .I lex
- one can use "[^]]" but with
- .I flex
- one must use "[^\\]]". The latter works with
- .I lex,
- too.
- .IP -
- The
- .I lex
- .B %r
- (generate a Ratfor scanner) option is not supported. It is not part
- of the POSIX draft.
- .IP -
- If you are providing your own yywrap() routine, you must include a
- "#undef yywrap" in the definitions section (section 1). Note that
- the "#undef" will have to be enclosed in %{}'s.
- .IP
- The POSIX draft
- specifies that yywrap() is a function and this is very unlikely to change; so
- .I flex users are warned
- that
- .B yywrap()
- is likely to be changed to a function in the near future.
- .IP -
- After a call to
- .B unput(),
- .I yytext
- and
- .I yyleng
- are undefined until the next token is matched. This is not the case with
- .I lex
- or the present POSIX draft.
- .IP -
- The precedence of the
- .B {}
- (numeric range) operator is different.
- .I lex
- interprets "abc{1,3}" as "match one, two, or
- three occurrences of 'abc'", whereas
- .I flex
- interprets it as "match 'ab'
- followed by one, two, or three occurrences of 'c'". The latter is
- in agreement with the current POSIX draft.
- .IP -
- The precedence of the
- .B ^
- operator is different.
- .I lex
- interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
- or 'bar' anywhere", whereas
- .I flex
- interprets it as "match either 'foo' or 'bar' if they come at the beginning
- of a line". The latter is in agreement with the current POSIX draft.
- .IP -
- To refer to yytext outside of the scanner source file,
- the correct definition with
- .I flex
- is "extern char *yytext" rather than "extern char yytext[]".
- This is contrary to the current POSIX draft but a point on which
- .I flex
- will not be changing, as the array representation entails a
- serious performance penalty. It is hoped that the POSIX draft will
- be emended to support the
- .I flex
- variety of declaration (as this is a fairly painless change to
- require of
- .I lex
- users).
- .IP -
- .I yyin
- is
- .I initialized
- by
- .I lex
- to be
- .I stdin;
- .I flex,
- on the other hand,
- initializes
- .I yyin
- to NULL
- and then
- .I assigns
- it to
- .I stdin
- the first time the scanner is called, providing
- .I yyin
- has not already been assigned to a non-NULL value. The difference is
- subtle, but the net effect is that with
- .I flex
- scanners,
- .I yyin
- does not have a valid value until the scanner has been called.
- .IP -
- The special table-size declarations such as
- .B %a
- supported by
- .I lex
- are not required by
- .I flex
- scanners;
- .I flex
- ignores them.
- .IP -
- The name
- .bd
- FLEX_SCANNER
- is #define'd so scanners may be written for use with either
- .I flex
- or
- .I lex.
- .LP
- The following
- .I flex
- features are not included in
- .I lex
- or the POSIX draft standard:
- .nf
-
- yyterminate()
- <<EOF>>
- YY_DECL
- #line directives
- %{}'s around actions
- yyrestart()
- comments beginning with '#' (deprecated)
- multiple actions on a line
-
- .fi
- This last feature refers to the fact that with
- .I flex
- you can put multiple actions on the same line, separated with
- semi-colons, while with
- .I lex,
- the following
- .nf
-
- foo handle_foo(); ++num_foos_seen;
-
- .fi
- is (rather surprisingly) truncated to
- .nf
-
- foo handle_foo();
-
- .fi
- .I flex
- does not truncate the action. Actions that are not enclosed in
- braces are simply terminated at the end of the line.
- .SH DIAGNOSTICS
- .I reject_used_but_not_detected undefined
- or
- .I yymore_used_but_not_detected undefined -
- These errors can occur at compile time. They indicate that the
- scanner uses
- .B REJECT
- or
- .B yymore()
- but that
- .I flex
- failed to notice the fact, meaning that
- .I flex
- scanned the first two sections looking for occurrences of these actions
- and failed to find any, but somehow you snuck some in (via a #include
- file, for example). Make an explicit reference to the action in your
- .I flex
- input file. (Note that previously
- .I flex
- supported a
- .B %used/%unused
- mechanism for dealing with this problem; this feature is still supported
- but now deprecated, and will go away soon unless the author hears from
- people who can argue compellingly that they need it.)
- .LP
- .I flex scanner jammed -
- a scanner compiled with
- .B -s
- has encountered an input string which wasn't matched by
- any of its rules.
- .LP
- .I flex input buffer overflowed -
- a scanner rule matched a string long enough to overflow the
- scanner's internal input buffer (16K bytes by default - controlled by
- .B YY_BUF_SIZE
- in "flex.skel". Note that to redefine this macro, you must first
- .B #undefine
- it).
- .LP
- .I scanner requires -8 flag -
- Your scanner specification includes recognizing 8-bit characters and
- you did not specify the -8 flag (and your site has not installed flex
- with -8 as the default).
- .LP
- .I
- fatal flex scanner internal error--end of buffer missed -
- This can occur in an scanner which is reentered after a long-jump
- has jumped out (or over) the scanner's activation frame. Before
- reentering the scanner, use:
- .nf
-
- yyrestart( yyin );
-
- .fi
- .LP
- .I too many %t classes! -
- You managed to put every single character into its own %t class.
- .I flex
- requires that at least one of the classes share characters.
- .SH DEFICIENCIES / BUGS
- See flex(1).
- .SH "SEE ALSO"
- .LP
- flex(1), lex(1), yacc(1), sed(1), awk(1).
- .LP
- M. E. Lesk and E. Schmidt,
- .I LEX - Lexical Analyzer Generator
- .SH AUTHOR
- Vern Paxson, with the help of many ideas and much inspiration from
- Van Jacobson. Original version by Jef Poskanzer. The fast table
- representation is a partial implementation of a design done by Van
- Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
- .LP
- Thanks to the many
- .I flex
- beta-testers, feedbackers, and contributors, especially Casey
- Leedom, benson@odi.com, Keith Bostic,
- Frederic Brehm, Nick Christopher, Jason Coughlin,
- Scott David Daniels, Leo Eskin,
- Chris Faylor, Eric Goldman, Eric
- Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
- Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
- Jef Poskanzer, Jim Roskind,
- Dave Tallman, Frank Whaley, Ken Yap, and those whose names
- have slipped my marginal mail-archiving skills but whose contributions
- are appreciated all the same.
- .LP
- Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
- Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
- headaches.
- .LP
- Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
- to Benson Margulies and Fred
- Burke for C++ support; to Ove Ewerlid for the basics of support for
- NUL's; and to Eric Hughes for the basics of support for multiple buffers.
- .LP
- Work is being done on extending
- .I flex
- to generate scanners in which the
- state machine is directly represented in C code rather than tables.
- These scanners may well be substantially faster than those generated
- using -f or -F. If you are working in this area and are interested
- in comparing notes and seeing whether redundant work can be avoided,
- contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
- .LP
- This work was primarily done when I was at the Real Time Systems Group
- at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
- for the support I received.
- .LP
- Send comments to:
- .nf
-
- Vern Paxson
- Computer Science Department
- 4126 Upson Hall
- Cornell University
- Ithaca, NY 14853-7501
-
- vern@cs.cornell.edu
- decvax!cornell!vern
-
- .fi
-