Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / unix / volume23 / flex2.3 / part08 / flexdoc.1.02 next >

Wrap

Text File | 1990-10-10 | 15.2 KB | 580 lines

Another area where the user can increase a scanner's performance (and one that's easier to implement) arises from the fact that the longer the tokens matched, the faster the scanner will run. This is because with long tokens the processing of most input characters takes place in the (short) inner scanning loop, and does not often have to go through the additional work of setting up the scanning environment (e.g., .B yytext) for the action. Recall the scanner for C comments: .nf %x comment %% int line_num = 1; "/*" BEGIN(comment); <comment>[^*\\n]* <comment>"*"+[^*/\\n]* <comment>\\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL); .fi This could be sped up by writing it as: .nf %x comment %% int line_num = 1; "/*" BEGIN(comment); <comment>[^*\\n]* <comment>[^*\\n]*\\n ++line_num; <comment>"*"+[^*/\\n]* <comment>"*"+[^*/\\n]*\\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL); .fi Now instead of each newline requiring the processing of another action, recognizing the newlines is "distributed" over the other rules to keep the matched text as long as possible. Note that .I adding rules does .I not slow down the scanner! The speed of the scanner is independent of the number of rules or (modulo the considerations given at the beginning of this section) how complicated the rules are with regard to operators such as '*' and '|'. .LP A final example in speeding up a scanner: suppose you want to scan through a file containing identifiers and keywords, one per line and with no other extraneous characters, and recognize all the keywords. A natural first approach is: .nf %% asm | auto | break | ... etc ... volatile | while /* it's a keyword */ .|\\n /* it's not a keyword */ .fi To eliminate the back-tracking, introduce a catch-all rule: .nf %% asm | auto | break | ... etc ... volatile | while /* it's a keyword */ [a-z]+ | .|\\n /* it's not a keyword */ .fi Now, if it's guaranteed that there's exactly one word per line, then we can reduce the total number of matches by a half by merging in the recognition of newlines with that of the other tokens: .nf %% asm\\n | auto\\n | break\\n | ... etc ... volatile\\n | while\\n /* it's a keyword */ [a-z]+\\n | .|\\n /* it's not a keyword */ .fi One has to be careful here, as we have now reintroduced backtracking into the scanner. In particular, while .I we know that there will never be any characters in the input stream other than letters or newlines, .I flex can't figure this out, and it will plan for possibly needing backtracking when it has scanned a token like "auto" and then the next character is something other than a newline or a letter. Previously it would then just match the "auto" rule and be done, but now it has no "auto" rule, only a "auto\\n" rule. To eliminate the possibility of backtracking, we could either duplicate all rules but without final newlines, or, since we never expect to encounter such an input and therefore don't how it's classified, we can introduce one more catch-all rule, this one which doesn't include a newline: .nf %% asm\\n | auto\\n | break\\n | ... etc ... volatile\\n | while\\n /* it's a keyword */ [a-z]+\\n | [a-z]+ | .|\\n /* it's not a keyword */ .fi Compiled with .B -Cf, this is about as fast as one can get a .I flex scanner to go for this particular problem. .LP A final note: .I flex is slow when matching NUL's, particularly when a token contains multiple NUL's. It's best to write rules which match .I short amounts of text if it's anticipated that the text will often include NUL's. .SH INCOMPATIBILITIES WITH LEX AND POSIX .I flex is a rewrite of the Unix .I lex tool (the two implementations do not share any code, though), with some extensions and incompatibilities, both of which are of concern to those who wish to write scanners acceptable to either implementation. At present, the POSIX .I lex draft is very close to the original .I lex implementation, so some of these incompatibilities are also in conflict with the POSIX draft. But the intent is that except as noted below, .I flex as it presently stands will ultimately be POSIX conformant (i.e., that those areas of conflict with the POSIX draft will be resolved in .I flex's favor). Please bear in mind that all the comments which follow are with regard to the POSIX .I draft standard of Summer 1989, and not the final document (or subsequent drafts); they are included so .I flex users can be aware of the standardization issues and those areas where .I flex may in the near future undergo changes incompatible with its current definition. .LP .I flex is fully compatible with .I lex with the following exceptions: .IP - The undocumented .I lex scanner internal variable .B yylineno is not supported. It is difficult to support this option efficiently, since it requires examining every character scanned and reexamining the characters when the scanner backs up. Things get more complicated when the end of buffer or file is reached or a NUL is scanned (since the scan must then be restarted with the proper line number count), or the user uses the yyless(), unput(), or REJECT actions, or the multiple input buffer functions. .IP The fix is to add rules which, upon seeing a newline, increment yylineno. This is usually an easy process, though it can be a drag if some of the patterns can match multiple newlines along with other characters. .IP yylineno is not part of the POSIX draft. .IP - The .B input() routine is not redefinable, though it may be called to read characters following whatever has been matched by a rule. If .B input() encounters an end-of-file the normal .B yywrap() processing is done. A ``real'' end-of-file is returned by .B input() as .I EOF. .IP Input is instead controlled by redefining the .B YY_INPUT macro. .IP The .I flex restriction that .B input() cannot be redefined is in accordance with the POSIX draft, but .B YY_INPUT has not yet been accepted into the draft (and probably won't; it looks like the draft will simply not specify any way of controlling the scanner's input other than by making an initial assignment to .I yyin). .IP - .I flex scanners do not use stdio for input. Because of this, when writing an interactive scanner one must explicitly call fflush() on the stream associated with the terminal after writing out a prompt. With .I lex such writes are automatically flushed since .I lex scanners use .B getchar() for their input. Also, when writing interactive scanners with .I flex, the .B -I flag must be used. .IP - .I flex scanners are not as reentrant as .I lex scanners. In particular, if you have an interactive scanner and an interrupt handler which long-jumps out of the scanner, and the scanner is subsequently called again, you may get the following message: .nf fatal flex scanner internal error--end of buffer missed .fi To reenter the scanner, first use .nf yyrestart( yyin ); .fi .IP - .B output() is not supported. Output from the .B ECHO macro is done to the file-pointer .I yyout (default .I stdout). .IP The POSIX draft mentions that an .B output() routine exists but currently gives no details as to what it does. .IP - .I lex does not support exclusive start conditions (%x), though they are in the current POSIX draft. .IP - When definitions are expanded, .I flex encloses them in parentheses. With lex, the following: .nf NAME [A-Z][A-Z0-9]* %% foo{NAME}? printf( "Found it\\n" ); %% .fi will not match the string "foo" because when the macro is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" and the precedence is such that the '?' is associated with "[A-Z0-9]*". With .I flex, the rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. Note that because of this, the .B ^, $, <s>, /, and .B <<EOF>> operators cannot be used in a .I flex definition. .IP The POSIX draft interpretation is the same as .I flex's. .IP - To specify a character class which matches anything but a left bracket (']'), in .I lex one can use "[^]]" but with .I flex one must use "[^\\]]". The latter works with .I lex, too. .IP - The .I lex .B %r (generate a Ratfor scanner) option is not supported. It is not part of the POSIX draft. .IP - If you are providing your own yywrap() routine, you must include a "#undef yywrap" in the definitions section (section 1). Note that the "#undef" will have to be enclosed in %{}'s. .IP The POSIX draft specifies that yywrap() is a function and this is very unlikely to change; so .I flex users are warned that .B yywrap() is likely to be changed to a function in the near future. .IP - After a call to .B unput(), .I yytext and .I yyleng are undefined until the next token is matched. This is not the case with .I lex or the present POSIX draft. .IP - The precedence of the .B {} (numeric range) operator is different. .I lex interprets "abc{1,3}" as "match one, two, or three occurrences of 'abc'", whereas .I flex interprets it as "match 'ab' followed by one, two, or three occurrences of 'c'". The latter is in agreement with the current POSIX draft. .IP - The precedence of the .B ^ operator is different. .I lex interprets "^foo|bar" as "match either 'foo' at the beginning of a line, or 'bar' anywhere", whereas .I flex interprets it as "match either 'foo' or 'bar' if they come at the beginning of a line". The latter is in agreement with the current POSIX draft. .IP - To refer to yytext outside of the scanner source file, the correct definition with .I flex is "extern char *yytext" rather than "extern char yytext[]". This is contrary to the current POSIX draft but a point on which .I flex will not be changing, as the array representation entails a serious performance penalty. It is hoped that the POSIX draft will be emended to support the .I flex variety of declaration (as this is a fairly painless change to require of .I lex users). .IP - .I yyin is .I initialized by .I lex to be .I stdin; .I flex, on the other hand, initializes .I yyin to NULL and then .I assigns it to .I stdin the first time the scanner is called, providing .I yyin has not already been assigned to a non-NULL value. The difference is subtle, but the net effect is that with .I flex scanners, .I yyin does not have a valid value until the scanner has been called. .IP - The special table-size declarations such as .B %a supported by .I lex are not required by .I flex scanners; .I flex ignores them. .IP - The name .bd FLEX_SCANNER is #define'd so scanners may be written for use with either .I flex or .I lex. .LP The following .I flex features are not included in .I lex or the POSIX draft standard: .nf yyterminate() <<EOF>> YY_DECL #line directives %{}'s around actions yyrestart() comments beginning with '#' (deprecated) multiple actions on a line .fi This last feature refers to the fact that with .I flex you can put multiple actions on the same line, separated with semi-colons, while with .I lex, the following .nf foo handle_foo(); ++num_foos_seen; .fi is (rather surprisingly) truncated to .nf foo handle_foo(); .fi .I flex does not truncate the action. Actions that are not enclosed in braces are simply terminated at the end of the line. .SH DIAGNOSTICS .I reject_used_but_not_detected undefined or .I yymore_used_but_not_detected undefined - These errors can occur at compile time. They indicate that the scanner uses .B REJECT or .B yymore() but that .I flex failed to notice the fact, meaning that .I flex scanned the first two sections looking for occurrences of these actions and failed to find any, but somehow you snuck some in (via a #include file, for example). Make an explicit reference to the action in your .I flex input file. (Note that previously .I flex supported a .B %used/%unused mechanism for dealing with this problem; this feature is still supported but now deprecated, and will go away soon unless the author hears from people who can argue compellingly that they need it.) .LP .I flex scanner jammed - a scanner compiled with .B -s has encountered an input string which wasn't matched by any of its rules. .LP .I flex input buffer overflowed - a scanner rule matched a string long enough to overflow the scanner's internal input buffer (16K bytes by default - controlled by .B YY_BUF_SIZE in "flex.skel". Note that to redefine this macro, you must first .B #undefine it). .LP .I scanner requires -8 flag - Your scanner specification includes recognizing 8-bit characters and you did not specify the -8 flag (and your site has not installed flex with -8 as the default). .LP .I fatal flex scanner internal error--end of buffer missed - This can occur in an scanner which is reentered after a long-jump has jumped out (or over) the scanner's activation frame. Before reentering the scanner, use: .nf yyrestart( yyin ); .fi .LP .I too many %t classes! - You managed to put every single character into its own %t class. .I flex requires that at least one of the classes share characters. .SH DEFICIENCIES / BUGS See flex(1). .SH "SEE ALSO" .LP flex(1), lex(1), yacc(1), sed(1), awk(1). .LP M. E. Lesk and E. Schmidt, .I LEX - Lexical Analyzer Generator .SH AUTHOR Vern Paxson, with the help of many ideas and much inspiration from Van Jacobson. Original version by Jef Poskanzer. The fast table representation is a partial implementation of a design done by Van Jacobson. The implementation was done by Kevin Gong and Vern Paxson. .LP Thanks to the many .I flex beta-testers, feedbackers, and contributors, especially Casey Leedom, benson@odi.com, Keith Bostic, Frederic Brehm, Nick Christopher, Jason Coughlin, Scott David Daniels, Leo Eskin, Chris Faylor, Eric Goldman, Eric Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht, Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt, Jef Poskanzer, Jim Roskind, Dave Tallman, Frank Whaley, Ken Yap, and those whose names have slipped my marginal mail-archiving skills but whose contributions are appreciated all the same. .LP Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob Mulcahy, Rich Salz, and Richard Stallman for help with various distribution headaches. .LP Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to Benson Margulies and Fred Burke for C++ support; to Ove Ewerlid for the basics of support for NUL's; and to Eric Hughes for the basics of support for multiple buffers. .LP Work is being done on extending .I flex to generate scanners in which the state machine is directly represented in C code rather than tables. These scanners may well be substantially faster than those generated using -f or -F. If you are working in this area and are interested in comparing notes and seeing whether redundant work can be avoided, contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE). .LP This work was primarily done when I was at the Real Time Systems Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there for the support I received. .LP Send comments to: .nf Vern Paxson Computer Science Department 4126 Upson Hall Cornell University Ithaca, NY 14853-7501 vern@cs.cornell.edu decvax!cornell!vern .fi