home *** CD-ROM | disk | FTP | other *** search
Text File | 1990-01-02 | 20.1 KB | 717 lines | [TEXT/KAHL] |
- .TH FLEX 1 "20 June 1989" "Version 2.1"
- .SH NAME
- flex - fast lexical analyzer generator
- .SH SYNOPSIS
- .B flex
- [
- .B -bdfipstvFILT -c[efmF] -Sskeleton_file
- ] [
- .I filename
- ]
- .SH DESCRIPTION
- .I flex
- is a rewrite of
- .I lex
- intended to right some of that tool's deficiencies: in particular,
- .I flex
- generates lexical analyzers much faster, and the analyzers use
- smaller tables and run faster.
- .SH OPTIONS
- In addition to lex's
- .B -t
- flag, flex has the following options:
- .TP
- .B -b
- Generate backtracking information to
- .I lex.backtrack.
- This is a list of scanner states which require backtracking
- and the input characters on which they do so. By adding rules one
- can remove backtracking states. If all backtracking states
- are eliminated and
- .B -f
- or
- .B -F
- is used, the generated scanner will run faster (see the
- .B -p
- flag). Only users who wish to squeeze every last cycle out of their
- scanners need worry about this option.
- .TP
- .B -d
- makes the generated scanner run in
- .I debug
- mode. Whenever a pattern is recognized the scanner will
- write to
- .I stderr
- a line of the form:
- .nf
-
- --accepting rule #n
-
- .fi
- Rules are numbered sequentially with the first one being 1. Rule #0
- is executed when the scanner backtracks; Rule #(n+1) (where
- .I n
- is the number of rules) indicates the default action; Rule #(n+2) indicates
- that the input buffer is empty and needs to be refilled and then the scan
- restarted. Rules beyond (n+2) are end-of-file actions.
- .TP
- .B -f
- has the same effect as lex's -f flag (do not compress the scanner
- tables); the mnemonic changes from
- .I fast compilation
- to (take your pick)
- .I full table
- or
- .I fast scanner.
- The actual compilation takes
- .I longer,
- since flex is I/O bound writing out the big table.
- .IP
- This option is equivalent to
- .B -cf
- (see below).
- .TP
- .B -i
- instructs flex to generate a
- .I case-insensitive
- scanner. The case of letters given in the flex input patterns will
- be ignored, and the rules will be matched regardless of case. The
- matched text given in
- .I yytext
- will have the preserved case (i.e., it will not be folded).
- .TP
- .B -p
- generates a performance report to stderr. The report
- consists of comments regarding features of the flex input file
- which will cause a loss of performance in the resulting scanner.
- Note that the use of
- .I REJECT
- and variable trailing context (see
- .B BUGS)
- entails a substantial performance penalty; use of
- .I yymore(),
- the
- .B ^
- operator,
- and the
- .B -I
- flag entail minor performance penalties.
- .TP
- .B -s
- causes the
- .I default rule
- (that unmatched scanner input is echoed to
- .I stdout)
- to be suppressed. If the scanner encounters input that does not
- match any of its rules, it aborts with an error. This option is
- useful for finding holes in a scanner's rule set.
- .TP
- .B -v
- has the same meaning as for lex (print to
- .I stderr
- a summary of statistics of the generated scanner). Many more statistics
- are printed, though, and the summary spans several lines. Most
- of the statistics are meaningless to the casual flex user, but the
- first line identifies the version of flex, which is useful for figuring
- out where you stand with respect to patches and new releases.
- .TP
- .B -F
- specifies that the
- .ul
- fast
- scanner table representation should be used. This representation is
- about as fast as the full table representation
- .ul
- (-f),
- and for some sets of patterns will be considerably smaller (and for
- others, larger). In general, if the pattern set contains both "keywords"
- and a catch-all, "identifier" rule, such as in the set:
- .nf
-
- "case" return ( TOK_CASE );
- "switch" return ( TOK_SWITCH );
- ...
- "default" return ( TOK_DEFAULT );
- [a-z]+ return ( TOK_ID );
-
- .fi
- then you're better off using the full table representation. If only
- the "identifier" rule is present and you then use a hash table or some such
- to detect the keywords, you're better off using
- .ul
- -F.
- .IP
- This option is equivalent to
- .B -cF
- (see below).
- .TP
- .B -I
- instructs flex to generate an
- .I interactive
- scanner. Normally, scanners generated by flex always look ahead one
- character before deciding that a rule has been matched. At the cost of
- some scanning overhead, flex will generate a scanner which only looks ahead
- when needed. Such scanners are called
- .I interactive
- because if you want to write a scanner for an interactive system such as a
- command shell, you will probably want the user's input to be terminated
- with a newline, and without
- .B -I
- the user will have to type a character in addition to the newline in order
- to have the newline recognized. This leads to dreadful interactive
- performance.
- .IP
- If all this seems to confusing, here's the general rule: if a human will
- be typing in input to your scanner, use
- .B -I,
- otherwise don't; if you don't care about how fast your scanners run and
- don't want to make any assumptions about the input to your scanner,
- always use
- .B -I.
- .IP
- Note,
- .B -I
- cannot be used in conjunction with
- .I full
- or
- .I fast tables,
- i.e., the
- .B -f, -F, -cf,
- or
- .B -cF
- flags.
- .TP
- .B -L
- instructs flex to not generate
- .B #line
- directives (see below).
- .TP
- .B -T
- makes flex run in
- .I trace
- mode. It will generate a lot of messages to stdout concerning
- the form of the input and the resultant non-deterministic and deterministic
- finite automatons. This option is mostly for use in maintaining flex.
- .TP
- .B -c[efmF]
- controls the degree of table compression.
- .B -ce
- directs flex to construct
- .I equivalence classes,
- i.e., sets of characters
- which have identical lexical properties (for example, if the only
- appearance of digits in the flex input is in the character class
- "[0-9]" then the digits '0', '1', ..., '9' will all be put
- in the same equivalence class).
- .B -cf
- specifies that the
- .I full
- scanner tables should be generated - flex should not compress the
- tables by taking advantages of similar transition functions for
- different states.
- .B -cF
- specifies that the alternate fast scanner representation (described
- above under the
- .B -F
- flag)
- should be used.
- .B -cm
- directs flex to construct
- .I meta-equivalence classes,
- which are sets of equivalence classes (or characters, if equivalence
- classes are not being used) that are commonly used together.
- A lone
- .B -c
- specifies that the scanner tables should be compressed but neither
- equivalence classes nor meta-equivalence classes should be used.
- .IP
- The options
- .B -cf
- or
- .B -cF
- and
- .B -cm
- do not make sense together - there is no opportunity for meta-equivalence
- classes if the table is not being compressed. Otherwise the options
- may be freely mixed.
- .IP
- The default setting is
- .B -cem
- which specifies that flex should generate equivalence classes
- and meta-equivalence classes. This setting provides the highest
- degree of table compression. You can trade off
- faster-executing scanners at the cost of larger tables with
- the following generally being true:
- .nf
-
- slowest smallest
- -cem
- -ce
- -cm
- -c
- -c{f,F}e
- -c{f,F}
- fastest largest
-
- .fi
- Note that scanners with the smallest tables compile the quickest, so
- during development you will usually want to use the default, maximal
- compression.
- .TP
- .B -Sskeleton_file
- overrides the default skeleton file from which flex constructs
- its scanners. You'll never need this option unless you are doing
- flex maintenance or development.
- .SH INCOMPATIBILITIES WITH LEX
- .I flex
- is fully compatible with
- .I lex
- with the following exceptions:
- .IP -
- There is no run-time library to link with. You needn't
- specify
- .I -ll
- when linking, and you must supply a main program. (Hacker's note: since
- the lex library contains a main() which simply calls yylex(), you actually
- .I can
- be lazy and not supply your own main program and link with
- .I -ll.)
- .IP -
- lex's
- .B %r
- (Ratfor scanners) and
- .B %t
- (translation table) options
- are not supported.
- .IP -
- The do-nothing
- .ul
- -n
- flag is not supported.
- .IP -
- When definitions are expanded, flex encloses them in parentheses.
- With lex, the following
- .nf
-
- NAME [A-Z][A-Z0-9]*
- %%
- foo{NAME}? printf( "Found it\\n" );
- %%
-
- .fi
- will not match the string "foo" because when the macro
- is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
- and the precedence is such that the '?' is associated with
- "[A-Z0-9]*". With flex, the rule will be expanded to
- "foo([A-z][A-Z0-9]*)?" and so the string "foo" will match.
- Note that because of this, the
- .B ^, $, <s>,
- and
- .B /
- operators cannot be used in a definition.
- .IP -
- The undocumented lex-scanner internal variable
- .B yylineno
- is not supported.
- .IP -
- The
- .B input()
- routine is not redefinable, though may be called to read characters
- following whatever has been matched by a rule. If
- .B input()
- encounters an end-of-file the normal
- .B yywrap()
- processing is done. A ``real'' end-of-file is returned as
- .I EOF.
- .IP
- Input can be controlled by redefining the
- .B YY_INPUT
- macro.
- YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
- action is to place up to max_size characters in the character buffer "buf"
- and return in the integer variable "result" either the
- number of characters read or the constant YY_NULL (0 on Unix systems)
- systems) to indicate EOF. The default YY_INPUT reads from the
- file-pointer "yyin" (which is by default
- .I stdin),
- so if you
- just want to change the input file, you needn't redefine
- YY_INPUT - just point yyin at the input file.
- .IP
- A sample redefinition of YY_INPUT (in the first section of the input
- file):
- .nf
-
- %{
- #undef YY_INPUT
- #define YY_INPUT(buf,result,max_size) \\
- result = (buf[0] = getchar()) == EOF ? YY_NULL : 1;
- %}
-
- .fi
- You also can add in things like counting keeping track of the
- input line number this way; but don't expect your scanner to
- go very fast.
- .IP -
- .B output()
- is not supported.
- Output from the ECHO macro is done to the file-pointer
- "yyout" (default
- .I stdout).
- .IP -
- If you are providing your own yywrap() routine, you must "#undef yywrap"
- first.
- .IP -
- To refer to yytext outside of your scanner source file, use
- "extern char *yytext;" rather than "extern char yytext[];".
- .IP -
- .B yyleng
- is a macro and not a variable, and hence cannot be accessed outside
- of the scanner source file.
- .IP -
- flex reads only one input file, while lex's input is made
- up of the concatenation of its input files.
- .IP -
- The name
- .bd
- FLEX_SCANNER
- is #define'd so scanners may be written for use with either
- flex or lex.
- .IP -
- The macro
- .bd
- YY_USER_ACTION
- can be redefined to provide an action
- which is always executed prior to the matched rule's action. For example,
- it could be #define'd to call a routine to convert yytext to lower-case,
- or to copy yyleng to a global variable to make it accessible outside of
- the scanner source file.
- .IP -
- In the generated scanner, rules are separated using
- .bd
- YY_BREAK
- instead of simple "break"'s. This allows, for example, C++ users to
- #define YY_BREAK to do nothing (while being very careful that every
- rule ends with a "break" or a "return"!) to avoid suffering from
- unreachable statement warnings where a rule's action ends with "return".
- .SH ENHANCEMENTS
- .IP -
- .I Exclusive start-conditions
- can be declared by using
- .B %x
- instead of
- .B %s.
- These start-conditions have the property that when they are active,
- .I no other rules are active.
- Thus a set of rules governed by the same exclusive start condition
- describe a scanner which is independent of any of the other rules in
- the flex input. This feature makes it easy to specify "mini-scanners"
- which scan portions of the input that are syntactically different
- from the rest (e.g., comments).
- .IP -
- .I yyterminate()
- can be used in lieu of a return statement in an action. It terminates
- the scanner and returns a 0 to the scanner's caller, indicating "all done".
- .IP -
- .I End-of-file rules.
- The special rule "<<EOF>>" indicates
- actions which are to be taken when an end-of-file is
- encountered and yywrap() returns non-zero (i.e., indicates
- no further files to process). The action can either
- point yyin at a new file to process, in which case the
- action should finish with
- .I YY_NEW_FILE
- (this is a branch, so subsequent code in the action won't
- be executed), or it should finish with a
- .I return
- statement. <<EOF>> rules may not be used with other
- patterns; they may only be qualified with a list of start
- conditions. If an unqualified <<EOF>> rule is given, it
- applies only to the INITIAL start condition, and
- .I not
- to
- .B %s
- start conditions.
- These rules are useful for catching things like unclosed comments.
- An example:
- .nf
-
- %x quote
- %%
- ...
- <quote><<EOF>> {
- error( "unterminated quote" );
- yyterminate();
- }
- <<EOF>> {
- yyin = fopen( next_file, "r" );
- YY_NEW_FILE;
- }
-
- .fi
- .IP -
- flex dynamically resizes its internal tables, so directives like "%a 3000"
- are not needed when specifying large scanners.
- .IP -
- The scanning routine generated by flex is declared using the macro
- .B YY_DECL.
- By redefining this macro you can change the routine's name and
- its calling sequence. For example, you could use:
- .nf
-
- #undef YY_DECL
- #define YY_DECL float lexscan( a, b ) float a, b;
-
- .fi
- to give it the name
- .I lexscan,
- returning a float, and taking two floats as arguments. Note that
- if you give arguments to the scanning routine, you must terminate
- the definition with a semi-colon (;).
- .IP -
- flex generates
- .B #line
- directives mapping lines in the output to
- their origin in the input file.
- .IP -
- You can put multiple actions on the same line, separated with
- semi-colons. With lex, the following
- .nf
-
- foo handle_foo(); return 1;
-
- .fi
- is truncated to
- .nf
-
- foo handle_foo();
-
- .fi
- flex does not truncate the action. Actions that are not enclosed in
- braces are terminated at the end of the line.
- .IP -
- Actions can be begun with
- .B %{
- and terminated with
- .B %}.
- In this case, flex does not count braces to figure out where the
- action ends - actions are terminated by the closing
- .B %}.
- This feature is useful when the enclosed action has extraneous
- braces in it (usually in comments or inside inactive #ifdef's)
- that throw off the brace-count.
- .IP -
- All of the scanner actions (e.g.,
- .B ECHO, yywrap ...)
- except the
- .B unput()
- and
- .B input()
- routines,
- are written as macros, so they can be redefined if necessary
- without requiring a separate library to link to.
- .IP -
- When
- .B yywrap()
- indicates that the scanner is done processing (it does this by returning
- non-zero), on subsequent calls the scanner will always immediately return
- a value of 0. To restart it on a new input file, the action
- .B yyrestart()
- is used. It takes one argument, the new input file. It closes the
- previous yyin (unless stdin) and sets up the scanners internal variables
- so that the next call to yylex() will start scanning the new file. This
- functionality is useful for, e.g., programs which will process a file, do some
- work, and then get a message to parse another file.
- .IP -
- Flex scans the code in section 1 (inside %{}'s) and the actions for
- occurrences of
- .I REJECT
- and
- .I yymore().
- If it doesn't see any, it assumes the features are not used and generates
- higher-performance scanners. Flex tries to be correct in identifying
- uses but can be fooled (for example, if a reference is made in a macro from
- a #include file). If this happens (a feature is used and flex didn't
- realize it) you will get a compile-time error of the form
- .nf
-
- reject_used_but_not_detected undefined
-
- .fi
- You can tell flex that a feature is used even if it doesn't think so
- with
- .B %used
- followed by the name of the feature (for example, "%used REJECT");
- similarly, you can specify that a feature is
- .I not
- used even though it thinks it is with
- .B %unused.
- .IP -
- Comments may be put in the first section of the input by preceding
- them with '#'.
- .SH FILES
- .TP
- .I flex.skel
- skeleton scanner
- .TP
- .I lex.yy.c
- generated scanner (called
- .I lexyy.c
- on some systems).
- .TP
- .I lex.backtrack
- backtracking information for
- .B -b
- flag (called
- .I lex.bck
- on some systems).
- .SH "SEE ALSO"
- .LP
- lex(1)
- .LP
- M. E. Lesk and E. Schmidt,
- .I LEX - Lexical Analyzer Generator
- .SH AUTHOR
- Vern Paxson, with the help of many ideas and much inspiration from
- Van Jacobson. Original version by Jef Poskanzer. Fast table
- representation is a partial implementation of a design done by Van
- Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
- .LP
- Thanks to the many flex beta-testers and feedbackers, especially Casey
- Leedom, Frederic Brehm, Nick Christopher, Chris Faylor, Eric Goldman, Eric
- Hughes, Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Esmond Pitt,
- Jef Poskanzer, and Dave Tallman. Thanks to Keith Bostic, John Gilmore, Bob
- Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
- headaches.
- .LP
- Send comments to:
- .nf
-
- Vern Paxson
- Real Time Systems
- Bldg. 46A
- Lawrence Berkeley Laboratory
- 1 Cyclotron Rd.
- Berkeley, CA 94720
-
- (415) 486-6411
-
- vern@csam.lbl.gov
- vern@rtsg.ee.lbl.gov
- ucbvax!csam.lbl.gov!vern
-
- .fi
- I will be gone from mid-July '89 through mid-August '89. From August on,
- the addresses are:
- .nf
-
- vern@cs.cornell.edu
-
- Vern Paxson
- CS Department
- Grad Office
- 4126 Upson
- Cornell University
- Ithaca, NY 14853-7501
-
- <no phone number yet>
-
- .fi
- Email sent to the former addresses should continue to be forwarded for
- quite a while. Also, it looks like my username will be "paxson" and
- not "vern". I'm planning on having a mail alias set up so "vern" will
- still work, but if you encounter problems try "paxson".
- .SH DIAGNOSTICS
- .LP
- .I flex scanner jammed -
- a scanner compiled with
- .B -s
- has encountered an input string which wasn't matched by
- any of its rules.
- .LP
- .I flex input buffer overflowed -
- a scanner rule matched a string long enough to overflow the
- scanner's internal input buffer (16K bytes - controlled by
- .B YY_BUF_MAX
- in "flex.skel").
- .LP
- .I old-style lex command ignored -
- the flex input contains a lex command (e.g., "%n 1000") which
- is being ignored.
- .SH BUGS
- .LP
- Some trailing context
- patterns cannot be properly matched and generate
- warning messages ("Dangerous trailing context"). These are
- patterns where the ending of the
- first part of the rule matches the beginning of the second
- part, such as "zx*/xy*", where the 'x*' matches the 'x' at
- the beginning of the trailing context. (Lex doesn't get these
- patterns right either.)
- If desperate, you can use
- .B yyless()
- to effect arbitrary trailing context.
- .LP
- .I variable
- trailing context (where both the leading and trailing parts do not have
- a fixed length) entails the same performance loss as
- .I REJECT
- (i.e., substantial).
- .LP
- For some trailing context rules, parts which are actually fixed-length are
- not recognized as such, leading to the abovementioned performance loss.
- In particular, parts using '|' or {n} are always considered variable-length.
- .LP
- Use of unput() or input() trashes the current yytext and yyleng.
- .LP
- Use of unput() to push back more text than was matched can
- result in the pushed-back text matching a beginning-of-line ('^')
- rule even though it didn't come at the beginning of the line.
- .LP
- yytext and yyleng cannot be modified within a flex action.
- .LP
- Nulls are not allowed in flex inputs or in the inputs to
- scanners generated by flex. Their presence generates fatal
- errors.
- .LP
- Flex does not generate correct #line directives for code internal
- to the scanner; thus, bugs in
- .I
- flex.skel
- yield bogus line numbers.
- .LP
- Pushing back definitions enclosed in ()'s can result in nasty,
- difficult-to-understand problems like:
- .nf
-
- {DIG} [0-9] /* a digit */
-
- .fi
- In which the pushed-back text is "([0-9] /* a digit */)".
- .LP
- Due to both buffering of input and read-ahead, you cannot intermix
- calls to stdio routines, such as, for example,
- .B getchar()
- with flex rules and expect it to work. Call
- .B input()
- instead.
- .LP
- The total table entries listed by the
- .B -v
- flag excludes the number of table entries needed to determine
- what rule has been matched. The number of entries is equal
- to the number of DFA states if the scanner does not use REJECT,
- and somewhat greater than the number of states if it does.
- .LP
- To be consistent with ANSI C, the escape sequence \\xhh should
- be recognized for hexadecimal escape sequences, such as '\\x41' for 'A'.
- .LP
- It would be useful if flex wrote to lex.yy.c a summary of the flags used in
- its generation (such as which table compression options).
- .LP
- The scanner run-time speeds still have not been optimized as much
- as they deserve. Van Jacobson's work shows that the can go
- faster still.
- .LP
- The utility needs more complete documentation.
-