This section describes mosmllex, a lexer generator which is closely based on camllex from the Caml Light implementation by Xavier Leroy. This documentation is based on that of camllex also.
The mosmllex command produces a lexical analyser from a set of regular expressions with attached semantic actions, in the style of lex. Assume that file lexer.lex contains the specification of a lexical analyser. Then executing
produces a file lexer.sml containing Moscow ML code for the lexical analyser. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.
Lexer buffers are an abstract data type implemented in the library unit Lexing. The functions createLexerString and createLexer from unit Lexing create lexer buffers that read from a character string, or any reading function, respectively.
When used in conjunction with a parser generated by mosmlyac (see Section 10), the semantic actions compute a value belonging to the datatype token defined by the generated parsing unit.
Example uses of mosmllex can be found in directories calc and lexyacc under mosml/examples.
A lexer definition must have a rule to recognize the special symbol
eof, meaning end-of-file. In general, a lexer must be able to
handle all characters that can appear in the input. This is usually
achieved by having the wildcard case _
at the very end of the
lexer definition. If the lexer is to be used with e.g. DOS files,
remember to provide a rule for the carriage-return symbol \r
.
Most often \r
will be treated the same as \n
, e.g. as
whitespace.
The format of a lexer definition is as follows:
Comments are delimited by (* and *), as in ML.
The header section is arbitrary Moscow ML text enclosed in curly
braces {
and }
. It can be omitted. If it is present,
the enclosed text is copied as is at the beginning of the output file
lexer.sml. Typically, the header section contains the
open directives required by the actions, and possibly some
auxiliary functions used in the actions.
The names of the entry points must be valid ML identifiers.
The regular expressions regexp are in the style of lex, but with a more ML-like syntax.
A character constant, with a syntax similar to that of Moscow ML character constants; see Section 9.3.5. Match the denoted character.
Match any character.
Match the end of the lexer input.
A string constant, with a syntax similar to that of Moscow ML string constants; see Section 9.3.6. Match the denoted string.
Match any single character belonging to the given character set.
Valid character sets are: single character constants `c`;
ranges of characters ` (all characters between
^
character-set ]
Match any single character not belonging to the given character set.
Match the concatenation of zero or more strings that match regexp. (Repetition).
Match the concatenation of one or more strings that match regexp. (Positive repetition).
Match either the empty string, or a string matching regexp. (Option).
Match any string that matches either regexp
Match the concatenation of two strings, the first matching
regexp
Match the same strings as regexp.
The operators * and + have highest precedence, followed by ?, then concatenation, then | (alternative).
An action is an arbitrary Moscow ML expression. An action is evaluated in a context where the identifier lexbuf is bound to the current lexer buffer. Some typical uses of lexbuf in conjunction with the operations on lexer buffers (provided by the Lexing library unit) are listed below.
Return the matched string.
Return the n'th character in the matched string. The first character has number 0.
Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.
Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.
Here entrypoint is the name of another entry point in the same lexer definition. Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.
A character constant in the lexer definition is delimited by `
(backquote) characters. The two backquotes enclose either a space or
a printable character c, different from `
and \
,
or an escape sequence:
A string constant is a (possibly empty) sequence of characters delimited by " (double quote) characters.
A string character strchar is either a space or a
printable character c, different from " and \
, or
an escape sequence: