next up previous contents
Next: A parser generator Up: Moscow ML Owner's Manual Previous: Quotations and antiquotations

A lexer generator

 

This section describes mosmllex, a lexer generator which is closely based on camllex from the Caml Light implementation by Xavier Leroy. This documentation is based on that of camllex also.

Overview

The mosmllex command produces a lexical analyser from a set of regular expressions with attached semantic actions, in the style of lex. Assume that file lexer.lex contains the specification of a lexical analyser. Then executing

program906

produces a file lexer.sml containing Moscow ML code for the lexical analyser. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.

Lexer buffers are an abstract data type implemented in the library unit Lexing. The functions createLexerString and createLexer from unit Lexing create lexer buffers that read from a character string, or any reading function, respectively.

When used in conjunction with a parser generated by mosmlyac (see Section 10), the semantic actions compute a value belonging to the datatype token defined by the generated parsing unit.

Example uses of mosmllex can be found in directories calc and lexyacc under mosml/examples.

Hints on using mosmllex

A lexer definition must have a rule to recognize the special symbol eof, meaning end-of-file. In general, a lexer must be able to handle all characters that can appear in the input. This is usually achieved by having the wildcard case _ at the very end of the lexer definition. If the lexer is to be used with e.g. DOS files, remember to provide a rule for the carriage-return symbol \r. Most often \r will be treated the same as \n, e.g. as whitespace.

Syntax of lexer definitions

The format of a lexer definition is as follows:

program923

Comments are delimited by (* and *), as in ML.

Header

The header section is arbitrary Moscow ML text enclosed in curly braces { and }. It can be omitted. If it is present, the enclosed text is copied as is at the beginning of the output file lexer.sml. Typically, the header section contains the open directives required by the actions, and possibly some auxiliary functions used in the actions.

Entry points

The names of the entry points must be valid ML identifiers.

Regular expressions

The regular expressions regexp are in the style of lex, but with a more ML-like syntax.

`char`

A character constant, with a syntax similar to that of Moscow ML character constants; see Section 9.3.5. Match the denoted character.

_

Match any character.

eof

Match the end of the lexer input.

"string"

A string constant, with a syntax similar to that of Moscow ML string constants; see Section 9.3.6. Match the denoted string.

[ character-set ]

Match any single character belonging to the given character set. Valid character sets are: single character constants `c`; ranges of characters ` tex2html_wrap_inline1770tex2html_wrap_inline1772 (all characters between tex2html_wrap_inline1770tex2html_wrap_inline1772

[ ^ character-set ]

Match any single character not belonging to the given character set.

regexp *

Match the concatenation of zero or more strings that match regexp. (Repetition).

regexp +

Match the concatenation of one or more strings that match regexp. (Positive repetition).

regexp ?

Match either the empty string, or a string matching regexp. (Option).

regexp tex2html_wrap_inline1778tex2html_wrap_inline1780

Match any string that matches either regexp tex2html_wrap_inline1778tex2html_wrap_inline1780

regexp tex2html_wrap_inline1778tex2html_wrap_inline1780

Match the concatenation of two strings, the first matching regexp tex2html_wrap_inline1778tex2html_wrap_inline1780

( regexp )

Match the same strings as regexp.

The operators * and + have highest precedence, followed by ?, then concatenation, then | (alternative).

Actions

An action is an arbitrary Moscow ML expression. An action is evaluated in a context where the identifier lexbuf is bound to the current lexer buffer. Some typical uses of lexbuf in conjunction with the operations on lexer buffers (provided by the Lexing library unit) are listed below.

Lexing.getLexeme lexbuf

Return the matched string.

Lexing.getLexemeChar lexbuf n

Return the n'th character in the matched string. The first character has number 0.

Lexing.getLexemeStart lexbuf

Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.

Lexing.getLexemeEnd lexbuf

Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.

entrypoint lexbuf

Here entrypoint is the name of another entry point in the same lexer definition. Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.

Character constants

 

A character constant in the lexer definition is delimited by ` (backquote) characters. The two backquotes enclose either a space or a printable character c, different from ` and \, or an escape sequence:

quot997

String constants

 

A string constant is a (possibly empty) sequence of characters delimited by " (double quote) characters.

quot1011

A string character strchar is either a space or a printable character c, different from " and \, or an escape sequence:

quot1027


next up previous contents
Next: A parser generator Up: Moscow ML Owner's Manual Previous: Quotations and antiquotations

Moscow ML 1.42