*first list all the things we will define to allow parser to reserve space
states inter goto mcomment data qdata endcheck cmdline string meta comment cstring spill default
*the last state listed is the default state at start of new element
symbols goto next ll label rsubs underscore data din cn any rcommand rfunction number punctuation quote comment squote variable meta all qin other int qcomment virgule special def
*symbols to be recognized
classes space label code comment meta error
*the order if this list is important. The last item is scanned
*first. Particularly so for exclusive classes
states
*now define all the parser states
*state names are signif. to 2 chars only, as are symbol names.
*the default state at start of line.
state default
fonts special S def S rcommand R rsubs F goto R comment I data R label L
changes rsubs cmdline rcommand cmdline label inter goto goto data data comment mcomment special cmdline def cmdline
unknown cmdline
end
*changes tell us where to go when certain symbols appear-
*here a label symbol tells us to enter the cmdline state
*unknown cmdline tells us to enter that state when an unknown symbol appears
state inter
fonts special S def S
changes special cmdline def cmdline
unknown cmdline
end
state cmdline
fonts rfunction F rcommand R goto R number N punctuation P quote Q comment I qcomment I data R variable V other E
*in parsing we work from left to right until we successfully match a symbol
changes quote string comment mcomment qcomment mcomment data data goto goto
*states and symbols may have the same names.
unknown spill
*an unknown would be an underscore at end of line. unknowns
*are not parsed so the spill handler gets a look to see what it is.
end
*all state definitions must end with 'end'
state string
fonts quote Q qin Q
*fall out on success
*close quotes.
changes quote cmdline
*close quotes means back to the command line.
end
state mcomment
*when we find REM, ', we need to decide if it's going to be metacommand or comment
fonts meta M
changes meta meta
unknown comment
end
state meta
fonts meta M squote Q int N
*we can have metacommands, or expressions integer& single-quoted
changes squote cstring
unknown comment
*if we find a non-meta bit, it's a comment.
end
state comment
fonts all C
*comment on to the end
end
state cstring
fonts squote Q any Q
changes squote meta
end
state spill
*if an underscore is found we come here
fonts underscore E any P
changes underscore endcheck
end
state endcheck
fonts
unknown cmdline
endl E P
return cmdline
end
*the return command tells the parser to go to cmdline state at the next line
state data
*in data statements, anything goes except a colon, unless it's in quotes.
fonts quotes Q cn P din D
changes quotes qdata cn cmdline
end
state qdata
fonts quote Q qin Q
changes quote data
end
state goto
*recognizes Linenumbers, Labels, commas, and NEXT (for RESUME NEXT)
fonts next R ll G virgule P
changes next cmdline
unknown cmdline
end
symbols
*now we define the symbol classes mapped to fonts in the font commands above
*first a linenumber or label
symbol label
form
{[$,'.']}({' '})':'
#
symbol ll
form
{[$,'.']}
symbol next
alphalist upper single $('.')
NEXT
symbol int
form
#
symbol underscore
form
'_'
*form is followed by lines of form parser definitions in special parser code
*Parser code:
*the following symbols represent items in the string being parsed:
*symbol: represents:
$ a word of alphanumeric characters, any case.
& a word of alphabetic chars, any case.
< a word of lower case.
> a word of upper case.
# a word of digits.
A single capital
a single miniscule
. any character
0 a single digit
! a word consisting of any characters except \0 or \n,
but including spaces; ie the rest of the line.
~'x' any character except x.
/'xy' a word of any characters, delimited by x or y, or the
end of a line.
*literals are presented in '' with the escape sequence for ' being ''.
*case-folded literals are specified by the start sequence (space)> in
single quotes, then the literal in upper-case. Eg:
' >REM' gives a literal of REM ignoring case.
Thus a space at the beginning of a '' sequence is ignored; to start with
a space use two spaces; to start with ' >' use ' >'.
*Logic: the parser works down the line seeing if it matches the form.
as soon as it fails, it drops out and tries with the next line
until there are no more lines. Then it returns a fail.
If it succeeds on any line (ie gets to the end) it returns a
success, ignoring further lines. So ordering is important to ensure
the maximum length is parsed (eg '3.2' might succeed as '3')
*NOTE: special symbols /'xy' and ! always succeed, so care must be taken
when using them to order symbols containing them in the fonts list
of a state in such a way as to avoid unconditional looping.
The following symbols act as logical operators:
(<form>,<form>,....)
This is 'optional'. The () must contain at least one form.
The parser keeps trying forms until one works, then skips to the
end of the bracket, omitting the others. If none works, it carries
on anyway.
[<form>,<form>,....]
This is 'choose 1'. As before but if none works, it fails the whole
line.
{<form>}
This is 'repeat at least once'. <form> is tried once. If it fails,
the whole line fails. If it succeeds, then the parser keeps trying
<form> until it does fail, then carries on with the rest of the line.
*So, complex maps can be built, Eg:
['A','B','C']('+','-')#
will succeed when the string is one of the letters A,B,C followed
by an optional + or - (but not both) and a compulsory word of at
least one and maybe more digits.
*Brackets can be nested, Eg:
#(['E','D']('+','-')#)
provides for an integer with double precision exponent.
There must be a number, and if there is anymore it must start
with 'E' or 'D', but not both, followed by optional sign mark
and compulsory exponent.
Note that this will parse '3' as well as '3E-24' since follow-on
exponent is optional.
* or, [A,a]({[$,#]})
allows a name which starts with a letter, of any case, followed by
an optional string of as many letters and full-stops as you like.
Z would succeed, as would Za. and s.hello.john, but .Z would fail.
** Note:However, the parser does not backtrack. If a line fails on which
** some options have already been processed, it does not go back and
** try the other options again, it goes on to the next line.
*No breaks for comments are allowed in form lists or other lists- they are
taken as end of list markers.
symbol any
form
.
*any character
symbol all
form
!
symbol other
form
~'_'
symbol din
form
/'":'
symbol data
alphalist upper single $('.')
DATA
symbol cn
form
':'
symbol virgule
form
','
*symbols can also be defined in terms of lists of strings The parser compares
*the string it has with the strings in the list, and if it finds a match, that
*symbol is returned as found. There are two sorts of list:
*alphalist: for words (generally). A parser form is provided at the top of
the list, which the parser uses to extract a notional 'word'
from the text. It then searches for a match to this word in the
list. The word must be identical- excess characters will fail.
*sizelist: for symbols, embedded words. The parser goes through the list
comparing each entry with its own length of characters from the
text. This allows tokens without identifiable separators to be
checked. Precedence is given to longer entries.
* an alphalist must be listed in full ascii order, low to high
* a sizelist is arranged in ascii order of first characters, then in
size order, longest to shortest, ie azz comes before ab comes before b.
*both types of lists must be have a size specifier, 'large' or 'small',
*to indicate the amount of memory to be reserved. 'large' lists can have up
*to 200 entries, 'small' ones up to 50.
*alphalists can also have size 'single', and only 1 entry.
*preceding the size specifier is an optional case specifier, upper, which
*causes all characters to be compared with list entries to be put into
*upper case before comparisons.
*any size of alphalist, and small sizelists, can be Hashed, for speed.
*Hashing is specified by giving two characters, the lowest and the
*highest ASCII values of the first characters of the words in the list,
*before the case specifier, or size specifier if there is no case.
*the list declarator ('alpha-' or 'sizelist') is followed by the size
specifier, then in the case of an alphalist, the parser form.