How evaluate.c works
Calculate() is just a front end for the evaluate()
function. evaluate() is comprised of two main parts - a scanner
(also known as a tokenizer) and a token evaluator.
The scanner turns a stream of characters (the expression string) into a
stream of tokens.
- The first thing it does is allocate enough memory, where the worst
case is that every character is an individual token. It now considers this
as a 'list' of tokens, with a 'list pointer' pointing at the first unused
token.
- We then perform a loop that traverses every character individually. We
can make the loop pointer 'skip forward' for tokens that have more than one
character.
- To translate characters into tokens very quickly, we have a
'scantable', where every possible character is given a token value. This
token value is written into the 'current token'. But that's not enough to
tokenize. So, we then make a few decisions based on what the token was.
- For most of the operators, all necessary scanning was done with just
that one non-alphanumeric character, so all they have to do is advance the
list pointer to the next token. A few other operators might have more
characters, so they check the next one and advance the stream if
neccessary.
- Assignment is special. As we always have to put a variable name before
it, all we do is look back one token and if it is a variable, change its
token from TK_VAR to TK_ASSN. Of course, if there
isn't a variable before it, then there's definately a syntax error
in the string.
- For numbers, we have to do a quite a bit of work. The number scanner
is in another function so we can re-use it later, outside the scanner. The
basic premise is that while we have another digit that's OK, we can
multiply by 10 (or 16 for hex numbers) and add the value of the character
we just read. For decimal numbers, if we finish over a decimal point, we
can turn the number into a real value and start reading numbers after the
decimal point.
- For strings, written in the scantable as TK_VAR, first we
extract a copy of the full string. If we had operators that were actually
strings instead of symbols, like perl has "and", "or" and
"not", we would make comparisons here. Next, we try to see if its
a function, If so, we turn it into a TK_FUNC token. Otherwise, we
can assume it's a variable, and thus use its name later on.
- We also have a few special tokens that aren't 'real', that is they're
part of the scanning process and not part of the evaluation.
- TK_SKIP says that this character isn't part of any token, but
isn't wrong to see it in the expression. It allows us to have spaces into
our expressions and not get into trouble for them.
- TK_ERROR says that this character isn't part of any token,
and it is wrong to see it in the expression.
- TK_BREAK says that the expression scanning is finished and
OK, even though we haven't reached the end of the character stream yet. The
main loop that does tokenization and evaluation knows about this and will
call us again and again until we eventually do reach the end.
- After a successful scan, we 'lace up' the tokens, using pointers to
tokens, so we can easily insert other tokens without copying them about. We
also null-terminate any variable names - we couldn't do this earlier as it
would destroy the following token.
Now we've done the token scan, we must perform work on the tokens until
they are fit for evaluation.
- The first thing we do is wrap the expression in a set of parentheses,
because this avoids the use of special cases in our algorithims for the
start and end. In particular, a closing set of parentheses forces a 'flush'
of pending operations, so our evaluation is guaranteed to have have no
remaining operations stacked at the end of evaluation.
- Firstly, all the minus signs in the string were converted to
TK_SUB, which is a binary operator, so we have to convert the ones
we think are unary to TK_NEG. In my opinion, they are are the ones
that follow an open parenthesis or an operator. The opposite is also true -
the TK_SUBs become TK_NEGs provided they're not preceded
by a close parenthesis, a value, or variable.
- Implicit multiplication is where you can write "5a" and mean
"5*a", however it's my opinion that you really want to mean
"(5*a)", so I've given implicit multiplication a higher precedence
than everything else. My opinion for implicit multiplication is that it
goes between any two tokens where the left one is a variable, value or
close bracket, and the right one is a variable, value, open bracket or
function. The only exception is where the left and right tokens are the
same - it doesn't follow that "1 2 3" should evaluate to 6, it
should be a syntax error.
- Variables have to exist and be pointed at by tokens, so we look up any
assignments in the token stream to ensure those tokens are created if
necessary, and then look for any variable references to ensure that all
variable already exist or are pulled from the environment. Admittedly, this
doesn't catch errors like "x=x+1" where x didn't exist
before this evaluation, but this is an evaluator, not a programming
language. In such cases, x has a value of 0.
Finally, we're ready for evaluation. We make two stacks - one for holding
values and one for holding operators. Why? Well, we'll be using what's
known as postfix evaluation, which is unambiguous. This requires that we
stack all values as we see them then apply the operators in the correct
order. The operators pull as many values as neccesary from the stack, then
push the result back. Eventually, there's only one value on the stack - the
result. However the token stream is in infix notation - what can we do?
Convert one to the other. The idea of an infix to postfix converter is that
the innermost operations get done first, then the outermost ones. This is
done by running from left to right through the token stream and pushing
operators to a stack. However, for binary operators, all higher-precedence
operators already on the stack must be output to some destination stream
before a new operator can go on. Similarly, a close bracket forces a
'flush' of the stack back to the nearest open bracket on it. The algorithim
goes like so:
empty postfix list
FOR all tokens in infix list
SELECT current token
CASE variable, value
append current token to postfix list
ENDCASE
CASE unary operator, open bracket
push token onto stack
ENDCASE
CASE binary operator, close bracket
WHILE precedence(token on top of stack) > precedence(current token)
pull token from stack, append it to postfix list
ENDWHILE
IF current token = close bracket
pull token from stack, cause error if it's not an open bracket
ELSE
push current token onto stack
ENDIF
ENDCASE
ENDFOR
However we've been a bit clever and instead of appending operators to a
list, we execute them right away. And that's basically it, apart from a few
decisions here and there about converting arguments to and from int and
reals.
- TK_ADD, TK_SUB and TK_MUL will pull two
arguments from the stack. If both are integers, integer operations will be
used, otherwise real operations will be used. If there is one real and one
integer, the integer will be converted to a real. The type of the result is
whatever operation was used.
- TK_EQ, TK_NE, TK_LT, TK_LE,
TK_GT and TK_GE will also convert to reals unless two
integers are present, however the result is always an integer - either
1 for true or 0 for false.
- TK_DIV, TK_POW and TK_FUNC always convert to
reals and return reals.
- TK_MOD, TK_AND, TK_OR, TK_BNOT,
TK_BAND, TK_BOR, TK_BXOR, TK_SHL and
TK_SHR always convert to integers and return integers.
- TK_NEG and TK_ASSN carry the type of their one
argument.
How do the 'stacks' work?
I don't use some ADT to operate the stacks, I use basic C operations. I
have an array of the appropriate type which has enough entries to hold
anything I want. I also have a 'stack pointer', which points at the current
topmost item on the stack. It is initialised to -1, to say that
the stack is empty. The following C fragments show the canonical stack
operations:
valstk = the stack
vstk = the stack pointer
if (vstk < 0) : if the stack is empty
if (vstk < n) : if the stack has n items or less on it
valstk[vstk] : get the item on top of the stack
valstk[++vstk] = item : push an item onto the stack
valstk[vstk--] : pull an item from the stack
Is there anything that could be better?
- allocmem(), getenv() and the variables functions all
make accesses and updates to global shared data, which is playing with fire
in a shared library. It would be better if memory accesses and the variable
table were somehow not global, so no locking mechanism is needed in shared
libraries, and more than one evaluation could take place at once.
Kyzer/CSG