Implementation Notes for SFLOAT,
                        a software Floating Point package
                              for 80X8X processors

                                By Robert L. Smith

INTRODUCTION:

        This document contains various notes, thoughts, methods, etc. which
relate to the software floating-point package which I have called SFLOAT.
My main motivation for writing such a package is to help eliminate a common
complaint about FORTH, that it isn't useful because it doesn't have floating
point.  There are various possible answers to that complaint.  One answer is
that floating point is available in most commercial FORTH systems.  On the
other hand, most people initially try FORTH in a public domain package, such
as F-83.  In that case, they will probably not find a floating point package.
(One exception is a hardware floating point package for the PC clones.
Another exception is a low precision "Zen" floating point published by Martin
Tracy.)  Another answer to the complaint is that floating point is not needed.
In a pure sense, that is certainly true.  However, there are many times when
the human effort required to do a problem in fixed point arithmetic becomes
unreasonable for the problem at hand.  In my particular case, I once wasted a
great deal of time trying to approximate, in fixed point, a function involving
the probability integral.  The problem was trivial in floating point.

        The floating point routines in this package are written for the most
part in assembly language (CODE) routines for a specific processor type, the
infamous Intel 8088 family.  Why did I write routines in code rather than in
high level FORTH?  For the usual reason of speed.  In addition, detailed
arithmetic functions are somewhat easier to write in code rather than high
level because of the absence of the carry bit in FORTH.  I chose the Intel
series because of its popularity.  An equivalent package for the Motorola
68XXX series should be much easier to write because of the 32 bit architect-
ure.  I will leave that to others.

        It was my intent to produce a set of floating point routines which
are both fast and accurate.  If you find errors or places where I can improve
things, please let me know.


REPRESENTATION:

        I was a member of the IEEE P754 committee on Floating Point.  At one
point I was strongly inclined to use the IEEE floating point representation.
There were two reasons why I rejected it.  The first reason was that I tried,
and found it very difficult to work with.  The second reason is that someone
else had built a floating point package around the IEEE format (Philip Koopman,
Jr., MVP FORTH).  According to reports from Glenn Haydon, the routines were
painfully slow.

        Every representation method has its advantages and disadvantages.
I chose a 48 bit representation.  The most significant bit is the sign bit.
The next 15 bits are used to indicate the power of 2, and the remaining 32
bits are used for the significand (or mantissa, as we used to say), without
suppressing the leading bit.  The binary point is at the left of the leading
significant bit.  This representation gives about 9 1/2 decimal digits of
accuracy (somewhat greater than the IEEE single precision representation)
and a range of 10^-4000 to 10^4000, roughly.  That is far more than anyone
needs, but greatly reduces the need to test for overflow and underflow.
I chose to consider any unnormalized number (with the most significant bit
cleared) as zero.  Finally, I set the bias on the exponent or power of 2 to
3FFF (Hex representation, obviously).  I am not completely satisfied with the
last two decisions (representation of zero, and exponent bias), but I don't
have the energy to re-do all of the work just to see if the decision should
be altered.

        The current version uses a separate floating point stack.  In my
initial version, I used the standard parameter stack to hold working floating
point numbers.  This is allowed in the Forth Vendors Group Standard Floating
Point Extension.  As of this writing, it appears that the ANSI Standard Forth
will specify a separate Floating Point stack, so I re-wrote the routines.


ACCURACY:

        In the IEEE Floating Point Standard, the basic four functions (add,
subtract, multiply, and divide) and square root are required to be exact
within one-half of the least significant bit.  It is my intention to meet
that requirement.  To understand how that is done, the reader should consult
some of the original papers.  See, for example, "The INTEL Standard for
Floating Point," by John F. Palmer, Proceedings Compsac77, IEEE Catalog No.
77CH1291-4C.  The basic idea is that all bits below the least significant
bits are to be considered.  If these bits represent a number greater than
one-half of the least significant bit, then the result is incremented by 1.
If the bits represent a number less than one-half, the result is unchanged.
If the bits represent exactly one-half, then the least significant bit is
examined.  If the least significant bit is 0, the result is unchanged.
Otherwise the result is incremented by 1.  This process is called "rounding
to even."  The trickiest part of this process occurs when subtracting two
numbers which have the same sign (or adding two numbers of opposite sign).
I know the casual reader of these notes is going to think that going to all
of this effort is gross overkill, but the mathematical literature has
numerous cases where this is important.  In any case, if you are going off on
your own to do floating point arithmetic, please do not simply truncate the
results unless you have very good reasons for doing so.  Some of the very
worst examples come from computers designed by Seymore Cray, whose philosophy
seems to be that the results don't have to be computed accurately as long as
they can be computed rapidly.


DIVISION

        If you are interested in details, the floating point division routine
in this package is fairly interesting.  It is my second version.  The initial
version used a more or less standard technique for dividing by successive
subtraction and shifting.  I stumbled on an interesting variation for Knuth's
method for divsion in the documentation of the Forth-based language called
"Abundance" by Roedy Green.  In that package, he performs a double precision
divide by first shifting the divisor left until the most significant bit is
set, obtaining an unsigned quotient and remainder, then adjusting the result
based on the initial amount of shifting.  In a floating point routine, the
operands are almost always normalized, so that the shifting is unnecessary,
except possibly for an initial right shift of the numerator by one place.

        The method is somewhat similiar to the technique that you learned in
grammar school, except that instead of working in base ten we use base "s",
where  s = 2^16 .  In other words, each 16 bit word is considered as a digit
in this new base.  We desire a 2 digit quotient and a two digit remainder.
We will first guess the leading digit of the quotient, then determine its
exact value, along with an appropriate remainder.  Then we will determine the
second digit and final remainder the same way.  It turns out that our initial
quess for each quotient digit must be no more than 2 greater than the correct
digit.  At each stage the numerator consists of 3 digits, and the denominator
consists of 2 digits.  We may define the first quotient digit and remainder
as follows:

                (s^2)*A + s*B + C            r1
                -----------------  =  q1 + ------
                     sD + E                sD + E

The 3 terms in the numerator account for a possible right shift in the case
that the initial numerator was greater than or equal to the denominator.  We
thus can guarantee that  A <= D, and  q1 < s.  We also require that the
remainder  r1 < (sD + E).  With a small amount of algebraic manipulation we
can see that

                r1 = (s^2)*A + s*B + C - s*q1*D - q1*E

That doesn't appear to be of much help, but just hold on.  We will make our
initial guess for q1 based on a single precision division as follows:

                s*A + B          r0
                -------  =  g0 + ---
                   D              D

The key to simplifying this calculation is that we do not ignore the remainder
in our trial division.  Note that the remainder in this division is

                r0 = s*A + B - g0*D

Assume for the time being that  q1 = g0.  It may not be, but our equations
will still be exact!  We can calculate a remainder  r1'  which is

                r1' = s*r0 + C - g0*E

by making a simple single precision multiply and a double precision
subtraction.  Our initial guess  g0  may have been too large.  We can tell
that because then  r1' would be negative.  If r1' is >= 0, our initial guess
is the correct guess.

        If r1' is negative, we may correct for that by simultaneously adding
the denominator (sD + E) to r1' and decrementing g0 by 1.  We check again to
see if the new remainder r1'' is negative.  If r1'' is also negative, make a
final correction by adding the denominator to r1'' and decrementing g0 again.
This final quotient is correct, and the remainder is also correct.

        The process is then repeated for the second quotient digit, by using
the first remainder as the new numerator and then estimating the quotient and
remainder as above.  The result is that we can calculate the double precision
quotient by using 2 single precision divides and two single precision
multiplies, and some trivial additions and subtractions.

         The final part of this process is to round the quotient precisely.
This is done by comparing twice the remainder with the divisor.  If 2*rem is
less than the divisor, the quotient is correct.  If 2*rem is greater than the
divisor, the quotient must be incremented by 1.  It 2*rem is exactly equal to
the divisor, we round to even.  That is, if the least significant bit in the
quotient is 1, then add 1 to the quotient.  Otherwise, do not add anything to
the quotient.

        There is an additional detail in the above: it is possible that
A = D , even though (sA + B ) < (sD + E) .  If we were to just divide, we
would encounter an overflow situation.  Therefore, consider that the trial
quotient  g0 = s - 1 .  The zero-th remainder is  r0 = sA + B - (sD + D) ,
and the first remainder is

        r1 = (s^2)*A + s*B + C - (s^2)*D + s*(D - E) + E
           = s*(B + D - E) + (C + E)

Again, the initial guess may be too large, which may be discovered by checking
if r1 is negative.  If it is, then one or two corrections may need to be made,
as in the previous discussion.


LOGARITHM

        These notes are for those few souls who are interested in the details
of implementation of the FLN function for software floating point.  As noted
elsewhere, the basic format for floating point is a 16 bit sign and biased
exponent, followed by a 32 bit normalized fraction.  The biased exponent
occupies bits 0 through 14 with a bias of 3FFF (hexadecimal), and the sign
bit is bit 15 of the first 16 bits.  The binary point of the fraction is just
to the left of the 32 bit fractional part (sometimes called the mantissa).
If the most significant bit of the mantissa is zero, the whole number is
considered zero.

        If the argument to the FLN function is negative or zero, it is an
error condition.  After testing for those conditions, we then break down the
problem into two parts, depending on the magnitude of the fraction.  If the
fraction is greater than or equal to 0.78125 (and, by definition, it must be
less than 1.0), then we note that

                ln(X) = ln((2^I) * 0.f) = I*ln(2) + ln(0.f)

where I is the unbiased exponent and 0.f is the fractional part.  Calculation
of  ln(0.f) is based on the approximation:

        ln(1-x) = -(x + (x^2)/2 + (x^3)/3 + (x^4)/4 + . . . )

It may be noted in passing that there are better approximations than the
above, but this one is selected to allow a crucial division to be made using
single precision divisors.  It is necessary to reduce the range of  x  in the
above equation.  (If we did not, it would be necessary to use about 24 terms
in the series when x = 0.25).  This is accomplished by dividing 0.f into
intervals of 2^-5 or 1/32 .  The procedure is as follows:

        0.f = 1 - f1 , where  f1 = 1 - 0.f

        f1 * 2^5 = J + f2 * 2^5

        0.f = 1 - f1 = 1 - (2^-5) * J - f2

            = (1 - J*(2^-5)) * [1 - f2/(1 - J*(2^-5))]

where  J  is an integer.  The calculation can proceed by using a table of
values for the logarithm of  (1 - J*(2^-5)), and an approximation for
ln(1-x), where

        x = f2/(1 - J*(2^-5))

In this case, the maximum value for x is 0.0435 .  The series approximation
for 33 bits of accuracy can terminate at the term (x^7)/7 .  In order to
maximize the accuracy of the calculation, we proceed using  y = x*2^4 .
The maximum value of y is 0.696 .  We then proceed to calculate:


        (2^4)*(-ln(1-x)) = y + (y^2)*(2^-5) + (2/3)*(y^3)*(2^-9) +
                + (y^4)*(2^-14) + (4/5)*(y^5)*(2^-18) +
                + (4/6)*(y^6)*(2^-22) + (8/7)*(y^7)*(2^-27)

The last term is kept only for rounding purposes.  The factor (8/7) may be
approximated by 1.  It is necessary to use double precision in only the
first 4 terms (well, maybe only the first 3, with extreme care).  The above
equation should be factored, and the calculations made from right to left so
that the smallest terms are calculated first.  (Writing equations in ASCII
is tedious, and reading them is probably hellish, at best).  The calculation
is roughly as follows:

        -(2^4)*ln(1-x) = (((((y*(2^-5) + (2/3))y*(2^-4) +
                +(4/5))*(y^2)*(2^-4) + Y)*(2^-5) + (2/3))*(Y^3)*(2^-4) +
                +(Y^2))*(2^-5) + Y

where  Y  indicates the double precision value and  y  the single precision
value.

        The case of the initial fraction 0.f being less than 0.78125 proceeds
along somewhat similiar lines to the above.  We first note that the desired
result may be expressed as

        ln(X) = ln(2^(I-1)*(2*0.f)) = (I-1)*ln(2) + ln(2*0.f)

where 1.0 <= (2.0*f) < 1.5625 .  In a manner similiar to the first case we
divide the interval up into 19 subintervals of 2^-5 or 1/32 each.  The
proceedure is as follows:

        2*0.f = 1 + f1 , where f1 = (2*0.f) - 1

        f1 * (2^5) = J + f2 * (2^5)

        2*0.f = (1 + J*(2^-5)) * [1 + f2/(1 + J*(2^-5))]

where J is an integer.  The calculation can proceed by using a table of
values for the logarithm of (1 + J*(2^-5)) , and an approximation for
ln(1+x), where

        x = f2/(1 + J*(2^-5))

In this case, the maximum value for x is 0.03125 .  The approximation
proceeds much as in the first case.  The approximation used is:

       ln(1+x) = x - (x^2)/2 + (x^3)/3 - (x^4)/4 + (x^5)/5 - (x^6)/6 + (x^7)/8

Let x = (2^-5)Y .  The calculation proceeds along the lines indicated by:

        (2^5)*ln(1+x) = Y - (2^-6)(Y + (2^-5)(Y^3)(2/3 - (2^-6)(y - (2^-5)*
                        (y^2)(4/5 - (2^-5)y(2/3 - y*(2^-6))))