4-9  MINIMIZING FLOATING-POINT ERRORS
 *************************************
 Practical methods for analyzing the severity of floating-point
 errors can be found in the chapter: 'practical issues'.

 Note that an interval/stochastic arithmetic package is not just a 
 diagnostic tool for FP errors, a result without an error estimation 
 is not very useful, as errors can never be eliminated completely
 in experimental data and computation.

 The chapter on 'FORTRAN pitfalls' discusses various programming 
 practises that may amplify floating-point (and other) errors, 
 and their avoidance is very important.



 Using REAL*16 (QUAD PRECISION)
 ------------------------------
 This is the most simple solution on machines that supports this 
 data type. REAL*16 takes more CPU time than REAL*8/REAL*4, 
 but introduces very small roundoff errors, and has a huge range.

 Performance cost of different size floats is VERY machine dependent
 (see the performance chapter). 

 A crude example program:


      PROGRAM REALS
C     ------------------------------------------------------------------
      REAL*4
     *              X4, Y4, Z4
C     ------------------------------------------------------------------
      REAL*8
     *              X8, Y8, Z8
C     ------------------------------------------------------------------
      REAL*16
     *              X16, Y16, Z16
C     ------------------------------------------------------------------
      X4 = 1.0E+00
      Y4 = 0.9999999E+00
      Z4 = X4 - Y4
      WRITE(*,*) Z4 ** 0.5
C     ------------------------------------------------------------------
      X8 = 1.0D+00
      Y8 = 0.9999999D+00
      Z8 = X8 - Y8
      WRITE(*,*) Z8 ** 0.5
C     ------------------------------------------------------------------
      X16 = 1.0Q+00
      Y16 = 0.9999999Q+00
      Z16 = X16 - Y16
      WRITE(*,*) Z16 ** 0.5
C     ------------------------------------------------------------------
      END



 Normalization of equations
 --------------------------
 Floating-point arithmetic is best when dealing with numbers with
 magnitudes of the order of 1.0, the 'representation density' is not
 maximal but we are in the 'middle' of the range.

 Usually you can decrease the range of numbers appearing in the 
 computation, by transforming the system of units, so that you get 
 dimensionless equations.

 The diffusion equation will serve as an example, I apologize for 
 the horrible notation:

    Ut = K * Uxx     

 Where the solution is U(X,T), the lowercase letters denote the 
 partial derivatives, and K is a constant.

 Let:
       L  be a typical length in the problem
       U0 a typical value of U 

 Substitute:

       X' = X / L     U' = U / U0

 Then:

       Ux  = Ux' / L
       Uxx = Ux'x' / (L*L)

 Substitute in the original equation:

       (U' * U0)t = (K / (L*L)) * (U' * U0)x'x'

       ((L * L) / K) U't = U'x'x'

 Substitute:

       T' = (K * T) / (L * L)

 And you get:

       U't' = U'x'x'

       With:   X' = X / L     U' = U / U0     T' = (K * T) / (L * L)



 Multi-precision arithmetic
 --------------------------
 That is a really bright idea, you can simulate floating-point numbers
 with very large sizes, using character strings (or other data types),
 and create routines for doing arithmetic on these giant numbers.
 Of course such software simulated arithmetic will be slow.

 By the way, the function overloading feature of Fortran 90, makes using
 multi-precision arithmetic packages with existing programs easy.

 Two free packages are "mpfun" and "bmp" (Brent's multiple precision),
 which are available from Netlib.



 Using special tricks
 --------------------
 A simple trick for summing a series, is sorting the numbers and adding 
 them in ascending order.

 An example program:


      PROGRAM RNDOF
C     ------------------------------------------------------------------
      INTEGER
     *              i
C     ------------------------------------------------------------------
      REAL
     *              sum
C     ------------------------------------------------------------------
      sum = 0.0
      DO I = 1, 10000000, 1
        sum = sum + 1.0 / REAL(i)
      END DO
      WRITE (*,*) 'Decreasing order: ', sum
C     ------------------------------------------------------------------
      sum = 0.0
      DO I = 10000000, 1, -1
        sum = sum + 1.0 / REAL(i)
      END DO
      WRITE (*,*) 'Increasing order: ', sum
C     ------------------------------------------------------------------
      END


 There is no need here for sorting, as the series is monotonic.
 Executing 2 * 10**7 iterations will take some CPU seconds, but 
 the result is very illuminating.


 Another way (though not as good as doubling the precision) is 
 using the Kahan Summation Formula.

 Suppose the series is stored in an array X(1:N)

      SUM = X(1)
      C = 0.0
      DO J = 2, N
        Y = X(J) - C
        T = SUM + Y
        C = (T - SUM) - Y
        SUM = T
      ENDDO


 Yet another method is using Knuth's formula.

 The recommended method is sorting and adding.



 Using integers instead of floats
 --------------------------------
 See the chapter on integers.



 Manual safeguarding
 -------------------
 You can check manually every dangerous arithmetic operation, you
 may construct special routines for 'safe' arithmetical operations.



 Hardware support
 ----------------
 IEEE conforming FPUs can raise an exception whenever a roundoff
 was performed on an arithmetical result. 

 You can write an exception handler that will report the exceptions,
 but as the result of most operations may have to be rounded, your 
 program will be slowed down, and you will get huge log files.



 Rational arithmetic
 -------------------
 We can represent (possibly with an error) every number as a quotient 
 of two integers, and keep along the dividend and divisor without
 actually performing the division.



Return to contents page