In addition, the source language standards imply rules about how expressions are to be evaluated. However, most code that has not been written with careful attention to floating point behavior does not require precise conformance to either the source language expression evaluation standards or to IEEE 754 arithmetic standards. Therefore, the MIPSpro 64-bit compilers provide a number of options which trade off source language expression evaluation rules and IEEE 754 conformance against better performance of generated code. These options allow transformations of calculations specified by the source code that may not produce precisely the same floating point result, although they involve a mathematically equivalent calculation.
Two of these options are the preferred controls. -OPT:roundoff deals with the extent to which language expression evaluation rules are observed, generally affecting the transformation of expressions involving multiple operations. -OPT:IEEE_arithmetic deals with the extent to which the generated code conforms to IEEE 754 standards for discrete IEEE-specified operations (for example, a divide or a square root). The remaining options in this class may be used to obtain finer control, but they may disappear or change in future compiler releases.
The first general option provides control over floating point accuracy and overflow/underflow exception behavior relative to the source language rules:
Do no transformations which could affect floating point results. This is the default for optimization levels -O0 to -O2.
Allow transformations with limited effects on floating point results. For roundoff, limited means that only the last bit or two of the mantissa are affected. For overflow (underflow), it means that intermediate results of the transformed calculation may overflow within a factor of two of where the original expression might have overflowed (underflowed). Note that limited effects may be less limited when compounded by multiple transformations.
Allow transformations with more extensive effects on floating point results. Allow associative rearrangement, even across loop iterations, and distribution of multiplication over addition/subtraction. Disallow only transformations known to cause cumulative roundoff errors or overflow/underflow for operands in a large range of valid floating point values.
Re-association can have a substantial effect on the performanceof software pipelined loops by breaking recurrences. This is therefore the default for optimization level -O3.
Allow any mathematically valid transformation of floating point expressions. This allows floating point induction variables in loops, even when they are known to cause cumulative roundoff errors, and fast algorithms for complex absolute value and divide, which overflow (underflow) for operands beyond the square root of the representable extremes.
No degradation: do no transformations which degrade floating point accuracy from IEEE requirements. The generated code may use instructions like madds which provide greater accuracy than required by IEEE 754. This is the default.
Minor degradation: allow transformations with limited effects on floating point results, as long as exact results remain exact. This option allows use of the MIPS 4 recip and rsqrt operations.
Conformance not required: allow any mathematically valid transformations. For instance, this allows implementation of x/y as x*recip(y). As an example, consider optimizing the Fortran code fragment:
INTEGER i, n
REAL sum, divisor, a(n)
sum = 0.0
DO i = 1,n
sum = sum + a(i)/divisor
END DO
At roundoff=0 and IEEE_arithmetic=1, the generated code must do the n loop iterations in order, with a divide and an add in each.
Using IEEE_arithmetic=3, the divide can be treated like a(i)*(1.0/divisor). On the MIPS R8000, the reciprocal can be done with a recip instruction. But more importantly, the reciprocal can be calculated once before the loop is entered, reducing the loop body to a much faster multiply and add per iteration, which can be a single madd instruction on the R8000.
Using roundoff=2, the loop may be reordered. The original loop takes at least 4 cycles per iteration on the R8000 (the latency of the add or madd instruction). Reordering allows the calculation of several partial sums in parallel, adding them together after loop exit. With software pipelining, a throughput of nearly 2 iterations per cycle is possible on the R8000, a factor of 8 improvement.
Consider another example:
INTEGER i,n
COMPLEX c(n)
REAL r
DO i = 1,n
r = 0.1 * i
c(i) = CABS ( CMPLX(r,r) )
END DO
Mathematically, r can be calculated by initializing it to 0.0 before entering the loop and adding 0.1 on each iteration. But doing so causes significant cumulative errors because the representation of 0.1 is not exact. The complex absolute value mathematically is equal to
SQRT(r*r + r*r). However, calculating it this way causes an overflow if 2*r*r is greater than the maximum REAL value, even though a representable result can be calculated for a much wider range of values of r (at greater cost). Both of these transformations are forbidden for roundoff=2, but enabled for roundoff=3.
It also disables optimizations which reverse the sense of a comparison, for example, turning x < y into ! (x >= y), since both x<y and x>=y may be FALSE if one of the operands is a NaN.