Software Pipelining

Software Pipelining

Software pipelining (SWP) is an important optimization for the inner loops of programs, which can cause dramatic improvement by rearranging the loops to overlap calculations from multiple iterations. This is an iterative process, searching for an effective schedule and then for a workable allocation of registers and then retrying if either step fails. Some of the options in the -SWP group control that process. Others control how the loop body is prepared for the attempt, for example, by unrolling.

Many important loop preparation transformations involve reassociation of floating point values. See the discussion of floating point optimization above, especially the -OPT:roundoff option.

SWP normally must be careful during the initial and final iterations of a loop to not perform extra operations which might cause runtime traps. It must be similarly careful if early exits from a loop (that is, before the initially calculated trip count is reached) are possible. Turning off certain traps at runtime can give it more flexibility, producing better schedules and/or simpler wind-up/wind-down code. See the target environment option -TENV:X=n for general control over the exception environment.

-SWP:=(ON|OFF)

Enable/disable SWP (normally enabled at -O3).

-SWP:back_substitution[=(ON|OFF)]

The iteration interval (II) of a pipelined loop (that is, the frequency at which new iterations are started), is constrained by circular data dependencies across iterations, called recurrences. This option, ON by default, allows transformations which make recurrences less severe by substituting the expression which defines a variable for the variable. For example, consider the code:

DO i=1,n

a(i) = a(i-1) + 5.0

END DO

Without back-substitution, each iteration must wait for the previous iteration's add to complete, yielding a best case of 4 cycles per iteration on the R8000. Back-substitution can transform the loop to something like:

DO i=1,n

a[i] = a[i-8] + 40.0

END DO

With appropriate initialization, this version can achieve an effective rate of nearly two iterations per cycle.

-SWP:backtracks_max=n

SWP often backtracks and tries again when it fails to find a workable schedule. This option controls the limit on how many times it does so. Increasing the limit improves its chances of success; descreasing it may reduce the compilation time.

-SWP:body_ins_count_max=n

SWP is not attempted for loop bodies containing more than n instructions (default 100, 0 for no limit). Larger loop bodies are less likely to be successfully pipelined, and take more compilation time in the attempt, so this is another tradeoff of (potential) code improvement vs. compile time.

Loop bodies are also normally unrolled in preparation for SWP. This also limits the unrolling, since loops are not unrolled to more than n instructions in the unrolled body. Unrolling is also constrained by the unroll_times_max option described below. (Unrolling of loop bodies not expected to be software pipelined is controlled separately by -OPT:unroll_size and -OPT:unroll_times_max.)

-SWP:fix_recurrences[=(ON|OFF)]

This option controls both of the transformations controlled by back_substitution and interleave_reductions. See their descriptions.

-SWP:if_conversion[=(ON|OFF)]

SWP generally works much better on loop bodies without internal branches caused by conditional execution. This option causes conditional branches to be removed when possible by using conditional move instructions (MIPS4) and equivalents. For example, consider the code:

DO i=1,n

IF ( a(i) .LT. b(i) ) THEN

c(i) = a(i)

ELSE

c(i) = b(i)

END IF

END DO

The loop body can be compiled for MIPS4 as:

ldc1
ldc1
c.lt.s
movf.s
sdc1 $f0,a(i)
$f1,b(i)
$fcc1,$f0,$f1
$f0,$f1,$fcc1
$f0,c(i)

Note that there are no conditional branches in the code. This option is ON by default for MIPS4 targets only.

-SWP:interleave_reductions[=(ON|OFF)]

This option, ON by default, has the same motivation as back-substution. It allows transformations which make recurrences arising from reductions less severe by interleaving multiple threads of the reduction and then piecing them together at the end of the loop. For example, consider the code to sum an array:

DO i=1,n

sum = sum + a(x)

END DO

Without interleaving, each iteration must wait for the previous iteration's add to complete, yielding a best case II of 4 cycles per iteration on the R8000. Interleaving can transform the loop to something equivalent to:

DO i=1,n,8

sum1 = sum1 + a(i) sum2 = sum2 + a(i+1) sum3 = sum3 + a(i+2) sum4 = sum4 + a(i+3) sum5 = sum5 + a(i+4) sum6 = sum6 + a(i+5) sum7 = sum7 + a(i+6) sum8 = sum8 + a(i+7)

END DO
sum = sum + sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8

This version can achieve an effective II of nearly 0.5 cycles.

These transformations generally require -OPT:roundoff=2 or better.

-SWP:trip_count_min=n

SWP is not attempted for loops with trip counts known to be smaller than n (default 5). The limit is applied via a runtime test for cases where the trip count is not known at compile time. Sometimes, a longer loop body can be profitably pipelined even with a smaller trip count, enabled by this option. -SWP:unroll_times_max=n.

This option controls the maximum number of times inner loop bodies are unrolled before attempting pipelining. The default is 4 for MIPS4 and 1 for MIPS3. Unrolling is also constrainedby the body_ins_count_max option described above. (Unrolling of loop bodies not expected to be software pipelined is controlled separately by -OPT:unroll_size and -OPT:unroll_times_max.)