Looking at the Code Produced by Software Pipelining

Looking at the Code Produced by Software Pipelining

The proper way to look at the assembly code generated by software pipelining is to use the -S compiler switch. This is vastly superior to using the disassembler because the -S switch adds annotations to the assembly code which name out the sections described above. The annotations also provide useful statistics about the software pipelining process as well as reasons why certain code did not pipeline. To get a summary of these annotations do the following:

%f77 -64 -S -O3 -mips4 foo.f

This creates an annotated .s file

%grep '#<swp' foo.s

#<swpf is printed for loops that failed to software pipeline. #<swps is printed for statistics and other info about the loops that did software pipeline.

Another way to get a summary of the software pipelining annotations is to set the -LIST:=ON flag on the command line. For example:

%f77 -64 -O3 -mips4 -LIST:=ON foo.f

This creates a .L file which contains a summary of the flags used by the compiler (including default values) and the software pipelining annotations.

Example 1: Output from Using the -S Compiler Switch

%cat test.f
        program test
        real*8 a x(100000),y(100000)
        do i = 1, 2000
           call daxpy(3.7, x, y, 100000)
        enddo
        stop
        end

        subroutine daxpy(a, x, y, nn)
        real*8 a x(*),y(*)
        do i = 1, nn, 1
           y(i) = y(i) + a * x(i)
        enddo
        return
        end
%f77 -64 -mips4 -O3 -S test.f
%grep swps test.s
#<swps>
#<swps> Pipelined loop line 11 steady state
#<swps>
#<swps>     4 unrollings before pipelining
#<swps>     6 cycles per 4 iterations
#<swps>     8 flops       ( 33% of peak) (madds count as 2
#<swps>     4 flops       ( 33% of peak) (madds count as 1
#<swps>     4 madds       ( 33% of peak)
#<swps>     12 mem refs   ( 100% of peak)
#<swps>     2 integer ops ( 16% of peak)

#<swps> 18 instructions ( 75% of peak)

#<swps>     1 short trip threshold
#<swps>     7 ireg registers used
#<swps>     11 fgr registers used
#<swps>

This shows that the inner loop starting at line 11 was software pipelined. The loop was unrolled four times before pipelining. It used 6 cycles for every four loop iterations and calculated the statistics as follows:

If each madd counts as two floating point operations, the R8000 can do four floating point operations per cycle (two madds), so its peak for this loop is 24. Eight floating point references are 8/24 or 33% of peak. The figure for madds is likewise calculated.
If each madd counts as one floating point operation, the R8000 can do two floating point operations per cycle, so its peak for this loop is 12. Four floating point operations are 4/12 or 33% of peak.
The R8000 can do two memory operations per cycle, so its peak for this loop is 12. Three memory references are 12/12 or 100% of peak.
The R8000 can do two integer operations per cycle, so its peak for this loop is 12. Two integer operations are 2/12 or 16% of peak.
The R8000 can do four instructions per cycle, so its peak for this loop is 24. Eighteen instructions are 18/24 or 75% of peak. The statistics also point out that loops of less than 1 iterations would not go through the software pipeline replication area, but would be executed in the simple_loop section shown above and that a total of seven integer and eleven floating point registers were used in generating the code.