The Effect of Bellows Stalls

The Effect of Bellows Stalls

The R8000 microprocessor cache organization consists of a 16 kilobyte integer- only primary cache housed on the Integer Unit, and a 4 MByte second level streaming cache. The 4 MByte streaming cache acts as the second level cache for the IU and the first level cache for the FPU.

The R8000 microprocessor contains two Tag RAM's which support the two- way interleaved streaming cache and can perform two Tag RAM accesses per cycle. One bank contains even addresses (the cache is addressed in 64- bit words) and the other bank contains odd addresses.The R8000 microprocessor allows either two loads or one load and one store in the same cycle. In order to facilitate two accesses per cycle, one address must be even and the other odd. When there are two accesses that are both odd or both even, one of the accesses is delayed by one cycle in the address bellow register. Multiple conflicts can degrade performance. If all the accesses are even or all the accesses are odd, performance bottoms out at one load per cycle. The compiler tries but cannot always guarantee instruction schedules where one access is to an even address and the other to an odd address. So, even though the examples below may be actually scheduled correctly by the compiler to eliminate bellows stalls, they illustrate the basic principles that you can follow when the compiler can't eliminate bellows stalls.

Example 2: Single Precision Loops

SAXPY (Single precision A times X Plus Y)

The simple SAXPY code below can run at close to half the performance of theoretical maximum, depending on how the arrays x and y are aligned relative to each other and the order in which the arrays are accessed.

subroutine saxpy( a, x, y, nn)
real a, x(*), y(*)
do i = 1, nn

y(i) = y(i) + a*x(i)

end do

return
end

The alignment was controlled by putting the arrays in a common block in the calling main program (i.e.)

common x,y

The difficulty with this single-precision example is that the cache is interleaved at double word boundaries. As opposed to double precision accesses, consecutive single precision accesses do not go to alternate banks. Thus y(i) and y(i+1) are in one bank, while y(i+2) and y(i+3) are in the other bank.

A solution to this SAXPY problem is to unroll the single precision loop 4 times. It is possible (in theory) to schedule accesses such that they are grouped:

y(i), y(i+2): in one cycle
y(i+1), y(i+3): in a different cycle

This trick can be applied to any number of arrays and the normal bellows effect should take care of unknown base alignment of the arrays y(i), x(i):

y(i+2), x(i+2)
y(i+1), x(i+1)
...

Whether or not the alignment is good, interleaving gets you full performance.

The following code gets close to the theoretical maximum.

program sample
real a x(100000),y(100000)

do i = 1,2000

call saxpy(3.7,x,y,100000)

enddo

stop

end

subroutine saxpy( a, x, y, nn)

real a, x(*), y(*)

do i = 1, nn,4

y(i) = y(i) + a*x(i)

y(i+2) = y(i+2) + a*x(i+2)

y(i+1) = y(i+1) + a*x(i+1)

y(i+3) = y(i+3) + a*x(i+3)

end do

return

end

Example 3: Double Precision Loops

Fortran Arrays are stored in column-major order. That means y(1,1), y(2,1),y(3,1), y(4,1) etc. are stored in contiguous memory locations. C on the other hand stores its arrays in row-major order; (y[1,1], y[1,2], y[1,3] etc. are stored contiguously.).

Fortran double precision loops can encounter bellows stalls if the column dimension is even. The simple loop below illustrates this:

        real*8 y(8,4)

        do 20 i = 1,8
           do 10 j = 1,4
              y(i,j) = y(i,j) + 1

10       continue      
20    continue

This is because the array is laid out in the secondary cache as shown in Table 6-1:

Fortran Array Layout in Secondary Cache
Bank 1 Bank 2
y(1,1) y(2,1)
y(3,1) y(4,1)
y(5,1) y(6,1)
y(7,1) y(8,1)
y(1,2) y(2,2)
etc. etc.

Fortran Array Layout in Secondary Cache
Bank 1	Bank 2
y(1,1)	y(2,1)
y(3,1)	y(4,1)
y(5,1)	y(6,1)
y(7,1)	y(8,1)
y(1,2)	y(2,2)
etc.	etc.

Note that y(1,1) and y(1,2) are in the same bank. Actually, all of the inner loop members are in the same bank. As a result there are bellows stalls.

Much better performance could have been obtained if the first dimension(column) was odd even if it didn't need the extra elements. This is because the secondary (G) cache would get laid out like this:

real*8 y(9,4)

The inner loop elements are in alternating banks now and the bellows stalls are avoided as shown in Table 6-2.

Fortran Arrays in Alternating Banks
Bank 1 Bank 2
y(1,1) y(2,1)
y(3,1) y(4,1)
y(5,1) y(6,1)
y(7,1) y(8,1)
y(9,1) y(1,2)
y(2,2) y(2,3)
etc. etc.

Fortran Arrays in Alternating Banks
Bank 1	Bank 2
y(1,1)	y(2,1)
y(3,1)	y(4,1)
y(5,1)	y(6,1)
y(7,1)	y(8,1)
y(9,1)	y(1,2)
y(2,2)	y(2,3)
etc.	etc.

The software pipelining statistics in the .s or .L file can also tell you if you are having problems with bellows stalls. The following example illustrates this:

%cat daxpy2.f
      subroutine daxpy(a, x, y, nn)
      real *8 a, x(2,nn), y(2,nn)
      
      DO i = 1,nn
         y(1,i) = y(1,i) + a * x(1,i)
      END DO
      
      return
      end
%f77 -64 -mips4 -O3 -S daxpy2.f
%grep swps daxpy2.s
#<swps>
#<swps> Pipelined loop line 4 steady state
#<swps>
#<swps>     4 unrollings before pipelining
#<swps>     6 cycles per 4 iterations
#<swps>     8 flops       ( 33% of peak) (madds count as 2
#<swps>     4 flops       ( 33% of peak) (madds count as 1
#<swps>     4 madds       ( 33% of peak)
#<swps>     12 mem refs   ( 100% of peak)
#<swps>     2 integer ops ( 16% of peak)

#<swps> 18 instructions ( 75% of peak)

#<swps>     1 short trip threshold
#<swps>     9 ireg registers used
#<swps>     11 fgr registers used
#<swps>
#<swps>     6 possible stall cycles
#<swps>     6 min possible stall cycles
#<swps>

This shows that the inner loop starting at line 4 was software pipelined. In addition to the standard software pipelining statistics, the last three lines of the report show that at best, the compiler can schedule code which may have 6 bellows stall cycles for every four loop iterations and that the compiler did produce this schedule.

Example 2: Single Precision Loops
Example 3: Double Precision Loops