The R8000 microprocessor contains two Tag RAM's which support the two- way interleaved streaming cache and can perform two Tag RAM accesses per cycle. One bank contains even addresses (the cache is addressed in 64- bit words) and the other bank contains odd addresses.The R8000 microprocessor allows either two loads or one load and one store in the same cycle. In order to facilitate two accesses per cycle, one address must be even and the other odd. When there are two accesses that are both odd or both even, one of the accesses is delayed by one cycle in the address bellow register. Multiple conflicts can degrade performance. If all the accesses are even or all the accesses are odd, performance bottoms out at one load per cycle. The compiler tries but cannot always guarantee instruction schedules where one access is to an even address and the other to an odd address. So, even though the examples below may be actually scheduled correctly by the compiler to eliminate bellows stalls, they illustrate the basic principles that you can follow when the compiler can't eliminate bellows stalls.
The simple SAXPY code below can run at close to half the performance of theoretical maximum, depending on how the arrays x and y are aligned relative to each other and the order in which the arrays are accessed.
subroutine saxpy( a, x, y, nn) real a, x(*), y(*) do i = 1, nn
y(i) = y(i) + a*x(i)
end do return endThe alignment was controlled by putting the arrays in a common block in the calling main program (i.e.)
The difficulty with this single-precision example is that the cache is interleaved at double word boundaries. As opposed to double precision accesses, consecutive single precision accesses do not go to alternate banks. Thus y(i) and y(i+1) are in one bank, while y(i+2) and y(i+3) are in the other bank.common x,y
A solution to this SAXPY problem is to unroll the single precision loop 4 times. It is possible (in theory) to schedule accesses such that they are grouped:
The following code gets close to the theoretical maximum.
program sample real a x(100000),y(100000)
do i = 1,2000
call saxpy(3.7,x,y,100000)
enddo
stop
end
subroutine saxpy( a, x, y, nn)
real a, x(*), y(*)
do i = 1, nn,4
y(i) = y(i) + a*x(i)
y(i+2) = y(i+2) + a*x(i+2)
y(i+1) = y(i+1) + a*x(i+1)
y(i+3) = y(i+3) + a*x(i+3)
end do
return
Fortran double precision loops can encounter bellows stalls if the column dimension is even. The simple loop below illustrates this:
real*8 y(8,4) do 20 i = 1,8 do 10 j = 1,4 y(i,j) = y(i,j) + 1 10 continue 20 continueThis is because the array is laid out in the secondary cache as shown in Table 6-1:
Bank 1 | Bank 2 |
---|---|
y(1,1) | y(2,1) |
y(3,1) | y(4,1) |
y(5,1) | y(6,1) |
y(7,1) | y(8,1) |
y(1,2) | y(2,2) |
etc. | etc. |
Note that y(1,1) and y(1,2) are in the same bank. Actually, all of the inner loop members are in the same bank. As a result there are bellows stalls.
Much better performance could have been obtained if the first dimension(column) was odd even if it didn't need the extra elements. This is because the secondary (G) cache would get laid out like this:
real*8 y(9,4)The inner loop elements are in alternating banks now and the bellows stalls are avoided as shown in Table 6-2.
Bank 1 | Bank 2 |
---|---|
y(1,1) | y(2,1) |
y(3,1) | y(4,1) |
y(5,1) | y(6,1) |
y(7,1) | y(8,1) |
y(9,1) | y(1,2) |
y(2,2) | y(2,3) |
etc. | etc. |
The software pipelining statistics in the .s or .L file can also tell you if you are having problems with bellows stalls. The following example illustrates this:
#<swps> 18 instructions ( 75% of peak)%cat daxpy2.f
subroutine daxpy(a, x, y, nn)
real *8 a, x(2,nn), y(2,nn)
DO i = 1,nn
y(1,i) = y(1,i) + a * x(1,i)
END DO
return
end
%f77 -64 -mips4 -O3 -S daxpy2.f
%grep swps daxpy2.s
#<swps>
#<swps> Pipelined loop line 4 steady state
#<swps>
#<swps> 4 unrollings before pipelining
#<swps> 6 cycles per 4 iterations
#<swps> 8 flops ( 33% of peak) (madds count as 2
#<swps> 4 flops ( 33% of peak) (madds count as 1
#<swps> 4 madds ( 33% of peak)
#<swps> 12 mem refs ( 100% of peak)
#<swps> 2 integer ops ( 16% of peak)
This shows that the inner loop starting at line 4 was software pipelined. In addition to the standard software pipelining statistics, the last three lines of the report show that at best, the compiler can schedule code which may have 6 bellows stall cycles for every four loop iterations and that the compiler did produce this schedule.#<swps> 1 short trip threshold
#<swps> 9 ireg registers used
#<swps> 11 fgr registers used
#<swps>
#<swps> 6 possible stall cycles
#<swps> 6 min possible stall cycles
#<swps>