do j = 1,n do i = 1,m do k = 1 , p c(i,j) = c(i,j) - a(i,k)*b(k,j)The R8000 is capable of issuing two madds and two load/stores per cycle. This simple version of matrix multiply requires 2 loads in the inner loop and one madd. Thus at best, it can run at half of peak speed. Note, though, that the same locations are loaded on multiple iterations. By unrolling the loops, we can eliminate some of these redundant loads. Consider for example, unrolling the outer loop by 2.
do j = 1,n,2 do i = 1,m do k = 1 , p c(i,j) = c(i,j) - a(i,k)*b(k,j) c(i,j+1) = c(i,j+1) - a(i,k)*b(k,j+1)We now have 3 loads for two madds. Further unrolling can bring the ratio down further.
Choosing the best unrolling factor requires trade-offs. First, we'd like to get the ratios of loads to madds below 1. Having an equal number of loads and madds may not be sufficient. Recall that memory bank conflicts can occasionally increase the time required to do a load. Thus, it's desirable to eliminate as many redundant loads as possible. On the other hand, heavily unrolled loops require many registers. Unrolling too much can lead to register spills.
Below is a good compromise for the matrix multiply example. It gets very close to peak performance.
do j = 1,n-4,5 do i = 1,m-3,4 do k = 1 , p c(i,j) = c(i,j) - a(i,k)*b(k,j) c(i,j+1) = c(i,j+1) - a(i,k)*b(k,j+1) c(i,j+2) = c(i,j+2) - a(i,k)*b(k,j+2) c(i,j+3) = c(i,j+3) - a(i,k)*b(k,j+3) c(i,j+4) = c(i,j+4) - a(i,k)*b(k,j+4) c(i+1,j) = c(i+1,j) - a(i+1,k)*b(k,j) c(i+1,j+1) = c(i+1,j+1) - a(i+1,k)*b(k,j+1) c(i+1,j+2) = c(i+1,j+2) - a(i+1,k)*b(k,j+2) c(i+1,j+3) = c(i+1,j+3) - a(i+1,k)*b(k,j+3) c(i+1,j+4) = c(i+1,j+4) - a(i+1,k)*b(k,j+4) c(i+2,j) = c(i+2,j) - a(i+2,k)*b(k,j) c(i+2,j+1) = c(i+2,j+1) - a(i+2,k)*b(k,j+1) c(i+2,j+2) = c(i+2,j+2) - a(i+2,k)*b(k,j+2) c(i+2,j+3) = c(i+2,j+3) - a(i+2,k)*b(k,j+3) c(i+2,j+4) = c(i+2,j+4) - a(i+2,k)*b(k,j+4) c(i+3,j) = c(i+3,j) - a(i+3,k)*b(k,j) c(i+3,j+1) = c(i+3,j+1) - a(i+3,k)*b(k,j+1) c(i+3,j+2) = c(i+3,j+2) - a(i+3,k)*b(k,j+2) c(i+3,j+3) = c(i+3,j+3) - a(i+3,k)*b(k,j+3) c(i+3,j+4) = c(i+3,j+4) - a(i+3,k)*b(k,j+4) enddo enddo do i =i,m do k = 1,p c(i,j) = c(i,j) - a(i,k)*b(k,j) enddo enddo enddo do j = j,n do i = 1,m-3,4 do k = 1 , p c(i,j) = c(i,j) - a(i,k)*b(k,j) c(i+1,j) = c(i+1,j) - a(i+1,k)*b(k,j) c(i+2,j) = c(i+2,j) - a(i+2,k)*b(k,j) c(i+3,j) = c(i+3,j) - a(i+3,k)*b(k,j) enddo enddo do i =i,m do k = 1,p c(i,j) = c(i,j) - a(i,k)*b(k,j) enddo enddo enddo return endNote that we have not unrolled the inner loop. The software pipelining phase of the compiler can effectively unroll inner loops automatically to eliminate redundant loads.
In addition, the KAP front end is able to automatically unroll outer loops. Given the initial matrix multiply, KAP generates the following code, which is almost as efficient as our hand coded version.
do 5 J=1,N-3,4 do 3 I=1,M-3,4 DD1 = C(I,J) DD2 = C(I,J+1) DD3 = C(I,J+2) DD4 = C(I,J+3) DD5 = C(I+1,J) DD6 = C(I+1,J+1) DD7 = C(I+1,J+2) DD8 = C(I+1,J+3) DD9 = C(I+2,J) DD10 = C(I+2,J+1) DD11 = C(I+2,J+2) DD12 = C(I+2,J+3) DD13 = C(I+3,J) DD14 = C(I+3,J+1) DD15 = C(I+3,J+2) DD16 = C(I+3,J+3) do 2 K=1,P DD1 = DD1 - A(I,K) * B(K,J) DD2 = DD2 - A(I,K) * B(K,J+1) DD3 = DD3 - A(I,K) * B(K,J+2) DD4 = DD4 - A(I,K) * B(K,J+3) DD5 = DD5 - A(I+1,K) * B(K,J) DD6 = DD6 - A(I+1,K) * B(K,J+1) DD7 = DD7 - A(I+1,K) * B(K,J+2) DD8 = DD8 - A(I+1,K) * B(K,J+3) DD9 = DD9 - A(I+2,K) * B(K,J) DD10 = DD10 - A(I+2,K) * B(K,J+1) DD11 = DD11 - A(I+2,K) * B(K,J+2) DD12 = DD12 - A(I+2,K) * B(K,J+3) DD13 = DD13 - A(I+3,K) * B(K,J) DD14 = DD14 - A(I+3,K) * B(K,J+1) DD15 = DD15 - A(I+3,K) * B(K,J+2) DD16 = DD16 - A(I+3,K) * B(K,J+3) 2 continue C(I,J) = DD1 C(I,J+1) = DD2 C(I,J+2) = DD3 C(I,J+3) = DD4 C(I+1,J) = DD5 C(I+1,J+1) = DD6 C(I+1,J+2) = DD7 C(I+1,J+3) = DD8 C(I+2,J) = DD9 C(I+2,J+1) = DD10 C(I+2,J+2) = DD11 C(I+2,J+3) = DD12 C(I+3,J) = DD13 C(I+3,J+1) = DD14 C(I+3,J+2) = DD15 C(I+3,J+3) = DD16 3 continue do 5 I=I,M,1 DD17 = C(I,J) DD18 = C(I,J+1) DD19 = C(I,J+2) DD20 = C(I,J+3) do 4 K=1,P DD17 = DD17 - A(I,K) * B(K,J) DD18 = DD18 - A(I,K) * B(K,J+1) DD19 = DD19 - A(I,K) * B(K,J+2) DD20 = DD20 - A(I,K) * B(K,J+3) 4 continue C(I,J) = DD17 C(I,J+1) = DD18 C(I,J+2) = DD19 C(I,J+3) = DD20 5 continue do 9 J=J,N,1 do 7 I=1,M-3,4 DD21 = C(I,J) DD22 = C(I+1,J) DD23 = C(I+2,J) DD24 = C(I+3,J) do 6 K=1,P DD21 = DD21 - A(I,K) * B(K,J) DD22 = DD22 - A(I+1,K) * B(K,J) DD23 = DD23 - A(I+2,K) * B(K,J) DD24 = DD24 - A(I+3,K) * B(K,J) 6 continue C(I,J) = DD21 C(I+1,J) = DD22 C(I+2,J) = DD23 C(I+3,J) = DD24 7 continue do 9 I=I,M,1 DD25 = C(I,J) do 8 K=1,P DD25 = DD25 - A(I,K) * B(K,J) 8 continue C(I,J) = DD25 9 continue return end