Matrix Multiply - A Tuning Case Study

Matrix Multiply - A Tuning Case Study

Matrix multiplication illustrates some of the issues in compiling for the R8000. Consider the simple implementation.

do j = 1,n
   do i = 1,m
       do k = 1 , p
          c(i,j) = c(i,j) - a(i,k)*b(k,j)

The R8000 is capable of issuing two madds and two load/stores per cycle. This simple version of matrix multiply requires 2 loads in the inner loop and one madd. Thus at best, it can run at half of peak speed. Note, though, that the same locations are loaded on multiple iterations. By unrolling the loops, we can eliminate some of these redundant loads. Consider for example, unrolling the outer loop by 2.

do j = 1,n,2
   do i = 1,m
      do k = 1 , p
         c(i,j) = c(i,j) - a(i,k)*b(k,j)
         c(i,j+1) = c(i,j+1) - a(i,k)*b(k,j+1)

We now have 3 loads for two madds. Further unrolling can bring the ratio down further.

Choosing the best unrolling factor requires trade-offs. First, we'd like to get the ratios of loads to madds below 1. Having an equal number of loads and madds may not be sufficient. Recall that memory bank conflicts can occasionally increase the time required to do a load. Thus, it's desirable to eliminate as many redundant loads as possible. On the other hand, heavily unrolled loops require many registers. Unrolling too much can lead to register spills.

Below is a good compromise for the matrix multiply example. It gets very close to peak performance.

do j = 1,n-4,5
  do i = 1,m-3,4
    do k = 1 , p
      c(i,j) = c(i,j) - a(i,k)*b(k,j)
      c(i,j+1) = c(i,j+1) - a(i,k)*b(k,j+1)
      c(i,j+2) = c(i,j+2) - a(i,k)*b(k,j+2)
      c(i,j+3) = c(i,j+3) - a(i,k)*b(k,j+3)
      c(i,j+4) = c(i,j+4) - a(i,k)*b(k,j+4)

      c(i+1,j) = c(i+1,j) - a(i+1,k)*b(k,j)
      c(i+1,j+1) = c(i+1,j+1) - a(i+1,k)*b(k,j+1)
      c(i+1,j+2) = c(i+1,j+2) - a(i+1,k)*b(k,j+2)
      c(i+1,j+3) = c(i+1,j+3) - a(i+1,k)*b(k,j+3)
      c(i+1,j+4) = c(i+1,j+4) - a(i+1,k)*b(k,j+4)

      c(i+2,j) = c(i+2,j) - a(i+2,k)*b(k,j)
      c(i+2,j+1) = c(i+2,j+1) - a(i+2,k)*b(k,j+1)
      c(i+2,j+2) = c(i+2,j+2) - a(i+2,k)*b(k,j+2)
      c(i+2,j+3) = c(i+2,j+3) - a(i+2,k)*b(k,j+3)
      c(i+2,j+4) = c(i+2,j+4) - a(i+2,k)*b(k,j+4)

      c(i+3,j) = c(i+3,j) - a(i+3,k)*b(k,j)
      c(i+3,j+1) = c(i+3,j+1) - a(i+3,k)*b(k,j+1)
      c(i+3,j+2) = c(i+3,j+2) - a(i+3,k)*b(k,j+2)
      c(i+3,j+3) = c(i+3,j+3) - a(i+3,k)*b(k,j+3)
      c(i+3,j+4) = c(i+3,j+4) - a(i+3,k)*b(k,j+4)
     enddo
    enddo
  do i =i,m
    do k = 1,p
      c(i,j) = c(i,j) - a(i,k)*b(k,j)
    enddo
  enddo
enddo
do j = j,n
  do i = 1,m-3,4
    do k = 1 , p
      c(i,j) = c(i,j) - a(i,k)*b(k,j)
      c(i+1,j) = c(i+1,j) - a(i+1,k)*b(k,j)
      c(i+2,j) = c(i+2,j) - a(i+2,k)*b(k,j)
      c(i+3,j) = c(i+3,j) - a(i+3,k)*b(k,j)
    enddo
  enddo
  do i =i,m
    do k = 1,p
      c(i,j) = c(i,j) - a(i,k)*b(k,j)
    enddo
  enddo
enddo
return
end

Note that we have not unrolled the inner loop. The software pipelining phase of the compiler can effectively unroll inner loops automatically to eliminate redundant loads.

In addition, the KAP front end is able to automatically unroll outer loops. Given the initial matrix multiply, KAP generates the following code, which is almost as efficient as our hand coded version.

    do 5 J=1,N-3,4
        do 3 I=1,M-3,4
          DD1 = C(I,J)
          DD2 = C(I,J+1)
          DD3 = C(I,J+2)
          DD4 = C(I,J+3)
          DD5 = C(I+1,J)
          DD6 = C(I+1,J+1)
          DD7 = C(I+1,J+2)
          DD8 = C(I+1,J+3)
          DD9 = C(I+2,J)
          DD10 = C(I+2,J+1)
          DD11 = C(I+2,J+2)
          DD12 = C(I+2,J+3)
          DD13 = C(I+3,J)
          DD14 = C(I+3,J+1)
          DD15 = C(I+3,J+2)
          DD16 = C(I+3,J+3)
          do 2 K=1,P
             DD1 = DD1 - A(I,K) * B(K,J)
             DD2 = DD2 - A(I,K) * B(K,J+1)
             DD3 = DD3 - A(I,K) * B(K,J+2)
             DD4 = DD4 - A(I,K) * B(K,J+3)
             DD5 = DD5 - A(I+1,K) * B(K,J)
             DD6 = DD6 - A(I+1,K) * B(K,J+1)
             DD7 = DD7 - A(I+1,K) * B(K,J+2)
             DD8 = DD8 - A(I+1,K) * B(K,J+3)
             DD9 = DD9 - A(I+2,K) * B(K,J)
             DD10 = DD10 - A(I+2,K) * B(K,J+1)
             DD11 = DD11 - A(I+2,K) * B(K,J+2)
             DD12 = DD12 - A(I+2,K) * B(K,J+3)
                DD13 = DD13 - A(I+3,K) * B(K,J)
                DD14 = DD14 - A(I+3,K) * B(K,J+1)
                DD15 = DD15 - A(I+3,K) * B(K,J+2)
                DD16 = DD16 - A(I+3,K) * B(K,J+3)
  2           continue
              C(I,J) = DD1
              C(I,J+1) = DD2
              C(I,J+2) = DD3
              C(I,J+3) = DD4
              C(I+1,J) = DD5
              C(I+1,J+1) = DD6
              C(I+1,J+2) = DD7
              C(I+1,J+3) = DD8
              C(I+2,J) = DD9
              C(I+2,J+1) = DD10
              C(I+2,J+2) = DD11
              C(I+2,J+3) = DD12
              C(I+3,J) = DD13
              C(I+3,J+1) = DD14
              C(I+3,J+2) = DD15
              C(I+3,J+3) = DD16
 3          continue
            do 5 I=I,M,1
              DD17 = C(I,J)
              DD18 = C(I,J+1)
              DD19 = C(I,J+2)
              DD20 = C(I,J+3)
              do 4 K=1,P
                DD17 = DD17 - A(I,K) * B(K,J)
                DD18 = DD18 - A(I,K) * B(K,J+1)
                DD19 = DD19 - A(I,K) * B(K,J+2)
                DD20 = DD20 - A(I,K) * B(K,J+3)
 4            continue
              C(I,J) = DD17
              C(I,J+1) = DD18
              C(I,J+2) = DD19
              C(I,J+3) = DD20
 5          continue
            do 9 J=J,N,1
              do 7 I=1,M-3,4
                DD21 = C(I,J)
                DD22 = C(I+1,J)
                DD23 = C(I+2,J)
                DD24 = C(I+3,J)
                do 6 K=1,P
             DD21 = DD21 - A(I,K) * B(K,J)
             DD22 = DD22 - A(I+1,K) * B(K,J)
             DD23 = DD23 - A(I+2,K) * B(K,J)
             DD24 = DD24 - A(I+3,K) * B(K,J)
 6         continue
           C(I,J) = DD21
           C(I+1,J) = DD22
           C(I+2,J) = DD23
           C(I+3,J) = DD24
 7       continue
         do 9 I=I,M,1
           DD25 = C(I,J)
           do 8 K=1,P
             DD25 = DD25 - A(I,K) * B(K,J)
 8         continue
           C(I,J) = DD25
 9       continue
         return
       end