Use of the IVDEP Directive

Use of the IVDEP Directive

The IVDEP directive is something that started in Cray Fortran. It stands for "Ignore Vector Dependencies".

It is a Fortran pragma that tells the compiler to be less strict when it is deciding whether it can get parallelism between loop iterations. By default, the compilers do the safe thing: they try to prove to themselves that there is no possible conflict between two memory references. If they can prove this, then it is safe for one of them to pass the other.

In particular, you need to be able to perform the load from iteration i+1 before the store from iteration i if you want to be able to overlap the calculation from two consecutive iterations.

Now suppose you have a loop like:

do i = 1, n

a(l(i)) = a(l(i)) + ...

enddo

The compiler has no way to know that

&a(l(i)) != &a(l(i+1))

without knowing something about the vector l. For example, if every element of l is 5, then

&a(l(i)) == &a(l(i+1))

for all values of i.

But you sometimes know something the compiler doesn't. Perhaps in the example above, l is a permutation vector and all its elements are unique. You'd like a way to tell the compiler to be less conservative. The IVDEP directive is a way to accomplish this.

Placed above a loop, the statement:

cdir$ ivdep

tells the compiler to assume there are no dependencies in the code.

Note: IVDEP IS DANGEROUS! If your code really isn't free of vector dependences, you may be telling the compiler to perform an illegal transformation which causes your program to get wrong answers. But, IVDEP is also powerful and you may very well find yourself in a position where you need to use it. But you have to be very careful when you do this.

By using an IVDEP, it was possible to get a 5 (static) cycle inner loop for the following routine. This compares with 36 cycles without this directive.It is also possible to get good results by adding:

cdir$ ivdep

right above the start of the inner loop, then using:

-O3 -mips4

to compile.

For example

        subroutine m(a,nn,b,ldb,ix,kb,m,n,p,alpha,id1,id2,
     +          lda,aa)
c
        implicit double precision(a-h,o-z)
        real*8 alpha,beta
        dimension a(*),b(ldb,n),ix(nn),aa(lda,*)
        integer p
c
        double precision t
c
        do  100 j=1,(n-1),2
          idc = ix(id1+j) + id2
          idc1 = ix(id1+j+1) + id2
          do 50 k=1,(p-3),4
            t00 = b(k,j)
            t01 = b(k,j+1)
            t10 = b(k+1,j)
            t11 = b(k+1,j+1)
            t20 = b(k+2,j)
            t21 = b(k+2,j+1)
            t30 = b(k+3,j)
            t31 = b(k+3,j+1)
cdir$ ivdep
            do 30 i=1,m
              a(idc+i) = a(idc+i) + aa(i,k)*t00 +
&             aa(i,k+1)*t10 + aa(i,k+2)*t20 + aa(i,k+3)*t30
              a(idc1+i) = a(idc1+i) + 
&             aa(i,k)*t01 +
&             aa(i,k+1)*t11 + aa(i,k+2)*t21 + aa(i,k+3)*t31
 30         continue
 50       continue
          do 80 k=k,p
            t00 = b(k,j)
            t01 = b(k,j+1)
            do i =1,m
              a(idc+i) = a(idc+i) + aa(i,k)*t00
              a(idc1+i) = a(idc1+i) + aa(i,k)*t01
            enddo
 80       continue
 100    continue
        do j=j,n
             idc = ix(id1+j) + id2
             do k=1,p
               t00 = b(k,j)
               do i=1,m
                 a(idc+i) = a(idc+i) + aa(i,k)*t00
               enddo
             enddo
           enddo
         return
         end

The inner loop now does 8 madds and 8 memory references in 5 cycles. In the limit it should be able to hit (static) full performance, but there is the nasty issue of incrementing the induction variables and testing for the end of the loop.

You can make it get a little closer to full performance if you force the loop to be unrolled. Right now this isn't happening because the loop contains too many operations before optimization when the decision of how much to unroll is made. The following option:

-OPT:unroll_size=1000

increases this limit from its default (320) to 1000 and allows the loop to be unrolled 2x. This yields 16 madds and 16 memory accesses in 9 cycles.