It is a Fortran pragma that tells the compiler to be less strict when it is deciding whether it can get parallelism between loop iterations. By default, the compilers do the safe thing: they try to prove to themselves that there is no possible conflict between two memory references. If they can prove this, then it is safe for one of them to pass the other.
In particular, you need to be able to perform the load from iteration i+1 before the store from iteration i if you want to be able to overlap the calculation from two consecutive iterations.
Now suppose you have a loop like:
do i = 1, n
a(l(i)) = a(l(i)) + ...
enddoThe compiler has no way to know that
&a(l(i)) != &a(l(i+1))without knowing something about the vector l. For example, if every element of l is 5, then
&a(l(i)) == &a(l(i+1))for all values of i.
But you sometimes know something the compiler doesn't. Perhaps in the example above, l is a permutation vector and all its elements are unique. You'd like a way to tell the compiler to be less conservative. The IVDEP directive is a way to accomplish this.
Placed above a loop, the statement:
cdir$ ivdeptells the compiler to assume there are no dependencies in the code.
Note: IVDEP IS DANGEROUS! If your code really isn't free of vector dependences, you may be telling the compiler to perform an illegal transformation which causes your program to get wrong answers. But, IVDEP is also powerful and you may very well find yourself in a position where you need to use it. But you have to be very careful when you do this.
By using an IVDEP, it was possible to get a 5 (static) cycle inner loop for the following routine. This compares with 36 cycles without this directive.It is also possible to get good results by adding:
cdir$ ivdepright above the start of the inner loop, then using:
-O3 -mips4
to compile.
For example
subroutine m(a,nn,b,ldb,ix,kb,m,n,p,alpha,id1,id2, + lda,aa) c implicit double precision(a-h,o-z) real*8 alpha,beta dimension a(*),b(ldb,n),ix(nn),aa(lda,*) integer p c double precision t c do 100 j=1,(n-1),2 idc = ix(id1+j) + id2 idc1 = ix(id1+j+1) + id2 do 50 k=1,(p-3),4 t00 = b(k,j) t01 = b(k,j+1) t10 = b(k+1,j) t11 = b(k+1,j+1) t20 = b(k+2,j) t21 = b(k+2,j+1) t30 = b(k+3,j) t31 = b(k+3,j+1) cdir$ ivdep do 30 i=1,m a(idc+i) = a(idc+i) + aa(i,k)*t00 + & aa(i,k+1)*t10 + aa(i,k+2)*t20 + aa(i,k+3)*t30 a(idc1+i) = a(idc1+i) + & aa(i,k)*t01 + & aa(i,k+1)*t11 + aa(i,k+2)*t21 + aa(i,k+3)*t31 30 continue 50 continue do 80 k=k,p t00 = b(k,j) t01 = b(k,j+1) do i =1,m a(idc+i) = a(idc+i) + aa(i,k)*t00 a(idc1+i) = a(idc1+i) + aa(i,k)*t01 enddo 80 continue 100 continue do j=j,n idc = ix(id1+j) + id2 do k=1,p t00 = b(k,j) do i=1,m a(idc+i) = a(idc+i) + aa(i,k)*t00 enddo enddo enddo return endThe inner loop now does 8 madds and 8 memory references in 5 cycles. In the limit it should be able to hit (static) full performance, but there is the nasty issue of incrementing the induction variables and testing for the end of the loop.
You can make it get a little closer to full performance if you force the loop to be unrolled. Right now this isn't happening because the loop contains too many operations before optimization when the decision of how much to unroll is made. The following option:
-OPT:unroll_size=1000
increases this limit from its default (320) to 1000 and allows the loop to be unrolled 2x. This yields 16 madds and 16 memory accesses in 9 cycles.