home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.lang.c
- Path: sparky!uunet!utcsri!helios.physics.utoronto.ca!alchemy.chem.utoronto.ca!mroussel
- From: mroussel@alchemy.chem.utoronto.ca (Marc Roussel)
- Subject: Allowed Fortran and C optimizations
- Message-ID: <1992Nov16.153307.4632@alchemy.chem.utoronto.ca>
- Sender: mroussel@alchemy.chem.utoronto.ca (Marc Roussel)
- Organization: Department of Chemistry, University of Toronto
- Date: Mon, 16 Nov 1992 15:33:07 GMT
- Lines: 112
-
- A few days ago, I suggested that I might post some example code
- showing that better performance can sometimes be obtained from Fortran
- than C unless one is willing to hand-tune the C code. Before I show my
- example, I would like to point out that I am not claiming that Fortran
- is better than C for everything, only that Fortran sometimes optimizes
- better than equivalent C code.
- This example is from an HP Interworks conference (New Orleans,
- August 24-27 1992) talk by Bob Montgomery of HP. All the numbers
- reported below will be for the HP 720. The benchmark is a routine
- called vec_mult_add which multiplies a vector v by a scalar a and adds a
- vector u, returning vector r. There is one small restriction: The
- routine must be Fortran-callable, i.e. everything has to be passed by
- reference.
- Here is the Fortran code:
-
- subroutine vec_mult_add(u,v,cnt,a,r)
- integer cnt
- real u(cnt),v(cnt),r(cnt),a
- * Comment: This was obviously written by someone who is used to
- * very old Fortran compilers. The following "if" is redundant
- * with the loop indices. I suspect that the optimizer will know
- * that however.
- if (cnt.le.0) return
- do ii=1,cnt
- r(ii) = a*v(ii) + u(ii)
- end do
- end
-
- On an HP 720, compiled with the -O flag (which does not invoke the HP
- preprocessor but allows all normal optimizations), this snippet, driven
- by an appropriate wrapper, achieves 18.50 MFLOPS. Remember that number.
- The most obvious way to implement this routine in C (remember, it
- as to be Fortran-callable) is probably
-
- vec_mult_add(u,v,cnt,a,r)
- float *u,*v;
- int *cnt;
- float *a,*r;
- {
- int ii;
- if (*cnt <= 0) return;
- for (ii=0; ii<*cnt; ii++) {
- *r++ = (*a * *v++) + *u++;
- }
- }
-
- This version of vec_mult_add gets you either 4.73 MFLOPS or 7.09 MFLOPS
- depending on whether or not you compiled in ANSI mode. (ANSI mode is
- faster because the loop then does not contain implicit float-double
- conversions.) That's pretty sad.
- Now if you know a little bit about C compilers and optimizers (and
- I know only a very little) it will be pretty clear what parts of this
- routine are inhibiting the optimizer. To put it simply, the fact that
- everything is a pointer is forcing the optimizer to make extremely
- conservative assumptions. Let's make the following modifications:
- First, we'll make local copies of cnt and a so that the compiler doesn't
- have to worry about aliasing. Then, we'll use array notation for things
- which really are arrays:
-
- vec_mult_add(u,v,cnt,a,r)
- float u[],v[];
- int *cnt;
- float *a,r[];
- {
- int ii;
- int lcnt = *cnt;
- float la = *a;
- if (lcnt <= 0) return;
- for (ii=0; ii<lcnt; ii++) {
- r[ii]+ = la * v[ii] + u[ii];
- }
- }
-
- Compiling this in ANSI mode with the -O flag we get 9.93 MFLOPS, i.e.
- about half the Fortran performance. Interestingly, we get 9.93 MFLOPS
- without switching to array syntax, i.e. with just the la and lcnt hacks.
- HP's C compiler has another optimizer switch which asserts that none of
- the arguments of a function call are aliased, the +Om1 flag. It's only
- when we make all the modifications shown above AND turn on this flag
- that we get 18.50 MFLOPS.
- Now the +Om1 flag does what I want it to, but I wouldn't want to
- compile a library or program written in C by someone else with it: Since C does
- not force you to keep your function parameters distinct, I couldn't
- count on the program executing correctly. That means that in many
- scientific codes whose guts are made up mostly of simple loops like the one
- shown above, I can expect C to perform about half as well as Fortran.
- In my case, that's not a big deal. For other people whose codes run for
- days, it is.
- Note that you can push this argument to the absurd: HP's hand-coded
- assembler version of vec_mult_add cranks out over 28 MFLOPS on a 720.
- Assembler is not a good choice for most scientific programming because it
- takes too much programmer time and effort. In the case of C and Fortran
- however, since both languages are roughly equally easy to learn and
- code, no such argument can be made.
- Furthermore, if I'm allowed to use special compiler flags, the HP's
- Fortran preprocessor (invoked with the +OP flag) gives this language an
- unfair edge since it should be able to recognize the loop as a
- vec_mult_add and call the assembler routine thus producing 28 MFLOPS
- performance in the Fortran program. It would of course be completely
- unfair to compare the two languages on this basis.
- What it comes down to is this. If you write Fortran programs, you
- are so constrained that the compiler knows exactly what you mean, right
- off the bat. If you write C, you have to have ways to tell the compiler
- what you mean. Expressiveness, in this case, comes at a performance price.
- Whether C or Fortran is the better language for your number crunching
- application will depend on whether or not you need C's features. A lot
- of scientific work requires no data structures more sophisticated than
- static arrays. In those cases, Fortran is probably the better choice.
- I wouldn't want to create binary trees in Fortran anymore however.
-
- Marc R. Roussel
- mroussel@alchemy.chem.utoronto.ca
-