NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / sys / super / 1190 < prev next >

Wrap

Text File | 1993-01-23 | 3.7 KB | 80 lines

Newsgroups: comp.sys.super Path: sparky!uunet!pgroup!lfm From: lfm@pgroup.com (Larry Meadows) Subject: i860 performance (was: Re: World's Most Powerful Computing Sites) Message-ID: <C19s04.3yF@pgroup.com> Date: Fri, 22 Jan 1993 19:12:03 GMT References: <1993Jan20.232809.29241@nas.nasa.gov> <1993Jan21.165159.10149@meiko.com> <1993Jan22.015827.26653@nas.nasa.gov> Organization: The Portland Group, Portland, OR Lines: 69 In article <1993Jan22.015827.26653@nas.nasa.gov> fineberg@nas.nasa.gov writes: >In article <1993Jan21.165159.10149@meiko.com>, richard@meiko.com (Richard Cownie) writes: >|> >|> I have to disagree with you there. I know of *some* applications where >|> the i860 can achieve a good fraction of claimed peak speed, e.g. on >|> a double-precision matrix multiply you can do over 35MFLOPS, against >|> a peak rate claimed as 40MFLOPS (or sometimes 60MFLOPS, because you can do >|> 2 adds for each multiply). In any case, it's well over 50% of peak. >|> >|> If a T800 transputer can achieve 50% of 4.4MFLOPS on a matrix multiply, >|> or indeed *anything* useful, I'd be interested to hear about it. >|> >|> Performance on big compiled Fortran programs is another kettle of fish, >|> and here I'd agree that peak performance figures are not much help. >|> -- >|> Richard Cownie (a.k.a. Tich), Meiko Scientific Corp >|> email: richard@meiko.com >|> phone: 617-890-7676 >|> fax: 617-890-5042 > >I don't know too many people that write assembly code, and that is what you >need to do to get 35 MFLOPs. As far as I'm concerned, assembly coded >benchnmarks are useless. And if you can't get more than 60% of peak on an >assembly coded matrix multiply, that is bad (when Intel quotes peak speeds >on its system it uses the 60MFLOPs number, or 75 for the Paragon). The best I >have ever seen on an i860 was 10-15Mflops, and that was because the compiler >had stuck in some special vector subroutines in my code where it recognized >a daxpy operation. I don't know what the transputer is capable of, but I would >be surprised if it can't do 75-90% of peak for a useless assembly coded >benchmark. > >Sam Well, I was going to stay out of this, but now I can't. 1. The iPSC/860 compiler will get around 27 mflops on fortran-coded matrix multiply. Yes, it does replace your daxpy with hand-coded routines. 2. Rating the i860 at 60/75 is absurd (yes, I know Intel does it that way). At most, I'd rate it at 40/50, since d.p. multiply can only be done every other cycle. And the compiler isn't really good at doing dual-mode instructions, so often 20/25 is close to the truth. 3. Aside from the difficulty of automatically generating dual-mode code, the other problem with the i860 is memory bandwidth. Consider the daxpy: b(i) = b(i) + s * a(i) 2 vectors must be fetched, and one vector stored. At 2 clocks per 64-bit word, the absolute best that can be done here is 6 clocks, if all the results come from main memory rather than "vector registers" (cache). So, (40e6 cycles-per-second / 6 cycles) * 2 f.p.-ops = 13.3 mflops Unfortunately, the iPSC/860 board actually takes 3 cycles for a write, so the count for the loop is really 7 cycles, or around 11.4 mflops The i860-XP should be somewhat better, since it can theoretically issue a 128-bit load every other clock. Of course, everything needs to be properly aligned, and the multiply is still 2 clocks. So around 4-5 clocks for a daxpy element is more likely. The problem is that peak mflops performance doesn't give the whole story -- memory bandwidth is much more important. Perhaps machines should be rated by how long it takes to do a vector add, or some such measure; then "super-vector" performance occurs when values can be left in vector registers. -- Larry Meadows The Portland Group lfm@pgroup.com