home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.super
- Path: sparky!uunet!pgroup!lfm
- From: lfm@pgroup.com (Larry Meadows)
- Subject: i860 performance (was: Re: World's Most Powerful Computing Sites)
- Message-ID: <C19s04.3yF@pgroup.com>
- Date: Fri, 22 Jan 1993 19:12:03 GMT
- References: <1993Jan20.232809.29241@nas.nasa.gov> <1993Jan21.165159.10149@meiko.com> <1993Jan22.015827.26653@nas.nasa.gov>
- Organization: The Portland Group, Portland, OR
- Lines: 69
-
- In article <1993Jan22.015827.26653@nas.nasa.gov> fineberg@nas.nasa.gov writes:
- >In article <1993Jan21.165159.10149@meiko.com>, richard@meiko.com (Richard Cownie) writes:
- >|>
- >|> I have to disagree with you there. I know of *some* applications where
- >|> the i860 can achieve a good fraction of claimed peak speed, e.g. on
- >|> a double-precision matrix multiply you can do over 35MFLOPS, against
- >|> a peak rate claimed as 40MFLOPS (or sometimes 60MFLOPS, because you can do
- >|> 2 adds for each multiply). In any case, it's well over 50% of peak.
- >|>
- >|> If a T800 transputer can achieve 50% of 4.4MFLOPS on a matrix multiply,
- >|> or indeed *anything* useful, I'd be interested to hear about it.
- >|>
- >|> Performance on big compiled Fortran programs is another kettle of fish,
- >|> and here I'd agree that peak performance figures are not much help.
- >|> --
- >|> Richard Cownie (a.k.a. Tich), Meiko Scientific Corp
- >|> email: richard@meiko.com
- >|> phone: 617-890-7676
- >|> fax: 617-890-5042
- >
- >I don't know too many people that write assembly code, and that is what you
- >need to do to get 35 MFLOPs. As far as I'm concerned, assembly coded
- >benchnmarks are useless. And if you can't get more than 60% of peak on an
- >assembly coded matrix multiply, that is bad (when Intel quotes peak speeds
- >on its system it uses the 60MFLOPs number, or 75 for the Paragon). The best I
- >have ever seen on an i860 was 10-15Mflops, and that was because the compiler
- >had stuck in some special vector subroutines in my code where it recognized
- >a daxpy operation. I don't know what the transputer is capable of, but I would
- >be surprised if it can't do 75-90% of peak for a useless assembly coded
- >benchmark.
- >
- >Sam
-
- Well, I was going to stay out of this, but now I can't.
-
- 1. The iPSC/860 compiler will get around 27 mflops on fortran-coded matrix
- multiply. Yes, it does replace your daxpy with hand-coded routines.
-
- 2. Rating the i860 at 60/75 is absurd (yes, I know Intel does it that way).
- At most, I'd rate it at 40/50, since d.p. multiply can only be done every other
- cycle. And the compiler isn't really good at doing dual-mode instructions, so
- often 20/25 is close to the truth.
-
- 3. Aside from the difficulty of automatically generating dual-mode code, the
- other problem with the i860 is memory bandwidth. Consider the daxpy:
-
- b(i) = b(i) + s * a(i)
-
- 2 vectors must be fetched, and one vector stored. At 2 clocks per 64-bit
- word, the absolute best that can be done here is 6 clocks, if all the results
- come from main memory rather than "vector registers" (cache). So,
-
- (40e6 cycles-per-second / 6 cycles) * 2 f.p.-ops = 13.3 mflops
-
- Unfortunately, the iPSC/860 board actually takes 3 cycles for a write, so
- the count for the loop is really 7 cycles, or around 11.4 mflops
-
- The i860-XP should be somewhat better, since it can theoretically
- issue a 128-bit load every other clock. Of course, everything needs to
- be properly aligned, and the multiply is still 2 clocks. So around 4-5
- clocks for a daxpy element is more likely.
-
- The problem is that peak mflops performance doesn't give the whole story --
- memory bandwidth is much more important. Perhaps machines should be
- rated by how long it takes to do a vector add, or some such measure; then
- "super-vector" performance occurs when values can be left in vector registers.
- --
- Larry Meadows The Portland Group
- lfm@pgroup.com
-