home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.arch:10663 comp.benchmarks:1672
- Newsgroups: comp.arch,comp.benchmarks
- Path: sparky!uunet!ukma!darwin.sura.net!news.udel.edu!perelandra.cms.udel.edu!mccalpin
- From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
- Subject: Re: DEC ALPHA Performance Claims
- Message-ID: <BxM8xv.EFI@news.udel.edu>
- Summary: Micro have a long way to go yet....
- Keywords: memory bandwidth, vectorized codes
- Sender: usenet@news.udel.edu
- Nntp-Posting-Host: perelandra.cms.udel.edu
- Organization: College of Marine Studies, U. Del.
- References: <1992Nov12.091854.22914@walter.cray.com>
- Date: Thu, 12 Nov 1992 18:34:42 GMT
- Lines: 71
-
- In article <1992Nov12.091854.22914@walter.cray.com> cmg@magnet.cray.com writes:
- >
- >We must be careful when comparing new architectures with old
- >architectures because of the effect of software. [...]
- > [....]
- >The improvement of microprocessor speed of the past decade has been
- >tremendous. How much of this improvement is attributable to software?
-
- Without bothering to hunt down the numbers, it is clear that for the
- LINPACK 100x100 and 1000x1000 benchmarks, the effect of software
- improvements has been tremendous. Because of this, I no longer
- consider any of the LINPACK numbers to be useful for characterizing
- system performance (except for dense linear algebra).
-
- The two cases have been improved by different sets of software
- enhancements:
-
- (1) In the 100x100 case, inlining has been the real key, though there
- is still some room for improvement in the compile-time evaluation of
- the excessive logical predicates in the BLAS routines. On the hardware
- side, larger caches have helped quite a bit.
-
- (2) In the 1000x1000, the improvement has been mostly due to the
- industry's increasing knowledge about block-mode algorithms, although
- the use of pipelined FPU's has been crucial for these algorithmic
- improvements to be effective.
-
-
- Just as in the case of the SPEC89 Matrix300 benchmark (which was made
- useless by the combination of these two sets of optimizations), this
- set of enhancements has made the LINPACK test cases rather poor
- predictors of scientific workload performance. For most large-scale
- scientific applications, the real bottleneck continues to be
- sustainable memory bandwidth --- often requiring non-unit or irregular
- strides.
-
- The combination of unbalanced architectures (fast and/or pipelined
- FPU's combined with slow memory systems) with an algorithm that
- benefits far more than most from blocking results in systems that
- cannot deliver what these LINPACK numbers appear to promise.
-
- ----------------------------------------------------------------------
-
- It is interesting to note that the top-of-the-line DEC Alpha system
- appears to be about even with the Cray C90 on $/MFLOP for the LINPACK
- 1000x1000 case. Since this is a best case for the Alpha, one must
- conclude that the C90 is more cost-effective for more memory-intensive
- vectorized algorithms (i.e. most all of them).
-
- Alpha AXP 10000 : $ 300k/111 MFLOPS = 2.7 k$/MFLOPS
- Cray C90 - 1 cpu: $2500k/871 MFLOPS = 2.9 k$/MFLOPS
-
- The Cray price is a guestimate. It is probably not too far off
- for a uniprocessor system. Multiprocessor systems are likely
- noticeably cheaper per cpu.
- ----------------------------------------------------------------------
-
- It is also interesting to note that the DEC AXP 3000/500 at 150 MHz
- has almost identical LINPACK 100x100 and LINPACK 1000x1000 numbers
- with the IBM RS/6000-970 at 50 MHz. Prices are similar, with DEC
- appearing to have a slight edge: $39k vs ~$55k.
-
- ----------------------------------------------------------------------
-
- All in all, a disappointing set of announcements from the point of view
- of this number-cruncher....
- --
- --
- John D. McCalpin mccalpin@perelandra.cms.udel.edu
- Assistant Professor mccalpin@brahms.udel.edu
- College of Marine Studies, U. Del. John.McCalpin@mvs.udel.edu
-