NetNews Usenet Archive 1992 #26

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #26 / NN_1992_26.iso / spool / comp / arch / 10663 < prev next >

Wrap

Internet Message Format | 1992-11-13 | 3.8 KB

Xref: sparky comp.arch:10663 comp.benchmarks:1672 Newsgroups: comp.arch,comp.benchmarks Path: sparky!uunet!ukma!darwin.sura.net!news.udel.edu!perelandra.cms.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Subject: Re: DEC ALPHA Performance Claims Message-ID: <BxM8xv.EFI@news.udel.edu> Summary: Micro have a long way to go yet.... Keywords: memory bandwidth, vectorized codes Sender: usenet@news.udel.edu Nntp-Posting-Host: perelandra.cms.udel.edu Organization: College of Marine Studies, U. Del. References: <1992Nov12.091854.22914@walter.cray.com> Date: Thu, 12 Nov 1992 18:34:42 GMT Lines: 71 In article <1992Nov12.091854.22914@walter.cray.com> cmg@magnet.cray.com writes: > >We must be careful when comparing new architectures with old >architectures because of the effect of software. [...] > [....] >The improvement of microprocessor speed of the past decade has been >tremendous. How much of this improvement is attributable to software? Without bothering to hunt down the numbers, it is clear that for the LINPACK 100x100 and 1000x1000 benchmarks, the effect of software improvements has been tremendous. Because of this, I no longer consider any of the LINPACK numbers to be useful for characterizing system performance (except for dense linear algebra). The two cases have been improved by different sets of software enhancements: (1) In the 100x100 case, inlining has been the real key, though there is still some room for improvement in the compile-time evaluation of the excessive logical predicates in the BLAS routines. On the hardware side, larger caches have helped quite a bit. (2) In the 1000x1000, the improvement has been mostly due to the industry's increasing knowledge about block-mode algorithms, although the use of pipelined FPU's has been crucial for these algorithmic improvements to be effective. Just as in the case of the SPEC89 Matrix300 benchmark (which was made useless by the combination of these two sets of optimizations), this set of enhancements has made the LINPACK test cases rather poor predictors of scientific workload performance. For most large-scale scientific applications, the real bottleneck continues to be sustainable memory bandwidth --- often requiring non-unit or irregular strides. The combination of unbalanced architectures (fast and/or pipelined FPU's combined with slow memory systems) with an algorithm that benefits far more than most from blocking results in systems that cannot deliver what these LINPACK numbers appear to promise. ---------------------------------------------------------------------- It is interesting to note that the top-of-the-line DEC Alpha system appears to be about even with the Cray C90 on $/MFLOP for the LINPACK 1000x1000 case. Since this is a best case for the Alpha, one must conclude that the C90 is more cost-effective for more memory-intensive vectorized algorithms (i.e. most all of them). Alpha AXP 10000 : $ 300k/111 MFLOPS = 2.7 k$/MFLOPS Cray C90 - 1 cpu: $2500k/871 MFLOPS = 2.9 k$/MFLOPS The Cray price is a guestimate. It is probably not too far off for a uniprocessor system. Multiprocessor systems are likely noticeably cheaper per cpu. ---------------------------------------------------------------------- It is also interesting to note that the DEC AXP 3000/500 at 150 MHz has almost identical LINPACK 100x100 and LINPACK 1000x1000 numbers with the IBM RS/6000-970 at 50 MHz. Prices are similar, with DEC appearing to have a slight edge: $39k vs ~$55k. ---------------------------------------------------------------------- All in all, a disappointing set of announcements from the point of view of this number-cruncher.... -- -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. John.McCalpin@mvs.udel.edu