NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / benchmar / 1330 < prev next >

Wrap

Text File | 1992-08-23 | 5.1 KB | 97 lines

Newsgroups: comp.benchmarks Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!porpoise!marlin!aburto From: aburto@nosc.mil (Alfred A. Aburto) Subject: Re: Geometric Mean or Median Message-ID: <1992Aug23.114309.3643@nosc.mil> Organization: Naval Ocean Systems Center, San Diego References: <1992Aug12.172209.3108@nas.nasa.gov> <Aug14.142126.38458@yuma.ACNS.ColoState.EDU> <1992Aug20.160352.13856@nas.nasa.gov> Distribution: comp.benchmarks Date: Sun, 23 Aug 1992 11:43:09 GMT Lines: 85 In article <1992Aug20.160352.13856@nas.nasa.gov> eugene@wilbur.nas.nasa.gov (Eugene N. Miya) writes: >>A discussion of this, and an offered proof >>of the geometric mean as preferred method is in the March 1986 issue of >>Communications of the ACM, "How Not to Lie With Statistics: The Correct >>Way to Summarize Benchmark Results," by Fleming and Wallace. > >Personally, I think the "proof" is weak. Worlton has an earlier, less >frequently cited paper saying the harmonic mean is the way to go. The above >authors didn't do enough literature checking. If the arithmetic, the geometric, >and the harmonic mean are all suspect, then don't trust any of them. > >--eugene miya, NASA Ames Research Center, eugene@orville.nas.nasa.gov > Resident Cynic, Rock of Ages Home for Retired Hackers > {uunet,mailrus,other gateways}!ames!eugene >Second Favorite email message: Returned mail: Cannot send message for 3 days >Ref: J. Worlton, Benchmarkology, email for additional ref. ------- The thing we need to realize is that 'benchmark' results are random numbers :-) If we keep in mind that they are really random numbers, then this might help avoid some of the typical problems that occur when comparing 'benchmark' results. It may help us realize that we can not take two isolated results and compare them since the measurements, and the measures of performance (means, medians, ..., etc.) are noisy and subject to a generally unknown error. This is true of the SPEC suite just as much as it is true for Dhrystone, Whetstone, Linpack and all the rest. The SPEC results, it would appear, are more reliable because they consist of the geometric mean of a number of different results. However, even the SPEC results will vary, as will any measure of performance, when the underlying test parameters change. See SunWorld magazine, Mar 1992, pg 48, "SPEC: Odyssey of a benchmark", where geometric mean SPECmark ratings of 17.8, 20.0, 20.8, and 25.0 were measured on the same machine using different compilers. Well, I agree, since all our measures of performance are error prone, we must be skeptical when comparing any isolated raw results. And we must be careful when comparing mean results without some indication of the magnitude of the error involved. I have heard alot of Dhrystone 1.1 bashing in the past. I tried to understand just how 'bad' Dhrystone was several years ago (it seems) by correlating Dhrystone 1.1 results with SPECint89 results. I had 20 or so data points to work with. I thought perhaps there would be little correlation and that Dhrystone really needed to be cast out as a measure of performance, because of the greater confidence placed in the SPEC results. Instead, they were highly correlated (0.92 correlation or so). Well, I had to revise my thinking. When I plotted the results it was clear they were well correlated. There were only 2 rather large and obvious differences between the Dhrystone 1.1 and SPECint89 results. Dhrystone 1.1 showed a couple of big 'spikes' in performance that were not present in the SPECint89 results (for an HP 68040 and i860 I believe). This result indicated Dhrystone 1.1 was probably ok, but one needed a variety of results (different systems, compilers, etc) to help filter out those 'spikes' which appeared unrealistic of general integer performance as compared to SPECint89. Also it showed that it was necessary to base performance on a number of different test programs (as in SPEC) instead of just one program. Even using a number of different test programs might not help make things more definite. This was really 'brought home to me' recently by the posting and email from Frank McMahon regarding the Livermore Loops MFLOPS results. The Livermore Loops consists of 72 calculation loops typically found in 'big' programs. However, the MFLOPS results can show a huge variation in performance with the very fast machines. Frank indicated that the NEC SX-3 supercomputer showed a variation in performance from 3 MFLOPS all the way to 3400 MFLOPS. The results were Poisson distributed. The standard deviation was 500 MFLOPS. One wonders about the meaning of any one or more measures of performance in this case. It is almost as if it is necessary to look at individual cases very carefully instead of some mean or median result. That is, I would not want to buy a NEC SX-3 supercomputer if it turned out most of the work I'd be doing corresponded to the low end performance of that system where a fast 68040 or 80486 might do just as well. Whew, what a nasty situation --- the spread in performance is just too big. Al Aburto aburto@marlin.nosc.mil -------