home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.benchmarks
- Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!porpoise!marlin!aburto
- From: aburto@nosc.mil (Alfred A. Aburto)
- Subject: Re: Geometric Mean or Median
- Message-ID: <1992Aug23.114309.3643@nosc.mil>
- Organization: Naval Ocean Systems Center, San Diego
- References: <1992Aug12.172209.3108@nas.nasa.gov> <Aug14.142126.38458@yuma.ACNS.ColoState.EDU> <1992Aug20.160352.13856@nas.nasa.gov>
- Distribution: comp.benchmarks
- Date: Sun, 23 Aug 1992 11:43:09 GMT
- Lines: 85
-
- In article <1992Aug20.160352.13856@nas.nasa.gov> eugene@wilbur.nas.nasa.gov (Eugene N. Miya) writes:
- >>A discussion of this, and an offered proof
- >>of the geometric mean as preferred method is in the March 1986 issue of
- >>Communications of the ACM, "How Not to Lie With Statistics: The Correct
- >>Way to Summarize Benchmark Results," by Fleming and Wallace.
- >
- >Personally, I think the "proof" is weak. Worlton has an earlier, less
- >frequently cited paper saying the harmonic mean is the way to go. The above
- >authors didn't do enough literature checking. If the arithmetic, the geometric,
- >and the harmonic mean are all suspect, then don't trust any of them.
- >
- >--eugene miya, NASA Ames Research Center, eugene@orville.nas.nasa.gov
- > Resident Cynic, Rock of Ages Home for Retired Hackers
- > {uunet,mailrus,other gateways}!ames!eugene
- >Second Favorite email message: Returned mail: Cannot send message for 3 days
- >Ref: J. Worlton, Benchmarkology, email for additional ref.
-
- -------
- The thing we need to realize is that 'benchmark' results are random
- numbers :-)
-
- If we keep in mind that they are really random numbers, then this might
- help avoid some of the typical problems that occur when comparing
- 'benchmark' results. It may help us realize that we can not take two
- isolated results and compare them since the measurements, and the
- measures of performance (means, medians, ..., etc.) are noisy and
- subject to a generally unknown error. This is true of the SPEC suite
- just as much as it is true for Dhrystone, Whetstone, Linpack and all
- the rest. The SPEC results, it would appear, are more reliable because
- they consist of the geometric mean of a number of different results.
- However, even the SPEC results will vary, as will any measure of
- performance, when the underlying test parameters change. See SunWorld
- magazine, Mar 1992, pg 48, "SPEC: Odyssey of a benchmark", where
- geometric mean SPECmark ratings of 17.8, 20.0, 20.8, and 25.0 were
- measured on the same machine using different compilers.
-
- Well, I agree, since all our measures of performance are error prone,
- we must be skeptical when comparing any isolated raw results. And we
- must be careful when comparing mean results without some indication of
- the magnitude of the error involved.
-
- I have heard alot of Dhrystone 1.1 bashing in the past. I tried to
- understand just how 'bad' Dhrystone was several years ago (it seems)
- by correlating Dhrystone 1.1 results with SPECint89 results. I had
- 20 or so data points to work with. I thought perhaps there would be
- little correlation and that Dhrystone really needed to be cast out
- as a measure of performance, because of the greater confidence placed
- in the SPEC results. Instead, they were highly correlated (0.92
- correlation or so). Well, I had to revise my thinking. When I
- plotted the results it was clear they were well correlated. There
- were only 2 rather large and obvious differences between the
- Dhrystone 1.1 and SPECint89 results. Dhrystone 1.1 showed a couple of
- big 'spikes' in performance that were not present in the SPECint89
- results (for an HP 68040 and i860 I believe). This result indicated
- Dhrystone 1.1 was probably ok, but one needed a variety of results
- (different systems, compilers, etc) to help filter out those 'spikes'
- which appeared unrealistic of general integer performance as compared
- to SPECint89. Also it showed that it was necessary to base performance
- on a number of different test programs (as in SPEC) instead of just one
- program.
-
- Even using a number of different test programs might not help make
- things more definite.
-
- This was really 'brought home to me' recently by the posting and email
- from Frank McMahon regarding the Livermore Loops MFLOPS results. The
- Livermore Loops consists of 72 calculation loops typically found in
- 'big' programs. However, the MFLOPS results can show a huge variation
- in performance with the very fast machines. Frank indicated that the
- NEC SX-3 supercomputer showed a variation in performance from 3 MFLOPS
- all the way to 3400 MFLOPS. The results were Poisson distributed. The
- standard deviation was 500 MFLOPS. One wonders about the meaning of
- any one or more measures of performance in this case. It is almost as
- if it is necessary to look at individual cases very carefully instead
- of some mean or median result. That is, I would not want to buy a
- NEC SX-3 supercomputer if it turned out most of the work I'd be doing
- corresponded to the low end performance of that system where a fast
- 68040 or 80486 might do just as well. Whew, what a nasty situation ---
- the spread in performance is just too big.
-
- Al Aburto
- aburto@marlin.nosc.mil
- -------
-
-
-