home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.benchmarks
- Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!porpoise!marlin!aburto
- From: aburto@nosc.mil (Alfred A. Aburto)
- Subject: Re: Geometric Mean or Median
- Message-ID: <1992Aug31.002356.24988@nosc.mil>
- Organization: Naval Ocean Systems Center, San Diego
- References: <1992Aug20.160352.13856@nas.nasa.gov> <1992Aug23.114309.3643@nosc.mil> <1992Aug26.160240.20114@murdoch.acc.Virginia.EDU>
- Distribution: comp.benchmarks
- Date: Mon, 31 Aug 1992 00:23:56 GMT
- Lines: 186
-
-
- In Article <1992Aug26.160240.20114@murdoch.acc.Virginia.EDU>
- clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
-
- In article <1992Aug23.114309.3643@nosc.mil>
- aburto@nosc.mil (Alfred A. Aburto) writes:
-
- >>I have heard alot of Dhrystone 1.1 bashing in the past. I tried to
- >>understand just how 'bad' Dhrystone was several years ago (it seems)
- >>by correlating Dhrystone 1.1 results with SPECint89 results. I had
- >>20 or so data points to work with. I thought perhaps there would be
- >>little correlation and that Dhrystone really needed to be cast out
- >>as a measure of performance, because of the greater confidence placed
- >>in the SPEC results. Instead, they were highly correlated (0.92
- >>correlation or so). Well, I had to revise my thinking.
-
- >This kind of bogus correlation was debunked long ago by SPEC. As soon as
- >more data points are added, the correlation gets worse. Try adding the
- >SparcStation 10 numbers to your test, for example.
-
- I didn't know SPEC had done that. Wish I had been info'd on the results.
- I'm not surprised though, but I'm curious now to see what they did.
- Actually the results (20 different systems I think) were fairly
- representative of various systems available, so I'm curious to see in
- what manner the correlation broke down.
-
- One of the problems with 'benchmarking' is the lack of good well
- documented data bases from which to work with.
-
- >For that matter, just look at the HP 9000/710 versus the HP 9000/720.
- >The only differences in the hardware are the larger caches on the 720.
- >Since the 710 has large enough caches for the Dhrystone code, but not
- >for some SPECint codes, it produces the same Dhrystones as the 720 but
- >significantly lower SPECint.
-
- The issue here is cache size. We know that cache size is an important
- factor in performance relative to a programs size (or cache utilization
- size). Dhrystone is a small program and hence produces similar results,
- as you say, in small caches as in big caches. Dhrystone is not adequate
- to gain an understanding of performance trade-offs relative to cache
- size. Other programs of varying size are needed to understand the
- 'spread' in performance due to cache size relative to program size. We
- need to understand the limitations of our test programs and use them
- appropriately. It is far (far) from the mark to think that Dhrystone
- is the only test program one should use.
-
- SPECint has problems here as well, because there are plently of 'small'
- programs available that will fit in the HP 9000/710 cache which will
- perform just as well as on the HP 9000/720. Yet, as you indicated, the
- SPECint results do not reflect this fact. There are reasons HP built the
- HP 710 and 720. Lower cost might be one of them (I don't know really).
- Perhaps also HP felt that there was a segment of users who would be just
- as happy with the smaller cache in the 710. They really didn't need a
- larger cache. They would take a hit on performance sometimes with their
- larger programs (SPECint type result), but in general the smaller cache
- machine was adequate for their purposes (Dhrystone type result).
-
- >Similar poor correlations will be obtained for two different systems
- >with very different cache sizes. Compare the HP9000/720 to a smaller
- >cache machine like an IBM RS/6000 or Sun SS2. For example, here are some
- >Spring, 1991, numbers:
- >
- > SPECint89 Dhrystone 1.1 MIPS MIPS/SPECint89
- > --------- ------------------ --------------
- >HP 9000/720 39.0 57 1.46
- >DEC 5000/200 19.0 24.2 1.27
- >IBM RS6000/550 34.5 56 1.62
- >
- >If I didn't have SPECint89 numbers, but wanted to derive them from
- >available Dhrystone MIPS numbers, the third column above would indicate
- >that I have a tough job ahead of me.
-
-
- But they ARE correlated! You can see it just by looking at the
- SPECint89 and Dhrystone1.1 numbers. It is incorrect to use the third
- column (above) to make any predictions or draw conclusions as it
- consists of ratio's of the raw data (program, 'benchmark', results).
- I'll explain below.
-
- I sorted the numbers in decreasing order and I added in the nsieve MIPS
- results (see the table below). Forget about the individual magnitudes
- because the scaling in each program is different. But look at the
- numbers. They track one another. The step size from one result to the
- next is different but overall the results are tracking fairly well. The
- HP 720 ranks highest for all three program results. The DEC 200 ranks
- lowest for all three program results. The IBM 550 ranks second in all
- three program results. They are all telling the same story and they are
- correlated. To check this qualitative correlation I also calculated the
- mathematical linear correlation coefficient and the result shows that
- they are all highly correlated. Correlation coefficients: SPECint89 to
- Dhyrstone1.1 = 0.982, SPECint89 to nsieve = 0.999, and Dhrystone1.1 to
- nsieve = 0.988.
-
- SPECint89 Dhrystone 1.1 MIPS nsieve MIPS
- --------- ------------------ --------------
- HP 9000/720 39.0 57 50.2
- IBM RS6000/550 34.5 56 43.8
- DEC 5000/200 19.0 24.2 17.0
-
- The details though are different. They are different because there is
- error in all those measurements. The compilers are not the same. The
- compiler options are not the same. The programs and what they do are
- all different. Cache size is a factor too. The SPECint89 results are
- a geometric mean of 4 programs while the Dhrystone and nsieve are the
- mean of none (and thus more susceptable to error). In view of these
- errors, it is amazing to me that the results are correlated at all!
- But they are most definitely well correlated.
-
- Because they are highly correlated doesn't mean you can pick numbers
- out of the raw program results above and start making comparisons or
- predictions. It just won't work because there are unaccounted errors
- in each number and between the different programs. Even worse is to
- take ratios like the MIPS/SPECint89. If there is error in the MIPS
- result and error in the SPECint89 results then the fractional error
- after the division is even worse than the fractional error in the
- original numbers. For example (40 +/- 6) / (20 +/- 3) = 2 +/- 0.6
- (approximately). The fractional error in the original numbers is
- 0.15 (15%) but it has doubled to 0.30 (30%) after the division. So
- you see that the ratio is an even less reliable number to use for
- comparison or prediction purposes, and particularly so because you were
- using the raw data (program or 'benchmark' results) of which you don't
- even know the error bounds. If you did have the error bounds for the
- ratio's then you might have realized that you really could draw no
- conclusion at all, and this is another reason why I think we need to
- start understanding the errors in our measurements. It will help us
- avoid drawing incorrect conclusions.
-
- Taking the ratio, MIPS/SPECint89, destroyed the correlation and led you
- to draw an erroneous conclusion about your 6 data samples. I noticed
- others using the above type ratio's, but it is simply not correct to
- do so.
-
- The correct procedure is to take the data samples (benchmark results
- which have random errors) and do a correlation. A linear correlation
- worked well so we can go with that. The linear correlation between
- nsieve MIPS and SPECint89 was quite strong at 0.999 so we'll go with
- that. Now we can do a linear least-squares fit to derive a linear
- relationship between the nsieve MIPS and SPECint89 samples we had to
- work with. We find the following:
-
- SPECint89 = 8.806 + 0.595 * nsieveMIPS.
-
-
- SPECint89 Predicted SPECint89 Error
- from nsieve Measured
- HP 9000/720 38.7 39 -0.3
- IBM RS6000/550 34.9 34.5 +0.4
- DEC 5000/200 18.9 19 -0.1
-
- Pretty interesting. Also note that I used the best (peak) values for
- the nsieve numbers. This seemed ok since it seems to me people tend
- to frequently report peak values for benchmark results anyway.
-
- We can do the same thing for the Dhrystone and SPECint89 numbers:
-
- SPECint89 = 5.571 + 0.5524 * Dhrystone1.1MIPS.
-
-
- SPECint89 Predicted SPECint89 Error
- from Dhrystone 1.1 Measured
- HP 9000/720 37.1 39 -1.9
- IBM RS6000/550 36.5 34.5 +2.0
- DEC 5000/200 18.9 19 -0.1
-
- Not as good as nsieve, but still not bad as the error is less than 6%.
-
- Please note that the correlations and relationships established above
- are really _only_ valid for the 9 data samples we had to work with. It
- would be erroneous to take any other results and throw them into the
- equations and think those results were correct. They probably won't be.
- We have not done enough work for that. Besides it was already indicated
- the correlation breaks down as the sample size increases.
-
- My main concern is that we do things correctly. I think that we really
- need to start understanding the errors in our measurements (benchmark
- results). Until we do I think we are just going to keep making lots of
- mistakes, and blantant errors, with those measurements. We are really
- on shaky ground when we compare benchmark results and have no idea
- of the magnitude of the error in those measurements.
-
- Al Aburto
- aburto@marlin.nosc.mil
-
- -------
-
-
-