NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / benchmar / 1346 < prev next >

Wrap

Text File | 1992-08-30 | 9.8 KB | 198 lines

Newsgroups: comp.benchmarks Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!porpoise!marlin!aburto From: aburto@nosc.mil (Alfred A. Aburto) Subject: Re: Geometric Mean or Median Message-ID: <1992Aug31.002356.24988@nosc.mil> Organization: Naval Ocean Systems Center, San Diego References: <1992Aug20.160352.13856@nas.nasa.gov> <1992Aug23.114309.3643@nosc.mil> <1992Aug26.160240.20114@murdoch.acc.Virginia.EDU> Distribution: comp.benchmarks Date: Mon, 31 Aug 1992 00:23:56 GMT Lines: 186 In Article <1992Aug26.160240.20114@murdoch.acc.Virginia.EDU> clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes: In article <1992Aug23.114309.3643@nosc.mil> aburto@nosc.mil (Alfred A. Aburto) writes: >>I have heard alot of Dhrystone 1.1 bashing in the past. I tried to >>understand just how 'bad' Dhrystone was several years ago (it seems) >>by correlating Dhrystone 1.1 results with SPECint89 results. I had >>20 or so data points to work with. I thought perhaps there would be >>little correlation and that Dhrystone really needed to be cast out >>as a measure of performance, because of the greater confidence placed >>in the SPEC results. Instead, they were highly correlated (0.92 >>correlation or so). Well, I had to revise my thinking. >This kind of bogus correlation was debunked long ago by SPEC. As soon as >more data points are added, the correlation gets worse. Try adding the >SparcStation 10 numbers to your test, for example. I didn't know SPEC had done that. Wish I had been info'd on the results. I'm not surprised though, but I'm curious now to see what they did. Actually the results (20 different systems I think) were fairly representative of various systems available, so I'm curious to see in what manner the correlation broke down. One of the problems with 'benchmarking' is the lack of good well documented data bases from which to work with. >For that matter, just look at the HP 9000/710 versus the HP 9000/720. >The only differences in the hardware are the larger caches on the 720. >Since the 710 has large enough caches for the Dhrystone code, but not >for some SPECint codes, it produces the same Dhrystones as the 720 but >significantly lower SPECint. The issue here is cache size. We know that cache size is an important factor in performance relative to a programs size (or cache utilization size). Dhrystone is a small program and hence produces similar results, as you say, in small caches as in big caches. Dhrystone is not adequate to gain an understanding of performance trade-offs relative to cache size. Other programs of varying size are needed to understand the 'spread' in performance due to cache size relative to program size. We need to understand the limitations of our test programs and use them appropriately. It is far (far) from the mark to think that Dhrystone is the only test program one should use. SPECint has problems here as well, because there are plently of 'small' programs available that will fit in the HP 9000/710 cache which will perform just as well as on the HP 9000/720. Yet, as you indicated, the SPECint results do not reflect this fact. There are reasons HP built the HP 710 and 720. Lower cost might be one of them (I don't know really). Perhaps also HP felt that there was a segment of users who would be just as happy with the smaller cache in the 710. They really didn't need a larger cache. They would take a hit on performance sometimes with their larger programs (SPECint type result), but in general the smaller cache machine was adequate for their purposes (Dhrystone type result). >Similar poor correlations will be obtained for two different systems >with very different cache sizes. Compare the HP9000/720 to a smaller >cache machine like an IBM RS/6000 or Sun SS2. For example, here are some >Spring, 1991, numbers: > > SPECint89 Dhrystone 1.1 MIPS MIPS/SPECint89 > --------- ------------------ -------------- >HP 9000/720 39.0 57 1.46 >DEC 5000/200 19.0 24.2 1.27 >IBM RS6000/550 34.5 56 1.62 > >If I didn't have SPECint89 numbers, but wanted to derive them from >available Dhrystone MIPS numbers, the third column above would indicate >that I have a tough job ahead of me. But they ARE correlated! You can see it just by looking at the SPECint89 and Dhrystone1.1 numbers. It is incorrect to use the third column (above) to make any predictions or draw conclusions as it consists of ratio's of the raw data (program, 'benchmark', results). I'll explain below. I sorted the numbers in decreasing order and I added in the nsieve MIPS results (see the table below). Forget about the individual magnitudes because the scaling in each program is different. But look at the numbers. They track one another. The step size from one result to the next is different but overall the results are tracking fairly well. The HP 720 ranks highest for all three program results. The DEC 200 ranks lowest for all three program results. The IBM 550 ranks second in all three program results. They are all telling the same story and they are correlated. To check this qualitative correlation I also calculated the mathematical linear correlation coefficient and the result shows that they are all highly correlated. Correlation coefficients: SPECint89 to Dhyrstone1.1 = 0.982, SPECint89 to nsieve = 0.999, and Dhrystone1.1 to nsieve = 0.988. SPECint89 Dhrystone 1.1 MIPS nsieve MIPS --------- ------------------ -------------- HP 9000/720 39.0 57 50.2 IBM RS6000/550 34.5 56 43.8 DEC 5000/200 19.0 24.2 17.0 The details though are different. They are different because there is error in all those measurements. The compilers are not the same. The compiler options are not the same. The programs and what they do are all different. Cache size is a factor too. The SPECint89 results are a geometric mean of 4 programs while the Dhrystone and nsieve are the mean of none (and thus more susceptable to error). In view of these errors, it is amazing to me that the results are correlated at all! But they are most definitely well correlated. Because they are highly correlated doesn't mean you can pick numbers out of the raw program results above and start making comparisons or predictions. It just won't work because there are unaccounted errors in each number and between the different programs. Even worse is to take ratios like the MIPS/SPECint89. If there is error in the MIPS result and error in the SPECint89 results then the fractional error after the division is even worse than the fractional error in the original numbers. For example (40 +/- 6) / (20 +/- 3) = 2 +/- 0.6 (approximately). The fractional error in the original numbers is 0.15 (15%) but it has doubled to 0.30 (30%) after the division. So you see that the ratio is an even less reliable number to use for comparison or prediction purposes, and particularly so because you were using the raw data (program or 'benchmark' results) of which you don't even know the error bounds. If you did have the error bounds for the ratio's then you might have realized that you really could draw no conclusion at all, and this is another reason why I think we need to start understanding the errors in our measurements. It will help us avoid drawing incorrect conclusions. Taking the ratio, MIPS/SPECint89, destroyed the correlation and led you to draw an erroneous conclusion about your 6 data samples. I noticed others using the above type ratio's, but it is simply not correct to do so. The correct procedure is to take the data samples (benchmark results which have random errors) and do a correlation. A linear correlation worked well so we can go with that. The linear correlation between nsieve MIPS and SPECint89 was quite strong at 0.999 so we'll go with that. Now we can do a linear least-squares fit to derive a linear relationship between the nsieve MIPS and SPECint89 samples we had to work with. We find the following: SPECint89 = 8.806 + 0.595 * nsieveMIPS. SPECint89 Predicted SPECint89 Error from nsieve Measured HP 9000/720 38.7 39 -0.3 IBM RS6000/550 34.9 34.5 +0.4 DEC 5000/200 18.9 19 -0.1 Pretty interesting. Also note that I used the best (peak) values for the nsieve numbers. This seemed ok since it seems to me people tend to frequently report peak values for benchmark results anyway. We can do the same thing for the Dhrystone and SPECint89 numbers: SPECint89 = 5.571 + 0.5524 * Dhrystone1.1MIPS. SPECint89 Predicted SPECint89 Error from Dhrystone 1.1 Measured HP 9000/720 37.1 39 -1.9 IBM RS6000/550 36.5 34.5 +2.0 DEC 5000/200 18.9 19 -0.1 Not as good as nsieve, but still not bad as the error is less than 6%. Please note that the correlations and relationships established above are really _only_ valid for the 9 data samples we had to work with. It would be erroneous to take any other results and throw them into the equations and think those results were correct. They probably won't be. We have not done enough work for that. Besides it was already indicated the correlation breaks down as the sample size increases. My main concern is that we do things correctly. I think that we really need to start understanding the errors in our measurements (benchmark results). Until we do I think we are just going to keep making lots of mistakes, and blantant errors, with those measurements. We are really on shaky ground when we compare benchmark results and have no idea of the magnitude of the error in those measurements. Al Aburto aburto@marlin.nosc.mil -------