NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / benchmar / 1342 < prev next >

Wrap

Internet Message Format | 1992-08-30 | 5.1 KB

Path: sparky!uunet!convex!darwin.sura.net!sgiblab!swrinde!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!porpoise!marlin!aburto From: aburto@nosc.mil (Alfred A. Aburto) Newsgroups: comp.benchmarks Subject: Re: Geometric Mean or Median Message-ID: <1992Aug30.022440.1857@nosc.mil> Date: 30 Aug 92 02:24:40 GMT References: <1992Aug20.160352.13856@nas.nasa.gov> <1992Aug23.114309.3643@nosc.mil> <1992Aug26.160240.20114@murdoch.acc.Virginia.EDU> Distribution: comp.benchmarks Organization: Naval Ocean Systems Center, San Diego Lines: 84 In article <1992Aug26.160240.20114@murdoch.acc.Virginia.EDU> clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes: >In article <1992Aug23.114309.3643@nosc.mil> aburto@nosc.mil (Alfred A. Aburto) writes: In article <1992Aug23.114309.3643@nosc.mil> aburto@nosc.mil (Alfred A. Aburto) writes: >>However, even the SPEC results will vary, as will any measure of >>performance, when the underlying test parameters change. See SunWorld >>magazine, Mar 1992, pg 48, "SPEC: Odyssey of a benchmark", where >>geometric mean SPECmark ratings of 17.8, 20.0, 20.8, and 25.0 were >>measured on the same machine using different compilers. >This is given as an example of "noisiness" or "error proneness" in SPEC, >but SPEC was intended to measure the SYSTEM, including the compilers. If >one vendor has better compilers than another, it matters to the end users. >Conversely, trying to benchmark the raw hardware in some way that filters >out compiler differences would not be interesting to people who have to >purchase the systems and use the compilers, not just the hardware. It was given as an example of the need to show (indicate) the 'spread' in the results. There is not just ONE result, there are numerous results, depending upon many parameters that are difficult to control. The vendors system may get one result (25.0), but the users system may get an entirely different result. In the SunWorld article, even after alot of trouble, they were unable to duplicate the vendors result (25.0). They finally settled on 20.8 as the best they could do and left it at that. I'm saying that unnecessary troubles may have been avoided if the vendor had said instead something like: "this system has a rating of 21.0 +/- 4.0, and you'll achieve peak performance of approximately 25.0 by use of this compiler with these options, and 17.0 with this other compiler with these options." Or some simple statement such as that, using perhaps more appropriately the Maximum and Minimum results instead of the standard deviation or RMS error. It would have avoided unnecessary problems and been more informative overall. I don't want to hide any information at all. I'm trying to say that we need to bring out more information in the hope that it will avoid the type of problems discussed in the SunWorld article. I'm not claiming to know exactly how to rectify this situation. I'm just saying there appears to be a need to do it. Of course the 'spread' (or 'error' if you will) in performance due to different compilers and compiler options is only one aspect of the problem. Different types of programs produce different results and there is a spread in performance there too. Program size and memory usage, main memory speed, cache type, cache size, ..., etc. all produce a spread in performance. The overall spread is considerable, and this is why system testing is so difficult. With regards to 'filtering' I was thinking of the need to 'filter' the extreme data points. One learns about these extreme values by having other similar _program_ results for comparison. This 'filtering' aspect was not intended so much for a particular compiler result, but for a particular program result. If program 'A' produces an order of magnitude 'better' result than 9 other _similar_ programs for a particular system (compiler included) then I feel a need to do something about program 'A'. At least I'd become very interested in program 'A' and try to figure out why it produces results so different from the other programs. Filtering is an option when a few of the system results for program 'A' show extreme outliers compared to other program results on the same system (compiler included as part of the system). Its just an option as one might not want to throw all the program 'A' results out due to a few 'abnormal' results (due to extreme optimization for example on 1 out 'M' programs and on a few out 'N' systems). Its just an option and there are certainly many cases where one would not want to filter at all. It depends on the data. >The only real question in my mind about SPEC's approach is allowing >vendors to use different compiler switch settings for each individual >benchmark, if it produces better numbers. I don't think many users can >compile every program that many different times and run timing tests on >each one. This is a good point. On the other hand one cannot fault vendors in trying to achieve the optimum performance in each individual case, but unfortunately, as you say, this makes it tough on the users trying to figure out what are the best options to use for their own particular programs. [more to follow] Al Aburto aburto@marlin.nosc.mil -------