NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / benchmar / 1302 < prev next >

Wrap

Internet Message Format | 1992-08-13 | 5.3 KB

Path: sparky!uunet!sun-barr!ames!uakari.primate.wisc.edu!caen!hellgate.utah.edu!dog.ee.lbl.gov!porpoise!marlin!aburto From: aburto@nosc.mil (Alfred A. Aburto) Newsgroups: comp.benchmarks Subject: Re: Geometric Mean or Median Message-ID: <1992Aug14.020025.4555@nosc.mil> Date: 14 Aug 92 02:00:25 GMT References: <1992Aug9.201016.10008@nosc.mil> <12878@inews.intel.com> Distribution: comp.benchmarks Organization: Naval Ocean Systems Center, San Diego Lines: 97 In article <12878@inews.intel.com> jwreilly@mipos2.UUCP (Jeffrey Reilly) writes: >Suppose the optimization to program 5 in case B was a legitimate >optimization that applied across a class of similar applications? >As a vendor, end-user, etc, I would think you would like to see this >reflected in whatever composite measure is used. I would argue that this >method appears too insensitive. This is a good question too. The thing is, we want the overall measure of performance to be more responsive or representative of the bulk of the data. We don't want the overall measure of performance to be biased by one or a few extreme outlying data points because they are not representative of the bulk of the data, they are low probability occuring events (correct or incorrect). You know all this, but that's the reason. I believe the Median does this, but it looks like I have some explaining to do about the definition of the Median. The next thing we want to do is define some parameters that describe (indicate, or give us an idea) what is happening at the tails of the distribution (cumulative distribution) because, as you indicate, this is where the really interesting things are happening. I would propose using in addition to the Median the Minimum and Maximum values, or the 10% and 90% points of the cumulative distribution for example. Three numbers is about 'right' I think. If we give alot of different measures of performance then people will pick out of that set one or a few perhaps that best fit their case for example and not everyone will pick the same measures of performance and thus things can get confused. We don't want to prevent people from learning more though, so in addition we provide (make available) all the raw data so people can just go crazy analyzing ever which way and this is 'good' as some interesting things are likely to be learned. To arrive at the above in a meaningful way we'll need 'good' validated data sets consisting of a sufficient number of samples (program results, routine results, kernel results, module results, ..., whatever constitutes a data sample). We want to pick enough samples so at least the cumulative distribution is relatively smooth (no big jumps in the distribution). I don't know how many samples is required, but I think that 10 isn't enough. Maybe 25 or 50 might be enough but we'd need to figure that out and it can be done. If the fluctuation is too big we might consider smoothing (filtering) the data. There are many possibilies to handle these problems (a simple binomial filter or Savitsky-Golay filter, or ...), but we need a sufficient number of validated samples first. >What definition of median are you using? Here is where we've had a problem (as it finally dawned on me). The median is defined as the 50% point of the cumulative distribution of samples. This is the definition I have been using, and it is the preferred way of deriving the Median from a set of samples. With small data samples it is subject to error as are all the other parameters. If the distribution of samples is not highly skewed you will find that the Median and Arithmetic Mean are very close or identical. See Papoulis for example: "Probability, Random Variables, and Stochastic Processes". >What do you mean by "reliable"? It is a measure of performance that is more representative of the main bulk of data and it is not easily biased by one or a few low probability of occurence events (data samples). The Mode might be even more useful and it can be inferred from the Arithmetic Mean and Median. Perhaps all these problems might vanish if the sample size were increased so that we'd get a better picture of the true distribution. >Regarding current standards, the SPEC metrics (SPECint92, SPECfp92, etc) >are DEFINED as being the geometric mean of the appropriate SPEC suites >SPECratios. Yes, and I think the Median might turn out to be more 'reliable', but also we need more data samples to better estimate the true sample distribution. Also we might want to consider the confidence of our estimates, particularly since the goal of deriving measures of performance is to compare two or more of them. >What is it they say, "with statistics anything can be proved"? :-) Yes, numbers can tell lies and we must be careful. Statistics is really Ok. The problems happen when we try to make claims with ill formed data sets --- with samples that do not adequately describe the true distribution it is certainly true that one can derive all sorts of wild and meaningless results (all lies). [Comments about references left out] >Jeff Reilly | "There is something fascinating about >Intel Corporation | science. One gets such wholesale returns >jwreilly@mipos2.intel.com | of conjecture out of such a trifling >(408) 765 - 5909 | investment of fact" - M. Twain >Disclaimer: All opinions are my own... Al Aburto aburto@marlin.nosc.mil -------