NetNews Usenet Archive 1992 #26

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #26 / NN_1992_26.iso / spool / sci / math / stat / 2329 < prev next >

Wrap

Text File | 1992-11-09 | 4.4 KB | 91 lines

Newsgroups: sci.math.stat Path: sparky!uunet!caen!spool.mu.edu!umn.edu!thompson From: thompson@atlas.socsci.umn.edu (T. Scott Thompson) Subject: Re: Help With Statistics on "Compressed Data" Message-ID: <thompson.721354344@daphne.socsci.umn.edu> Keywords: statistics, compressed data, Sender: news@news2.cis.umn.edu (Usenet News Administration) Nntp-Posting-Host: daphne.socsci.umn.edu Reply-To: thompson@atlas.socsci.umn.edu Organization: Economics Department, University of Minnesota References: <5NOV199209582623@b56vxg.kodak.com> Date: Tue, 10 Nov 1992 00:12:24 GMT Lines: 76 ekdug@b56vxg.kodak.com (Linda Stustman) writes: >I'm looking for a way to estimate an upper bound on the standard deviation >of a stream of plant data. The complication is that the data is coming from >a process data base that has a type of compression applied to it. >Simply put, data from the process is generated ever 30 minutes (an analysis >by a gas chromatograph). The process data base compares the new value with >the previous one and only records the new value (with an associated time >stamp) if the absolute value of the difference between the two readings is >greater than a fixed threshold. >In practice, this means 3 to 6 values a day are recorded for the variable, >out of the 48 analyses that are actually done. My problem is to come up >with a way of providing a reasonable estimate for the standard deviation >of the analysis values that uses the information present in the recorded >values, but also includes the information that the other 42 to 45 analyses >varied less that the threshold value. >My only idea on how to attack the problem (to date) is to assume that the >range of +/- the threshold value corresponds to +/- 3 sigma of a normally >distributed variable. Then, I could generate the appropriate number of >"missing" values from a random normal distribution, add the recorded >values and calculate a standard deviation from the resulting augmented >"data" set. Perhaps arguments could be made for treating the +/- range >of the threshold value as +/- 2 or +/- 4 sigma (and does anyone have any >comments?). This does not sound like a very good idea to me. Here are the random thoughts on which this opinion is based: It is not clear to me exactly what model you have in mind here. Are you proposing to assume that the mean of the process is unchanged from analysis to analysis? What about the variance? Is that also to be assumed not changing over time? Is there any autocorrelation in either the basic values that are begin measured or the measurement errors? Is your initial value sampled unconditionally, or are you extracting a subsequence from a measurement process that has been running for some time? No good statistical answer to your question can be obtained without (at least implicitly) answering these questions. Another important question: Is the threshhold value known? If so then it certainly would not be appropriate to fix it at +/- K standard deviations for any value of K, since this is equivalent to saying that you already know the standard deviation! Suppose that you assume that the analysis at time t produced a value x(t), that the x(t) values are independent normals with mean M and variance V, that x(t) is observed if and only if |x(t) - x(t-1)| > E, and that the aim of the analysis is to estimate M and V given knowledge of E and a sample (t1,x(t1)), ..., (tn,x(tn)) of n data points corresponding to the points at which the threshhold was met. Then it seems that you have a fairly standard, if nonlinear, maximum likelihood problem. I haven't checked, but I suspect that all parameters are identified. In fact, I think that this remains true even when the threshold E is also treated as a parameter. Your main difficulty in this scenario would be to calculate the likelihood, since the censoring introduces a rather messy dependency among the observations. One simplification might be to first work conditionally on the sequence of dummy variables for whether or not an observation was taken. Clearly you will not be able to estimate any of the parameters M, V or E from this data alone. However, I think that you can estimate the ratio E to Sqrt(V) (the standard deviation) from this data alone. -- T. Scott Thompson email: thompson@atlas.socsci.umn.edu Department of Economics phone: (612) 625-0119 University of Minnesota fax: (612) 624-0209