home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: sci.math.stat
- Path: sparky!uunet!caen!spool.mu.edu!umn.edu!thompson
- From: thompson@atlas.socsci.umn.edu (T. Scott Thompson)
- Subject: Re: Help With Statistics on "Compressed Data"
- Message-ID: <thompson.721354344@daphne.socsci.umn.edu>
- Keywords: statistics, compressed data,
- Sender: news@news2.cis.umn.edu (Usenet News Administration)
- Nntp-Posting-Host: daphne.socsci.umn.edu
- Reply-To: thompson@atlas.socsci.umn.edu
- Organization: Economics Department, University of Minnesota
- References: <5NOV199209582623@b56vxg.kodak.com>
- Date: Tue, 10 Nov 1992 00:12:24 GMT
- Lines: 76
-
- ekdug@b56vxg.kodak.com (Linda Stustman) writes:
-
- >I'm looking for a way to estimate an upper bound on the standard deviation
- >of a stream of plant data. The complication is that the data is coming from
- >a process data base that has a type of compression applied to it.
-
- >Simply put, data from the process is generated ever 30 minutes (an analysis
- >by a gas chromatograph). The process data base compares the new value with
- >the previous one and only records the new value (with an associated time
- >stamp) if the absolute value of the difference between the two readings is
- >greater than a fixed threshold.
-
- >In practice, this means 3 to 6 values a day are recorded for the variable,
- >out of the 48 analyses that are actually done. My problem is to come up
- >with a way of providing a reasonable estimate for the standard deviation
- >of the analysis values that uses the information present in the recorded
- >values, but also includes the information that the other 42 to 45 analyses
- >varied less that the threshold value.
-
- >My only idea on how to attack the problem (to date) is to assume that the
- >range of +/- the threshold value corresponds to +/- 3 sigma of a normally
- >distributed variable. Then, I could generate the appropriate number of
- >"missing" values from a random normal distribution, add the recorded
- >values and calculate a standard deviation from the resulting augmented
- >"data" set. Perhaps arguments could be made for treating the +/- range
- >of the threshold value as +/- 2 or +/- 4 sigma (and does anyone have any
- >comments?).
-
- This does not sound like a very good idea to me. Here are the random
- thoughts on which this opinion is based:
-
- It is not clear to me exactly what model you have in mind here. Are
- you proposing to assume that the mean of the process is unchanged from
- analysis to analysis? What about the variance? Is that also to be
- assumed not changing over time? Is there any autocorrelation in
- either the basic values that are begin measured or the measurement
- errors? Is your initial value sampled unconditionally, or are you
- extracting a subsequence from a measurement process that has been
- running for some time? No good statistical answer to your question
- can be obtained without (at least implicitly) answering these
- questions.
-
- Another important question: Is the threshhold value known? If so then
- it certainly would not be appropriate to fix it at +/- K standard
- deviations for any value of K, since this is equivalent to saying that
- you already know the standard deviation!
-
- Suppose that you assume that the analysis at time t produced a value
- x(t), that the x(t) values are independent normals with mean M and
- variance V, that x(t) is observed if and only if
-
- |x(t) - x(t-1)| > E,
-
- and that the aim of the analysis is to estimate M and V given
- knowledge of E and a sample (t1,x(t1)), ..., (tn,x(tn)) of n data
- points corresponding to the points at which the threshhold was met.
-
- Then it seems that you have a fairly standard, if nonlinear, maximum
- likelihood problem. I haven't checked, but I suspect that all
- parameters are identified. In fact, I think that this remains true
- even when the threshold E is also treated as a parameter. Your main
- difficulty in this scenario would be to calculate the likelihood,
- since the censoring introduces a rather messy dependency among the
- observations.
-
- One simplification might be to first work conditionally on the
- sequence of dummy variables for whether or not an observation was
- taken. Clearly you will not be able to estimate any of the parameters
- M, V or E from this data alone. However, I think that you can
- estimate the ratio E to Sqrt(V) (the standard deviation) from this
- data alone.
-
- --
- T. Scott Thompson email: thompson@atlas.socsci.umn.edu
- Department of Economics phone: (612) 625-0119
- University of Minnesota fax: (612) 624-0209
-