NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / sci / math / stat / 1657 < prev next >

Wrap

Internet Message Format | 1992-08-14 | 4.0 KB

Path: sparky!uunet!decwrl!purdue!mentor.cc.purdue.edu!rain!gpetty From: gpetty@rain.atms.purdue.edu (Grant W. Petty) Newsgroups: sci.math.stat Subject: Analyzing highly non-Gaussian, n-variate data Message-ID: <56884@mentor.cc.purdue.edu> Date: 14 Aug 92 23:41:27 GMT Sender: news@mentor.cc.purdue.edu Organization: Earth & Atmospheric Sciences, Purdue University Lines: 80 I'm not sure if the subject line makes sense, but here are two questions which have been plaguing me (a non-statistician) for a long time: 1. You have a large set of measurements consisting of N-dimensional vectors, where N > 3 (in my case, N=7). Elements in the vectors are correlated, albeit in a nonlinear, non-Gaussian fashion. You have no a priori knowledge of the precise functional form of the physical relationship between the elements, though you are certain one exists. Under these circumstances, what can you do to a) determine the effective dimensionality of the data (i.e., the minimum number of independent parameters which are capable of explaining most of the "volume" of the cloud of points in N-dimensional space? b) determine a segmented curve, surface, or hypersurface (depending on how many parameters you choose to specify) which passes "optimally" through the cloud of points? If the data exhibited something like a multivariate Gaussian pdf, then it would make sense to just compute the eigenvalues/eigenvectors of the N x N covariance matrix; the effective dimensionality would then just be the number of eigenvectors which are required to explained the bulk of the total variance. However, this approach gives meaningless results if, say, your data all fall exactly on a single wildly contorted curve in N-space: the true dimensionality in this case would be only one, but PCA looks for something like the principal axes of an ellipsoidal volume containing the points and therefore would find several significant basis vectors. If you could somehow calculate the effective N-D "volume" of the cloud of points for a subset of the elements of your ensemble, and see how that volume changes as you increase the number of variables considered, it seems to me that that could give you a good clue. For example, if the cloud of points was truly one-dimensional (in some unknown non-linear transformation of your coordinate system), then the cloud of points should, in some sense, occupy a very small volume which is almost independent of the number of dimensions of the subspace onto which you are projecting the cloud of point. That is, a projection of the points onto a 2-D surface would follow a simple 2-D curve with zero volume; a projection of the points onto a 3-D subspace would also yield a curve with zero volume, etc. Whereas if the data were intrinsically 2-D, then projection onto a 2-D surface would yield a 2-D cloud with finite area; projection onto a 3-D space would yield a surface with finite area but zero volume, and so on for higher dimensional subspaces. Does this make sense? And if so, do algorithms exist for looking into this behavior, given a sufficiently large set of multivariate data? 2. The second question is related: Given two N-dimensional clouds of points of the type described above, can one quantify the "overlap" between the volumes occupied by the two clouds and thus say something about whether the populations from which the points were taken appear to occupy the same or different regions in N-dimensional space? Obviously, one could compute multi-dimensional histograms and then compare the number of boxes which contain points from one, both, or neither of the clouds, but 7-dimensional histograms can get pretty bulky for any reasonable box size. Are there any applications-oriented (rather than theoretical) textbooks which address these sorts of issues? E-mail replies welcome -- Grant W. Petty gpetty@rain.atms.purdue.edu Assistant Prof. of Atmospheric Science (317) 494-2544 Dept. of Earth & Atmospheric Sciences "All standard disclaimers apply" Purdue University, West Lafayette IN 47907-1397