NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3211 < prev next >

Wrap

Internet Message Format | 1992-08-17 | 9.2 KB

Path: sparky!uunet!usc!elroy.jpl.nasa.gov!ames!pacbell.com!network.ucsd.edu!sdcc12!cs!demers From: demers@cs.ucsd.edu (David DeMers) Newsgroups: comp.ai.neural-nets Subject: Re: Reducing Training time vs Generalisation Keywords: back propagation, training, generalisation Message-ID: <36931@sdcc12.ucsd.edu> Date: 17 Aug 92 20:14:39 GMT References: <1992Aug16.063825.15300@julian.uwo.ca> <1992Aug16.213939.15944@ccu1.aukuni.ac.nz> <arms.714014919@spedden> Sender: news@sdcc12.ucsd.edu Organization: =CSE Dept., U.C. San Diego Lines: 213 Nntp-Posting-Host: beowulf.ucsd.edu In article <arms.714014919@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes: >edwin@ccu1.aukuni.ac.nz (Edwin Ng) writes: ... >>I'd like to ask if anyone has >>anything to add about the quality of generalisation >>resulting from using different parameters to speed up >>training?? ... >Before one gets deeply into such questions, I think one should specify >what one means by "generalization". Let me jump back in here. I might bore you all with a bit of "first principles", but it helps me think about what we are trying to do. The jargon is not always used with the same semantics across disciplines, so pretend I know what I'm saying when I use a word in a way that seems a bit imprecise (after all, I'm just hitting the 'R' key, not writing a journal paper here :-) First, let's focus on a particular, though very broad, class of problems: We have a finite set Z, consisting of pairs of vectors (or scalars, or other mathematical objects, possibly not in a metric space ...): Z = {(x_i, y_i)} i = 1, ..., N where the x_i are in some space X and the y_i are in some space Y (since real problems tend to always be bounded, I will assume that X and Y are compact) we believe that there is some functional relationship, y_i = f(x_i) (possibly perturbed by noise) we would like to estimate the expected value of y_m given x_m, where x_m is not necessarily in Z. --------------- (Most of my work is either classification or function approximation over continuous spaces.) "Generalization" is performance on x_m not in Z. Typically, "interpolation" is used to describe x_m which are inside the convex hull of the x_i, and "extrapolation" is used to describe x_m outside the set of training data. The problems being discussed in this thread seem to fall into this framework - Classification: the y_i are the class memberships and f can be decomposed into the characteristic functions for each class. We want to construct a classifier for the x_i. Function approximation: the above should remind us of regression or interpolation problems. Choose X to be {0,1}^k for boolean problems. The literature on this is enormous... The main difficulty with this problem stems from the fact that for any data set Z, there are a huge number of possible functions f which will reproduce Z exactly. If we are in real-valued spaces, then there will be an uncountable number. If we are in k-dimensional boolean space, there will be (2^(2^k - N)) functions, all of which reproduce the data in Z exactly. So, we have to pick one of these. How do we choose? This is where prior information or belief come in. Typically, we use Occam's razor and choose the "simplest", though there are difficulties in determining what "simple" means. Parametric techniques bias the search to a small set of models. E.g. linear regression. It is usually pretty easy (computationally) to find the best parameter values for a given model. Depending on the model, the search for the best values of the parameters could end up as complicated as non-linear global optimization... Non-parametric techniques attempt to "learn" the model from the data - usually they need enormous amounts of data to say anything statistically meaningful, but have the advantage of reducing the need for imposing a bias on the model. k-nearest-neighbors is a non-parametric method for building a classifier. Also projection-pursuit. There are some techniques which look like a cross between the two - like using feedforward neural networks. The network is parametric (I'm assuming the usual implementation of setting up a fixed architecture and then training) in the sense that one is trying to find the maximum likelihood values for the weights (insert enough appropriate math stuff here), but non-parametric in the sense that we really don't believe that the true function is a composition of a bunch of sigmoids. But we DO know that we can get close to any continuous function with a single-hidden layer feedforward network. What is all this leading up to, you ask? (or those of you still reading, anyway...) A reasonable way to think about all this is that you want to minimize some distance measure of your prediction and the "truth" over some appropriately weighted set in the domain. For example, minimize squared error integrated over X (weighted by the density). A good reference on the bias vs. variance tradeoff, is Geman, Stuart, Elie Bienenstock & Rene Doursat, "Neural Networks and the Bias/Variance Dilemma", Neural Computation 4, no. 1 (Jan 1992). Estimation error can be decomposed into two components, the "bias", which is how well our estimator fits Z, and "variance", which is how dependent the estimator is on Z. These can be quantified, and estimated. >...it should mean something like: for a test point at distance d >from a correctly learned training point, the response was still >correct (This definition works for boolean and multi-class problems). >You could generalize this using interpolation and continuous >functions. This assumes that there is some continuity; that if x_l is near x_m, then y_l will be near y_m. This is typical, and is usually a worthwhile prior assumption. However, the "correct" part might be problematic for continuous function approximation problems - that is, what is "correct" interpolation? >Once this is agreed upon, some questions start to make sense, like: >why should a multilayer perceptron generalize at all?" It's not >because of continuity, because continuous functions can oscillate as >rapidly as you want, and could fit any finite, non-contradictory >training set as well as you want. As is shown in the above-referenced paper, feedforward neural networks tend to be poor bias, but low variance estimators, which improve bias at the cost of increasing variance during training. That is, a "random" network is generally pretty smooth, and adapts its response surface to the data. By smooth, I mean more than just differentiable, I mean that the derivatives are bounded. So the reason why NNs generalize so well is because the prior assumption of smoothness is inherent in the structure of NNs. The overfitting problem is related - you can reduce bias by increasing the order of your model; thus by adding more terms (hidden units,...) you can fit your data arbitrarily closely. >If you could get a Lipschitz bound on the result of NN training, then >you could be confident about getting some reasonable generalization: >i.e. if x and x' are two input vectors, then the outputs y and y' >would satisfy |y - y'| <= C |x - x'|. This is *much* stronger than >continuity. Determining a reasonably small C might be a problem for a >given net. If you can estimate integrated squared error, you might do better - Lipschitz can be pretty pessimistic, which might be what you want if, as you mentioned in a previous post, you have to satisfy a customer and need a spec in the contract. >Other criteria of good generalization might include monotonicity of the >synthesized function. Or smoothness bounds, which is what the spline people use. >In the case of adaptive logic networks, generalization is based on the >fact that perturbations of an input vector tend to get filtered out as >the signal is processed in a tree Thus you have built in the assumption that a small neighborhood of a learned datum has the same response? >The most impressive generalization that has been attained with ALNs >was work by Dekang Lin, where he used tree growth and a training set >of about 10^5 points, and obtained 91.1% generalization to a space of >2^512 points (the multiplexor problem). The sparsity of the training >set in that space boggles the mind. So, does the modeling method used bias the search towards functions much like the true one? I.e. simple boolean functions. ------ for further consideration, I highly recommend the Geman article, and many of its references. Note that there is a large Bayesian camp - see David MacKay's thesis, e.g., for the Bayesian neural network framework. David Wolpert has an interesting approach, which seems to be pretty idiosyncratic in terminology. He's published in Neural Networks recently. The fields of statistical inference (regression), spline interpolation, approximation theory and much of statistical pattern recognition are concerned with these questions, so please forgive my failures here. I'd advise anyone deeply interested in these questions to spend some time reading statistical inference texts, and Duda & Hart for pattern recognition. -- Dave DeMers ddemers@UCSD demers@cs.ucsd.edu Computer Science & Engineering C-014 demers%cs@ucsd.bitnet UC San Diego ...!ucsd!cs!demers La Jolla, CA 92093-0114 (619) 534-0688, or -8187, FAX: (619) 534-7029