NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3213 < prev next >

Wrap

Internet Message Format | 1992-08-17 | 7.0 KB

Path: sparky!uunet!dtix!darwin.sura.net!wupost!gumby!destroyer!ubc-cs!alberta!arms From: arms@cs.UAlberta.CA (Bill Armstrong) Newsgroups: comp.ai.neural-nets Subject: Re: Reducing Training time vs Generalisation Keywords: back propagation, training, generalisation Message-ID: <arms.714091659@spedden> Date: 17 Aug 92 22:47:39 GMT References: <1992Aug16.063825.15300@julian.uwo.ca> <1992Aug16.213939.15944@ccu1.aukuni.ac.nz> <arms.714014919@spedden> <36931@sdcc12.ucsd.edu> Sender: news@cs.UAlberta.CA (News Administrator) Organization: University of Alberta, Edmonton, Canada Lines: 154 Nntp-Posting-Host: spedden.cs.ualberta.ca demers@cs.ucsd.edu (David DeMers) writes: >In article <arms.714014919@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes: >>Before one gets deeply into such questions, I think one should specify >>what one means by "generalization". > Z = {(x_i, y_i)} i = 1, ..., N > where the x_i are in some space X and the y_i > are in some space Y >(since real problems tend to always be bounded, I will >assume that X and Y are compact) >we believe that there is some functional relationship, > y_i = f(x_i) (possibly perturbed by noise) >we would like to estimate the expected value of y_m >given x_m, where x_m is not necessarily in Z. >--------------- >"Generalization" is performance on x_m not in Z. Typically, >"interpolation" is used to describe x_m which are inside >the convex hull of the x_i, and "extrapolation" is used >to describe x_m outside the set of training data. I can agree with the above formulation of what we want. lots of stuff deleted.... >A reasonable way to think about all this is that you >want to minimize some distance measure of your prediction >and the "truth" over some appropriately weighted set >in the domain. For example, minimize squared error >integrated over X (weighted by the density). This leaves only two problems: defining truth and computing the function giving our prediction. These are both very difficult to do, in general. For example, you may have 10^100 points in X, and we can't evaluate the squared error. Even if we could, it might be better to use a worst case criterion, e.g. maximum absolute error, or maybe some kind of risk as in game theory. Anyway, the intention is clear. We still at this point don't know what to do about specifying what generalization is if we don't know what the correct y value is for all x in X. >>...it should mean something like: for a test point at distance d >>from a correctly learned training point, the response was still >>correct (This definition works for boolean and multi-class problems). >>You could generalize this using interpolation and continuous >>functions. >This assumes that there is some continuity; that if x_l is near x_m, >then y_l will be near y_m. This is typical, and is usually a >worthwhile prior assumption. However, the "correct" part might >be problematic for continuous function approximation problems - >that is, what is "correct" interpolation? Continuity of the desired output function (even if unknown) and of the mechanism for implementing it (in theory) are OK, except they don't go far enough. If I give you a plot of a million (x,y) pairs in R^2, and ask you to interpolate based on continuity, then between any two of the given points, you can still fit in a continuous approximation of a plot of all the depths of the ocean around the equator true to vertical scale and horizontally compressed. Continuity is virtually no restriction at all in the context we are discussing, and certainly means nothing for generalization in a digital world. The point is well taken about "correctness". We usually don't have a definition of correctness. For example, we don't have a perfect description of the set of "A"s on a 32x32 grid, but we need some formula or other to replace this knowledge in practice. Correct interpolation is another thing we don't have a grasp of. >>Once this is agreed upon, some questions start to make sense, like: >>why should a multilayer perceptron generalize at all?" It's not >>because of continuity, because continuous functions can oscillate as >>rapidly as you want, and could fit any finite, non-contradictory >>training set as well as you want. >As is shown in the above-referenced paper, feedforward neural >networks tend to be poor bias, but low variance estimators, which >improve bias at the cost of increasing variance during training. >That is, a "random" network is generally pretty smooth, and >adapts its response surface to the data. Is that the music of celestial violins I hear??? By smooth, >I mean more than just differentiable, I mean that the derivatives >are bounded. OK, differentiable is fine, but that still allows you to fit in a differentiable ocean floor plot. The bound on the derivatives would help *if* it small enough, because that would give us a Lipschitz condition too, by the mean value theorem. >So the reason why NNs generalize so well is because the prior >assumption of smoothness is inherent in the structure of NNs. Sure, in theory, you have smoothness, continuity and differentiability. These mean *nothing* in the world of digital arithmetic. ... >If you can estimate integrated squared error, you might do better - >Lipschitz can be pretty pessimistic, which might be what you want >if, as you mentioned in a previous post, you have to satisfy a >customer and need a spec in the contract. I disagree, squared error can only be evaluated on a finite set which is usually sparse in the input space, so it gives us no control of errors on untested points. Lipschitz covers *all* points, which we must somehow cover to define generalization properly. >>Other criteria of good generalization might include monotonicity of the >>synthesized function. >Or smoothness bounds, which is what the spline people use. Splines are OK too, because they cover all points. >>In the case of adaptive logic networks, generalization is based on the >>fact that perturbations of an input vector tend to get filtered out as >>the signal is processed in a tree >Thus you have built in the assumption that a small neighborhood >of a learned datum has the same response? That's the way ALNs generalize. They don't interpolate. To get continuous functions, like the theory of BP networks produces, you have to do some smoothing in addition to using ALNs. >So, does the modeling method used bias the search towards functions >much like the true one? I.e. simple boolean functions. Yes, the fact that we train trees biases the solution towards insensitive functions, suitable for pattern recognition. If you like, that is the ALN meaning of "simple". Thanks for the thoughtful comments. PS I hope you don't take the "violins" comment too much to heart, but the truth is that with a least squared error criterion on the training set, I can get the optimal learned function to create a disaster very easily. -- *************************************************** Prof. William W. Armstrong, Computing Science Dept. University of Alberta; Edmonton, Alberta, Canada T6G 2H1 arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071