home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!dtix!darwin.sura.net!wupost!gumby!destroyer!ubc-cs!alberta!arms
- From: arms@cs.UAlberta.CA (Bill Armstrong)
- Newsgroups: comp.ai.neural-nets
- Subject: Re: Reducing Training time vs Generalisation
- Keywords: back propagation, training, generalisation
- Message-ID: <arms.714091659@spedden>
- Date: 17 Aug 92 22:47:39 GMT
- References: <1992Aug16.063825.15300@julian.uwo.ca> <1992Aug16.213939.15944@ccu1.aukuni.ac.nz> <arms.714014919@spedden> <36931@sdcc12.ucsd.edu>
- Sender: news@cs.UAlberta.CA (News Administrator)
- Organization: University of Alberta, Edmonton, Canada
- Lines: 154
- Nntp-Posting-Host: spedden.cs.ualberta.ca
-
- demers@cs.ucsd.edu (David DeMers) writes:
-
- >In article <arms.714014919@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes:
-
- >>Before one gets deeply into such questions, I think one should specify
- >>what one means by "generalization".
-
- > Z = {(x_i, y_i)} i = 1, ..., N
-
- > where the x_i are in some space X and the y_i
- > are in some space Y
-
- >(since real problems tend to always be bounded, I will
- >assume that X and Y are compact)
-
- >we believe that there is some functional relationship,
- > y_i = f(x_i) (possibly perturbed by noise)
-
- >we would like to estimate the expected value of y_m
- >given x_m, where x_m is not necessarily in Z.
-
- >---------------
-
- >"Generalization" is performance on x_m not in Z. Typically,
- >"interpolation" is used to describe x_m which are inside
- >the convex hull of the x_i, and "extrapolation" is used
- >to describe x_m outside the set of training data.
-
- I can agree with the above formulation of what we want.
-
- lots of stuff deleted....
-
- >A reasonable way to think about all this is that you
- >want to minimize some distance measure of your prediction
- >and the "truth" over some appropriately weighted set
- >in the domain. For example, minimize squared error
- >integrated over X (weighted by the density).
-
- This leaves only two problems: defining truth and computing the
- function giving our prediction. These are both very difficult to do,
- in general.
-
- For example, you may have 10^100 points in X, and we can't evaluate
- the squared error. Even if we could, it might be better to use a
- worst case criterion, e.g. maximum absolute error, or maybe some kind
- of risk as in game theory. Anyway, the intention is clear. We still
- at this point don't know what to do about specifying what
- generalization is if we don't know what the correct y value is for all
- x in X.
-
- >>...it should mean something like: for a test point at distance d
- >>from a correctly learned training point, the response was still
- >>correct (This definition works for boolean and multi-class problems).
- >>You could generalize this using interpolation and continuous
- >>functions.
-
- >This assumes that there is some continuity; that if x_l is near x_m,
- >then y_l will be near y_m. This is typical, and is usually a
- >worthwhile prior assumption. However, the "correct" part might
- >be problematic for continuous function approximation problems -
- >that is, what is "correct" interpolation?
-
- Continuity of the desired output function (even if unknown) and of the
- mechanism for implementing it (in theory) are OK, except they don't go
- far enough. If I give you a plot of a million (x,y) pairs in R^2, and
- ask you to interpolate based on continuity, then between any two of
- the given points, you can still fit in a continuous approximation of a
- plot of all the depths of the ocean around the equator true to
- vertical scale and horizontally compressed. Continuity is virtually
- no restriction at all in the context we are discussing, and certainly
- means nothing for generalization in a digital world.
-
- The point is well taken about "correctness". We usually don't have a
- definition of correctness. For example, we don't have a perfect
- description of the set of "A"s on a 32x32 grid, but we need some
- formula or other to replace this knowledge in practice. Correct
- interpolation is another thing we don't have a grasp of.
-
- >>Once this is agreed upon, some questions start to make sense, like:
- >>why should a multilayer perceptron generalize at all?" It's not
- >>because of continuity, because continuous functions can oscillate as
- >>rapidly as you want, and could fit any finite, non-contradictory
- >>training set as well as you want.
-
- >As is shown in the above-referenced paper, feedforward neural
- >networks tend to be poor bias, but low variance estimators, which
- >improve bias at the cost of increasing variance during training.
- >That is, a "random" network is generally pretty smooth, and
- >adapts its response surface to the data.
-
- Is that the music of celestial violins I hear???
-
- By smooth,
- >I mean more than just differentiable, I mean that the derivatives
- >are bounded.
-
- OK, differentiable is fine, but that still allows you to fit in a
- differentiable ocean floor plot. The bound on the derivatives would
- help *if* it small enough, because that would give us a Lipschitz
- condition too, by the mean value theorem.
-
- >So the reason why NNs generalize so well is because the prior
- >assumption of smoothness is inherent in the structure of NNs.
-
- Sure, in theory, you have smoothness, continuity and differentiability.
- These mean *nothing* in the world of digital arithmetic.
- ...
-
- >If you can estimate integrated squared error, you might do better -
- >Lipschitz can be pretty pessimistic, which might be what you want
- >if, as you mentioned in a previous post, you have to satisfy a
- >customer and need a spec in the contract.
-
- I disagree, squared error can only be evaluated on a finite set which
- is usually sparse in the input space, so it gives us no control of
- errors on untested points. Lipschitz covers *all* points, which we must
- somehow cover to define generalization properly.
-
- >>Other criteria of good generalization might include monotonicity of the
- >>synthesized function.
-
- >Or smoothness bounds, which is what the spline people use.
-
- Splines are OK too, because they cover all points.
-
- >>In the case of adaptive logic networks, generalization is based on the
- >>fact that perturbations of an input vector tend to get filtered out as
- >>the signal is processed in a tree
-
- >Thus you have built in the assumption that a small neighborhood
- >of a learned datum has the same response?
-
- That's the way ALNs generalize. They don't interpolate. To get
- continuous functions, like the theory of BP networks produces, you
- have to do some smoothing in addition to using ALNs.
-
- >So, does the modeling method used bias the search towards functions
- >much like the true one? I.e. simple boolean functions.
-
- Yes, the fact that we train trees biases the solution towards insensitive
- functions, suitable for pattern recognition. If you like, that is the
- ALN meaning of "simple".
-
- Thanks for the thoughtful comments.
-
- PS I hope you don't take the "violins" comment too much to heart, but
- the truth is that with a least squared error criterion on the training
- set, I can get the optimal learned function to create a disaster very
- easily.
- --
- ***************************************************
- Prof. William W. Armstrong, Computing Science Dept.
- University of Alberta; Edmonton, Alberta, Canada T6G 2H1
- arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071
-