home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!usc!elroy.jpl.nasa.gov!ames!pacbell.com!network.ucsd.edu!sdcc12!cs!demers
- From: demers@cs.ucsd.edu (David DeMers)
- Newsgroups: comp.ai.neural-nets
- Subject: Re: Reducing Training time vs Generalisation
- Keywords: back propagation, training, generalisation
- Message-ID: <36931@sdcc12.ucsd.edu>
- Date: 17 Aug 92 20:14:39 GMT
- References: <1992Aug16.063825.15300@julian.uwo.ca> <1992Aug16.213939.15944@ccu1.aukuni.ac.nz> <arms.714014919@spedden>
- Sender: news@sdcc12.ucsd.edu
- Organization: =CSE Dept., U.C. San Diego
- Lines: 213
- Nntp-Posting-Host: beowulf.ucsd.edu
-
- In article <arms.714014919@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes:
- >edwin@ccu1.aukuni.ac.nz (Edwin Ng) writes:
-
- ...
-
- >>I'd like to ask if anyone has
- >>anything to add about the quality of generalisation
- >>resulting from using different parameters to speed up
- >>training??
-
- ...
-
-
- >Before one gets deeply into such questions, I think one should specify
- >what one means by "generalization".
-
- Let me jump back in here. I might bore you all with a
- bit of "first principles", but it helps me think about what
- we are trying to do. The jargon is not always used with
- the same semantics across disciplines, so pretend I know
- what I'm saying when I use a word in a way that seems a
- bit imprecise (after all, I'm just hitting the 'R' key, not
- writing a journal paper here :-)
-
- First, let's focus on a particular, though very broad,
- class of problems:
-
- We have a finite set Z, consisting of pairs of vectors (or scalars, or
- other mathematical objects, possibly not in a metric space ...):
-
- Z = {(x_i, y_i)} i = 1, ..., N
-
- where the x_i are in some space X and the y_i
- are in some space Y
-
- (since real problems tend to always be bounded, I will
- assume that X and Y are compact)
-
- we believe that there is some functional relationship,
- y_i = f(x_i) (possibly perturbed by noise)
-
- we would like to estimate the expected value of y_m
- given x_m, where x_m is not necessarily in Z.
-
- ---------------
-
- (Most of my work is either classification or function approximation
- over continuous spaces.)
-
- "Generalization" is performance on x_m not in Z. Typically,
- "interpolation" is used to describe x_m which are inside
- the convex hull of the x_i, and "extrapolation" is used
- to describe x_m outside the set of training data.
-
- The problems being discussed in this thread seem to fall
- into this framework -
- Classification: the y_i are the class memberships
- and f can be decomposed into
- the characteristic functions for each class.
- We want to construct a classifier for the x_i.
- Function approximation: the above should remind us
- of regression or interpolation problems.
- Choose X to be {0,1}^k for boolean problems.
-
- The literature on this is enormous...
-
- The main difficulty with this problem stems from the fact that
- for any data set Z, there are a huge number of possible functions
- f which will reproduce Z exactly. If we are in real-valued spaces,
- then there will be an uncountable number. If we are in k-dimensional
- boolean space, there will be (2^(2^k - N)) functions,
- all of which reproduce the data in Z exactly.
-
- So, we have to pick one of these. How do we choose? This
- is where prior information or belief come in. Typically,
- we use Occam's razor and choose the "simplest", though there
- are difficulties in determining what "simple" means.
-
- Parametric techniques bias the search to a small set of
- models. E.g. linear regression. It is usually pretty easy
- (computationally) to find the best parameter values for a given model.
- Depending on the model, the search for the best values
- of the parameters could end up as complicated as non-linear
- global optimization...
-
- Non-parametric techniques attempt to "learn" the model from
- the data - usually they need enormous amounts of data to
- say anything statistically meaningful, but have the advantage
- of reducing the need for imposing a bias on the model.
- k-nearest-neighbors is a non-parametric method for building
- a classifier. Also projection-pursuit.
-
- There are some techniques which look like a cross between
- the two - like using feedforward neural networks.
- The network is parametric (I'm assuming the usual implementation
- of setting up a fixed architecture and then training) in the
- sense that one is trying to find the maximum likelihood values
- for the weights (insert enough appropriate math stuff here),
- but non-parametric in the sense that we really don't believe
- that the true function is a composition of a bunch of sigmoids.
- But we DO know that we can get close to any continuous
- function with a single-hidden layer feedforward network.
-
- What is all this leading up to, you ask? (or those of you
- still reading, anyway...)
-
- A reasonable way to think about all this is that you
- want to minimize some distance measure of your prediction
- and the "truth" over some appropriately weighted set
- in the domain. For example, minimize squared error
- integrated over X (weighted by the density).
-
- A good reference on the bias vs. variance tradeoff, is
- Geman, Stuart, Elie Bienenstock & Rene Doursat,
- "Neural Networks and the Bias/Variance Dilemma", Neural Computation 4,
- no. 1 (Jan 1992). Estimation error can be decomposed into
- two components,
- the "bias", which is how well our estimator fits Z, and
- "variance", which is how dependent the estimator is on Z.
- These can be quantified, and estimated.
-
-
- >...it should mean something like: for a test point at distance d
- >from a correctly learned training point, the response was still
- >correct (This definition works for boolean and multi-class problems).
- >You could generalize this using interpolation and continuous
- >functions.
-
- This assumes that there is some continuity; that if x_l is near x_m,
- then y_l will be near y_m. This is typical, and is usually a
- worthwhile prior assumption. However, the "correct" part might
- be problematic for continuous function approximation problems -
- that is, what is "correct" interpolation?
-
- >Once this is agreed upon, some questions start to make sense, like:
- >why should a multilayer perceptron generalize at all?" It's not
- >because of continuity, because continuous functions can oscillate as
- >rapidly as you want, and could fit any finite, non-contradictory
- >training set as well as you want.
-
- As is shown in the above-referenced paper, feedforward neural
- networks tend to be poor bias, but low variance estimators, which
- improve bias at the cost of increasing variance during training.
- That is, a "random" network is generally pretty smooth, and
- adapts its response surface to the data. By smooth,
- I mean more than just differentiable, I mean that the derivatives
- are bounded.
-
- So the reason why NNs generalize so well is because the prior
- assumption of smoothness is inherent in the structure of NNs.
-
- The overfitting problem is related - you can reduce bias by increasing
- the order of your model; thus by adding more terms (hidden units,...)
- you can fit your data arbitrarily closely.
-
- >If you could get a Lipschitz bound on the result of NN training, then
- >you could be confident about getting some reasonable generalization:
- >i.e. if x and x' are two input vectors, then the outputs y and y'
- >would satisfy |y - y'| <= C |x - x'|. This is *much* stronger than
- >continuity. Determining a reasonably small C might be a problem for a
- >given net.
-
- If you can estimate integrated squared error, you might do better -
- Lipschitz can be pretty pessimistic, which might be what you want
- if, as you mentioned in a previous post, you have to satisfy a
- customer and need a spec in the contract.
-
- >Other criteria of good generalization might include monotonicity of the
- >synthesized function.
-
- Or smoothness bounds, which is what the spline people use.
-
- >In the case of adaptive logic networks, generalization is based on the
- >fact that perturbations of an input vector tend to get filtered out as
- >the signal is processed in a tree
-
- Thus you have built in the assumption that a small neighborhood
- of a learned datum has the same response?
-
- >The most impressive generalization that has been attained with ALNs
- >was work by Dekang Lin, where he used tree growth and a training set
- >of about 10^5 points, and obtained 91.1% generalization to a space of
- >2^512 points (the multiplexor problem). The sparsity of the training
- >set in that space boggles the mind.
-
- So, does the modeling method used bias the search towards functions
- much like the true one? I.e. simple boolean functions.
-
- ------
-
- for further consideration, I highly recommend the Geman article,
- and many of its references.
-
- Note that there is a large Bayesian camp - see David MacKay's thesis,
- e.g., for the Bayesian neural network framework.
-
- David Wolpert has an interesting approach, which seems
- to be pretty idiosyncratic in terminology. He's published in
- Neural Networks recently.
-
- The fields of statistical inference (regression), spline
- interpolation, approximation theory and much of statistical
- pattern recognition are concerned with these questions,
- so please forgive my failures here. I'd advise anyone
- deeply interested in these questions to spend some time
- reading statistical inference texts, and Duda & Hart for
- pattern recognition.
-
- --
- Dave DeMers ddemers@UCSD demers@cs.ucsd.edu
- Computer Science & Engineering C-014 demers%cs@ucsd.bitnet
- UC San Diego ...!ucsd!cs!demers
- La Jolla, CA 92093-0114 (619) 534-0688, or -8187, FAX: (619) 534-7029
-