home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!news.claremont.edu!ucivax!orion.oac.uci.edu!network.ucsd.edu!sdcc12!cs!demers
- From: demers@cs.ucsd.edu (David DeMers)
- Newsgroups: comp.ai.neural-nets
- Subject: Re: Reducing Training time vs Generalisation
- Keywords: back propagation, training, generalisation
- Message-ID: <36967@sdcc12.ucsd.edu>
- Date: 18 Aug 92 17:34:57 GMT
- References: <arms.714091659@spedden> <36944@sdcc12.ucsd.edu> <arms.714146123@spedden>
- Sender: news@sdcc12.ucsd.edu
- Organization: =CSE Dept., U.C. San Diego
- Lines: 123
- Nntp-Posting-Host: beowulf.ucsd.edu
-
- In article <arms.714146123@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes:
-
- >Here is an example of a backpropagation neural network that has very
- >wild behavior at some points not in the training or test sets.
-
- What is the training set?
-
- ...
-
- >We assume the net has been trained on a subset of integers and also
- >tested on a subset of integers.
-
- ...
-
- >Below is the overall function f(x) produced by the net, which is also
- >the specification of what it is *supposed* to do outside the interval
- >(0,1). In (0,1) the specification is to be less than 0.002 in
- >absolute value.
-
- >f(x) = 40 [ 1/( 1 + e^40*(x - 1/4)) + 1/( 1 + e^-40*(x - 3/4)) -1 ]
-
- So is the specification *exactly* this function? Or is it
- a set of training data for which an approximation is desired?
-
- >The largest deviation of our trained network f(x) from 0 on all integers is
-
- If the spec calls for < 0.002 inside (0,1) and no values within
- (0,1) were used in the training set, how can you possibly
- end up with this function?
-
- This is not simply a pathological example, it is completely
- absurd. How is this network constructed by finding parameters
- via gradient descent, or any other optimization method for
- finding parameters? You have started (I presume) with an
- ill-behaved network. If you begin with a set of weights
- such that all of the basis functions (sigmoids) are in
- their linear region, you can't reach this unstable point
- by training on data consisting of integer x values and
- corresponding y values (all of which are very small)
-
- >f(0) = f(1) = 0.0018...
-
- >So f is within 2/1000 of being 0 everywhere on our training and test
- >sets. Can we be satisfied with it? No! If we happen to give an input
- >of x = 1/2, we get
-
- >f(1/2) = - 39.99...
-
- >The magnitude of this is over 22000 times larger than anything
- >appearing during training and testing, and is way out of spec.
-
- >Such unexpected values are likely to be very rare if a lot of testing
- >has been done on a trained net, but even then, the potential for
- >disaster can still be lurking in the system. Unless neural nets are
- >*designed* to be safe, there may be a serious risk involved in using
- >them.
-
- The "wildness" here is postulated; I still don't see how it can
- actually happen on your facts, that the network was trained to
- zero error on a training set of integer values.
-
- >But to achieve that goal, a design methodology must be used which is
- >*guaranteed* to lead to a safe network.
-
- What is the spec? If it is "approximate the function that
- generated this data set", then you need to focus on the
- approximation capabilities of your methods. You can NEVER
- *guarantee* a "safe" result without some more knowledge
- of what the true function is. You are making the assumption
- that safe = bounded in a particular manner. If that's known,
- then why not build it in?
-
- >Such a methodology can be
- >based on decomposition of the input space into parts where the
- >function synthesized is forced to be monotonic in each variable.
-
- This can work. CART does this; its successor MARS fits local splines.
-
- In the neural network framework, Mike Jordan and Robert Jacobs
- are working on a generalization of modular architecture of
- Jacobs, Jordan, Nowlan & Hinton, which recursively splits the
- input space into nested regions and "learns" a mapping within
- each region.
-
- ...
-
-
- >For BP networks, I am not sure a safe design methodology can be
- >developed. This is not because of the BP algorithm, per se, but
- >rather because of the architecture of multilayer networks with
- >sigmoids: *all* weights are used in computing *every* output (the
- >effect of zero weights having been eliminated). Every output is
- >calculated using some negative and some positive weights, giving very
- >little hope of control over the values beyond the set of points
- >tested.
-
- True, all weights are used; however, most if not all of them are
- not particularly important in networks used for real-world
- applications (handwritten digit recognition, e.g.)
-
- Given a network and a data set, you can compute the partial
- derivatives of, say, mean squared error wrt the weights, and
- the curvature. You want the first partials all to be 0,
- and the second partials give some indication of the "contribution".
- See LeCun, Denker & Solla, "Optimal Brain Damage" in NIPS-2.
-
- It is not celestial violins but the nature of compositions
- of ridge functions which allow me to say that a general
- feedforward network is smooth, and that BP learning adapts
- its response surface to the training data.
-
- If weights are initialized such that the magnitude of
- the vector of weights into a unit is bounded so that the
- response will be in the linear region, I don't believe
- that gradient descent training over a data set will
- result in "wild" values. You are going to have to
- show me how to do it.
-
- --
- Dave DeMers ddemers@UCSD demers@cs.ucsd.edu
- Computer Science & Engineering C-014 demers%cs@ucsd.bitnet
- UC San Diego ...!ucsd!cs!demers
- La Jolla, CA 92093-0114 (619) 534-0688, or -8187, FAX: (619) 534-7029
-