NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3224 < prev next >

Wrap

Internet Message Format | 1992-08-18 | 5.4 KB

Path: sparky!uunet!news.claremont.edu!ucivax!orion.oac.uci.edu!network.ucsd.edu!sdcc12!cs!demers From: demers@cs.ucsd.edu (David DeMers) Newsgroups: comp.ai.neural-nets Subject: Re: Reducing Training time vs Generalisation Keywords: back propagation, training, generalisation Message-ID: <36967@sdcc12.ucsd.edu> Date: 18 Aug 92 17:34:57 GMT References: <arms.714091659@spedden> <36944@sdcc12.ucsd.edu> <arms.714146123@spedden> Sender: news@sdcc12.ucsd.edu Organization: =CSE Dept., U.C. San Diego Lines: 123 Nntp-Posting-Host: beowulf.ucsd.edu In article <arms.714146123@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes: >Here is an example of a backpropagation neural network that has very >wild behavior at some points not in the training or test sets. What is the training set? ... >We assume the net has been trained on a subset of integers and also >tested on a subset of integers. ... >Below is the overall function f(x) produced by the net, which is also >the specification of what it is *supposed* to do outside the interval >(0,1). In (0,1) the specification is to be less than 0.002 in >absolute value. >f(x) = 40 [ 1/( 1 + e^40*(x - 1/4)) + 1/( 1 + e^-40*(x - 3/4)) -1 ] So is the specification *exactly* this function? Or is it a set of training data for which an approximation is desired? >The largest deviation of our trained network f(x) from 0 on all integers is If the spec calls for < 0.002 inside (0,1) and no values within (0,1) were used in the training set, how can you possibly end up with this function? This is not simply a pathological example, it is completely absurd. How is this network constructed by finding parameters via gradient descent, or any other optimization method for finding parameters? You have started (I presume) with an ill-behaved network. If you begin with a set of weights such that all of the basis functions (sigmoids) are in their linear region, you can't reach this unstable point by training on data consisting of integer x values and corresponding y values (all of which are very small) >f(0) = f(1) = 0.0018... >So f is within 2/1000 of being 0 everywhere on our training and test >sets. Can we be satisfied with it? No! If we happen to give an input >of x = 1/2, we get >f(1/2) = - 39.99... >The magnitude of this is over 22000 times larger than anything >appearing during training and testing, and is way out of spec. >Such unexpected values are likely to be very rare if a lot of testing >has been done on a trained net, but even then, the potential for >disaster can still be lurking in the system. Unless neural nets are >*designed* to be safe, there may be a serious risk involved in using >them. The "wildness" here is postulated; I still don't see how it can actually happen on your facts, that the network was trained to zero error on a training set of integer values. >But to achieve that goal, a design methodology must be used which is >*guaranteed* to lead to a safe network. What is the spec? If it is "approximate the function that generated this data set", then you need to focus on the approximation capabilities of your methods. You can NEVER *guarantee* a "safe" result without some more knowledge of what the true function is. You are making the assumption that safe = bounded in a particular manner. If that's known, then why not build it in? >Such a methodology can be >based on decomposition of the input space into parts where the >function synthesized is forced to be monotonic in each variable. This can work. CART does this; its successor MARS fits local splines. In the neural network framework, Mike Jordan and Robert Jacobs are working on a generalization of modular architecture of Jacobs, Jordan, Nowlan & Hinton, which recursively splits the input space into nested regions and "learns" a mapping within each region. ... >For BP networks, I am not sure a safe design methodology can be >developed. This is not because of the BP algorithm, per se, but >rather because of the architecture of multilayer networks with >sigmoids: *all* weights are used in computing *every* output (the >effect of zero weights having been eliminated). Every output is >calculated using some negative and some positive weights, giving very >little hope of control over the values beyond the set of points >tested. True, all weights are used; however, most if not all of them are not particularly important in networks used for real-world applications (handwritten digit recognition, e.g.) Given a network and a data set, you can compute the partial derivatives of, say, mean squared error wrt the weights, and the curvature. You want the first partials all to be 0, and the second partials give some indication of the "contribution". See LeCun, Denker & Solla, "Optimal Brain Damage" in NIPS-2. It is not celestial violins but the nature of compositions of ridge functions which allow me to say that a general feedforward network is smooth, and that BP learning adapts its response surface to the training data. If weights are initialized such that the magnitude of the vector of weights into a unit is bounded so that the response will be in the linear region, I don't believe that gradient descent training over a data set will result in "wild" values. You are going to have to show me how to do it. -- Dave DeMers ddemers@UCSD demers@cs.ucsd.edu Computer Science & Engineering C-014 demers%cs@ucsd.bitnet UC San Diego ...!ucsd!cs!demers La Jolla, CA 92093-0114 (619) 534-0688, or -8187, FAX: (619) 534-7029