NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3233 < prev next >

Wrap

Text File | 1992-08-19 | 9.4 KB | 222 lines

Newsgroups: comp.ai.neural-nets Path: sparky!uunet!gumby!destroyer!ubc-cs!alberta!arms From: arms@cs.UAlberta.CA (Bill Armstrong) Subject: Wild values (was Reducing Training time ...) Message-ID: <arms.714208873@spedden> Keywords: back propagation, training, generalisation Sender: news@cs.UAlberta.CA (News Administrator) Nntp-Posting-Host: spedden.cs.ualberta.ca Organization: University of Alberta, Edmonton, Canada References: <arms.714091659@spedden> <36944@sdcc12.ucsd.edu> <arms.714146123@spedden> <36967@sdcc12.ucsd.edu> Date: Wed, 19 Aug 1992 07:21:13 GMT Lines: 208 demers@cs.ucsd.edu (David DeMers) writes: >In article <arms.714146123@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes: >>Here is an example of a backpropagation neural network that has very >>wild behavior at some points not in the training or test sets. >What is the training set? >... >>We assume the net has been trained on a subset of integers and also >>tested on a subset of integers. Pick any set of integers that contains at least the six points x = -2 -1 0 1 2 3, each one with the f(x) value specified below. Test on any finite set of integers you like. >... >>Below is the overall function f(x) produced by the net, which is also >>the specification of what it is *supposed* to do outside the interval >>(0,1). In (0,1) the specification is to be less than 0.002 in >>absolute value. >>f(x) = 40 [ 1/( 1 + e^40*(x - 1/4)) + 1/( 1 + e^-40*(x - 3/4)) -1 ] >So is the specification *exactly* this function? No. Or is it >a set of training data for which an approximation is desired? There is a training set of (x,y) pairs where x is an integer and y = f(x). The choice of training set is not critical. The idea is that I want exactly f(x), except that in (0,1) I want the peak to be removed in such a way that I have a smooth function very close to 0. The bound on values in (0,1) is intended to say only that having a peak is out of spec. >>The largest deviation of our trained network f(x) from 0 on all integers is >If the spec calls for < 0.002 inside (0,1) and no values within >(0,1) were used in the training set, how can you possibly >end up with this function? The idea is that you would like the net output to be small in (0,1). I just don't want a peak. >This is not simply a pathological example, it is completely >absurd. You simply haven't grasped it yet. This kind of little "absurd" example is going to show many people how dangerous it is to use the usual approach to neural networks. When a safety-critical system blows up because you neglected some wild output of your neural net, it will be too late to go back and try to understand the example. Anyway, it is not a pathological example. Once you get the idea, you can construct lots of examples. It's only when you reach that point that you can begin to think about preventing wild values. Sorry, calling my little example "absurd" won't convince people who have a lot to lose from a misbehaved system. If they are smart, they will want to see proof that a wild value can't cause a problem. Are you ready to supply a proof? I don't think so, because you still don't grasp the problem. How is this network constructed by finding parameters >via gradient descent, or any other optimization method for >finding parameters? You have started (I presume) with an >ill-behaved network. If you begin with a set of weights >such that all of the basis functions (sigmoids) are in >their linear region, you can't reach this unstable point >by training on data consisting of integer x values and >corresponding y values (all of which are very small) >>f(0) = f(1) = 0.0018... I have had backprop converge on this kind of pathological example, from some not particularly carefully chosen starting state. If the f-values are small, I can see there is a problem with a real BP net, but the argument is supposed to be mathematical, so numerical accuracy is not a problem. If you happened to initialize the system by chance to the given weights, which do produce the desired values on the training set, the BP algorithm would have 0 mean square error on the training set, and would not change the weights. In other words, the weights (+ or - 40) are stable, and you can reach them. Maybe there are starting points from which you can't reach them, but that's a different problem to find them. >>So f is within 2/1000 of being 0 everywhere on our training and test >>sets. Can we be satisfied with it? No! If we happen to give an input >>of x = 1/2, we get >>f(1/2) = - 39.99... >>The magnitude of this is over 22000 times larger than anything >>appearing during training and testing, and is way out of spec. >>Such unexpected values are likely to be very rare if a lot of testing >>has been done on a trained net, but even then, the potential for >>disaster can still be lurking in the system. Unless neural nets are >>*designed* to be safe, there may be a serious risk involved in using >>them. >The "wildness" here is postulated; I still don't see how it can >actually happen on your facts, that the network was trained to >zero error on a training set of integer values. The "wild" solution is not postulated, it is THE set of weights which gives 0 error on the training set. The wild solution is forced upon the net by the training data. The use of integers for training and testing and the fact that they are uniformly spaced is also not critical. >>But to achieve that goal, a design methodology must be used which is >>*guaranteed* to lead to a safe network. >What is the spec? If it is "approximate the function that >generated this data set", then you need to focus on the >approximation capabilities of your methods. You can NEVER >*guarantee* a "safe" result without some more knowledge >of what the true function is. You are making the assumption >that safe = bounded in a particular manner. If that's known, >then why not build it in? I thnk this is clear now. >>Such a methodology can be >>based on decomposition of the input space into parts where the >>function synthesized is forced to be monotonic in each variable. >This can work. CART does this; its successor MARS fits local splines. >In the neural network framework, Mike Jordan and Robert Jacobs >are working on a generalization of modular architecture of >Jacobs, Jordan, Nowlan & Hinton, which recursively splits the >input space into nested regions and "learns" a mapping within >each region. Great. Do they use monotonicity, or a scheme which allows them to get tight bounds on *all* outputs, so they can satisfy a "spec" if we could agree on one? >... >>For BP networks, I am not sure a safe design methodology can be >>developed. This is not because of the BP algorithm, per se, but >>rather because of the architecture of multilayer networks with >>sigmoids: *all* weights are used in computing *every* output (the >>effect of zero weights having been eliminated). Every output is >>calculated using some negative and some positive weights, giving very >>little hope of control over the values beyond the set of points >>tested. >True, all weights are used; however, most if not all of them are >not particularly important in networks used for real-world >applications (handwritten digit recognition, e.g.) Sure, but a lot of little weights can add up, particularly if values derived from them get multiplied by a larger weight. >Given a network and a data set, you can compute the partial >derivatives of, say, mean squared error wrt the weights, and >the curvature. You want the first partials all to be 0, >and the second partials give some indication of the "contribution". >See LeCun, Denker & Solla, "Optimal Brain Damage" in NIPS-2. >It is not celestial violins but the nature of compositions >of ridge functions which allow me to say that a general >feedforward network is smooth, and that BP learning adapts >its response surface to the training data. At what point will you start looking at the example above and seeing that such a statement, though it is generally true, can also be disastrously false. I think I understand what you mean by compositions of ridge functions, but if the input pattern to a net is such that you are not outside the high slope region of some sigmoid on every path from an input variable to an output, then you may get very rapid change of output as you vary that input variable. The situation you describe, where you are always in the linear region of all sigmoids sounds *very* undesirable. The output should benefit by some signals getting very attenuated in effect by being near the flat parts of sigmoids. >If weights are initialized such that the magnitude of >the vector of weights into a unit is bounded so that the >response will be in the linear region, I don't believe >that gradient descent training over a data set will >result in "wild" values. You are going to have to >show me how to do it. It depends what you mean by bounded. The weights in the f(x) above are bounded by 40. If you want them bounded by 2, you don't get a wild value. Do you always bound your weights in absolute value by small numbers? If you do, then you have nothing to worry about, as long as you can train the net to do what you want (I doubt that you can then approximate functions like |sin( 10 * x)| which has some sharp changes of direction on [0,1].) -- *************************************************** Prof. William W. Armstrong, Computing Science Dept. University of Alberta; Edmonton, Alberta, Canada T6G 2H1 arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071