home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.ai.neural-nets
- Path: sparky!uunet!gumby!destroyer!ubc-cs!alberta!arms
- From: arms@cs.UAlberta.CA (Bill Armstrong)
- Subject: Wild values (was Reducing Training time ...)
- Message-ID: <arms.714208873@spedden>
- Keywords: back propagation, training, generalisation
- Sender: news@cs.UAlberta.CA (News Administrator)
- Nntp-Posting-Host: spedden.cs.ualberta.ca
- Organization: University of Alberta, Edmonton, Canada
- References: <arms.714091659@spedden> <36944@sdcc12.ucsd.edu> <arms.714146123@spedden> <36967@sdcc12.ucsd.edu>
- Date: Wed, 19 Aug 1992 07:21:13 GMT
- Lines: 208
-
- demers@cs.ucsd.edu (David DeMers) writes:
-
- >In article <arms.714146123@spedden> arms@cs.UAlberta.CA (Bill Armstrong) writes:
-
- >>Here is an example of a backpropagation neural network that has very
- >>wild behavior at some points not in the training or test sets.
-
- >What is the training set?
-
- >...
-
- >>We assume the net has been trained on a subset of integers and also
- >>tested on a subset of integers.
-
- Pick any set of integers that contains at least the six points x =
- -2 -1 0 1 2 3, each one with the f(x) value specified below.
- Test on any finite set of integers you like.
-
- >...
-
- >>Below is the overall function f(x) produced by the net, which is also
- >>the specification of what it is *supposed* to do outside the interval
- >>(0,1). In (0,1) the specification is to be less than 0.002 in
- >>absolute value.
-
- >>f(x) = 40 [ 1/( 1 + e^40*(x - 1/4)) + 1/( 1 + e^-40*(x - 3/4)) -1 ]
-
- >So is the specification *exactly* this function?
-
- No.
-
- Or is it
- >a set of training data for which an approximation is desired?
-
- There is a training set of (x,y) pairs where x is an integer and y =
- f(x). The choice of training set is not critical. The idea is that I
- want exactly f(x), except that in (0,1) I want the peak to be removed
- in such a way that I have a smooth function very close to 0. The
- bound on values in (0,1) is intended to say only that having a peak is
- out of spec.
-
- >>The largest deviation of our trained network f(x) from 0 on all integers is
-
- >If the spec calls for < 0.002 inside (0,1) and no values within
- >(0,1) were used in the training set, how can you possibly
- >end up with this function?
-
- The idea is that you would like the net output to be small in (0,1).
- I just don't want a peak.
-
- >This is not simply a pathological example, it is completely
- >absurd.
-
- You simply haven't grasped it yet. This kind of little "absurd"
- example is going to show many people how dangerous it is to use the
- usual approach to neural networks. When a safety-critical system
- blows up because you neglected some wild output of your neural net, it
- will be too late to go back and try to understand the example.
-
- Anyway, it is not a pathological example. Once you get the idea, you
- can construct lots of examples. It's only when you reach that point
- that you can begin to think about preventing wild values. Sorry,
- calling my little example "absurd" won't convince people who have a
- lot to lose from a misbehaved system. If they are smart, they will
- want to see proof that a wild value can't cause a problem. Are you
- ready to supply a proof? I don't think so, because you still don't
- grasp the problem.
-
- How is this network constructed by finding parameters
- >via gradient descent, or any other optimization method for
- >finding parameters? You have started (I presume) with an
- >ill-behaved network. If you begin with a set of weights
- >such that all of the basis functions (sigmoids) are in
- >their linear region, you can't reach this unstable point
- >by training on data consisting of integer x values and
- >corresponding y values (all of which are very small)
-
- >>f(0) = f(1) = 0.0018...
-
- I have had backprop converge on this kind of pathological example,
- from some not particularly carefully chosen starting state. If the
- f-values are small, I can see there is a problem with a real BP net,
- but the argument is supposed to be mathematical, so numerical accuracy
- is not a problem.
-
- If you happened to initialize the system by chance to the given
- weights, which do produce the desired values on the training set, the
- BP algorithm would have 0 mean square error on the training set, and
- would not change the weights. In other words, the weights (+ or - 40)
- are stable, and you can reach them. Maybe there are starting points
- from which you can't reach them, but that's a different problem to
- find them.
-
- >>So f is within 2/1000 of being 0 everywhere on our training and test
- >>sets. Can we be satisfied with it? No! If we happen to give an input
- >>of x = 1/2, we get
-
- >>f(1/2) = - 39.99...
-
- >>The magnitude of this is over 22000 times larger than anything
- >>appearing during training and testing, and is way out of spec.
-
- >>Such unexpected values are likely to be very rare if a lot of testing
- >>has been done on a trained net, but even then, the potential for
- >>disaster can still be lurking in the system. Unless neural nets are
- >>*designed* to be safe, there may be a serious risk involved in using
- >>them.
-
- >The "wildness" here is postulated; I still don't see how it can
- >actually happen on your facts, that the network was trained to
- >zero error on a training set of integer values.
-
- The "wild" solution is not postulated, it is THE set of weights which
- gives 0 error on the training set. The wild solution is forced upon
- the net by the training data. The use of integers for training and testing
- and the fact that they are uniformly spaced is also not critical.
-
- >>But to achieve that goal, a design methodology must be used which is
- >>*guaranteed* to lead to a safe network.
-
- >What is the spec? If it is "approximate the function that
- >generated this data set", then you need to focus on the
- >approximation capabilities of your methods. You can NEVER
- >*guarantee* a "safe" result without some more knowledge
- >of what the true function is. You are making the assumption
- >that safe = bounded in a particular manner. If that's known,
- >then why not build it in?
-
- I thnk this is clear now.
-
- >>Such a methodology can be
- >>based on decomposition of the input space into parts where the
- >>function synthesized is forced to be monotonic in each variable.
-
- >This can work. CART does this; its successor MARS fits local splines.
-
- >In the neural network framework, Mike Jordan and Robert Jacobs
- >are working on a generalization of modular architecture of
- >Jacobs, Jordan, Nowlan & Hinton, which recursively splits the
- >input space into nested regions and "learns" a mapping within
- >each region.
-
- Great. Do they use monotonicity, or a scheme which allows them to get
- tight bounds on *all* outputs, so they can satisfy a "spec" if we
- could agree on one?
-
- >...
-
-
- >>For BP networks, I am not sure a safe design methodology can be
- >>developed. This is not because of the BP algorithm, per se, but
- >>rather because of the architecture of multilayer networks with
- >>sigmoids: *all* weights are used in computing *every* output (the
- >>effect of zero weights having been eliminated). Every output is
- >>calculated using some negative and some positive weights, giving very
- >>little hope of control over the values beyond the set of points
- >>tested.
-
- >True, all weights are used; however, most if not all of them are
- >not particularly important in networks used for real-world
- >applications (handwritten digit recognition, e.g.)
-
- Sure, but a lot of little weights can add up, particularly if values
- derived from them get multiplied by a larger weight.
-
- >Given a network and a data set, you can compute the partial
- >derivatives of, say, mean squared error wrt the weights, and
- >the curvature. You want the first partials all to be 0,
- >and the second partials give some indication of the "contribution".
- >See LeCun, Denker & Solla, "Optimal Brain Damage" in NIPS-2.
-
- >It is not celestial violins but the nature of compositions
- >of ridge functions which allow me to say that a general
- >feedforward network is smooth, and that BP learning adapts
- >its response surface to the training data.
-
- At what point will you start looking at the example above and seeing
- that such a statement, though it is generally true, can also be
- disastrously false. I think I understand what you mean by
- compositions of ridge functions, but if the input pattern to a net is
- such that you are not outside the high slope region of some sigmoid on
- every path from an input variable to an output, then you may get very
- rapid change of output as you vary that input variable.
-
- The situation you describe, where you are always in the linear region
- of all sigmoids sounds *very* undesirable. The output should benefit by
- some signals getting very attenuated in effect by being near the flat
- parts of sigmoids.
-
- >If weights are initialized such that the magnitude of
- >the vector of weights into a unit is bounded so that the
- >response will be in the linear region, I don't believe
- >that gradient descent training over a data set will
- >result in "wild" values. You are going to have to
- >show me how to do it.
-
- It depends what you mean by bounded. The weights in the f(x) above
- are bounded by 40. If you want them bounded by 2, you don't get a
- wild value. Do you always bound your weights in absolute value by
- small numbers? If you do, then you have nothing to worry about, as
- long as you can train the net to do what you want (I doubt that you can
- then approximate functions like |sin( 10 * x)| which has some sharp
- changes of direction on [0,1].)
- --
- ***************************************************
- Prof. William W. Armstrong, Computing Science Dept.
- University of Alberta; Edmonton, Alberta, Canada T6G 2H1
- arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071
-