NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3241 < prev next >

Wrap

Text File | 1992-08-19 | 9.3 KB | 200 lines

Newsgroups: comp.ai.neural-nets Path: sparky!uunet!gumby!destroyer!ubc-cs!alberta!arms From: arms@cs.UAlberta.CA (Bill Armstrong) Subject: Re: Reducing Training time vs Generalisation Message-ID: <arms.714249520@spedden> Sender: news@cs.UAlberta.CA (News Administrator) Nntp-Posting-Host: spedden.cs.ualberta.ca Organization: University of Alberta, Edmonton, Canada References: <Bt8MI4.6Bu.1@cs.cmu.edu> Date: Wed, 19 Aug 1992 18:38:40 GMT Lines: 187 I agree with most of Scott Fahlmann's comments about "wild" excursions. I suppose one can dispute my insistence of referring to this as a "safety" issue, to use a loaded "s" word, which people may find alarmist. But I calls 'em the way I sees 'em. I don't want it to be an issue that people can ignore because it doesn't sound important. There are just a few comments in relation to Scott's remarks that need to be made: sef@sef-pmax.slisp.cs.cmu.edu writes: >We went round on Bill Armstrong's dangerous example once before. Here's my >current understanding of the situation: >If you make a training set that looks like this, > ..............* *................. >and train a net with two hidden units a long time with no weight decay, you >can get a zero-error solution with a large upward excursion between the two >*'s. The peak of the excursion can be *much* higher than the height of the >two "tossing points". In this case, there are also solutions that create a >flat plateau, and these are more likely to be found by the usual learning >algorithms. Isn't the plateau solution a *local* minimum of error? Is this a case where you actually should seek a local minimum rather than an absolute minimum in training? >If you shape the training set a bit more carefully > * * > .............* *................. >and use a two-hidden unit net, you can FORCE the solution with a big >excursion. Only the "ankle" part of the sigmoid will fit these tossing >points. However, a net with more hidden units could again create a >plateau, however, and this would be the more likely solution. Same question: is this plateau a local minimum only? If you are saying that local minima are more likely than global minima, that contradicts a lot of people who claim not to worry about local minima because it doesn't happen to them. I think if you take enough training points, the globally minimal solution will be unique up to isomorphism, e.g. interchange of some elements. The more hidden units and layers you have, the less transparent the behavior of the whole net becomes. Some examples of "wild" behavior will only appear with relatively small weights, but lots of layers: linear regions of sigmoids all lining up to produce a big excursion. >What's happening here is that sigmoids are smooth in certain ways (bounded >derivatives) and we're forcing an exact fit through the training points. >So the best solution does have a big excursion in it. You often see the >same gyrations (or worse ones) when fitting a set of points with a >polynomial. Excellent analogy. My gut feeling is that polynomials would be far worse than MLPs with sigmoids, because the sigmoids don't go off to infinity. >Now from some points of view, this big excursion is a good thing. It is >the "right" solution if you truly want to minimize training set error and >maintain smoothness. Bill points out that this solution is not the right >one for some other purposes. You might, for example, want to impose the >added constraint that the output remains within -- or doesn't go too far >beyond -- the convex hull of the training cases. Yes. That's the idea. >This point is hard to grasp in Bill's arguments, since he insists upon >using loaded words like "safe" and "unsafe", but after several go-rounds >I'm pretty sure that's what he means. I would just point out that for some >applications, the smooth solution with big excursions might be the "safe" >one. For example, you want to turn your airplane without snapping off the >wings in sudden turns. Of course. In that case we should have additional training and test points for what is an interesting part of the input space, though, and that will get rid of the unwanted behavior whether we don't want the excursion or we do. The safety issue arises if our net is forced into an excursion that we are not aware of because we didn't want it and couldn't test for it because there are too many points in the input space. An additional trouble is that whenever we force huge values of output from small training ones, we depend a lot on the precision of the arithmetic. I have no idea whether reduced precision arithmetic would make the excursions larger or smaller. In practice, we can't ignore that issue, though. >OK, suppose for a given application we do want to meet Bill's boundedness >criterion on the outputs. There are several solutions: >1. Sacrifice perfect fit. Weight decay does this in a natural way, finding >a compromise between exact fit on the training set and the desire for small >weights (or low derivatives on the output surface). The weight-decay >parameter controls the relative weight given to fit and smallness of >weights. I agree this will work if you can get fine enough control that reaches to every point of the input space to prevent excursions, and the fitting of the function to the spec is good enough. Though I don't deny it could be done, is this easy, or even possible to do in practice? >2. Sacrifice smoothness. If sharp corners are OK, it is a trivial matter >to add an extra piece of machinery that simply enforces the non-excursion >criterion, clipping the neural net's output when it wanders outside the >region bounded by the training set outputs. If you do want some excursions and you don't want others, this won't work. It is not a simple matter to find the problem regions and bound them specially. I believe it is NP complete to find them. The plausibility of this can be argued as follows: ALNs are a special case of MLPs; you can't tell if an ALN is constant 0 or has some 1 output somewhere (worst case, CNF- satisfiability); this is a special case of a spike that may not be wanted. The "spec" here is wanting an NN that produces a constant 0, and it's not a great example because we can do that differently, but maybe someone can work on it as a basis. >3. Go to a piecewise solution. With splines, we can fit the training >points exactly, bound the first N derivatives, and still not go on wild >excursions, though the solution is more complex in other ways and will ner never?? >pick up on long-range regularities in the training data. "Neural" nets >that use radial basis units or other local-response units have the same >character. I guess ALN's do too, though with lots of jaggy corners. True. This is where forcing monotonicity on the ALN because a priori knowledge tells us the output must be monotonic, can help to capture the long-range regularities. This local vs global question is very fundamental, and I'm grateful that you raise the issue in this context Scott. Maybe there are other things that help to capture global regularities that people can think of. The usual MLPs are very "globally" oriented, so they may be good at capturing the global information inherent in training data. The downside is that you can't evaluate the output just taking local information into account. The solution for ALNs will use monotonicity to capture global regularities, which will result in a jaggy function like you suggest. Then, that function will be smoothed. We *could* use a smoothing kernel like the derivative of the usual sigmoid, or a Gaussian, however we will use a kernel with bounded support. This will keep the arithmetic parts of the computation brief. To do that, we need to structure the ALN computation not to do arithmetic, but simply to compute a pointer to the coefficients required for computing the smoothed output in the part of the input space the input vector is located in. Pointing to a local approximant using ALNs puts the "missing" conditional statements back into the NN paradigm. In sum we get: 1. global fitting using a priori knowledge, 2. use of ALNs to compute an integer index (which they are suited to), 3. fast computation, only a small part of which is arithmetic. So ALNs will not have to pay a time penalty like MLPs with sigmoids do to compute smooth functions based on detection of global regularities. (Does anyone hear soft funeral music?) >4. Go to a higher-order solution. With extra hidden units, there will be >solutions that fit the data but that have the extra degrees of freedom to >meet other criteria as well. There are various ways of biasing these nets >to favor solutions that do what you want. Basically, you build the added >desiderata into the error function, as we do with weight decay (a bias >toward small weights). I don't know what you mean. I think you might find in practice that more hidden units and layers give you more, not fewer, problems. Thanks Scott. I appreciate that you have succeeded in understanding my comments on "safety", or whatever you want to call it, despite their apparently being fairly obscure (judging by people's initial reactions), for which I apologize. Bill -- *************************************************** Prof. William W. Armstrong, Computing Science Dept. University of Alberta; Edmonton, Alberta, Canada T6G 2H1 arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071