NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3246 < prev next >

Wrap

Internet Message Format | 1992-08-19 | 6.0 KB

Path: sparky!uunet!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!news From: sef@sef-pmax.slisp.cs.cmu.edu Newsgroups: comp.ai.neural-nets Subject: Re: Reducing Training time vs Generalisation Message-ID: <Bt9GIx.9In.1@cs.cmu.edu> Date: 20 Aug 92 02:35:21 GMT Article-I.D.: cs.Bt9GIx.9In.1 Sender: news@cs.cmu.edu (Usenet News System) Organization: School of Computer Science, Carnegie Mellon Lines: 116 Nntp-Posting-Host: sef-pmax.slisp.cs.cmu.edu From: arms@cs.UAlberta.CA (Bill Armstrong) > ..............* *................. >... In this case, there are also solutions that create a >flat plateau, and these are more likely to be found by the usual learning >algorithms. Isn't the plateau solution a *local* minimum of error? Is this a case where you actually should seek a local minimum rather than an absolute minimum in training? No, the plateau solution is one of several global minima that have zero error because they go smack through all the points. For simplicity, assume two 0-1 sigmoid units and a linear output unit of unity gain. One hidden unit goes from zero to one between the the last of the left-side dots and the left *. The other hidden unit one goes from zero to one between the right * and the dots. The first hidden unit has an output weight of +x and the second has an output weight of -x, where x is the height of the *-defined plateau. OK, the solution is only EXACT as the sigmoids approach step functions, which means as the input-side weights grow large. But you get very close even with moderate weights. >If you shape the training set a bit more carefully > * * > .............* *................. >and use a two-hidden unit net, you can FORCE the solution with a big >excursion. Only the "ankle" part of the sigmoid will fit these tossing >points. However, a net with more hidden units could again create a >plateau, however, and this would be the more likely solution. Same question: is this plateau a local minimum only? Well, one ugly solution uses four hidden units and gets up to the plateau in two steps, and down in two more. Again, zero error, so the solution is one of the global minima. The more hidden units and layers you have, the less transparent the behavior of the whole net becomes. Some examples of "wild" behavior will only appear with relatively small weights, but lots of layers: linear regions of sigmoids all lining up to produce a big excursion. Sure, but they will have no incentive to do this unless the data, in some sense, forces them to. You could always throw a narrow Gaussian unit into the net, slide it over between any two training points, and give it an output weight of 10^99. But it would be wrong. >1. Sacrifice perfect fit. Weight decay does this in a natural way, finding >a compromise between exact fit on the training set and the desire for small >weights (or low derivatives on the output surface). The weight-decay >parameter controls the relative weight given to fit and smallness of >weights. I agree this will work if you can get fine enough control that reaches to every point of the input space to prevent excursions, and the fitting of the function to the spec is good enough. Though I don't deny it could be done, is this easy, or even possible to do in practice? You don't need fine control. If you just crank some weight decay into the error measure, it will keep the net from making wild excursion without being forced to by the data. For example, in the example about the big gaussian spike, it would drive the output weight to zero if the Gaussian is not helping to fit the data. >2. Sacrifice smoothness. If sharp corners are OK, it is a trivial matter >to add an extra piece of machinery that simply enforces the non-excursion >criterion, clipping the neural net's output when it wanders outside the >region bounded by the training set outputs. If you do want some excursions and you don't want others, this won't work. It is not a simple matter to find the problem regions and bound them specially. I believe it is NP complete to find them. The plausibility of this can be argued as follows: ALNs are a special case of MLPs; you can't tell if an ALN is constant 0 or has some 1 output somewhere (worst case, CNF- satisfiability); this is a special case of a spike that may not be wanted. Huh??? It's trivial to put a widget on each output line that clips the output to lie between all those seen during training. It's a bit harder, but still not NP complete, to find the convex hull of outputs and clip to that. Now, if you want some excursions and not others, you'd better tell the net or the net designer about this -- it's a bit unreasonable to expect a backrpop net to read your mind. The usual MLPs are very "globally" oriented, so they may be good at capturing the global information inherent in training data. The downside is that you can't evaluate the output just taking local information into account... (Does anyone hear soft funeral music?) Well, since you keep pounding on this, I will point out that in most backprop-style nets after training, almost all of the hidden units are saturated almost all of the time. So you can replace them with sharp thresholds and use the same kind of lazy evaluation at runtime that you propose for ALNs: work backwards through the tree and don't evaluate any input sub-trees that can't alter the current unit's state. Myself, I prefer to think in terms of parallel hardware, so lazy evaluation isn't an issue. Yes, sigmoid unit hardware is a bit more expensive to implement than simple gates, but I don't need nearly as many of them. -- Scott =========================================================================== Scott E. Fahlman School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Internet: sef+@cs.cmu.edu