NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3235 < prev next >

Wrap

Internet Message Format | 1992-08-19 | 4.4 KB

Path: sparky!uunet!charon.amdahl.com!pacbell.com!mips!darwin.sura.net!wupost!cs.utexas.edu!uwm.edu!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!news From: sef@sef-pmax.slisp.cs.cmu.edu Newsgroups: comp.ai.neural-nets Subject: Re: Reducing Training time vs Generalisation Message-ID: <Bt8MI4.6Bu.1@cs.cmu.edu> Date: 19 Aug 92 15:46:50 GMT Article-I.D.: cs.Bt8MI4.6Bu.1 Sender: news@cs.cmu.edu (Usenet News System) Organization: School of Computer Science, Carnegie Mellon Lines: 86 Nntp-Posting-Host: sef-pmax.slisp.cs.cmu.edu We went round on Bill Armstrong's dangerous example once before. Here's my current understanding of the situation: If you make a training set that looks like this, ..............* *................. and train a net with two hidden units a long time with no weight decay, you can get a zero-error solution with a large upward excursion between the two *'s. The peak of the excursion can be *much* higher than the height of the two "tossing points". In this case, there are also solutions that create a flat plateau, and these are more likely to be found by the usual learning algorithms. If you shape the training set a bit more carefully * * .............* *................. and use a two-hidden unit net, you can FORCE the solution with a big excursion. Only the "ankle" part of the sigmoid will fit these tossing points. However, a net with more hidden units could again create a plateau, however, and this would be the more likely solution. What's happening here is that sigmoids are smooth in certain ways (bounded derivatives) and we're forcing an exact fit through the training points. So the best solution does have a big excursion in it. You often see the same gyrations (or worse ones) when fitting a set of points with a polynomial. Now from some points of view, this big excursion is a good thing. It is the "right" solution if you truly want to minimize training set error and maintain smoothness. Bill points out that this solution is not the right one for some other purposes. You might, for example, want to impose the added constraint that the output remains within -- or doesn't go too far beyond -- the convex hull of the training cases. This point is hard to grasp in Bill's arguments, since he insists upon using loaded words like "safe" and "unsafe", but after several go-rounds I'm pretty sure that's what he means. I would just point out that for some applications, the smooth solution with big excursions might be the "safe" one. For example, you want to turn your airplane without snapping off the wings in sudden turns. OK, suppose for a given application we do want to meet Bill's boundedness criterion on the outputs. There are several solutions: 1. Sacrifice perfect fit. Weight decay does this in a natural way, finding a compromise between exact fit on the training set and the desire for small weights (or low derivatives on the output surface). The weight-decay parameter controls the relative weight given to fit and smallness of weights. 2. Sacrifice smoothness. If sharp corners are OK, it is a trivial matter to add an extra piece of machinery that simply enforces the non-excursion criterion, clipping the neural net's output when it wanders outside the region bounded by the training set outputs. 3. Go to a piecewise solution. With splines, we can fit the training points exactly, bound the first N derivatives, and still not go on wild excursions, though the solution is more complex in other ways and will ner pick up on long-range regularities in the training data. "Neural" nets that use radial basis units or other local-response units have the same character. I guess ALN's do too, though with lots of jaggy corners. 4. Go to a higher-order solution. With extra hidden units, there will be solutions that fit the data but that have the extra degrees of freedom to meet other criteria as well. There are various ways of biasing these nets to favor solutions that do what you want. Basically, you build the added desiderata into the error function, as we do with weight decay (a bias toward small weights). By the way, I didn't respond to the question on generalization because Dave DeMers gave an answer much better than I could have produced. -- Scott =========================================================================== Scott E. Fahlman School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Internet: sef+@cs.cmu.edu