home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!charon.amdahl.com!pacbell.com!mips!darwin.sura.net!wupost!cs.utexas.edu!uwm.edu!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!news
- From: sef@sef-pmax.slisp.cs.cmu.edu
- Newsgroups: comp.ai.neural-nets
- Subject: Re: Reducing Training time vs Generalisation
- Message-ID: <Bt8MI4.6Bu.1@cs.cmu.edu>
- Date: 19 Aug 92 15:46:50 GMT
- Article-I.D.: cs.Bt8MI4.6Bu.1
- Sender: news@cs.cmu.edu (Usenet News System)
- Organization: School of Computer Science, Carnegie Mellon
- Lines: 86
- Nntp-Posting-Host: sef-pmax.slisp.cs.cmu.edu
-
-
- We went round on Bill Armstrong's dangerous example once before. Here's my
- current understanding of the situation:
-
- If you make a training set that looks like this,
-
-
- ..............* *.................
-
- and train a net with two hidden units a long time with no weight decay, you
- can get a zero-error solution with a large upward excursion between the two
- *'s. The peak of the excursion can be *much* higher than the height of the
- two "tossing points". In this case, there are also solutions that create a
- flat plateau, and these are more likely to be found by the usual learning
- algorithms.
-
- If you shape the training set a bit more carefully
-
- * *
- .............* *.................
-
- and use a two-hidden unit net, you can FORCE the solution with a big
- excursion. Only the "ankle" part of the sigmoid will fit these tossing
- points. However, a net with more hidden units could again create a
- plateau, however, and this would be the more likely solution.
-
- What's happening here is that sigmoids are smooth in certain ways (bounded
- derivatives) and we're forcing an exact fit through the training points.
- So the best solution does have a big excursion in it. You often see the
- same gyrations (or worse ones) when fitting a set of points with a
- polynomial.
-
- Now from some points of view, this big excursion is a good thing. It is
- the "right" solution if you truly want to minimize training set error and
- maintain smoothness. Bill points out that this solution is not the right
- one for some other purposes. You might, for example, want to impose the
- added constraint that the output remains within -- or doesn't go too far
- beyond -- the convex hull of the training cases.
-
- This point is hard to grasp in Bill's arguments, since he insists upon
- using loaded words like "safe" and "unsafe", but after several go-rounds
- I'm pretty sure that's what he means. I would just point out that for some
- applications, the smooth solution with big excursions might be the "safe"
- one. For example, you want to turn your airplane without snapping off the
- wings in sudden turns.
-
- OK, suppose for a given application we do want to meet Bill's boundedness
- criterion on the outputs. There are several solutions:
-
- 1. Sacrifice perfect fit. Weight decay does this in a natural way, finding
- a compromise between exact fit on the training set and the desire for small
- weights (or low derivatives on the output surface). The weight-decay
- parameter controls the relative weight given to fit and smallness of
- weights.
-
- 2. Sacrifice smoothness. If sharp corners are OK, it is a trivial matter
- to add an extra piece of machinery that simply enforces the non-excursion
- criterion, clipping the neural net's output when it wanders outside the
- region bounded by the training set outputs.
-
- 3. Go to a piecewise solution. With splines, we can fit the training
- points exactly, bound the first N derivatives, and still not go on wild
- excursions, though the solution is more complex in other ways and will ner
- pick up on long-range regularities in the training data. "Neural" nets
- that use radial basis units or other local-response units have the same
- character. I guess ALN's do too, though with lots of jaggy corners.
-
- 4. Go to a higher-order solution. With extra hidden units, there will be
- solutions that fit the data but that have the extra degrees of freedom to
- meet other criteria as well. There are various ways of biasing these nets
- to favor solutions that do what you want. Basically, you build the added
- desiderata into the error function, as we do with weight decay (a bias
- toward small weights).
-
- By the way, I didn't respond to the question on generalization because Dave
- DeMers gave an answer much better than I could have produced.
-
- -- Scott
- ===========================================================================
- Scott E. Fahlman
- School of Computer Science
- Carnegie Mellon University
- 5000 Forbes Avenue
- Pittsburgh, PA 15213
-
- Internet: sef+@cs.cmu.edu
-