home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!news
- From: sef@sef-pmax.slisp.cs.cmu.edu
- Newsgroups: comp.ai.neural-nets
- Subject: Re: Reducing Training time vs Generalisation
- Message-ID: <Bt9GIx.9In.1@cs.cmu.edu>
- Date: 20 Aug 92 02:35:21 GMT
- Article-I.D.: cs.Bt9GIx.9In.1
- Sender: news@cs.cmu.edu (Usenet News System)
- Organization: School of Computer Science, Carnegie Mellon
- Lines: 116
- Nntp-Posting-Host: sef-pmax.slisp.cs.cmu.edu
-
-
- From: arms@cs.UAlberta.CA (Bill Armstrong)
-
- > ..............* *.................
-
- >... In this case, there are also solutions that create a
- >flat plateau, and these are more likely to be found by the usual learning
- >algorithms.
-
- Isn't the plateau solution a *local* minimum of error? Is this a case
- where you actually should seek a local minimum rather than an absolute
- minimum in training?
-
- No, the plateau solution is one of several global minima that have zero
- error because they go smack through all the points. For simplicity, assume
- two 0-1 sigmoid units and a linear output unit of unity gain. One hidden
- unit goes from zero to one between the the last of the left-side dots and
- the left *. The other hidden unit one goes from zero to one between the
- right * and the dots. The first hidden unit has an output weight of +x
- and the second has an output weight of -x, where x is the height of the
- *-defined plateau.
-
- OK, the solution is only EXACT as the sigmoids approach step functions,
- which means as the input-side weights grow large. But you get very close
- even with moderate weights.
-
- >If you shape the training set a bit more carefully
-
- > * *
- > .............* *.................
-
- >and use a two-hidden unit net, you can FORCE the solution with a big
- >excursion. Only the "ankle" part of the sigmoid will fit these tossing
- >points. However, a net with more hidden units could again create a
- >plateau, however, and this would be the more likely solution.
-
- Same question: is this plateau a local minimum only?
-
- Well, one ugly solution uses four hidden units and gets up to the plateau
- in two steps, and down in two more. Again, zero error, so the solution is
- one of the global minima.
-
- The more hidden units and layers you have, the less transparent the
- behavior of the whole net becomes. Some examples of "wild" behavior
- will only appear with relatively small weights, but lots of layers:
- linear regions of sigmoids all lining up to produce a big excursion.
-
- Sure, but they will have no incentive to do this unless the data, in some
- sense, forces them to. You could always throw a narrow Gaussian unit into
- the net, slide it over between any two training points, and give it an
- output weight of 10^99. But it would be wrong.
-
- >1. Sacrifice perfect fit. Weight decay does this in a natural way, finding
- >a compromise between exact fit on the training set and the desire for small
- >weights (or low derivatives on the output surface). The weight-decay
- >parameter controls the relative weight given to fit and smallness of
- >weights.
-
- I agree this will work if you can get fine enough control that reaches
- to every point of the input space to prevent excursions, and the
- fitting of the function to the spec is good enough. Though I don't
- deny it could be done, is this easy, or even possible to do in practice?
-
- You don't need fine control. If you just crank some weight decay into the
- error measure, it will keep the net from making wild excursion without
- being forced to by the data. For example, in the example about the big
- gaussian spike, it would drive the output weight to zero if the Gaussian is
- not helping to fit the data.
-
- >2. Sacrifice smoothness. If sharp corners are OK, it is a trivial matter
- >to add an extra piece of machinery that simply enforces the non-excursion
- >criterion, clipping the neural net's output when it wanders outside the
- >region bounded by the training set outputs.
-
- If you do want some excursions and you don't want others, this won't
- work. It is not a simple matter to find the problem regions and bound
- them specially. I believe it is NP complete to find them. The
- plausibility of this can be argued as follows: ALNs are a special case
- of MLPs; you can't tell if an ALN is constant 0 or has some 1 output
- somewhere (worst case, CNF- satisfiability); this is a special case of
- a spike that may not be wanted.
-
- Huh??? It's trivial to put a widget on each output line that clips the
- output to lie between all those seen during training. It's a bit harder,
- but still not NP complete, to find the convex hull of outputs and clip to
- that. Now, if you want some excursions and not others, you'd better tell
- the net or the net designer about this -- it's a bit unreasonable to expect
- a backrpop net to read your mind.
-
- The usual MLPs are very "globally" oriented, so they may be good at
- capturing the global information inherent in training data. The
- downside is that you can't evaluate the output just taking local
- information into account...
- (Does anyone hear soft funeral music?)
-
- Well, since you keep pounding on this, I will point out that in most
- backprop-style nets after training, almost all of the hidden units are
- saturated almost all of the time. So you can replace them with sharp
- thresholds and use the same kind of lazy evaluation at runtime that you
- propose for ALNs: work backwards through the tree and don't evaluate any
- input sub-trees that can't alter the current unit's state.
-
- Myself, I prefer to think in terms of parallel hardware, so lazy evaluation
- isn't an issue. Yes, sigmoid unit hardware is a bit more expensive to
- implement than simple gates, but I don't need nearly as many of them.
-
- -- Scott
-
- ===========================================================================
- Scott E. Fahlman
- School of Computer Science
- Carnegie Mellon University
- 5000 Forbes Avenue
- Pittsburgh, PA 15213
-
- Internet: sef+@cs.cmu.edu
-