home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.ai.neural-nets
- Path: sparky!uunet!gumby!destroyer!ubc-cs!alberta!arms
- From: arms@cs.UAlberta.CA (Bill Armstrong)
- Subject: Re: Reducing Training time vs Generalisation
- Message-ID: <arms.714249520@spedden>
- Sender: news@cs.UAlberta.CA (News Administrator)
- Nntp-Posting-Host: spedden.cs.ualberta.ca
- Organization: University of Alberta, Edmonton, Canada
- References: <Bt8MI4.6Bu.1@cs.cmu.edu>
- Date: Wed, 19 Aug 1992 18:38:40 GMT
- Lines: 187
-
-
- I agree with most of Scott Fahlmann's comments about "wild"
- excursions. I suppose one can dispute my insistence of referring to
- this as a "safety" issue, to use a loaded "s" word, which people may
- find alarmist. But I calls 'em the way I sees 'em. I don't want it
- to be an issue that people can ignore because it doesn't sound
- important.
-
- There are just a few comments in relation to Scott's remarks that need
- to be made:
-
- sef@sef-pmax.slisp.cs.cmu.edu writes:
-
-
- >We went round on Bill Armstrong's dangerous example once before. Here's my
- >current understanding of the situation:
-
- >If you make a training set that looks like this,
-
-
- > ..............* *.................
-
- >and train a net with two hidden units a long time with no weight decay, you
- >can get a zero-error solution with a large upward excursion between the two
- >*'s. The peak of the excursion can be *much* higher than the height of the
- >two "tossing points". In this case, there are also solutions that create a
- >flat plateau, and these are more likely to be found by the usual learning
- >algorithms.
-
- Isn't the plateau solution a *local* minimum of error? Is this a case
- where you actually should seek a local minimum rather than an absolute
- minimum in training?
-
- >If you shape the training set a bit more carefully
-
- > * *
- > .............* *.................
-
- >and use a two-hidden unit net, you can FORCE the solution with a big
- >excursion. Only the "ankle" part of the sigmoid will fit these tossing
- >points. However, a net with more hidden units could again create a
- >plateau, however, and this would be the more likely solution.
-
- Same question: is this plateau a local minimum only? If you are
- saying that local minima are more likely than global minima, that
- contradicts a lot of people who claim not to worry about local minima
- because it doesn't happen to them. I think if you take enough
- training points, the globally minimal solution will be unique up to
- isomorphism, e.g. interchange of some elements.
-
- The more hidden units and layers you have, the less transparent the
- behavior of the whole net becomes. Some examples of "wild" behavior
- will only appear with relatively small weights, but lots of layers:
- linear regions of sigmoids all lining up to produce a big excursion.
-
- >What's happening here is that sigmoids are smooth in certain ways (bounded
- >derivatives) and we're forcing an exact fit through the training points.
- >So the best solution does have a big excursion in it. You often see the
- >same gyrations (or worse ones) when fitting a set of points with a
- >polynomial.
-
- Excellent analogy. My gut feeling is that polynomials would be far worse
- than MLPs with sigmoids, because the sigmoids don't go off to infinity.
-
- >Now from some points of view, this big excursion is a good thing. It is
- >the "right" solution if you truly want to minimize training set error and
- >maintain smoothness. Bill points out that this solution is not the right
- >one for some other purposes. You might, for example, want to impose the
- >added constraint that the output remains within -- or doesn't go too far
- >beyond -- the convex hull of the training cases.
-
- Yes. That's the idea.
-
- >This point is hard to grasp in Bill's arguments, since he insists upon
- >using loaded words like "safe" and "unsafe", but after several go-rounds
- >I'm pretty sure that's what he means. I would just point out that for some
- >applications, the smooth solution with big excursions might be the "safe"
- >one. For example, you want to turn your airplane without snapping off the
- >wings in sudden turns.
-
- Of course. In that case we should have additional training and test
- points for what is an interesting part of the input space, though, and
- that will get rid of the unwanted behavior whether we don't want the excursion
- or we do. The safety issue arises if our net is forced into an excursion that
- we are not aware of because we didn't want it and couldn't test for it
- because there are too many points in the input space.
-
- An additional trouble is that whenever we force huge values of output
- from small training ones, we depend a lot on the precision of the
- arithmetic. I have no idea whether reduced precision arithmetic would
- make the excursions larger or smaller. In practice, we can't ignore
- that issue, though.
-
- >OK, suppose for a given application we do want to meet Bill's boundedness
- >criterion on the outputs. There are several solutions:
-
- >1. Sacrifice perfect fit. Weight decay does this in a natural way, finding
- >a compromise between exact fit on the training set and the desire for small
- >weights (or low derivatives on the output surface). The weight-decay
- >parameter controls the relative weight given to fit and smallness of
- >weights.
-
- I agree this will work if you can get fine enough control that reaches
- to every point of the input space to prevent excursions, and the
- fitting of the function to the spec is good enough. Though I don't
- deny it could be done, is this easy, or even possible to do in practice?
-
- >2. Sacrifice smoothness. If sharp corners are OK, it is a trivial matter
- >to add an extra piece of machinery that simply enforces the non-excursion
- >criterion, clipping the neural net's output when it wanders outside the
- >region bounded by the training set outputs.
-
- If you do want some excursions and you don't want others, this won't
- work. It is not a simple matter to find the problem regions and bound
- them specially. I believe it is NP complete to find them. The
- plausibility of this can be argued as follows: ALNs are a special case
- of MLPs; you can't tell if an ALN is constant 0 or has some 1 output
- somewhere (worst case, CNF- satisfiability); this is a special case of
- a spike that may not be wanted.
-
- The "spec" here is wanting an NN that produces a constant 0, and it's
- not a great example because we can do that differently, but maybe
- someone can work on it as a basis.
-
- >3. Go to a piecewise solution. With splines, we can fit the training
- >points exactly, bound the first N derivatives, and still not go on wild
- >excursions, though the solution is more complex in other ways and will ner
-
- never??
-
- >pick up on long-range regularities in the training data. "Neural" nets
- >that use radial basis units or other local-response units have the same
- >character. I guess ALN's do too, though with lots of jaggy corners.
-
- True. This is where forcing monotonicity on the ALN because a priori
- knowledge tells us the output must be monotonic, can help to
- capture the long-range regularities.
-
- This local vs global question is very fundamental, and I'm grateful that
- you raise the issue in this context Scott. Maybe there are other things that
- help to capture global regularities that people can think of.
-
- The usual MLPs are very "globally" oriented, so they may be good at
- capturing the global information inherent in training data. The
- downside is that you can't evaluate the output just taking local
- information into account. The solution for ALNs will use monotonicity
- to capture global regularities, which will result in a jaggy function
- like you suggest. Then, that function will be smoothed. We *could*
- use a smoothing kernel like the derivative of the usual sigmoid, or a
- Gaussian, however we will use a kernel with bounded support. This
- will keep the arithmetic parts of the computation brief. To do that,
- we need to structure the ALN computation not to do arithmetic, but
- simply to compute a pointer to the coefficients required for computing
- the smoothed output in the part of the input space the input vector is
- located in. Pointing to a local approximant using ALNs puts the
- "missing" conditional statements back into the NN paradigm.
-
- In sum we get: 1. global fitting using a priori knowledge,
- 2. use of ALNs to compute an integer index (which they are suited to),
- 3. fast computation, only a small part of which is arithmetic.
-
- So ALNs will not have to pay a time penalty like MLPs with sigmoids do
- to compute smooth functions based on detection of global regularities.
- (Does anyone hear soft funeral music?)
-
- >4. Go to a higher-order solution. With extra hidden units, there will be
- >solutions that fit the data but that have the extra degrees of freedom to
- >meet other criteria as well. There are various ways of biasing these nets
- >to favor solutions that do what you want. Basically, you build the added
- >desiderata into the error function, as we do with weight decay (a bias
- >toward small weights).
-
- I don't know what you mean. I think you might find in practice that
- more hidden units and layers give you more, not fewer, problems.
-
- Thanks Scott. I appreciate that you have succeeded in understanding
- my comments on "safety", or whatever you want to call it, despite
- their apparently being fairly obscure (judging by people's initial
- reactions), for which I apologize.
-
- Bill
-
- --
- ***************************************************
- Prof. William W. Armstrong, Computing Science Dept.
- University of Alberta; Edmonton, Alberta, Canada T6G 2H1
- arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071
-