NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3248 < prev next >

Wrap

Text File | 1992-08-19 | 7.9 KB | 170 lines

Newsgroups: comp.ai.neural-nets Path: sparky!uunet!usc!sol.ctr.columbia.edu!destroyer!ubc-cs!unixg.ubc.ca!kakwa.ucs.ualberta.ca!alberta!arms From: arms@cs.UAlberta.CA (Bill Armstrong) Subject: Re: Reducing Training time vs Generalisation Message-ID: <arms.714289771@spedden> Sender: news@cs.UAlberta.CA (News Administrator) Nntp-Posting-Host: spedden.cs.ualberta.ca Organization: University of Alberta, Edmonton, Canada References: <Bt9GIx.9In.1@cs.cmu.edu> Date: Thu, 20 Aug 1992 05:49:31 GMT Lines: 157 sef@sef-pmax.slisp.cs.cmu.edu writes: > >If you shape the training set a bit more carefully > > > * * > > .............* *................. > > >and use a two-hidden unit net, you can FORCE the solution with a big > >excursion. Only the "ankle" part of the sigmoid will fit these tossing > >points. However, a net with more hidden units could again create a > >plateau, however, and this would be the more likely solution. > > Same question: is this plateau a local minimum only? >Well, one ugly solution uses four hidden units and gets up to the plateau >in two steps, and down in two more. Again, zero error, so the solution is >one of the global minima. The plateau solution might not be a global minimum if we just specify the values along the ....... part to fit the sigmoids perfectly only if there is a peak. > The more hidden units and layers you have, the less transparent the > behavior of the whole net becomes. Some examples of "wild" behavior > will only appear with relatively small weights, but lots of layers: > linear regions of sigmoids all lining up to produce a big excursion. > >Sure, but they will have no incentive to do this unless the data, in some >sense, forces them to. You could always throw a narrow Gaussian unit into >the net, slide it over between any two training points, and give it an >output weight of 10^99. But it would be wrong. A good reason not to use such units, eh! > >1. Sacrifice perfect fit. Weight decay does this in a natural way, finding > >a compromise between exact fit on the training set and the desire for small > >weights (or low derivatives on the output surface). The weight-decay > >parameter controls the relative weight given to fit and smallness of > >weights. > > I agree this will work if you can get fine enough control that reaches > to every point of the input space to prevent excursions, and the > fitting of the function to the spec is good enough. Though I don't > deny it could be done, is this easy, or even possible to do in practice? > >You don't need fine control. If you just crank some weight decay into the >error measure, it will keep the net from making wild excursion without >being forced to by the data. For example, in the example about the big >gaussian spike, it would drive the output weight to zero if the Gaussian is >not helping to fit the data. I seem to recall going over this before, and I believe what is required to upset the scheme is to have a lot of training points which force training to fit a solution having a peak. I.e. if six points aren't enough, take as many as are required to overwhelm the weight decay. I suppose you can make the weight decay greater as the number of training points increases to make it so the weight decay can't be overwhelmed. But I think you would lose some genuine peaks which weren't represented by their fair share of points in the training set. Isn't this true: unless your training samples are well distributed, you tend to wipe out features of less well sampled parts of the space? You might say this is what you want. A little control might be desirable. > >2. Sacrifice smoothness. If sharp corners are OK, it is a trivial matter > >to add an extra piece of machinery that simply enforces the non-excursion > >criterion, clipping the neural net's output when it wanders outside the > >region bounded by the training set outputs. > > If you do want some excursions and you don't want others, this won't > work. It is not a simple matter to find the problem regions and bound > them specially. I believe it is NP complete to find them. The > plausibility of this can be argued as follows: ALNs are a special case > of MLPs; you can't tell if an ALN is constant 0 or has some 1 output > somewhere (worst case, CNF- satisfiability); this is a special case of > a spike that may not be wanted. > >Huh??? It's trivial to put a widget on each output line that clips the >output to lie between all those seen during training. Sure. It's a bit harder, >but still not NP complete, to find the convex hull of outputs and clip to >that. Fine. Unfortunately in both cases above, there can still be excursions in the convex hull that you don't want and can't prevent this way. Now, if you want some excursions and not others, you'd better tell >the net or the net designer about this -- it's a bit unreasonable to expect >a backrpop net to read your mind. OK -- this is close to what I meant by "fine control". > > The usual MLPs are very "globally" oriented, so they may be good at > capturing the global information inherent in training data. The > downside is that you can't evaluate the output just taking local > information into account... > (Does anyone hear soft funeral music?) > >Well, since you keep pounding on this, I will point out that in most >backprop-style nets after training, almost all of the hidden units are >saturated almost all of the time. So you can replace them with sharp >thresholds and use the same kind of lazy evaluation at runtime that you >propose for ALNs: work backwards through the tree and don't evaluate any >input sub-trees that can't alter the current unit's state. You have a lot more experience than I do with sigmoid type nets, so what you have just said is extremely significant, in that you are coming closer all the time to a logical net. If you are able to replace sigmoids with sharp thresholds, and not change the output of the net significantly, then you are really using threshold *logic* nets. Now let's see what it takes to get lazy evaluation: first of all, I think you would have to insist that the sign of all weights on an element be positive, and all signals in the net too. Otherwise in forming a weighted sum of inputs, you can not be sure you are on one side of the sharp threshold or not until you have evaluated all inputs (not lazy!). I think the signals would have to be bounded too. I think this would be OK. ALNs are still faster, because they don't do arithmetic, but ALNs don't have as powerful nodes. One argument for going whole hog into ALNs is that you don't have to train using sigmoids, then risk damaging the result of learning by going to sharp thresholds. If there were a training procedure for networks of the above kind of node with a sharp threshold, that would be very promising. I thought backprop required differentiability to work though. >Myself, I prefer to think in terms of parallel hardware, so lazy evaluation >isn't an issue. Not true! If you have a fixed amount of hardware, then to do large problems, you will have to iterate it. Lazy evaluation allows one to suppress iterations because you don't need certain inputs, so laziness is still very useful. The speedup factor compared to complete evaluation grows with the size of the problem. Yes, sigmoid unit hardware is a bit more expensive to >implement than simple gates, but I don't need nearly as many of them. Hardly seems worth while to keep sigmoids if almost all of your units are almost always saturated though. ALNs may have to train with lots of nodes, but after training, we collapse entire subtrees of adaptive nodes into just a single transistor for execution. Thanks. Bill -- *************************************************** Prof. William W. Armstrong, Computing Science Dept. University of Alberta; Edmonton, Alberta, Canada T6G 2H1 arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071