NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / ai / neuraln / 3450 < prev next >

Wrap

Text File | 1992-09-10 | 4.8 KB | 100 lines

Newsgroups: comp.ai.neural-nets Path: sparky!uunet!gatech!destroyer!ubc-cs!alberta!arms From: arms@cs.UAlberta.CA (Bill Armstrong) Subject: Re: Summary of CUPS + new question Message-ID: <arms.716190162@spedden> Sender: news@cs.UAlberta.CA (News Administrator) Nntp-Posting-Host: spedden.cs.ualberta.ca Organization: University of Alberta, Edmonton, Canada References: <BuCFut.F6t.1@cs.cmu.edu> Date: Fri, 11 Sep 1992 05:42:42 GMT Lines: 87 sef@sef1.slisp.cs.cmu.edu writes: > From: jjb@sequent.com (Jeff Berkowitz) > > Some weeks back I posted a request for real examples of the performance > of back propogation simulators in "backprops/second." Bill Armstrong > at the University of Alberta, Canada pointed out in a posting that the > accepted unit of measurement was CUPS... >Actually, it's a bit more complicated than this. The cleanest form of the >backprop algorithm (called "batch" or "per-epoch" updating) does >forward/backprop cycles through the whole training set, accumulating dE/dw >for each weight. Then all the weights are updated at once. This gives you >the "true" error gradient, but can be slow if the training set is large and >redundant. >An alternative form (unfortunately, the famous Rumelhart-Hinton-Williams >paper tends to muddle these two forms together) is called "stochastic", >"online", or "per-example" updating. In this case you update the weights >by a small amount as each training example goes by. For very small step >sizes, this amounts to the same thing; if your steps get too large, it can >lead to trouble, since several atypical samples in a row can knock you far >off course. >People generally speak of CUPS (connection UPDATES per second) only in >connection with the second kind of algorithm. If you're doing batch >training, a more appropriate measure is CPS (connections per second), since >the update step is outside the inner loop. Thanks for your very informative reply to this question Scott. It seemed appropriate to say why it is that in training ALNs, we believe it is better to use the "per example" updating. I think analogous comments would apply to backprop type nets, but of course the results might not have the same success as in the case of ALNs (i.e. it could be better or worse). I think of the situation this way. Any change of a node function in an ALN tree which results in a changed signal along a path to the output for a certain training example automatically changes the heuristic responsibilities of all subtrees feeding the "non-path" inputs of nodes along that path. Now from that point on, the adaptation of nodes with changed heuristic responsibilities is different. So is the adaptation of nodes on the path, because they have different inputs. In the case of backprop, the change of a node function is replaced by a more frequent type of change, that of a weight. A changed weight also changes the signals along a path for a given example, and hence changes the back propagated values for many nodes. From that point on, the directions the state (weights) should change to reduce the square error may be different. Up to this point, everything is analogous. In both cases, we can either freeze the heuristic responsibility for a given example (or the backpropagated signal) at what it was while we look at a whole epoch, or we can let it change. From experience, it seemed that it was beneficial in the case of ALNs not to spend the computational effort to find the best function change, i. e. the one that improves a whole epoch of patterns, but rather to keep the whole ALN changing in response to the latest state of our information, based on "per example" updating. Why? I'm not sure. But I conjecture it might be due to the desirability of avoiding local minima of error. Does one really want the ALN or MLP to cruise down into a local minimum and stay there? No. One wants a global minimum. But doing the computations of gradient descent more accurately, based on an entire epoch, guarantees that you come to rest at the local minimum of the valley you started in. So why not do a faster computation that has a chance of kicking the system out of the valley you are currently in? I should add that there are other heuristics in the ALN algorithm that are not gradient-descent type (atree release 2.7 on-line help, technical notes on the learning algorithm). I.e. some nodes are made responsible and adaptations are caused to occur even in cases where that could increase the error. This is quite different from the approach of adding noise to kick the system out of local minima, because the kick is given in a promising direction according to the heuristics. Bill -- *************************************************** Prof. William W. Armstrong, Computing Science Dept. University of Alberta; Edmonton, Alberta, Canada T6G 2H1 arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071