home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.ai.neural-nets
- Path: sparky!uunet!gatech!destroyer!ubc-cs!alberta!arms
- From: arms@cs.UAlberta.CA (Bill Armstrong)
- Subject: Re: Summary of CUPS + new question
- Message-ID: <arms.716190162@spedden>
- Sender: news@cs.UAlberta.CA (News Administrator)
- Nntp-Posting-Host: spedden.cs.ualberta.ca
- Organization: University of Alberta, Edmonton, Canada
- References: <BuCFut.F6t.1@cs.cmu.edu>
- Date: Fri, 11 Sep 1992 05:42:42 GMT
- Lines: 87
-
- sef@sef1.slisp.cs.cmu.edu writes:
-
-
- > From: jjb@sequent.com (Jeff Berkowitz)
- >
- > Some weeks back I posted a request for real examples of the performance
- > of back propogation simulators in "backprops/second." Bill Armstrong
- > at the University of Alberta, Canada pointed out in a posting that the
- > accepted unit of measurement was CUPS...
-
- >Actually, it's a bit more complicated than this. The cleanest form of the
- >backprop algorithm (called "batch" or "per-epoch" updating) does
- >forward/backprop cycles through the whole training set, accumulating dE/dw
- >for each weight. Then all the weights are updated at once. This gives you
- >the "true" error gradient, but can be slow if the training set is large and
- >redundant.
-
- >An alternative form (unfortunately, the famous Rumelhart-Hinton-Williams
- >paper tends to muddle these two forms together) is called "stochastic",
- >"online", or "per-example" updating. In this case you update the weights
- >by a small amount as each training example goes by. For very small step
- >sizes, this amounts to the same thing; if your steps get too large, it can
- >lead to trouble, since several atypical samples in a row can knock you far
- >off course.
-
- >People generally speak of CUPS (connection UPDATES per second) only in
- >connection with the second kind of algorithm. If you're doing batch
- >training, a more appropriate measure is CPS (connections per second), since
- >the update step is outside the inner loop.
-
- Thanks for your very informative reply to this question Scott. It
- seemed appropriate to say why it is that in training ALNs, we believe
- it is better to use the "per example" updating. I think analogous
- comments would apply to backprop type nets, but of course the results
- might not have the same success as in the case of ALNs (i.e. it could
- be better or worse).
-
- I think of the situation this way. Any change of a node function in
- an ALN tree which results in a changed signal along a path to the
- output for a certain training example automatically changes the
- heuristic responsibilities of all subtrees feeding the "non-path"
- inputs of nodes along that path. Now from that point on, the
- adaptation of nodes with changed heuristic responsibilities is
- different. So is the adaptation of nodes on the path, because they
- have different inputs.
-
- In the case of backprop, the change of a node function is replaced by
- a more frequent type of change, that of a weight. A changed weight
- also changes the signals along a path for a given example, and hence
- changes the back propagated values for many nodes. From that point
- on, the directions the state (weights) should change to reduce the
- square error may be different.
-
- Up to this point, everything is analogous. In both cases, we can
- either freeze the heuristic responsibility for a given example (or the
- backpropagated signal) at what it was while we look at a whole epoch,
- or we can let it change. From experience, it seemed that it was
- beneficial in the case of ALNs not to spend the computational effort
- to find the best function change, i. e. the one that improves a whole
- epoch of patterns, but rather to keep the whole ALN changing in
- response to the latest state of our information, based on "per
- example" updating.
-
- Why? I'm not sure. But I conjecture it might be due to the
- desirability of avoiding local minima of error. Does one really want
- the ALN or MLP to cruise down into a local minimum and stay there?
- No. One wants a global minimum. But doing the computations of
- gradient descent more accurately, based on an entire epoch, guarantees
- that you come to rest at the local minimum of the valley you started
- in. So why not do a faster computation that has a chance of kicking
- the system out of the valley you are currently in?
-
- I should add that there are other heuristics in the ALN algorithm that
- are not gradient-descent type (atree release 2.7 on-line help,
- technical notes on the learning algorithm). I.e. some nodes are made
- responsible and adaptations are caused to occur even in cases where
- that could increase the error. This is quite different from the
- approach of adding noise to kick the system out of local minima,
- because the kick is given in a promising direction according to the
- heuristics.
-
- Bill
- --
- ***************************************************
- Prof. William W. Armstrong, Computing Science Dept.
- University of Alberta; Edmonton, Alberta, Canada T6G 2H1
- arms@cs.ualberta.ca Tel(403)492 2374 FAX 492 1071
-