NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / ai / neuraln / 4730 < prev next >

Wrap

Internet Message Format | 1993-01-07 | 4.6 KB

Path: sparky!uunet!elroy.jpl.nasa.gov!usc!cs.utexas.edu!uwm.edu!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!news From: sef@sef-pmax.slisp.cs.cmu.edu Newsgroups: comp.ai.neural-nets Subject: Re: will Cascade Correlation work in stochastic mode? Message-ID: <C0Gw28.M9E.1@cs.cmu.edu> Date: 7 Jan 93 04:49:16 GMT Article-I.D.: cs.C0Gw28.M9E.1 Sender: news@cs.cmu.edu (Usenet News System) Organization: School of Computer Science, Carnegie Mellon Lines: 76 Nntp-Posting-Host: sef-pmax.slisp.cs.cmu.edu From: ra@cs.brown.edu (Ronny Ashar) However, I would prefer a more robust algorithm. I was looking at Fahlman's Cascade Correlation. My impression is that Cascor needs epoch training only; it could be modified to work in stochastic mode, but, in that case, it will end up creating huge nets with redundant units. Is that correct? Good question. The answer is a bit complicated. This probably won't make much sense to people who don't already understand the Cascor algorithm in some detail... There is nothing inherently batch-oriented in the basic structure of Cascor. However, the Cascor code that I distribute, and that I run myself, uses Quickprop for updating the weights both in the candidate-training phase and in the output-training phase. Quickprop, in its current form, is inherently a batch-update algorithm: you have to run the same batch of training examples through the system multiple times to get the speedup it offers. That batch does not necessarily have to include the whole original training set, however. You want enough examples to get a good, stable estimate of the gradient, but not a lot more than that. It is possible to change the training set used in quickprop from time to time, but whenever you do that you should zero out the previous-slope and previous-delta values to prevent problems. So if the problem is that your training data set is too large and redundant, a possible solution is to choose a smaller batch size, train on that, and switch the batches occasionally. (Martin Moller has a nice paper on choosing the batch size for his Scaled Conjugate Gradient algorithm in Neural Networks for Signal Processing 2, IEEE Press, 1992. Similar ideas could be used with Quickprop or Cascor.) If you want true online updating, without storing up a batch of examples, you can use the structure of Cascor, but with the Quickprop updating replaced by stochastic backprop. If you do that, you cannot use the trick of cacheing the errors and unit values for all the cases in an epoch. Each new sample must be allowed to propagate through the active net. These changes will cost you a lot of speed, but you may get that back (and more) for highly redundant data sets in which per-epoch updating is terribly inefficient. If you get impatient and quit the training phases too early, before backprop has really converged, then Cascor will indeed create too many units and generalize poorly. When you are using stochastic backprop to do the updates, the training never reaches a truly quiescent state. The error keeps bouncing up and down as the individual samples go by. So you have to modify the quiescence tests that are used to terminate the training phases. You need to take an average over many samples, and declare the training phase over when there is no change in the average error for some period of time. Even so, there is the danger of stopping the training when the net is in a less than optimal state. It is probably best to gradually reduce the training rate to reduce these fluctuations. Candidate training and output training can, to some extent, be overlapped, but it is probably a good idea to run the candidate training for a while after the output weights have been frozen. There may be a temporary, discontinuous up-tick in the network's error when you tenure a new hidden unit. If such a glitch is worrisome, you can minimize the damage by installing the new unit with a very small (or zero) output weight. This too will slow down training. Sorry, I don't have any citations to published work on this topic. All this is based on my own experience and what others have told me informally. If anyone else knows of such citations, I'd like to hear about them. -- Scott =========================================================================== Scott E. Fahlman Internet: sef+@cs.cmu.edu Senior Research Scientist Phone: 412 268-2575 School of Computer Science Fax: 412 681-5739 Carnegie Mellon University Latitude: 40:26:33 N 5000 Forbes Avenue Longitude: 79:56:48 W Pittsburgh, PA 15213 ===========================================================================