home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.ai.neural-nets
- Path: sparky!uunet!utcsri!torn!newshost.uwo.ca!gaul.csd.uwo.ca!koops
- From: koops@gaul.csd.uwo.ca (Luke Koops)
- Subject: Reducing Training time (summary)
- Organization: Computer Science Dept., Univ. of Western Ontario, London, Canada
- Date: Sun, 16 Aug 1992 06:38:25 GMT
- Message-ID: <1992Aug16.063825.15300@julian.uwo.ca>
- Keywords: back propagation, training
- Sender: news@julian.uwo.ca (USENET News System)
- Nntp-Posting-Host: obelix.gaul.csd.uwo.ca
- Lines: 371
-
- A while ago I posted a request asking how I could reduce the training
- time for a back-propagation neural network, and recieved these helpful
- respones.
-
-
- From styri@balder.nta.no Mon Aug 10 08:00 EDT 1992
-
- One way of reducing training time is to use a faster implementation...
- There's an implementation by Dr. Donald Tveter in several archives,
- this suite of s/w includes an integer (or rather fixed point math)
- implementation of the bp algorithm. Worth trying.
-
- Your method of training is, however, a bit too simple. Training time
- depends on a lot of factors: the momentum and learning rate, and the
- initial weights - just to mention the most important. Just increasing
- the number of nodes in the hidden layer may even lead to a poor net.
-
- ---
- Haakon Styri
- Norwegian Telecom Research
-
- [note: number of nodes is increased until a solution is found
- if the net overgeneralizes, that's not my problem]
-
-
-
- From peterhi@syma.sussex.ac.uk Mon Aug 10 08:31 EDT 1992
- Organization: University of Sussex
-
- Being pragmatic I would say that you should use a binary search for the number
- of hidden nodes. Start at 5, if that trains then try 3 if that fails go up to
- 4 and so on. You can ask any half decent Comp-Sci BSc about binary searches.
- As to the order try this heuristic...
-
- Have the training set sequenced so as to have the greatest difference in
- output and least difference in input from one vector to the next. This is
- assuming that the training set is being accessed linearly and not at random.
- There are more incorrect orderings that there are correct ones. (By the way
- the method of ordering the data is NP-Complete and therefore I can offer you
- no code to do this for you - sorry). If you could get be a SERC fund I'll look
- into it for you :)
-
- How big is the training set? Could you mail it to me?
-
- Peter Hickman
- --
- --------------------------------------- peterhi@uk.ac.sussex.syma ------------
- "More beer, more shouting, resistance is useless" - USTA bars
-
-
-
- From rilbe@elixir.e.kth.se Mon Aug 10 08:36 EDT 1992
-
- Scott E. Fahlman at Carnegie-Mellon has written a report called "An Empirical
- Study of Learning Speed in Back-Propagation Networks". It is available by ftp
- at archive.cis.ohio-state.edu under /pub/neuroprose as fahlman.quickprop-tr.ps.Z.
-
- In this report he provides training times for tight encoder problems of different
- sizes and with different parameter values.
-
- He proposes different error functions and he presents a new algorithm which he
- calls quick-prop (you may have heard of it?).
-
- I have changed the PDP bp program to use the quick-prop algorithm instead of
- plain old bp. It works just fine. Here is my changed piece of code:
-
- /* The quickprop algorithm developed by Scott Fahlman at Carnegie-Mellon. */
- change_weights_follow() {
- register int i;
- register float *wt, *dwt, *epi, *wi, *end, *pwi, growth;
- float tb, dp, den;
-
- p_css = css;
- css = 0.0;
- dp = 0.0;
-
- link_sum();
-
- for (i=ninputs+ndelays; i<nunits; i++) {
- tb = bed[i];
- dbias[i] = tb*bepsilon[i]+momentum*dbias[i];
- bias[i] += dbias[i];
- css += ((double) tb)*((double) tb);
- dp += ((double) tb)*((double) pbed[i]);
- pbed[i] = tb;
- bed[i] = 0.0;
- wt = weight[i];
- dwt= dweight[i];
- wi = wed[i];
- pwi = pwed[i];
- wed[i] = pwi;
- pwed[i] = wi;
- epi = epsilon[i];
- end = wt+num_weights_to[i];
- for (; wt<end; wt++, dwt++, wi++, pwi++, epi++) {
- if (fabs(*wi)>=fabs(*pwi) && signbit(*wi)==signbit(*pwi))
- growth = maxgrowth;
- else {
- growth = (*wi)/((*pwi)-(*wi));
- if (fabs(growth)>maxgrowth) growth = copysign(maxgrowth,growth);
- }
- *dwt = growth*(*dwt);
- if (signbit(*wi)==signbit(*pwi)) *dwt += (*epi)*(*wi);
- *wt += *dwt;
- if (fabs(*wt)>maxweight) *wt = copysign(maxweight,*wt);
- css += ((double) (*wi))*((double) (*wi));
- dp += ((double) (*wi))*((double) (*pwi));
- *pwi = 0.0;
- }
- }
-
- den = p_css*css;
- if (den > 0.0) gcor = dp/(sqrt(den));
- else gcor = 0.0;
-
- pos_neg_constraints();
- }
-
-
-
- The variables "maxgrowth" and "maxweight" should be installed into the menu
- system of the PDP software using the "install_var" function. (Study the bp file
- to see how it is done.)
-
- Another change which is easy to implement is to add 0.1 to the sigmoid-prime
- function. This is also suggested in the report mentioned above. To do this,
- study the "compute_error". Change "act*(1-act)" to "0.1+act*(1-act)".
-
- I hope you will get better performance with these changes.
- The report does also, as mentioned above, indicate optimal parameter values for
- ordinary bp which you could try out.
-
- Leif Rilbe
- rilbe@elixir.e.kth.se
-
-
-
-
- From Scott_Fahlman@SEF-PMAX.SLISP.CS.CMU.EDU Mon Aug 10 10:22 EDT 1992
-
- Luke,
-
- You may have seen this stuff already, since I recently posted a pointer to
- it. In any case, you really should know about it, given the project that
- you are doing. Your method of finding the right architecture by gradually
- increasing a single hidden layer was, as far as I know, first done by Timur
- Ash (see citations in the Cascor paper). Too bad it's near the end of your
- experiment and not the start...
-
- -- Scott
- ===========================================================================
- Scott E. Fahlman
- School of Computer Science
- Carnegie Mellon University
- 5000 Forbes Avenue
- Pittsburgh, PA 15213
-
- Internet: sef+@cs.cmu.edu
- ===========================================================================
- Public-domain simulation programs for the Quickprop, Cascade-Correlation,
- and Recurrent Cascade-Correlation learning algorithms are available via
- anonymous FTP on the Internet. This code is distributed without charge on
- an "as is" basis. There is no warranty of any kind by the authors or by
- Carnegie-Mellon University.
-
- Instructions for obtaining the code via FTP are included below. If you
- can't get it by FTP, contact me by E-mail (sef+@cs.cmu.edu) and I'll try
- *once* to mail it to you. Specify whether you want the C or Lisp version.
- If it bounces or your mailer rejects such a large message, I don't have
- time to try a lot of other delivery methods.
-
- I would appreciate hearing about any interesting applications of this code,
- and will try to help with any problems people run into. Of course, if the
- code is incorporated into any products or larger systems, I would
- appreciate an acknowledgement of where it came from.
-
- If for some reason these programs do not work for you, please contact me
- and I'll try to help. Common errors: (1) Some people don't notice that the
- symmetric sigmoid output units in cascor have a range of -0.5 to +0.5 (for
- reasons that are mostly historical). If you try to force this algorithm to
- produce an output of +1.0 or +37.3, it isn't going to work. (2) Note that
- quickprop (which is used inside of Cascade-Correlation) is designed to
- update the weights after every epoch, and it assumes that all the epochs
- are identical. If you try to run this code updating after every training
- case, you will lose badly. If you want to change the training set, it is
- important to zero out the PREV-SLOPES and DELTAS vectors, and also to
- re=build the caches in Cascade-Correlation.
-
- HOW TO GET IT:
-
- For people (at CMU, MIT, and soon some other places) with access to the
- Andrew File System (AFS), you can access the files directly from directory
- "/afs/cs.cmu.edu/project/connect/code". This file system uses the same
- syntactic conventions as BSD Unix: case sensitive names, slashes for
- subdirectories, no version numbers, etc. The protection scheme is a bit
- different, but that shouldn't matter to people just trying to read these
- files.
-
- For people accessing these files via FTP:
-
- 1. Create an FTP connection from wherever you are to machine
- "ftp.cs.cmu.edu". The internet address of this machine is 128.2.206.173,
- for those who need it.
-
- 2. Log in as user "anonymous" with your own ID as password. You may see an
- error message that says "filenames may not have /.. in them" or something
- like that. Just ignore it.
-
- 3. Change remote directory to "/afs/cs/project/connect/code". NOTE: You
- must do this in a single operation. Some of the super directories on this
- path are protected against outside users.
-
- 4. At this point FTP should be able to get a listing of files in this
- directory with DIR and fetch the ones you want with GET. (The exact FTP
- commands you use depend on your local FTP server.)
-
- Current contents:
-
- quickprop1.lisp Original Common Lisp version of Quickprop.
- quickprop1.c C version by Terry Regier, U. Cal. Berkeley.
- cascor1.lisp Original Common Lisp version of Cascade-Correlation.
- cascor1.c C version by Scott Crowder, Carnegie Mellon
- rcc1.lisp Common Lisp version of Recurrent Cascade-Correlation.
- rcc1.c C version, trans. by Conor Doherty, Univ. Coll. Dublin
- vowel.c Code for Tony Robinson's vowel benchmark.
- am4.tar.Z Aspirin/Migraine code from MITRE.
- backprop.lisp Overlay for quickprop1.lisp. Turns it into backprop.
- ---------------------------------------------------------------------------
- Tech reports describing these algorithms can also be obtained via FTP.
- These are Postscript files, processed with the Unix compress/uncompress
- program.
-
- unix> ftp ftp.cs.cmu.edu (or 128.2.206.173)
- Name: anonymous
- Password: <your user id>
- ftp> cd /afs/cs/project/connect/tr
- ftp> binary
- ftp> get filename.ps.Z
- ftp> quit
- unix> uncompress filename.ps.Z
- unix> lpr filename.ps (or however you print postscript files)
-
- For "filename", sustitute the following:
-
- qp-tr Paper on Quickprop and other backprop speedups.
- cascor-tr Cascade-Correlation paper.
- rcc-tr Recurrent Cascade-Correlation paper.
- precision Hoehfeld-Fahlman paper on Cascade-Correlation with
- limited numerical precision.
- ---------------------------------------------------------------------------
- The following are the published conference and journal versions of the
- above (in some cases shortened and revised):
-
- Scott E. Fahlman (1988) "Faster-Learning Variations on Back-Propagation: An
- Empirical Study" in (\it Proceedings, 1988 Connectionist Models Summer
- School}, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski (eds.),
- Morgan Kaufmann Publishers, Los Altos CA, pp. 38-51.
-
- Scott E. Fahlman and Christian Lebiere (1990) "The Cascade-Correlation
- Learning Architecture", in {\it Advances in Neural Information Processing
- Systems 2}, D. S. Touretzky (ed.), Morgan Kaufmann Publishers, Los Altos
- CA, pp. 524-532.
-
- Scott E. Fahlman (1991) "The Recurrent Cascade-Correlation Architecture" in
- {\it Advances in Neural Information Processing Systems 3}, R. P. Lippmann,
- J. E. Moody, and D. S. Touretzky (eds.), Morgan Kaufmann Pulishers, Los
- Altos CA, pp. 190-196.
-
- Marcus Hoehfeld and Scott E. Fahlman (1992) "Learning with Limited
- Numerical Precision Using the Cascade-Correlation Learning Algorithm" in
- IEEE Transactions on Neural Networks, Vol. 3, no. 4, July 1992, pp.
- 602-611.
-
-
-
-
- From neal@stat.washington.edu Mon Aug 10 17:03 EDT 1992
- Organization: U. Washington Dept. of Statistics
-
- I have found that a quick speed up can be done
- if you have floating point values is to normalize
- the data by computing the mean and standard deviation
- of each column and then , for each member of the column
- replace it by it's z-score. In C it would look something
- like this:
-
-
- for(j=0;j<ncols;j++);
- {
- .
- .
- compute mean[j] and sd[j]
- .
- .
-
- for(i=0;i<nrows;i++)
- {
- observation[i][j] = observation[i][j]-mean[j])/sd[j]
- }
- }
-
- Then dump your observations into the bp nn.
-
- Good Luck.
-
- Phil Neal
-
-
-
-
- From chinet!drt@clout.chi.il.us Wed Aug 12 10:59 EDT 1992
- From: drt@chinet.chi.il.us (Donald Tveter)
-
- I've been puttering with simple backprop improvements for some time. I've
- posted the source in comp.sources.misc in February and it should be available
- from any site that stores that group or I could email you the cC code (for UNIX,
- 16-bit DOS and 32-bit DOS).
-
- If you're not interested in that here are the simplest things you can do which
- will also make a big difference:
-
- 1) fudge the derivative in the output layer. Change it from s(1-s) to 1. This
- will probably require you to use a much smaller eta and you may need an
- even smaller value for the lower layer. (This is called the differential
- step size method.)
-
- 2) Sometimes changing the activation function to 1 /(1 + exp(-Dx)) helps.
- Normally the best value for D is greater than 1 but sometimes it works
- best for D < 1.
-
- 3) Add direct connections from the input layer to the output layer.
-
- Don Tveter
- drt@chinet.chi.il.us
-
-
-
-
-
- [ from Luke koops:
-
- Unfortunately I did not take advantage of any of the aforementioned
- techniques for speeding up training. This was because I found a
- simple way to improve my training times. There was a vast increase in
- training speed when I switched from training by epoch to training by
- pattern. This improvement was sufficient for me.
-
- I noticed that the performance of neural networks was degraded when
- the training set became too large. The test set had a size of 1024
- patterns and the training set was chosen from this set with
- repetition. When I made a training set consisting of about 2400
- examples (representing approx. 90% unique instances of the test set),
- the backprop network failed to learn. This is a counter intuitive
- result. You would think that increasing the amount of information
- available would improve learning. I suspect that the error
- derivatives that were accumulated over the feedforward phase of
- training tended to cancel eachother out, thus creating large local
- minima that represent poor generalizations. I guess this would be
- an example of "under fitting" the data. ]
-
-
- Thanks for all your help, and interesting suggestions.
-
- -Luke Koops (koops@csd.uwo.ca)
-
-
- --
- /
- ***/
- *\/* koops@csd.uwo.ca
- ****
-