NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / ai / neuraln / 3203 < prev next >

Wrap

Text File | 1992-08-15 | 14.0 KB | 384 lines

Newsgroups: comp.ai.neural-nets Path: sparky!uunet!utcsri!torn!newshost.uwo.ca!gaul.csd.uwo.ca!koops From: koops@gaul.csd.uwo.ca (Luke Koops) Subject: Reducing Training time (summary) Organization: Computer Science Dept., Univ. of Western Ontario, London, Canada Date: Sun, 16 Aug 1992 06:38:25 GMT Message-ID: <1992Aug16.063825.15300@julian.uwo.ca> Keywords: back propagation, training Sender: news@julian.uwo.ca (USENET News System) Nntp-Posting-Host: obelix.gaul.csd.uwo.ca Lines: 371 A while ago I posted a request asking how I could reduce the training time for a back-propagation neural network, and recieved these helpful respones. From styri@balder.nta.no Mon Aug 10 08:00 EDT 1992 One way of reducing training time is to use a faster implementation... There's an implementation by Dr. Donald Tveter in several archives, this suite of s/w includes an integer (or rather fixed point math) implementation of the bp algorithm. Worth trying. Your method of training is, however, a bit too simple. Training time depends on a lot of factors: the momentum and learning rate, and the initial weights - just to mention the most important. Just increasing the number of nodes in the hidden layer may even lead to a poor net. --- Haakon Styri Norwegian Telecom Research [note: number of nodes is increased until a solution is found if the net overgeneralizes, that's not my problem] From peterhi@syma.sussex.ac.uk Mon Aug 10 08:31 EDT 1992 Organization: University of Sussex Being pragmatic I would say that you should use a binary search for the number of hidden nodes. Start at 5, if that trains then try 3 if that fails go up to 4 and so on. You can ask any half decent Comp-Sci BSc about binary searches. As to the order try this heuristic... Have the training set sequenced so as to have the greatest difference in output and least difference in input from one vector to the next. This is assuming that the training set is being accessed linearly and not at random. There are more incorrect orderings that there are correct ones. (By the way the method of ordering the data is NP-Complete and therefore I can offer you no code to do this for you - sorry). If you could get be a SERC fund I'll look into it for you :) How big is the training set? Could you mail it to me? Peter Hickman -- --------------------------------------- peterhi@uk.ac.sussex.syma ------------ "More beer, more shouting, resistance is useless" - USTA bars From rilbe@elixir.e.kth.se Mon Aug 10 08:36 EDT 1992 Scott E. Fahlman at Carnegie-Mellon has written a report called "An Empirical Study of Learning Speed in Back-Propagation Networks". It is available by ftp at archive.cis.ohio-state.edu under /pub/neuroprose as fahlman.quickprop-tr.ps.Z. In this report he provides training times for tight encoder problems of different sizes and with different parameter values. He proposes different error functions and he presents a new algorithm which he calls quick-prop (you may have heard of it?). I have changed the PDP bp program to use the quick-prop algorithm instead of plain old bp. It works just fine. Here is my changed piece of code: /* The quickprop algorithm developed by Scott Fahlman at Carnegie-Mellon. */ change_weights_follow() { register int i; register float *wt, *dwt, *epi, *wi, *end, *pwi, growth; float tb, dp, den; p_css = css; css = 0.0; dp = 0.0; link_sum(); for (i=ninputs+ndelays; i<nunits; i++) { tb = bed[i]; dbias[i] = tb*bepsilon[i]+momentum*dbias[i]; bias[i] += dbias[i]; css += ((double) tb)*((double) tb); dp += ((double) tb)*((double) pbed[i]); pbed[i] = tb; bed[i] = 0.0; wt = weight[i]; dwt= dweight[i]; wi = wed[i]; pwi = pwed[i]; wed[i] = pwi; pwed[i] = wi; epi = epsilon[i]; end = wt+num_weights_to[i]; for (; wt<end; wt++, dwt++, wi++, pwi++, epi++) { if (fabs(*wi)>=fabs(*pwi) && signbit(*wi)==signbit(*pwi)) growth = maxgrowth; else { growth = (*wi)/((*pwi)-(*wi)); if (fabs(growth)>maxgrowth) growth = copysign(maxgrowth,growth); } *dwt = growth*(*dwt); if (signbit(*wi)==signbit(*pwi)) *dwt += (*epi)*(*wi); *wt += *dwt; if (fabs(*wt)>maxweight) *wt = copysign(maxweight,*wt); css += ((double) (*wi))*((double) (*wi)); dp += ((double) (*wi))*((double) (*pwi)); *pwi = 0.0; } } den = p_css*css; if (den > 0.0) gcor = dp/(sqrt(den)); else gcor = 0.0; pos_neg_constraints(); } The variables "maxgrowth" and "maxweight" should be installed into the menu system of the PDP software using the "install_var" function. (Study the bp file to see how it is done.) Another change which is easy to implement is to add 0.1 to the sigmoid-prime function. This is also suggested in the report mentioned above. To do this, study the "compute_error". Change "act*(1-act)" to "0.1+act*(1-act)". I hope you will get better performance with these changes. The report does also, as mentioned above, indicate optimal parameter values for ordinary bp which you could try out. Leif Rilbe rilbe@elixir.e.kth.se From Scott_Fahlman@SEF-PMAX.SLISP.CS.CMU.EDU Mon Aug 10 10:22 EDT 1992 Luke, You may have seen this stuff already, since I recently posted a pointer to it. In any case, you really should know about it, given the project that you are doing. Your method of finding the right architecture by gradually increasing a single hidden layer was, as far as I know, first done by Timur Ash (see citations in the Cascor paper). Too bad it's near the end of your experiment and not the start... -- Scott =========================================================================== Scott E. Fahlman School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Internet: sef+@cs.cmu.edu =========================================================================== Public-domain simulation programs for the Quickprop, Cascade-Correlation, and Recurrent Cascade-Correlation learning algorithms are available via anonymous FTP on the Internet. This code is distributed without charge on an "as is" basis. There is no warranty of any kind by the authors or by Carnegie-Mellon University. Instructions for obtaining the code via FTP are included below. If you can't get it by FTP, contact me by E-mail (sef+@cs.cmu.edu) and I'll try *once* to mail it to you. Specify whether you want the C or Lisp version. If it bounces or your mailer rejects such a large message, I don't have time to try a lot of other delivery methods. I would appreciate hearing about any interesting applications of this code, and will try to help with any problems people run into. Of course, if the code is incorporated into any products or larger systems, I would appreciate an acknowledgement of where it came from. If for some reason these programs do not work for you, please contact me and I'll try to help. Common errors: (1) Some people don't notice that the symmetric sigmoid output units in cascor have a range of -0.5 to +0.5 (for reasons that are mostly historical). If you try to force this algorithm to produce an output of +1.0 or +37.3, it isn't going to work. (2) Note that quickprop (which is used inside of Cascade-Correlation) is designed to update the weights after every epoch, and it assumes that all the epochs are identical. If you try to run this code updating after every training case, you will lose badly. If you want to change the training set, it is important to zero out the PREV-SLOPES and DELTAS vectors, and also to re=build the caches in Cascade-Correlation. HOW TO GET IT: For people (at CMU, MIT, and soon some other places) with access to the Andrew File System (AFS), you can access the files directly from directory "/afs/cs.cmu.edu/project/connect/code". This file system uses the same syntactic conventions as BSD Unix: case sensitive names, slashes for subdirectories, no version numbers, etc. The protection scheme is a bit different, but that shouldn't matter to people just trying to read these files. For people accessing these files via FTP: 1. Create an FTP connection from wherever you are to machine "ftp.cs.cmu.edu". The internet address of this machine is 128.2.206.173, for those who need it. 2. Log in as user "anonymous" with your own ID as password. You may see an error message that says "filenames may not have /.. in them" or something like that. Just ignore it. 3. Change remote directory to "/afs/cs/project/connect/code". NOTE: You must do this in a single operation. Some of the super directories on this path are protected against outside users. 4. At this point FTP should be able to get a listing of files in this directory with DIR and fetch the ones you want with GET. (The exact FTP commands you use depend on your local FTP server.) Current contents: quickprop1.lisp Original Common Lisp version of Quickprop. quickprop1.c C version by Terry Regier, U. Cal. Berkeley. cascor1.lisp Original Common Lisp version of Cascade-Correlation. cascor1.c C version by Scott Crowder, Carnegie Mellon rcc1.lisp Common Lisp version of Recurrent Cascade-Correlation. rcc1.c C version, trans. by Conor Doherty, Univ. Coll. Dublin vowel.c Code for Tony Robinson's vowel benchmark. am4.tar.Z Aspirin/Migraine code from MITRE. backprop.lisp Overlay for quickprop1.lisp. Turns it into backprop. --------------------------------------------------------------------------- Tech reports describing these algorithms can also be obtained via FTP. These are Postscript files, processed with the Unix compress/uncompress program. unix> ftp ftp.cs.cmu.edu (or 128.2.206.173) Name: anonymous Password: <your user id> ftp> cd /afs/cs/project/connect/tr ftp> binary ftp> get filename.ps.Z ftp> quit unix> uncompress filename.ps.Z unix> lpr filename.ps (or however you print postscript files) For "filename", sustitute the following: qp-tr Paper on Quickprop and other backprop speedups. cascor-tr Cascade-Correlation paper. rcc-tr Recurrent Cascade-Correlation paper. precision Hoehfeld-Fahlman paper on Cascade-Correlation with limited numerical precision. --------------------------------------------------------------------------- The following are the published conference and journal versions of the above (in some cases shortened and revised): Scott E. Fahlman (1988) "Faster-Learning Variations on Back-Propagation: An Empirical Study" in (\it Proceedings, 1988 Connectionist Models Summer School}, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski (eds.), Morgan Kaufmann Publishers, Los Altos CA, pp. 38-51. Scott E. Fahlman and Christian Lebiere (1990) "The Cascade-Correlation Learning Architecture", in {\it Advances in Neural Information Processing Systems 2}, D. S. Touretzky (ed.), Morgan Kaufmann Publishers, Los Altos CA, pp. 524-532. Scott E. Fahlman (1991) "The Recurrent Cascade-Correlation Architecture" in {\it Advances in Neural Information Processing Systems 3}, R. P. Lippmann, J. E. Moody, and D. S. Touretzky (eds.), Morgan Kaufmann Pulishers, Los Altos CA, pp. 190-196. Marcus Hoehfeld and Scott E. Fahlman (1992) "Learning with Limited Numerical Precision Using the Cascade-Correlation Learning Algorithm" in IEEE Transactions on Neural Networks, Vol. 3, no. 4, July 1992, pp. 602-611. From neal@stat.washington.edu Mon Aug 10 17:03 EDT 1992 Organization: U. Washington Dept. of Statistics I have found that a quick speed up can be done if you have floating point values is to normalize the data by computing the mean and standard deviation of each column and then , for each member of the column replace it by it's z-score. In C it would look something like this: for(j=0;j<ncols;j++); { . . compute mean[j] and sd[j] . . for(i=0;i<nrows;i++) { observation[i][j] = observation[i][j]-mean[j])/sd[j] } } Then dump your observations into the bp nn. Good Luck. Phil Neal From chinet!drt@clout.chi.il.us Wed Aug 12 10:59 EDT 1992 From: drt@chinet.chi.il.us (Donald Tveter) I've been puttering with simple backprop improvements for some time. I've posted the source in comp.sources.misc in February and it should be available from any site that stores that group or I could email you the cC code (for UNIX, 16-bit DOS and 32-bit DOS). If you're not interested in that here are the simplest things you can do which will also make a big difference: 1) fudge the derivative in the output layer. Change it from s(1-s) to 1. This will probably require you to use a much smaller eta and you may need an even smaller value for the lower layer. (This is called the differential step size method.) 2) Sometimes changing the activation function to 1 /(1 + exp(-Dx)) helps. Normally the best value for D is greater than 1 but sometimes it works best for D < 1. 3) Add direct connections from the input layer to the output layer. Don Tveter drt@chinet.chi.il.us [ from Luke koops: Unfortunately I did not take advantage of any of the aforementioned techniques for speeding up training. This was because I found a simple way to improve my training times. There was a vast increase in training speed when I switched from training by epoch to training by pattern. This improvement was sufficient for me. I noticed that the performance of neural networks was degraded when the training set became too large. The test set had a size of 1024 patterns and the training set was chosen from this set with repetition. When I made a training set consisting of about 2400 examples (representing approx. 90% unique instances of the test set), the backprop network failed to learn. This is a counter intuitive result. You would think that increasing the amount of information available would improve learning. I suspect that the error derivatives that were accumulated over the feedforward phase of training tended to cancel eachother out, thus creating large local minima that represent poor generalizations. I guess this would be an example of "under fitting" the data. ] Thanks for all your help, and interesting suggestions. -Luke Koops (koops@csd.uwo.ca) -- / ***/ *\/* koops@csd.uwo.ca ****