NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / sources / misc / 3878 < prev next >

Wrap

Text File | 1992-09-02 | 50.6 KB | 1,164 lines

Newsgroups: comp.sources.misc Path: sparky!kent From: drt@chinet.chi.il.us (Donald Tveter) Subject: v31i129: backprop - Fast Backpropagation, Part01/04 Message-ID: <csm-v31i129=backprop.130027@sparky.IMD.Sterling.COM> Followup-To: comp.sources.d X-Md4-Signature: 412f11eddccad5739dd0a89dffa28abf Sender: kent@sparky.imd.sterling.com (Kent Landfield) Organization: Sterling Software Date: Wed, 2 Sep 1992 18:06:40 GMT Approved: kent@sparky.imd.sterling.com Lines: 1150 Submitted-by: drt@chinet.chi.il.us (Donald Tveter) Posting-number: Volume 31, Issue 129 Archive-name: backprop/part01 Supersedes: backprop: Volume 28, Issue 63-66 Environment: UNIX, DOS The programs described below were produced for my own use in studying back-propagation and for doing some examples that are found in my introduction to Artificial Intelligence textbook, The Basis of Artificial Intelligence, to be published by Computer Science Press. (I hope some time before the sun burns out or before the Cubs win the World Series or before Congress balances the budget or ... .) I have copyrighted these files but I hereby give permission to anyone to use them for experimentation, educational purposes or to redistribute them on a not for profit basis. All others that want to use these programs for business or commercial purposes I will charge you a small fee. You should contact me by mail at the address in the readme. Changes/Additions vs. the February 1992 Version The programs now have the quickprop algorithm included with a few experimental modifications of this algorithm as well. Also I would be interested in hearing from anyone with suggestions, bug reports and major successes or failures. (Caution though: email in and out of the whole Chicago area is often unreliable. Also from time to time I get email and reply but the reply bounces. If its really important include a physical mail address.) There are four simulators that can be constructed from the included files. The program, rbp, does back-propagation using real weights and arithmetic. The program, bp, does back-propagation using 16-bit integer weights, 16 and 32-bit integer arithmetic and some floating point arithmetic. The program, sbp, uses 16-bit integer symmetric weights but only allows two-layer networks. The program srbp does the same using real weights. The purpose of sbp and srbp is to produce networks that can be used with the Boltzman machine relaxation algorithm (not included). In general the 16-bit integer programs are the most useful because they are the fastest. Unfortunately, sometimes 16-bit integer weights don't have enough range or precision and then using the floating point versions may be necessary. Many other speed-up techniques are included in these programs. See the file readme for more information. Dr. Donald R. Tveter 5228 N. Nashville Ave. Chicago, Illinois 60656 USENET: drt@chinet.chi.il.us ------ #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of archive 1 (of 4)." # Contents: readme # Wrapped by drt@chinet on Mon Aug 24 08:57:28 1992 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'readme' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'readme'\" else echo shar: Extracting \"'readme'\" $46585 characters$ sed "s/^X//" >'readme' <<'END_OF_FILE' XFast Back-Propagation XCopyright (c) 1990, 1991, 1992 by Donald R. Tveter X X X1. Introduction X X The programs described below were produced for my own use in studying Xback-propagation and for doing some examples that are found in my Xintroduction to Artificial Intelligence textbook, The Basis of XArtificial Intelligence, to be published by Computer Science Press. (I Xhope some time before the sun burns out or before the Cubs win the World XSeries or before Congress balances the budget or ... .) I have Xcopyrighted these files but I hereby give permission to anyone to use Xthem for experimentation, educational purposes or to redistribute them Xon a not for profit basis. All others that want to use these programs Xfor business or commercial purposes I will charge you a small fee. You Xshould contact me by mail at: X XDr. Donald R. Tveter X5228 N. Nashville Ave. XChicago, Illinois 60656 XUSENET: drt@chinet.chi.il.us X XAlso I would be interested in hearing from anyone with suggestions, Xbug reports and major successes or failures. (Caution though: email Xin and out of the whole Chicago area is often unreliable. Also from Xtime to time I get email and reply but the reply bounces. If its Xreally important include a physical mail address.) X X Note: this is use at your own risk software: there is no guarantee Xthat it is bug-free. Use of this software constitutes acceptance for Xuse in an as is condition. There are no warranties with regard to this Xsoftware. In no event shall the author be liable for any damages Xwhatsoever arising out of or in connection with the use or performance Xof this software. X X There are four simulators that can be constructed from the included Xfiles. The program, rbp, does back-propagation using real weights and Xarithmetic. The program, bp, does back-propagation using 16-bit integer Xweights, 16 and 32-bit integer arithmetic and some floating point Xarithmetic. The program, sbp, uses 16-bit integer symmetric weights but Xonly allows two-layer networks. The program srbp does the same using Xreal weights. The purpose of sbp and srbp is to produce networks that Xcan be used with the Boltzman machine relaxation algorithm (not Xincluded). X X In general the 16-bit integer programs are the most useful because Xthey are the fastest. Unfortunately, sometimes 16-bit integer weights Xdon't have enough range or precision and then using the floating point Xversions may be necessary. Many other speed-up techniques are included Xin these programs. X XChanges/Additions vs. the February 1992 Version X X The programs now have the quickprop algorithm included with a few Xexperimental modifications of this algorithm as well. X X2. Making the Simulators X X See the file, readme.mak. (This file is getting too large for some Xmailers.) X X3. A Simple Example X X Each version would normally be called with the name of a file to read Xcommands from, as in: X Xbp xor X XWhen no file name is specified bp will take commands from the keyboard X(UNIX stdin file). After the data file is read commands are then taken Xfrom the keyboard. X X The commands are one letter commands and most of them have optional Xparameters. The `A', `B', `d' and `f' commands allow a number of Xsub-commands on a line. The maximum length of any line is 256 Xcharacters. An `*' is a comment and it can be used to make the Xremainder of the line a comment. Here is an example of a data file to Xdo the xor problem: X X* input file for the xor problem X Xm 2 1 1 * make a 2-1-1 network Xc 1 1 3 1 * add this extra connection Xc 1 2 3 1 * add this extra connection Xs 7 * seed the random number function Xk 0 1 * give the network random weights X Xn 4 * read four new patterns into memory X1 0 1 X0 0 0 X0 1 1 X1 1 0 X Xe 0.5 * set eta to 0.5 (and eta2 to 0.5) Xa 0.9 * set alpha to 0.9 X XFirst in this example, the m command will make a network with 2 units Xin the input layer, 1 unit in the second layer and 1 unit in the third Xlayer. The following c commands create extra connections from layer 1 Xunit 1 to layer 3 unit 1 and from layer 1 unit 2 to layer 3 unit 1. The X`s' command sets the seed for the random number function. The `k' Xcommand then gives the network random weights. The `k' command has Xanother use as well. It can be used to try to kick a network out of a Xlocal minimum. Here, the meaning of "k 0 1" is to examine all the Xweights in the network and for every weight equal to 0 (and they all Xstart out at 0), add in a random number between -1 and +1. The `n' Xcommand specifies four new patterns to be read into memory. With the X`n' command, any old patterns that may have been present are removed. XThere is also an `x' command that behaves like the `n' command, except Xthe `x' commands ADDS the extra patterns to the current training Xset. The input pattern comes first followed by the output pattern. XThe statement, e 0.5, sets eta, the learning rate for the upper layer to X0.5 and eta2 for the lower layers to 0.5 as well. The last line sets Xalpha, the momentum parameter, to 0.9. X X After these commands are executed the following messages and prompt Xappears: X XFast Back-Propagation Copyright (c) 1990, 1991, 1992 by Donald R. Tveter Xtaking commands from stdin now X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? X XThe characters within the square brackets are a list of the possible Xcommands. To run 100 iterations of back-propagation and print out the Xstatus of the learning every 20 iterations type "r 100 20" at the Xprompt: X X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? r 100 20 X XThis gives: X Xrunning . . . X 20 iterations 0.00% right (0 right 4 wrong) 0.49927 error/unit X 40 iterations 0.00% right (0 right 4 wrong) 0.43188 error/unit X 60 iterations 75.00% right (3 right 1 wrong) 0.09033 error/unit X 62 iterations 100.00% right (4 right 0 wrong) 0.07129 error/unit Xpatterns learned to within 0.10 at iteration^G 62 X XThe program immediately prints out the "running . . ." message. After Xeach 20 iterations a summary of the learning process is printed giving Xthe percentage of patterns that are right, the number right and wrong Xand the average value of the absolute values of the errors of the output Xunits. The program stops when the each output for each pattern has been Xlearned to within the required tolerance, in this case the default value Xof 0.1. A ctrl-G is normally printed out as well to sound the bell. If Xthe second number defining how often to print out a summary is omitted Xthe summaries will not be printed. Sometimes the integer versions will Xdo a few extra iterations before declaring the problem done because of Xtruncation errors in the arithmetic done to check for convergence. The Xstatus figures for iteration i are computed when making the forward pass Xof the iteration and before the weights are updated so these values are Xone iteration out of date. This saves on CPU time, however, if you Xreally need up-do-date statistics use the u+ option described in the Xformat specifications. X XListing Patterns X X To get a listing of the status of each pattern use the `P' command Xto give: X X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? P X 1 0.90 (0.098) ok X 2 0.05 (0.052) ok X 3 0.94 (0.062) ok X 4 0.07 (0.074) ok X 62 iterations 100.00% right (4 right 0 wrong) 0.07129 error/unit X XThe numbers in parentheses give the sum of the absolute values of the Xoutput errors for each pattern. An `ok' is given to every pattern that Xhas been learned to within the required tolerance. To get the status of Xone pattern, say, the fourth pattern, type "P 4" to give: X X 0.07 (0.074) ok X XTo get a summary without the complete listing use "P 0". To get the Xoutput targets for a given pattern, say pattern 3, use "O 3". X X A particular test pattern can be input to the network with the `p' Xcommand, as in: X X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? p 1 0 X 0.90 X XExamining Weights X X It is often interesting to see the values of some particular weights Xin the network. To see a listing of all the weights in a network use Xthe save weights command described below and then cat the file weights, Xhowever, to see the weights leading into a particular node, say the node Xin row 2, node 1 use the w command as in: X X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? w 2 1 Xlayer unit unit value weight input from unit X 1 1 1.00000 9.53516 9.53516 X 1 2 0.00000 -8.40332 0.00000 X 2 t 1.00000 4.13086 4.13086 X sum = 13.66602 X XThis listing also gives data on how the current activation value of the Xnode is computed using the weights and the activations values of the Xnodes feeding into unit 1 of layer 2. The `t' unit is the threshold Xunit. X XThe Help Command X X To get a short description of any command, type `h' followed by the Xletter of the command. Here, we type h h for help with help: X X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? h h X Xh <letter> gives help for command <letter>. X XTo list the status of all the parameters in the program, use `?'. X XTo Quit the Program X X Finally, to end the program, the `q' (for quit) command is entered: X X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? q X X X4. The Format Command X X There are several ways to input and output patterns, numbers and Xother values and there is one format command, `f', that is used to set Xthese options. In the format command a number of options can be Xspecified on a single line as for example in: X Xf b+ ir oc s- wB X XInput Patterns X X The programs are able to read pattern values in two different Xformats. The default input format is the compressed format. In it, Xeach value is one character and it is not necessary to have blanks Xbetween the characters. For example, in compressed format the patterns Xfor xor could be written out in either of the following ways: X X101 10 1 X000 00 0 X011 01 1 X110 11 0 X XThe second example is preferable because it makes it easier to see the Xinput and the output patterns. Compressed format can also be used to Xinput patterns with the `p' command. In addition to using 1 and 0 as Xinput the character, `?' can be used. This character is initially Xdefined to be 0.5, but it can be redefined using the Q command like so: X XQ -1 X XThis sets the value of ? to -1. Other valid input characters are the Xletters, `h', `i', `j' and `k'. The `h' stands for `hidden'. Its Xmeaning in an input string is that the value at this point in the string Xshould be taken from the next unit in the second layer of the network. XThis notation is useful for specifying simple recurrent networks. XNaturally, `i', `j' and `k' stand for taking input values from the Xthird, fourth and fifth layers (if they exist). A simple example of a Xrecurrent network is given later. X X The other input format for numbers is real. The number portion must Xstart with a digit (.35 is not allowed, but 0.35 is). Exponential Xnotation is not allowed. Real numbers have to be separated by a space. XThe `h', `i', `j', `k' and `?' characters are also allowed with real Xinput patterns. To take input in the real format it is necessary to Xset the input format to be real using the `f' (format) command as in: X Xf ir X XTo change back to the compressed format use: X Xf ic X XOutput of Patterns X X Output format is controlled with the `f' command as in: X Xf or Xf oc Xf oa X XThe first sets the output to real numbers. The second sets the output Xto be compressed mode where the value printed will be a `1' when the Xunit value is greater than 1.0 - tolerance, a `^' when the value is Xabove 0.5 but less than 1.0 - tolerance, a `v' when the value is less Xthan 0.5 but greater than the tolerance. Below the tolerance value a X`0' is printed. The tolerance can be changed using the `t' command (not Xa part of the format command). For example, to make all values greater Xthan 0.8 print as `1' and all values less than 0.2 print as `0' use: X Xt 0.2 X XOf course this same tolerance value is also used to check to see if all Xthe patterns have converged. The third output format is meant to give X"analog compressed" output. In this format a `c' is printed when a Xvalue is close enough to its target value. Otherwise, if the answer is Xclose to 1, a `1' is printed, if the answer is close to 0, a `0' is Xprinted, if the answer is above the target but not close to 1, a `^' is Xprinted and if the answer is below the target but not close to 0, a `v' Xis printed. This output format is designed for problems where the Xoutput is a real number, as for instance, when the problem is to make a Xnetwork learn sin(x). X X For the sake of convenience the output format (and only the output Xformat) can be set without using the `f' so that: X Xor X Xwill also make the output format real. X XBreaking up the Output Values X X In the compressed formats the default is to print a blank after Xevery 10 values. This can be altered using the `b' (for inserting Xbreaks) command. The use for this command is to separate output values Xinto logical groups to make the output more readable. For instance, you Xmay have 24 output units where it makes sense to insert blanks after the X4th, 7th and 19th positions. To do this, specify: X Xb 4 7 19 X XThen for example the output will look like: X X 1 10^0 10^ ^000v00000v0 01000 (0.17577) X 2 1010 01v 0^0000v00000 ^1000 (0.16341) X 3 0101 10^ 00^00v00000v 00001 (0.16887) X 4 0100 0^0 000^00000v00 00^00 (0.19880) X XThe `b' command allows up to 20 break positions to be specified. The Xdefault output format is the real format with 10 numbers per line. For Xthe output of real values the `b' command specifies when to print a Xcarriage return rather than when to print a blank. (Note: the `b' Xcommand is not part of the `f' command.) X XPattern Formats X X There are two different types of problems that back-propagation can Xhandle, the general type of problem where every output unit can take on Xan arbitrary value and the classification type of problem where the goal Xis to turn on output unit i and turn off all the other output units when Xthe pattern is of class i. The xor problem is an example of the general Xtype of problem. For an example of a classification problem, suppose Xyou have a number of data points scattered about through two-dimensional Xspace and you have to classify the points as either class 1, class 2 or Xclass 3. For a pattern of class 1 you can always set up the output: X"1 0 0", for class 2: "0 1 0" and for class 3: "0 0 1", however doing Xthe translation to bit patterns can be annoying so another notation can Xbe used. Instead of specifying the bit patterns you can set the pattern Xformat option to classification (as opposed to the default value of Xgeneral) like so: X Xf pc X Xand then the program will read data in the form: X X 1.33 3.61 1 * shorthand for 1 0 0 X 0.42 -2.30 2 * shorthand for 0 1 0 X -0.31 4.30 3 * shorthand for 0 0 1 X Xand translate it to the bit string form when the pattern is loaded onto Xthe output units. To switch to the general form use "f pg". X X In addition to controlling the input of data the p command within Xthe format command is used to control the output of patterns from a set Xof test patterns kept on a file. If the format is either c or g then Xwhen the test set is run thru the network you will only get a summary of Xhow many patterns are correct. If the format is either C or G you will Xget a listing of the output values for each pattern as well as the Xsummary. When reading patterns, C works the same as c and G works the Xsame as g. X XControlling Summaries X X When the program is learning patterns you can have it print out the Xstatus of the learning process at regular intervals. The default is to Xprint out only a summary of how learning is going however you can also Xprint out the status of every pattern at regular intervals. To get the Xwhole set of patterns use "f s-" to turn off the summary format and "f Xs+" to go back to summarizing. X XRinging the Bell X X To ring the bell when the learning has been completed use "f b+" and Xto turn off the bell use "f b-". X XEchoing Input X X When you are reading commands from a file it is often worthwhile to Xsee those commands echoed on the screen. To do this, use "f e+" and to Xturn off the echoing, use "f e-". X XPaging X X The program is set up to write 24 lines to the screen and then pause. XAt the pause the program prints out a ":" (as does the UNIX System V Xpager, pg). At this point typing a carriage return will get you one Xmore page. Typing a "q" followed by a carriage return will quit the Xprocess you're working on and give you another prompt. So, if you're Xrunning for example the xor problem and you type, "r 100 1" you will run X24 iterations through the program, these will print out and then there Xwill be a pause. Notice that the program will not be busy computing Xanything more during the pause. To reset the number of lines written to Xthe screen to say, 12, use "f P 12". Setting the value to 0 will drop Xthe paging altogether. X X Note that the program will not be paging at all if you take a series Xof commands from the original data file or some other input file and the Xoutput produced by these commands is less than the page size. That is Xto say, a new line count is started every time a new command is read and Xif the output of that command is less than the page size there won't be Xany paging. X XMaking a Copy of Your Session X X To make a copy of what appears on the screen use "f c+" to start Xwriting to the file "copy" and "f c-" to stop writing to this file. XEnding the session automatically closes this file as well. X XUp-To-Date Statistics X X During the ith pass thru the network the program will collect Xstatistics on how many patterns are correct and how much error there is Xoverall. These numbers are gathered before the weights are updated and Xso the results listed for iteration i really show the situation after Xthe weight update for iteration i-1. To complicate matters more the Xweight updates can be done continuously instead of after all the Xpatterns have been presented so the statistics you get here are skewed Xeven more. If you want to have up-to-date statistics with either Xmethod use "f u+" and to go back to statistics that are out of date Xuse "f u-". The penalty with "f u+" is that the program needs to do Xanother forward pass. When using the continuous update method it is Xhighly advisable to use "f u+" at least when you get close to complete Xconvergence because the default method of checking may claim the Xlearning is finished when it isn't or it may continue training after the Xtolerances have been met. X X 5. Taking Training and Testing Patterns from Files X X In the xor example given above the four patterns were part of the Xdata file and to read them in the following lines were used: X Xn 4 X1 0 1 X0 0 0 X0 1 1 X1 1 0 X XHowever it is also convenient to take patterns from a file that Xcontains nothing but a list of patterns (and possibly comments). To Xread a new set of patterns from some file, patterns, use: X Xn f patterns X XTo add an extra group of patterns to the current set you can use: X Xx f patterns X X In addition to keeping a set of training patterns you can take Xtesting patterns from a file as well. To specify the file you can Xinvoke the program with a second file name as in: X Xbp xor xtest X XIn addition, if you do the following: X Xt f xtest X Xthe program will set xtest as the test file and immediately do the Xtesting. Once the file has been defined you can test the patterns on Xthe test file by "t f" or just "t". (This leaves the t command doing Xdouble duty since "t <real>" will set the tolerance to <real>.) Also in Xaddition, the test file can be set without being tested by using X XB t f xtest X Xas explained in the benchmarking section. X X X6. Saving and Restoring Weights and Related Values X X Sometimes the amount of time and effort needed to produce a set of Xweights to solve a problem is so great that it is more convenient to Xsave the weights rather than constantly recalculate them. Weights can Xbe saved as real values in an ASCII format (the default) or as binary Xto save space. The old way to save the weights (which still works) is Xto enter the command, "S". The weights are then written on a file Xcalled "weights" or to the last file name you have specified, if you're Xusing the new version of the command. The following file comes from the Xxor problem: X X62r file = ../xor3 X 9.5351562500 X -8.4033203125 X 4.1308593750 X 5.5800781250 X -4.9755859375 X -11.3095703125 X 8.0527343750 X XTo write the weights the program starts with the second layer, writes Xout the weights leading into these units in order with the threshold Xweight last, then it moves on to the third layer, and so on. To Xrestore these weights type an `R' for restore. At this time the Xprogram reads the header line and sets the total number of iterations Xthe program has gone through to be the first number it finds on the Xheader line. It then reads the character immediately after the number. XThe `r' indicates that the weights will be real numbers represented as Xcharacter strings. If the weights were binary the character would be a X`b' rather than an `r'. Also, if the character is `b', the next Xcharacter is read. This next character indicates how many bytes are Xused per value. The integer versions bp and sbp write files with 2 Xbytes per weight while the real versions rbp and srbp write files with X8 bytes per weight for double precision reals and 4 bytes per weight for Xsingle precision reals. With this notation weight files written by one Xprogram can be read by the other. A binary weight format is specified Xwithin the `f' command by using "f wb". A real format is specified by Xusing "f wr". If your program specifies that weights should be written Xin one format but the weight file you read from is different a notice Xwill be given. There is no check made to see if the number of weights Xon the file equals the number of weights in the network. X X The above formats specify that only weights are written out and this Xis all you need once the patterns have converged. However, if you're Xstill training the network and want to break off training and pick up Xthe training from exactly the same point later on you need to save the Xold weight changes when using momentum and the parameters for the Xdelta-bar-delta method if you are using this technique. To save these Xextra parameters on the weights file use "f wR" to write the extra Xvalues as real and "f wB" to write the extra values as binary. X X In the above example the command, S, was used to save the weights Ximmediately. Another alternative is to save weights at regular Xintervals. The command, S 100, will automatically save weights every X100 iterations the program does. The default rate at which to save Xweights is set at 32767 for 16-bit compilers and 2147483647 for 32-bit Xcompilers which generally means that no weights will ever be saved. X X To save weights to a file other than "weights" you can say: X"s w <filename>", where, of course, <filename> is the file you want to Xsave to. To continue saving to the same file you can just do "s w". XNaturally if you restore weights it will be from this current weights Xfile as well. You can restore weights from another file by using: X"r w <filename>". Of course this also sets the name of the file to Xwrite to so if you're not careful you could lose your original weights Xfile. X X7. Initializing Weights and Giving the Network a `Kick' X X All the weights in the network initially start out at 0. In Xsymmetric networks then no learning may result because error signals Xcancel themselves out. Even in non-symmetric networks the training Xprocess will usually converge faster if the weights start out at small Xrandom values. To do this the `k' command will take the network and Xalter the weights in the following ways. Suppose the command given is: X Xk 0 0.5 X XNow if a weight is exactly 0, then the weight will be changed to a Xrandom value between +0.5 and -0.5. The above command can therefore be Xused to initialize the weights in the network. A more complex use of Xthe `k' command is to decrease the magnitude of large weights in the Xnetwork by a certain random amount. For instance in the following Xcommand: X Xk 2 8 X Xall the weights in the network that are greater than or equal to 2 will Xbe decreased by a random number between 0 and 8. Weights less than or Xequal to -2 will be increased by a random number between 0 and 8. The Xseed to the random number generator can be changed using the `s' command Xas in "s 7". The integer parameter in the `s' command is of type, Xunsigned. X X Another method of giving a network a kick is to add hidden layer Xunits. The command: X XH 2 0.5 X Xadds one unit to layer 2 of the network and all the weights that are Xcreated are initialized to between - 0.5 and + 0.5. X X The subject of kicking a back-propagation network out of local minima Xhas barely been studied and there is no guarantee that the above methods Xare very useful in general. X X8. Setting the Algorithm to Use X X A number of different variations on the original back-propagation Xalgorithm have been proposed in order to speed up convergence and some Xof these have been built into these simulators. These options are set Xusing the `A' command and a number of options can go on the one line. X XActivation Functions X X To set the activation function use: X XA al * for the linear activation function XA ap * for the piece-wise activation function XA as * for the smooth activation function XA at * for the piecewise near-tanh function that runs from -1 to +1 XA aT * for the continuous near-tanh function that runs from -1 to +1 X XWhen using the linear activation function it is only appropriate to use Xthe differential step-size derivative and a two-layer network. The Xsmooth activation function is: X Xf = 1.0 / (1.0 + exp(-x)) X Xwhere x is the input to a unit. The piece-wise function is an Xapproximation to the function, f and it will normally save some CPU Xtime even though it may increase the number of iterations you need to Xsolve the problem. The continuous near-tanh function is 2f - 1 and the Xpiece-wise version approximates this function with a series of straight Xlines. X XSharpness (or Gain) X X The sharpness (or gain) is the parameter, D, in the function: X X1.0 / (1.0 + exp(-D*x)). X XA sharper sigmoid shaped activation function (larger D) will produce Xfaster convergence (see "Speeding Up Back Propagation" by Yoshio Izui Xand Alex Pentland in the Proceedings of IJCNN-90-WASH-DC, Lawrence XErlbaum Associates, 1990). To set this parameter to say, 8, use X"A D 8". The default value is 1. Unfortunately, too large a value Xfor D will hurt convergence so this is not a perfect solution to Xspeeding up learning. Sometimes the best value for D may be less than X1.0. A larger D is also useful in the integer version of Xback-propagation where the weights are limited to between -32 and X+31.999. A larger D value in effect magnifies the weights and makes it Xpossible for the weights to stay smaller. Values of D less than one may Xbe useful in extracting a network from a local minima (see "Handwritten XNumeral Recognition by Multi-layered Neural Network with Improved XLearning Algorithm" by Yamada, Kami, Temma and Tsukumo in Proceedings of Xthe 1989 IJCNN, IEEE Press). A smaller value of D will also force the Xweights and the weight changes to be larger and this may be of value Xwhen the weight changes become less than the weight resolution of 0.001 Xin the integer version. X XThe Derivatives X X The correct derivative for the standard activation function is s(1-s) Xwhere s is the activation value of a unit however when s is near 0 or 1 Xthis term will give only very small weight changes during the learning Xprocess. To counter this problem Fahlman proposed the following one Xfor the output layer: X X0.1 + s(1-s) X X(For the original description of this method see "Faster Learning XVariations of Back-Propagation: An Empirical Study", by Scott E. XFahlman, in Proceedings of the 1988 Connectionist Models Summer School, XMorgan Kaufmann, 1989.) Besides Fahlman's derivative and the original Xone the differential step size method (see "Stepsize Variation Methods Xfor Accelerating the Back-Propagation Algorithm", by Chen and Mars, in XIJCNN-90-WASH-DC, Lawrence Erlbaum, 1990) takes the derivative to be 1 Xin the layer going into the output units and uses the correct derivative Xterm for all other layers. The learning rate for the inner layers is Xnormally set to some smaller value. To set a value for eta2 give two Xvalues in the `e' command as in: X Xe 0.1 0.01 X XTo set the derivative use the `A' command as in: X XA dd * use the differential step size derivative (default) XA dF * use Fahlman's derivative in only the output layer XA df * use Fahlman's derivative in all layers XA do * use the original, correct derivative X XUpdate Methods X X The choices are the periodic (batch) method, the continuous (online) Xmethod, delta-bar-delta and quickprop. The following commands set the Xupdate methods: X XA uc * for the continuous update method XA ud * for the delta-bar-delta method XA up * for the original periodic update method (default) XA uq * for the quickprop algorithm X XThe delta-bar-delta method uses a number of special parameters and these Xare set using the `d' command. Delta-bar-delta can be used with any of Xthe derivatives and the algorithm will find its own value of eta for Xeach weight. X XOther Algorithm Options X X The `b' command controls whether or not to backpropagate error for Xunits that have learned their response to within a given tolerance. The Xdefault is to always backpropagate error. The advantage to not Xbackpropagating error is that this can save computer time. This Xparameter can be set like so: X XA b+ * always backpropagate error XA b- * don't backpropagate error when close X X The `s' sub-command will set the number of times to skip a pattern Xwhen the pattern has been learned to within the desired tolerance. To Xskip 3 iterations, use "A s 3", to reset to not skip any patterns Xuse "A s 0". X X The `t' sub-command will take the given pattern (only one at a Xtime) out of the training set so that you can then train the other Xpatterns and test the network's response to this one pattern that was Xremoved. To test pattern 3 use "A t 3" and to reset to use all the Xpatterns use "A t 0". X X X9. The Delta-Bar-Delta Method X X The delta-bar-delta method attempts to find a learning rate, eta, for Xeach individual weight. The parameters are the initial value for the Xetas, the amount by which to increase an eta that seems to be too small, Xthe rate at which to decrease an eta that is too large, a maximum value Xfor each eta and a parameter used in keeping a running average of the Xslopes. Here are examples of setting these parameters: X Xd d 0.5 * sets the decay rate to 0.5 Xd e 0.1 * sets the initial etas to 0.1 Xd k 0.25 * sets the amount to increase etas by (kappa) to 0.25 Xd m 10 * sets the maximum eta to 10 Xd n 0.005 * an experimental noise parameter Xd t 0.7 * sets the history parameter, theta, to 0.7 X XThese settings can all be placed on one line: X Xd d 0.5 e 0.1 k 0.25 m 10 t 0.7 X XThe version implemented here does not use momentum. The symmetric Xversions sbp and srbp do not implement delta-bar-delta. X X The idea behind the delta-bar-delta method is to let the program find Xits own learning rate for each weight. The `e' sub-command sets the Xinitial value for each of these learning rates. When the program sees Xthat the slope of the error surface averages out to be in the same Xdirection for several iterations for a particular weight the program Xincreases the eta value by an amount, kappa, given by the `k' parameter. XThe network will then move down this slope faster. When the program Xfinds the slope changes signs the assumption is that the program has Xstepped over to the other side of the minimum and so it cuts down the Xlearning rate by the decay factor given by the `d' parameter. For Xinstance, a d value of 0.5 cuts the learning rate for the weight in Xhalf. The `m' parameter specifies the maximum allowable value for an Xeta. The `t' parameter (theta) is used to compute a running average of Xthe slope of the weight and must be in the range 0 <= t < 1. The Xrunning average at iteration i, a[i], is defined as: X Xa[i] = (1 - t) * slope[i] + t * a[i-1], X Xso small values for t make the most recent slope more important than the Xprevious average of the slope. Determining the learning rate for Xback-propagation automatically is, of course, very desirable and this Xmethod often speeds up convergence by quite a lot. Unfortunately, bad Xchoices for the delta-bar-delta parameters give bad results and a lot of Xexperimentation may be necessary. If you have n patterns in the Xtraining set try starting e and k around 1/n. The n parameter is an Xexperimental noise term that is only used in the integer version. It Xchanges a weight in the wrong direction by the amount indicated when the Xprevious weight change was 0 and the new weight change would be 0 and Xthe slope is non-zero. (I found this to be effective in an integer Xversion of quickprop so I tossed it into delta-bar-delta as well. If Xyou find this helps please let me know.) For more on delta-bar-delta Xsee "Increased Rates of Convergence" by Robert A. Jacobs, in Neural XNetworks, Volume 1, Number 4, 1988. X X10. Quickprop X X Quickprop (see "Faster-Learning Variations on Back-Propagation: An XEmpirical Study", by Scott E. Fahlman, in Proceedings of the 1988 XConnectionist Models Summer School", Morgan Kaufmann, 1989.) is similar Xto delta-bar-delta in that the algorithm attempts to increase the size Xof a weight change for each weight while the process continues to go Xdownhill in terms of error. The main acceleration technique is to make Xthe next weight change mu times the previous weight change. Fahlman Xsuggests mu = 1.75 is generally quite good so this is the initial value Xfor mu but slightly larger or slightly smaller values are sometimes Xbetter. In addition when this is done Fahlman also adds in a value, Xeta times the current slope, in the traditional backprop fashion. I had Xto wonder if this was a good idea so in this code I've included a Xcapability to add it in or not add it in. So far it seems to me that Xsometimes adding in this extra term helps and sometimes it doesn't. The Xdefault is to always use the extra term. X X A second key property of quickprop is that when a weight change Xchanges the slope of the error curve from positive to negative or Xvice-versa, quickprop attempts to jump to the minimum by assuming the Xcurve is a parabola. X X A third factor involved in quickprop comes about from the fact that Xthe weights often grow very large very quickly. To minimize this Xproblem there is a decay factor designed to keep the weights small. XFahlman does not mention a value for this parameter but I usually try X0.0001 to 0.00001. I've found that too large a decay factor can stall Xout the learning process so that if your network isn't learning fast Xenough or isn't learning at all one possible fix is to decrease the Xdecay factor. The small values you need present a problem for the Xinteger version since the smallest value you can represent is about X0.001. To get around this problem the decay value you input should be X1000 times larger than the value you intend to use. So to get 0.0001, Xinput 0.1, to get 0.00001, input 0.01, etc. The code has been written Xso that the factor of 1000 is taken into account during the calculations Xinvolving the decay factor. To keep the values consistent both the Xinteger and floating point versions use this convention. If you use Xquickprop elsewhere you need to take this factor of 1000 into account. XThe default value for the decay factor is 0.1 (=0.0001). X X I built in one additional feature for the integer version. I found Xthat by adding small amounts of noise the time to convergence can be Xbrought down and the number of failures can be decreased somewhat. This Xseems to be especially true when the weight changes get very small. The Xnoise consists of moving uphill in terms of error by a small amount when Xthe previous weight change was zero. Good values for the noise seem to Xbe around 0.005. X X The parameters for quickprop are all set in the `qp' command like Xso: X Xqp d 0.1 * the default decay factor Xqp e 0.5 * the default value for eta Xqp m 1.75 * the default value for mu Xqp n 0 * the default value for noise Xqp s+ * always include the slope X Xor a whole series can go on one line: X Xqp d 0.1 e 0.5 m 1.75 n 0 s+ X X X11. Recurrent Networks X X Recurrent back-propagation networks take values from higher level Xunits and use them as activation values for lower level units. This Xgives a network a simple kind of short-term memory, possibly a little Xlike human short-term memory. For instance, suppose you want a network Xto memorize the two short sequences, "acb" and "bcd". In the middle of Xboth of these sequences is the letter, "c". In the first case you want Xa network to take in "a" and output "c", then take in "c" and output X"b". In the second case you want a network to take in "b" and output X"c", then take in "c" and output "d". To do this a network needs a Xsimple memory of what came before the "c". X X Let the network be an 7-3-4 network where input units 1-4 and output Xunits 1-4 stand for the letters a-d. Furthermore, let there be 3 hidden Xlayer units. The hidden units will feed their values back down to the Xinput units 5-7 where they become input for the next step. To see why Xthis works, suppose the patterns have been learned by the network. XInputting the "a" from the first string produces some random pattern of Xactivation, p1, on the hidden layer units and "c" on the output units. XThe pattern p1 is copied down to units 5-7 of the input layer. Second, Xthe letter, "c" is presented to the network together with p1 now on Xunits 5-7. This will give "b" on the output units. However, if the "b" Xfrom the second string is presented first there will be a different Xrandom pattern, p2, on the hidden layer units. These values are copied Xdown to input units 5-7. These values combine with the "c" to produce Xthe output, "d". X X The training patterns for the network can be: X X 1000 000 0010 * "a" prompts the output, "c" X 0010 hhh 0100 * inputting "c" should produce "b" X X 0100 000 0010 * "b" prompts the output, "c" X 0010 hhh 0001 * inputting "c" should produce "d" X Xwhere the first four values on each line are the normal input, the Xmiddle three either start out all zeros or take their values from the Xprevious values of the hidden units. The code for taking these values Xfrom the hidden layer units is "h". The last set of values represents Xthe output that should be produced. To take values from the third layer Xof a network, the code is "i". For the fourth and fifth layers (if they Xexist) the codes are "j" and "k". Training recurrent networks can take Xmuch longer than training standard networks and the average error can Xjump up and down quite a lot. X X X12. The Benchmarking Command X X The main purpose of the benchmarking command is to make it possible Xto run a number of tests of a problem with different initial weights and Xaverage the number of iterations and CPU time for networks that Xconverged. A second purpose is to run a training set thru the network a Xnumber of times and for each try a test pattern or a test set can be Xchecked at regular intervals. X X A typical command to simply test the current parameters on a number Xof networks is: X XB g 5 m 15 k 1 r 1000 200 X XThe "g 5" specifies that you'd like to set the goal of getting 5 Xnetworks to converge but the "m 15" sets a maximum of 15 tries to Xreach this goal. The k specifies that each initial network will get Xa kick by setting each weight to a random number between -1 and 1. XThe "r 1000 200" portion specifies that you should run up to 1000 Xiterations on a network and print the status of learning every 200 Xiterations. This follows the normal run command and the second Xparameter defining how often to print the statistics can be omitted. XFor example, here is some output from benchmarking with the xor problem: X X[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? B g 5 m 5 k 1 r 200 X seed = 7; running . . . Xpatterns learned to within 0.10 at iteration 62 X seed = 7; running . . . X seed = 7; running . . . Xpatterns learned to within 0.10 at iteration 54 X seed = 7; running . . . Xpatterns learned to within 0.10 at iteration 39 X seed = 7; running . . . Xpatterns learned to within 0.10 at iteration 44 X1 failures; 4 successes; average = 49.750000 0.333320 sec/network X XThe time/network includes however much time is used to print messages so Xto time the process effectively all printing should be minimized. When Xthe timing is done on a UNIX PC the time returned by clock will overflow Xafter 2147 seconds or about 36 minutes. If you system has the same Xlimitation take care that ALL of the benchmarking you do in a single Xcall of the program adds up to less than 36 minutes. X X In the above example the seed that was used to set the random values Xfor the weights was set to 7 (outside the benchmarking command) however Xif you set a number of seeds as in: X Xs 3 5 7 18484 99 X Xthe seeds will be taken in order for each network. When there are more Xnetworks to try than there are seeds the random values keep coming from Xthe last seed value so actually you can get by using a single seed. XThe idea behind allowing multiple seeds is so that if one network does Xsomething interesting you can use that seed to run a network with the Xsame initial weights outside of the benchmarking command. X X Once the benchmarking parameters have been set it is only necessary Xto include the run portion in order to start the benchmarking process Xagain, thus, "B r 200" will run benchmarking again using the current set Xof parameters. Also, the parameters can be set without starting the Xbenchmarking process by just not including the `r' parameters in the B Xcommand as in: X XB g 5 m 5 k 1 X X In addition to getting data on convergence you can have the program Xrun test patterns thru the network at the print statistics rate given Xin the `r' sub-command. To specify the test file, test100, use: X XB t f test100 X XTo run the training data thru for up to 1000 iterations and test every X200 iterations use: X XB r 1000 200 X XIf the pattern format specification p is set to either c or g you will Xget a summary of the patterns on the test file. If p is either C or G Xyou will get the results for each pattern listed as well as the summary. XTo stop testing the data on the data file use "B t 0". X X Sometimes you may have so little data available that it is difficult Xto separate the patterns into a training set and a test set. One Xsolution is to remove each pattern from the training set, train the Xnetwork on the remaining patterns and then test the network on the Xpattern that was removed. To remove a pattern, say pattern 1 from the Xtraining set use: X XB t 1 X XTo systematically remove each pattern from the training set use a data Xfile with the following commands: X XB t 1 r 200 50 XB t 2 r 200 50 XB t 3 r 200 50 X ... etc. X Xand the pattern will be tested every 50 iterations. If, in the course Xof training the network, all the patterns converge, the program will Xprint out a line starting with a capital S followed by a test of the Xtest pattern. If the programs hits the point where statistics on the Xlearning process have to be printed and the network has not converged Xthen a capital F will print out followed by a test of the test pattern. XTo stop this testing use "B t 0". X X It would be nice to have the program average up and tabulate all the Xdata that comes out of the benchmarking command but I thought I'd leave Xthat to users for now. You can use the record command to save the Xoutput from the entire session and then run it thru some other program, Xsuch as an awk program in order to sort everything out. X X X13. Miscellaneous Commands X X Below is a list of some miscellaneous commands, a short example of Xeach and a short description of the command. X X X! Example: ! ls X XAnything after `!' will be passed on to the OS as a command to execute. X X XC Example: C X XThe C command will clear the network of values, reset the number of Xiterations to 0 and reset other values so that another run can be made. XThe seed value is reset so you can run the same network over with the Xsame initial weights but with different learning parameters. Even Xthough the seed is reset the weights are not initialized so you Xmust do this step after clearing the network. X X Xi Example: i f X XEntering "i f" will read commands from the file, f. When there are no Xmore commands on the file the program resumes reading from the previous Xfile being used. You can also have an `i' command within the file f, Xhowever the depth to which you can nest the number of active files is 4 Xand stdin itself counts as the first one. Once an input file has been Xspecified you can simply type "i" to read from the file again. X X Xl Example: l 2 X XEntering "l 2" will print the values of the units on layer 2, or Xwhatever layer is specified. X X XT Example: T -3 X XIn sbp and srbp only, "T -3" sets all the threshold weights to -3 or Xwhatever value is specified and freezes them at this value. X X XW Example: W 0.9 X XEntering "W 0.9" will remove (whittle away) all the weights with Xabsolute values less than 0.9. X XIn addition, under UNIX when a user generated interrupt occurs (by Xtyping DEL) the program will drop its current task and take the next Xcommand from the keyboard. Under DOS you can use ctrl-C to stop the Xcurrent task and take the next command from the keyboard. Also under XDOS, when the program is executing a run command hitting the escape key Xwill get the program to stop after the current iteration. X X X13. Limitations X X Weights in the bp and sbp programs are 16-bit integer weights where Xthe real value of the weight has been multiplied by 1024. The integer Xversions cannot handle weights less than -32 or greater than 31.999. XThe weight changes are all checked for overflow but there are other Xplaces in these programs where calculations can possibly overflow as Xwell and none of these places are checked. Input values for the integer Xversions can run from -31.994 to 31.999. Due to the method used to Ximplement recurrent connections, input values in the real version are Xlimited to -31994.0 and above. END_OF_FILE if test 46585 -ne `wc -c <'readme'`; then echo shar: \"'readme'\" unpacked with wrong size! fi # end of 'readme' fi echo shar: End of archive 1 $of 4$. cp /dev/null ark1isdone MISSING="" for I in 1 2 3 4 ; do if test ! -f ark${I}isdone ; then MISSING="${MISSING} ${I}" fi done if test "${MISSING}" = "" ; then echo You have unpacked all 4 archives. rm -f ark[1-9]isdone else echo You still need to unpack the following archives: echo " " ${MISSING} fi ## End of shell archive. exit 0 exit 0 # Just in case...