Monster Media 1993 #2

home *** CD-ROM | disk | FTP | other *** search

/ Monster Media 1993 #2 / Image.iso / text / 9305nni.zip / 930530.BIB < prev next >

Wrap

Text File | 1993-05-30 | 34KB | 889 lines

Article 9277 of comp.ai.neural-nets: Path: serval!netnews.nwnet.net!usenet.coe.montana.edu!caen!usenet.cis.ufl.edu!hkim From: hkim@insect.cis.ufl.edu (Hyeoncheol Kim) Newsgroups: comp.ai.neural-nets Subject: SUMMARY: weight decay references Date: 30 May 1993 20:36:21 GMT Organization: Univ. of Florida CIS Dept. Lines: 874 Distribution: world Message-ID: <1ub5s5INNisg@snoopy.cis.ufl.edu> NNTP-Posting-Host: insect.cis.ufl.edu Hello, I requested references on the subject of weight decay and pruning to this newsgroup a week ago or so. Here is summary of the responses that I have got up until now. Thanks a lot for your help. I really appreciate it. Hyeoncheol Kim hkim@cis.ufl.edu Dept. of Computer and Information Sciences University of Florida Gainesville, Florida, USA. Enjoy... SUMMARY BEGINS... ================================================================== Date: Mon, 24 May 1993 12:17:29 -0400 From: omlinc@cs.rpi.edu To: hkim@cis.ufl.edu Subject: Re: references on pruning wanted ... Status: R Here is a list of references I received a couple of months ago in reply to a similar request: >From KRUSCHKE@ucs.indiana.edu Tue Mar 16 15:25:52 1993 Date: Tue, 16 Mar 93 15:25:48 EST From: "John K. Kruschke" <KRUSCHKE@ucs.indiana.edu> Subject: weight decay To: omlinc@cs.rpi.edu Status: R I did some work on methods to reduce the hidden layers of back-prop networks (references below), using variants of weight decay, and there's no reason they couldn't be applied to recurrent networks. I've moved on to other research now, but I'd be very interested in whatever results you get (either supportive or not). Good luck, in any case! Kruschke, J. K. and Movellan, J. R. (1991). Benefits of gain: Speeded learning and minimal hidden layers in back-propagation networks. IEEE Transactions on Systems, Man and Cybernetics, v.21, pp.273-280. Kruschke, J. K. (1989b). Distributed bottlenecks for improved generalization in back-propagation networks. International J. of Neural Networks Research and Applications, v.1, pp.187-193. Kruschke, J. K. (1989a). Improving generalization in back-propagation networks with distributed bottlenecks. In: Proceedings of the IEEE International Joint Conference on Neural Networks, v.1, 443-447. Washington DC, June 1989. Kruschke, J. K. (1988). Creating local and distributed bottlenecks in hidden layers of back-propagation networks. In: D. Touretzky, G. Hinton and T. Sejnowski (eds.), Proceedings of the 1988 Connectionist Models Summer School, pp.120-126. San Mateo, CA: Morgan Kaufmann. ------------------------------------------------------------ John K. Kruschke Asst. Prof. of Psych. & Cog. Sci. Dept. of Psychology internet: kruschke@indiana.edu Indiana University bitnet: kruschke@iubacs Bloomington, IN 47405-4201 office: (812) 855-3192 USA lab: (812) 855-9613 ============================================================ >From KRUSCHKE@ucs.indiana.edu Tue Mar 16 15:29:00 1993 Date: Tue, 16 Mar 93 15:28:00 EST From: "John K. Kruschke" <KRUSCHKE@ucs.indiana.edu> Subject: more on weight decay To: omlinc@cs.rpi.edu Status: R This is pretty dated now, but it might be of historical interest. ========== Date: Tue, 3 Jan 89 00:30:12 PST From: kruschke@cogsci.berkeley.edu (John Kruschke) To: connectionists@cs.cmu.edu Here is the compilation of responses to my request for info on weight decay. I have kept editing to a minimum, so you can see exactly what the author of the reply said. Where appropriate, I have included some comments of my own, set off in square brackets. The responses are arranged into three broad topics: (1) Boltzmann-machine related; (2) back-prop related; (3) psychology related. Thanks to all, and happy new year! --John ----------------------------------------------------------------- ORIGINAL REQUEST: I'm interested in all the information I can get regarding WEIGHT DECAY in back-prop, or in other learning algorithms. *In return* I'll collate all the info contributed and send the complilation out to all contributors. Info might include the following: REFERENCES: - Applications which used weight decay - Theoretical treatments Please be as complete as possible in your citation. FIRST-HAND EXPERIENCE - Application domain, details of I/O patterns, etc. - exact decay procedure used, and results (Please send info directly to me: kruschke@cogsci.berkeley.edu Don't use the reply command.) T H A N K S ! --John Kruschke. ----------------------------------------------------------------- From: Geoffrey Hinton <hinton@ai.toronto.edu> Date: Sun, 4 Dec 88 13:57:45 EST Weight-decay is a version of what statisticians call "Ridge Regression". We used weight-decay in Boltzmann machines to keep the energy barriers small. This is described in section 6.1 of: Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984) Boltzmann Machines: Constraint satisfaction networks that learn. Technical Report CMU-CS-84-119, Carnegie-Mellon University. I used weight decay in the family trees example. Weight decay was used to improve generalization and to make the weights easier to interpret (because, at equilibrium, the magnitude of a weight = its usefulness). This is in: Rumelhart, D.~E., Hinton, G.~E., and Williams, R.~J. (1986) Learning representations by back-propagating errors. {\it Nature}, {\bf 323}, 533--536. I used weight decay to achieve better generalization in a hard generalization task that is reported in: Hinton, G.~E. (1987) Learning translation invariant recognition in a massively parallel network. In Goos, G. and Hartmanis, J., editors, {\it PARLE: Parallel Architectures and Languages Europe}, pages~1--13, Lecture Notes in Computer Science, Springer-Verlag, Berlin. Weight-decay can also be used to keep "fast" weights small. The fast weights act as a temporary context. One use of such a context is described in: Hinton, G.~E. and Plaut, D.~C. (1987) Using fast weights to deblur old memories. {\it Proceedings of the Ninth Annual Conference of the Cognitive Science Society}, Seattle, WA. --Geoff ----------------------------------------------------------------- [In his lecture at the International Computer Science Institute, Berkeley CA, on 16-DEC-88, Geoff also mentioned that weight decay is good for wiping out the initial values of weights so that only the effects of learning remain. In particular, if the change (due to learning) on two weights is the same for all updates, then the two weights converge to the same value. This is one way to generate symmetric weights from non-symmetric starting values. --John] ----------------------------------------------------------------- From: Michael.Franzini@SPEECH2.CS.CMU.EDU Date: Sun, 4 Dec 1988 23:24-EST My first-hand experience confirms what I'm sure many other people have told you: that (in general) weight decay in backprop increases generalization. I've found that it's particulary important for small training sets, and its effect diminishes as the training set size increases. Weight decay was first used by Barak Pearlmutter. The first mention of weight decay is, I believe, in an early paper of Hinton's (possibly the Plaut, Nowlan, and Hinton CMU CS tech report), and it is attributed to "Barak Pearlmutter, Personal Communication" there. The version of weight decay that (i'm fairly sure) all of us at CMU use is one in which each weight is multiplied by 0.999 every epoch. Scott Fahlman has a more complicated version, which is described in his QUICKPROP tech report. [QuickProp is also described in his paper in the Proceedings of the 1988 Connectionist Models Summer School, published by Morgan Kaufmann. --John] The main motivation for using it is to eliminate spurious large weights which happen not to interfere with recognition of training data but would interfere with recognizing testing data. (This was Barak's motivation for trying it in the first place.) However, I have heard more theoretical justifications (which, unfortunately, I can't reproduce.) In case Barak didn't reply to your message, you might want to contact him directly at bap@cs.cmu.edu. --Mike ----------------------------------------------------------------- From: Barak.Pearlmutter@F.GP.CS.CMU.EDU Date: 8 Dec 1988 16:36-EST We first used weight decay as a way to keep weights in a boltzmann machine from growing too large. We added a term to the thing being minimized, G, so that G' = G + 1/2 h \sum_{i<j} w_{ij}^2 where G' is our new thing to minimize. This gives \partial G'/\partial w_{ij} = \partial G/\partial w_{ij} + h w_{ij} which is just weight decay with some mathematical motivation. As Mike mentioned, I was the person who thought of weight decay in this context (in the shower no less), but parameter decay has been used forever, in adaptive control for example. It sort of worked okay for Boltzmann machines, but works much better in backpropagation. As a historic note I should mention that there were some competing techniques for keeping weights small in Boltzmann machines, such as Mark Derthick's "differential glommetry" in which the effective target termperature of the wake phase is higher than that of the sleep phase. I don't know if there is an analogue for this in backpropagation, but there certainly is for mean field theory networks. Getting back weight decay, it was noted immediately that G has the unit "bits" while $w_{ij}^2$ has the unit "weight^2", sort of a problem from a dimensional analysis point of view. Solving this conundrum, Rick Szeliski pointed out that if we're going to transmit our weights by telephone and know a-priori that weights have gaussian distributions, so P(w_{ij}=x) \propto e^{-1/2 h x^2} where h is set to get the correct variance, then transmitting a weight w will take $-1/2 h w^2$ bits, which we can add to G with dimensional confidence. Of course, this argument extends to fast/slow split weights nicely; the other guy already knows the slow weights, so we need transmit only the fast weights. By "ridge regression" I guess Geoff means that valleys in weight space that cause the weights to grow asymptotically are made to tilt up after a while, so that the asymptotic tailing off is eliminated. It's like adding a bowl to weight space, so minima have to be within the bowl. An interesting side effect of weight decay is that, once we get to a minimum, so $\partial G'/\partial w = 0$, then w_{ij} \propto - \partial G/\partial w_{ij} so we can do a sort of eyeball significance analysis, since a weight's magnitiude is proportaional to how sensitive the error is to changing it. ----------------------------------------------------------------- From: russ%yummy@gateway.mitre.org (Russell Leighton) Date: Mon, 5 Dec 88 09:17:56 EST We always use weight decay in backprop. It is partiuclarly important in escaping local minima. Decay moves the transfer function from all of the semi-linear (sigmoidal) nodes toward the linear region. The important point is that all nodes move proportionally so no information in the weights is "erased" but only scaled. When the nodes that have trapped the system in the local minima are scaled enough, the system moves onto a different trajectory through weight space. Oscilations are still possible, but are less likely. We use decay with a process we call "shaping" (see Wieland and Leighton, "Shaping Schedules as a Method for Accelerating Leanring", Abstracts of the First Annual INNS Meeting, Boston, 1988) that we use to speed learning of some difficult problems. ARPA: russ%yummy@gateway.mitre.org Russell Leighton MITRE Signal Processing Lab 7525 Colshire Dr. McLean, Va. 22102 USA ----------------------------------------------------------------- From: James Arthur Pittman <hi.pittman@MCC.COM> Date: Tue, 6 Dec 88 09:34 CST Probably he will respond to you himself, but Alex Weiland of MITRE presented a paper at INNS in Boston on shaping, in which the order of presentation of examples in training a back-prop net was altered to reflect a simpler rule at first. Over a number of epochs he gradually changed the examples to slowly change the rule to the one desired. The nets learned much faster than if he just tossed the examples at the net in random order. He told me that it would not work without weight decay. He said their rule-of-thumb was the decay should give the weights a half-life of 2 to 3 dozen epochs (usually a value such as 0.9998). But I neglected to ask him if he felt that the number of epochs or the number of presentations was important. Perhaps if one had a significantly different training set size, that rule-of-thumb would be different? I have started some experiments simular to his shaping, using some random variation of the training data (where the random variation grows over time). Weiland also discussed this in his talk. I haven't yet compared decay with no-decay. I did try (as a lark) using decay with a regular (non-shaping) training, and it did worse than we usually get (on same data and same network type/size/shape). Perhaps I was using a stupid decay value (0.9998 I think) for that situation. I hope to get back to this, but at the moment we are preparing for a software release to our shareholders (MCC is owned by 20 or so computer industry corporations). In the next several weeks a lot of people will go on Christmas vacation, so I will be able to run a bunch of nets all at once. They call me the machine vulture. ----------------------------------------------------------------- From: Tony Robinson <ajr@digsys.engineering.cambridge.ac.uk> Date: Sat, 3 Dec 88 11:10:20 GMT Just a quick note in reply to your message to `connectionists' to say that I have tried to use weight decay with back-prop on networks with order 24 i/p, 24 hidden, 11 o/p units. The problem was vowel recognition (I think), it was about 18 months ago, and the problem was of the unsolvable type (i.e. non-zero final energy). My conclusion was that weight decay only made matters worse, and my justification (to myself) for abandoning weight decay was that you are not even pretending to do gradient descent any more, and any good solution formed quickly becomes garbaged by scaling the weights. If you want to avoid hidden units sticking on their limiting values, why not use hidden units with no limiting values, for instance I find the activation function f(x) = x * x works better than f(x) = 1.0 / (1.0 + exp(- x)) anyway. Sorry I havn't got anything formal to offer, but I hope these notes help. Tony Robinson. ----------------------------------------------------------------- From: jose@tractatus.bellcore.com (Stephen J Hanson) Date: Sat, 3 Dec 88 11:54:02 EST Actually, "costs" or "penalty" functions are probably better terms. We had a poster last week at NIPS that discussed some of the pitfalls and advantages of two kinds of costs. I can send you the paper when we have a version available. Stephen J. Hanson (jose@bellcore.com) ----------------------------------------------------------------- [ In a conversation in his office on 06-DEC-88, Dave Rumelhart described to me several cost functions he has tried. The motive for the functions he has tried is different from the motive for standard weight decay. Standard weight decay, \sum_{i,j} w_{i,j}^2 , is used to *distribute* weights more evenly over the given connections, thereby increasing robustness (cf. earlier replies). He has tried several other cost functions in an attempt to *localize*, or concentrate, the weights on a small subset of the given connections. The goal is to improve generalization. His favorite is \sum_{i,j} ( w_{i,j}^2 / ( K + w_{i,j}^2 ) ) where K is a constant, around 1 or 2. Note that this function is negatively accelerating, whereas standard weight decay is positively accelerating. This function penalizes small weights (proportionally) more than large weights, just the opposite of standard weight decay. He has also tried, with less satisfying results, \sum ( 1 - \exp - (\alpha w_{i,j}^2) ) and \sum \ln ( K + w_{i,j}^2 ). Finally, he has tried a cost function designed to make all the fan-in weights of a single unit decay, when possible. That is, the unit is effectively cut out of the network. The function is \sum_i (\sum_j w_{i,j}^2) / ( K + \sum_j w_{i,j}^2 ). Each weight is thereby penalized (inversely) proportionally to the total fan-in weight of its node. --John ] [1991: Some papers that have explored Rumelhart's ideas: Hanson, S. J. and Pratt, L. Y. (1989). Comparing biases form minimal network construction with back-propagation. In: D. S. Touretzky (ed.), Advances in Neural Information Processing Systems 1, pp.177-185. San Mateo, CA: Morgan Kaufmann. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with application to forecasting. In: R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann. (end references).] ----------------------------------------------------------------- [ This is also a relevant place to mention my paper in the Proceedings of the 1988 Connectionist Models Summer School, "Creating local and distributed bottlenecks in back-propagation networks". I have since developed those ideas, and have expressed the localized bottleneck method as gradient descent on an additional cost term. The cost term is quite general, and some forms of decay are simply special cases of it. --John] [ 1991: Here are references to that work: Kruschke, J. K., & Movellan, J. R. (1991). Benefits of gain: Speeded learning and minimal hidden layers in back propagation networks. IEEE Transactions on Systems, Man and Cybernetics, v.21, pp.273-280. Kruschke, J. K. (1989b). Distributed bottlenecks for improved generalization in back-propagation networks. International Journal of Neural Networks Research and Applications, v.1, pp.187-193. Kruschke, J. K. (1989a). Improving generalization in back-propagation networks with distributed bottlenecks. In: Proceedings of the IEEE International Joint Conference on Neural Networks, Washington D.C. June 1989, v.1, pp.443-447. Kruschke, J. K. (1988). Creating local and distributed bottlenecks in hidden layers of back-propagation networks. In: D. Touretzky, G. Hinton, & T. Sejnowski (eds.), Proceedings of the 1988 Connectionist Models Summer School, pp.120-126. San Mateo, CA: Morgann Kaufmann. (end references).] ----------------------------------------------------------------- From: john moody <moody-john@YALE.ARPA> Date: Sun, 11 Dec 88 22:54:11 EST Scalettar and Zee did some interesting work on weight decay with back prop for associative memory. They found that a Unary Representation emerged (see Baum, Moody, and Wilczek; Bio Cybernetics Aug or Sept 88 for info on Unary Reps). Contact Tony Zee at UCSB (805)961-4111 for info on weight decay paper. --John Moody [ 1991: Here's a reference on that excellent paper: Scalettar, R. & Zee, A. (1986). A feed-forward memory with decay. Technical Report NSF-ITP-86-118, Institute for Theoretical Physics, University of California at Santa Barbara. Later published as: Emergence of grandmother memory in feed forward networks: learning with noise and forgetfulness. In: D. Waltz & J. A. Feldman (eds.), Connectionist models and their implications: Readings from Cognitive Science. Ablex, 1988. (end reference).] ----------------------------------------------------------------- From: gluck@psych.Stanford.EDU (Mark Gluck) Date: Sat, 10 Dec 88 16:51:29 PST I'd appreciate a copy of your weight decay collation. I have a paper in MS form which illustrates how adding weight decay to the linear-LMS one-layer net improves its ability to predict human generalization in classification learning. mark gluck dept of psych stanford univ, stanford, ca 94305 ----------------------------------------------------------------- From: INAM000 <INAM%MCGILLB.bitnet@jade.berkeley.edu> (Tony Marley) Date: SUN 04 DEC 1988 11:16:00 EST I have been exploring some ideas re COMPETITIVE LEARNING with "noisy weights" in modeling simple psychophysics. The task is the classical one of identifying one of N signals by a simple (verbal) response -e.g. the stimuli might be squares of different sizes, and one has to identify the presented one by saying the appropriate integer. We know from classical experiments that people cannot perform this task perfectly once N gets larger than about 7, but performance degrades smoothly for larger N. I have been developing simulations where the mapping is learnt by competitive learning, with the weights decaying/varying over time when they are not reset by relevant inputs. I have not got too many results to date, as I have been taking the psychological data seriously, which means worrying about reaction times, sequential effects, "end effects" (stimuli at the end of the range more accurately identified), range effects (increasing the stimulus range has little effect), etc.. Tony Marley ----------------------------------------------------------------- From: aboulanger@bbn.com (Albert Boulanger) Date: Fri, 2 Dec 88 19:43:14 EST This one concerns the Hopfield model. In James D Keeler, "Basin of Attraction of Neural Network Models", Snowbird Conference Proceedings (1986), 259-264, it is shown that the basins of attraction become very complicated as the number of stored patterns increase. He uses a weight modification method called "unlearning" to smooth out these basins. Albert Boulanger BBN Systems & Technologies Corp. aboulanger@bbn.com ----------------------------------------------------------------- From: Joerg Kindermann <unido!gmdzi!joerg@uunet.UU.NET> Date: Mon, 5 Dec 88 08:21:03 -0100 We used a form of weight decay not for learning but for recall in multilayer feedforward networks. See the following abstract. Input patterns are treated as ``weights'' coming from a constant valued external unit. If you would like a copy of the technical report, please send e-mail to joerg@gmdzi.uucp or write to: Dr. Joerg Kindermann Gesellschaft fuer Mathematik und Datenverarbeitung Schloss Birlinghoven Postfach 1240 D-5205 St. Augustin 1 WEST GERMANY Detection of Minimal Microfeatures by Internal Feedback J. Kindermann & A. Linden Abstract We define the notion of minimal microfeatures and introduce a new method of internal feedback for multilayer networks. Error signals are used to modify the input of a net. When combined with input DECAY, internal feedback allows the detection of sets of minimal microfeatures, i.e. those subpatterns which the network actually uses for discrimination. Additional noise on the training data increases the number of minimal microfeatures for a given pattern. The detection of minimal microfeatures is a first step towards a subsymbolic system with the capability of self-explanation. The paper provides examples from the domain of letter recognition. ----------------------------------------------------------------- From: Helen M. Gigley <hgigley@note.nsf.gov> Date: Mon, 05 Dec 88 11:03:23 -0500 I am responding to your request even though my use of decay is not with respect to learning in connectionist-like models. My focus has been on a functioning system that can be lesioned. One question I have is what is the behavioral association to weight decay? What aspects of learning is it intended to reflect. I can understand that activity decay over time of each cell is meaningful and reflects a cellular property, but what is weight decay in comparable terms? Now, I will send you offprints if you would like of my work and am including a list of several publications which you may be able to peruse. The model, HOPE, is a hand-tuned structural connectionist model that is designed to enable lesioning without redesign or reprogramming to study possible processing causes of aphasia. Decay factors as an integral part of dynamic time-dependent processes are one of several aspects of processing in a neural environment which potentially affect the global processing results even though they are defined only locally. If I can be of any additional help please let me know. Helen Gigley References: Gigley, H.M. Neurolinguistically Constrained Simulation of Sentence Comprehension: Integrating Artificial Intelligence and Brain Theorym Ph.D. Dissertation, UMass/Amherst, 1982. Available University Microfilms, Ann Arbor, MI. Gigley, H.M. HOPE--AI and the dynamic process of language behavior. in Cognition and Brain Theory 6(1) :39-88, 1983. Gigley, H.M. Grammar viewed as a functioning part of of a cognitive system. Proceedings of ACL 23rd Annual Meeting, Chicago, 1985 . Gigley, H.M. Computational Neurolinguistics -- What is it all about? in IJCAI Proceedings, Los Angeles, 1985. Gigley, H.M. Studies in Artificial Aphasia--experiments in processing change. In Journal of Computer Methods and Programs in Biomedicine, 22 (1): 43-50, 1986. Gigley, H.M. Process Synchronization, Lexical Ambiguity Resolution, and Aphasia. In Steven L. Small, Garrison Cottrell, and Michael Tanenhaus (eds.) Lexical Ambiguity Resolution, Morgen Kaumann, 1988. ----------------------------------------------------------------- From: bharucha@eleazar.Dartmouth.EDU (Jamshed Bharucha) Date: Tue, 13 Dec 88 16:56:00 EST I haven't tried weight decay but am curious about it. I am working on back-prop learning of musical sequences using a Jordan-style net. The network develops a musical schema after learning lots of sequences that have culture-specific regularities. I.e., it learns to generate expectancies for tones following a sequential context. I'm interested in knowing how to implement forgetting, whether short term or long term. Jamshed. ----------------------------------------------------------------- ======== END OF WEIGHT DECAY NOTES ============================== Christian PS:I also submitted a paper to NIPS which deals with improving the generalization performance of recurrent networks through neuron pruning. It is available via anonymous ftp. The address is external.nj.nec.com. The file is /pub/giles/papers/prune.ps.Z. ==================================================================== Date: Mon, 24 May 93 14:39:52 -0500 From: John Kruschke <kruschke@pallas.psych.indiana.edu> To: hkim@thedog.cis.ufl.edu Subject: Re: weight decay: references wanted... Organization: Indiana University Status: R Some references on weight decay / node pruning or creation / dimensionality reduction: Kruschke, J. K. & Movellan, J. R. (1991). Benefits of gain: speeded learning and minimal hidden layers in back-propagation networks. IEEE Transaction on Systems, Man and Cybernetics, v.21, pp.273-280. Kruschke, J. K. (1989). Distributed bottlenecks for improved generalization in back-propagation networks. International Journal of Neural Networks Research and Applications, v.1, pp.187--193. Kruschke, J. K. (1989). Creating local and distributed bottlenecks in hidden layers of back-propagation networks. In: D. Touretzky, G. Hinton and T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School, pp.120-126. San Mateo, CA: Morgan Kaufmann. (The first two papers listed above report extensions of ideas introduced in the third paper listed above.) One of the appealing properties of the methods described in those papers is that they can be construed as either node pruning or as node creating methods. There is a pool of candidate nodes, and the degree of their participation can be continuously adjusted, so that nodes can be retired ("pruned") or recruited ("created") as needed, depending on how effectively error is being reduced with the currently participating nodes. John K. Kruschke Dept. of Psychology and Cognitive Science Program Indiana University Bloomington, IN 47405 P.S. Please send me or post the complete list of references you get! Thanks. ===================================================================== Date: Tue, 25 May 93 11:49:27 EDT From: gatech!concert.net!array!nasir@bikini.cis.ufl.edu (Nasir Ghani) Message-Id: <9305251549.AA15245@array.UUCP> To: hkim@cis.ufl.edu Status: R Hi, There are some articles in the NIPS (Neural Information Processing ???) which deal with the pruning (weights only, not neurons) of neural networks, (BACKPROPAGATION NETS, i am talking about). I used the work from a paper by Yan and Le Cun in NIPS i think 1991 or so, and they had a method called optimal brain damage. THis is a rather structure approach, and you may just try ad-hoc approaches such as magnitude metric or something to knock of weights below a certain threshold.......i found for my work at least, that the OBD took a LOT longer to compute and the results were marginally better than a simple threshold sort of rule. It really depends on the data you have and how much of it you have for training purposes....hope this helps. Later Nasir Ghani ========================================================================== Date: Wed, 26 May 93 11:48:04 EST From: young@s1.elec.uq.oz.au (Steven Young) Message-Id: <9305260148.AA01404@s2.elec.uq.oz.au> To: hkim@thedog.cis.ufl.edu Subject: Re: weight decay: references wanted... Status: R I posted the following in reply to a similar request from Elliot Furman Newsgroups: comp.ai.neural-nets Subject: Re: Pruning units and weights Date: Wed, 12 May 1993 00:22:13 GMT furman@leland.Stanford.EDU (Elliot M Furman) writes: >Can anyone tell me how to prune unnecessary units and weights? >I would like to start training a fully feedforward NN with >too many units and then prune those that aren't contributing >much to the "solution". This is the standard approach that comes to mind when people consider pruning and is suggested in various papers I have seen (Many people attribute the idea to Rumelhart?). I'll include a list of references that I know of at the end of this post. One approach is the method of weight decaying, and removing connections if the final (trained) weight is small (pick a parameter value and if the weight is less than that in absolute value, remove it). There are a number of papers on including weight decay as a part of the error function for minimization with standard descent techniques. There are a range of approaches and many papers expounding this idea: (Hanson and Pratt, 1989), (Chauvin, 1989), (Le Cun, Denker and Solla, 1990), (Ji, Snapp, Psaltis, 1990), (Bishop, 1990), (Weigend, Rumelhart, and Huberman, 1991). There are some other schemes of making pruning decisions directly. One simple rule (suggested initially by Sietsma and Dow (1988)) is to check if network units in the same layer are duplicating function, if so then remove one of the duplicating units. Mozer and Smolensky (1989) have suggested a different scheme called skeletonization which makes a decision based on the `relevance' of the unit. Relevance is checked by comparing the performance of the network with the unit included and removed. J. Sietsma, R. J. F. Dow, `Neural Network Pruning --- Why and How', ICNN 1988, vol I, pages 325--333, 1988. Yves Chauvin, `A Back-Propagation Algorithm with optimal use of Hidden Units', NIPS 1, pages 519--526, 1989. Stephen Jos{\'e} Hanson, Lorien Y. Pratt, `Comparing Biases for Minimal Construction with Back-Propagation', NIPS 1, pages 177--185, 1989. Michael C. Mozer, Paul Smolensky, `Skeletonization: A technique for trimming the fat from a network via relevance assessment', NIPS 1, pages 107--115, 1989. Michael C. Mozer, Paul Smolensky, `Using Relevance to Reduce Network Size Automatically', Connection Science, vol. 1, no. 1, pages 3--16, 1989 C. M. Bishop, `Curvature-Driven Smoothing in Backpropagation Neural Networks', INNC-1990-Paris, pages 749--752, 1990. Chuanyi Ji, Robert R. Snapp, Demetri Psaltis, `Generalizing Smoothness Constraints from Discrete Samples', Neural Computation, vol. 2, pages 188-197, 1990. Yann Le Cun, John S. Denker, Sara A. Solla, `Optimal Brain Damage', NIPS 2, pages 598--605, 1990 Jocelyn Sietsma, Robert J.F. Dow, `Creating Artificial Neural Networks That Generalize' Neural Networks, vol 4, pages 67--79, 1991. Andreas S. Weigend, David E. Rumelhart, Bernardo A. Huberman, `Generalization by Weight-Elimination with Application to Forecasting', NIP 3, pages 875--882, 1991. Hope this is helpful. Steven -- Steven Young PhD Student | Dept of Electrical Engineering email : young@s1.elec.uq.oz.au | University of Queensland Murphy was an anarchist! | AUSTRALIA 4072 Ph:61+7 3653564 --------- And John Kruschke posted the following recently Date: Mon, 17 May 93 14:08:09 -0500 From: John Kruschke <kruschke@pallas.psych.indiana.edu> Subject: Re: Pruning units and weights Newsgroups: comp.ai.neural-nets Here are a couple more references to papers that describe methods for pruning unneeded nodes from backprop networks: Kruschke, J. K. (1989). Distributed bottlenecks for improved generalization in back-propagation networks. International Journal of Neural Networks Research and Applications, v.1, pp.187-193. Kruschke, J. K., & Movellan, J. R. (1991). Benefits of gain: Speeded learning and minimal hidden layers in back-propagation networks. IEEE Transactions on Systems, Man and Cybernetics, v.21, pp.273-280. John K. Kruschke Dept. of Psychology Indiana University Bloomington, IN 47405 USA --------