home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Monster Media 1993 #2
/
Image.iso
/
text
/
9305nni.zip
/
930530.BIB
< prev
next >
Wrap
Text File
|
1993-05-30
|
34KB
|
889 lines
Article 9277 of comp.ai.neural-nets:
Path: serval!netnews.nwnet.net!usenet.coe.montana.edu!caen!usenet.cis.ufl.edu!hkim
From: hkim@insect.cis.ufl.edu (Hyeoncheol Kim)
Newsgroups: comp.ai.neural-nets
Subject: SUMMARY: weight decay references
Date: 30 May 1993 20:36:21 GMT
Organization: Univ. of Florida CIS Dept.
Lines: 874
Distribution: world
Message-ID: <1ub5s5INNisg@snoopy.cis.ufl.edu>
NNTP-Posting-Host: insect.cis.ufl.edu
Hello,
I requested references on the subject of weight decay and pruning to
this newsgroup a week ago or so.
Here is summary of the responses that I have got up until now.
Thanks a lot for your help. I really appreciate it.
Hyeoncheol Kim
hkim@cis.ufl.edu
Dept. of Computer and Information Sciences
University of Florida
Gainesville, Florida, USA.
Enjoy...
SUMMARY BEGINS...
==================================================================
Date: Mon, 24 May 1993 12:17:29 -0400
From: omlinc@cs.rpi.edu
To: hkim@cis.ufl.edu
Subject: Re: references on pruning wanted ...
Status: R
Here is a list of references I received a couple
of months ago in reply to a similar request:
>From KRUSCHKE@ucs.indiana.edu Tue Mar 16 15:25:52 1993
Date: Tue, 16 Mar 93 15:25:48 EST
From: "John K. Kruschke" <KRUSCHKE@ucs.indiana.edu>
Subject: weight decay
To: omlinc@cs.rpi.edu
Status: R
I did some work on methods to reduce the hidden layers of back-prop
networks (references below), using variants of weight decay, and
there's no reason they couldn't be applied to recurrent networks.
I've moved on to other research now, but I'd be very interested in
whatever results you get (either supportive or not). Good luck, in
any case!
Kruschke, J. K. and Movellan, J. R. (1991).
Benefits of gain: Speeded learning and minimal hidden layers in
back-propagation networks.
IEEE Transactions on Systems, Man and Cybernetics, v.21, pp.273-280.
Kruschke, J. K. (1989b).
Distributed bottlenecks for improved generalization in
back-propagation networks.
International J. of Neural Networks Research and Applications,
v.1, pp.187-193.
Kruschke, J. K. (1989a).
Improving generalization in back-propagation networks with distributed
bottlenecks.
In: Proceedings of the IEEE International Joint Conference on Neural
Networks, v.1, 443-447. Washington DC, June 1989.
Kruschke, J. K. (1988).
Creating local and distributed bottlenecks in hidden layers of
back-propagation networks.
In: D. Touretzky, G. Hinton and T. Sejnowski (eds.),
Proceedings of the 1988 Connectionist Models Summer School,
pp.120-126. San Mateo, CA: Morgan Kaufmann.
------------------------------------------------------------
John K. Kruschke Asst. Prof. of Psych. & Cog. Sci.
Dept. of Psychology internet: kruschke@indiana.edu
Indiana University bitnet: kruschke@iubacs
Bloomington, IN 47405-4201 office: (812) 855-3192
USA lab: (812) 855-9613
============================================================
>From KRUSCHKE@ucs.indiana.edu Tue Mar 16 15:29:00 1993
Date: Tue, 16 Mar 93 15:28:00 EST
From: "John K. Kruschke" <KRUSCHKE@ucs.indiana.edu>
Subject: more on weight decay
To: omlinc@cs.rpi.edu
Status: R
This is pretty dated now, but it might be of historical interest.
==========
Date: Tue, 3 Jan 89 00:30:12 PST
From: kruschke@cogsci.berkeley.edu (John Kruschke)
To: connectionists@cs.cmu.edu
Here is the compilation of responses to my request for info on
weight decay.
I have kept editing to a minimum, so you can see exactly what the
author of the reply said. Where appropriate, I have included some
comments of my own, set off in square brackets. The responses are
arranged into three broad topics: (1) Boltzmann-machine related;
(2) back-prop related; (3) psychology related.
Thanks to all, and happy new year! --John
-----------------------------------------------------------------
ORIGINAL REQUEST:
I'm interested in all the information I can get regarding
WEIGHT DECAY in back-prop, or in other learning algorithms.
*In return* I'll collate all the info contributed and send the
complilation out to all contributors.
Info might include the following:
REFERENCES:
- Applications which used weight decay
- Theoretical treatments
Please be as complete as possible in your citation.
FIRST-HAND EXPERIENCE
- Application domain, details of I/O patterns, etc.
- exact decay procedure used, and results
(Please send info directly to me: kruschke@cogsci.berkeley.edu
Don't use the reply command.)
T H A N K S ! --John Kruschke.
-----------------------------------------------------------------
From: Geoffrey Hinton <hinton@ai.toronto.edu>
Date: Sun, 4 Dec 88 13:57:45 EST
Weight-decay is a version of what statisticians call "Ridge
Regression".
We used weight-decay in Boltzmann machines to keep the energy barriers
small. This is described in section 6.1 of:
Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984)
Boltzmann Machines: Constraint satisfaction networks that learn.
Technical Report CMU-CS-84-119, Carnegie-Mellon University.
I used weight decay in the family trees example. Weight decay was
used to improve generalization and to make the weights easier to
interpret (because, at equilibrium, the magnitude of a weight =
its usefulness). This is in:
Rumelhart, D.~E., Hinton, G.~E., and Williams, R.~J. (1986)
Learning representations by back-propagating errors.
{\it Nature}, {\bf 323}, 533--536.
I used weight decay to achieve better generalization in a hard
generalization task that is reported in:
Hinton, G.~E. (1987)
Learning translation invariant recognition in a massively
parallel network. In Goos, G. and Hartmanis, J., editors,
{\it PARLE: Parallel Architectures and Languages Europe},
pages~1--13, Lecture Notes in Computer Science,
Springer-Verlag, Berlin.
Weight-decay can also be used to keep "fast" weights small. The fast
weights act as a temporary context. One use of such a context is
described in:
Hinton, G.~E. and Plaut, D.~C. (1987)
Using fast weights to deblur old memories.
{\it Proceedings of the Ninth Annual Conference of the
Cognitive Science Society}, Seattle, WA.
--Geoff
-----------------------------------------------------------------
[In his lecture at the International Computer Science Institute,
Berkeley CA, on 16-DEC-88, Geoff also mentioned that weight decay is
good for wiping out the initial values of weights so that only the
effects of learning remain.
In particular, if the change (due to learning) on two weights is the
same for all updates, then the two weights converge to the same
value. This is one way to generate symmetric weights from
non-symmetric starting values.
--John]
-----------------------------------------------------------------
From: Michael.Franzini@SPEECH2.CS.CMU.EDU
Date: Sun, 4 Dec 1988 23:24-EST
My first-hand experience confirms what I'm sure many other people have
told you: that (in general) weight decay in backprop increases
generalization. I've found that it's particulary important for small
training sets, and its effect diminishes as the training set size
increases.
Weight decay was first used by Barak Pearlmutter. The first mention
of weight decay is, I believe, in an early paper of Hinton's (possibly
the Plaut, Nowlan, and Hinton CMU CS tech report), and it is
attributed to "Barak Pearlmutter, Personal Communication" there.
The version of weight decay that (i'm fairly sure) all of us at CMU
use is one in which each weight is multiplied by 0.999 every epoch.
Scott Fahlman has a more complicated version, which is described in
his QUICKPROP tech report. [QuickProp is also described in his paper
in the Proceedings of the 1988 Connectionist Models Summer School,
published by Morgan Kaufmann. --John]
The main motivation for using it is to eliminate spurious large
weights which happen not to interfere with recognition of training
data but would interfere with recognizing testing data. (This was
Barak's motivation for trying it in the first place.) However, I have
heard more theoretical justifications (which, unfortunately, I can't
reproduce.)
In case Barak didn't reply to your message, you might want to contact
him directly at bap@cs.cmu.edu.
--Mike
-----------------------------------------------------------------
From: Barak.Pearlmutter@F.GP.CS.CMU.EDU
Date: 8 Dec 1988 16:36-EST
We first used weight decay as a way to keep weights in a boltzmann
machine from growing too large. We added a term to the thing being
minimized, G, so that
G' = G + 1/2 h \sum_{i<j} w_{ij}^2
where G' is our new thing to minimize. This gives
\partial G'/\partial w_{ij} = \partial G/\partial w_{ij} + h w_{ij}
which is just weight decay with some mathematical motivation. As Mike
mentioned, I was the person who thought of weight decay in this
context (in the shower no less), but parameter decay has been used
forever, in adaptive control for example.
It sort of worked okay for Boltzmann machines, but works much better
in backpropagation. As a historic note I should mention that there
were some competing techniques for keeping weights small in Boltzmann
machines, such as Mark Derthick's "differential glommetry" in which
the effective target termperature of the wake phase is higher than
that of the sleep phase. I don't know if there is an analogue for
this in backpropagation, but there certainly is for mean field theory
networks.
Getting back weight decay, it was noted immediately that G has the
unit "bits" while $w_{ij}^2$ has the unit "weight^2", sort of a
problem from a dimensional analysis point of view. Solving this
conundrum, Rick Szeliski pointed out that if we're going to transmit
our weights by telephone and know a-priori that weights have gaussian
distributions, so
P(w_{ij}=x) \propto e^{-1/2 h x^2}
where h is set to get the correct variance, then transmitting a weight
w will take $-1/2 h w^2$ bits, which we can add to G with dimensional
confidence.
Of course, this argument extends to fast/slow split weights nicely; the
other guy already knows the slow weights, so we need transmit only the
fast weights.
By "ridge regression" I guess Geoff means that valleys in weight space
that cause the weights to grow asymptotically are made to tilt up after
a while, so that the asymptotic tailing off is eliminated. It's like
adding a bowl to weight space, so minima have to be within the bowl.
An interesting side effect of weight decay is that, once we get to a
minimum, so $\partial G'/\partial w = 0$, then
w_{ij} \propto - \partial G/\partial w_{ij}
so we can do a sort of eyeball significance analysis, since a weight's
magnitiude is proportaional to how sensitive the error is to changing
it.
-----------------------------------------------------------------
From: russ%yummy@gateway.mitre.org (Russell Leighton)
Date: Mon, 5 Dec 88 09:17:56 EST
We always use weight decay in backprop. It is partiuclarly important
in escaping local minima. Decay moves the transfer function from all
of the semi-linear (sigmoidal) nodes toward the linear region. The
important point is that all nodes move proportionally so no
information in the weights is "erased" but only scaled. When the nodes
that have trapped the system in the local minima are scaled enough,
the system moves onto a different trajectory through weight space.
Oscilations are still possible, but are less likely.
We use decay with a process we call "shaping" (see Wieland and
Leighton, "Shaping Schedules as a Method for Accelerating Leanring",
Abstracts of the First Annual INNS Meeting, Boston, 1988) that we use
to speed learning of some difficult problems.
ARPA: russ%yummy@gateway.mitre.org
Russell Leighton
MITRE Signal Processing Lab
7525 Colshire Dr.
McLean, Va. 22102
USA
-----------------------------------------------------------------
From: James Arthur Pittman <hi.pittman@MCC.COM>
Date: Tue, 6 Dec 88 09:34 CST
Probably he will respond to you himself, but Alex Weiland of MITRE
presented a paper at INNS in Boston on shaping, in which the order of
presentation of examples in training a back-prop net was altered to
reflect a simpler rule at first. Over a number of epochs he gradually
changed the examples to slowly change the rule to the one desired. The
nets learned much faster than if he just tossed the examples at the
net in random order. He told me that it would not work without weight
decay. He said their rule-of-thumb was the decay should give the
weights a half-life of 2 to 3 dozen epochs (usually a value such as
0.9998). But I neglected to ask him if he felt that the number of
epochs or the number of presentations was important. Perhaps if one
had a significantly different training set size, that rule-of-thumb
would be different?
I have started some experiments simular to his shaping, using some
random variation of the training data (where the random variation
grows over time). Weiland also discussed this in his talk. I haven't
yet compared decay with no-decay. I did try (as a lark) using decay
with a regular (non-shaping) training, and it did worse than we
usually get (on same data and same network type/size/shape). Perhaps
I was using a stupid decay value (0.9998 I think) for that situation.
I hope to get back to this, but at the moment we are preparing for a
software release to our shareholders (MCC is owned by 20 or so
computer industry corporations). In the next several weeks a lot of
people will go on Christmas vacation, so I will be able to run a bunch
of nets all at once. They call me the machine vulture.
-----------------------------------------------------------------
From: Tony Robinson <ajr@digsys.engineering.cambridge.ac.uk>
Date: Sat, 3 Dec 88 11:10:20 GMT
Just a quick note in reply to your message to `connectionists' to say
that I have tried to use weight decay with back-prop on networks with
order 24 i/p, 24 hidden, 11 o/p units. The problem was vowel
recognition (I think), it was about 18 months ago, and the problem was
of the unsolvable type (i.e. non-zero final energy).
My conclusion was that weight decay only made matters worse, and my
justification (to myself) for abandoning weight decay was that you are
not even pretending to do gradient descent any more, and any good
solution formed quickly becomes garbaged by scaling the weights.
If you want to avoid hidden units sticking on their limiting values,
why not use hidden units with no limiting values, for instance I find
the activation function f(x) = x * x works better than f(x) = 1.0 /
(1.0 + exp(- x)) anyway.
Sorry I havn't got anything formal to offer, but I hope these notes
help.
Tony Robinson.
-----------------------------------------------------------------
From: jose@tractatus.bellcore.com (Stephen J Hanson)
Date: Sat, 3 Dec 88 11:54:02 EST
Actually, "costs" or "penalty" functions are probably better terms. We
had a poster last week at NIPS that discussed some of the pitfalls and
advantages of two kinds of costs. I can send you the paper when we
have a version available.
Stephen J. Hanson (jose@bellcore.com)
-----------------------------------------------------------------
[ In a conversation in his office on 06-DEC-88, Dave Rumelhart
described to me several cost functions he has tried.
The motive for the functions he has tried is different from the motive
for standard weight decay. Standard weight decay,
\sum_{i,j} w_{i,j}^2 ,
is used to *distribute* weights more evenly over the given
connections, thereby increasing robustness (cf. earlier replies).
He has tried several other cost functions in an attempt to *localize*,
or concentrate, the weights on a small subset of the given
connections. The goal is to improve generalization. His favorite is
\sum_{i,j} ( w_{i,j}^2 / ( K + w_{i,j}^2 ) )
where K is a constant, around 1 or 2. Note that this function is
negatively accelerating, whereas standard weight decay is positively
accelerating. This function penalizes small weights (proportionally)
more than large weights, just the opposite of standard weight decay.
He has also tried, with less satisfying results,
\sum ( 1 - \exp - (\alpha w_{i,j}^2) )
and
\sum \ln ( K + w_{i,j}^2 ).
Finally, he has tried a cost function designed to make all the fan-in
weights of a single unit decay, when possible. That is, the unit is
effectively cut out of the network. The function is
\sum_i (\sum_j w_{i,j}^2) / ( K + \sum_j w_{i,j}^2 ).
Each weight is thereby penalized (inversely) proportionally to the
total fan-in weight of its node.
--John ]
[1991: Some papers that have explored Rumelhart's ideas:
Hanson, S. J. and Pratt, L. Y. (1989). Comparing biases form minimal
network construction with back-propagation. In: D. S. Touretzky
(ed.), Advances in Neural Information Processing Systems 1,
pp.177-185. San Mateo, CA: Morgan Kaufmann.
Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
Generalization by weight-elimination with application to forecasting.
In: R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.),
Advances in Neural Information Processing Systems 3,
San Mateo, CA: Morgan Kaufmann.
(end references).]
-----------------------------------------------------------------
[ This is also a relevant place to mention my paper in the Proceedings
of the 1988 Connectionist Models Summer School, "Creating local and
distributed bottlenecks in back-propagation networks". I have since
developed those ideas, and have expressed the localized bottleneck
method as gradient descent on an additional cost term. The cost term
is quite general, and some forms of decay are simply special cases of
it. --John]
[ 1991: Here are references to that work:
Kruschke, J. K., & Movellan, J. R. (1991). Benefits of gain: Speeded
learning and minimal hidden layers in back propagation networks.
IEEE Transactions on Systems, Man and Cybernetics, v.21, pp.273-280.
Kruschke, J. K. (1989b). Distributed bottlenecks for improved
generalization in back-propagation networks. International Journal of
Neural Networks Research and Applications, v.1, pp.187-193.
Kruschke, J. K. (1989a). Improving generalization in back-propagation
networks with distributed bottlenecks. In: Proceedings of the IEEE
International Joint Conference on Neural Networks, Washington D.C.
June 1989, v.1, pp.443-447.
Kruschke, J. K. (1988). Creating local and distributed bottlenecks in
hidden layers of back-propagation networks. In: D. Touretzky,
G. Hinton, & T. Sejnowski (eds.), Proceedings of the 1988
Connectionist Models Summer School, pp.120-126. San Mateo, CA:
Morgann Kaufmann.
(end references).]
-----------------------------------------------------------------
From: john moody <moody-john@YALE.ARPA>
Date: Sun, 11 Dec 88 22:54:11 EST
Scalettar and Zee did some interesting work on weight decay with back prop
for associative memory. They found that a Unary Representation emerged (see
Baum, Moody, and Wilczek; Bio Cybernetics Aug or Sept 88 for info on Unary
Reps). Contact Tony Zee at UCSB (805)961-4111 for info on weight decay paper.
--John Moody
[ 1991: Here's a reference on that excellent paper:
Scalettar, R. & Zee, A. (1986). A feed-forward memory with decay.
Technical Report NSF-ITP-86-118, Institute for Theoretical Physics,
University of California at Santa Barbara. Later published as:
Emergence of grandmother memory in feed forward networks: learning
with noise and forgetfulness. In: D. Waltz & J. A. Feldman (eds.),
Connectionist models and their implications: Readings from Cognitive
Science. Ablex, 1988.
(end reference).]
-----------------------------------------------------------------
From: gluck@psych.Stanford.EDU (Mark Gluck)
Date: Sat, 10 Dec 88 16:51:29 PST
I'd appreciate a copy of your weight decay collation. I have a paper in MS
form which illustrates how adding weight decay to the linear-LMS one-layer
net improves its ability to predict human generalization in classification
learning.
mark gluck
dept of psych
stanford univ,
stanford, ca 94305
-----------------------------------------------------------------
From: INAM000 <INAM%MCGILLB.bitnet@jade.berkeley.edu> (Tony Marley)
Date: SUN 04 DEC 1988 11:16:00 EST
I have been exploring some ideas re COMPETITIVE LEARNING with "noisy
weights" in modeling simple psychophysics. The task is the classical
one of identifying one of N signals by a simple (verbal) response
-e.g. the stimuli might be squares of different sizes, and one has to
identify the presented one by saying the appropriate integer. We
know from classical experiments that people cannot perform this task
perfectly once N gets larger than about 7, but performance degrades
smoothly for larger N.
I have been developing simulations where the mapping is learnt by
competitive learning, with the weights decaying/varying over time when
they are not reset by relevant inputs. I have not got too many
results to date, as I have been taking the psychological data
seriously, which means worrying about reaction times, sequential
effects, "end effects" (stimuli at the end of the range more
accurately identified), range effects (increasing the stimulus range
has little effect), etc..
Tony Marley
-----------------------------------------------------------------
From: aboulanger@bbn.com (Albert Boulanger)
Date: Fri, 2 Dec 88 19:43:14 EST
This one concerns the Hopfield model. In
James D Keeler,
"Basin of Attraction of Neural Network Models",
Snowbird Conference Proceedings (1986), 259-264,
it is shown that the basins of attraction become very complicated as
the number of stored patterns increase. He uses a weight modification
method called "unlearning" to smooth out these basins.
Albert Boulanger
BBN Systems & Technologies Corp.
aboulanger@bbn.com
-----------------------------------------------------------------
From: Joerg Kindermann <unido!gmdzi!joerg@uunet.UU.NET>
Date: Mon, 5 Dec 88 08:21:03 -0100
We used a form of weight decay not for learning but for recall in
multilayer feedforward networks. See the following abstract. Input
patterns are treated as ``weights'' coming from a constant valued
external unit.
If you would like a copy of the technical report, please send e-mail to
joerg@gmdzi.uucp
or write to:
Dr. Joerg Kindermann
Gesellschaft fuer Mathematik und Datenverarbeitung
Schloss Birlinghoven
Postfach 1240
D-5205 St. Augustin 1
WEST GERMANY
Detection of Minimal Microfeatures by Internal Feedback
J. Kindermann & A. Linden
Abstract
We define the notion of minimal microfeatures and introduce a new
method of internal feedback for multilayer networks. Error signals are
used to modify the input of a net. When combined with input DECAY,
internal feedback allows the detection of sets of minimal
microfeatures, i.e. those subpatterns which the network actually uses
for discrimination. Additional noise on the training data increases
the number of minimal microfeatures for a given pattern. The detection
of minimal microfeatures is a first step towards a subsymbolic system
with the capability of self-explanation. The paper provides examples
from the domain of letter recognition.
-----------------------------------------------------------------
From: Helen M. Gigley <hgigley@note.nsf.gov>
Date: Mon, 05 Dec 88 11:03:23 -0500
I am responding to your request even though my use of decay is not
with respect to learning in connectionist-like models. My focus has
been on a functioning system that can be lesioned.
One question I have is what is the behavioral association to weight
decay? What aspects of learning is it intended to reflect. I can
understand that activity decay over time of each cell is meaningful
and reflects a cellular property, but what is weight decay in
comparable terms?
Now, I will send you offprints if you would like of my work and am
including a list of several publications which you may be able to
peruse. The model, HOPE, is a hand-tuned structural connectionist
model that is designed to enable lesioning without redesign or
reprogramming to study possible processing causes of aphasia. Decay
factors as an integral part of dynamic time-dependent processes are
one of several aspects of processing in a neural environment which
potentially affect the global processing results even though they are
defined only locally. If I can be of any additional help please let
me know.
Helen Gigley
References:
Gigley, H.M. Neurolinguistically Constrained Simulation of Sentence
Comprehension: Integrating Artificial Intelligence and Brain Theorym
Ph.D. Dissertation, UMass/Amherst, 1982. Available University
Microfilms, Ann Arbor, MI.
Gigley, H.M. HOPE--AI and the dynamic process of language behavior.
in Cognition and Brain Theory 6(1) :39-88, 1983.
Gigley, H.M. Grammar viewed as a functioning part of of a cognitive
system. Proceedings of ACL 23rd Annual Meeting, Chicago, 1985 .
Gigley, H.M. Computational Neurolinguistics -- What is it all about?
in IJCAI Proceedings, Los Angeles, 1985.
Gigley, H.M. Studies in Artificial Aphasia--experiments in processing
change. In Journal of Computer Methods and Programs in Biomedicine,
22 (1): 43-50, 1986.
Gigley, H.M. Process Synchronization, Lexical Ambiguity Resolution,
and Aphasia. In Steven L. Small, Garrison Cottrell, and Michael
Tanenhaus (eds.) Lexical Ambiguity Resolution, Morgen Kaumann, 1988.
-----------------------------------------------------------------
From: bharucha@eleazar.Dartmouth.EDU (Jamshed Bharucha)
Date: Tue, 13 Dec 88 16:56:00 EST
I haven't tried weight decay but am curious about it. I am working on
back-prop learning of musical sequences using a Jordan-style net. The
network develops a musical schema after learning lots of sequences
that have culture-specific regularities. I.e., it learns to generate
expectancies for tones following a sequential context. I'm interested
in knowing how to implement forgetting, whether short term or long
term.
Jamshed.
-----------------------------------------------------------------
======== END OF WEIGHT DECAY NOTES ==============================
Christian
PS:I also submitted a paper to NIPS which deals
with improving the generalization performance
of recurrent networks through neuron pruning.
It is available via anonymous ftp.
The address is external.nj.nec.com. The file
is /pub/giles/papers/prune.ps.Z.
====================================================================
Date: Mon, 24 May 93 14:39:52 -0500
From: John Kruschke <kruschke@pallas.psych.indiana.edu>
To: hkim@thedog.cis.ufl.edu
Subject: Re: weight decay: references wanted...
Organization: Indiana University
Status: R
Some references on weight decay / node pruning or creation /
dimensionality reduction:
Kruschke, J. K. & Movellan, J. R. (1991). Benefits of gain: speeded
learning and minimal hidden layers in back-propagation networks.
IEEE Transaction on Systems, Man and Cybernetics, v.21, pp.273-280.
Kruschke, J. K. (1989). Distributed bottlenecks for improved
generalization in back-propagation networks. International Journal
of Neural Networks Research and Applications, v.1, pp.187--193.
Kruschke, J. K. (1989). Creating local and distributed bottlenecks in
hidden layers of back-propagation networks. In: D. Touretzky,
G. Hinton and T. Sejnowski (Eds.), Proceedings of the 1988
Connectionist Models Summer School, pp.120-126. San Mateo, CA:
Morgan Kaufmann.
(The first two papers listed above report extensions of ideas introduced
in the third paper listed above.)
One of the appealing properties of the methods described in those papers
is that they can be construed as either node pruning or as node creating
methods. There is a pool of candidate nodes, and the degree of their
participation can be continuously adjusted, so that nodes can be retired
("pruned") or recruited ("created") as needed, depending on how effectively
error is being reduced with the currently participating nodes.
John K. Kruschke
Dept. of Psychology and Cognitive Science Program
Indiana University
Bloomington, IN 47405
P.S. Please send me or post the complete list of references you get! Thanks.
=====================================================================
Date: Tue, 25 May 93 11:49:27 EDT
From: gatech!concert.net!array!nasir@bikini.cis.ufl.edu (Nasir Ghani)
Message-Id: <9305251549.AA15245@array.UUCP>
To: hkim@cis.ufl.edu
Status: R
Hi,
There are some articles in the NIPS (Neural Information Processing ???) which
deal with the pruning (weights only, not neurons) of neural networks,
(BACKPROPAGATION NETS, i am talking about). I used the work from a paper
by Yan and Le Cun in NIPS i think 1991 or so, and they had a method called
optimal brain damage. THis is a rather structure approach, and you may just
try ad-hoc approaches such as magnitude metric or something to knock of weights
below a certain threshold.......i found for my work at least, that the OBD
took a LOT longer to compute and the results were marginally better than
a simple threshold sort of rule. It really depends on the data you have and
how much of it you have for training purposes....hope this helps.
Later
Nasir Ghani
==========================================================================
Date: Wed, 26 May 93 11:48:04 EST
From: young@s1.elec.uq.oz.au (Steven Young)
Message-Id: <9305260148.AA01404@s2.elec.uq.oz.au>
To: hkim@thedog.cis.ufl.edu
Subject: Re: weight decay: references wanted...
Status: R
I posted the following in reply to a similar request from Elliot Furman
Newsgroups: comp.ai.neural-nets
Subject: Re: Pruning units and weights
Date: Wed, 12 May 1993 00:22:13 GMT
furman@leland.Stanford.EDU (Elliot M Furman) writes:
>Can anyone tell me how to prune unnecessary units and weights?
>I would like to start training a fully feedforward NN with
>too many units and then prune those that aren't contributing
>much to the "solution".
This is the standard approach that comes to mind when people consider
pruning and is suggested in various papers I have seen (Many people
attribute the idea to Rumelhart?). I'll include a list of references
that I know of at the end of this post.
One approach is the method of weight decaying, and removing connections
if the final (trained) weight is small (pick a parameter value and if
the weight is less than that in absolute value, remove it). There are
a number of papers on including weight decay as a part of the error
function for minimization with standard descent techniques. There
are a range of approaches and many papers expounding this idea:
(Hanson and Pratt, 1989), (Chauvin, 1989), (Le Cun, Denker and Solla, 1990),
(Ji, Snapp, Psaltis, 1990), (Bishop, 1990), (Weigend, Rumelhart, and Huberman,
1991).
There are some other schemes of making pruning decisions directly.
One simple rule (suggested initially by Sietsma and Dow (1988)) is to check
if network units in the same layer are duplicating function, if so
then remove one of the duplicating units. Mozer and Smolensky (1989) have
suggested a different scheme called skeletonization which makes a
decision based on the `relevance' of the unit. Relevance is checked
by comparing the performance of the network with the unit included and
removed.
J. Sietsma, R. J. F. Dow, `Neural Network Pruning --- Why and How',
ICNN 1988, vol I, pages 325--333, 1988.
Yves Chauvin, `A Back-Propagation Algorithm with optimal use of Hidden
Units', NIPS 1, pages 519--526, 1989.
Stephen Jos{\'e} Hanson, Lorien Y. Pratt, `Comparing Biases for Minimal
Construction with Back-Propagation', NIPS 1, pages 177--185, 1989.
Michael C. Mozer, Paul Smolensky, `Skeletonization: A technique for
trimming the fat from a network via relevance assessment', NIPS 1,
pages 107--115, 1989.
Michael C. Mozer, Paul Smolensky, `Using Relevance to Reduce Network
Size Automatically', Connection Science, vol. 1, no. 1, pages 3--16, 1989
C. M. Bishop, `Curvature-Driven Smoothing in Backpropagation Neural
Networks', INNC-1990-Paris, pages 749--752, 1990.
Chuanyi Ji, Robert R. Snapp, Demetri Psaltis, `Generalizing Smoothness
Constraints from Discrete Samples', Neural Computation, vol. 2, pages
188-197, 1990.
Yann Le Cun, John S. Denker, Sara A. Solla, `Optimal Brain Damage',
NIPS 2, pages 598--605, 1990
Jocelyn Sietsma, Robert J.F. Dow, `Creating Artificial Neural Networks
That Generalize' Neural Networks, vol 4, pages 67--79, 1991.
Andreas S. Weigend, David E. Rumelhart, Bernardo A. Huberman,
`Generalization by Weight-Elimination with Application to Forecasting',
NIP 3, pages 875--882, 1991.
Hope this is helpful.
Steven
--
Steven Young PhD Student | Dept of Electrical Engineering
email : young@s1.elec.uq.oz.au | University of Queensland
Murphy was an anarchist! | AUSTRALIA 4072 Ph:61+7 3653564
---------
And John Kruschke posted the following recently
Date: Mon, 17 May 93 14:08:09 -0500
From: John Kruschke <kruschke@pallas.psych.indiana.edu>
Subject: Re: Pruning units and weights
Newsgroups: comp.ai.neural-nets
Here are a couple more references to papers that describe methods for pruning
unneeded nodes from backprop networks:
Kruschke, J. K. (1989). Distributed bottlenecks for improved generalization
in back-propagation networks. International Journal of Neural Networks
Research and Applications, v.1, pp.187-193.
Kruschke, J. K., & Movellan, J. R. (1991). Benefits of gain: Speeded learning
and minimal hidden layers in back-propagation networks. IEEE Transactions
on Systems, Man and Cybernetics, v.21, pp.273-280.
John K. Kruschke
Dept. of Psychology
Indiana University
Bloomington, IN 47405 USA
--------