home *** CD-ROM | disk | FTP | other *** search
- From: saswss@unx.sas.com (Warren Sarle)
- Newsgroups: comp.ai.neural-nets,comp.answers,news.answers
- Subject: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
- Supersedes: <nn3.posting_1027963394@hotellng.unx.sas.com>
- Followup-To: comp.ai.neural-nets
- Date: 30 Dec 2002 21:23:08 GMT
- Organization: SAS Institute Inc., Cary, NC, USA
- Lines: 2181
- Approved: news-answers-request@MIT.EDU
- Expires: 3 Feb 2003 21:23:07 GMT
- Message-ID: <nn3.posting_1041283387@hotellng.unx.sas.com>
- Reply-To: saswss@unx.sas.com (Warren Sarle)
- NNTP-Posting-Host: hotellng.unx.sas.com
- X-Trace: license1.unx.sas.com 1041283388 5839 10.28.2.188 (30 Dec 2002 21:23:08 GMT)
- X-Complaints-To: usenet@unx.sas.com
- NNTP-Posting-Date: 30 Dec 2002 21:23:08 GMT
- Keywords: frequently asked questions, answers
- Originator: saswss@hotellng.unx.sas.com
- Path: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!news-out.cwix.com!newsfeed.cwix.com!feed2.news.rcn.net!rcn!news-out.visi.com!hermes.visi.com!news.lightlink.com!vienna7.his.com!attws1!ip.att.net!lamb.sas.com!newshost!hotellng.unx.sas.com!saswss
- Xref: senator-bedfellow.mit.edu comp.ai.neural-nets:64337 comp.answers:52355 news.answers:243494
-
- Archive-name: ai-faq/neural-nets/part3
- Last-modified: 2001-05-21
- URL: ftp://ftp.sas.com/pub/neural/FAQ3.html
- Maintainer: saswss@unx.sas.com (Warren S. Sarle)
-
- Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC,
- USA. Answers provided by other authors as cited below are copyrighted by
- those authors, who by submitting the answers for the FAQ give permission for
- the answer to be reproduced as part of the FAQ in any of the ways specified
- in part 1 of the FAQ.
-
- This is part 3 (of 7) of a monthly posting to the Usenet newsgroup
- comp.ai.neural-nets. See the part 1 of this posting for full information
- what it is all about.
-
- ========== Questions ==========
- ********************************
-
- Part 1: Introduction
- Part 2: Learning
- Part 3: Generalization
-
- How is generalization possible?
- How does noise affect generalization?
- What is overfitting and how can I avoid it?
- What is jitter? (Training with noise)
- What is early stopping?
- What is weight decay?
- What is Bayesian learning?
- How to combine networks?
- How many hidden layers should I use?
- How many hidden units should I use?
- How can generalization error be estimated?
- What are cross-validation and bootstrapping?
- How to compute prediction and confidence intervals (error bars)?
-
- Part 4: Books, data, etc.
- Part 5: Free software
- Part 6: Commercial software
- Part 7: Hardware and miscellaneous
-
- ------------------------------------------------------------------------
-
- Subject: How is generalization possible?
- =========================================
-
- During learning, the outputs of a supervised neural net come to approximate
- the target values given the inputs in the training set. This ability may be
- useful in itself, but more often the purpose of using a neural net is to
- generalize--i.e., to have the outputs of the net approximate target values
- given inputs that are not in the training set. Generalizaton is not always
- possible, despite the blithe assertions of some authors. For example,
- Caudill and Butler, 1990, p. 8, claim that "A neural network is able to
- generalize", but they provide no justification for this claim, and they
- completely neglect the complex issues involved in getting good
- generalization. Anyone who reads comp.ai.neural-nets is well aware from the
- numerous posts pleading for help that artificial neural networks do not
- automatically generalize.
-
- Generalization requires prior knowledge, as pointed out by Hume (1739/1978),
- Russell (1948), and Goodman (1954/1983) and rigorously proved by Wolpert
- (1995a, 1996a, 1996b). For any practical application, you have to know what
- the relevant inputs are (you can't simply include every imaginable input).
- You have to know a restricted class of input-output functions that contains
- an adequate approximation to the function you want to learn (you can't use a
- learning method that is capable of fitting every imaginable function). And
- you have to know that the cases you want to generalize to bear some
- resemblance to the training cases. Thus, there are three conditions that are
- typically necessary--although not sufficient--for good generalization:
-
- 1. The first necessary condition is that the inputs to the network contain
- sufficient information pertaining to the target, so that there exists a
- mathematical function relating correct outputs to inputs with the desired
- degree of accuracy. You can't expect a network to learn a nonexistent
- function--neural nets are not clairvoyant! For example, if you want to
- forecast the price of a stock, a historical record of the stock's prices
- is rarely sufficient input; you need detailed information on the
- financial state of the company as well as general economic conditions,
- and to avoid nasty surprises, you should also include inputs that can
- accurately predict wars in the Middle East and earthquakes in Japan.
- Finding good inputs for a net and collecting enough training data often
- take far more time and effort than training the network.
-
- 2. The second necessary condition is that the function you are trying to
- learn (that relates inputs to correct outputs) be, in some sense, smooth.
- In other words, a small change in the inputs should, most of the time,
- produce a small change in the outputs. For continuous inputs and targets,
- smoothness of the function implies continuity and restrictions on the
- first derivative over most of the input space. Some neural nets can learn
- discontinuities as long as the function consists of a finite number of
- continuous pieces. Very nonsmooth functions such as those produced by
- pseudo-random number generators and encryption algorithms cannot be
- generalized by neural nets. Often a nonlinear transformation of the input
- space can increase the smoothness of the function and improve
- generalization.
-
- For classification, if you do not need to estimate posterior
- probabilities, then smoothness is not theoretically necessary. In
- particular, feedforward networks with one hidden layer trained by
- minimizing the error rate (a very tedious training method) are
- universally consistent classifiers if the number of hidden units grows at
- a suitable rate relative to the number of training cases (Devroye,
- Gy÷rfi, and Lugosi, 1996). However, you are likely to get better
- generalization with realistic sample sizes if the classification
- boundaries are smoother.
-
- For Boolean functions, the concept of smoothness is more elusive. It
- seems intuitively clear that a Boolean network with a small number of
- hidden units and small weights will compute a "smoother" input-output
- function than a network with many hidden units and large weights. If you
- know a good reference characterizing Boolean functions for which good
- generalization is possible, please inform the FAQ maintainer
- (saswss@unx.sas.com).
-
- 3. The third necessary condition for good generalization is that the
- training cases be a sufficiently large and representative subset
- ("sample" in statistical terminology) of the set of all cases that you
- want to generalize to (the "population" in statistical terminology). The
- importance of this condition is related to the fact that there are,
- loosely speaking, two different types of generalization: interpolation
- and extrapolation. Interpolation applies to cases that are more or less
- surrounded by nearby training cases; everything else is extrapolation. In
- particular, cases that are outside the range of the training data require
- extrapolation. Cases inside large "holes" in the training data may also
- effectively require extrapolation. Interpolation can often be done
- reliably, but extrapolation is notoriously unreliable. Hence it is
- important to have sufficient training data to avoid the need for
- extrapolation. Methods for selecting good training sets are discussed in
- numerous statistical textbooks on sample surveys and experimental design.
-
- Thus, for an input-output function that is smooth, if you have a test case
- that is close to some training cases, the correct output for the test case
- will be close to the correct outputs for those training cases. If you have
- an adequate sample for your training set, every case in the population will
- be close to a sufficient number of training cases. Hence, under these
- conditions and with proper training, a neural net will be able to generalize
- reliably to the population.
-
- If you have more information about the function, e.g. that the outputs
- should be linearly related to the inputs, you can often take advantage of
- this information by placing constraints on the network or by fitting a more
- specific model, such as a linear model, to improve generalization.
- Extrapolation is much more reliable in linear models than in flexible
- nonlinear models, although still not nearly as safe as interpolation. You
- can also use such information to choose the training cases more efficiently.
- For example, with a linear model, you should choose training cases at the
- outer limits of the input space instead of evenly distributing them
- throughout the input space.
-
- References:
-
- Caudill, M. and Butler, C. (1990). Naturally Intelligent Systems. MIT
- Press: Cambridge, Massachusetts.
-
- Devroye, L., Gy÷rfi, L., and Lugosi, G. (1996), A Probabilistic Theory of
- Pattern Recognition, NY: Springer.
-
- Goodman, N. (1954/1983), Fact, Fiction, and Forecast, 1st/4th ed.,
- Cambridge, MA: Harvard University Press.
-
- Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R. (1986),
- Induction: Processes of Inference, Learning, and Discovery, Cambridge, MA:
- The MIT Press.
-
- Howson, C. and Urbach, P. (1989), Scientific Reasoning: The Bayesian
- Approach, La Salle, IL: Open Court.
-
- Hume, D. (1739/1978), A Treatise of Human Nature, Selby-Bigge, L.A.,
- and Nidditch, P.H. (eds.), Oxford: Oxford University Press.
-
- Plotkin, H. (1993), Darwin Machines and the Nature of Knowledge,
- Cambridge, MA: Harvard University Press.
-
- Russell, B. (1948), Human Knowledge: Its Scope and Limits, London:
- Routledge.
-
- Stone, C.J. (1977), "Consistent nonparametric regression," Annals of
- Statistics, 5, 595-645.
-
- Stone, C.J. (1982), "Optimal global rates of convergence for
- nonparametric regression," Annals of Statistics, 10, 1040-1053.
-
- Vapnik, V.N. (1995), The Nature of Statistical Learning Theory, NY:
- Springer.
-
- Wolpert, D.H. (1995a), "The relationship between PAC, the statistical
- physics framework, the Bayesian framework, and the VC framework," in
- Wolpert (1995b), 117-214.
-
- Wolpert, D.H. (ed.) (1995b), The Mathematics of Generalization: The
- Proceedings of the SFI/CNLS Workshop on Formal Approaches to
- Supervised Learning, Santa Fe Institute Studies in the Sciences of
- Complexity, Volume XX, Reading, MA: Addison-Wesley.
-
- Wolpert, D.H. (1996a), "The lack of a priori distinctions between
- learning algorithms," Neural Computation, 8, 1341-1390.
-
- Wolpert, D.H. (1996b), "The existence of a priori distinctions between
- learning algorithms," Neural Computation, 8, 1391-1420.
-
- ------------------------------------------------------------------------
-
- Subject: How does noise affect generalization?
- ===============================================
-
- "Statistical noise" means variation in the target values that is
- unpredictable from the inputs of a specific network, regardless of the
- architecture or weights. "Physical noise" refers to variation in the target
- values that is inherently unpredictable regardless of what inputs are used.
- Noise in the inputs usually refers to measurement error, so that if the same
- object or example is presented to the network more than once, the input
- values differ.
-
- Noise in the actual data is never a good thing, since it limits the accuracy
- of generalization that can be achieved no matter how extensive the training
- set is. On the other hand, injecting artificial noise (jitter) into the
- inputs during training is one of several ways to improve generalization for
- smooth functions when you have a small training set.
-
- Certain assumptions about noise are necessary for theoretical results.
- Usually, the noise distribution is assumed to have zero mean and finite
- variance. The noise in different cases is usually assumed to be independent
- or to follow some known stochastic model, such as an autoregressive process.
- The more you know about the noise distribution, the more effectively you can
- train the network (e.g., McCullagh and Nelder 1989).
-
- If you have noise in the target values, what the network learns depends
- mainly on the error function. For example, if the noise is independent with
- finite variance for all training cases, a network that is well-trained using
- least squares will produce outputs that approximate the conditional mean of
- the target values (White, 1990; Bishop, 1995). Note that for a binary 0/1
- variable, the mean is equal to the probability of getting a 1. Hence, the
- results in White (1990) immediately imply that for a categorical target with
- independent noise using 1-of-C coding (see "How should categories be
- encoded?"), a network that is well-trained using least squares will produce
- outputs that approximate the posterior probabilities of each class (see
- Rojas, 1996, if you want a simple explanation of why least-squares estimates
- probabilities). Posterior probabilities can also be learned using
- cross-entropy and various other error functions (Finke and Mⁿller, 1994;
- Bishop, 1995). The conditional median can be learned by least-absolute-value
- training (White, 1992a). Conditional modes can be approximated by yet other
- error functions (e.g., Rohwer and van der Rest 1996). For noise
- distributions that cannot be adequately approximated by a single location
- estimate (such as the mean, median, or mode), a network can be trained to
- approximate quantiles (White, 1992a) or mixture components (Bishop, 1995;
- Husmeier, 1999).
-
- If you have noise in the target values, the mean squared generalization
- error can never be less than the variance of the noise, no matter how much
- training data you have. But you can estimate the mean of the target values,
- conditional on a given set of input values, to any desired degree of
- accuracy by obtaining a sufficiently large and representative training set,
- assuming that the function you are trying to learn is one that can indeed be
- learned by the type of net you are using, and assuming that the complexity
- of the network is regulated appropriately (White 1990).
-
- Noise in the target values increases the danger of overfitting (Moody 1992).
-
- Noise in the inputs limits the accuracy of generalization, but in a more
- complicated way than does noise in the targets. In a region of the input
- space where the function being learned is fairly flat, input noise will have
- little effect. In regions where that function is steep, input noise can
- degrade generalization severely.
-
- Furthermore, if the target function is Y=f(X), but you observe noisy inputs
- X+D, you cannot obtain an arbitrarily accurate estimate of f(X) given X+D no
- matter how large a training set you use. The net will not learn f(X), but
- will instead learn a convolution of f(X) with the distribution of the noise
- D (see "What is jitter?)"
-
- For more details, see one of the statistically-oriented references on neural
- nets such as:
-
- Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
- Oxford University Press, especially section 6.4.
-
- Finke, M., and Mⁿller, K.-R. (1994), "Estimating a-posteriori
- probabilities using stochastic network models," in Mozer, Smolensky,
- Touretzky, Elman, & Weigend, eds., Proceedings of the 1993 Connectionist
- Models Summer School, Hillsdale, NJ: Lawrence Erlbaum Associates, pp.
- 324-331.
-
- Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
- the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
-
- Husmeier, D. (1999), Neural Networks for Conditional Probability
- Estimation: Forecasting Beyond Point Predictions, Berlin: Springer
- Verlag, ISBN 185233095.
-
- McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
- ed., London: Chapman & Hall.
-
- Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
- Generalization and Regularization in Nonlinear Learning Systems", in
- Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
- Information Processing Systems 4, 847-854.
-
- Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
- Cambridge University Press.
-
- Rohwer, R., and van der Rest, J.C. (1996), "Minimum description length,
- regularization, and multimodal data," Neural Computation, 8, 595-609.
-
- Rojas, R. (1996), "A short proof of the posterior probability property of
- classifier neural networks," Neural Computation, 8, 41-43.
-
- White, H. (1990), "Connectionist Nonparametric Regression: Multilayer
- Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3,
- 535-550. Reprinted in White (1992).
-
- White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles
- Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings
- of the 23rd Sympsium on the Interface: Computing Science and Statistics,
- Alexandria, VA: American Statistical Association, pp. 190-199. Reprinted
- in White (1992b).
-
- White, H. (1992b), Artificial Neural Networks: Approximation and
- Learning Theory, Blackwell.
-
- ------------------------------------------------------------------------
-
- Subject: What is overfitting and how can I avoid it?
- =====================================================
-
- The critical issue in developing a neural network is generalization: how
- well will the network make predictions for cases that are not in the
- training set? NNs, like other flexible nonlinear estimation methods such as
- kernel regression and smoothing splines, can suffer from either underfitting
- or overfitting. A network that is not sufficiently complex can fail to
- detect fully the signal in a complicated data set, leading to underfitting.
- A network that is too complex may fit the noise, not just the signal,
- leading to overfitting. Overfitting is especially dangerous because it can
- easily lead to predictions that are far beyond the range of the training
- data with many of the common types of NNs. Overfitting can also produce wild
- predictions in multilayer perceptrons even with noise-free data.
-
- For an elementary discussion of overfitting, see Smith (1996). For a more
- rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992)
- on the bias/variance trade-off (it's not really a dilemma). We are talking
- about statistical bias here: the difference between the average value of an
- estimator and the correct value. Underfitting produces excessive bias in the
- outputs, whereas overfitting produces excessive variance. There are
- graphical examples of overfitting and underfitting in Sarle (1995, 1999).
-
- The best way to avoid overfitting is to use lots of training data. If you
- have at least 30 times as many training cases as there are weights in the
- network, you are unlikely to suffer from much overfitting, although you may
- get some slight overfitting no matter how large the training set is. For
- noise-free data, 5 times as many training cases as weights may be
- sufficient. But you can't arbitrarily reduce the number of weights for fear
- of underfitting.
-
- Given a fixed amount of training data, there are at least six approaches to
- avoiding underfitting and overfitting, and hence getting good
- generalization:
-
- o Model selection
- o Jittering
- o Early stopping
- o Weight decay
- o Bayesian learning
- o Combining networks
-
- The first five approaches are based on well-understood theory. Methods for
- combining networks do not have such a sound theoretical basis but are the
- subject of current research. These six approaches are discussed in more
- detail under subsequent questions.
-
- The complexity of a network is related to both the number of weights and the
- size of the weights. Model selection is concerned with the number of
- weights, and hence the number of hidden units and layers. The more weights
- there are, relative to the number of training cases, the more overfitting
- amplifies noise in the targets (Moody 1992). The other approaches listed
- above are concerned, directly or indirectly, with the size of the weights.
- Reducing the size of the weights reduces the "effective" number of
- weights--see Moody (1992) regarding weight decay and Weigend (1994)
- regarding early stopping. Bartlett (1997) obtained learning-theory results
- in which generalization error is related to the L_1 norm of the weights
- instead of the VC dimension.
-
- Overfitting is not confined to NNs with hidden units. Overfitting can occur
- in generalized linear models (networks with no hidden units) if either or
- both of the following conditions hold:
-
- 1. The number of input variables (and hence the number of weights) is large
- with respect to the number of training cases. Typically you would want at
- least 10 times as many training cases as input variables, but with
- noise-free targets, twice as many training cases as input variables would
- be more than adequate. These requirements are smaller than those stated
- above for networks with hidden layers, because hidden layers are prone to
- creating ill-conditioning and other pathologies.
-
- 2. The input variables are highly correlated with each other. This condition
- is called "multicollinearity" in the statistical literature.
- Multicollinearity can cause the weights to become extremely large because
- of numerical ill-conditioning--see "How does ill-conditioning affect NN
- training?"
-
- Methods for dealing with these problems in the statistical literature
- include ridge regression (similar to weight decay), partial least squares
- (similar to Early stopping), and various methods with even stranger names,
- such as the lasso and garotte (van Houwelingen and le Cessie, ????).
-
- References:
-
- Bartlett, P.L. (1997), "For valid generalization, the size of the weights
- is more important than the size of the network," in Mozer, M.C., Jordan,
- M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
- Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.
-
- Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
- the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
-
- Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
- Generalization and Regularization in Nonlinear Learning Systems", in
- Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
- Information Processing Systems 4, 847-854.
-
- Sarle, W.S. (1995), "Stopped Training and Other Remedies for
- Overfitting," Proceedings of the 27th Symposium on the Interface of
- Computing Science and Statistics, 352-360,
- ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
- compressed postscript file, 747K, 10 pages)
-
- Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results,"
- ftp://ftp.sas.com/pub/neural/dojo/dojo.html
-
- Smith, M. (1996). Neural Networks for Statistical Modeling, Boston:
- International Thomson Computer Press, ISBN 1-850-32842-0.
-
- van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized
- likelihood as methods to improve predictive accuracy,"
- http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf and
- http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf
-
- Weigend, A. (1994), "On overfitting and the effective number of hidden
- units," Proceedings of the 1993 Connectionist Models Summer School,
- 335-342.
-
- ------------------------------------------------------------------------
-
- Subject: What is jitter? (Training with noise)
- ===============================================
-
- Jitter is artificial noise deliberately added to the inputs during training.
- Training with jitter is a form of smoothing related to kernel regression
- (see "What is GRNN?"). It is also closely related to regularization methods
- such as weight decay and ridge regression.
-
- Training with jitter works because the functions that we want NNs to learn
- are mostly smooth. NNs can learn functions with discontinuities, but the
- functions must be piecewise continuous in a finite number of regions if our
- network is restricted to a finite number of hidden units.
-
- In other words, if we have two cases with similar inputs, the desired
- outputs will usually be similar. That means we can take any training case
- and generate new training cases by adding small amounts of jitter to the
- inputs. As long as the amount of jitter is sufficiently small, we can assume
- that the desired output will not change enough to be of any consequence, so
- we can just use the same target value. The more training cases, the merrier,
- so this looks like a convenient way to improve training. But too much jitter
- will obviously produce garbage, while too little jitter will have little
- effect (Koistinen and Holmstr÷m 1992).
-
- Consider any point in the input space, not necessarily one of the original
- training cases. That point could possibly arise as a jittered input as a
- result of jittering any of several of the original neighboring training
- cases. The average target value at the given input point will be a weighted
- average of the target values of the original training cases. For an infinite
- number of jittered cases, the weights will be proportional to the
- probability densities of the jitter distribution, located at the original
- training cases and evaluated at the given input point. Thus the average
- target values given an infinite number of jittered cases will, by
- definition, be the Nadaraya-Watson kernel regression estimator using the
- jitter density as the kernel. Hence, training with jitter is an
- approximation to training with the kernel regression estimator as target.
- Choosing the amount (variance) of jitter is equivalent to choosing the
- bandwidth of the kernel regression estimator (Scott 1992).
-
- When studying nonlinear models such as feedforward NNs, it is often helpful
- first to consider what happens in linear models, and then to see what
- difference the nonlinearity makes. So let's consider training with jitter in
- a linear model. Notation:
-
- x_ij is the value of the jth input (j=1, ..., p) for the
- ith training case (i=1, ..., n).
- X={x_ij} is an n by p matrix.
- y_i is the target value for the ith training case.
- Y={y_i} is a column vector.
-
- Without jitter, the least-squares weights are B = inv(X'X)X'Y, where
- "inv" indicates a matrix inverse and "'" indicates transposition. Note that
- if we replicate each training case c times, or equivalently stack c copies
- of the X and Y matrices on top of each other, the least-squares weights are
- inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before.
-
- With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ...,
- c, where z_ijk is produced by some random number generator, usually with
- a normal distribution with mean 0 and standard deviation s, and the
- z_ijk's are all independent. In place of the n by p matrix X, this
- gives us a big matrix, say Q, with cn rows and p columns. To compute the
- least-squares weights, we need Q'Q. Let's consider the jth diagonal
- element of Q'Q, which is
-
- 2 2 2
- sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk)
- i,k i,k
-
- which is approximately, for c large,
-
- 2 2
- c(sum x_ij + ns )
- i
-
- which is c times the corresponding diagonal element of X'X plus ns^2.
- Now consider the u,vth off-diagonal element of Q'Q, which is
-
- sum (x_iu+z_iuk)(x_iv+z_ivk)
- i,k
-
- which is approximately, for c large,
-
- c(sum x_iu x_iv)
- i
-
- which is just c times the corresponding element of X'X. Thus, Q'Q equals
- c(X'X+ns^2I), where I is an identity matrix of appropriate size.
- Similar computations show that the crossproduct of Q with the target values
- is cX'Y. Hence the least-squares weights with jitter of variance s^2 are
- given by
-
- 2 2 2
- B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'Y
-
- In the statistics literature, B(ns^2) is called a ridge regression
- estimator with ridge value ns^2.
-
- If we were to add jitter to the target values Y, the cross-product X'Y
- would not be affected for large c for the same reason that the off-diagonal
- elements of X'X are not afected by jitter. Hence, adding jitter to the
- targets will not change the optimal weights; it will just slow down training
- (An 1996).
-
- The ordinary least squares training criterion is (Y-XB)'(Y-XB).
- Weight decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B,
- where d is the decay rate. Weight decay can also be implemented by
- inventing artificial training cases. Augment the training data with p new
- training cases containing the matrix dI for the inputs and a zero vector
- for the targets. To put this in a formula, let's use A;B to indicate the
- matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D.
- Thus the augmented inputs are X;dI and the augmented targets are Y;0,
- where 0 indicates the zero vector of the appropriate size. The squared error
- for the augmented training data is:
-
- (Y;0-(X;dI)B)'(Y;0-(X;dI)B)
- = (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B
- = Y'Y - 2Y'XB + B'(X'X+d^2I)B
- = Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B
- = (Y-XB)'(Y-XB)+d^2B'B
-
- which is the weight-decay training criterion. Thus the weight-decay
- estimator is:
-
- inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y
-
- which is the same as the jitter estimator B(d^2), i.e. jitter with
- variance d^2/n. The equivalence between the weight-decay estimator and
- the jitter estimator does not hold for nonlinear models unless the jitter
- variance is small relative to the curvature of the nonlinear function (An
- 1996). However, the equivalence of the two estimators for linear models
- suggests that they will often produce similar results even for nonlinear
- models. Details for nonlinear models, including classification problems, are
- given in An (1996).
-
- B(0) is obviously the ordinary least-squares estimator. It can be shown
- that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in
- other words, adding jitter causes the weights to shrink. It can also be
- shown that under the usual statistical assumptions, there always exists some
- value of ns^2 > 0 such that B(ns^2) provides better expected
- generalization than B(0). Unfortunately, there is no way to calculate a
- value of ns^2 from the training data that is guaranteed to improve
- generalization. There are other types of shrinkage estimators called Stein
- estimators that do guarantee better generalization than B(0), but I'm not
- aware of a nonlinear generalization of Stein estimators applicable to neural
- networks.
-
- The statistics literature describes numerous methods for choosing the ridge
- value. The most obvious way is to estimate the generalization error by
- cross-validation, generalized cross-validation, or bootstrapping, and to
- choose the ridge value that yields the smallest such estimate. There are
- also quicker methods based on empirical Bayes estimation, one of which
- yields the following formula, useful as a first guess:
-
- 2 p(Y-XB(0))'(Y-XB(0))
- s = --------------------
- 1 n(n-p)B(0)'B(0)
-
- You can iterate this a few times:
-
- 2 p(Y-XB(0))'(Y-XB(0))
- s = --------------------
- l+1 2 2
- n(n-p)B(s )'B(s )
- l l
-
- Note that the more training cases you have, the less noise you need.
-
- References:
-
- An, G. (1996), "The effects of adding noise during backpropagation
- training on a generalization performance," Neural Computation, 8,
- 643-674.
-
- Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
- Oxford University Press.
-
- Holmstr÷m, L. and Koistinen, P. (1992) "Using additive noise in
- back-propagation training", IEEE Transaction on Neural Networks, 3,
- 24-38.
-
- Koistinen, P. and Holmstr÷m, L. (1992) "Kernel regression and
- backpropagation training with noise," NIPS4, 1033-1039.
-
- Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
- Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
- MIT Press, ISBN 0-262-18190-8.
-
- Scott, D.W. (1992) Multivariate Density Estimation, Wiley.
-
- Vinod, H.D. and Ullah, A. (1981) Recent Advances in Regression Methods,
- NY: Marcel-Dekker.
-
- ------------------------------------------------------------------------
-
- Subject: What is early stopping?
- =================================
-
- NN practitioners often use nets with many times as many parameters as
- training cases. E.g., Nelson and Illingworth (1991, p. 165) discuss training
- a network with 16,219 parameters with only 50 training cases! The method
- used is called "early stopping" or "stopped training" and proceeds as
- follows:
-
- 1. Divide the available data into training and validation sets.
- 2. Use a large number of hidden units.
- 3. Use very small random initial values.
- 4. Use a slow learning rate.
- 5. Compute the validation error rate periodically during training.
- 6. Stop training when the validation error rate "starts to go up".
-
- It is crucial to realize that the validation error is not a good estimate
- of the generalization error. One method for getting an unbiased estimate of
- the generalization error is to run the net on a third set of data, the test
- set, that is not used at all during the training process. For other methods,
- see "How can generalization error be estimated?"
-
- Early stopping has several advantages:
-
- o It is fast.
- o It can be applied successfully to networks in which the number of weights
- far exceeds the sample size.
- o It requires only one major decision by the user: what proportion of
- validation cases to use.
-
- But there are several unresolved practical issues in early stopping:
-
- o How many cases do you assign to the training and validation sets? Rules
- of thumb abound, but appear to be no more than folklore. The only
- systematic results known to the FAQ maintainer are in Sarle (1995), which
- deals only with the case of a single input. Amari et al. (1995) attempts
- a theoretical approach but contains serious errors that completely
- invalidate the results, especially the incorrect assumption that the
- direction of approach to the optimum is distributed isotropically.
- o Do you split the data into training and validation sets randomly or by
- some systematic algorithm?
- o How do you tell when the validation error rate "starts to go up"? It may
- go up and down numerous times during training. The safest approach is to
- train to convergence, then go back and see which iteration had the lowest
- validation error. For more elaborate algorithms, see Prechelt (1994,
- 1998).
-
- Statisticians tend to be skeptical of stopped training because it appears to
- be statistically inefficient due to the use of the split-sample technique;
- i.e., neither training nor validation makes use of the entire sample, and
- because the usual statistical theory does not apply. However, there has been
- recent progress addressing both of the above concerns (Wang 1994).
-
- Early stopping is closely related to ridge regression. If the learning rate
- is sufficiently small, the sequence of weight vectors on each iteration will
- approximate the path of continuous steepest descent down the error surface.
- Early stopping chooses a point along this path that optimizes an estimate of
- the generalization error computed from the validation set. Ridge regression
- also defines a path of weight vectors by varying the ridge value. The ridge
- value is often chosen by optimizing an estimate of the generalization error
- computed by cross-validation, generalized cross-validation, or bootstrapping
- (see "What are cross-validation and bootstrapping?") There always exists a
- positive ridge value that will improve the expected generalization error in
- a linear model. A similar result has been obtained for early stopping in
- linear models (Wang, Venkatesh, and Judd 1994). In linear models, the ridge
- path lies close to, but does not coincide with, the path of continuous
- steepest descent; in nonlinear models, the two paths can diverge widely. The
- relationship is explored in more detail by Sj÷berg and Ljung (1992).
-
- References:
-
- S. Amari, N.Murata, K.-R. Muller, M. Finke, H. Yang. Asymptotic
- Statistical Theory of Overtraining and Cross-Validation. METR 95-06,
- 1995, Department of Mathematical Engineering and Information Physics,
- University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.
-
- Finnof, W., Hergert, F., and Zimmermann, H.G. (1993), "Improving model
- selection by nonconvergent methods," Neural Networks, 6, 771-783.
-
- Nelson, M.C. and Illingworth, W.T. (1991), A Practical Guide to Neural
- Nets, Reading, MA: Addison-Wesley.
-
- Orr, G.B., and Mueller, K.-R., eds. (1998), Neural Networks: Tricks of
- the Trade, Berlin: Springer, ISBN 3-540-65311-2.
-
- Prechelt, L. (1998), "Early stopping--But when?" in Orr and Mueller
- (1998), 55-69.
-
- Prechelt, L. (1994), "PROBEN1--A set of neural network benchmark problems
- and benchmarking rules," Technical Report 21/94, Universitat Karlsruhe,
- 76128 Karlsruhe, Germany,
- ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.gz.
-
- Sarle, W.S. (1995), "Stopped Training and Other Remedies for
- Overfitting," Proceedings of the 27th Symposium on the Interface of
- Computing Science and Statistics, 352-360,
- ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
- compressed postscript file, 747K, 10 pages)
-
- Sj÷berg, J. and Ljung, L. (1992), "Overtraining, Regularization, and
- Searching for Minimum in Neural Networks," Technical Report
- LiTH-ISY-I-1297, Department of Electrical Engineering, Linkoping
- University, S-581 83 Linkoping, Sweden, http://www.control.isy.liu.se .
-
- Wang, C. (1994), A Theory of Generalisation in Learning Machines with
- Neural Network Application, Ph.D. thesis, University of Pennsylvania.
-
- Wang, C., Venkatesh, S.S., and Judd, J.S. (1994), "Optimal Stopping and
- Effective Machine Complexity in Learning," NIPS6, 303-310.
-
- Weigend, A. (1994), "On overfitting and the effective number of hidden
- units," Proceedings of the 1993 Connectionist Models Summer School,
- 335-342.
-
- ------------------------------------------------------------------------
-
- Subject: What is weight decay?
- ===============================
-
- Weight decay adds a penalty term to the error function. The usual penalty is
- the sum of squared weights times a decay constant. In a linear model, this
- form of weight decay is equivalent to ridge regression. See "What is
- jitter?" for more explanation of ridge regression.
-
- Weight decay is a subset of regularization methods. The penalty term in
- weight decay, by definition, penalizes large weights. Other regularization
- methods may involve not only the weights but various derivatives of the
- output function (Bishop 1995).
-
- The weight decay penalty term causes the weights to converge to smaller
- absolute values than they otherwise would. Large weights can hurt
- generalization in two different ways. Excessively large weights leading to
- hidden units can cause the output function to be too rough, possibly with
- near discontinuities. Excessively large weights leading to output units can
- cause wild outputs far beyond the range of the data if the output activation
- function is not bounded to the same range as the data. To put it another
- way, large weights can cause excessive variance of the output (Geman,
- Bienenstock, and Doursat 1992). According to Bartlett (1997), the size (L_1
- norm) of the weights is more important than the number of weights in
- determining generalization.
-
- Other penalty terms besides the sum of squared weights are sometimes used.
- Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses:
-
- (w_i)^2
- sum -------------
- i (w_i)^2 + c^2
-
- where w_i is the ith weight and c is a user-specified constant. Whereas
- decay using the sum of squared weights tends to shrink the large
- coefficients more than the small ones, weight elimination tends to shrink
- the small coefficients more, and is therefore more useful for suggesting
- subset models (pruning).
-
- The generalization ability of the network can depend crucially on the decay
- constant, especially with small training sets. One approach to choosing the
- decay constant is to train several networks with different amounts of decay
- and estimate the generalization error for each; then choose the decay
- constant that minimizes the estimated generalization error. Weigend,
- Rumelhart, and Huberman (1991) iteratively update the decay constant during
- training.
-
- There are other important considerations for getting good results from
- weight decay. You must either standardize the inputs and targets, or adjust
- the penalty term for the standard deviations of all the inputs and targets.
- It is usually a good idea to omit the biases from the penalty term.
-
- A fundamental problem with weight decay is that different types of weights
- in the network will usually require different decay constants for good
- generalization. At the very least, you need three different decay constants
- for input-to-hidden, hidden-to-hidden, and hidden-to-output weights.
- Adjusting all these decay constants to produce the best estimated
- generalization error often requires vast amounts of computation.
-
- Fortunately, there is a superior alternative to weight decay: hierarchical
- Bayesian learning. Bayesian learning makes it possible to estimate
- efficiently numerous decay constants.
-
- References:
-
- Bartlett, P.L. (1997), "For valid generalization, the size of the weights
- is more important than the size of the network," in Mozer, M.C., Jordan,
- M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
- Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.
-
- Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
- Oxford University Press.
-
- Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
- the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
-
- Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
- Cambridge University Press.
-
- Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
- Generalization by weight-elimination with application to forecasting. In:
- R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural
- Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann.
-
- ------------------------------------------------------------------------
-
- Subject: What is Bayesian Learning?
- ===================================
-
- By Radford Neal.
-
- Conventional training methods for multilayer perceptrons ("backprop" nets)
- can be interpreted in statistical terms as variations on maximum likelihood
- estimation. The idea is to find a single set of weights for the network that
- maximize the fit to the training data, perhaps modified by some sort of
- weight penalty to prevent overfitting.
-
- The Bayesian school of statistics is based on a different view of what it
- means to learn from data, in which probability is used to represent
- uncertainty about the relationship being learned (a use that is shunned in
- conventional--i.e., frequentist--statistics). Before we have seen any data,
- our prior opinions about what the true relationship might be can be
- expresssed in a probability distribution over the network weights that
- define this relationship. After we look at the data (or after our program
- looks at the data), our revised opinions are captured by a posterior
- distribution over network weights. Network weights that seemed plausible
- before, but which don't match the data very well, will now be seen as being
- much less likely, while the probability for values of the weights that do
- fit the data well will have increased.
-
- Typically, the purpose of training is to make predictions for future cases
- in which only the inputs to the network are known. The result of
- conventional network training is a single set of weights that can be used to
- make such predictions. In contrast, the result of Bayesian training is a
- posterior distribution over network weights. If the inputs of the network
- are set to the values for some new case, the posterior distribution over
- network weights will give rise to a distribution over the outputs of the
- network, which is known as the predictive distribution for this new case. If
- a single-valued prediction is needed, one might use the mean of the
- predictive distribution, but the full predictive distribution also tells you
- how uncertain this prediction is.
-
- Why bother with all this? The hope is that Bayesian methods will provide
- solutions to such fundamental problems as:
-
- o How to judge the uncertainty of predictions. This can be solved by
- looking at the predictive distribution, as described above.
- o How to choose an appropriate network architecture (eg, the number hidden
- layers, the number of hidden units in each layer).
- o How to adapt to the characteristics of the data (eg, the smoothness of
- the function, the degree to which different inputs are relevant).
-
- Good solutions to these problems, especially the last two, depend on using
- the right prior distribution, one that properly represents the uncertainty
- that you probably have about which inputs are relevant, how smooth the
- function is, how much noise there is in the observations, etc. Such
- carefully vague prior distributions are usually defined in a hierarchical
- fashion, using hyperparameters, some of which are analogous to the weight
- decay constants of more conventional training procedures. The use of
- hyperparameters is discussed by Mackay (1992a, 1992b, 1995) and Neal (1993a,
- 1996), who in particular use an "Automatic Relevance Determination" scheme
- that aims to allow many possibly-relevant inputs to be included without
- damaging effects.
-
- Selection of an appropriate network architecture is another place where
- prior knowledge plays a role. One approach is to use a very general
- architecture, with lots of hidden units, maybe in several layers or groups,
- controlled using hyperparameters. This approach is emphasized by Neal
- (1996), who argues that there is no statistical need to limit the complexity
- of the network architecture when using well-designed Bayesian methods. It is
- also possible to choose between architectures in a Bayesian fashion, using
- the "evidence" for an architecture, as discussed by Mackay (1992a, 1992b).
-
- Implementing all this is one of the biggest problems with Bayesian methods.
- Dealing with a distribution over weights (and perhaps hyperparameters) is
- not as simple as finding a single "best" value for the weights. Exact
- analytical methods for models as complex as neural networks are out of the
- question. Two approaches have been tried:
-
- 1. Find the weights/hyperparameters that are most probable, using methods
- similar to conventional training (with regularization), and then
- approximate the distribution over weights using information available at
- this maximum.
- 2. Use a Monte Carlo method to sample from the distribution over weights.
- The most efficient implementations of this use dynamical Monte Carlo
- methods whose operation resembles that of backprop with momentum.
-
- The first method comes in two flavours. Buntine and Weigend (1991) describe
- a procedure in which the hyperparameters are first integrated out
- analytically, and numerical methods are then used to find the most probable
- weights. MacKay (1992a, 1992b) instead finds the values for the
- hyperparameters that are most likely, integrating over the weights (using an
- approximation around the most probable weights, conditional on the
- hyperparameter values). There has been some controversy regarding the merits
- of these two procedures, with Wolpert (1993) claiming that analytically
- integrating over the hyperparameters is preferable because it is "exact".
- This criticism has been rebutted by Mackay (1993). It would be inappropriate
- to get into the details of this controversy here, but it is important to
- realize that the procedures based on analytical integration over the
- hyperparameters do not provide exact solutions to any of the problems of
- practical interest. The discussion of an analogous situation in a different
- statistical context by O'Hagan (1985) may be illuminating.
-
- Monte Carlo methods for Bayesian neural networks have been developed by Neal
- (1993a, 1996). In this approach, the posterior distribution is represented
- by a sample of perhaps a few dozen sets of network weights. The sample is
- obtained by simulating a Markov chain whose equilibrium distribution is the
- posterior distribution for weights and hyperparameters. This technique is
- known as "Markov chain Monte Carlo (MCMC)"; see Neal (1993b) for a review.
- The method is exact in the limit as the size of the sample and the length of
- time for which the Markov chain is run increase, but convergence can
- sometimes be slow in practice, as for any network training method.
-
- Work on Bayesian neural network learning has so far concentrated on
- multilayer perceptron networks, but Bayesian methods can in principal be
- applied to other network models, as long as they can be interpreted in
- statistical terms. For some models (eg, RBF networks), this should be a
- fairly simple matter; for others (eg, Boltzmann Machines), substantial
- computational problems would need to be solved.
-
- Software implementing Bayesian neural network models (intended for research
- use) is available from the home pages of David MacKay and Radford Neal.
-
- There are many books that discuss the general concepts of Bayesian
- inference, though they mostly deal with models that are simpler than neural
- networks. Here are some recent ones:
-
- Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York:
- John Wiley.
-
- Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian
- Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5.
-
- O'Hagan, A. (1994) Bayesian Inference (Volume 2B in Kendall's Advanced
- Theory of Statistics), ISBN 0-340-52922-9.
-
- Robert, C. P. (1995) The Bayesian Choice, New York: Springer-Verlag.
-
- The following books and papers have tutorial material on Bayesian learning
- as applied to neural network models:
-
- Bishop, C. M. (1995) Neural Networks for Pattern Recognition, Oxford:
- Oxford University Press.
-
- Lee, H.K.H (1999), Model Selection and Model Averaging for Neural
- Networks, Doctoral dissertation, Carnegie Mellon University,
- Pittsburgh, USA, http://lib.stat.cmu.edu/~herbie/thesis.html
-
- MacKay, D. J. C. (1995) "Probable networks and plausible predictions - a
- review of practical Bayesian methods for supervised neural networks",
- available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/network.ps.gz.
-
- Mueller, P. and Insua, D.R. (1995) "Issues in Bayesian Analysis of Neural
- Network Models," Neural Computation, 10, 571-592, (also Institute of
- Statistics and Decision Sciences Working Paper 95-31),
- ftp://ftp.isds.duke.edu/pub/WorkingPapers/95-31.ps
-
- Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York:
- Springer-Verlag, ISBN 0-387-94724-8.
-
- Ripley, B. D. (1996) Pattern Recognition and Neural Networks,
- Cambridge: Cambridge University Press.
-
- Thodberg, H. H. (1996) "A review of Bayesian neural networks with an
- application to near infrared spectroscopy", IEEE Transactions on Neural
- Networks, 7, 56-72.
-
- Some other references:
-
- Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
- (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V.
- (North-Holland).
-
- Buntine, W. L. and Weigend, A. S. (1991) "Bayesian back-propagation",
- Complex Systems, 5, 603-643.
-
- MacKay, D. J. C. (1992a) "Bayesian interpolation", Neural Computation,
- 4, 415-447.
-
- MacKay, D. J. C. (1992b) "A practical Bayesian framework for
- backpropagation networks," Neural Computation, 4, 448-472.
-
- MacKay, D. J. C. (1993) "Hyperparameters: Optimize or Integrate Out?",
- available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/alpha.ps.gz.
-
- Neal, R. M. (1993a) "Bayesian learning via stochastic dynamics", in C. L.
- Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural
- Information Processing Systems 5, San Mateo, California: Morgan
- Kaufmann, 475-482.
-
- Neal, R. M. (1993b) Probabilistic Inference Using Markov Chain Monte
- Carlo Methods, available at
- ftp://ftp.cs.utoronto.ca/pub/radford/review.ps.Z.
-
- O'Hagan, A. (1985) "Shoulders in hierarchical models", in J. M. Bernardo,
- M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (editors), Bayesian
- Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (North-Holland),
- 697-710.
-
- Sarle, W. S. (1995) "Stopped Training and Other Remedies for
- Overfitting," Proceedings of the 27th Symposium on the Interface of
- Computing Science and Statistics, 352-360,
- ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
- compressed postscript file, 747K, 10 pages)
-
- Wolpert, D. H. (1993) "On the use of evidence in neural networks", in C.
- L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural
- Information Processing Systems 5, San Mateo, California: Morgan
- Kaufmann, 539-546.
-
- Finally, David MacKay maintains a FAQ about Bayesian methods for neural
- networks, at http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html .
-
- Comments on Bayesian learning
- +++++++++++++++++++++++++++++
-
- By Warren Sarle.
-
- Bayesian purists may argue over the proper way to do a Bayesian analysis,
- but even the crudest Bayesian computation (maximizing over both parameters
- and hyperparameters) is shown by Sarle (1995) to generalize better than
- early stopping when learning nonlinear functions. This approach requires the
- use of slightly informative hyperpriors and at least twice as many training
- cases as weights in the network. A full Bayesian analysis by MCMC can be
- expected to work even better under even broader conditions. Bayesian
- learning works well by frequentist standards--what MacKay calls the
- "evidence framework" is used by frequentist statisticians under the name
- "empirical Bayes." Although considerable research remains to be done,
- Bayesian learning seems to be the most promising approach to training neural
- networks.
-
- Bayesian learning should not be confused with the "Bayes classifier." In the
- latter, the distribution of the inputs given the target class is assumed to
- be known exactly, and the prior probabilities of the classes are assumed
- known, so that the posterior probabilities can be computed by a
- (theoretically) simple application of Bayes' theorem. The Bayes classifier
- involves no learning--you must already know everything that needs to be
- known! The Bayes classifier is a gold standard that can almost never be used
- in real life but is useful in theoretical work and in simulation studies
- that compare classification methods. The term "Bayes rule" is also used to
- mean any classification rule that gives results identical to those of a
- Bayes classifier.
-
- Bayesian learning also should not be confused with the "naive" or "idiot's"
- Bayes classifier (Warner et al. 1961; Ripley, 1996), which assumes that the
- inputs are conditionally independent given the target class. The naive Bayes
- classifier is usually applied with categorical inputs, and the distribution
- of each input is estimated by the proportions in the training set; hence the
- naive Bayes classifier is a frequentist method.
-
- The term "Bayesian network" often refers not to a neural network but to a
- belief network (also called a causal net, influence diagram, constraint
- network, qualitative Markov network, or gallery). Belief networks are more
- closely related to expert systems than to neural networks, and do not
- necessarily involve learning (Pearl, 1988; Ripley, 1996). Here are some URLs
- on Bayesian belief networks:
-
- o http://bayes.stat.washington.edu/almond/belief.html
- o http://www.cs.orst.edu/~dambrosi/bayesian/frame.html
- o http://www2.sis.pitt.edu/~genie
- o http://www.research.microsoft.com/dtg/msbn
-
- References for comments:
-
- Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks
- of Plausible Inference, San Mateo, CA: Morgan Kaufmann.
-
- Ripley, B. D. (1996) Pattern Recognition and Neural Networks,
- Cambridge: Cambridge University Press.
-
- Warner, H.R., Toronto, A.F., Veasy, L.R., and Stephenson, R. (1961), "A
- mathematical model for medical diagnosis--application to congenital heart
- disease," J. of the American Medical Association, 177, 177-184.
-
- ------------------------------------------------------------------------
-
- Subject: How to combine networks?
- ==================================
-
- Methods for combining networks are a subject of active research. Many
- different methods with different purposes have been proposed. The properties
- and relationships of these methods are just beginning to be understood. Some
- methods, such as boosting, are remedies for underfitting. Other methods,
- such as bagging, are mainly remedies for overfitting or instability.
- Bayesian learning naturally leads to model averaging (Hoeting et al., 1999).
- A good general reference is the book edited by Sharkey (1999), especially
- the article by Drucker (1999). Regarding the effects of bagging and weight
- decay used together, see Taniguchi and Tresp (1997).
-
- Here is a list of terms used for various methods of combining models, mostly
- taken from Christoph M. Friedrich's web page (see below):
-
- o Adaboost
- o ADDEMUP
- o arcing: adaptive recombination of classifiers
- o bagging: bootstrap aggregation
- o bag-stacking: bagging plus stacking
- o boosting
- o cascading
- o combination of classifiers
- o committees of networks
- o consensus theory
- o cragging: cross aggregation (like k-fold cross validation)
- o dagging: disjoint-sample aggregation
- o dag-stacking: dagging plus stacking
- o divide and conquer classifiers
- o ensembles
- o hagging: half-sample aggregation
- o mixture of experts
- o multiple classifier systems:
- o multi-stage and multi-level classifiers
- o OLC: optimal linear combination
- o pandemonium of reflective agents
- o sieving algorithms
- o stacking: feeding outputs of several models (and possibly the the
- original inputs) into a second-level model
- o voting
-
- URLs:
-
- o Christoph M. Friedrich's web page, "Combinations of Classifiers and
- Regressors Bibliography and Guide to Internet Resources" at
- http://www.tussy.uni-wh.de/~chris/ensemble/ensemble.html
-
- o Tirthankar RayChaudhuri's web page on combining estimators at
- http://www-comp.mpce.mq.edu.au/~tirthank/combest.html
-
- o Robert E. Schapire's boosting page at
- http://www.research.att.com/~schapire/boost.html
-
- o http://www.boosting.org/
-
- References:
-
- Clemen, Robert T. (1989), "Combining forecasts: A review and annotated
- bibliography", International Journal of Forecasting, Vol 5, pp 559-584.
-
- Drucker, H. (1999), "Boosting using neural networks," in Sharkey (1999),
- pp. 51-78.
-
- Hoeting, J. A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999)
- "Bayesian Model Averaging: A Tutorial (with discussion)," Statistical
- Science, 14:4, 382-417. Corrected version available at
- http://www.stat.washington.edu/www/research/online/hoeting1999.pdf
-
- Sharkey, A.J.C. (1999), Combining Artificial Neural Nets: Ensemble and
- Modular Multi-Net Systems, London: Springer.
-
- Taniguchi, M., and Tresp, V. (1997), "Averaging regularized estimators,"
- Neural Computation, 9, 1163-1178.
-
- ------------------------------------------------------------------------
-
- Subject: How many hidden layers should I use?
- ==============================================
-
- You may not need any hidden layers at all. Linear and generalized linear
- models are useful in a wide variety of applications (McCullagh and Nelder
- 1989). And even if the function you want to learn is mildly nonlinear, you
- may get better generalization with a simple linear model than with a
- complicated nonlinear model if there is too little data or too much noise to
- estimate the nonlinearities accurately.
-
- In MLPs with step/threshold/Heaviside activation functions, you need two
- hidden layers for full generality (Sontag 1992). For further discussion, see
- Bishop (1995, 121-126).
-
- In MLPs with any of a wide variety of continuous nonlinear hidden-layer
- activation functions, one hidden layer with an arbitrarily large number of
- units suffices for the "universal approximation" property (e.g., Hornik,
- Stinchcombe and White 1989; Hornik 1993; for more references, see Bishop
- 1995, 130, and Ripley, 1996, 173-180). But there is no theory yet to tell
- you how many hidden units are needed to approximate any given function.
-
- If you have only one input, there seems to be no advantage to using more
- than one hidden layer. But things get much more complicated when there are
- two or more inputs. To illustrate, examples with two inputs and one output
- will be used so that the results can be shown graphically. In each example
- there are 441 training cases on a regular 21-by-21 grid. The test sets have
- 1681 cases on a regular 41-by-41 grid over the same domain as the training
- set. If you are reading the HTML version of this document via a web browser,
- you can see surface plots based on the test set by clicking on the file
- names mentioned in the folowing text. Each plot is a gif file, approximately
- 9K in size.
-
- Consider a target function of two inputs, consisting of a Gaussian hill in
- the middle of a plane (hill.gif). An MLP with an identity output activation
- function can easily fit the hill by surrounding it with a few sigmoid
- (logistic, tanh, arctan, etc.) hidden units, but there will be spurious
- ridges and valleys where the plane should be flat (h_mlp_6.gif). It takes
- dozens of hidden units to flatten out the plane accurately (h_mlp_30.gif).
-
- Now suppose you use a logistic output activation function. As the input to a
- logistic function goes to negative infinity, the output approaches zero. The
- plane in the Gaussian target function also has a value of zero. If the
- weights and bias for the output layer yield large negative values outside
- the base of the hill, the logistic function will flatten out any spurious
- ridges and valleys. So fitting the flat part of the target function is easy
- (h_mlpt_3_unsq.gif and h_mlpt_3.gif). But the logistic function also tends
- to lower the top of the hill.
-
- If instead of a rounded hill, the target function was a mesa with a large,
- flat top with a value of one, the logistic output activation function would
- be able to smooth out the top of the mesa just like it smooths out the plane
- below. Target functions like this, with large flat areas with values of
- either zero or one, are just what you have in many noise-free classificaton
- problems. In such cases, a single hidden layer is likely to work well.
-
- When using a logistic output activation function, it is common practice to
- scale the target values to a range of .1 to .9. Such scaling is bad in a
- noise-free classificaton problem, because it prevents the logistic function
- from smoothing out the flat areas (h_mlpt1-9_3.gif).
-
- For the Gaussian target function, [.1,.9] scaling would make it easier to
- fit the top of the hill, but would reintroduce undulations in the plane. It
- would be better for the Gaussian target function to scale the target values
- to a range of 0 to .9. But for a more realistic and complicated target
- function, how would you know the best way to scale the target values?
-
- By introducing a second hidden layer with one sigmoid activation function
- and returning to an identity output activation function, you can let the net
- figure out the best scaling (h_mlp1_3.gif). Actually, the bias and weight
- for the output layer scale the output rather than the target values, and you
- can use whatever range of target values is convenient.
-
- For more complicated target functions, especially those with several hills
- or valleys, it is useful to have several units in the second hidden layer.
- Each unit in the second hidden layer enables the net to fit a separate hill
- or valley. So an MLP with two hidden layers can often yield an accurate
- approximation with fewer weights than an MLP with one hidden layer. (Chester
- 1990).
-
- To illustrate the use of multiple units in the second hidden layer, the next
- example resembles a landscape with a Gaussian hill and a Gaussian valley,
- both elliptical (hillanvale.gif). The table below gives the RMSE (root mean
- squared error) for the test set with various architectures. If you are
- reading the HTML version of this document via a web browser, click on any
- number in the table to see a surface plot of the corresponding network
- output.
-
- The MLP networks in the table have one or two hidden layers with a tanh
- activation function. The output activation function is the identity. Using a
- squashing function on the output layer is of no benefit for this function,
- since the only flat area in the function has a target value near the middle
- of the target range.
-
- Hill and Valley Data: RMSE for the Test Set
- (Number of weights in parentheses)
- MLP Networks
-
- HUs in HUs in Second Layer
- First ----------------------------------------------------------
- Layer 0 1 2 3 4
- 1 0.204( 5) 0.204( 7) 0.189( 10) 0.187( 13) 0.185( 16)
- 2 0.183( 9) 0.163( 11) 0.147( 15) 0.094( 19) 0.096( 23)
- 3 0.159( 13) 0.095( 15) 0.054( 20) 0.033( 25) 0.045( 30)
- 4 0.137( 17) 0.093( 19) 0.009( 25) 0.021( 31) 0.016( 37)
- 5 0.121( 21) 0.092( 23) 0.010( 37) 0.011( 44)
- 6 0.100( 25) 0.092( 27) 0.007( 43) 0.005( 51)
- 7 0.086( 29) 0.077( 31)
- 8 0.079( 33) 0.062( 35)
- 9 0.072( 37) 0.055( 39)
- 10 0.059( 41) 0.047( 43)
- 12 0.047( 49) 0.042( 51)
- 15 0.039( 61) 0.032( 63)
- 20 0.025( 81) 0.018( 83)
- 25 0.021(101) 0.016(103)
- 30 0.018(121) 0.015(123)
- 40 0.012(161) 0.015(163)
- 50 0.008(201) 0.014(203)
-
- For an MLP with only one hidden layer (column 0), Gaussian hills and valleys
- require a large number of hidden units to approximate well. When there is
- one unit in the second hidden layer, the network can represent one hill or
- valley easily, which is what happens with three to six units in the first
- hidden layer. But having only one unit in the second hidden layer is of
- little benefit for learning two hills or valleys. Using two units in the
- second hidden layer enables the network to approximate two hills or valleys
- easily; in this example, only four units are required in the first hidden
- layer to get an excellent fit. Each additional unit in the second hidden
- layer enables the network to learn another hill or valley with a relatively
- small number of units in the first hidden layer, as explained by Chester
- (1990). In this example, having three or four units in the second hidden
- layer helps little, and actually produces a worse approximation when there
- are four units in the first hidden layer due to problems with local minima.
-
- Unfortunately, using two hidden layers exacerbates the problem of local
- minima, and it is important to use lots of random initializations or other
- methods for global optimization. Local minima with two hidden layers can
- have extreme spikes or blades even when the number of weights is much
- smaller than the number of training cases. One of the few advantages of
- standard backprop is that it is so slow that spikes and blades will not
- become very sharp for practical training times.
-
- More than two hidden layers can be useful in certain architectures such as
- cascade correlation (Fahlman and Lebiere 1990) and in special applications,
- such as the two-spirals problem (Lang and Witbrock 1988) and ZIP code
- recognition (Le Cun et al. 1989).
-
- RBF networks are most often used with a single hidden layer. But an extra,
- linear hidden layer before the radial hidden layer enables the network to
- ignore irrelevant inputs (see How do MLPs compare with RBFs?") The linear
- hidden layer allows the RBFs to take elliptical, rather than radial
- (circular), shapes in the space of the inputs. Hence the linear layer gives
- you an elliptical basis function (EBF) network. In the hill and valley
- example, an ORBFUN network requires nine hidden units (37 weights) to get
- the test RMSE below .01, but by adding a linear hidden layer, you can get an
- essentially perfect fit with three linear units followed by two radial units
- (20 weights).
-
- References:
-
- Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
- Oxford University Press.
-
- Chester, D.L. (1990), "Why Two Hidden Layers are Better than One,"
- IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268.
-
- Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation Learning
- Architecture," NIPS2, 524-532,
- ftp://archive.cis.ohio-state.edu/pub/neuroprose/fahlman.cascor-tr.ps.Z.
-
- Hornik, K., Stinchcombe, M. and White, H. (1989), "Multilayer feedforward
- networks are universal approximators," Neural Networks, 2, 359-366.
-
- Hornik, K. (1993), "Some new results on neural network approximation,"
- Neural Networks, 6, 1069-1072.
-
- Lang, K.J. and Witbrock, M.J. (1988), "Learning to tell two spirals
- apart," in Touretzky, D., Hinton, G., and Sejnowski, T., eds.,
- Procedings of the 1988 Connectionist Models Summer School, San Mateo,
- CA: Morgan Kaufmann.
-
- Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
- Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
- handwritten ZIP code recognition", Neural Computation, 1, 541-551.
-
- McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
- ed., London: Chapman & Hall.
-
- Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
- Cambridge University Press.
-
- Sontag, E.D. (1992), "Feedback stabilization using two-hidden-layer
- nets", IEEE Transactions on Neural Networks, 3, 981-990.
-
- ------------------------------------------------------------------------
-
- Subject: How many hidden units should I use?
- =============================================
-
- The best number of hidden units depends in a complex way on:
-
- o the numbers of input and output units
- o the number of training cases
- o the amount of noise in the targets
- o the complexity of the function or classification to be learned
- o the architecture
- o the type of hidden unit activation function
- o the training algorithm
- o regularization
-
- In most situations, there is no way to determine the best number of hidden
- units without training several networks and estimating the generalization
- error of each. If you have too few hidden units, you will get high training
- error and high generalization error due to underfitting and high statistical
- bias. If you have too many hidden units, you may get low training error but
- still have high generalization error due to overfitting and high variance.
- Geman, Bienenstock, and Doursat (1992) discuss how the number of hidden
- units affects the bias/variance trade-off.
-
- Some books and articles offer "rules of thumb" for choosing an architecture;
- for example:
-
- o "A rule of thumb is for the size of this [hidden] layer to be somewhere
- between the input layer size ... and the output layer size ..." (Blum,
- 1992, p. 60).
- o "To calculate the number of hidden nodes we use a general rule of:
- (Number of inputs + outputs) * (2/3)" (from the FAQ for a commercial
- neural network software company).
- o "you will never require more than twice the number of hidden units as you
- have inputs" in an MLP with one hidden layer (Swingler, 1996, p. 53). See
- the section in Part 4 of the FAQ on The Worst books for the source of
- this myth.)
- o "How large should the hidden layer be? One rule of thumb is that it
- should never be more than twice as large as the input layer." (Berry and
- Linoff, 1997, p. 323).
- o "Typically, we specify as many hidden nodes as dimensions [principal
- components] needed to capture 70-90% of the variance of the input data
- set." (Boger and Guterman, 1997)
-
- These rules of thumb are nonsense because they ignore the number of training
- cases, the amount of noise in the targets, and the complexity of the
- function. Even if you restrict consideration to minimizing training error on
- data with lots of training cases and no noise, it is easy to construct
- counterexamples that disprove these rules of thumb. For example:
-
- o There are 100 Boolean inputs and 100 Boolean targets. Each target is a
- conjunction of some subset of inputs. No hidden units are needed.
-
- o There are two continuous inputs X and Y which take values uniformly
- distributed on a square [0,8] by [0,8]. Think of the input space as a
- chessboard, and number the squares 1 to 64. The categorical target
- variable C is the square number, so there are 64 output units. For
- example, you could generate the data as follows (this is the SAS
- programming language, but it should be easy to translate into any other
- language):
-
- data chess;
- step = 1/4;
- do x = step/2 to 8-step/2 by step;
- do y = step/2 to 8-step/2 by step;
- c = 8*floor(x) + floor(y) + 1;
- output;
- end;
- end;
- run;
-
- No hidden units are needed.
-
- o The classic two-spirals problem has two continuous inputs and a Boolean
- classification target. The data can be generated as follows:
-
- data spirals;
- pi = arcos(-1);
- do i = 0 to 96;
- angle = i*pi/16.0;
- radius = 6.5*(104-i)/104;
- x = radius*cos(angle);
- y = radius*sin(angle);
- c = 1;
- output;
- x = -x;
- y = -y;
- c = 0;
- output;
- end;
- run;
-
- With one hidden layer, about 50 tanh hidden units are needed. Many NN
- programs may actually need closer to 100 hidden units to get zero
- training error.
-
- o There is one continuous input X that takes values on [0,100]. There is
- one continuous target Y = sin(X). Getting a good approximation to Y
- requires about 20 to 25 tanh hidden units. Of course, 1 sine hidden unit
- would do the job.
-
- Some rules of thumb relate the total number of trainable weights in the
- network to the number of training cases. A typical recommendation is that
- the number of weights should be no more than 1/30 of the number of training
- cases. Such rules are only concerned with overfitting and are at best crude
- approximations. Also, these rules do not apply when regularization is used.
- It is true that without regularization, if the number of training cases is
- much larger (but no one knows exactly how much larger) than the number of
- weights, you are unlikely to get overfitting, but you may suffer from
- underfitting. For a noise-free quantitative target variable, twice as many
- training cases as weights may be more than enough to avoid overfitting. For
- a very noisy categorical target variable, 30 times as many training cases as
- weights may not be enough to avoid overfitting.
-
- An intelligent choice of the number of hidden units depends on whether you
- are using early stopping or some other form of regularization. If not, you
- must simply try many networks with different numbers of hidden units,
- estimate the generalization error for each one, and choose the network with
- the minimum estimated generalization error. For examples using statistical
- criteria to choose the number of hidden units, see
- ftp://ftp.sas.com/pub/neural/dojo/dojo.html.
-
- Using conventional optimization algorithms (see "What are conjugate
- gradients, Levenberg-Marquardt, etc.?"), there is little point in trying a
- network with more weights than training cases, since such a large network is
- likely to overfit.
-
- Using standard online backprop, however, Lawrence, Giles, and Tsoi (1996,
- 1997) have shown that it can be difficult to reduce training error to a
- level near the globally optimal value, even when using more weights than
- training cases. But increasing the number of weights makes it easier for
- standard backprop to find a good local optimum, so using "oversize" networks
- can reduce both training error and generalization error.
-
- If you are using early stopping, it is essential to use lots of hidden units
- to avoid bad local optima (Sarle 1995). There seems to be no upper limit on
- the number of hidden units, other than that imposed by computer time and
- memory requirements. Weigend (1994) makes this assertion, but provides only
- one example as evidence. Tetko, Livingstone, and Luik (1995) provide
- simulation studies that are more convincing. Similar results were obtained
- in conjunction with the simulations in Sarle (1995), but those results are
- not reported in the paper for lack of space. On the other hand, there seems
- to be no advantage to using more hidden units than you have training cases,
- since bad local minima do not occur with so many hidden units.
-
- If you are using weight decay or Bayesian estimation, you can also use lots
- of hidden units (Neal 1996). However, it is not strictly necessary to do so,
- because other methods are available to avoid local minima, such as multiple
- random starts and simulated annealing (such methods are not safe to use with
- early stopping). You can use one network with lots of hidden units, or you
- can try different networks with different numbers of hidden units, and
- choose on the basis of estimated generalization error. With weight decay or
- MAP Bayesian estimation, it is prudent to keep the number of weights less
- than half the number of training cases.
-
- Bear in mind that with two or more inputs, an MLP with one hidden layer
- containing only a few units can fit only a limited variety of target
- functions. Even simple, smooth surfaces such as a Gaussian bump in two
- dimensions may require 20 to 50 hidden units for a close approximation.
- Networks with a smaller number of hidden units often produce spurious ridges
- and valleys in the output surface (see Chester 1990 and "How do MLPs compare
- with RBFs?") Training a network with 20 hidden units will typically require
- anywhere from 150 to 2500 training cases if you do not use early stopping or
- regularization. Hence, if you have a smaller training set than that, it is
- usually advisable to use early stopping or regularization rather than to
- restrict the net to a small number of hidden units.
-
- Ordinary RBF networks containing only a few hidden units also produce
- peculiar, bumpy output functions. Normalized RBF networks are better at
- approximating simple smooth surfaces with a small number of hidden units
- (see How do MLPs compare with RBFs?).
-
- There are various theoretical results on how fast approximation error
- decreases as the number of hidden units increases, but the conclusions are
- quite sensitive to the assumptions regarding the function you are trying to
- approximate. See p. 178 in Ripley (1996) for a summary. According to a
- well-known result by Barron (1993), in a network with I inputs and H units
- in a single hidden layer, the root integrated squared error (RISE) will
- decrease at least as fast as H^(-1/2) under some quite complicated
- smoothness assumptions. Ripley cites another paper by DeVore et al. (1989)
- that says if the function has only R bounded derivatives, RISE may decrease
- as slowly as H^(-R/I). For some examples with from 1 to 4 hidden layers
- see How many hidden layers should I use?" and "How do MLPs compare with
- RBFs?"
-
- For learning a finite training set exactly, bounds for the number of hidden
- units required are provided by Elisseeff and Paugam-Moisy (1997).
-
- References:
-
- Barron, A.R. (1993), "Universal approximation bounds for superpositions
- of a sigmoid function," IEEE Transactions on Information Theory, 39,
- 930-945.
-
- Berry, M.J.A., and Linoff, G. (1997), Data Mining Techniques, NY: John
- Wiley & Sons.
-
- Blum, A. (1992), Neural Networks in C++, NY: Wiley.
-
- Boger, Z., and Guterman, H. (1997), "Knowledge extraction from artificial
- neural network models," IEEE Systems, Man, and Cybernetics Conference,
- Orlando, FL.
-
- Chester, D.L. (1990), "Why Two Hidden Layers are Better than One,"
- IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268.
-
- DeVore, R.A., Howard, R., and Micchelli, C.A. (1989), "Optimal nonlinear
- approximation," Manuscripta Mathematica, 63, 469-478.
-
- Elisseeff, A., and Paugam-Moisy, H. (1997), "Size of multilayer networks
- for exact learning: analytic approach," in Mozer, M.C., Jordan, M.I., and
- Petsche, T., (eds.) Advances in Neural Information Processing Systems 9,
- Cambrideg, MA: The MIT Press, pp.162-168.
-
- Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
- the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
-
- Lawrence, S., Giles, C.L., and Tsoi, A.C. (1996), "What size neural
- network gives optimal generalization? Convergence properties of
- backpropagation," Technical Report UMIACS-TR-96-22 and CS-TR-3617,
- Institute for Advanced Computer Studies, University of Maryland, College
- Park, MD 20742,
- http://www.neci.nj.nec.com/homepages/lawrence/papers/minima-tr96/minima-tr96.html
-
- Lawrence, S., Giles, C.L., and Tsoi, A.C. (1997), "Lessons in Neural
- Network Training: Overfitting May be Harder than Expected," Proceedings
- of the Fourteenth National Conference on Artificial Intelligence,
- AAAI-97, AAAI Press, Menlo Park, California, pp. 540-545,
- http://www.neci.nj.nec.com/homepages/lawrence/papers/overfitting-aaai97/overfitting-aaai97-bib.html
-
- Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York:
- Springer-Verlag, ISBN 0-387-94724-8.
-
- Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
- Cambridge University Press,
-
- Sarle, W.S. (1995), "Stopped Training and Other Remedies for
- Overfitting," Proceedings of the 27th Symposium on the Interface of
- Computing Science and Statistics, 352-360,
- ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
- compressed postscript file, 747K, 10 pages)
-
- Swingler, K. (1996), Applying Neural Networks: A Practical Guide,
- London: Academic Press.
-
- Tetko, I.V., Livingstone, D.J., and Luik, A.I. (1995), "Neural Network
- Studies. 1. Comparison of Overfitting and Overtraining," J. Chem. Info.
- Comp. Sci., 35, 826-833.
-
- Weigend, A. (1994), "On overfitting and the effective number of hidden
- units," Proceedings of the 1993 Connectionist Models Summer School,
- 335-342.
-
- ------------------------------------------------------------------------
-
- Subject: How can generalization error be estimated?
- ====================================================
-
- There are many methods for estimating generalization error.
-
- Single-sample statistics: AIC, SBC, MDL, FPE, Mallows' C_p, etc.
- In linear models, statistical theory provides several simple estimators
- of the generalization error under various sampling assumptions
- (Darlington 1968; Efron and Tibshirani 1993; Miller 1990). These
- estimators adjust the training error for the number of weights being
- estimated, and in some cases for the noise variance if that is known.
-
- These statistics can also be used as crude estimates of the
- generalization error in nonlinear models when you have a "large" training
- set. Correcting these statistics for nonlinearity requires substantially
- more computation (Moody 1992), and the theory does not always hold for
- neural networks due to violations of the regularity conditions.
-
- Among the simple generalization estimators that do not require the noise
- variance to be known, Schwarz's Bayesian Criterion (known as both SBC and
- BIC; Schwarz 1978; Judge et al. 1980; Raftery 1995) often works well for
- NNs (Sarle 1995, 1999). AIC and FPE tend to overfit with NNs. Rissanen's
- Minimum Description Length principle (MDL; Rissanen 1978, 1987, 1999) is
- closely related to SBC. A special issue of Computer Journal contains
- several articles on MDL, which can be found online at
- http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/
- Several articles on SBC/BIC are available at the University of
- Washigton's web site at http://www.stat.washington.edu/tech.reports
-
- For classification problems, the formulas are not as simple as for
- regression with normal noise. See Efron (1986) regarding logistic
- regression.
-
- Split-sample or hold-out validation.
- The most commonly used method for estimating generalization error in
- neural networks is to reserve part of the data as a "test" set, which
- must not be used in any way during training. The test set must be a
- representative sample of the cases that you want to generalize to. After
- training, run the network on the test set, and the error on the test set
- provides an unbiased estimate of the generalization error, provided that
- the test set was chosen randomly. The disadvantage of split-sample
- validation is that it reduces the amount of data available for both
- training and validation. See Weiss and Kulikowski (1991).
-
- Cross-validation (e.g., leave one out).
- Cross-validation is an improvement on split-sample validation that allows
- you to use all of the data for training. The disadvantage of
- cross-validation is that you have to retrain the net many times. See
- "What are cross-validation and bootstrapping?".
-
- Bootstrapping.
- Bootstrapping is an improvement on cross-validation that often provides
- better estimates of generalization error at the cost of even more
- computing time. See "What are cross-validation and bootstrapping?".
-
- If you use any of the above methods to choose which of several different
- networks to use for prediction purposes, the estimate of the generalization
- error of the best network will be optimistic. For example, if you train
- several networks using one data set, and use a second (validation set) data
- set to decide which network is best, you must use a third (test set) data
- set to obtain an unbiased estimate of the generalization error of the chosen
- network. Hjorth (1994) explains how this principle extends to
- cross-validation and bootstrapping.
-
- References:
-
- Darlington, R.B. (1968), "Multiple Regression in Psychological Research
- and Practice," Psychological Bulletin, 69, 161-182.
-
- Efron, B. (1986), "How biased is the apparent error rate of a prediction
- rule?" J. of the American Statistical Association, 81, 461-470.
-
- Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
- London: Chapman & Hall.
-
- Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods:
- Validation, Model Selection, and Bootstrap, London: Chapman & Hall.
-
- Miller, A.J. (1990), Subset Selection in Regression, London: Chapman &
- Hall.
-
- Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
- Generalization and Regularization in Nonlinear Learning Systems", in
- Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
- Information Processing Systems 4, 847-854.
-
- Raftery, A.E. (1995), "Bayesian Model Selection in Social Research," in
- Marsden, P.V. (ed.), Sociological Methodology 1995, Cambridge, MA:
- Blackwell, ftp://ftp.stat.washington.edu/pub/tech.reports/bic.ps.z or
- http://www.stat.washington.edu/tech.reports/bic.ps
-
- Rissanen, J. (1978), "Modelling by shortest data description,"
- Automatica, 14, 465-471.
-
- Rissanen, J. (1987), "Stochastic complexity" (with discussion), J. of the
- Royal Statistical Society, Series B, 49, 223-239.
-
- Rissanen, J. (1999), "Hypothesis Selection and Testing by the MDL
- Principle," Computer Journal, 42, 260-269,
- http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/
-
- Sarle, W.S. (1995), "Stopped Training and Other Remedies for
- Overfitting," Proceedings of the 27th Symposium on the Interface of
- Computing Science and Statistics, 352-360,
- ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
- compressed postscript file, 747K, 10 pages)
-
- Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results,"
- ftp://ftp.sas.com/pub/neural/dojo/dojo.html
-
- Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
- Morgan Kaufmann.
-
- ------------------------------------------------------------------------
-
- Subject: What are cross-validation and bootstrapping?
- ======================================================
-
- Cross-validation and bootstrapping are both methods for estimating
- generalization error based on "resampling" (Weiss and Kulikowski 1991; Efron
- and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White 1994; Shao
- and Tu 1995). The resulting estimates of generalization error are often used
- for choosing among various models, such as different network architectures.
-
- Cross-validation
- ++++++++++++++++
-
- In k-fold cross-validation, you divide the data into k subsets of
- (approximately) equal size. You train the net k times, each time leaving
- out one of the subsets from training, but using only the omitted subset to
- compute whatever error criterion interests you. If k equals the sample
- size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
- more elaborate and expensive version of cross-validation that involves
- leaving out all possible subsets of v cases.
-
- Note that cross-validation is quite different from the "split-sample" or
- "hold-out" method that is commonly used for early stopping in NNs. In the
- split-sample method, only a single subset (the validation set) is used to
- estimate the generalization error, instead of k different subsets; i.e.,
- there is no "crossing". While various people have suggested that
- cross-validation be applied to early stopping, the proper way of doing so is
- not obvious.
-
- The distinction between cross-validation and split-sample validation is
- extremely important because cross-validation is markedly superior for small
- data sets; this fact is demonstrated dramatically by Goutte (1997) in a
- reply to Zhu and Rohwer (1996). For an insightful discussion of the
- limitations of cross-validatory choice among several learning methods, see
- Stone (1977).
-
- Jackknifing
- +++++++++++
-
- Leave-one-out cross-validation is also easily confused with jackknifing.
- Both involve omitting each training case in turn and retraining the network
- on the remaining subset. But cross-validation is used to estimate
- generalization error, while the jackknife is used to estimate the bias of a
- statistic. In the jackknife, you compute some statistic of interest in each
- subset of the data. The average of these subset statistics is compared with
- the corresponding statistic computed from the entire sample in order to
- estimate the bias of the latter. You can also get a jackknife estimate of
- the standard error of a statistic. Jackknifing can be used to estimate the
- bias of the training error and hence to estimate the generalization error,
- but this process is more complicated than leave-one-out cross-validation
- (Efron, 1982; Ripley, 1996, p. 73).
-
- Choice of cross-validation method
- +++++++++++++++++++++++++++++++++
-
- Cross-validation can be used simply to estimate the generalization error of
- a given model, or it can be used for model selection by choosing one of
- several models that has the smallest estimated generalization error. For
- example, you might use cross-validation to choose the number of hidden
- units, or you could use cross-validation to choose a subset of the inputs
- (subset selection). A subset that contains all relevant inputs will be
- called a "good" subsets, while the subset that contains all relevant inputs
- but no others will be called the "best" subset. Note that subsets are "good"
- and "best" in an asymptotic sense (as the number of training cases goes to
- infinity). With a small training set, it is possible that a subset that is
- smaller than the "best" subset may provide better generalization error.
-
- Leave-one-out cross-validation often works well for estimating
- generalization error for continuous error functions such as the mean squared
- error, but it may perform poorly for discontinuous error functions such as
- the number of misclassified cases. In the latter case, k-fold
- cross-validation is preferred. But if k gets too small, the error estimate
- is pessimistically biased because of the difference in training-set size
- between the full-sample analysis and the cross-validation analyses. (For
- model-selection purposes, this bias can actually help; see the discussion
- below of Shao, 1993.) A value of 10 for k is popular for estimating
- generalization error.
-
- Leave-one-out cross-validation can also run into trouble with various
- model-selection methods. Again, one problem is lack of continuity--a small
- change in the data can cause a large change in the model selected (Breiman,
- 1996). For choosing subsets of inputs in linear regression, Breiman and
- Spector (1992) found 10-fold and 5-fold cross-validation to work better than
- leave-one-out. Kohavi (1995) also obtained good results for 10-fold
- cross-validation with empirical decision trees (C4.5). Values of k as small
- as 5 or even 2 may work even better if you analyze several different random
- k-way splits of the data to reduce the variability of the cross-validation
- estimate.
-
- Leave-one-out cross-validation also has more subtle deficiencies for model
- selection. Shao (1995) showed that in linear models, leave-one-out
- cross-validation is asymptotically equivalent to AIC (and Mallows' C_p), but
- leave-v-out cross-validation is asymptotically equivalent to Schwarz's
- Bayesian criterion (called SBC or BIC) when v =
- n[1-1/(log(n)-1)], where n is the number of training cases. SBC
- provides consistent subset-selection, while AIC does not. That is, SBC will
- choose the "best" subset with probability approaching one as the size of the
- training set goes to infinity. AIC has an asymptotic probability of one of
- choosing a "good" subset, but less than one of choosing the "best" subset
- (Stone, 1979). Many simulation studies have also found that AIC overfits
- badly in small samples, and that SBC works well (e.g., Hurvich and Tsai,
- 1989; Shao and Tu, 1995). Hence, these results suggest that leave-one-out
- cross-validation should overfit in small samples, but leave-v-out
- cross-validation with appropriate v should do better. However, when true
- models have an infinite number of parameters, SBC is not efficient, and
- other criteria that are asymptotically efficient but not consistent for
- model selection may produce better generalization (Hurvich and Tsai, 1989).
-
- Shao (1993) obtained the surprising result that for selecting subsets of
- inputs in a linear regression, the probability of selecting the "best" does
- not converge to 1 (as the sample size n goes to infinity) for leave-v-out
- cross-validation unless the proportion v/n approaches 1. At first glance,
- Shao's result seems inconsistent with the analysis by Kearns (1997) of
- split-sample validation, which shows that the best generalization is
- obtained with v/n strictly between 0 and 1, with little sensitivity to the
- precise value of v/n for large data sets. But the apparent conflict is due
- to the fundamentally different properties of cross-validation and
- split-sample validation.
-
- To obtain an intuitive understanding of Shao (1993), let's review some
- background material on generalization error. Generalization error can be
- broken down into three additive parts, noise variance + estimation variance
- + squared estimation bias. Noise variance is the same for all subsets of
- inputs. Bias is nonzero for subsets that are not "good", but it's zero for
- all "good" subsets, since we are assuming that the function to be learned is
- linear. Hence the generalization error of "good" subsets will differ only in
- the estimation variance. The estimation variance is (2p/t)s^2 where p
- is the number of inputs in the subset, t is the training set size, and s^2
- is the noise variance. The "best" subset is better than other "good" subsets
- only because the "best" subset has (by definition) the smallest value of p.
- But the t in the denominator means that differences in generalization error
- among the "good" subsets will all go to zero as t goes to infinity.
- Therefore it is difficult to guess which subset is "best" based on the
- generalization error even when t is very large. It is well known that
- unbiased estimates of the generalization error, such as those based on AIC,
- FPE, and C_p, do not produce consistent estimates of the "best" subset
- (e.g., see Stone, 1979).
-
- In leave-v-out cross-validation, t=n-v. The differences of the
- cross-validation estimates of generalization error among the "good" subsets
- contain a factor 1/t, not 1/n. Therefore by making t small enough (and
- thereby making each regression based on t cases bad enough), we can make
- the differences of the cross-validation estimates large enough to detect. It
- turns out that to make t small enough to guess the "best" subset
- consistently, we have to have t/n go to 0 as n goes to infinity.
-
- The crucial distinction between cross-validation and split-sample validation
- is that with cross-validation, after guessing the "best" subset, we train
- the linear regression model for that subset using all n cases, but with
- split-sample validation, only t cases are ever used for training. If our
- main purpose were really to choose the "best" subset, I suspect we would
- still have to have t/n go to 0 even for split-sample validation. But
- choosing the "best" subset is not the same thing as getting the best
- generalization. If we are more interested in getting good generalization
- than in choosing the "best" subset, we do not want to make our regression
- estimate based on only t cases as bad as we do in cross-validation, because
- in split-sample validation that bad regression estimate is what we're stuck
- with. So there is no conflict between Shao and Kearns, but there is a
- conflict between the two goals of choosing the "best" subset and getting the
- best generalization in split-sample validation.
-
- Bootstrapping
- +++++++++++++
-
- Bootstrapping seems to work better than cross-validation in many cases
- (Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly
- analyzing subsets of the data, you repeatedly analyze subsamples of the
- data. Each subsample is a random sample with replacement from the full
- sample. Depending on what you want to do, anywhere from 50 to 2000
- subsamples might be used. There are many more sophisticated bootstrap
- methods that can be used not only for estimating generalization error but
- also for estimating confidence bounds for network outputs (Efron and
- Tibshirani 1993). For estimating generalization error in classification
- problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap)
- is one of the currently favored methods that has the advantage of performing
- well even when there is severe overfitting. Use of bootstrapping for NNs is
- described in Baxt and White (1995), Tibshirani (1996), and Masters (1995).
- However, the results obtained so far are not very thorough, and it is known
- that bootstrapping does not work well for some other methodologies such as
- empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984;
- Kohavi, 1995), for which it can be excessively optimistic.
-
- For further information
- +++++++++++++++++++++++
-
- Cross-validation and bootstrapping become considerably more complicated for
- time series data; see Hjorth (1994) and Snijders (1988).
-
- More information on jackknife and bootstrap confidence intervals is
- available at ftp://ftp.sas.com/pub/neural/jackboot.sas (this is a plain-text
- file).
-
- References:
-
- Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for
- clinical input variable effects in a network trained to identify the
- presence of acute myocardial infarction", Neural Computation, 7, 624-638.
-
- Breiman, L. (1996), "Heuristics of instability and stabilization in model
- selection," Annals of Statistics, 24, 2350-2383.
-
- Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984),
- Classification and Regression Trees, Belmont, CA: Wadsworth.
-
- Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation
- in regression: The X-random case," International Statistical Review, 60,
- 291-319.
-
- Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical
- Implications, Proceedings of a workshop held in Groningen, The
- Netherlands, September 25-26, 1986, Berlin: Springer-Verlag.
-
- Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling
- Plans, Philadelphia: SIAM.
-
- Efron, B. (1983), "Estimating the error rate of a prediction rule:
- Improvement on cross-validation," J. of the American Statistical
- Association, 78, 316-331.
-
- Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
- London: Chapman & Hall.
-
- Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation:
- The .632+ bootstrap method," J. of the American Statistical Association,
- 92, 548-560.
-
- Goutte, C. (1997), "Note on free lunches and cross-validation," Neural
- Computation, 9, 1211-1215,
- ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz.
-
- Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,
- Model Selection, and Bootstrap, London: Chapman & Hall.
-
- Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model
- selection in small samples," Biometrika, 76, 297-307.
-
- Kearns, M. (1997), "A bound on the error of cross validation using the
- approximation and estimation rates, with consequences for the
- training-test split," Neural Computation, 9, 1143-1161.
-
- Kohavi, R. (1995), "A study of cross-validation and bootstrap for
- accuracy estimation and model selection," International Joint Conference
- on Artificial Intelligence (IJCAI), pp. ?,
- http://robotics.stanford.edu/users/ronnyk/
-
- Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
- Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0
-
- Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validation
- estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.)
- Advances in Neural Information Processing Systems 6, San Mateo, CA:
- Morgan Kaufman, pp. 391-398.
-
- Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
- Cambridge University Press.
-
- Shao, J. (1993), "Linear model selection by cross-validation," J. of the
- American Statistical Association, 88, 486-494.
-
- Shao, J. (1995), "An asymptotic theory for linear model selection,"
- Statistica Sinica ?.
-
- Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York:
- Springer-Verlag.
-
- Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation in
- time series," in Dijkstra (1988), pp. 56-69.
-
- Stone, M. (1977), "Asymptotics for and against cross-validation,"
- Biometrika, 64, 29-35.
-
- Stone, M. (1979), "Comments on model selection criteria of Akaike and
- Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278.
-
- Tibshirani, R. (1996), "A comparison of some error estimates for neural
- network models," Neural Computation, 8, 152-163.
-
- Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn,
- Morgan Kaufmann.
-
- Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation,"
- Neural Computation, 8, 1421-1426.
-
- ------------------------------------------------------------------------
-
- Subject: How to compute prediction and confidence
- =================================================
- intervals (error bars)?
- =======================
-
- (This answer is only about half finished. I will get around to the other
- half eventually.)
-
- In addition to estimating over-all generalization error, it is often useful
- to be able to estimate the accuracy of the network's predictions for
- individual cases.
-
- Let:
-
- Y = the target variable
- y_i = the value of Y for the ith case
- X = the vector of input variables
- x_i = the value of X for the ith case
- N = the noise in the target variable
- n_i = the value of N for the ith case
- m(X) = E(Y|X) = the conditional mean of Y given X
- w = a vector of weights for a neural network
- w^ = the weight obtained via training the network
- p(X,w) = the output of a neural network given input X and weights w
- p_i = p(x_i,w)
- L = the number of training (learning) cases, (y_i,x_i), i=1, ..., L
- Q(w) = the objective function
-
- Assume the data are generated by the model:
-
- Y = m(X) + N
- E(N|X) = 0
- N and X are independent
-
- The network is trained by attempting to minimize the objective function
- Q(w), which, for example, could be the sum of squared errors or the
- negative log likelihood based on an assumed family of noise distributions.
-
- Given a test input x_0, a 100c% prediction interval for y_0 is an
- interval [LPB_0,UPB_0] such that Pr(LPB_0 <= y_0 <=
- UPB_0) = c, where c is typically .95 or .99, and the probability is
- computed over repeated random selection of the training set and repeated
- observation of Y given the test input x_0. A 100c% confidence interval
- for p_0 is an interval [LCB_0,UCB_0] such that Pr(LCB_0 <=
- p_0 <= UCB_0) = c, where again the probability is computed over
- repeated random selection of the training set. Note that p_0 is a
- nonrandom quantity, since x_0 is given. A confidence interval is narrower
- than the corresponding prediction interval, since the prediction interval
- must include variation due to noise in y_0, while the confidence interval
- does not. Both intervals include variation due to sampling of the training
- set and possible variation in the training process due, for example, to
- random initial weights and local minima of the objective function.
-
- Traditional statistical methods for nonlinear models depend on several
- assumptions (Gallant, 1987):
-
- 1. The inputs for the training cases are either fixed or obtained by simple
- random sampling or some similarly well-behaved process.
- 2. Q(w) has continuous first and second partial derivatives with respect
- to w over some convex, bounded subset S_W of the weight space.
- 3. Q(w) has a unique global minimum at w^, which is an interior point of
- S_W.
- 4. The model is well-specified, which requires (a) that there exist weights
- w$ in the interior of S_W such that m(x) = p(x,w$), and (b)
- that the assumptions about the noise distribution are correct. (Sorry
- about the w$ notation, but I'm running out of plain text symbols.)
-
- These traditional methods are based on a linear approximation to p(x,w)
- in a neighborhood of w$, yielding a quadratic approximation to Q(w).
- Hence the Hessian of Q(w) (the square matrix of second-order partial
- derivatives with respect to w) frequently appears in these methods.
-
- Assumption (3) is not satisfied for neural nets, because networks with
- hidden units always have multiple global minima, and the global minima are
- often improper. Hence, confidence intervals for the weights cannot be
- obtained using standard Hessian-based methods. However, Hwang and Ding
- (1997) have shown that confidence intervals for predicted values can be
- obtained because the predicted values are statistically identified even
- though the weights are not.
-
- Cardell, Joerding, and Li (1994) describe a more serious violation of
- assumption (3), namely that that for some m(x), no finite global minimum
- exists. In such situations, it may be possible to use regularization methods
- such as weight decay to obtain valid confidence intervals (De Veaux, Schumi,
- Schweinsberg, and Ungar, 1998), but more research is required on this
- subject, since the derivation in the cited paper assumes a finite global
- minimum.
-
- For large samples, the sampling variability in w^ can be approximated in
- various ways:
-
- o Fisher's information matrix, which is the expected value of the Hessian
- of Q(w) divided by L, can be used when Q(w) is the negative log
- likelihood (Spall, 1998).
- o The delta method, based on the Hessian of Q(w) or the Gauss-Newton
- approximation using the cross-product Jacobian of Q(w), can also be
- used when Q(w) is the negative log likelihood (Tibshirani, 1996; Hwang
- and Ding, 1997; De Veaux, Schumi, Schweinsberg, and Ungar, 1998).
- o The sandwich estimator, a more elaborate Hessian-based method, relaxes
- assumption (4) (Gallant, 1987; White, 1989; Tibshirani, 1996).
- o Bootstrapping can be used without knowing the form of the noise
- distribution and takes into account variability introduced by local
- minima in training, but requires training the network many times on
- different resamples of the training set (Tibshirani, 1996; Heskes 1997).
-
- References:
-
- Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why some feedforward
- networks cannot learn some polynomials," Neural Computation, 6, 761-766.
-
- De Veaux,R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998),
- "Prediction intervals for neural networks via nonlinear regression,"
- Technometrics, 40, 273-282.
-
- Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.
-
- Heskes, T. (1997), "Practical confidence and prediction intervals," in
- Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
- Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp.
- 176-182.
-
- Hwang, J.T.G., and Ding, A.A. (1997), "Prediction intervals for
- artificial neural networks," J. of the American Statistical Association,
- 92, 748-757.
-
- Nix, D.A., and Weigend, A.S. (1995), "Learning local error bars for
- nonlinear regression," in Tesauro, G., Touretzky, D., and Leen, T.,
- (eds.) Advances in Neural Information Processing Systems 7, Cambridge,
- MA: The MIT Press, pp. 489-496.
-
- Spall, J.C. (1998), "Resampling-based calculation of the information
- matrix in nonlinear statistical models," Proceedings of the 4th Joint
- Conference on Information Sciences, October 23-28, Research Triangle
- PArk, NC, USA, Vol 4, pp. 35-39.
-
- Tibshirani, R. (1996), "A comparison of some error estimates for neural
- network models," Neural Computation, 8, 152-163.
-
- White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden
- Layer Feedforward Network Models", J. of the American Statistical Assoc.,
- 84, 1008-1013.
-
- ------------------------------------------------------------------------
-
- Next part is part 4 (of 7). Previous part is part 2.
-
- --
-
- Warren S. Sarle SAS Institute Inc. The opinions expressed here
- saswss@unx.sas.com SAS Campus Drive are mine and not necessarily
- (919) 677-8000 Cary, NC 27513, USA those of SAS Institute.
-