NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / ai / neuraln / 4701 < prev next >

Wrap

Text File | 1993-01-05 | 3.3 KB | 71 lines

Newsgroups: comp.ai.neural-nets Path: sparky!uunet!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!news.sei.cmu.edu!bb3.andrew.cmu.edu!crabapple.srv.cs.cmu.edu!news From: sef@sef-pmax.slisp.cs.cmu.edu Subject: Re: some basic questions Message-ID: <C0E950.10B.1@cs.cmu.edu> Sender: news@cs.cmu.edu (Usenet News System) Nntp-Posting-Host: sef-pmax.slisp.cs.cmu.edu Organization: School of Computer Science, Carnegie Mellon Date: Tue, 5 Jan 1993 18:38:55 GMT Lines: 59 From: danz+@CENTRO.SOAR.CS.CMU.EDU (Dan Zhu) - Is symmetric sigmoid function (-0.5, 0.5) or (-1, 1) always better than the asymmetric one (0, 1)? Any reference? Almost always. In my quickprop paper (citation below) I explore this on a few simple learning tasks. Asymmetric sigmoid learned faster on simple encoder tasks, but I view that as an unusual feature of that problem. There's a paper by Stornetta and Hubermann in the 1987 IEEE ICNN that claims superiority for symmetric sigmoids, and a nice paper in NIPS 3 by leCun, Kanter, and Solla presents some theory that may explain why symmetric sigmoids work better. - Shall I do this kind of scaling with the input representation also? Your first layer of weights should be able to learn to do the sacling for you. However, some researchers report better and much faster results if you pre-scale the inputs so that all are in the same range. I suspect that this is a trivial difference for fast, fairly robust learning algorithms and a more important difference for slower algorithms that may tend to get stuck. - What would be a good cut point to test the network from time to time to avoid the overtraining? The usual practice is to use a separate training and validation set. Periodically, you run the validation set through the partially trained net, and you stop (and maybe revert) when you see the generalization starting to get worse. How often you check this depends on your goals: checking frequently is more expensive, but stops you at closer to the right place. - I remember I read something like "three layer (with one hidden layer) is sufficient for the generalization of the network...". Could anyone give me any pointer to the exact reference for it? I don't have the references handy. There have been various results along this line by Cybenko, Hal White, Sontag, and others. - Also, is there any new reference about the clue for selecting the range of "hidden nodes", "learning rate", "momentum" and judgement about the "initial weight"? There's no new magic that I know of that would allow you to choose parameters and network topology just by looking at the data set (at least for non-toy problems). If you want a near-optimal result, you have to adjust the learning parameters and network topology dynamically. -- Scott =========================================================================== Scott E. Fahlman Internet: sef+@cs.cmu.edu Senior Research Scientist Phone: 412 268-2575 School of Computer Science Fax: 412 681-5739 Carnegie Mellon University Latitude: 40:26:33 N 5000 Forbes Avenue Longitude: 79:56:48 W Pittsburgh, PA 15213 ===========================================================================