home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.ai.neural-nets
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!news.sei.cmu.edu!bb3.andrew.cmu.edu!crabapple.srv.cs.cmu.edu!news
- From: sef@sef-pmax.slisp.cs.cmu.edu
- Subject: Re: some basic questions
- Message-ID: <C0E950.10B.1@cs.cmu.edu>
- Sender: news@cs.cmu.edu (Usenet News System)
- Nntp-Posting-Host: sef-pmax.slisp.cs.cmu.edu
- Organization: School of Computer Science, Carnegie Mellon
- Date: Tue, 5 Jan 1993 18:38:55 GMT
- Lines: 59
-
-
- From: danz+@CENTRO.SOAR.CS.CMU.EDU (Dan Zhu)
-
- - Is symmetric sigmoid function (-0.5, 0.5) or (-1, 1) always
- better than the asymmetric one (0, 1)? Any reference?
-
- Almost always. In my quickprop paper (citation below) I explore this on a
- few simple learning tasks. Asymmetric sigmoid learned faster on simple
- encoder tasks, but I view that as an unusual feature of that problem.
- There's a paper by Stornetta and Hubermann in the 1987 IEEE ICNN that
- claims superiority for symmetric sigmoids, and a nice paper in NIPS 3 by
- leCun, Kanter, and Solla presents some theory that may explain why
- symmetric sigmoids work better.
-
- - Shall I do this kind of scaling with the input representation also?
-
- Your first layer of weights should be able to learn to do the sacling for
- you. However, some researchers report better and much faster results if
- you pre-scale the inputs so that all are in the same range. I suspect that
- this is a trivial difference for fast, fairly robust learning algorithms
- and a more important difference for slower algorithms that may tend to get
- stuck.
-
- - What would be a good cut point to test the network from time to time
- to avoid the overtraining?
-
- The usual practice is to use a separate training and validation set.
- Periodically, you run the validation set through the partially trained net,
- and you stop (and maybe revert) when you see the generalization starting to
- get worse. How often you check this depends on your goals: checking
- frequently is more expensive, but stops you at closer to the right place.
-
- - I remember I read something like "three layer (with one hidden layer)
- is sufficient for the generalization of the network...". Could anyone
- give me any pointer to the exact reference for it?
-
- I don't have the references handy. There have been various results along
- this line by Cybenko, Hal White, Sontag, and others.
-
- - Also, is there any new reference about the clue for selecting
- the range of "hidden nodes", "learning rate", "momentum" and
- judgement about the "initial weight"?
-
- There's no new magic that I know of that would allow you to choose
- parameters and network topology just by looking at the data set (at least
- for non-toy problems). If you want a near-optimal result, you have to
- adjust the learning parameters and network topology dynamically.
-
- -- Scott
-
- ===========================================================================
- Scott E. Fahlman Internet: sef+@cs.cmu.edu
- Senior Research Scientist Phone: 412 268-2575
- School of Computer Science Fax: 412 681-5739
- Carnegie Mellon University Latitude: 40:26:33 N
- 5000 Forbes Avenue Longitude: 79:56:48 W
- Pittsburgh, PA 15213
- ===========================================================================
-
-