home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!snorkelwacker.mit.edu!ai-lab!sun-of-smokey!marcus
- From: marcus@sun-of-smokey.NoSubdomain.NoDomain (Jeff Marcus)
- Newsgroups: comp.ai.neural-nets
- Subject: Re: need for unique test sets
- Message-ID: <25761@life.ai.mit.edu>
- Date: 23 Jul 92 14:07:31 GMT
- References: <1992Jul19.070433.5896@afterlife.ncsc.mil> <arms.711645136@spedden> <1992Jul21.224019.6615@u.washington.edu>
- Sender: news@ai.mit.edu
- Organization: MIT/LCS Spoken Language Systems
- Lines: 35
-
- In article <1992Jul21.224019.6615@u.washington.edu>,
- davisd@milton.u.washington.edu (Daniel Davis) writes:
- |> I hope I can clear up this debate with a little specificity.
- |>
- |> A couple guys say that all we need is independant sampling, while
- |> someone else seems to think that one should not include the training
- |> data in the test set.
- |>
- |> Independant sampling is in fact all you need, but given the proper
- |> context, it is also proper to say that one should not include any of
- |> the training data in the test set.
- |>
- |> Suppose you take 10000 independant samples. You use 5000 as your
- |> training set. You would *not* select your test set from all 10000
- |> samples, but instead, only from the 5000 not included in the
- |> training
- |> set. If you selected test data from a random sampling of all 10000
- |> samples, your test data and your training data would no longer be
- |> independant. Of the data you have, only the 5000 previously
- |> unselected
- |> data correspond to data independant of your training set. In this
- |> sense, then, one should not include any of the training data in the
- |> test data.
- |>
- |> However, it is *not* a problem if it happens that some of the 5000
- |> previously unselected data are in fact repeats of the original
- |> training data, as it is assumed that the original 10000 were
- |> independant samples.
- |>
- |> Buy Buy -- Dan Davis
- |> Univ. of Washington, Dept. of EE, davisd@u.washington.edu
-
- Exactly. I should have been this specific in making my argument. It might
- have saved some bandwidth.
- Jeff
-