NetNews Usenet Archive 1992 #26

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #26 / NN_1992_26.iso / spool / sci / math / stat / 2338 < prev next >

Wrap

Text File | 1992-11-11 | 4.2 KB | 78 lines

Newsgroups: sci.math.stat Path: sparky!uunet!destroyer!sol.ctr.columbia.edu!The-Star.honeywell.com!umn.edu!thompson From: thompson@atlas.socsci.umn.edu (T. Scott Thompson) Subject: Re: modelling distributions Message-ID: <thompson.721517751@daphne.socsci.umn.edu> Sender: news@news2.cis.umn.edu (Usenet News Administration) Nntp-Posting-Host: daphne.socsci.umn.edu Reply-To: thompson@atlas.socsci.umn.edu Organization: Economics Department, University of Minnesota References: <1dro0bINN8p9@agate.berkeley.edu> Date: Wed, 11 Nov 1992 21:35:51 GMT Lines: 64 mwande@graunt.qal.berkeley.edu (Mike Anderson) writes: >Suppose I have data on two variables - age and income of a person. I would like >to estimate income curves by age using spline regression, but my problem is >this: to protect the identity of the individuals, incomes were topcoded, so >that any person reporting incomes of > $100,000 gets coded at $100,000. I am >interested in getting very accurate estimates, and given the long right tail >of the income distribution, these truncated values may be throwing off my >estimates quite a bit. > So I would like to replace the truncated values with my own "tail". My >question is, how do I go about modelling the income distribution and tacking >on my own tail? Off the top of my head, I would precede thusly: If there are >Nt truncated observations and Nu untruncated observations, N = Nt + Nu, I would >first scale the Nu incomes to mean 0, generate N obs from something like a >log-normal, lop off the top Nt quantiles, and compare the generated data to the >observed data with a Q-Q plot, choosing that level of variance in the log-normal >which gives me the straightest fit. Then I would take the Nt random values I >lopped off from the generated data and randomly assign them to the Nt truncated >values. Does this sound reasonable? I'm sure there is a better way to go about >this, can someone tell me what it is or where to find it? I assume that you apply this procedure separately for each age group. I also assume that you scale the log of income (rather than income itself) to have mean zero, since a log-normal variable cannot have mean zero. I have no idea why you would want to do this. Except for the bit about scaling the log of incomes to mean 0, the procedure that you describe is approximately equivalent to what you would do to fit a censored log-normal distribution to the data by maximum-likelihood. An ML estimator would not insist that the distribution predict the observed number of censored observations exactly, nor would it require that the mean of the log of the uncensored income variables be zero, since generally this value will not give the best fit. The ML estimate is optimal if the true uncensored distribution is really log-normal, and may be arbitrarily bad otherwise, depending on what that unobserved tail really looks like. On the other hand, if you know that the uncensored distribution is log-normal then you don't need to use splines to estimate this distribution. Catch-22. The fact of the matter is that there is no "right" way to do this. You simply have no information in your data about what the upper tail looks like. Any attempt to "simulate" the upper tail is as arbitrary as any other in the absence of additional information. Hopefully you don't really need it for your ultimate purposes. For example, any estimates of mean income produced from this data will be unreliable, since mean income cannot be determined without knowing how much the Donald Trumps of the world earned. Note: I use the term "censored" here instead of "truncated." Generally we say that a random variable Y is truncated if observations in the original sample with Y > Ymax are simply thrown away. In your case these observations are not thrown away. Rather, you observe the "censored" variable min(Y,Ymax) instead of Y itself, but you observe this for _every_ observation in the original sample. So you know (approximately) how much probability there should be in the upper tail. There is less information loss with censoring than with truncation. -- T. Scott Thompson email: thompson@atlas.socsci.umn.edu Department of Economics phone: (612) 625-0119 University of Minnesota fax: (612) 624-0209