NetNews Usenet Archive 1992 #26

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #26 / NN_1992_26.iso / spool / sci / math / stat / 2340 < prev next >

Wrap

Internet Message Format | 1992-11-11 | 4.3 KB

Path: sparky!uunet!think.com!ames!agate!qal.qal.berkeley.edu!mwande From: mwande@qal.qal.berkeley.edu (Mike Anderson) Newsgroups: sci.math.stat Subject: RE: modelling distributions Date: 12 Nov 1992 00:55:07 GMT Organization: University of California, Berkeley Lines: 70 Distribution: world Message-ID: <1dsa1bINNdc7@agate.berkeley.edu> NNTP-Posting-Host: qal.qal.berkeley.edu mwande@graunt.qal.berkeley.edu (Mike Anderson) writes: >>Suppose I have data on two variables - age and income of a person. I would like >>to estimate income curves by age using spline regression, but my problem is >>this: to protect the identity of the individuals, incomes were topcoded, so >>that any person reporting incomes of > $100,000 gets coded at $100,000. I am >>interested in getting very accurate estimates, and given the long right tail >>of the income distribution, these truncated values may be throwing off my >>estimates quite a bit. >> So I would like to replace the truncated values with my own "tail". My >>question is, how do I go about modelling the income distribution and tacking >>on my own tail? Off the top of my head, I would precede thusly: If there are >>(stuff deleted) >I assume that you apply this procedure separately for each age group. >I also assume that you scale the log of income (rather than income >itself) to have mean zero, since a log-normal variable cannot have >mean zero. I have no idea why you would want to do this. Let me give a few more specifics about what I'm doing. Ignore the spline for the moment - I'm using that to fit a curve to the average income at each age across age. Suppose, for the purposes of the "tail-fitting", that I am just trying to model the income at a particular age, to get an average. In scaling the income to mean zero, I had in mind using any number of distributions, not just a log-normal, to try against the distributions. I guess I could just set the mean of the model distribution equal to the population mean without the censored points, point being that the scaling is arbitrary; I was originally think of a normal dist, which is why is was thinking mean zero, but then I realized a log normal might have a better tail for income. thompson@atlas.socsci.umn.edu writes: >Except for the bit about scaling the log of incomes to mean 0, the >procedure that you describe is approximately equivalent to what you >would do to fit a censored log-normal distribution to the data by >maximum-likelihood. An ML estimator would not insist that the >distribution predict the observed number of censored observations >exactly, nor would it require that the mean of the log of the >uncensored income variables be zero, since generally this value will >not give the best fit. >The ML estimate is optimal if the true uncensored distribution is >really log-normal, and may be arbitrarily bad otherwise, depending on >what that unobserved tail really looks like. On the other hand, if >you know that the uncensored distribution is log-normal then you don't >need to use splines to estimate this distribution. Catch-22. As above, I'm not using the spline on the distributions I'm modelling, rather on the average at each age after adjusting the age-specific distribu- tions for the censored observations. The distirbution across age is nowhere near a log-normal, or any typical distribution I am aware of. >The fact of the matter is that there is no "right" way to do this. >You simply have no information in your data about what the upper tail >looks like. Any attempt to "simulate" the upper tail is as arbitrary >as any other in the absence of additional information. Hopefully you >don't really need it for your ultimate purposes. For example, any >estimates of mean income produced from this data will be unreliable, >since mean income cannot be determined without knowing how much the >Donald Trumps of the world earned. Suppose I look at outside data sources to estimate the incomes of the Donald Trumps. How could I incorporate this in the estimation of the various distribution parameters? -- Mike Anderson Dept. of Demography UC Berkeley mwande@QAL.Berkeley.EDU "And I would say to those out around the country. 'Take a hard look now. Don't let that rabbit be pulled out of the hat by one hand and 25 other rabbits dumped on you in another.'" - George Bush, 1/24/90