home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!think.com!ames!agate!qal.qal.berkeley.edu!mwande
- From: mwande@qal.qal.berkeley.edu (Mike Anderson)
- Newsgroups: sci.math.stat
- Subject: RE: modelling distributions
- Date: 12 Nov 1992 00:55:07 GMT
- Organization: University of California, Berkeley
- Lines: 70
- Distribution: world
- Message-ID: <1dsa1bINNdc7@agate.berkeley.edu>
- NNTP-Posting-Host: qal.qal.berkeley.edu
-
- mwande@graunt.qal.berkeley.edu (Mike Anderson) writes:
-
- >>Suppose I have data on two variables - age and income of a person. I would like
- >>to estimate income curves by age using spline regression, but my problem is
- >>this: to protect the identity of the individuals, incomes were topcoded, so
- >>that any person reporting incomes of > $100,000 gets coded at $100,000. I am
- >>interested in getting very accurate estimates, and given the long right tail
- >>of the income distribution, these truncated values may be throwing off my
- >>estimates quite a bit.
- >> So I would like to replace the truncated values with my own "tail". My
- >>question is, how do I go about modelling the income distribution and tacking
- >>on my own tail? Off the top of my head, I would precede thusly: If there are
-
- >>(stuff deleted)
-
- >I assume that you apply this procedure separately for each age group.
- >I also assume that you scale the log of income (rather than income
- >itself) to have mean zero, since a log-normal variable cannot have
- >mean zero. I have no idea why you would want to do this.
-
- Let me give a few more specifics about what I'm doing. Ignore the spline for
- the moment - I'm using that to fit a curve to the average income at each age
- across age. Suppose, for the purposes of the "tail-fitting", that I am just
- trying to model the income at a particular age, to get an average. In scaling
- the income to mean zero, I had in mind using any number of distributions,
- not just a log-normal, to try against the distributions. I guess
- I could just set the mean of the model distribution equal to the population
- mean without the censored points, point being that the scaling is arbitrary;
- I was originally think of a normal dist, which is why is was thinking mean
- zero, but then I realized a log normal might have a better tail for income.
-
- thompson@atlas.socsci.umn.edu writes:
- >Except for the bit about scaling the log of incomes to mean 0, the
- >procedure that you describe is approximately equivalent to what you
- >would do to fit a censored log-normal distribution to the data by
- >maximum-likelihood. An ML estimator would not insist that the
- >distribution predict the observed number of censored observations
- >exactly, nor would it require that the mean of the log of the
- >uncensored income variables be zero, since generally this value will
- >not give the best fit.
-
- >The ML estimate is optimal if the true uncensored distribution is
- >really log-normal, and may be arbitrarily bad otherwise, depending on
- >what that unobserved tail really looks like. On the other hand, if
- >you know that the uncensored distribution is log-normal then you don't
- >need to use splines to estimate this distribution. Catch-22.
-
- As above, I'm not using the spline on the distributions I'm modelling,
- rather on the average at each age after adjusting the age-specific distribu-
- tions for the censored observations. The distirbution across age is nowhere
- near a log-normal, or any typical distribution I am aware of.
-
- >The fact of the matter is that there is no "right" way to do this.
- >You simply have no information in your data about what the upper tail
- >looks like. Any attempt to "simulate" the upper tail is as arbitrary
- >as any other in the absence of additional information. Hopefully you
- >don't really need it for your ultimate purposes. For example, any
- >estimates of mean income produced from this data will be unreliable,
- >since mean income cannot be determined without knowing how much the
- >Donald Trumps of the world earned.
-
- Suppose I look at outside data sources to estimate the incomes of the
- Donald Trumps. How could I incorporate this in the estimation of the various
- distribution parameters?
-
- --
- Mike Anderson Dept. of Demography UC Berkeley mwande@QAL.Berkeley.EDU
- "And I would say to those out around the country. 'Take a hard look now.
- Don't let that rabbit be pulled out of the hat by one hand and 25 other
- rabbits dumped on you in another.'" - George Bush, 1/24/90
-