home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: sci.math.stat
- Path: sparky!uunet!destroyer!sol.ctr.columbia.edu!The-Star.honeywell.com!umn.edu!thompson
- From: thompson@atlas.socsci.umn.edu (T. Scott Thompson)
- Subject: Re: modelling distributions
- Message-ID: <thompson.721517751@daphne.socsci.umn.edu>
- Sender: news@news2.cis.umn.edu (Usenet News Administration)
- Nntp-Posting-Host: daphne.socsci.umn.edu
- Reply-To: thompson@atlas.socsci.umn.edu
- Organization: Economics Department, University of Minnesota
- References: <1dro0bINN8p9@agate.berkeley.edu>
- Date: Wed, 11 Nov 1992 21:35:51 GMT
- Lines: 64
-
- mwande@graunt.qal.berkeley.edu (Mike Anderson) writes:
-
- >Suppose I have data on two variables - age and income of a person. I would like
- >to estimate income curves by age using spline regression, but my problem is
- >this: to protect the identity of the individuals, incomes were topcoded, so
- >that any person reporting incomes of > $100,000 gets coded at $100,000. I am
- >interested in getting very accurate estimates, and given the long right tail
- >of the income distribution, these truncated values may be throwing off my
- >estimates quite a bit.
- > So I would like to replace the truncated values with my own "tail". My
- >question is, how do I go about modelling the income distribution and tacking
- >on my own tail? Off the top of my head, I would precede thusly: If there are
- >Nt truncated observations and Nu untruncated observations, N = Nt + Nu, I would
- >first scale the Nu incomes to mean 0, generate N obs from something like a
- >log-normal, lop off the top Nt quantiles, and compare the generated data to the
- >observed data with a Q-Q plot, choosing that level of variance in the log-normal
- >which gives me the straightest fit. Then I would take the Nt random values I
- >lopped off from the generated data and randomly assign them to the Nt truncated
- >values. Does this sound reasonable? I'm sure there is a better way to go about
- >this, can someone tell me what it is or where to find it?
-
- I assume that you apply this procedure separately for each age group.
- I also assume that you scale the log of income (rather than income
- itself) to have mean zero, since a log-normal variable cannot have
- mean zero. I have no idea why you would want to do this.
-
- Except for the bit about scaling the log of incomes to mean 0, the
- procedure that you describe is approximately equivalent to what you
- would do to fit a censored log-normal distribution to the data by
- maximum-likelihood. An ML estimator would not insist that the
- distribution predict the observed number of censored observations
- exactly, nor would it require that the mean of the log of the
- uncensored income variables be zero, since generally this value will
- not give the best fit.
-
- The ML estimate is optimal if the true uncensored distribution is
- really log-normal, and may be arbitrarily bad otherwise, depending on
- what that unobserved tail really looks like. On the other hand, if
- you know that the uncensored distribution is log-normal then you don't
- need to use splines to estimate this distribution. Catch-22.
-
- The fact of the matter is that there is no "right" way to do this.
- You simply have no information in your data about what the upper tail
- looks like. Any attempt to "simulate" the upper tail is as arbitrary
- as any other in the absence of additional information. Hopefully you
- don't really need it for your ultimate purposes. For example, any
- estimates of mean income produced from this data will be unreliable,
- since mean income cannot be determined without knowing how much the
- Donald Trumps of the world earned.
-
-
- Note: I use the term "censored" here instead of "truncated."
- Generally we say that a random variable Y is truncated if observations
- in the original sample with Y > Ymax are simply thrown away. In your
- case these observations are not thrown away. Rather, you observe the
- "censored" variable min(Y,Ymax) instead of Y itself, but you observe
- this for _every_ observation in the original sample. So you know
- (approximately) how much probability there should be in the upper
- tail. There is less information loss with censoring than with
- truncation.
- --
- T. Scott Thompson email: thompson@atlas.socsci.umn.edu
- Department of Economics phone: (612) 625-0119
- University of Minnesota fax: (612) 624-0209
-