Struggles with Survey Weighting and Regression Modeling
Struggles with Survey Weighting and Regression Modeling (DOI: http://dx.doi.org/10.1214/088342306000000691) was written by Andrew Gelman in 2007. It was published in Statistical Science (vol. 22, no. 2).
The author demonstrates that, under certain conditions, poststratification and survey weighting are the same procedure.
- Assume large population so that finite population quantities of interest are the same as superpopulation quantities.
Assume the population size of all strata is known: J cells with population size Nj such that Σ Nj = N.
- Restrict weight computation to:
- poststratification, because raking methods are not necessary when all cell sizes are known, as stated in the above assumption
- flat nonresponse adjustments within strata
- This really is a constraint on the available predictors for response propensity. If the only stratifying variables are used, then of course each strata will have a unique response prediction, and therefore a constant adjustment factor.
- The poststratified estimate is:
The weighted estimate can be put in terms of unit weights (wi) or cell weights (Wj = njwi):
In contrast, the literature on weighted analysis is confused. In some circumstances, it is recommended to not use survey weights when models of survey data. In addition, weighted standard errors are not trivial to estimate.
Survey weighting also runs into problems as the number of stratified cells grows, potentially to the point where a cell has no respondents.
The authors start from poststratification and try to reverse-engineer weighting methods.
First, using regression of the outcome on the stratifying variables:
X is the n by k matrix of predictors
Xpop is the J by k matrix of cell estimates
Npop is the J-long vector of the Nj population sizes
the coefficients are estimated as b = (XTX)-1XTy (see here)
the poststratified cell estimates are given as Xpopb, or Xpop(XTX)-1XTy
- the poststratified estimate is:
- to fit this into the rough formula of survey weighted estimates:
then the J-long vector of weights, w, must be:
Reading notes
Note also that the author repeats their argument from here. This is, as before, applicable because of the constraint on weight computation.