= Struggles with Survey Weighting and Regression Modeling =

'''Struggles with Survey Weighting and Regression Modeling''' (DOI: http://dx.doi.org/10.1214/088342306000000691) was written by Andrew Gelman in 2007. It was published in ''Statistical Science'' (vol. 22, no. 2).

The author demonstrates that, under certain conditions, [[Statistics/PostStratification|poststratification]] and [[Statistics/SurveyWeights|survey weighting]] are the same procedure.
 * Assume large population so that finite population quantities of interest are the same as superpopulation quantities.
 * Assume the population size of all strata is known: ''J'' cells with population size ''N,,j,,'' such that ''Σ N,,j,, = N''.
 * Restrict weight computation to:
   * poststratification, because raking methods are not necessary when all cell sizes are known, as stated in the above assumption
   * flat nonresponse adjustments within strata
     * This really is a constraint on the available predictors for response propensity. If the only stratifying variables are used, then of course each strata will have a unique response prediction, and therefore a constant adjustment factor. 
 * The poststratified estimate is:

{{attachment:ps.svg}}

 * The weighted estimate can be put in terms of unit weights (''w,,i,,'') or cell weights (''W,,j,, = n,,j,,w,,i,,''):

{{attachment:wt.svg}}

In contrast, the literature on weighted analysis is confused. In some circumstances, it is [[SamplingWeightsAndRegressionAnalysis|recommended to not use survey weights when models of survey data]]. In addition, weighted standard errors are not trivial to estimate.

Survey weighting also runs into problems as the number of stratified cells grows, potentially to the point where a cell has no respondents.

The authors start from poststratification and try to reverse-engineer weighting methods.

First, a simple [[Statistics/OrdinaryLeastSquares|OLS fitting]] of the outcome on the stratifying variables:
 * '''''X''''' is the ''n'' by ''k'' matrix of predictors
 * '''''X^pop^''''' is the ''J'' by ''k'' matrix of cell estimates
 * '''''N^pop^''''' is the ''J''-long vector of the ''N,,j,,'' population sizes
 * the coefficients (true values ''β'') are estimated as ''b = ('''X'''^T^'''X''')^-1^'''X'''^T^y'' (see [[Statistics/OrdinaryLeastSquares/Multivariate|here]])
 * the poststratified cell estimates are given as '''''X^pop^'''b'', or '''''X^pop^'''('''X'''^T^'''X''')^-1^'''X'''^T^y''
 * the poststratified estimate is:

{{attachment:regress1.svg}}

 * to fit this into the rough formula of survey weighted estimates:

{{attachment:regress2.svg}}

 * then the ''J''-long vector of weights, ''w'', must be:

{{attachment:regress3.svg}}

Next, expand this to a [[Statistics/BayesianHierarchicalModel|hierarchical model]]. The author specifically uses batches of age indicators (i.e., [[Statistics/Binning|binned]] into 4 levels), education (i.e. 4 levels), and their interactions.

 * a generally noninformative prior is assumed:
   * coefficients are distributed ''β ~ N(0,'''Σ''',,β,,)''
   * most coefficients are assumed to be independent, such that the [[Statistics/CovarianceMatrices#Precision_Matrices|precision matrix]] '''''Σ''',,β,,^-1^'' is fully specified as a diagonal matrix with the estimated ''σ^-2^'' for the batched predictors, and 0s for all others
 * the coefficients are estimated as ''b = ('''X'''^T^'''Σ''',,y,,^-1^'''X''' + '''Σ''',,β,,^-1^)^-1^'''X'''^T^'''Σ''',,y,,^-1^y''
 * the poststratified estimate is:

{{attachment:hier1.svg}}

 * the ''J''-long vector of weights ''w'' is:

{{attachment:hier2.svg}}

The others further consider special cases of the hierarchical model.

The posterior variance of ''y'' can be calculated as any of:

{{attachment:hier3.svg}}



== Reading notes ==

Note also that the author repeats their argument from [[ImprovingOnProbabilityWeightingForHouseholdSize|here]]. This is, as before, applicable because of the constraint on weight computation.



----
CategoryRicottone