= Struggles with Survey Weighting and Regression Modeling = '''Struggles with Survey Weighting and Regression Modeling''' (DOI: http://dx.doi.org/10.1214/088342306000000691) was written by Andrew Gelman in 2007. It was published in ''Statistical Science'' (vol. 22, no. 2). The author demonstrates that, under certain conditions, [[Statistics/PostStratification|poststratification]] and [[Statistics/SurveyWeights|survey weighting]] are the same procedure. * Assume large population so that finite population quantities of interest are the same as superpopulation quantities. * Assume the population size of all strata is known: ''J'' cells with population size ''N,,j,,'' such that ''Σ N,,j,, = N''. * Restrict weight computation to: * poststratification, because raking methods are not necessary when all cell sizes are known, as stated in the above assumption * flat nonresponse adjustments within strata * This really is a constraint on the available predictors for response propensity. If the only stratifying variables are used, then of course each strata will have a unique response prediction, and therefore a constant adjustment factor. * The poststratified estimate is: {{attachment:ps.svg}} * The weighted estimate can be put in terms of unit weights (''w,,i,,'') or cell weights (''W,,j,, = n,,j,,w,,i,,''): {{attachment:wt.svg}} In contrast, the literature on weighted analysis is confused. In some circumstances, it is [[SamplingWeightsAndRegressionAnalysis|recommended to not use survey weights when models of survey data]]. In addition, weighted standard errors are not trivial to estimate. Survey weighting also runs into problems as the number of stratified cells grows, potentially to the point where a cell has no respondents. The authors start from poststratification and try to reverse-engineer weighting methods. First, a simple [[Statistics/OrdinaryLeastSquares|OLS fitting]] of the outcome on the stratifying variables: * '''''X''''' is the ''n'' by ''k'' matrix of predictors * '''''X^pop^''''' is the ''J'' by ''k'' matrix of cell estimates * '''''N^pop^''''' is the ''J''-long vector of the ''N,,j,,'' population sizes * the coefficients (true values ''β'') are estimated as ''b = ('''X'''^T^'''X''')^-1^'''X'''^T^y'' (see [[Statistics/OrdinaryLeastSquares/Multivariate|here]]) * the poststratified cell estimates are given as '''''X^pop^'''b'', or '''''X^pop^'''('''X'''^T^'''X''')^-1^'''X'''^T^y'' * the poststratified estimate is: {{attachment:regress1.svg}} * to fit this into the rough formula of survey weighted estimates: {{attachment:regress2.svg}} * then the ''J''-long vector of weights, ''w'', must be: {{attachment:regress3.svg}} Next, expand this to a [[Statistics/BayesianHierarchicalModel|hierarchical model]]. The author specifically uses batches of age indicators (i.e., [[Statistics/Binning|binned]] into 4 levels), education (i.e. 4 levels), and their interactions. * a generally noninformative prior is assumed: * coefficients are distributed ''β ~ N(0,'''Σ''',,β,,)'' * most coefficients are assumed to be independent, such that the [[Statistics/CovarianceMatrices#Precision_Matrices|precision matrix]] '''''Σ''',,β,,^-1^'' is fully specified as a diagonal matrix with the estimated ''σ^-2^'' for the batched predictors, and 0s for all others * the coefficients are estimated as ''b = ('''X'''^T^'''Σ''',,y,,^-1^'''X''' + '''Σ''',,β,,^-1^)^-1^'''X'''^T^'''Σ''',,y,,^-1^y'' * the poststratified estimate is: {{attachment:hier1.svg}} * the ''J''-long vector of weights ''w'' is: {{attachment:hier2.svg}} The others further consider special cases of the hierarchical model. The posterior variance of ''y'' can be calculated as any of: {{attachment:hier3.svg}} == Reading notes == Note also that the author repeats their argument from [[ImprovingOnProbabilityWeightingForHouseholdSize|here]]. This is, as before, applicable because of the constraint on weight computation. ---- CategoryRicottone