Differences between revisions 7 and 8

Survey Samples

Contents

Survey Samples

Frames

Original sources for contact data are often called frames. Common frames used for survey sampling are:

Census Bureau surveys are excellent sources of regional, hierarchical data (i.e. states > counties > tracts)
U.S. Postal Service Delivery Sequence files
Random digit dialing (RDD)

Sampling

There are principally two ways to generate a survey sample: probability and non-probability sampling. These design choices have effects on the weighting methods.

A closely-related question is how sample should be allocated across population groups.

Propability Sampling

All members of a population have a non-zero chance to be contacted in a survey instrument. Traditional statistics rely on this assumption.

Examples:

Administrative surveys
Surveys with random recruitment (as by random digit dialing)

Non-probability Sampling

Some members of a population are certain to be contacted or not be contacted.

Examples:

Panel surveys
River surveys (i.e. surveys with open recruitment, as by banner ads)

Stratified Allocation

Stratification is the process of dividing the population into discrete stratum, and then sampling from the strata.

Methods include:

Equal allocation takes the same number from each stratum.
Proportional allocation takes a number proportional to the size of that stratum.
Neyman allocation optimizes a key measure's margin of error against cost. It assumes a fixed cost per contact.
On the other hand, if cost is assumed variable, it becomes an optimal allocation.

Note that, if all measures are equally varied, proportional allocation is essentially the same as a Neyman allocation.

Selecting Domains

The key considerations for selecting the discrete strata, or domains:

are some splits more important to others?
- if studying military recruitment, then sex/gender is a strong split
what splits will be used for reporting?
what is the expected response rate?
- if too few responses are expected from a domain, then splits should be reconsidered
what is the desired margin of error?

Sampling Methods

Simple Random Sampling (SRS) is essentially sorting randomly and taking the first N cases.

Stratified Random Sampling (STSRS) is the above process applied to a stratified sample, using proportional allocation.

Systematic sampling is any form of sampling that takes every Nth case from a list. The key is then how the list is ordered.

Probability Proportionate to Size (PPS) ensures that chance to be contacted increases with the magnitude of some measure. For example, in a study of utility customers, the largest consumers of that utility should almost always be contacted.

Multi-Stage Methods

Randomly select primary sampling units (PSU) like census tracts, then randomly select the actual targets (i.e. households) as secondary sampling units (SSU).

Cluster sampling is a two-stage method where all members of the sampled PSUs are contacted.

Common in face-to-face interviews, due to extraordinary costs of that mode.

Multi-Phase Methods

Sample for a screener, then re-sample based on the information collected in the screener. In most cases, all responses from the target group are re-contacted, while a random sample of others are re-contacted.

Design Effects

Design effect is a measure of efficiency for a sample design. Specifically, it compares a sample design against SRS, which can be considered a 'baseline' design in terms of simplicity and administrative cost. As the design effect increases, either the sample size must increase or the estimated variances will increase.

DEFF_p(s)(x) = Var_p(s)(x) / Var_SRS(x)

The key input is the estimated sampling variance of some population estimator under some design (i.e. Var_p(s)(x)). For complex sample designs, this is computed using resampling methods.

Unequal Weighting Effects

Kish decomposed the above calculation into terms of unequal weighting effects:

DEFF_p(s)(x) = DEFF_{unequal weighting} * DEFF_clustering

The unequal weighting effect is the inverse of the effective sample size, also known as the effective base. For a given sample measure (y):

DEFF_{unequal weighting}(y) = ∑_i w_i² / (∑_i w_i)²

Note that this leaves finite population correction aside.

Variance Estimation

Sampling variance is how estimates would vary if samples are repeated drawn from the population. Of course, because the population descriptives are unknown, survey variance must be estimated.

The methods for variance estimation include:

Exact methods are mathematically convenient but impractical.
Taylor series linearization makes use of weights and sample design features (i.e. strata, finite population correction, etc.) to estimate variance.
Replication or replicate weights makes use of several hierarchical weights.

Note that it isn't possible to estimate variance for a single unit, or a singleton stratum.

Finite Population Correction

Finite population correction (FPC) encapsulates the fact that as sample rate increases, sampling variance decreases. (If the entire population is sampled, there is no variance.) This is generally inapplicable if the sampling rate is below 5%.

Analysis in Stata

When using Stata...

use subpop or over options instead of if for filtration.
use the singleunit(centered) option.

Resampling Methods

Bootstrap Estimation

Bootstrapping is a method for imputing values for missing data. A random sample of equal size is drawn (i.e. cases are resampled non-exclusively from the original sample). The descriptives in question are modeled using logistic regression on the second sample, and missing values of the first sample are computed based on that model.

CategoryRicottone

-  ⇤ ← Revision 7 as of 2021-04-30 16:24:59 → 
  Size: 4557
  Editor: DominicRicottone
  Comment:
+   ← Revision 8 as of 2021-04-30 17:01:41 → ⇥
  Size: 6393
  Editor: DominicRicottone
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-## page was renamed from SurveySampling
= Survey Sampling =
+= Survey Samples =
-Line 10:
+Line 9:
-== Sample Frames ==
+== Frames ==
-Line 12:
+Line 11:
-Common frames used for survey sampling are:
+Original sources for contact data are often called '''frames'''. Common frames used for survey sampling are:
-Line 22:
+Line 21:
-== Sample Type ==
+== Sampling ==
-Line 24:
+Line 23:
-=== Propability sampling ===
+There are principally two ways to generate a survey sample: '''probability''' and '''non-probability''' sampling. These design choices have effects on the [[SurveyWeights|weighting methods]].

A closely-related question is how sample should be ''allocated'' across population groups.



=== Propability Sampling ===
-Line 33:
+Line 38:
-=== Non-probability sampling ===
+=== Non-probability Sampling ===
-Line 41:
+Line 47:
-----
+=== Stratified Allocation ===

Stratification is the process of dividing the population into discrete stratum, and then sampling from the strata.

Methods include:

 * '''Equal allocation''' takes the same number from each stratum.
 * '''Proportional allocation''' takes a number proportional to the size of that stratum.
 * '''Neyman allocation''' optimizes a key measure's margin of error against cost. It assumes a fixed cost per contact.
 * On the other hand, if cost is assumed variable, it becomes an '''optimal allocation'''.

Note that, if all measures are equally varied, proportional allocation is essentially the same as a Neyman allocation.
-Line 45:
+Line 64:
-== Survey Allocation ==
+=== Selecting Domains ===
-Line 47:
+Line 66:
-Allocation is the distribution of sample size across domains.


=== Designing Domains ===

The key considerations are:
+The key considerations for selecting the discrete strata, or '''domains''':
-Line 60:
+Line 74:
-=== Stratified Allocation ===

Stratification is the process of dividing the population into discrete stratum, and then sampling from the strata.

 * '''Equal allocation''' is taking the same number from each stratum.
 * '''Proportional allocation''' is taking a number proportional to the size of that stratum.
 * '''Neyman allocation''' is an optimization of a key measure's margin of error against cost. It assumes a fixed cost per contact.
 * On the other hand, if cost is assumed variable, it becomes an '''optimal allocation'''.

Note that, if all measures are equally varied, proportional allocation is essentially the same as a Neyman allocation.
-Line 105:
+Line 107:
-== Variance Estimation ==
+== Design Effects ==
-Line 107:
+Line 109:
-Sampling variance is how estimates would vary if samples are repeated drawn from the population. Sample design can affect the variance; stratified sampled have differing variances per stratum.
+'''Design effect''' is a measure of efficiency for a sample design. Specifically, it compares a sample design against SRS, which can be considered a 'baseline' design in terms of simplicity and administrative cost. As the design effect increases, either the sample size must increase or the estimated variances will increase.
-Line 109:
+Line 111:
-Of course, because the population descriptives are unknown, survey variance must be estimated.
+DEFF'',,p(s),,''(x) = Var'',,p(s),,''(x) / Var'',,SRS,,''(x)
-Line 111:
+Line 113:
-'''Exact methods''' are mathematically convenient but impractical.

'''Finite population correction''' ('''FPC''') encapsulates the fact that as sample rate increases, sampling variance decreases. (If the entire populatino is sampled, there is no variance.) This is generally inapplicable if the sampling rate is below 5%.

'''Taylor series linearization''' makes use of weights and sample design features (i.e. strata, finite population correction, etc.) to estimate variance.

'''Replication''' or '''replicate weights''' makes use of several hierarchical weights.

When using Stata, consider using `subpop` or `over` options instead of `if` for filtration.
+The key input is the estimated sampling variance of some population estimator under some design (i.e. Var,,p(s),,(x)). For complex sample designs, this is computed using '''resampling methods'''.
-Line 122:
+Line 116:
-=== Singleton Strata ===
-Line 124:
+Line 117:
-It isn't possible to estimate variance for a single unit.
+=== Unequal Weighting Effects ===
-Line 126:
+Line 119:
-When using Stata, consider using the `singleunit(centered)` option.
+Kish decomposed the above calculation into terms of unequal weighting effects:

DEFF'',,p(s),,''(x) = DEFF'',,unequal weighting,,'' * DEFF'',,clustering,,''

The '''unequal weighting effect''' is the inverse of the '''effective sample size''', also known as the '''effective base'''. For a given sample measure (y):

DEFF'',,unequal weighting,,''(y) = ∑'',,i,,'' w'',,i,,''^2^ / (∑'',,i,,'' w'',,i,,'')^2^

Note that this leaves '''finite population correction''' aside.



=== Variance Estimation ===

Sampling variance is how estimates would vary if samples are repeated drawn from the population. Of course, because the population descriptives are unknown, survey variance must be estimated.

The methods for variance estimation include:

 * '''Exact methods''' are mathematically convenient but impractical.
 * '''Taylor series linearization''' makes use of weights and sample design features (i.e. strata, finite population correction, etc.) to estimate variance.
 * '''Replication''' or '''replicate weights''' makes use of several hierarchical weights.

Note that it isn't possible to estimate variance for a single unit, or a '''singleton stratum'''.




=== Finite Population Correction ===

'''Finite population correction''' ('''FPC''') encapsulates the fact that as sample rate increases, sampling variance decreases. (If the entire population is sampled, there is no variance.) This is generally inapplicable if the sampling rate is below 5%.



=== Analysis in Stata ===

When using Stata...

 * use `subpop` or `over` options instead of `if` for filtration.
 * use the `singleunit(centered)` option.

----



== Resampling Methods ==

=== Bootstrap Estimation ===

'''Bootstrapping''' is a method for imputing values for missing data. A random sample of equal size is drawn (i.e. cases are resampled ''non-exclusively'' from the original sample). The descriptives in question are modeled using logistic regression on the second sample, and missing values of the first sample are computed based on that model.

Diff for "SurveySamples"