Size: 2353
Comment:
|
Size: 4219
Comment: Typo
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Survey Statistics = | = Survey Inference = '''Survey inference''' is an experiment design to estimate parameters of a population using measurements from a sample of it. |
Line 9: | Line 11: |
== Statistics with Survey Data == | == Approaches to Survey Data == |
Line 11: | Line 13: |
There are key differences between surey statistics and model statistics. | |
Line 16: | Line 17: |
Build a mathematical model that describes a population. Generate a random sample from that population to generate estimates. Estimate how the error terms of those estimates would vary if repeated samples were drawn. |
1. Build a mathematical model that describes a population. 2. Generate a random sample from that population to generate estimates. 3. Estimate how the error terms of those estimates would vary if repeated samples were drawn. In other words, the key is how well the model describes the population. |
Line 23: | Line 26: |
Identify a population with fixed descriptives. Randomly draw a sample from that population to collect measures from. Estimate how the measures would vary if repeated samples were drawn. |
1. Identify a population with fixed descriptives. 2. Draw a sample from that population to collect measures from. 3. Estimate how the measures would vary if repeated samples were drawn. In other words, the key is how well the sample fits the population. If the full population were contacted, measures would be perfect. |
Line 30: | Line 35: |
Using model-based inference while accounting for survey design | Using model-based inference while accounting for survey design. |
Line 36: | Line 41: |
== Survey Populations == | == Population of Interest == |
Line 38: | Line 43: |
The population of a survey has fixed but unknown descriptives. A random sample is contacted, rather than the full population, for reasons of cost and administration. The population descriptives are estimated based on the design of the sample and the descriptives collected from that sample. |
Survey statistics are estimates of a population parameter calculated from measurements. The population of interest is sometimes referred to as a '''universe''' for such calculations. Records of the population of interest form a [[Statistics/SurveyFrame|frame]]. |
Line 43: | Line 48: |
=== Limitations of Survey Sampling === | |
Line 45: | Line 49: |
If a sample is a poor fit for the population, then it will be difficult/impossible to estimate population descriptives. This is why samples are *randomly* drawn. |
=== Sampling Error === |
Line 48: | Line 51: |
But for random sampling to succeed, the target populations needs to be completely identified. Incompleteness of data can skew samples. |
If a sample is a poor fit for the population, then it will be difficult/impossible to estimate population parameters. ''Random'' sampling attempts to address this. But for random sampling to succeed, the population of interest needs to be completely specified. If a frame contains records that are not in the population of interest, it features '''over-coverage'''. If it misses records in the population of interest, it features '''under-coverage'''. Inaccurate or out-of-date information impacts [[Statistics/NonResponseBias|non-response]]. Auxiliary information can inform and guide sample design, so richness of a frame can also be a contributing factor to sampling error. See [[SurveySamples|here]] for a details on survey sampling. ---- |
Line 52: | Line 66: |
=== Limitations of Surveying === | |
Line 54: | Line 67: |
Non-random non-response is an additional roadblock to estimating population descriptives. | == Measurement == |
Line 56: | Line 69: |
Some populations are inherently difficult to contact, due to: | Survey interviews measure characteristics about sampled records and, whenever possible, measure the responses of those who are successfully contacted and then cooperate. The '''mode''' of interview is often pre-determined from the selection of a [[Statistics/SurveyFrame|frame]], but there are advantages and disadvantages to each, especially in costs. * Any mode requiring first contact by mail necessarily costs printing and postage. * Paper mode necessarily costs printing, and mailed paper mode calls for a return envelope and pre-paid postage on it. * Computed-assisted modes (i.e., '''Computer-Assisted Telephone Interview''' ('''CATI''') and '''Computer-Assisted Personal Interview''' ('''CAPI''')) push down labor costs. * Automated modes (i.e., web, '''Interactive Voice Response''' ('''IVR'''), etc.) push labor costs even lower. === Non-sampling Error === '''Specification error''' refers to a difference between the measures of interest and what is actually measured. '''Measurement error''' refers to a variety of factors that interfere with the interview. For example, for a survey of self-reported political opinions, there may be an '''interviewer effect''' from the apparent race of an interviewer. [[Statistics/NonResponseBias|Non-response]] has the potential to introduce bias, especially if non-response is not random. For example, some populations are inherently difficult to contact, due to: |
Line 60: | Line 90: |
+ remoteness of location (i.e. rural Alaska) | * remoteness of location (i.e. rural Alaska) |
Line 62: | Line 92: |
The availability of contact data often dictates the mode of survey instrument. Some populations are not easily contacted by specific modes, due to: |
This should be considered during sample design. As another example, some populations are inherently difficult to contact by certain survey modes, due to: |
Line 68: | Line 99: |
This should be considered when selecting a [[Statistics/SurveyFrame|frame]] in the first place. | |
Line 69: | Line 101: |
=== Common Sampling Frames === | |
Line 71: | Line 102: |
* Census Bureau surveys are excellent sources of regional, hierarchical data (i.e. states > counties > tracts) * U.S. Postal Service Delivery Sequence files * Random digit dialing |
Survey Inference
Survey inference is an experiment design to estimate parameters of a population using measurements from a sample of it.
Contents
Approaches to Survey Data
Model-based inference
- Build a mathematical model that describes a population.
- Generate a random sample from that population to generate estimates.
- Estimate how the error terms of those estimates would vary if repeated samples were drawn.
In other words, the key is how well the model describes the population.
Design-based inference
- Identify a population with fixed descriptives.
- Draw a sample from that population to collect measures from.
- Estimate how the measures would vary if repeated samples were drawn.
In other words, the key is how well the sample fits the population. If the full population were contacted, measures would be perfect.
Inferential statistics from complex survey data
Using model-based inference while accounting for survey design.
Population of Interest
Survey statistics are estimates of a population parameter calculated from measurements. The population of interest is sometimes referred to as a universe for such calculations.
Records of the population of interest form a frame.
Sampling Error
If a sample is a poor fit for the population, then it will be difficult/impossible to estimate population parameters.
Random sampling attempts to address this. But for random sampling to succeed, the population of interest needs to be completely specified.
If a frame contains records that are not in the population of interest, it features over-coverage. If it misses records in the population of interest, it features under-coverage.
Inaccurate or out-of-date information impacts non-response.
Auxiliary information can inform and guide sample design, so richness of a frame can also be a contributing factor to sampling error.
See here for a details on survey sampling.
Measurement
Survey interviews measure characteristics about sampled records and, whenever possible, measure the responses of those who are successfully contacted and then cooperate.
The mode of interview is often pre-determined from the selection of a frame, but there are advantages and disadvantages to each, especially in costs.
- Any mode requiring first contact by mail necessarily costs printing and postage.
- Paper mode necessarily costs printing, and mailed paper mode calls for a return envelope and pre-paid postage on it.
Computed-assisted modes (i.e., Computer-Assisted Telephone Interview (CATI) and Computer-Assisted Personal Interview (CAPI)) push down labor costs.
Automated modes (i.e., web, Interactive Voice Response (IVR), etc.) push labor costs even lower.
Non-sampling Error
Specification error refers to a difference between the measures of interest and what is actually measured.
Measurement error refers to a variety of factors that interfere with the interview. For example, for a survey of self-reported political opinions, there may be an interviewer effect from the apparent race of an interviewer.
Non-response has the potential to introduce bias, especially if non-response is not random. For example, some populations are inherently difficult to contact, due to:
- political, legal, and ethical considerations (i.e. minors in the population)
- language and cultural barriers
- remoteness of location (i.e. rural Alaska)
This should be considered during sample design.
As another example, some populations are inherently difficult to contact by certain survey modes, due to:
- socioeconomic conditions (i.e. lack of telephone connectivity)
- low accuracy contact data (i.e. outdated addresses)
This should be considered when selecting a frame in the first place.