Sampling-based vs. Design-based Uncertainty in Regression Analysis
Sampling-based vs. Design-based Uncertainty in Regression Analysis was written by Alberto Abadie, Susan Athey, Guido W. Imbens, and Jeffrey M. Wooldridge in 2017.
Sample based uncertainty looks like the variance between these samples, wherein R is the sample selection indicator.
Sample 1
N |
Y |
Z |
R |
1 |
y1 |
z1 |
1 |
2 |
? |
? |
0 |
3 |
? |
? |
0 |
4 |
? |
? |
0 |
Sample 2
N |
Y |
Z |
R |
1 |
? |
? |
0 |
2 |
? |
? |
0 |
3 |
y3 |
z3 |
1 |
4 |
y4 |
z4 |
1 |
Sample 3
N |
Y |
Z |
R |
1 |
? |
? |
0 |
2 |
? |
? |
0 |
3 |
y3 |
z3 |
1 |
4 |
? |
? |
0 |
Design-based uncertainty looks like the variance between these samples, wherein X is the treatment assignment indicator.
Sample 1
N |
Y(1) |
Y(0) |
X |
1 |
y1(1) |
? |
1 |
2 |
? |
y2(0) |
0 |
3 |
? |
y3(0) |
0 |
4 |
? |
y4(0) |
0 |
Sample 2
N |
Y(1) |
Y(0) |
X |
1 |
y1(1) |
? |
1 |
2 |
? |
y2(0) |
0 |
3 |
y3(1) |
? |
1 |
4 |
? |
y4(0) |
0 |
Sample 3
N |
Y(1) |
Y(0) |
X |
1 |
? |
y1(0) |
0 |
2 |
? |
y2(0) |
0 |
3 |
y3(1) |
? |
1 |
4 |
y4(1) |
? |
1 |
The authors propose a framework for uncertainty that covers both of these sources as well as letting go of the assumption of an infinitely large population.
R is of course known for all cases, but is generated by a stochastic process (i.e., varies across possible samples). Attributes including Y and Z are known only for sampled cases, but more importantly are fixed for all possible samples. They are independent of R.
X is another attribute generated by a stochastic process, and therefore the realized outcomes Y(0) and Y(1) are, too.
A descriptive estimand is a function of fixed attributes. These become known with certainty as the sample size approaches the fixed population size.
A causal estimand is a function of stochastic processes and realized outcomes.
Reading notes
The authors adopted a notation that I'm quite unfamilair with. For example, n is the finite population size and N is the sample size. I would have expected the inverse. Generally, in the math and statistics literature, I have come to expect that uppercase letters are distributions or variables, while subscripted lowercase letters are actual values or measurements.