Sampling-based vs. Design-based Uncertainty in Regression Analysis

Sampling-based vs. Design-based Uncertainty in Regression Analysis (arXiv: https://arxiv.org/abs/1706.01778) was written by Alberto Abadie, Susan Athey, Guido W. Imbens, and Jeffrey M. Wooldridge. The first draft was published online in 2017. It was then published in Econometrica (vol. 88, no. 1) in 2020.

Sample based uncertainty looks like the variance between these samples, wherein R is the sample selection indicator.

Sample 1

N	Y	Z	R
1	y₁	z₁	1
2	?	?	0
3	?	?	0
4	?	?	0

Sample 2

N	Y	Z	R
1	?	?	0
2	?	?	0
3	y₃	z₃	1
4	y₄	z₄	1

Sample 3

N	Y	Z	R
1	?	?	0
2	?	?	0
3	y₃	z₃	1
4	?	?	0

Design-based uncertainty looks like the variance between these samples, wherein X is the treatment assignment indicator.

Sample 1

N	Y(1)	Y(0)	X
1	y₁(1)	?	1
2	?	y₂(0)	0
3	?	y₃(0)	0
4	?	y₄(0)	0

Sample 2

N	Y(1)	Y(0)	X
1	y₁(1)	?	1
2	?	y₂(0)	0
3	y₃(1)	?	1
4	?	y₄(0)	0

Sample 3

N	Y(1)	Y(0)	X
1	?	y₁(0)	0
2	?	y₂(0)	0
3	y₃(1)	?	1
4	y₄(1)	?	1

The authors propose a framework for uncertainty that covers both of these sources as well as letting go of the assumption of an infinitely large population.

R is of course known for all cases, but is generated by a stochastic process (i.e., varies across possible samples). Attributes including Y and Z are known only for sampled cases, but more importantly are fixed for all possible samples. They are independent of R.

X is another attribute generated by a stochastic process, and therefore the realized outcomes Y(0) and Y(1) are, too.

A descriptive estimand is a function of fixed attributes, e.g. can be expressed in terms of (X, Y). These become known with certainty as the sample size approaches the fixed population size.

A causal estimand is a function of stochastic processes, realized outcomes, etc.; e.g. cannot be expressed in terms of (X, Y, R).

Authors note there is an in-between category that can be expressed in terms of (X, Y, R) but not (X, Y), "although it is difficult to think of interesting ones".

TODO: I need to return to this one, not making any headway this week...

Reading notes

The authors adopted a notation that I'm quite unfamilair with. For example, n is the finite population size and N is the sample size. I would have expected the inverse. Generally, in the math and statistics literature, I have come to expect that uppercase letters are distributions or variables, while subscripted lowercase letters are actual values or measurements.

CategoryRicottone CategoryTodoRead CategoryReadingNotes

SamplingBasedVsDesignBasedUncertaintyInRegressionAnalysis (last edited 2025-07-24 16:09:35 by DominicRicottone)