Statistical Paradises and Paradoxes in Big Data
Statistical Paradises and Paradoxes in Big Data (DOI: https://doi.org/10.1214/18-AOAS1161SF) was written by Xiao-Li Meng in 2018. It was published in The Annals of Applied Statistics (vol. 12, no. 2).
The full title is Statistical Paradises and Paradoxes in Big Data I: Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election. There is a part 2 that addresses making predictions with bid data.
These notes incorporate some details from the slidedeck of the 2021 Joint Program in Survey Methodology Distinguished Lecture, Data Defect Correlation: A Unified Quality Metric for Probabilistic Sample and Non-Probabilistic Sample, also by Meng.
The author presents a motivating question in, "Which one should we trust more, a 5% survey sample or an 80% administrative dataset?". The focus then is to quantify and characterize the tradeoffs between traditional survey statistics, underpinned by random samples, and larger non-random samples.
The population average G̅N is estimated by the sample average estimator G̅n; given a sample {Xj, j ∈ IN} where IN is a size-n subset of N, the sample average estimator is expressed as where Rj carries the value 1 if the sample member responded and 0 otherwise.
Note that 'responded' is considered a random mechanism in the traditional framework, but big data is often administrative. There is a parallel 'recorded' mechanism, but it is not random. "It is thus entirely possible that nothing in [this equation] is probabilistic." Nonetheless, so long as J is uniformly distributed, the error can be characterized as:
The author decomposes this into the actuarial equation:
The three components here are:
data quality as Cor(RJ,GJ), also denoted as ρR,G
This is not calculable from the analytic sample, as all cases therein obviously have R=1.
data quantity as √[(N-n)/n]
Letting f = E[RJ] = n/N be the sampling rate, this can be trivially rewritten into the second form. The importance here is to reframe the error as depending on the fraction of the population covered by the analytic sample.
Note similarity to the FPC.
problem difficulty as σG, the population standard deviation
Compare this to the standard error: σ/√n. The novel components are data quality and the numerator of data quantity.
The author then uses this to reformulate the MSE of the estimator with respect to the distribution of R. (Not the variance, because without SRS we expect some selection bias.)
This leads to two derivative measures:
The design effect (or 'lack-of-design' effect) is the ratio of this MSE over the MSE (variance) under SRS. This enables comparison of non-random samples to random samples. Note though that the final component, data difficulty, is constant across the two, so it's rather the product of data quality and quantity that matter.
Fitting this expected deviation into a statistical formula that works under SRS, e.g. Z-scores Z = (G̅n - G̅N)/VarSRS(G̅n), reveals that the error of an estimator grows with N. Specifically Z = ˆρR,G √(N-1).
Therefore, the author describes a big data paradox: "[t]he bigger the data, the surer we fool ourselves."
The author also demonstrates that survey weighting is unlikely to have a strong enough curative effect to overcome this negative effect of large N. Therefore with a nonprobability sample, and also with a probability sample where nonresponse invalidates randomness, weighted estimators suffer accumulating errors as N grows.
Returning to the motivating question, comparing a survey dataset and an administrative dataset: if the former exhibits a 1% sampling fraction (fS) and a 60% response rate (R), whereas the latter contains 80% of the population (fA), then the dropout odds ratio is the ratio of data quantity components:
The administrative dataset is superior if holds. In this specific example, the inequality is rather likely to hold as long as the data quality component is similar across the two. More generally, this demonstrates how even a relatively high response rate of 60% can severely lower the value of an SRS design, if nonresponse invalidates randomness.