Statistical Paradises and Paradoxes in Big Data

Statistical Paradises and Paradoxes in Big Data (DOI: https://doi.org/10.1214/18-AOAS1161SF) was written by Xiao-Li Meng in 2018. It was published in The Annals of Applied Statistics (vol. 12, no. 2).

The full title is Statistical Paradises and Paradoxes in Big Data I: Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election. There is a part 2 that addresses making predictions with bid data.

These notes incorporate some details from the slidedeck of the 2021 Joint Program in Survey Methodology Distinguished Lecture, Data Defect Correlation: A Unified Quality Metric for Probabilistic Sample and Non-Probabilistic Sample, also by Meng.

The author presents a motivating question in, "Which one should we trust more, a 5% survey sample or an 80% administrative dataset?". The focus then is to quantify and characterize the tradeoffs between traditional survey statistics, underpinned by random samples, and larger non-random samples.

The population average N is estimated by the sample average estimator n; given a sample {Xj, j ∈ IN} where IN is a size-n subset of N, the sample average estimator is expressed as sampavg.svg where Rj carries the value 1 if the sample member responded and 0 otherwise.

Note that 'responded' is considered a random mechanism in the traditional framework, but big data is often administrative. There is a parallel 'recorded' mechanism, but it is not random. "It is thus entirely possible that nothing in [this equation] is probabilistic." Nonetheless, so long as J is uniformly distributed, the error can be characterized as:

err.svg

The author decomposes this into the actuarial equation:

err2.svg

The three components here are:

Compare this to the standard error: σ/√n. The novel components are data quality and the numerator of data quantity.

The author then uses this to reformulate the MSE of the estimator with respect to the distribution of R. (Not the variance, because without SRS we expect some selection bias.)

mse.svg

This leads to two derivative measures:

Therefore, the author describes a big data paradox: "[t]he bigger the data, the surer we fool ourselves."

The author also demonstrates that survey weighting is unlikely to have a strong enough curative effect to overcome this negative effect of large N. Therefore with a nonprobability sample, and also with a probability sample where nonresponse invalidates randomness, weighted estimators suffer accumulating errors as N grows.

Returning to the motivating question, comparing a survey dataset and an administrative dataset: if the former exhibits a 1% sampling fraction (fS) and a 60% response rate (R), whereas the latter contains 80% of the population (fA), then the dropout odds ratio is the ratio of data quantity components:

odds.svg

The administrative dataset is superior if ineq.svg holds. In this specific example, the inequality is rather likely to hold as long as the data quality component is similar across the two. More generally, this demonstrates how even a relatively high response rate of 60% can severely lower the value of an SRS design, if nonresponse invalidates randomness.


CategoryRicottone CategoryReadingNotes

StatisticalParadisesAndParadoxesInBigData (last edited 2025-08-02 15:23:37 by DominicRicottone)