Statistical Modeling: The Two Cultures
Statistical Modeling: The Two Cultures (DOI: https://doi.org/10.1214/ss/1009213726) was written by Leo Breiman. It was published in Statistical Science (2001, vol. 16, no. 3).
The author describes the fields of statistics as having two approaches to problemsolving:
The data modeling culture which matches a phenomena to a data-generating model, then tries to fit the model (i.e., estimate then interpret the parameters' coefficients) using measurements
- "Every article started with: Assume that the data are generated by the following model: ..."
The algorithmic modeling culture which estimates many models and optimizes for predictive accuracy
Broadly speaking, the author criticizes how ill-fitting data models are used to inappropriately claim significance of findings. By making fewer assumptions about data generation, the latter 'culture' leads to more robust predictions.
- Generally the latter's models still assume data is i.i.d.
- "At one point, some years ago, I set up a simulated regression problem in seven dimensions with a controlled amount of nonlinearity. Standard tests of goodness-of-fit did not reject linearity until the nonlinearity was extreme. Recent theory supports this conclusion. Work by Bickel, Ritov and Stoker (2001) shows that goodness-of-fit tests have very little power unless the direction of the alternative is precisely specified. The implication is that omnibus goodness-of-fit tests, which test in many directions simultaneously, have little power, and will not reject until the lack of fit is extreme."
- TODO: sounds like a replicable experiment!
- Residual analysis not a plausibly falsifiable test with more than 2 or 3 dimensions
Author predicts that complicated Bayesian models will become more popular as the former 'culture' runs into more problems that do not fit the classical parametric data models.
Rashomon effect: crowding of 'good' models leads to instability. A common practice is to apply feature bagging, stepwise predictor omission, etc., to create a more interpretable or parsimonious parametric model. The fit model minimizes R2, but often there are many different models (i.e., different parameters were retained) that get very close. As a result, a small change in the training data leads to selection of a very different model without much change in stated significance. The latter 'culture' has solved this problem with aggregating models, e.g. bagging.
Author argues that predictive power and interpretability are at natural odds.
Parametric models are not robust to high dimensionality, whereas several non-parametric models (e.g. support-vector networks) only converge given high dimensionality.