Gradient Boosting
Gradient boosting is a method for estimating an ensemble model.
Description
Boosting is gradient descent in function space. Rather than estimating the solution that minimizes loss, this method estimates the function (model) that minimizes loss.
A primary use case for boosting is creating an ensemble model of unbiased but inaccurate models, such as deep classification trees. Many terms are borrowed from that field.
Formulation
There is a weak model h that estimates the label y with ŷ = [h(x1), ..., h(xn)], which includes some error. A strong model H would converge such that y = H(x) = [H(x1), ..., H(xn)].
Start by estimating the labels using h, which will also be referred to as F1 for consistency. For every subsequent step m up to M:
Recursively create an ensemble model as Fm = α Fm-1 + h.
Estimate the labels using Fm. This is equivalent to estimating the residual of the prior model, y - α Fm-1, using h.
Break if Fm converges to H.
As stated above, the primary use case is boosted classification trees; each estimation step involves the calculation of all possible classification tree models and selecting the one that minimizes error.
The ensemble model is finally formulated as H = Σm (α Fm).
Derivation
The minimization error is derived from a minimization of the inner product of the gradient of the loss function (L) and the proposed model.
The typical loss function to use is exponential loss:
The gradient of this loss function in every dimension is:
Note that the e-yH(x) term is relative to the loss itself. Across all individuals i, this term sums to 1. It is effectively the weighted contribution of an individual i to overall loss, so rewritten as wi.
The minimization is expressed then as:
Note that yih(xi) is always either +1 (accurate classification) or -1 (erroneous classification). It's simpler to split the summation like:
After splitting the summation, it becomes clear that the latter component is the sum of errors (ε). Necessarily, the complementary term is 1-ε. After rewriting, the problem simplifies to a minimization of errors.
Learning Rate
In the above formulation, α is used to mirror gradient descent. But in the context of boosting specifically, this hyperparameter is more often called the learning rate or shrinkage, and notated as η (i.e., eta).
As in gradient descent, setting to 0.1 or lower is the general best practice.