Differences between revisions 2 and 3
Revision 2 as of 2025-03-27 20:17:05
Size: 1974
Comment: Clarification
Revision 3 as of 2025-03-28 15:45:15
Size: 2145
Comment: Inner product?
Deletions are marked like this. Additions are marked like this.
Line 28: Line 28:
As stated above, the primary use case is '''boosted classification trees'''; each step estimating the labels requires calculation of all possible classification tree models and selecting the one that minimizes loss. As stated above, the primary use case is '''boosted classification trees'''; each estimation step involves the calculation of all possible classification tree models and selecting the one that minimizes loss. Or more accurately, selecting the one that minimizes the [[LinearAlgebra/VectorMultiplication#Inner_Product|inner product]] of the model with the [[Calculus/Gradient|gradient]].

Gradient Boosting

Gradient boosting is a method for estimating an ensemble model.


Description

Boosting is gradient descent in function space. Rather than estimating the solution that minimizes loss, this method estimates the function (model) that minimizes loss.

A primary use case for boosting is creating an ensemble model of unbiased but inaccurate models, such as deep classification trees. Many terms are borrowed from that field.

Formulation

There is a weak model h that estimates the label y with ŷ = [h(x1), ..., h(xn)], which includes some error. A strong model H would converge such that y = H(x) = [H(x1), ..., H(xn)].

Start by estimating the labels using h, which will also be referred to as F1 for consistency. For every subsequent step m up to M:

  1. Recursively create an ensemble model as Fm = α Fm-1 + h.

  2. Estimate the labels using Fm. This is equivalent to estimating the residual of the prior model, y - α Fm-1, using h.

  3. Break if Fm converges to H.

As stated above, the primary use case is boosted classification trees; each estimation step involves the calculation of all possible classification tree models and selecting the one that minimizes loss. Or more accurately, selecting the one that minimizes the inner product of the model with the gradient.

The ensemble model is finally formulated as H = Σm (α Fm).

Learning Rate

In the above formulation, α is used to mirror gradient descent. But in the context of boosting specifically, this hyperparameter is more often called the learning rate or shrinkage, and notated as η (i.e., eta).

As in gradient descent, setting to 0.1 or lower is the general best practice.


CategoryRicottone

Statistics/GradientBoosting (last edited 2025-03-28 15:45:15 by DominicRicottone)