Differences between revisions 2 and 3

Gradient Boosting

Gradient boosting is a method for estimating an ensemble model.

Contents

Gradient Boosting
1. Description
  1. Formulation
  2. Learning Rate

Description

Boosting is gradient descent in function space. Rather than estimating the solution that minimizes loss, this method estimates the function (model) that minimizes loss.

A primary use case for boosting is creating an ensemble model of unbiased but inaccurate models, such as deep classification trees. Many terms are borrowed from that field.

Formulation

There is a weak model h that estimates the label y with ŷ = [h(x₁), ..., h(x_n)], which includes some error. A strong model H would converge such that y = H(x) = [H(x₁), ..., H(x_n)].

Start by estimating the labels using h, which will also be referred to as F₁ for consistency. For every subsequent step m up to M:

Recursively create an ensemble model as F_m = α F_m-1 + h.
Estimate the labels using F_m. This is equivalent to estimating the residual of the prior model, y - α F_m-1, using h.
Break if F_m converges to H.

As stated above, the primary use case is boosted classification trees; each estimation step involves the calculation of all possible classification tree models and selecting the one that minimizes loss. Or more accurately, selecting the one that minimizes the inner product of the model with the gradient.

The ensemble model is finally formulated as H = Σ_m (α F_m).

Learning Rate

In the above formulation, α is used to mirror gradient descent. But in the context of boosting specifically, this hyperparameter is more often called the learning rate or shrinkage, and notated as η (i.e., eta).

As in gradient descent, setting to 0.1 or lower is the general best practice.

CategoryRicottone

Statistics/GradientBoosting (last edited 2025-03-28 15:45:15 by DominicRicottone)

-  ⇤ ← Revision 2 as of 2025-03-27 20:17:05 → 
  Size: 1974
  Editor: DominicRicottone
  Comment: Clarification
+   ← Revision 3 as of 2025-03-28 15:45:15 → ⇥
  Size: 2145
  Editor: DominicRicottone
  Comment: Inner product?
-Deletions are marked like this.
+Additions are marked like this.
 Line 28:
-As stated above, the primary use case is '''boosted classification trees'''; each step estimating the labels requires calculation of all possible classification tree models and selecting the one that minimizes loss.
+As stated above, the primary use case is '''boosted classification trees'''; each estimation step involves the calculation of all possible classification tree models and selecting the one that minimizes loss. Or more accurately, selecting the one that minimizes the [[LinearAlgebra/VectorMultiplication#Inner_Product|inner product]] of the model with the [[Calculus/Gradient|gradient]].

Diff for "Statistics/GradientBoosting"

Gradient Boosting

Description

Formulation

Learning Rate