= Gradient Boosting = '''Gradient boosting''' is a method for estimating an ensemble model. <> ---- == Description == [[https://www.youtube.com/watch?v=dosOtgSdbnY|Boosting]] is [[Calculus/GradientDescent|gradient descent]] in function space. Rather than estimating the solution that minimizes loss, this method estimates the function (model) that minimizes loss. A primary use case for boosting is creating an ensemble model of unbiased but inaccurate models, such as deep [[Statistics/DecisionTrees#Classification_Trees|classification trees]]. Many terms are borrowed from that field. === Formulation === There is a '''weak model''' ''h'' that estimates the label ''y'' with ''ŷ = [h(x,,1,,), ..., h(x,,n,,)]'', which includes some error. A '''strong model''' ''H'' would converge such that ''y = H(x) = [H(x,,1,,), ..., H(x,,n,,)]''. Start by estimating the labels using ''h'', which will also be referred to as ''F,,1,,'' for consistency. For every subsequent step ''m'' up to ''M'': 1. Recursively create an ensemble model as ''F,,m,, = α F,,m-1,, + h''. 1. Estimate the labels using ''F,,m,,''. This is equivalent to estimating the residual of the prior model, ''y - α F,,m-1,,'', using ''h''. 1. Break if ''F,,m,,'' converges to ''H''. As stated above, the primary use case is '''boosted classification trees'''; each estimation step involves the calculation of all possible classification tree models and selecting the one that minimizes error. The ensemble model is finally formulated as ''H = Σ,,m,, (α F,,m,,)''. === Derivation === The minimization error is derived from a minimization of the [[LinearAlgebra/VectorMultiplication#Inner_Product|inner product]] of the [[Calculus/Gradient|gradient]] of the loss function (''L'') and the proposed model. The typical loss function to use is '''exponential loss''': {{attachment:exploss.svg}} The gradient of this loss function in every dimension is: {{attachment:gradloss.svg}} Note that the ''e^-yH(x)^'' term is relative to the loss itself. Across all individuals ''i'', this term sums to 1. It is effectively the weighted contribution of an individual ''i'' to overall loss, so rewritten as ''w,,i,,''. The minimization is expressed then as: {{attachment:min1.svg}} Note that ''y,,i,,h(x,,i,,)'' is always either +1 (accurate classification) or -1 (erroneous classification). It's simpler to split the summation like: {{attachment:min2.svg}} After splitting the summation, it becomes clear that the latter component is the sum of errors (''ε''). Necessarily, the complementary term is ''1-ε''. After rewriting, the problem simplifies to a minimization of errors. === Learning Rate === In the above formulation, α is used to mirror gradient descent. But in the context of boosting specifically, this hyperparameter is more often called the '''learning rate''' or '''shrinkage''', and notated as ''η'' (i.e., eta). As in gradient descent, setting to 0.1 or lower is the general best practice. ---- CategoryRicottone