⇤ ← Revision 1 as of 2025-03-27 15:03:54
Size: 1614
Comment: Initial commit
|
Size: 2012
Comment: Minibatch
|
Deletions are marked like this. | Additions are marked like this. |
Line 43: | Line 43: |
In statistical contexts, the objective function that should be minimized is the average loss, e.g. mean squared error: {{attachment:loss.svg}} The gradient then is formulated like: {{attachment:lossgradient.svg}} This usually presents as a prohibitive computation. Instead, the 'full gradient' is estimated as the '''minibatch gradient''', using a random sample of the observations. |
Gradient Descent
Gradient descent is a method for approximating minima of an objective function.
Contents
Description
By setting a gradient to zero, critical points can be calculated. Some differentiable functions cannot be practically solved for, however.
Given a position xn and an objective function f that is differentiable around xn, the gradient can be calculated. It follows that for a 'small enough' α:
xn+1 = xn - α ∇f(xn)
f(xn+1) < f(xn)
will hold true.
The step size α must be small enough that it does not move xn+1 past the intended local minima and into the proximity of another local minima.
Note furthermore that the step size parameter does not hold step distance constant. The actual step taken is the product of the gradient vector and the step size parameter, so the gradient vector's degree is also a component in step distance.
By iteratively following the direction of steepest descent in small steps, the local minima is approximated.
Convergence
As long as the step size is less than or equal to 1/L, where L is the Lipschitz constant, this method will converge.
Usage
This method approximates local minima, not necessarily the global minimum. As a result, objective functions used in statistics are often required to be convex so as to ensure that the only local minimum is the global minimum.
In statistical contexts, the objective function that should be minimized is the average loss, e.g. mean squared error:
The gradient then is formulated like:
This usually presents as a prohibitive computation. Instead, the 'full gradient' is estimated as the minibatch gradient, using a random sample of the observations.