Gradient Descent

Gradient descent is a method for approximating minima of an objective function.


Description

By setting a gradient to zero, critical points can be calculated. Some differentiable functions cannot be practically solved for, however.

Given a position xn and an objective function f that is differentiable around xn, the gradient can be calculated. It follows that for a 'small enough' α:

xn+1 = xn - α ∇f(xn)

f(xn+1) < f(xn)

will hold true.

The step size α must be small enough that it does not move xn+1 past the intended local minima and into the proximity of another local minima.

Note furthermore that the step size parameter does not hold step distance constant. The actual step taken is the product of the gradient vector and the step size parameter, so the gradient vector's degree is also a component in step distance.

By iteratively following the direction of steepest descent in small steps, the local minima is approximated.

Convergence

As long as the step size is less than or equal to 1/L, where L is the Lipschitz constant, this method will converge.


Usage

This method approximates local minima, not necessarily the global minimum. As a result, objective functions used in statistics are often required to be convex so as to ensure that the only local minimum is the global minimum.

In statistical contexts, the objective function that should be minimized is the average loss, e.g. mean squared error:

loss.svg

The gradient then is formulated like:

lossgradient.svg

This usually presents as a prohibitive computation. Instead, the 'full gradient' is estimated as the minibatch gradient, using a random sample of the observations.


CategoryRicottone