Gradient Descent

Gradient descent is a method for approximating minima of an objective function.


Description

By setting a gradient to zero, critical points can be calculated. Some differentiable functions cannot be practically solved for, however.

Given a position xn and an objective function f that is differentiable around xn, the gradient can be calculated. It follows that for a 'small enough' α:

xn+1 = xn - α ∇f(xn)

f(xn+1) < f(xn)

will hold true.

The step size α must be small enough that it does not move xn+1 past the intended local minima and into the proximity of another local minima.

Note furthermore that the step size parameter does not hold step distance constant. The actual step taken is the product of the gradient vector and the step size parameter, so the gradient vector's degree is also a component in step distance.

By iteratively following the direction of steepest descent in small steps, the local minima is approximated.

Convergence

As long as the step size is less than or equal to 1/L, where L is the Lipschitz constant, this method will converge.

Setting to 0.1 or lower is the general best practice.


Usage

This method approximates local minima, not necessarily the global minimum. As a result, objective functions used in statistics are often required to be convex so as to ensure that the only local minimum is the global minimum.

When the objective cannot be convex, to avoid local minima, the method is repeated with randomness added to the 'starting points'.

Minibatch Gradient

In statistical contexts, the objective function that should be minimized is the average loss, e.g. mean squared error:

loss.svg

The gradient then is formulated like:

lossgradient.svg

Importantly, this requires calculating the gradient for each observation. This is generally impractical.

Instead, the 'full gradient' is estimated as the minibatch gradient, using a random sample of the observations. As the size of the sample grows, the minibatch gradient converges to the full gradient. Regardless of the sample size, this is an unbiased estimator. Taken to the extreme, when the sample size is 1, this is called stochastic gradient descent (SGD).

SGD generally requires that step size decrease over iterations to guarantee convergence.


CategoryRicottone

Calculus/GradientDescent (last edited 2025-03-27 19:58:18 by DominicRicottone)