Differences between revisions 1 and 2

Gradient Descent

Gradient descent is a method for approximating minima of an objective function.

Contents

Gradient Descent
1. Description
  1. Convergence
2. Usage

Description

By setting a gradient to zero, critical points can be calculated. Some differentiable functions cannot be practically solved for, however.

Given a position x_n and an objective function f that is differentiable around x_n, the gradient can be calculated. It follows that for a 'small enough' α:

x_n+1 = x_n - α ∇f(x_n)

f(x_n+1) < f(x_n)

will hold true.

The step size α must be small enough that it does not move x_n+1 past the intended local minima and into the proximity of another local minima.

Note furthermore that the step size parameter does not hold step distance constant. The actual step taken is the product of the gradient vector and the step size parameter, so the gradient vector's degree is also a component in step distance.

By iteratively following the direction of steepest descent in small steps, the local minima is approximated.

Convergence

As long as the step size is less than or equal to 1/L, where L is the Lipschitz constant, this method will converge.

Usage

This method approximates local minima, not necessarily the global minimum. As a result, objective functions used in statistics are often required to be convex so as to ensure that the only local minimum is the global minimum.

In statistical contexts, the objective function that should be minimized is the average loss, e.g. mean squared error:

The gradient then is formulated like:

This usually presents as a prohibitive computation. Instead, the 'full gradient' is estimated as the minibatch gradient, using a random sample of the observations.

CategoryRicottone

-  ⇤ ← Revision 1 as of 2025-03-27 15:03:54 → 
  Size: 1614
  Editor: DominicRicottone
  Comment: Initial commit
+   ← Revision 2 as of 2025-03-27 15:29:07 → ⇥
  Size: 2012
  Editor: DominicRicottone
  Comment: Minibatch
-Deletions are marked like this.
+Additions are marked like this.
 Line 43:
+In statistical contexts, the objective function that should be minimized is the average loss, e.g. mean squared error:

{{attachment:loss.svg}}

The gradient then is formulated like:

{{attachment:lossgradient.svg}}

This usually presents as a prohibitive computation. Instead, the 'full gradient' is estimated as the '''minibatch gradient''', using a random sample of the observations.

Diff for "Calculus/GradientDescent"

Gradient Descent

Description

Convergence

Usage