= Gradient Descent = '''Gradient descent''' is a method for approximating minima of an objective function. <> ---- == Description == By setting a [[Calculus/Gradient|gradient]] to zero, critical points can be calculated. Some differentiable functions cannot be practically solved for, however. Given a position ''x,,n,,'' and an objective function ''f'' that is differentiable around ''x,,n,,'', the gradient can be calculated. It follows that for a 'small enough' ''α'': ''x,,n+1,, = x,,n,, - α ∇f(x,,n,,)'' ''f(x,,n+1,,) < f(x,,n,,)'' will hold true. The '''step size''' ''α'' must be small enough that it does not move ''x,,n+1,,'' past the intended local minima and into the proximity of another local minima. Note furthermore that the step size parameter does ''not'' hold step distance constant. The actual step taken is the product of the gradient vector and the step size parameter, so the gradient vector's degree is also a component in step distance. By iteratively following the direction of steepest descent in small steps, the local minima is approximated. === Convergence === As long as the step size is less than or equal to ''1/L'', where ''L'' is the [[Calculus/LipschitzContinuity|Lipschitz constant]], this method will converge. Setting to 0.1 or lower is the general best practice. ---- == Usage == This method approximates local minima, not necessarily the global minimum. As a result, objective functions used in statistics are often required to be '''convex''' so as to ensure that the only local minimum is the global minimum. When the objective cannot be convex, to avoid local minima, the method is repeated with randomness added to the 'starting points'. === Minibatch Gradient === In statistical contexts, the objective function that should be minimized is the average loss, e.g. mean squared error: {{attachment:loss.svg}} The gradient then is formulated like: {{attachment:lossgradient.svg}} Importantly, this requires calculating the gradient for each observation. This is generally impractical. Instead, the 'full gradient' is estimated as the '''minibatch gradient''', using a random sample of the observations. As the size of the sample grows, the minibatch gradient converges to the full gradient. Regardless of the sample size, this is an unbiased estimator. Taken to the extreme, when the sample size is 1, this is called '''stochastic gradient descent''' ('''SGD'''). SGD generally requires that step size decrease over iterations to guarantee convergence. ---- CategoryRicottone