The goal of gradient descent is to find the minimal x of the function .

We start with a random and improve with

For the direction to move, we choose the negative gradient

and compute further steps with: which means: Next x - current x is negative, as we reduce step by step. The step is equal to the stepsize times the negative gradient.

Therefore, the next x is calculated as:

or alternatively:

Understanding stepsize

Can we set the stepsize simply to 1?

No. The function value of the next point would not decrease, therefore, our process would be stuck.

The stepsize can potentially change in the process of gradient descent.

How many minima can a function have?

Functions can have different local minima, thus, points where x with

A function is convex, when the straight line connecting any two points is always above the graph. This is true for functions like . Function with more than one minimum which have differnet function values are never convex. Yet, a function can be convex and in fact have more than one minimum - if the minima have the same function values.

  • For a convex function, all local minima have the same function value.
  • A two-times differentiable function is convex, if its Hessian is positive semidefinite
  • The matrix is positive semidefinite
  • Linear regression with RSS/L2/qudratic loss is convex

GD will find the global minimum of a convex function.

For non-convex functions, it might only find the local minimum.