❯

❯

Optimizers

Oct 30, 20251 min read

The General Form

θ_{new} = θ_{old} + Δ θ

where $Δ θ$ is computed from the gradient $\frac{\partial L}{\partial θ}$ .

Stochastic Gradient Descent (SGD)

θ_{new} = θ_{old} - η \frac{\partial L}{\partial θ}

where:

$η$ = learning rate (hyperparameter, e.g., 0.001)
$\frac{\partial L}{\partial θ}$ = gradient you computed
negative because gradient is calculated to be in the direction of greater loss

SGD with Momentum

Adds “velocity” to smooth updates:

v_{t} = β v_{t - 1} + \frac{\partial L}{\partial W}

W_{new} = W_{old} - η v_{t}

where:

$v_{t}$ = velocity (exponential moving average of gradients)
$β$ = momentum coefficient (e.g., 0.9)

Adam

Combines momentum + adaptive learning rates:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) \frac{\partial L}{\partial W}

(first moment, like momentum)

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) (\frac{\partial L}{\partial W})^{2}

(second moment, squared gradients)

Bias correction:

\overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}, \overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}}

Update:

W_{new} = W_{old} - η \frac{m ^ _{t}}{v ^ _{t} + ϵ}

where:

$β_{1} = 0.9$ (first moment decay)
$β_{2} = 0.999$ (second moment decay)
$ϵ = 1 0^{- 8}$ (numerical stability)
Adam adapts learning rate per parameter based on gradient history.

Graph View

The General Form
Stochastic Gradient Descent (SGD)
SGD with Momentum
Adam

Backlinks

00 - Deep Learning Table of Contents
Backpropagation
Parallel Structures Between Continuous and Discrete Time Systems

Part of Ed Eddy Edward © 2025

GitHub
LinkedIn
YouTube