The General Form

where is computed from the gradient .

Stochastic Gradient Descent (SGD)

where:

  • = learning rate (hyperparameter, e.g., 0.001)
  • = gradient you computed
  • negative because gradient is calculated to be in the direction of greater loss

SGD with Momentum

Adds “velocity” to smooth updates:

where:

  • = velocity (exponential moving average of gradients)
  • = momentum coefficient (e.g., 0.9)

Adam

Combines momentum + adaptive learning rates:

(first moment, like momentum)

(second moment, squared gradients)

Bias correction:

Update:

where:

  • (first moment decay)
  • (second moment decay)
  • (numerical stability)
  • Adam adapts learning rate per parameter based on gradient history.