The General Form
where is computed from the gradient .
Stochastic Gradient Descent (SGD)
where:
- = learning rate (hyperparameter, e.g., 0.001)
- = gradient you computed
- negative because gradient is calculated to be in the direction of greater loss
SGD with Momentum
Adds “velocity” to smooth updates:
where:
- = velocity (exponential moving average of gradients)
- = momentum coefficient (e.g., 0.9)
Adam
Combines momentum + adaptive learning rates:
(first moment, like momentum)
(second moment, squared gradients)
Bias correction:
Update:
where:
- (first moment decay)
- (second moment decay)
- (numerical stability)
- Adam adapts learning rate per parameter based on gradient history.
