SGD zigzags: takes steps proportional to gradient magnitude — large gradients in steep directions cause oscillation across the valley
Momentum smooths: accumulates velocity in consistent directions, dampens oscillation — like a ball rolling downhill
AdaGrad adapts: divides by accumulated gradient history per parameter — good for sparse gradients, but learning rate decays to zero eventually
RMSProp fixes AdaGrad: uses exponential moving average instead of sum — prevents learning rate from decaying to zero
Adam = Momentum + RMSProp: combines adaptive rates with momentum and bias correction — generally the best default optimizer
AdamW decouples weight decay: standard Adam applies weight decay through the gradient (L2 regularization), which interacts with adaptive learning rates. AdamW applies it directly to weights — better regularization, now the default in PyTorch and most transformer training