© 2026 Greg T. Chism · MIT License

Optimization Algorithms — Interactive Explorer

Watch SGD, Momentum, AdaGrad, RMSProp, Adam, and AdamW navigate the same loss landscape — see why adaptive methods converge faster


Optimizers
Training Parameters
η 0.010
Learning rate — scales every gradient step
Momentum
β 0.85
Fraction of previous velocity to carry forward
Adam
β₁ 0.90
1st-moment (mean) decay rate
β₂ 0.999
2nd-moment (variance) decay rate
AdamW
λ 0.010
Weight decay — applied directly to weights, not through gradient
Simulation
Speed Med
What's happening?
Select optimizers and press Play to watch each one navigate the loss landscape. Observe how their paths differ — especially in narrow valleys.
Key Concepts
SGD zigzags: takes steps proportional to gradient magnitude — large gradients in steep directions cause oscillation across the valley
Momentum smooths: accumulates velocity in consistent directions, dampens oscillation — like a ball rolling downhill
AdaGrad adapts: divides by accumulated gradient history per parameter — good for sparse gradients, but learning rate decays to zero eventually
RMSProp fixes AdaGrad: uses exponential moving average instead of sum — prevents learning rate from decaying to zero
Adam = Momentum + RMSProp: combines adaptive rates with momentum and bias correction — generally the best default optimizer
AdamW decouples weight decay: standard Adam applies weight decay through the gradient (L2 regularization), which interacts with adaptive learning rates. AdamW applies it directly to weights — better regularization, now the default in PyTorch and most transformer training
Step 0
Active 5
Loss Landscape — step 0
Loss landscape renders here Contour plot + optimizer trajectories
SGD Momentum AdaGrad RMSProp Adam AdamW
Convergence Curves — loss vs. step
Convergence curves render here One line per optimizer — loss over steps
SGD Momentum AdaGrad RMSProp Adam AdamW
Optimizer Status
SGD idle
Momentum idle
AdaGrad idle
RMSProp idle
Adam idle
AdamW idle
Fastest
Slowest
Gradient Norm
Gradient norm renders here
Effective gradient magnitude per optimizer — adaptive methods shrink this over time.