Optimization Algorithms — Interactive Explorer

Watch SGD, Momentum, AdaGrad, RMSProp, Adam, and AdamW navigate the same loss landscape — see why adaptive methods converge faster

Optimizers

SGD vanilla Momentum SGD+β AdaGrad adaptive RMSProp decay-G Adam m̂+v̂ AdamW m+v+wd

Training Parameters

η 0.010

Learning rate — scales every gradient step

Momentum

β 0.85

Fraction of previous velocity to carry forward

Adam

β₁ 0.90

1st-moment (mean) decay rate

β₂ 0.999

2nd-moment (variance) decay rate

AdamW

λ 0.010

Weight decay — applied directly to weights, not through gradient

Simulation

Speed Med

What's happening?

Select optimizers and press Play to watch each one navigate the loss landscape. Observe how their paths differ — especially in narrow valleys.

Key Concepts ▾

SGD zigzags: takes steps proportional to gradient magnitude — large gradients in steep directions cause oscillation across the valley

Momentum smooths: accumulates velocity in consistent directions, dampens oscillation — like a ball rolling downhill

AdaGrad adapts: divides by accumulated gradient history per parameter — good for sparse gradients, but learning rate decays to zero eventually

RMSProp fixes AdaGrad: uses exponential moving average instead of sum — prevents learning rate from decaying to zero

Adam = Momentum + RMSProp: combines adaptive rates with momentum and bias correction — generally the best default optimizer

AdamW decouples weight decay: standard Adam applies weight decay through the gradient (L2 regularization), which interacts with adaptive learning rates. AdamW applies it directly to weights — better regularization, now the default in PyTorch and most transformer training

Step 0

Active 5

Loss Landscape — step 0

SGD Momentum AdaGrad RMSProp Adam AdamW

Convergence Curves — loss vs. step

SGD Momentum AdaGrad RMSProp Adam AdamW

Optimizer Status

SGD — idle

Momentum — idle

AdaGrad — idle

RMSProp — idle

Adam — idle

AdamW — idle

—

Fastest

—

Slowest

Gradient Norm

Effective gradient magnitude per optimizer — adaptive methods shrink this over time.