© 2026 Greg T. Chism · MIT License

Vanishing & Exploding Gradients — Interactive Explorer

Watch gradient magnitudes shrink to zero or explode as they propagate backward through deep networks — and see how modern fixes keep training stable


Network
Depth (hidden layers)
Activation Function
Parameters
W 1.00
Weight scale — <1 vanishes, >1 explodes
η 0.010
Learning rate
Simulation
Speed Med
What's happening?
Select depth and activation, then press Play to watch gradients travel backward through the network.
Key Concepts
Why gradients vanish: sigmoid and tanh derivatives max out at 0.25 and 1.0 — multiplied together across N layers gives 0.25ᴺ or smaller. After 10 layers: 0.25¹⁰ ≈ 0.000001.
Why gradients explode: if weight_scale > 1 and derivatives > 1, the product grows exponentially backward through layers — each layer amplifies rather than shrinks the signal.
The chain rule is the culprit: backprop multiplies one derivative per layer — the gradient at layer 1 = product of N terms. If any term < 1, the product shrinks geometrically. N terms of 0.25 gives 0.25ᴺ.
ReLU mostly fixes vanishing: derivative is exactly 1 for positive inputs — doesn't shrink the gradient. But "dead ReLU" neurons (always negative inputs) have zero gradient and never recover.
Modern solutions: Batch Norm, skip connections (ResNets), careful initialization (Xavier/He), and gradient clipping all address the problem from different angles — normalize, bypass, scale correctly, or cap the damage.
Epoch 0
Step 0
Backpropagation — gradient magnitude per layer
Network gradient flow diagram arrows colored by gradient magnitude
1.000
Output
Layer 6
Layer 5
Layer 4
Layer 3
Layer 2
Input
Dead (≈0) Healthy Large Exploding
Gradient magnitude over 10 training steps — the gap shows how much less early layers learn
Gradient magnitude history
Output layer (≈1.0, seed) Input layer (varies — shows learning speed)
Sigmoid — activation function and derivative
σ(x) and σ′(x) — x ∈ [−4, 4]
Activation curve — rendered by D3
σ(x) σ′(x)
Gradient magnitude per layer (log scale)
Per-layer gradient bars — rendered by D3
Healthy Vanished Exploding
Chain rule — backpropagated gradient product
Run the backward pass on Tab 1 to see the chain rule expansion.
Sigmoid max derivative ≈ 0.25 — each layer multiplies the gradient by at most 0.25. After 6 layers: 0.25⁶ ≈ 0.00024.
Same deep network — four different gradient stability fixes
Batch Normalization
Normalizes layer inputs to zero mean and unit variance — keeps activation magnitudes from drifting, stabilizing gradients throughout training.
Normalization
Skip Connections (ResNet)
Identity shortcuts let gradients bypass layers entirely — the gradient highway ensures every layer receives at least a clean copy of the upstream gradient.
Architecture
Xavier / He Initialization
Scales initial weights by 1/√n (Xavier) or √(2/n) (He/ReLU) to keep activation variance stable across layers — prevents vanishing or exploding at the start.
Initialization
Gradient Clipping
Caps gradient norm at a threshold (e.g. 0.5) before the optimizer step — prevents individual parameter updates from becoming catastrophically large.
Clipping
Click a solution panel to see how it modifies the gradient signal at each layer.
Layer Gradients
L6 →
1.000
L5
L4
L3
L2
L1 →
Gradient Ratio
‖∇L/∇output‖ at input layer
Ratio < 0.01 indicates vanishing. Ratio > 100 indicates exploding.
Learning Speed
Layer 6 Fast
Layer 5 Fast
Layer 4 Slow
Layer 3 Slow
Layer 2 Frozen
Layer 1 Frozen