Vanishing & Exploding Gradients — Interactive Explorer

Watch gradient magnitudes shrink to zero or explode as they propagate backward through deep networks — and see how modern fixes keep training stable

Network

Depth (hidden layers)

Activation Function

Parameters

W 1.00

Weight scale — <1 vanishes, >1 explodes

η 0.010

Learning rate

Simulation

Speed Med

What's happening?

Select depth and activation, then press Play to watch gradients travel backward through the network.

Key Concepts ▾

Why gradients vanish: sigmoid and tanh derivatives max out at 0.25 and 1.0 — multiplied together across N layers gives 0.25ᴺ or smaller. After 10 layers: 0.25¹⁰ ≈ 0.000001.

Why gradients explode: if weight_scale > 1 and derivatives > 1, the product grows exponentially backward through layers — each layer amplifies rather than shrinks the signal.

The chain rule is the culprit: backprop multiplies one derivative per layer — the gradient at layer 1 = product of N terms. If any term < 1, the product shrinks geometrically. N terms of 0.25 gives 0.25ᴺ.

ReLU mostly fixes vanishing: derivative is exactly 1 for positive inputs — doesn't shrink the gradient. But "dead ReLU" neurons (always negative inputs) have zero gradient and never recover.

Modern solutions: Batch Norm, skip connections (ResNets), careful initialization (Xavier/He), and gradient clipping all address the problem from different angles — normalize, bypass, scale correctly, or cap the damage.

Epoch 0

Step 0

Backpropagation — gradient magnitude per layer

1.000

Output

—

Layer 6

—

Layer 5

—

Layer 4

—

Layer 3

—

Layer 2

—

Input

Dead (≈0) Healthy Large Exploding

Gradient magnitude over 10 training steps — the gap shows how much less early layers learn

Output layer (≈1.0, seed) Input layer (varies — shows learning speed)

Sigmoid — activation function and derivative

σ(x) and σ′(x) — x ∈ [−4, 4]

σ(x) σ′(x)

Gradient magnitude per layer (log scale)

Healthy Vanished Exploding

Chain rule — backpropagated gradient product

Run the backward pass on Tab 1 to see the chain rule expansion.

Sigmoid max derivative ≈ 0.25 — each layer multiplies the gradient by at most 0.25. After 6 layers: 0.25⁶ ≈ 0.00024.

Same deep network — four different gradient stability fixes

Batch Normalization

Normalizes layer inputs to zero mean and unit variance — keeps activation magnitudes from drifting, stabilizing gradients throughout training.

Normalization

Skip Connections (ResNet)

Identity shortcuts let gradients bypass layers entirely — the gradient highway ensures every layer receives at least a clean copy of the upstream gradient.

Architecture

Xavier / He Initialization

Scales initial weights by 1/√n (Xavier) or √(2/n) (He/ReLU) to keep activation variance stable across layers — prevents vanishing or exploding at the start.

Initialization

Gradient Clipping

Caps gradient norm at a threshold (e.g. 0.5) before the optimizer step — prevents individual parameter updates from becoming catastrophically large.

Clipping

Click a solution panel to see how it modifies the gradient signal at each layer.

Layer Gradients

L6 →

1.000

—

L1 →

—

Gradient Ratio

—

‖∇L/∇output‖ at input layer

Ratio < 0.01 indicates vanishing. Ratio > 100 indicates exploding.

Learning Speed

Layer 6 Fast

Layer 5 Fast

Layer 4 Slow

Layer 3 Slow

Layer 2 Frozen

Layer 1 Frozen