Why gradients vanish: sigmoid and tanh derivatives max out at 0.25 and 1.0 — multiplied together across N layers gives 0.25ᴺ or smaller. After 10 layers: 0.25¹⁰ ≈ 0.000001.
Why gradients explode: if weight_scale > 1 and derivatives > 1, the product grows exponentially backward through layers — each layer amplifies rather than shrinks the signal.
The chain rule is the culprit: backprop multiplies one derivative per layer — the gradient at layer 1 = product of N terms. If any term < 1, the product shrinks geometrically. N terms of 0.25 gives 0.25ᴺ.
ReLU mostly fixes vanishing: derivative is exactly 1 for positive inputs — doesn't shrink the gradient. But "dead ReLU" neurons (always negative inputs) have zero gradient and never recover.
Modern solutions: Batch Norm, skip connections (ResNets), careful initialization (Xavier/He), and gradient clipping all address the problem from different angles — normalize, bypass, scale correctly, or cap the damage.