Batch Normalization — Interactive Explorer

See how batch norm stabilizes activation distributions and accelerates training by reducing internal covariate shift

Network Configuration

Network Depth

Batch Normalization

Compare activation distributions with and without batch norm

Batch Size

Training Parameters

η 0.010

Learning rate — try higher values to see BN's stabilizing effect

Simulation

Speed Med

What's happening?

Select a depth, set a learning rate, and press Play to watch batch normalization stabilize activation distributions layer by layer.

Key Concepts ▾

Internal covariate shift: as training progresses, the distribution of each layer's inputs changes — forcing later layers to constantly adapt. BN fixes this by normalizing each layer's output.

Normalization step: subtract the batch mean and divide by batch standard deviation — centers and scales activations to approximately N(0,1).

Learnable γ and β: after normalizing, scale by γ and shift by β — lets the network learn the optimal distribution for each layer rather than forcing N(0,1).

Training vs inference: during training, BN uses the current mini-batch statistics. During inference, it uses running averages accumulated during training.

Why BN helps: stable distributions allow higher learning rates, reduce sensitivity to initialization, and act as mild regularization.

Epoch 0

Batch 0

Activation Distributions — With Batch Norm

With BN

Layer 1

Layer 2

With BN — activations stay near N(0,1) Both rows share the same x-axis [−6, +6] — compare the drift directly

Activation Distributions — Without Batch Norm

Without BN

Layer 1

Layer 2

Without BN — distributions drift and widen over training

Step 1 of 4

Compute Batch Mean

μ_B = (1/m) Σ xᵢ = 0.7500

Average all m=8 activations in the mini-batch.

Training Loss vs. Epoch

Press Play to animate — With BN (green, solid) converges faster and more smoothly. Try Deep (8 layers) to see a dramatic gap.

With BN Without BN

Gradient Norm vs. Epoch

With BN — stable Without BN — spiky

Layer Statistics

Layer 1 BN

μ —

σ² —

Layer 2 BN

μ —

σ² —

BN Parameters (Layer 1)

1.00

γ (scale)

0.00

β (shift)

γ and β are learned — start at (1, 0) and adapt during training.

Running Stats (Inference)

μ_run —

σ²_run —

momentum 0.1

Exponential moving average of batch statistics — used at inference time.