© 2026 Greg T. Chism · MIT License

Batch Normalization — Interactive Explorer

See how batch norm stabilizes activation distributions and accelerates training by reducing internal covariate shift


Network Configuration
Network Depth
Batch Normalization
Compare activation distributions with and without batch norm
Batch Size
Training Parameters
η 0.010
Learning rate — try higher values to see BN's stabilizing effect
Simulation
Speed Med
What's happening?
Select a depth, set a learning rate, and press Play to watch batch normalization stabilize activation distributions layer by layer.
Key Concepts
Internal covariate shift: as training progresses, the distribution of each layer's inputs changes — forcing later layers to constantly adapt. BN fixes this by normalizing each layer's output.
Normalization step: subtract the batch mean and divide by batch standard deviation — centers and scales activations to approximately N(0,1).
Learnable γ and β: after normalizing, scale by γ and shift by β — lets the network learn the optimal distribution for each layer rather than forcing N(0,1).
Training vs inference: during training, BN uses the current mini-batch statistics. During inference, it uses running averages accumulated during training.
Why BN helps: stable distributions allow higher learning rates, reduce sensitivity to initialization, and act as mild regularization.
Epoch 0
Batch 0
Activation Distributions — With Batch Norm
With BN
Layer 1
histogram
Layer 2
histogram
With BN — activations stay near N(0,1) Both rows share the same x-axis [−6, +6] — compare the drift directly
Activation Distributions — Without Batch Norm
Without BN
Layer 1
histogram
Layer 2
histogram
Without BN — distributions drift and widen over training
Step 1 of 4
Compute Batch Mean
μ_B = (1/m) Σ xᵢ = 0.7500
Average all m=8 activations in the mini-batch.
Step 1
Compute Batch Mean
μ_B = (1/m) Σ x_i
distribution + mean marker
Average activation across all m samples in the mini-batch for each feature dimension.
Step 2
Compute Variance
σ²_B = (1/m) Σ (x_i − μ_B)²
centered dist. + spread markers
Measure spread after centering. Captures how much activations deviate from the mean.
Step 3
Normalize
x̂_i = (x_i − μ_B) / √(σ²_B + ε)
N(0,1) shape before γ, β
Divide by standard deviation (with ε for numerical stability). Result has zero mean and unit variance.
Step 4
Scale & Shift
y_i = γ · x̂_i + β
N(β, γ²) shape learned γ, β
Learnable γ and β restore representational power — the network can un-normalize if that's optimal.
Training Loss vs. Epoch
Press Play to animate — With BN (green, solid) converges faster and more smoothly. Try Deep (8 layers) to see a dramatic gap.
With BN Without BN
Gradient Norm vs. Epoch
With BN — stable Without BN — spiky
Layer Statistics
Layer 1 BN
μ
σ²
Layer 2 BN
μ
σ²
BN Parameters (Layer 1)
1.00
γ (scale)
0.00
β (shift)
γ and β are learned — start at (1, 0) and adapt during training.
Running Stats (Inference)
μ_run
σ²_run
momentum 0.1
Exponential moving average of batch statistics — used at inference time.