Internal covariate shift: as training progresses, the distribution of each layer's inputs changes — forcing later layers to constantly adapt. BN fixes this by normalizing each layer's output.
Normalization step: subtract the batch mean and divide by batch standard deviation — centers and scales activations to approximately N(0,1).
Learnable γ and β: after normalizing, scale by γ and shift by β — lets the network learn the optimal distribution for each layer rather than forcing N(0,1).
Training vs inference: during training, BN uses the current mini-batch statistics. During inference, it uses running averages accumulated during training.
Why BN helps: stable distributions allow higher learning rates, reduce sensitivity to initialization, and act as mild regularization.