Why learning rate matters: Too high and training overshoots the minimum and diverges — the loss explodes. Too low and training converges too slowly or gets stuck in a suboptimal region. LR is the single most impactful hyperparameter in any gradient-based optimizer.
Step decay: Drops LR by a fixed factor every N epochs — simple and effective but the abrupt jumps can cause brief instability when the LR suddenly halves or thirds. The staircase pattern in the loss curve is its signature.
Cosine annealing: Smoothly reduces LR following a cosine curve from initLR to near-zero — widely used in modern deep learning (ResNets, ViTs). Reaches lower final loss than constant or step decay and often finds flatter, more generalizable minima.
Warmup: Start with a very small LR and linearly increase to initLR over the first few epochs — prevents large unstable updates when weights are randomly initialized. Especially important for Transformers, where attention gradients can explode early in training.
Cyclical LR: Periodically increase and decrease LR — the rises help escape local minima and saddle points. Used in super-convergence training; the optimizer explores more of the loss landscape and can settle in a wider, flatter minimum with better generalization.