What is a loss function? A measure of how wrong the model's prediction is — the optimizer minimizes this value by adjusting weights via gradient descent. The choice of loss function shapes what the model learns to optimise.
MSE vs MAE: MSE squares the error so large mistakes are penalized much more than small ones — sensitive to outliers. MAE uses absolute error — robust to outliers but has a constant gradient that can oscillate near the minimum.
Why cross-entropy for classification? It penalizes confident wrong predictions extremely heavily — log(0) → ∞. This creates strong gradients that push the model away from confident mistakes, making it ideal for probability outputs.
What does the gradient tell us? The gradient ∂L/∂ŷ is the slope of the loss curve at the current prediction — it tells the optimizer which direction and how far to move. Large gradient = far from minimum, small gradient = near minimum.
Huber loss: combines MSE (for small errors) and MAE (for large errors) using a threshold δ. Smooth gradients near the minimum, robust to outliers far away. Best of both worlds for noisy regression problems.