© 2026 Greg T. Chism · MIT License

Loss Functions — Interactive Explorer

See how MSE, MAE, Cross-Entropy, and Hinge loss respond to predictions — and why choosing the right loss function matters for your problem


Loss Function
MSE = (ŷ − y)² — penalizes large errors quadratically
True Label y
Task type
True value
y 0.50
Prediction ŷ
ŷ 0.50
Drag to move predicted value and watch loss update
Outliers
n 0
Add outliers to show robustness differences between MSE and MAE
Playback
Speed
What's happening?
Select a loss function and drag the prediction slider to see how each loss responds. Try adding outliers to compare MSE vs MAE robustness.
Key Concepts
What is a loss function? A measure of how wrong the model's prediction is — the optimizer minimizes this value by adjusting weights via gradient descent. The choice of loss function shapes what the model learns to optimise.
MSE vs MAE: MSE squares the error so large mistakes are penalized much more than small ones — sensitive to outliers. MAE uses absolute error — robust to outliers but has a constant gradient that can oscillate near the minimum.
Why cross-entropy for classification? It penalizes confident wrong predictions extremely heavily — log(0) → ∞. This creates strong gradients that push the model away from confident mistakes, making it ideal for probability outputs.
What does the gradient tell us? The gradient ∂L/∂ŷ is the slope of the loss curve at the current prediction — it tells the optimizer which direction and how far to move. Large gradient = far from minimum, small gradient = near minimum.
Huber loss: combines MSE (for small errors) and MAE (for large errors) using a threshold δ. Smooth gradients near the minimum, robust to outliers far away. Best of both worlds for noisy regression problems.
Loss Landscape — MSE · ŷ vs L(ŷ)
Selected Loss: Loss vs Predicted Value
Loss curve (D3 rendered) x-axis: ŷ from −0.5 to 1.5 · y-axis: L(ŷ)
True label y Current ŷ Minimum
All Four Loss Functions — Quick Comparison
MSE MAE Cross-Entropy Hinge D3 rendered D3 rendered D3 rendered D3 rendered
MSE MAE Cross-Entropy Hinge
The loss curve shows L(ŷ) as a function of the predicted value. The orange dot marks your current prediction — drag the slider to move it. The minimum (★) is where the gradient is zero and loss is smallest.
Optimization Effect — Gradient Descent Under Different Losses
Scatter plot with GD paths (D3 rendered) ~20 data points · coloured paths show convergence under each loss
MSE MAE Huber Cross-Entropy (logistic)
Loss over Gradient Descent Steps
|∂L/∂ŷ| vs ŷ for each loss (D3 rendered)
Press Play to watch gradient descent update the prediction under each loss function simultaneously. Notice how MSE has steeper gradients far from the truth, while MAE has constant gradient magnitude regardless of distance.
When to Use Which Loss Function
Regression
Outlier Robustness
MSE is pulled toward outliers; MAE ignores them
MSE fit — pulled toward outlier MAE fit — robust to outlier Outlier point
When to use: Use MSE when outliers are rare and you want smooth gradients. Use MAE when outliers are common and robustness matters. Use Huber for a balance between the two.
Binary Classification
Decision Boundary
Cross-Entropy vs Hinge for two-class separation
Class 0 Class 1 Cross-Entropy boundary Hinge margin zone
When to use: Use Cross-Entropy for probabilistic outputs (logistic regression, neural nets). Use Hinge for margin-based classifiers (SVM-style) when you want a hard decision boundary.
Multi-Class
Softmax Output
Categorical Cross-Entropy over all class probabilities
True class C0 — loss applied here Other classes L = −log(ptrue)
When to use: Categorical Cross-Entropy is the standard for multi-class neural networks — it rewards confident correct predictions and heavily penalizes confident wrong ones.
Use the outlier slider (left panel) to see how MSE and MAE diverge as outliers are added. The blue fit line (MSE) gets pulled toward extremes while the green line (MAE) stays close to the bulk of the data.
Current Loss
L(ŷ) · MSE
Gradient ∂L/∂ŷ
slope of loss at current ŷ
Negative = loss decreases as ŷ increases. Optimizer moves opposite gradient direction.
Formula
MSE
L = (ŷ − y)²
Smooth everywhere · derivative = 2(ŷ−y)
All Losses at ŷ
LossL(ŷ)
MSE
MAE
C-Entropy
Hinge
Huber
at current ŷ and y