© 2026 Greg T. Chism · MIT License

Knowledge Distillation — Interactive Explorer

Watch a small student network learn from a large teacher's soft predictions — see how temperature scaling transfers richer knowledge than hard labels alone


Teacher Network
Architecture
Number of hidden layers
Student Network
Architecture
Number of hidden layers
Distillation Params
Temperature τ
τ 3.0
Higher τ = softer probability distributions
Loss weight α
α 0.50
α=0: hard labels only · α=1: soft labels only
Input Class
What's happening?
Select teacher and student architectures, set the temperature τ and loss weight α, then press Play to watch the student learn from the teacher's soft predictions.
Key Concepts
What is knowledge distillation? A technique where a small student network is trained to mimic a large teacher network's output distributions — not just the hard labels. The student learns the teacher's "dark knowledge" about how classes relate to each other.
What are soft labels? Instead of [1,0,0,0,0], the teacher outputs [0.80,0.12,0.05,0.02,0.01]. These soft probabilities encode class similarities — the teacher knows cat and dog are more similar to each other than to car, and that information rides in the small non-zero probabilities.
What does temperature do? Higher τ makes the teacher's distribution softer — probabilities become more uniform, making inter-class similarities more visible to the student. τ=1 preserves the original distribution; τ=5–10 is typical for distillation, amplifying dark knowledge.
What does alpha control? α balances hard label loss (cross-entropy with true one-hot labels) and soft label loss (KL divergence from the teacher). α=0.5 is a common default — both contribute equally. α→0 ignores true labels; α→1 ignores the teacher.
Why does distillation work? The teacher's soft labels contain more information per example than hard labels. A student trained with distillation reaches higher accuracy than one trained from scratch with the same architecture — same size model, significantly better performance.
Teacher vs Student — Small Teacher (3 layers) · Tiny Student (1 layer)
Teacher
3 hidden layers · ~120k params
Teacher network D3 node diagram
Output class probabilities
Teacher output probs
Teacher Soft Labels (τ = 3.0)
Hard τ=1 τ=cur
C0
0.62
C1
0.18
C2
0.12
C3
0.05
C4
0.03
Soft labels reveal class similarity — dark knowledge
Student
1 hidden layer · ~8k params
Student network D3 node diagram
Output class probabilities
Student output probs
The soft labels panel shows how temperature τ transforms the teacher's one-hot predictions into smooth probability distributions — exposing how similar the teacher finds each class to be.
Distillation Training — Loss curves over epochs
Loss curves over training epochs hard label loss · soft label loss · combined loss rendered by D3
Hard label loss Soft label loss (KL div) Combined loss (α weighted)
Epoch 0 / 100
Student → Teacher KL Divergence over Training
KL divergence vs epoch — rendered by D3
Teacher soft probs (τ-scaled)
Teacher target distribution
Student probs at current epoch
Student learned distribution
During distillation, the student minimizes a weighted combination of two losses: cross-entropy with true labels (hard loss) and KL divergence from the teacher's soft predictions (soft loss). Watch both converge as training progresses.
Compression Results — Teacher · Student from Scratch · Student Distilled
Teacher
Params ~120k
Accuracy
Layers 3
Inference speed
Scratch
Params ~8k
Accuracy
Layers 1
Inference speed
Distilled
Params ~8k
Accuracy
Layers 1
Inference speed
Accuracy vs Model Size — distillation advantage
Accuracy vs Parameter Count teacher · scratch student · distilled student rendered by D3
Teacher Student (scratch) Student (distilled)
The distilled student achieves accuracy close to the large teacher while keeping the student's small parameter count and fast inference speed — the core promise of knowledge distillation.
Prediction
True class
Teacher →
Student →
Teacher Probs
C0
C1
C2
C3
C4
Student Probs
C0
C1
C2
C3
C4
KL Divergence
KL(teacher ∥ student)
Lower = student distribution closer to teacher. Goal: KL → 0.
Temperature τ
3.0
current τ
scaling logits
At τ=1: near hard-label softmax. At τ→∞: uniform soft distribution.