Knowledge Distillation — Interactive Explorer

Watch a small student network learn from a large teacher's soft predictions — see how temperature scaling transfers richer knowledge than hard labels alone

Teacher Network

Architecture

Number of hidden layers

Student Network

Architecture

Number of hidden layers

Distillation Params

Temperature τ

τ 3.0

Higher τ = softer probability distributions

Loss weight α

α 0.50

α=0: hard labels only · α=1: soft labels only

Input Class

What's happening?

Select teacher and student architectures, set the temperature τ and loss weight α, then press Play to watch the student learn from the teacher's soft predictions.

Key Concepts ▾

What is knowledge distillation? A technique where a small student network is trained to mimic a large teacher network's output distributions — not just the hard labels. The student learns the teacher's "dark knowledge" about how classes relate to each other.

What are soft labels? Instead of [1,0,0,0,0], the teacher outputs [0.80,0.12,0.05,0.02,0.01]. These soft probabilities encode class similarities — the teacher knows cat and dog are more similar to each other than to car, and that information rides in the small non-zero probabilities.

What does temperature do? Higher τ makes the teacher's distribution softer — probabilities become more uniform, making inter-class similarities more visible to the student. τ=1 preserves the original distribution; τ=5–10 is typical for distillation, amplifying dark knowledge.

What does alpha control? α balances hard label loss (cross-entropy with true one-hot labels) and soft label loss (KL divergence from the teacher). α=0.5 is a common default — both contribute equally. α→0 ignores true labels; α→1 ignores the teacher.

Why does distillation work? The teacher's soft labels contain more information per example than hard labels. A student trained with distillation reaches higher accuracy than one trained from scratch with the same architecture — same size model, significantly better performance.

Teacher vs Student — Small Teacher (3 layers) · Tiny Student (1 layer)

Teacher

3 hidden layers · ~120k params

Output class probabilities

Teacher Soft Labels (τ = 3.0)

Hard τ=1 τ=cur

0.62

0.18

0.12

0.05

0.03

Soft labels reveal class similarity — dark knowledge

Student

1 hidden layer · ~8k params

Output class probabilities

The soft labels panel shows how temperature τ transforms the teacher's one-hot predictions into smooth probability distributions — exposing how similar the teacher finds each class to be.

Distillation Training — Loss curves over epochs

Hard label loss Soft label loss (KL div) Combined loss (α weighted)

Epoch 0 / 100

Student → Teacher KL Divergence over Training

Teacher soft probs (τ-scaled)

Student probs at current epoch

During distillation, the student minimizes a weighted combination of two losses: cross-entropy with true labels (hard loss) and KL divergence from the teacher's soft predictions (soft loss). Watch both converge as training progresses.

Compression Results — Teacher · Student from Scratch · Student Distilled

Teacher

Params ~120k

Accuracy —

Layers 3

Inference speed

Scratch

Params ~8k

Accuracy —

Layers 1

Inference speed

Distilled

Params ~8k

Accuracy —

Layers 1

Inference speed

Accuracy vs Model Size — distillation advantage

Teacher Student (scratch) Student (distilled)

The distilled student achieves accuracy close to the large teacher while keeping the student's small parameter count and fast inference speed — the core promise of knowledge distillation.

Prediction

True class —

Teacher → —

Student → —

Teacher Probs

—

Student Probs

—

KL Divergence

—

KL(teacher ∥ student)

Lower = student distribution closer to teacher. Goal: KL → 0.

Temperature τ

3.0

current τ
scaling logits

At τ=1: near hard-label softmax. At τ→∞: uniform soft distribution.