What is knowledge distillation? A technique where a small student network is trained to mimic a large teacher network's output distributions — not just the hard labels. The student learns the teacher's "dark knowledge" about how classes relate to each other.
What are soft labels? Instead of [1,0,0,0,0], the teacher outputs [0.80,0.12,0.05,0.02,0.01]. These soft probabilities encode class similarities — the teacher knows cat and dog are more similar to each other than to car, and that information rides in the small non-zero probabilities.
What does temperature do? Higher τ makes the teacher's distribution softer — probabilities become more uniform, making inter-class similarities more visible to the student. τ=1 preserves the original distribution; τ=5–10 is typical for distillation, amplifying dark knowledge.
What does alpha control? α balances hard label loss (cross-entropy with true one-hot labels) and soft label loss (KL divergence from the teacher). α=0.5 is a common default — both contribute equally. α→0 ignores true labels; α→1 ignores the teacher.
Why does distillation work? The teacher's soft labels contain more information per example than hard labels. A student trained with distillation reaches higher accuracy than one trained from scratch with the same architecture — same size model, significantly better performance.