What are Distilled Models?

Shaik Hamzah Shareef Last Updated : 04 Mar, 2025
7 min read

We might’ve heard about Deepseek, but have you also observed mentions of Deepseek’s distilled models on Ollama? Or perhaps, if you’ve tried Groq Cloud, you might have witnessed similar models. But what exactly are these “distil” models? In this context, distil stands for distilled versions of the original models released by the organizations. Distilled models are basically smaller and more efficient models, designed to replicate the behavior of larger models while reducing resource requirements.

Benefits of Distilled Models

  • Reduced memory footprint and computation requirements
  • Lower energy consumption during inference and training
  • Faster processing times

Also Read: Building a RAG System for AI Reasoning with DeepSeek R1 Distilled Model

How were Distilled Models Introduced?

This process aims to maintain performance while reducing memory footprint and computation requirements. It is a form of model compression introduced by Geoffrey Hinton in his 2015 paper, “Distilling the Knowledge in a Neural Network.”

Hinton raised the question: Is it possible to train a large neural network and then compress its knowledge into a smaller one? In his view, the smaller network acts as a student, while the larger network serves as a teacher. The goal is for the student to replicate the key weights learned by the teacher.

By analyzing the teacher’s behavior and its predictions, Hinton and his colleagues devised a training methodology that allows a smaller (student) network to effectively learn its weights. The core idea was to minimize the error between the student’s output and two types of targets: the actual ground truth (hard target) and the teacher’s prediction (soft target).

Dual Loss Components

  • Hard Loss: This is the error measured against the true (ground truth) labels. It is what you would typically optimize in standard training, ensuring that the model learns the correct output.
  • Soft Loss: This is the error measured against the teacher’s predictions. While the teacher might not be perfect, its predictions contain valuable information about the relative probabilities of the output classes, which can guide the student model toward better generalization.

The training objective is to minimize the weighted sum of these two losses. The weight assigned to the soft loss is denoted by the λ:

In this formulation, the parameter λ (soft weight) determines the balance between learning from the actual labels and mimicking the teacher’s output. Even though one might argue that the true labels should be sufficient for training, incorporating the teacher’s prediction (soft loss) can actually help accelerate training and enhance performance by guiding the student with nuanced information.

The Softmax Function and Temperature

A key component in this methodology is the modification of the softmax function via a parameter called temperature (T). The softmax function, also known as the normalized exponential function, converts raw output scores (logits) from a neural network into probabilities. For a node i with value y_i, the standard softmax is defined as:

formula

Hinton introduced a new version of the softmax function that incorporates the temperature parameter:

Softmax
  • When T=1: The function behaves like the standard softmax.
  • When T>1: The exponentials become less extreme, producing a “softer” probability distribution over classes. In other words, the probabilities become more evenly spread out, revealing more information about the relative likelihood of each class.

Adjusting the Loss with Temperature

Since applying a higher temperature produces a softer distribution, it effectively scales the gradients during training. To correct for this and maintain effective learning from the soft targets, the soft loss is multiplied by T^2. The updated overall loss function becomes:

This formulation ensures that both the hard loss (from the actual labels) and the temperature-adjusted soft loss (from the teacher’s predictions) contribute appropriately to the training of the student model.

Overview

  • Teacher-Student Dynamics: The student model learns by minimizing errors against both the true labels (hard loss) and the teacher’s predictions (soft loss).
  • Weighted Loss Function: The overall training loss is a weighted sum of hard and soft losses, controlled by the parameter λ.
  • Temperature-Adjusted Softmax: The introduction of the temperature T in the softmax function softens the probability distribution, and multiplying the soft loss by T^2 compensates for this effect during training.

By combining these elements, the distilled network is trained efficiently, harnessing both the precision of hard labels and the richer, more informative guidance provided by the teacher’s predictions. This process not only accelerates training but also helps the smaller network approximate the performance of its larger counterpart.

DistilBERT

DistilBERT adapts Hinton’s distillation method with a slight modification by adding a cosine embedding loss to measure the distance between the student’s and teacher’s embedding vectors. Here’s a quick comparison:

  • DistilBERT: 6 layers, 66 million parameters
  • BERT-base: 12 layers, 110 million parameters

Both models were retrained on the same dataset (English Wikipedia and the Toronto Book Corpus). On evaluation tasks:

  • GLUE Tasks: BERT-base averaged 79.5% accuracy vs. DistilBERT’s 77%.
  • SQuAD Dataset: BERT-base scored 88.5% F1 compared to DistilBERT’s ~86%.

DistillGPT2

For GPT-2, which was originally released in four sizes:

  • The smallest GPT-2 has 12 layers and approximately 117 million parameters (some reports note 124 million due to implementation differences).
  • DistillGPT2 is the distilled version with 6 layers and 82 million parameters, while retaining the same embedding size (768).

You can explore the model on Hugging Face.

Even though distillGPT2 is twice as fast as GPT-2, its perplexity on large text datasets is 5 points higher. In NLP, lower perplexity indicates better performance; thus, the smallest GPT-2 still outperforms its distilled counterpart.

Implementing LLM Distillation

Implementing Large Language Model (LLM) distillation involves several steps and the use of specialized frameworks and libraries. Below is an overview of the process:

Frameworks and Libraries

  • Hugging Face Transformers: Provides a Distiller class that simplifies transferring knowledge from a teacher to a student model.
  • Other Libraries:
    • TensorFlow Model Optimization: Offers tools for model pruning, quantization, and distillation.
    • PyTorch Distiller: Contains utilities for compressing models using distillation techniques.
    • DeepSpeed: Developed by Microsoft, it includes features for both model training and distillation.

Steps Involved

  1. Data Preparation: Prepare a dataset that is representative of the target tasks. Data augmentation techniques can further enhance the diversity of training examples.
  2. Teacher Model Selection: Choose a well-performing, pre-trained teacher model. The quality of the teacher directly influences the performance of the student.
  3. Distillation Process
    • Training Setup: Initialize the student model and configure training parameters (e.g., learning rate, batch size).
    • Knowledge Transfer: Use the teacher model to generate soft targets (probability distributions) alongside hard targets (ground truth labels).
    • Training Loop: Train the student model to minimize the combined loss between its predictions and the soft/hard targets.
  4. Evaluation Metrics: Common metrics to assess the distilled model include:
    • Accuracy: Percentage of correct predictions.
    • Inference Speed: Time required to process inputs.
    • Model Size: Reduction in size and computational efficiency.
    • Resource Utilization: Efficiency in terms of computational resource consumption during inference.

Understanding Model Distillation

Key Components of Model Distillation

Picking Teacher and Student Model Architectures

The student model can either be a simplified or quantized version of the teacher, or it can have an entirely different, optimized architecture. The choice depends on the specific requirements of the deployment environment.

The Distillation Process Explained

At the core of this process is training the student model to mimic the teacher’s behavior. This is achieved by minimizing the difference between the student’s predictions and the teacher’s outputs—a supervised learning approach that forms the foundation of model distillation.

Challenges and Limitations

While distilled models offer clear benefits, there are some challenges to consider:

  • Trade-offs in Accuracy: Distilled models often experience a slight drop in performance compared to their larger counterparts.
  • Complexity of the Distillation Process: Configuring the right training environment and fine-tuning hyperparameters (like λ and temperature T) can be challenging.
  • Domain Adaptation: The effectiveness of distillation may vary depending on the specific domain or task for which the model is being used.

Future Directions in Model Distillation

The field of model distillation is rapidly evolving. Some promising areas include:

  • Advancements in Distillation Techniques: Ongoing research aims to close the performance gap between teacher and student models.
  • Automated Distillation Processes: New approaches are emerging to automate hyperparameter tuning, making distillation more accessible and efficient.
  • Broader Applications: Beyond NLP, model distillation is gaining traction in computer vision, reinforcement learning, and other areas, potentially transforming deployment in resource-constrained environments.

Real-World Applications

Distilled models are finding practical applications across various industries:

  • Mobile and Edge Computing: Their smaller size makes them ideal for deployment on devices with limited computational power, ensuring faster inference in mobile apps and IoT devices.
  • Energy Efficiency: In large-scale deployments, such as cloud services, reduced power consumption is critical. Distilled models help lower energy usage.
  • Rapid Prototyping: For startups and researchers, distilled models offer a balance between performance and resource efficiency, enabling faster development cycles.

Conclusion

Distilled models have transformed deep learning by achieving a delicate balance between high performance and computational efficiency. While they may sacrifice a bit of accuracy due to their smaller size and reliance on soft loss training, their faster processing and reduced resource demands make them especially valuable in resource-constrained settings.

Essentially, a distilled network emulates the behavior of its larger counterpart but can never exceed it in performance due to its limited capacity. This trade-off makes distilled models a smart choice when computing resources are limited or when their performance closely approximates that of the original model. Conversely, if the performance drop is significant or if computational power is readily available through methods like parallelization, opting for the original, larger model may be the better option.

GenAI Intern @ Analytics Vidhya | Final Year @ VIT Chennai
Passionate about AI and machine learning, I'm eager to dive into roles as an AI/ML Engineer or Data Scientist where I can make a real impact. With a knack for quick learning and a love for teamwork, I'm excited to bring innovative solutions and cutting-edge advancements to the table. My curiosity drives me to explore AI across various fields and take the initiative to delve into data engineering, ensuring I stay ahead and deliver impactful projects.

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details