As a developer, you’re likely familiar with the power of large language models (LLMs) but also the challenges they bring—extensive computational requirements and high latency. Enter Small Language Models (SLMs)—compact, efficient versions of LLMs with fewer than 10 billion parameters. Designed for speed and resource efficiency, SLMs are tailor-made for scenarios like edge computing and real-time applications, delivering targeted performance without overwhelming your hardware. Whether you’re building a lightweight chatbot or enabling on-device AI, SLMs offer a practical solution to bring AI closer to your project’s needs.
This article explores the essentials of small language models (SLMs), highlighting their key features, applications, and creation from larger language models (LLMs). We’ll also walk you through implementing these models using Ollama on Google Colab and compare the results from different model variants, helping you understand their real-world performance and use cases.
This article was published as a part of the Data Science Blogathon.
Small Language Models have fewer parameters (typically under 10 billion), which dramatically reduces the computational costs and energy usage. They focus on specific tasks and are trained on smaller datasets. This maintains a balance between performance and resource efficiency. Small Language Models (SLMs) are compact versions of their larger counterparts, designed to deliver high efficiency and performance while minimizing computational resources. SLMs optimize for specific tasks and environments, unlike large-scale models like GPT-4 or PaLM, which demand vast memory, compute power, and energy. This makes them an ideal choice for edge devices, resource-constrained settings, and applications where speed and scalability are critical.
Let us learn about how small language models are created:
Below is the comparison table of small language models and large language models:
Small Language Models (SLMs) | Large Language Models (LLMs) | |
Size | SLMs are much smaller in size with less number of parameters (typically under 10 billion) | LLMs are much larger with a lot higher number of parameters. |
Training Data & Time | SLMs are trained with more focussed and context specific smaller datasets. SLMs can typically be trained in weeks. | LLMs are trained with a ton of varied datasets for generic learning requirements. For training LLMs, it can take months |
Computing Resources | Needs much less resources making them a more sustainable option. | Owing to the large number of parameters in LLMs and the large training data used, LLMs need a lot of computing resources to train and run. |
Proficiency | Best in dealing with simpler and specific tasks | Expert in dealing with complex and generic tasks |
Inference | SLMs can run locally on devices like phones and raspberry pi without need of an internet connection | LLMs need GPU and other such specialized hardware to operate |
Response Time | SLMs have faster response time owing to their small size. | Depending on the complexity of the tasks, LLMs can take much longer times to respond |
Control of Models | Users can run SLMs on their own servers, tune them and even freeze them so that they don’t change at all in the future. | With LLMs, the control is in the hands of the model builders. This could lead to model drifts and catastrophic forgetting as well if the model changes. |
Cost | Considering comparatively lower requirement of computing resources, overall cost is lower. | Owing to the large amount of computing resources needed to train and run LLM models, cost is higher. |
To know more, checkout our article on: SLMs vs LLMs: The Ultimate Comparison Guide!
In the rapidly evolving world of AI, small language models (SLMs) are setting new benchmarks for efficiency and versatility. Here’s a look at the most advanced SLMs, highlighting their unique features, capabilities, and applications.
Model Series Overview: The Phi 3.5 series includes advanced AI models with diverse specializations:
Context Window: All models support a 128,000-token context length, enabling tasks involving text, code, images, and videos.
Small language models (SLMs) excel in resource-constrained settings due to their computational efficiency and speed. They power edge computing by enabling real-time processing on devices like smartphones and IoT systems. SLMs are ideal for chatbots, virtual assistants, and content generation, offering quick responses and cost-effective solutions. They also support text summarization for concise overviews, text classification for tasks like sentiment analysis, and translation for lightweight language tasks. Additional applications include code generation, mathematical problem-solving, healthcare text processing, and personalized recommendations, making them versatile tools across industries.
Ollama is an advanced AI tool that allows users to easily set up and run large language models locally (in CPU and GPU modes). We will explore how to run these small language models on Google Colab using Ollama in the following steps.
!sudo apt update
!sudo apt install -y pciutils
!curl -fsSL https://ollama.com/install.sh | sh
!pip install langchain-ollama
import threading
import subprocess
import time
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown
def run_ollama_serve():
subprocess.Popen(["ollama", "serve"])
thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)
The run_ollama_serve() function is defined to launch an external process (ollama serve) using subprocess.Popen().
The threading package creates a new thread that runs the run_ollama_serve() function. The thread starts, enabling the ollama service to run in the background. The main thread sleeps for 5 seconds as defined by time.sleep(5) commad, giving the server time to start up before proceeding with any further actions.
!ollama pull llama3.2
Running !ollama pull llama3.2 ensures that the Llama 3.2 language model is downloaded and ready to be used. We can pull the other small language models too from here for experimentation or comparison of outputs.
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="llama3.2")
chain = prompt | model
display(Markdown(chain.invoke({"question": "What's the length of hypotenuse in a right angled triangle"})))
The above code creates a prompt template to format a question, feeds the question to the Llama 3.2 model, and outputs the response with step-by-step reasoning. In this case, it’s asking about the length of the hypotenuse in a right-angled triangle. The process involves defining a structured prompt, chaining it with a model, and then invoking the chain to get and display the response.
Understanding how small language models perform across different tasks is essential to determine their suitability for real-world applications. In this section, we compare outputs from various SLMs to highlight their strengths, limitations, and best use cases.
Delivers concise responses with strong reasoning but struggles slightly with creative tasks.
Offers fast responses with decent accuracy but lacks depth in explanations.
Excels in structured problem-solving but sometimes over-generalizes in open-ended queries.
Provides detailed and contextually rich answers, balancing accuracy and creativity.
Handles complex queries effectively but requires higher computational resources.
Even though all the models give accurate response to the question, Gemma 2 (2 Billion) model at least for this question gives the most comprehensive and easy to understand answer.
Small language models represent a powerful solution for scenarios that require efficiency, speed, and resource optimization without sacrificing performance. By leveraging reduced parameter sizes and efficient architectures, these models are well-suited for applications in resource-constrained environments, real-time processing, and edge computing. While they may not possess the broad capabilities of their larger counterparts, small language models excel in specific tasks such as code generation, question answering, and text summarization.
With advancements in training techniques, like knowledge distillation and pruning, these models are increasingly capable of delivering competitive performance in many practical use cases. Their ability to balance compactness with functionality makes them an invaluable tool for developers and businesses seeking scalable, cost-effective AI solutions.
A. Small Language Models (SLMs) are language models with fewer parameters, typically under 10 billion, making them more resource-efficient. They are optimized for specific tasks and trained on smaller datasets, balancing performance and computational efficiency. These models are ideal for applications that require fast responses and minimal resource consumption.
A. SLMs are designed to deliver high performance while using significantly less computational power and energy than larger models like GPT-4 or PaLM. Their compact size suits edge devices with limited memory, compute, and energy, enabling scalable, efficient applications.
A. Knowledge distillation involves training smaller models using insights from larger models, enabling compact variants like LLaMA 2 and Gemma 2 to inherit capabilities while remaining resource-efficient.
A. Pruning reduces model size by removing redundant weights or neurons with minimal impact on performance. This directly decreases the model’s complexity.
Quantization, on the other hand, reduces the precision of the model’s parameters, for instance, by using 8-bit integers instead of 32-bit floating-point numbers. This reduces memory usage and increases inference speed without altering the overall structure of the model.