Small language models (SLMs) are making a significant impact in AI. They provide strong performance while being efficient and cost-effective. One standout example is the Llama 3.2 3B. It performs exceptionally well in Retrieval-Augmented Generation (RAG) tasks, cutting computational costs and memory usage while maintaining high accuracy. This article explores how to fine-tune the Llama 3.2 3B model. Learn how smaller models can excel in RAG tasks and push the boundaries of what compact AI solutions can achieve.
The Llama 3.2 3B model, developed by Meta, is a multilingual SLM with 3 billion parameters, designed for tasks like question answering, summarization, and dialogue systems. It outperforms many open-source models on industry benchmarks and supports diverse languages. Available in various sizes, Llama 3.2 offers efficient computational performance and includes quantized versions for faster, memory-efficient deployment in mobile and edge environments.
Also Read: Top 13 Small Language Models (SLMs)
Fine-tuning is essential for adapting SLM or LLMs to specific domains or tasks, such as medical, legal, or RAG applications. While pre-training enables language models to generate text across diverse topics, fine-tuning re-trains the model on domain-specific or task-specific data to improve relevance and performance. To address the high computational cost of fine-tuning all parameters, techniques like Parameter Efficient Fine-Tuning (PEFT) focus on training only a subset of the model’s parameters, optimizing resource usage while maintaining performance.
One such PEFT method is Low Rank Adaptation (LoRA).
In Lora, the weight matrix in SLM or LLM is decomposed into a product of two low-rank matrices.
W = WA * WB
If W has m rows and n columns, then it can be decomposed into WA with m rows and r columns, and WB with r rows and n columns. Here r is much less than m or n. So, rather than training m*n values, we can only train r*(m+n) values. r is called rank which is the hyperparameter we can choose.
def lora_linear(x):
h = x @ W # regular linear
h += scale * (x @ W_A @ W_B) # low-rank update
return h
Checkout: Parameter-Efficient Fine-Tuning of Large Language Models with LoRA and QLoRA
Let’s implement LoRA on the Llama 3.2 3B model.
Installing the above sloth version will also install the compatible pytorch, transformers, and Nvidia GPU libraries. We can use google colab to access the GPU.
Let’s look at the implementation now!
from unsloth import FastLanguageModel, is_bfloat16_supported, train_on_responses_only
from datasets import load_dataset, Dataset
from trl import SFTTrainer, apply_chat_template
from transformers import TrainingArguments, DataCollatorForSeq2Seq, TextStreamer
import torch
max_seq_length = 2048
dtype = None # None for auto-detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-3B-Instruct",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use if using gated models like meta-llama/Llama-3.2-11b
)
For other models supported by Unsloth, we can refer to this document.
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
use_rslora = False,
loftq_config = None,
)
We will use the RAG data to finetune. Download the data from huggingface.
dataset = load_dataset("neural-bridge/rag-dataset-1200", split = "train")
The dataset has three keys as follows:
Dataset({ features: [‘context’, ‘question’, ‘answer’], num_rows: 960 })
The data needs to be in a specific format depending on the language model. Read more details here.
So, let’s convert the data into the required format:
def convert_dataset_to_dict(dataset):
dataset_dict = {
"prompt": [],
"completion": []
}
for row in dataset:
user_content = f"Context: {row['context']}\nQuestion: {row['question']}"
assistant_content = row['answer']
dataset_dict["prompt"].append([
{"role": "user", "content": user_content}
])
dataset_dict["completion"].append([
{"role": "assistant", "content": assistant_content}
])
return dataset_dict
converted_data = convert_dataset_to_dict(dataset)
dataset = Dataset.from_dict(converted_data)
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
The dataset message will be as follows:
We can initialize the trainer for finetuning the SLM:
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 6, # using small number to test
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)
Description of some of the parameters:
Make the model train on responses only by specifying the response template:
trainer = train_on_responses_only(
trainer,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
trainer_stats = trainer.train()
Here’s the training stats:
Let’s use the model for inference:
FastLanguageModel.for_inference(model)
messages = [
{"role": "user", "content": "Context: The sky is typically clear during the day. Question: What color is the water?"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1)
To save the trained including LoRA weights, use the below code
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
Checkout: Guide to Fine-Tuning Large Language Models
Fine-tuning Llama 3.2 3B for RAG tasks showcases the efficiency of smaller models in delivering high performance with reduced computational costs. Techniques like LoRA optimize resource usage while maintaining accuracy. This approach empowers domain-specific applications, making advanced AI more accessible, scalable, and cost-effective, driving innovation in retrieval-augmented generation and democratizing AI for real-world challenges.
Also Read: Getting Started With Meta Llama 3.2
A. RAG combines retrieval systems with generative models to enhance responses by grounding them in external knowledge, making it ideal for tasks like question answering and summarization.
A. Llama 3.2 3B offers a balance of performance, efficiency, and scalability, making it suitable for RAG tasks while reducing computational and memory requirements.
A. Low-Rank Adaptation (LoRA) minimizes resource usage by training only low-rank matrices instead of all model parameters, enabling efficient fine-tuning on constrained hardware.
A. Hugging Face provides the RAG dataset, which contains context, questions, and answers, to fine-tune the Llama 3.2 3B model for better task performance.
A. Yes, Llama 3.2 3B, especially in its quantized form, is optimized for memory-efficient deployment on edge and mobile environments.