Unlocking the power of domain-specific Large Language Models like Microsoft Phi-4 requires the ability to fine-tune these models for specialized tasks. Fine-tuning Phi-4 on custom datasets helps tailor the model to perform optimally in specific domains, such as customer support, medical advice, or technical documentation. By leveraging LoRA (Low-Rank Adaptation) adapters, this process becomes more efficient, allowing for faster training and reduced resource consumption. This guide will walk you through the essential steps to fine-tune Phi-4 using LoRA adapters, integrate the model with Hugging Face for easy sharing, and apply the latest techniques to get the most out of your custom LLM.
This article was published as a part of the Data Science Blogathon.
Before diving into fine-tuning Phi-4, ensure you have the necessary tools and environment configured. This includes installing Python 3.8+, PyTorch with CUDA support for GPU acceleration, and the unsloth library, along with Hugging Face Transformers and Datasets for seamless dataset handling and model integration. Having these prerequisites in place will ensure a smooth and efficient fine-tuning process.
Ensure you have the following installed:
Install the required libraries with:
pip install unsloth
pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
This section covers all the essential steps involved in fine-tuning Microsoft Phi-4, from setting up the environment to pushing the fine-tuned model to Hugging Face. It includes configuring the model, preparing the dataset, training, monitoring GPU usage, generating responses, and saving/uploading the model.
Below we will be setting up the model by loading the model and importing the dependencies:
LoRA adapters enable parameter-efficient fine-tuning by training only a small subset of model parameters.
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Phi-4",
max_seq_length=max_seq_length,
load_in_4bit=load_in_4bit,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
)
We use the FineTome-100k dataset in ShareGPT format. The unsloth library provides utilities to convert this format into Hugging Face’s generic format for multi-turn conversations.
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, get_chat_template
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
The Hugging Face’s datasets library loads the mlabonne/FineTome-100k dataset and ensures that only the training split is loaded with the split=”train” argument.
dataset = standardize_sharegpt(dataset)
The standardize_sharegpt function from the unsloth.chat_templates module standardizes the dataset to the ShareGPT format. This ensures that the dataset adheres to the expected format for multi-turn conversations.
tokenizer = get_chat_template(tokenizer, chat_template="phi-4")
The get_chat_template function customizes the tokenizer to use the “phi-4” chat template. This ensures the prompts and conversations align with Phi-4’s format.
def formatting_prompts_func(examples):
texts = [
tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
for convo in examples["conversations"]
]
return {"text": texts}
The formatting_prompts_func processes each example in the dataset:
dataset = dataset.map(formatting_prompts_func, batched=True)
The map function applies formatting_prompts_func to the entire dataset in batches. This efficiently preprocesses the dataset to prepare it for fine-tuning.
We look at how the conversations are structured for item 5:
dataset[5]["conversations"]
Fine-tuning the Model involves training Phi-4 with Hugging Face’s SFTTrainer, optimizing the process with custom settings and efficient data handling.
We use Hugging Face’s SFTTrainer to train the model. Below is a minimal setup for efficient training:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
To train only on assistant responses, we mask user inputs using the train_on_responses_only utility:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|im_start|>user<|im_sep|>",
response_part="<|im_start|>assistant<|im_sep|>",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
dataset_num_proc=2,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=30,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
output_dir="outputs",
report_to="none",
),
)
Trainer Initialization:
Training Arguments:
Purpose:
This setup efficiently fine-tunes a large model on a custom dataset, focusing on:
We can also use Unsloth’s train_on_completions method to only train on the assistant outputs and ignore the loss on the user’s inputs.
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|im_start|>user<|im_sep|>",
response_part="<|im_start|>assistant<|im_sep|>",
)
Let’s verify masking is actually done:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
Check GPU memory usage before and after training:
import torch
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
Generate responses using the fine-tuned model:
Defining the Input Messages:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "phi-4",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [
{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]
The input is structured as a list of message dictionaries. Each dictionary specifies the role (e.g., “user”) and the content (e.g., the user’s query).
This approach supports multi-turn conversations, aligning with the model’s chat-based functionality
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
Parameters:
Generating Text:
outputs = model.generate(
input_ids=inputs, max_new_tokens=64, use_cache=True, temperature=1.5, min_p=0.1
)
Parameters:
Decoding and Displaying the Output:
print(tokenizer.batch_decode(outputs))
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [
{"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(
input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1
Save Locally or Push to Hugging Face:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
To upload to Hugging Face:
model.push_to_hub_merged("hf/model", tokenizer, save_method="lora", token="<your_hf_token>")
This code pushes a model to the Hugging Face Hub, using the LoRA method for efficient saving, and it also includes the associated tokenizer. You would need a valid Hugging Face authentication token (<your_hf_token>) to execute the action successfully.
Fine-tuning Microsoft Phi-4 locally and pushing it to Hugging Face allows developers to create highly specialized models efficiently. With tools like Unsloth, LoRA Adapters, and Hugging Face, the process becomes accessible and scalable. Try it out with your dataset today!
A. Microsoft Phi-4 is a large language model (LLM) optimized for language understanding and generation tasks. Fine-tuning it on a custom dataset enables domain-specific performance, tailoring the model to specialized applications such as customer service, technical documentation, or niche industries.
A. LoRA (Low-Rank Adaptation) adapters allow efficient fine-tuning by training only a subset of model parameters instead of the entire model. This reduces computational requirements and memory usage, making it ideal for large models like Phi-4.
A. Key requirements include Python 3.8+, PyTorch with CUDA support, the unsloth library for streamlined workflows, and Hugging Face Transformers and Datasets for dataset handling and training.
A. Use a dataset like FineTome-100k in ShareGPT format. Convert and standardize the dataset using unsloth utilities to ensure compatibility with Hugging Face’s multi-turn conversation template.
A. Save your fine-tuned model and tokenizer locally, then use the .push_to_hub_merged() method from unsloth to upload the model and tokenizer to Hugging Face with your authentication token
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.