How to Fine-Tune Phi-4 Locally?

Himanshu Ranjan Last Updated : 24 Jan, 2025

9 min read

Unlocking the power of domain-specific Large Language Models like Microsoft Phi-4 requires the ability to fine-tune these models for specialized tasks. Fine-tuning Phi-4 on custom datasets helps tailor the model to perform optimally in specific domains, such as customer support, medical advice, or technical documentation. By leveraging LoRA (Low-Rank Adaptation) adapters, this process becomes more efficient, allowing for faster training and reduced resource consumption. This guide will walk you through the essential steps to fine-tune Phi-4 using LoRA adapters, integrate the model with Hugging Face for easy sharing, and apply the latest techniques to get the most out of your custom LLM.

Learning Objectives

Learn how to fine-tune Microsoft Phi-4 using LoRA adapters for domain-specific tasks.
Understand the setup process and configuration for loading Phi-4 efficiently with 4-bit quantization.
Gain proficiency in preparing datasets and transforming them for fine-tuning with Hugging Face and unsloth.
Master training techniques using Hugging Face’s SFTTrainer to optimize model performance.
Explore how to monitor GPU usage and save/upload fine-tuned models to Hugging Face for deployment.

This article was published as a part of the Data Science Blogathon.

Prerequisites
Fine-Tuning Phi-4: A Step-by-Step Guide
Conclusion
Frequently Asked Questions

Prerequisites

Before diving into fine-tuning Phi-4, ensure you have the necessary tools and environment configured. This includes installing Python 3.8+, PyTorch with CUDA support for GPU acceleration, and the unsloth library, along with Hugging Face Transformers and Datasets for seamless dataset handling and model integration. Having these prerequisites in place will ensure a smooth and efficient fine-tuning process.

Ensure you have the following installed:

Python 3.8+
PyTorch (with CUDA support for GPU acceleration)
unsloth
Hugging Face Transformers and Datasets

Install the required libraries with:

pip install unsloth
pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Fine-Tuning Phi-4: A Step-by-Step Guide

This section covers all the essential steps involved in fine-tuning Microsoft Phi-4, from setting up the environment to pushing the fine-tuned model to Hugging Face. It includes configuring the model, preparing the dataset, training, monitoring GPU usage, generating responses, and saving/uploading the model.

Step 1: Setting Up the Model

Below we will be setting up the model by loading the model and importing the dependencies:

Load the Model with LoRA Adapters

LoRA adapters enable parameter-efficient fine-tuning by training only a small subset of model parameters.

Importing Dependencies

from unsloth import FastLanguageModel
import torch

FastLanguageModel: A utility class from the unsloth library for working with language models, including loading and fine-tuning.
torch: PyTorch library for deep learning operations, providing GPU acceleration.

Configuration Settings

max_seq_length = 2048
load_in_4bit = True

max_seq_length: Specifies the maximum length of input sequences. Models like Phi-4 are designed to handle long sequences efficiently, making this crucial.
load_in_4bit: This setting loads the model with 4-bit quantization, reducing memory usage and improving inference speed.

Loading the Phi-4 Model

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Phi-4",
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
)

model_name: Refers to the pre-trained Phi-4 model hosted by unsloth.
from_pretrained: Downloads and initializes the model and tokenizer with the specified configurations.

Applying LoRA Adapters

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

get_peft_model: A method to integrate LoRA adapters into the model for parameter-efficient fine-tuning.
r=16: Sets the rank of the LoRA layers, controlling the dimensionality of the additional trainable parameters.
target_modules: Specifies the model layers where LoRA adapters will be applied. These layers correspond to key components of the model’s transformer architecture.
lora_alpha: A scaling factor for the LoRA layers to stabilize training.
lora_dropout: Dropout probability for regularization; set to 0 for no dropout.
bias=”none”: Indicates that no additional bias terms are introduced.
use_gradient_checkpointing: Activates gradient checkpointing to reduce memory usage during backpropagation.
random_state=3407: Ensures reproducibility by fixing the random seed.

Step 2: Preparing the Dataset

We use the FineTome-100k dataset in ShareGPT format. The unsloth library provides utilities to convert this format into Hugging Face’s generic format for multi-turn conversations.

Load the Dataset

from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, get_chat_template

dataset = load_dataset("mlabonne/FineTome-100k", split="train")

The Hugging Face’s datasets library loads the mlabonne/FineTome-100k dataset and ensures that only the training split is loaded with the split=”train” argument.

Standardize the Dataset

dataset = standardize_sharegpt(dataset)

The standardize_sharegpt function from the unsloth.chat_templates module standardizes the dataset to the ShareGPT format. This ensures that the dataset adheres to the expected format for multi-turn conversations.

Apply Phi-4 Chat Template

tokenizer = get_chat_template(tokenizer, chat_template="phi-4")

The get_chat_template function customizes the tokenizer to use the “phi-4” chat template. This ensures the prompts and conversations align with Phi-4’s format.

Format Prompts for Training

def formatting_prompts_func(examples):
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in examples["conversations"]
    ]
    return {"text": texts}

The formatting_prompts_func processes each example in the dataset:

The examples[“conversations”] field contains conversation data.
Each conversation (convo) is passed through tokenizer.apply_chat_template.
tokenize=False ensures the output is not tokenized yet.
add_generation_prompt=False avoids appending generation-specific tokens to the prompts at this stage.
The formatted text is stored under the “text” field.

Map Function to Dataset

dataset = dataset.map(formatting_prompts_func, batched=True)

The map function applies formatting_prompts_func to the entire dataset in batches. This efficiently preprocesses the dataset to prepare it for fine-tuning.

We look at how the conversations are structured for item 5:

dataset[5]["conversations"]

Step 3: Fine-Tuning the Model

Fine-tuning the Model involves training Phi-4 with Hugging Face’s SFTTrainer, optimizing the process with custom settings and efficient data handling.

Training with SFTTrainer

We use Hugging Face’s SFTTrainer to train the model. Below is a minimal setup for efficient training:

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

Masking User Inputs

To train only on assistant responses, we mask user inputs using the train_on_responses_only utility:

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user<|im_sep|>",
    response_part="<|im_start|>assistant<|im_sep|>",
)

SFTTrainer: A specialized trainer for supervised fine-tuning of language models.
TrainingArguments: Defines training hyperparameters such as batch size, learning rate, and number of steps.
DataCollatorForSeq2Seq: Prepares input data for sequence-to-sequence models.
is_bfloat16_supported: Checks if the system supports bfloat16, a mixed-precision format.

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=30,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        output_dir="outputs",
        report_to="none",
    ),
)

Trainer Initialization:

model and tokenizer: The language model and its tokenizer are passed in, typically pre-configured.
train_dataset: The dataset used for training, preprocessed and tokenized earlier.
dataset_text_field: Specifies the field in the dataset containing the text.
max_seq_length: The maximum sequence length for tokenized inputs.
data_collator: Ensures input data is properly batched and padded.
dataset_num_proc: Parallelizes dataset processing for efficiency.

Training Arguments:

per_device_train_batch_size: Batch size for each device during training (set to 2 here).
gradient_accumulation_steps: Simulates a larger batch size by accumulating gradients over multiple steps.
warmup_steps: Steps for learning rate warmup, helping stabilize training.
max_steps: Total number of training steps (30 here, indicating a short training run).
learning_rate: Learning rate for the optimizer.
fp16 and bf16: Enable mixed precision (FP16 or BF16) based on hardware support for faster and memory-efficient training.
logging_steps: Frequency of logging during training.
optim: Optimizer choice; adamw_8bit reduces memory usage.
weight_decay: Regularization parameter to prevent overfitting.
output_dir: This directory saves the model checkpoints and logs.
report_to: Disables reporting to external tracking tools (e.g., WandB).

Purpose:

This setup efficiently fine-tunes a large model on a custom dataset, focusing on:

Memory optimization (e.g., mixed precision, 8-bit optimizers).
Efficient training configurations with a small batch size and gradient accumulation.
Short, lightweight training for quick experimentation or domain adaptation.

We can also use Unsloth’s train_on_completions method to only train on the assistant outputs and ignore the loss on the user’s inputs.

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user<|im_sep|>",
    response_part="<|im_start|>assistant<|im_sep|>",
)

Let’s verify masking is actually done:

tokenizer.decode(trainer.train_dataset[5]["input_ids"])

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

Step 4: Monitoring GPU Usage

Check GPU memory usage before and after training:

import torch

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Step 5: Inference

Generate responses using the fine-tuned model:

Defining the Input Messages:

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
tokenizer,
chat_template = "phi-4",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

The input is structured as a list of message dictionaries. Each dictionary specifies the role (e.g., “user”) and the content (e.g., the user’s query).

This approach supports multi-turn conversations, aligning with the model’s chat-based functionality

Preprocessing Inputs with the Tokenizer

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

apply_chat_template: Prepares the input for the Phi-4 model using the tokenizer and ensures compatibility with the chat format.

Parameters:

tokenize=True: Converts text into token IDs.
add_generation_prompt=True: Adds a special prompt token to guide the model’s response generation.
return_tensors=”pt”: Converts the processed data into PyTorch tensors for GPU processing.
.to(“cuda”): Moves the data to the GPU for accelerated computation.

Generating Text:

outputs = model.generate(
    input_ids=inputs, max_new_tokens=64, use_cache=True, temperature=1.5, min_p=0.1
)

Parameters:

input_ids=inputs: The tokenized input.
max_new_tokens=64: Limits the length of the generated output to 64 tokens.
use_cache=True: Speeds up generation by using cached activations.
temperature=1.5: Controls randomness in output (higher values = more creative, less deterministic).
min_p=0.1: Ensures diversity by setting a minimum probability threshold for token sampling.

Decoding and Displaying the Output:

print(tokenizer.batch_decode(outputs))

Decodes the generated token IDs back into human-readable text using the tokenizer.
We use batch_decode because the outputs might contain multiple sequences.

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
{"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(
input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1

Step 6: Saving and Uploading the Model

Save Locally or Push to Hugging Face:

model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

Saving and Uploading the Model Fine-Tune Phi-4

To upload to Hugging Face:

model.push_to_hub_merged("hf/model", tokenizer, save_method="lora", token="<your_hf_token>")

This code pushes a model to the Hugging Face Hub, using the LoRA method for efficient saving, and it also includes the associated tokenizer. You would need a valid Hugging Face authentication token (<your_hf_token>) to execute the action successfully.

Conclusion

Fine-tuning Microsoft Phi-4 locally and pushing it to Hugging Face allows developers to create highly specialized models efficiently. With tools like Unsloth, LoRA Adapters, and Hugging Face, the process becomes accessible and scalable. Try it out with your dataset today!

Key Takeaways

Fine-tuning Microsoft Phi-4 with LoRA adapters optimizes domain-specific performance while saving computational resources.
The Unsloth library simplifies the process of integrating LoRA adapters and working with Hugging Face datasets.
Efficient dataset transformation and tokenization are critical for preparing data for fine-tuning Phi-4 on custom tasks.
Training with Hugging Face’s SFTTrainer and advanced settings allows for fast, memory-efficient fine-tuning.
Uploading fine-tuned models to Hugging Face enables easy sharing and deployment for specialized applications.

Frequently Asked Questions

Q1. What is Microsoft Phi-4, and why fine-tune it?

A. Microsoft Phi-4 is a large language model (LLM) optimized for language understanding and generation tasks. Fine-tuning it on a custom dataset enables domain-specific performance, tailoring the model to specialized applications such as customer service, technical documentation, or niche industries.

Q2. What are LoRA adapters, and why are they used here?

A. LoRA (Low-Rank Adaptation) adapters allow efficient fine-tuning by training only a subset of model parameters instead of the entire model. This reduces computational requirements and memory usage, making it ideal for large models like Phi-4.

Q3. Which libraries and tools are required for fine-tuning Phi-4?

A. Key requirements include Python 3.8+, PyTorch with CUDA support, the unsloth library for streamlined workflows, and Hugging Face Transformers and Datasets for dataset handling and training.

Q4. How do I handle datasets for fine-tuning?

A. Use a dataset like FineTome-100k in ShareGPT format. Convert and standardize the dataset using unsloth utilities to ensure compatibility with Hugging Face’s multi-turn conversation template.

Q5. How can I push my fine-tuned model to Hugging Face?

A. Save your fine-tuned model and tokenizer locally, then use the .push_to_hub_merged() method from unsloth to upload the model and tokenizer to Hugging Face with your authentication token

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Himanshu Ranjan

Hi there! I’m Himanshu a Data Scientist at KPMG, and I have a deep passion for data everything from crunching numbers to finding patterns that tell a story. For me, data is more than just numbers on a screen; it’s a tool for discovery and insight. I’m always excited by the possibility of what data can reveal and how it can solve real-world problems.

But it’s not just data that grabs my attention. I love exploring new things, whether that’s learning a new skill, experimenting with new technologies, or diving into topics outside my comfort zone. Curiosity drives me, and I’m always looking for fresh challenges that push me to think differently and grow. At heart, I believe there’s always more to learn, and I’m on a constant journey to expand my knowledge and perspective.

Advanced Generative AI LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

How to Fine-Tune Phi-4 Locally?

Learning Objectives

Table of contents

Prerequisites

Fine-Tuning Phi-4: A Step-by-Step Guide

Step 1: Setting Up the Model

Load the Model with LoRA Adapters

Importing Dependencies

Configuration Settings

Loading the Phi-4 Model

Applying LoRA Adapters

Step 2: Preparing the Dataset

Load the Dataset

Standardize the Dataset

Apply Phi-4 Chat Template

Format Prompts for Training

Map Function to Dataset

Step 3: Fine-Tuning the Model

Training with SFTTrainer

Masking User Inputs

Step 4: Monitoring GPU Usage

Step 5: Inference

Preprocessing Inputs with the Tokenizer

Step 6: Saving and Uploading the Model

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au