Fine-tune Llama 3 using Direct Preference Optimization

Sunil Kumar Last Updated : 25 Jun, 2024

10 min read

Introduction

Large Language Models have revolutionized productivity by enabling tasks like Q&A, dynamic code generation, and agentic systems. However, pre-trained vanilla models are often biased and can produce harmful content. To improve performance, algorithms like Reinforcement Learning with Human Feedback and Direct Preference Optimization (DPO) can be used. This article focuses on RLHF methods and the implementation of DPO using Unsloth, highlighting the importance of these methods in improving the quality and effectiveness of models in various tasks.

Learning Objectives

Understand the significance of fine-tuning large language models (LLMs) for specific tasks and applications.
Differentiate between RLHF (Reinforcement Learning with Human Feedback) and DPO (Direct Preference Optimization) as fine-tuning approaches for LLMs.
Identify the pros and cons of both RLHF and DPO methods in the context of LLM fine-tuning.
Explore open-source tools available for implementing DPO fine-tuning, such as TRL (Transformers for Reinforcement Learning) library, Axolotl, and Unsloth.
Learn the steps involved in fine-tuning a fine tune Llama 3 8B model using DPO with Unsloth, including data preparation, model installation, training, and inference.
Understand the key parameters and hyperparameters involved in training a model using DPO.
Gain insights into the benefits of using Llama 3 models, such as reduced memory footprint and improved inference speed.

This article was published as a part of the Data Science Blogathon.

What is Llama 3?
What is RLHF?
What is Direct Preference Optimization?
- Pros of Direct Preference Optimization
- Cons of Direct Preference Optimization
Open-source Tools for DPO Training
DPO Fine-tuning with Unsloth
Frequently Asked Questions

What is Llama 3?

The Llama 3 is a family of open-source models recently released by Meta. The model family consists of pre-trained and instruction-tuned chat models with 8B and 70B parameters. Since its release, the model has been well-received by the OSS community. The models have performed well in various benchmarks like MMLU, HUMANEVAL, MATH, etc. The small 8B model especially has outperformed many bigger models. This makes the model ideal for personal uses and edge deployment. However, many use cases require the models to be fine-tuned on a custom dataset to perform well. So, let’s understand what is RLHF and DPO then implement it.

What is RLHF?

RLHF is an alignment technique usually applied after the Supervised Fine-tuning process, known as “llama 3 fine tuning,” to drill down certain types of behavior into a base model. For example, the model can be trained to refuse to respond to harmful texts or to avoid hate speeches. This is an important step before releasing the models to the public. Big companies like Google, Meta, and OpenAI spend enormous resources to align the models before releasing them in the wild.

How Does RLHF Work?

The RLHF technique is a two-step process that involves training a reward model on preference data and fine-tuning the base model with reinforcement learning, specifically “llama 3 fine tuning.” The preference dataset is a highly curated dataset of accepted and rejected responses from foundational language models. A human data annotator ranks each response to add variability. The reward model is then trained or fine-tuned on the preference data, which can be the same model or different language model, or even a traditional classification model.

The next step is to fine-tune the base model using RL. Traditionally in RLHF, the PPO (Proximal Policy Optimization) algorithm is used to update the parameters of the base model based on a reward function. In PPO, we have an initial language model, a policy model that will be fine-tuned, and the reward model from the previous step.

The preference dataset prompts the RL policy model to generate responses, which are then fed to the initial base model to calculate the relative KL penalty. The KL penalty measures the difference between one probability distribution and another, ensuring the policy model doesn’t drift far from the base model. The formula for calculating the KL penalty is given.

In the next step, the reward model assigns preference scores to the responses from the RL policy model. After this, the parameters of the RL policy model are updated by maximizing the reward function. The reward function is the sum of the preference score and KL penalty.

From here onwards the policy model can be updated iteratively.

While RLHF using PPO has upsides like greater flexibility to incorporate various types of feedback, the implementation can be unstable. Here are some of the pros and cons of the RLHF fine-tuning method.

Pros of RLHF

RLHF provides greater control over fine-tuning as it allows for designing nuanced reward models.
RLHF can also accommodate diverse reward formats such as numbers, implicit feedback, and textual corrections.
RLHF can be beneficial when the model needs to be trained over massive data.

Cons of RLHF

Training and fitting a reward model can be challenging both technically and computationally.
While it allows diverse feedback to guide LLMs, it is often unstable and less reliable than DPO.

What is Direct Preference Optimization?

Direct Preference Optimization is a fine-tuning technique that aims to improve on the shortcomings of PPO. DPO simplifies the RLHF by eliminating the need for reward modeling and training the model via RL-based optimization. Instead, it directly optimizes the language model based on human preference data. Using pairwise comparisons of model outputs, human evaluators choose preferred responses for prompts. This feedback directly guides the training of the language model. We can also use responses from better models as preferred and weaker models as rejected to fine-tune base models.

Direct Preference Optimization uses a reference model instead of a reward model, aiming to output a higher probability for preferred responses and a lower probability for rejected responses. This approach is more stable and efficient than PPO-based RLHF, as it bypasses extensive reward model training and fitting processes.

Pros of Direct Preference Optimization

DPO is straightforward when it comes to implementation. There is no need to train a separate reward model.
Besides being easy to implement, DPO is also more stable and predictable. The models can be reliably guided toward a particular goal.
DPO is more resource-efficient as it directly operates on the LLM.

Cons of Direct Preference Optimization

DPO does not provide the flexibility of a complex reward mechanism design.
It can not work with diverse feedback like RLHF. DPO relies on binary feedback format.

Open-source Tools for DPO Training

Now, let’s explore the open-source tools for implementing DPO. There are many ways you can implement DPO using different open-source tools.

TRL: The most popular is through Huggingface’s TRL library. It has all the bells and whistles for efficient DPO fine-tuning. As it is from Huggingface, you can integrate other libraries from Huggingface seamlessly.
Axolotl: If you do not want to bother yourself with Python codes, there is another open-source tool called Axolotl. Instead of writing the codes in Python, it lets us define all the parameters and hyper-parameters in a YAML file. This makes it much easier to manage the fine-tuning process. It wraps the TRL library underneath, hence we can use all of its functionality but in a cleaner way.
Unsloth: Another open-source tool that lets you fine-tune LLMs optimally. The library implements CUDA-optimized custom triton kernels for faster training and inferencing. It also leaves a lesser memory footprint during model training. We will use Unsloth for the DPO fine-tuning the Llama 3 8B model.

So, let’s implement Direct Preference Optimization fine-tuning on the fine tune Llama 3 model using Unsloth.

DPO Fine-tuning with Unsloth

Let us now explore DPO fine-tuning with unsloth. We need to go through certain steps.

Step1: Install Dependencies

Before moving ahead, install the dependencies. We will install Unsloth from their git repository, flash-attention, trl, and Wandb for logging. Optionally you can install deep speed for distributed training across GPUs.

import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft \
    accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

Step2: Set Key in Local Environment

Now, set WANDB_API_KEY in your local environment.

import os
os.environ['WANDB_API_KEY'] = "your_api_key"

Step4: Data Preparation

We will use the Orca DPO dataset from Intel for alignment through DPO. As we learned before, a DPO dataset has a prompt column, a column for selected answers, and a prompt for rejected answers.

This is a small dataset, you can use other DPO datasets like Argilla’s ultra-feedback preference data.

The data is perfect for DPO tuning. We can load the data using Huggingface’s dataset library. Change the column name question to prompt as TRL’s DPOTrainer requires it. We will also need to split Train and Test data.

from datasets import load_dataset
dataset = load_dataset("Intel/orca_dpo_pairs", split = "train")

dataset = dataset.rename_column('question','prompt')

dataset_dict = dataset.train_test_split(test_size=0.04)

Step5: Install Llama 3

We will now install fine tune Llama 3 instruct quantized model from Unsloth. This will take a few moments. The 4-bit quantized model is around 5.76 GB. The script below will install and load the model on the GPU.

from unsloth import FastLanguageModel
import torch
max_seq_length = 4096
dtype = None 
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8B-instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

Step6: Load LoRA Adapters

We can now load all the required LoRA adapters to the Llama model. We will only update some 1-10% of the total parameters. Setting gradient checkpointing to “unsloth” allows 30% less memory use and accommodates 2x larger batch sizes.

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 
)

r: The “r” stands for the rank of the low-rank adapters. A higher rank will increase the number of trainable parameters; which improves the model’s adaptability to data. At the same time, this increases computing requirements.
lora_alpha: This is similar to “learning rate”. This modulates the effect of training update matrices on the original weight of the models.
target_modules: The target modules here represent the layers of the model architecture to which the updates will be applied.

Step7: Define LoRA Hyper-parameters

Now, define all the training arguments and hyperparams for the model training. But before that patch the DPOTrainer. This is only needed if you are doing it in a Notebook. This enhances the model logging in a Jupyter Notebook. Ignore the step if you are not on an IPython Notebook.

from unsloth import PatchDPOTrainer
PatchDPOTrainer()

import wandb
wandb.login()

Now define the LoRA hyper-parameters.

from transformers import TrainingArguments
from trl import DPOTrainer
import wandb

project_name = "llama3" 
entity = "wandb"
# os.environ["WANDB_LOG_MODEL"] = "checkpoint"

wandb.init(project=project_name, name = "mistral-7b-instruct-DPO-1")
dpo_trainer = DPOTrainer(
    model = model,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 3,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 5e-6,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        #max_steps=20,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        report_to="wandb",  # enable logging to W&B
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = dataset_dict["train"],
    eval_dataset = dataset_dict["test"],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

Here’s a quick breakdown of all the key training arguments used above.

per_device_train_batch_size: The batch size per device during training.
gradient_accumulation_steps: The number of steps to accumulate gradients before performing a backward/update pass.
warmup_ratio: The ratio of the total training steps where learning rate linearly ramps up to its maximum value.
num_train_epochs: The number of complete passes through the training dataset.
optim: The type of optimizer used. Here, it is an adamw 8-bit.
lr_scheduler: This parameter adjusts the learning rate during training. A linear scheduler linearly adjusts the value of the learning rate.

Now, start training.

dpo_trainer.train()

This will kick-start model fine-tuning. If you encounter an out-of-memory (OOM) error try reducing training batch size and accumulation steps. You can visualize the training run in the Notebook or observe it from your Wandb profile.

Step8: Inferencing

Once the training is finished save the LoRA model.

model.save_pretrained("lora_model")

You can now load the LoRA model and start asking questions.

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model",
        max_seq_length = 512,
        # dtype = dtype,
        load_in_4bit = True,
    )
FastLanguageModel.for_inference(model)

We can define a transformer pipeline for inferencing.

import transformers
message = [
    {"role": "system", "content": "You are a helpful assistant chatbot."},
    {"role": "user", "content": "What is a Large Language Model?"}
]

prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=terminators,
    num_return_sequences=1,
    max_length=200,
)

print(sequences[0]['generated_text'][len(prompt):])

You may also wrap it in a Gradio chat interface using the below script.

import gradio as gr

messages = []

def add_text(history, text):
    global messages  #message[list] is defined globally
    history = history + [(text,'')]
    messages = messages + [{"role":'user', 'content': text}]
    return history, ""

def generate(history):
  global messages
  prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

  terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
  response_msg = outputs[0]["generated_text"][len(prompt):]
  for char in response_msg:
      history[-1][1] += char
      yield history
  pass

with gr.Blocks() as demo:

    chatbot = gr.Chatbot(value=[], elem_id="chatbot")
    with gr.Row():
            txt = gr.Textbox(
                show_label=False,
                placeholder="Enter text and press enter",
            )

    txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(
            generate, inputs =[chatbot,],outputs = chatbot,)

demo.queue()
demo.launch(debug=True)

Conclusions

Llama 3 from Meta has proven to be very capable, especially the small 8B model. It can be run on cheaper hardware and fine-tuned to adhere to particular use cases. But to make them commercially viable, we may need to fine-tune them for custom use cases. This article discussed fine-tuning techniques like RLHF, DPO, and implementation of DPO using Unsloth. Here are the key takeaways from the article.

Key Takeaways

RLHF, typically applied to models post-training, aligns them. PPO, the most widely used RLHF method, enables greater control over model alignment.
DPO has emerged as an effective alternative to PPO training. It bypasses PPO’s complicated and unreliable workflow by using a replica of the model itself as the reward model.
You can implement DPO using open-source tools such as Unsloth, HuggingFace TRL and Transformer, Axolotl, etc. These tools provide frameworks and libraries to facilitate the integration of direct preference optimization into machine learning workflows.
Unsloth has custom-optimized implementations of popular language models which help reduce training time, and memory footprints and improve model inferencing.

Frequently Asked Questions

Q1. What is direct preference optimization?

A. Direct preference optimization (DPO) directly optimizes a model based on user preferences or feedback, enhancing the model’s alignment with human expectations without intermediate reward models.

Q2. What is PPO and DPO in LLM?

A. In LLMs, PPO (Proximal Policy Optimization) is a reinforcement learning algorithm that updates policies to improve performance. DPO (Direct Preference Optimization) directly adjusts model parameters based on user feedback for better alignment with preferences.

Q3. Is direct preference optimization better than RLHF?

A. Direct preference optimization can be more efficient than Reinforcement Learning with Human Feedback (RLHF) as it simplifies the training process by directly optimizing based on user preferences, potentially achieving faster convergence.

Q4. What is the difference between PPO and direct policy optimization?

A. PPO (Proximal Policy Optimization) uses a surrogate objective to ensure stable updates in reinforcement learning. Direct Policy Optimization directly modifies policy parameters based on performance feedback without using surrogate objectives.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sunil Kumar

Meet your author Sunil kumar Dash, a developer and a writer. Has diverse interests in tech, pop culture, wellness, philosophy and Anime. Exploring underrated music is his hobby. And loves to doom scroll Twitter when bored.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Fine-tune Llama 3 using Direct Preference Optimization

Introduction

Learning Objectives

Table of contents

What is Llama 3?

What is RLHF?

How Does RLHF Work?

Pros of RLHF

Cons of RLHF

What is Direct Preference Optimization?

Pros of Direct Preference Optimization

Cons of Direct Preference Optimization

Open-source Tools for DPO Training

DPO Fine-tuning with Unsloth

Step1: Install Dependencies

Step2: Set Key in Local Environment

Step4: Data Preparation

Step5: Install Llama 3

Step6: Load LoRA Adapters

Step7: Define LoRA Hyper-parameters

Step8: Inferencing

Conclusions

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics