Finetuning Llama 3 with Odds Ratio Preference Optimization

Ajay Last Updated : 02 Apr, 2025

12 min read

Large Language Models are often trained rather than built, requiring multiple steps to perform well. These steps, including Supervised Fine Tuning (SFT) and Preference Alignment, are crucial for learning new things and aligning with human responses. However, each step takes a significant amount of time and computing resources. One solution is the Odd Ratio Preference Optimization (ORPO), which combines SFT and Preference Tuning in a single step. This guide will explore ORPO and its potential to reduce the time taken to train Large Language Models.

Learning Objectives

Understand the typical flow of training a Large Language Model (LLM), including pretraining, supervised fine-tuning, and preference alignment.
Identify different training and fine-tuning methods for LLMs, such as supervised fine-tuning and preference optimization (e.g., PPO, DPO, ORPO).
Explain the concept of Odds Ratio Preference Optimization (ORPO) and its role in reducing training time and computational resources by combining supervised fine-tuning and preference optimization in a single step.
Describe the key components of ORPO, including the odds ratio term in the training loss and its integration with supervised fine-tuning.
Learn how to prepare data for finetuning an LLM with ORPO, including data formatting and preprocessing steps.
Understand the process of loading and training an LLM with ORPO, including model loading, patching the DPOTrainer, and initiating the training process.
Evaluate the effectiveness of ORPO in improving the efficiency and coherence of LLMs by aligning them more closely with human preferences.

This article was published as a part of the Data Science Blogathon.

Typical Flow of LLM Training
Introduction to ORPO
Finetuning Llama 3 with ORPO – Data Preparation
Model Loading and Training
Conclusion
Frequently Asked Questions

Typical Flow of LLM Training

Pretraining:
- Large Language Models are pretrained on a large corpus of text data like Wikipedia.
- This is unsupervised training where the model learns about word sequences and their probabilities.
Instruction Tuning:
- The model is trained to follow instructions provided in the data.
- Data includes instructions and their corresponding answers.
- This training enables the model to respond appropriately to user prompts, acting like a chat model.
Supervised Fine-Tuning:
- LLM is trained on domain-specific or task-specific data.
- Example: fine-tuning to mask Personally Identifiable Information (PII) data.
- Data contains both masked and unmasked versions of text, allowing the model to learn the task.
Alignment-Tuning or Preference Alignment:
- Aimed at aligning model responses to generate responsible and clean answers.
- Preference optimization methods include PPO (Policy Preference Optimization), DPO (Direct Preference Optimization), and ORPO (Odds Ratio Preference Optimization).

So we see here that there are different fine-tune stages of an LLM. Each fine-tuning step consumes a lot of time and the larger the data, the more the training time for the LLM. Mainly the Supervised Fine-Tuning and the Preference Alignment, being performed as separate steps, consume a lot of training time.

Introduction to ORPO

ORPO aka Odds Ratio Preference Optimization aims to reduce both the training time and the resources required during the Preference Optimization. It does this by combining both the Supervised Fine-Tuning and the Preference Optimization in a single step. ORPO removes the need for the use of a reward model, which is generally used in other Preference Algorithms like the DPO and the PPO. ORPO believes that the SFT is powerful enough to converge to steer the model to chosen responses from the rejected responses. The formula for the new loss can be seen below:

The Odds Ratio term in ORPO is used to calculate the likelihood of a model generating an output sequence y given an input sequence x. This value indicates that the model is n times more likely to generate the sequence y than not. The odds ratio of chosen responses over rejected responses measures the model’s likelihood of generating chosen responses.

The log of this odds ratio is considered because just taking the ratio of raw probabilities of the chosen over the rejected will produce a very small value. And finally, an activation function like the sigmoid is applied to this log of odds ratio. This final equation is called the ORPO loss and this loss is added to the SFT loss. A tunable parameter lambda is introduced for hyperparameter tuning.

The ORPOTrainer aims to reduce the combined loss of Negative Log Likelihood and ORPO loss by supervised fine-tuning the Large Language Model. This approach focuses on the chosen response and moves it away from rejected ones, eliminating the need for an additional reward model. This approach significantly reduces computation resources for preference tuning and align tuning, thereby reducing training and tuning time for Large Language Models.

Finetuning Llama 3 with ORPO – Data Preparation

We will now proceed with steps of fine-tuning llama 3 with ORPO.

Step1: Installing Libraries

In this section, we will finetune the newly launched Llama 3 with the ORPO. For this, we will be working with the Kaggle Notebook and start by installing the following libraries.

!pip install -U -q xformers --index-url https://download.pytorch.org/whl/cu121
!pip install -q "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q datasets trl transformers accelerate huggingface-cli wandb

xformers: A library launched by Meta that allows us to work with flexible transformer parts, thus allowing us to combine different parts of LLMs.
unsloth: This is a library that we will be working with to train the Llama 3. Unsloth is known to speed the training process of Large Language Models and reduce the GPU memory consumption.
datasets: A library from huggingface which we will work with to download a dataset to finetune on
trl: A library from huggingface for training the Large Language Models.
transformers: We will work with this library to download the model from huggingface.
accelerate: We need this to speed up the GPU inference for the Large Language Models.
huggingface-cli: We need this library to login into huggingface to download the llama-3 model because llama-3 requires authentication to use it.

To work with the Meta Model, first, we need to accept their terms and conditions. Go to this link, sign in with your HuggingFace account, and accept their agreement policy. After this, we will log in to our HuggingFace account through the huggingface-cli command.

Step3: Dataset Loading and Data Preprocessing

We will start with dataset loading and data preprocessing part. First, we need to log in with our huggingface account so we can access and download Meta’s Llama 3 8B model and the tokenizer. For this, the code will be:

!huggingface-cli login --token $you_api_key

Here in the above command, provide your HuggingFace token. This token can be obtained from the HuggingFace website. Running this command will log us into our HuggingFace account and we see the following output:

Step4: Download the Model

Next, we will download the model. The code for this will be:

from transformers import AutoTokenizer

base_model = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model)

We import the AutoTokenizer Class from the transformers library.
Here we first define the model name in the variable base_model.
Then we call the AutoTokenizer.from_petrained() function and pass it the base_model variable.

Running the code will download the Llama3 Tokenizer from the Meta HuggingFace repository. This tokenizer is necessary to apply the chat format of Llama 3 for the dataset that we will be working with and to tokenize them.

Step5: Finetune Llama 3

Now we will download the dataset that we will finetune our Llama 3 on. The code for this will be:

from datasets import load_dataset

dataset_name = "jondurbin/truthy-dpo-v0.1"
dataset = load_dataset(dataset_name)

Here we import the load_dataset class from the datasets library.
Then we provide the path for our dataset to the dataset_name variable.
This dataset_name variable is given to the load_dataset() function, which downloads the dataset from the HuggingFace hub.

Running this code will download the data “truthy-dpo-v0.1” from the huggingface and store it in the variable dataset. A few rows from the dataset can be seen below:

We will be working with the four columns in the dataset. These are the system, prompt, chosen, and rejected columns. The system and the prompt columns contain the system message and the user prompt. The chosen column contains the chosen response and the rejected column contains the rejected response.

Step6: Creating Columns

We need to create new chosen and rejected columns where each of these columns contains both the system message, the user prompt, and the chosen or the rejected response. The code for this can be seen below:

def format_chat_template(row):
    message_chosen = [{"role":"system","content":row['system']},
    {"role":"user","content":row['prompt']},
    {"role":"assistant","content":row['chosen']}]
    
    message_rejected = [{"role":"system","content":row['system']},
    {"role":"user","content":row['prompt']},
    {"role":"assistant","content":row['rejected']}]
    
    prompt = row['system'] + '/n' + row['prompt']
    
    row["chosen"] = tokenizer.apply_chat_template(message_chosen, tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(message_rejected, tokenize=False)
    row['prompt'] = prompt
    
    return row

The provided code defines a function called format_chat_template that takes a row of data as input and returns a modified version of that row.

Inside the function, two lists are created:

message_chosen: This list represents a chat message with the assistant message as the “chosen” response. It contains three dictionaries, each representing a message from either the system, the user, or the assistant.
message_rejected: This list represents a chat message with the assistant message as the “rejected” response. Similar to – message_chosen, it even contains three dictionaries representing messages from the system, user, and assistant.
The next line creates a string called prompt by concatenating the system and prompt columns from the input row. This string represents the system’s message followed by the user’s prompt.
The function then applies a method called apply_chat_template from a tokenizer object (tokenizer) to the message_chosen and message_rejected lists. This function takes in these messages and applies formatting to them based on the chat format that the Llama 3 takes.
Here we assign tokenizer=False because we need back the text, not the tokens.
Finally, the modified row is returned as output.

Step7: Applying Function to Dataset

Now, we will apply this function to the Dataset that we have just downloaded. For this, we work with the following code:

import os

dataset = dataset.map(
    format_chat_template,
    num_proc= os.cpu_count(),
)

Here, we map the function that we have just defined, to the dataset that we have just downloaded from HuggingFace. To map it, we call the map function of the dataset object and pass it the function for formatting and the CPU count, so that execution can be done in parallel. Running this code will modify the data within the dataset with the required formatting for the training process.

Finally, we are done with the data pre-processing part. Next, we will download the Llama-3 8 Billion model and train it with this dataset.

Model Loading and Training

In this section, we will download the model and start the training process.

Step1: Downloading the Model

First, we will begin with downloading the model. The code for this will be:

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None 
load_in_4bit = True 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = secret_value_0, 
)

We start by importing FastLanguageModel from the unsloth library and PyTorch.
Then we define 3 variables, the max_seq_length, the maximum tokens that are to be generated by the model, dtype, which we give None for auto-detection and load_in_4bit, where the True implies that we wish to quantize to 4-bit.
Now, we call the .from_pretrained() from FastLanguageModel(), and to this, we pass.

Step2: Quantization

Running the above code will download the llama-3 8b model and quantize it to a 4-bit format and it will also fetch the relevant tokenizer.

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 
)

Now, we try to get the PEFT version of our model. For this, we call the .get_peft_model() function of the FastLanguageModel class. To this, we pass the following parameters

model: This is the model that we have downloaded just now.
rank: It is the rank of the LoRA matrix. We provide a value of 16 for it.
target_modules: This is a list of target modules for which we wish to create the LoRA on. We will be taking all the attention layers and the linear layers.
alpha: This is the LoRA scaling factor. We set this scaling factor to 16, it is usually equal to or double the size of rank.
lora_dropout: Defines the percentage of dropping of neurons. Unsloth currently doesn’t support dropout, hence it is set to 0.
bias: Unsloth doesn’t support bias terms, hence it is set to none.
use_rslora: Wether to enable Rank Stabilized Lora or Not? Set to False.
loftq_config: This is set to none because we do not have any LoftQ config.

Running this code will create the LoRA Adapters, which we will be training with the dataset that we have downloaded.

Step3: Patching DPOTrainer

Let’s start by patching the DPOTrainer.

from unsloth import PatchDPOTrainer

PatchDPOTrainer()

The unsloth library has not yet released an official implementation for ORPO Trainer. To address this, the PatchDPOTrainer is imported, which will patch the existing DPOTrainer and ORPOTrainer from the HuggingFace trl library, enhancing its speed and memory efficiency.

from trl import ORPOConfig, ORPOTrainer


orpo_trainer = ORPOTrainer(
    model = model,
    args = ORPOConfig(
        output_dir="/kaggle/working/model",
        max_prompt_length=512,
        max_length=1024,
        logging_steps=1,
        per_device_train_batch_size=2,
        remove_unused_columns=False,
        gradient_accumulation_steps=2,
        optim="paged_adamw_8bit",
        lr_scheduler_type="cosine",
        gradient_checkpointing=True,
        beta=0.1,
        num_train_epochs=1,
        fp16=True,
        do_eval=False,
    ),
    train_dataset = dataset["train"],
    tokenizer = tokenizer,
)

We start by importing the ORPOTrainer and ORPOConfig from the trl library. Then we set the parameters inside the ORPOTrainer.

These include:

output_dir: Here we specify the output directory where to store the LoRA adapters.
max_prompt_length: Defines the maximum prompt length. This is set to 512
max_length: This defines the maximum length of the sequence. It is set to 1024
logging_steps: We set this to 1, so we can see the logs, like the training loss every single epoch
per_device_train_batch_size: It is the number of batches that we will be training per GPU, and we set this to 2.
gradient_accumulation_steps: We set this to 2, accumulating gradients every 2 steps before updating them.
remove_unused_columns: Will remove the null columns if present in the dataset if set to True
optim: Here we define the optimizer we want to work with while training. We will work with the paged_adamw_8bit optimizer.
lr_scheduler_type: This tells the type of learning rate scheduler to work with. We go with cosine
beta: It is the hyperparameter for the ORPO loss. 0.1 is the recommended value.
We set the gradient_checkpointing to True.
We set fp16 to True, because the GPU we are working on will support it, and because we do not have any evaluation data, we set the do_eval=False and we train for 1 full epoch.

So, we pass this ORPOConfig, which is the training argument to the ORPOTrainer along with the dataset and the tokenizer. Running this code will create the ORPOTrainer and is ready to start the training step.

Read about this article “3 ways to use Llama 3

Step4: Initiate Training

We will initiate the training with the following code.

orpo_trainer.train()

Calling the .train() on the orpo_trainer will start the training process. We can see in the pic that we get the training metrics like the training loss, rewards/chosen, rewards/rejected, and so on. There are a total of 247 steps that were taken to complete one epoch of training on the entire dataset. In the second pic, we can see that as the number of steps increased, the training loss has come down.

The odds_ratio in the third picture fluctuates, but overall increases with the number of steps. This indicates a higher probability of generating chosen responses compared to rejected ones, allowing for alignment tuning on a Large Language Model using ORPO or Odds Ratio Preference Optimization.

Conclusion

Odds Ratio Preference Optimization (ORPO) presents a promising approach to efficiently fine-tune large language models like Llama 3 by combining Supervised Fine-Tuning and Preference Optimization in a single step. By introducing an odds ratio term in the training loss, ORPO effectively balances the selection of preferred outputs over rejected ones, all while eliminating the need for a separate reward model. This streamlined approach not only reduces the training time and computational resources required but also leads to a more coherent and efficient model. ORPO demonstrates its potential in aligning language models more closely with human preferences, optimizing their ability to generate high-quality, relevant responses in various applications.

Key Takeaway

ORPO combines Supervised Fine-Tuning and Preference Optimization into a single training step, significantly reducing the time and resources required to train large language models.
By incorporating an odds ratio term in the training loss, ORPO guides the model towards preferred responses while avoiding rejected ones, thus enhancing the quality of generated text.
ORPO has the capability to apply to various large language models, such as Llama 3, showcasing its potential to enhance the training process for a range of NLP tasks and applications.
Integrating ORPO into existing training workflows becomes easy using libraries such as unsloth and trl, thereby streamlining the training process.
The combination of negative log-likelihood and ORPO loss allows the model to converge toward more suitable responses based on the chosen and rejected sequences.

Frequently Asked Questions

Q1. What is ORPO in the context of Large Language Models?

A. ORPO stands for Odds Ratio Preference Optimization, a method that combines supervised fine-tuning and preference optimization in a single step for efficient training

Q2. Why is ORPO beneficial for training Large Language Models (LLMs)?

A. ORPO reduces both training time and computing resources by combining two fine-tuning steps, which streamlines the process and eliminates the need for a separate reward model

Q3. How does ORPO differ from other preference optimization methods like PPO or DPO?

A. ORPO eliminates the need for a reward model and integrates the odds ratio in the training loss to steer models toward chosen responses and away from rejected ones

Q4. What is the main advantage of using ORPO for LLMs?

A. The main advantage is the reduction in training time and computational resources needed, allowing more efficient fine-tuning of large language models

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ajay

I work as a Developer in the field of Data Science. I constantly spend time learning new things be it related to AI, DataSceine, and CyberSecurity. Deep learning and machine learning are two topics that I find particularly fascinating, and Python is my preferred language for programming. Cyber Security is another field that I'm touching upon recently. I have experience with large-scale data analysis, and I have a solid grasp of a variety of deep learning and machine learning approaches, including neural networks, regression models, and natural language processing. I'm eager to take on new challenges and make a meaningful contribution to the industry, so I'm constantly seeking for ways to enlarge and deepen my knowledge and skills in the subject.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Finetuning Llama 3 with Odds Ratio Preference Optimization

Learning Objectives

Table of contents

Typical Flow of LLM Training

Introduction to ORPO

Finetuning Llama 3 with ORPO – Data Preparation

Step1: Installing Libraries

Step2: Sign in HuggingFace Account

Step3: Dataset Loading and Data Preprocessing

Step4: Download the Model

Step5: Finetune Llama 3

Step6: Creating Columns

Step7: Applying Function to Dataset

Model Loading and Training

Step1: Downloading the Model

Step2: Quantization

Step3: Patching DPOTrainer

These include:

Step4: Initiate Training

Conclusion

Key Takeaway

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit