Large Language Models are often trained rather than built, requiring multiple steps to perform well. These steps, including Supervised Fine Tuning (SFT) and Preference Alignment, are crucial for learning new things and aligning with human responses. However, each step takes a significant amount of time and computing resources. One solution is the Odd Ratio Preference Optimization (ORPO), which combines SFT and Preference Tuning in a single step. This guide will explore ORPO and its potential to reduce the time taken to train Large Language Models.
This article was published as a part of the Data Science Blogathon.
So we see here that there are different fine-tune stages of an LLM. Each fine-tuning step consumes a lot of time and the larger the data, the more the training time for the LLM. Mainly the Supervised Fine-Tuning and the Preference Alignment, being performed as separate steps, consume a lot of training time.
ORPO aka Odds Ratio Preference Optimization aims to reduce both the training time and the resources required during the Preference Optimization. It does this by combining both the Supervised Fine-Tuning and the Preference Optimization in a single step. ORPO removes the need for the use of a reward model, which is generally used in other Preference Algorithms like the DPO and the PPO. ORPO believes that the SFT is powerful enough to converge to steer the model to chosen responses from the rejected responses. The formula for the new loss can be seen below:
The Odds Ratio term in ORPO is used to calculate the likelihood of a model generating an output sequence y given an input sequence x. This value indicates that the model is n times more likely to generate the sequence y than not. The odds ratio of chosen responses over rejected responses measures the model’s likelihood of generating chosen responses.
The log of this odds ratio is considered because just taking the ratio of raw probabilities of the chosen over the rejected will produce a very small value. And finally, an activation function like the sigmoid is applied to this log of odds ratio. This final equation is called the ORPO loss and this loss is added to the SFT loss. A tunable parameter lambda is introduced for hyperparameter tuning.
The ORPOTrainer aims to reduce the combined loss of Negative Log Likelihood and ORPO loss by supervised fine-tuning the Large Language Model. This approach focuses on the chosen response and moves it away from rejected ones, eliminating the need for an additional reward model. This approach significantly reduces computation resources for preference tuning and align tuning, thereby reducing training and tuning time for Large Language Models.
We will now proceed with steps of fine-tuning llama 3 with ORPO.
In this section, we will finetune the newly launched Llama 3 with the ORPO. For this, we will be working with the Kaggle Notebook and start by installing the following libraries.
!pip install -U -q xformers --index-url https://download.pytorch.org/whl/cu121
!pip install -q "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q datasets trl transformers accelerate huggingface-cli wandb
To work with the Meta Model, first, we need to accept their terms and conditions. Go to this link, sign in with your HuggingFace account, and accept their agreement policy. After this, we will log in to our HuggingFace account through the huggingface-cli command.
We will start with dataset loading and data preprocessing part. First, we need to log in with our huggingface account so we can access and download Meta’s Llama 3 8B model and the tokenizer. For this, the code will be:
!huggingface-cli login --token $you_api_key
Here in the above command, provide your HuggingFace token. This token can be obtained from the HuggingFace website. Running this command will log us into our HuggingFace account and we see the following output:
Next, we will download the model. The code for this will be:
from transformers import AutoTokenizer
base_model = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model)
Running the code will download the Llama3 Tokenizer from the Meta HuggingFace repository. This tokenizer is necessary to apply the chat format of Llama 3 for the dataset that we will be working with and to tokenize them.
Now we will download the dataset that we will finetune our Llama 3 on. The code for this will be:
from datasets import load_dataset
dataset_name = "jondurbin/truthy-dpo-v0.1"
dataset = load_dataset(dataset_name)
Running this code will download the data “truthy-dpo-v0.1” from the huggingface and store it in the variable dataset. A few rows from the dataset can be seen below:
We will be working with the four columns in the dataset. These are the system, prompt, chosen, and rejected columns. The system and the prompt columns contain the system message and the user prompt. The chosen column contains the chosen response and the rejected column contains the rejected response.
We need to create new chosen and rejected columns where each of these columns contains both the system message, the user prompt, and the chosen or the rejected response. The code for this can be seen below:
def format_chat_template(row):
message_chosen = [{"role":"system","content":row['system']},
{"role":"user","content":row['prompt']},
{"role":"assistant","content":row['chosen']}]
message_rejected = [{"role":"system","content":row['system']},
{"role":"user","content":row['prompt']},
{"role":"assistant","content":row['rejected']}]
prompt = row['system'] + '/n' + row['prompt']
row["chosen"] = tokenizer.apply_chat_template(message_chosen, tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(message_rejected, tokenize=False)
row['prompt'] = prompt
return row
The provided code defines a function called format_chat_template that takes a row of data as input and returns a modified version of that row.
Inside the function, two lists are created:
Now, we will apply this function to the Dataset that we have just downloaded. For this, we work with the following code:
import os
dataset = dataset.map(
format_chat_template,
num_proc= os.cpu_count(),
)
Here, we map the function that we have just defined, to the dataset that we have just downloaded from HuggingFace. To map it, we call the map function of the dataset object and pass it the function for formatting and the CPU count, so that execution can be done in parallel. Running this code will modify the data within the dataset with the required formatting for the training process.
Finally, we are done with the data pre-processing part. Next, we will download the Llama-3 8 Billion model and train it with this dataset.
In this section, we will download the model and start the training process.
First, we will begin with downloading the model. The code for this will be:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
token = secret_value_0,
)
Running the above code will download the llama-3 8b model and quantize it to a 4-bit format and it will also fetch the relevant tokenizer.
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
Now, we try to get the PEFT version of our model. For this, we call the .get_peft_model() function of the FastLanguageModel class. To this, we pass the following parameters
Running this code will create the LoRA Adapters, which we will be training with the dataset that we have downloaded.
Let’s start by patching the DPOTrainer.
from unsloth import PatchDPOTrainer
PatchDPOTrainer()
The unsloth library has not yet released an official implementation for ORPO Trainer. To address this, the PatchDPOTrainer is imported, which will patch the existing DPOTrainer and ORPOTrainer from the HuggingFace trl library, enhancing its speed and memory efficiency.
from trl import ORPOConfig, ORPOTrainer
orpo_trainer = ORPOTrainer(
model = model,
args = ORPOConfig(
output_dir="/kaggle/working/model",
max_prompt_length=512,
max_length=1024,
logging_steps=1,
per_device_train_batch_size=2,
remove_unused_columns=False,
gradient_accumulation_steps=2,
optim="paged_adamw_8bit",
lr_scheduler_type="cosine",
gradient_checkpointing=True,
beta=0.1,
num_train_epochs=1,
fp16=True,
do_eval=False,
),
train_dataset = dataset["train"],
tokenizer = tokenizer,
)
We start by importing the ORPOTrainer and ORPOConfig from the trl library. Then we set the parameters inside the ORPOTrainer.
So, we pass this ORPOConfig, which is the training argument to the ORPOTrainer along with the dataset and the tokenizer. Running this code will create the ORPOTrainer and is ready to start the training step.
We will initiate the training with the following code.
orpo_trainer.train()
Calling the .train() on the orpo_trainer will start the training process. We can see in the pic that we get the training metrics like the training loss, rewards/chosen, rewards/rejected, and so on. There are a total of 247 steps that were taken to complete one epoch of training on the entire dataset. In the second pic, we can see that as the number of steps increased, the training loss has come down.
The odds_ratio in the third picture fluctuates, but overall increases with the number of steps. This indicates a higher probability of generating chosen responses compared to rejected ones, allowing for alignment tuning on a Large Language Model using ORPO or Odds Ratio Preference Optimization.
Odds Ratio Preference Optimization (ORPO) presents a promising approach to efficiently fine-tune large language models like Llama 3 by combining Supervised Fine-Tuning and Preference Optimization in a single step. By introducing an odds ratio term in the training loss, ORPO effectively balances the selection of preferred outputs over rejected ones, all while eliminating the need for a separate reward model. This streamlined approach not only reduces the training time and computational resources required but also leads to a more coherent and efficient model. ORPO demonstrates its potential in aligning language models more closely with human preferences, optimizing their ability to generate high-quality, relevant responses in various applications.
A. ORPO stands for Odds Ratio Preference Optimization, a method that combines supervised fine-tuning and preference optimization in a single step for efficient training
A. ORPO reduces both training time and computing resources by combining two fine-tuning steps, which streamlines the process and eliminates the need for a separate reward model
A. ORPO eliminates the need for a reward model and integrates the odds ratio in the training loss to steer models toward chosen responses and away from rejected ones
A. The main advantage is the reduction in training time and computational resources needed, allowing more efficient fine-tuning of large language models
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.