Over the past few years, the landscape of natural language processing (NLP) has undergone a remarkable transformation, all thanks to the advent of fine-tuning large language models. These sophisticated models have opened the doors to a wide array of applications, ranging from language translation to sentiment analysis and even the creation of intelligent chatbots.
But their versatility sets these models apart; fine-tuning them to tackle specific tasks and domains has become a standard practice, unlocking their true potential and elevating their performance to new heights. In this comprehensive guide, we’ll delve into the world of fine-tuning large language models, covering everything from the basics to advanced techniques such as instruction fine tuning. Also, this will help you to understanding for prompt engineering
This article was published as a part of the Data Science Blogathon.
Pre-trained language models are big neural networks trained on tons of text from the internet. They learn by predicting missing words in sentences, helping them understand grammar and context. Fine-tuning is the next step, where these models get customized for specific tasks using particular datasets, making them even more effective.
Examples of popular pre-trained language models include BERT (Bidirectional Encoder Representations from Transformers), GPT-3 (Generative Pre-trained Transformer 3), RoBERTa (A Robustly Optimized BERT Pretraining Approach), and many more. These models are known for their ability to perform tasks such as text generation, sentiment classification, and language understanding at an impressive level of proficiency of these hyperparameters.
Let’s discuss one of the language models in detail.
GPT-3 Generative Pre-trained Transformer 3 is a ground-breaking language model architecture that has transformed natural language generation and understanding. The Transformer model is the foundation for the GPT-3 architecture, which incorporates several parameters to produce exceptional performance.
A stack of Transformer encoder layers makes up GPT-3. Multi-head self-attention mechanisms and feed-forward neural networks make up each layer. While the feed-forward networks process and transform the encoded representations, the attention mechanism enables the model to recognize dependencies and relationships between words.
The main innovation of GPT-3 is its enormous size, which allows it to capture a huge amount of language knowledge thanks to its astounding 175 billion parameters.
You can use the OpenAI API to interact with the GPT- 3 model of openAI. Here is an example of text generation using GPT-3.
import openai
# Set up your OpenAI API credentials
openai.api_key = 'YOUR_API_KEY'
# Define the prompt for text generation
prompt = "A quick brown fox jumps"
# Make a request to GPT-3 for text generation
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=100,
temperature=0.6
)
# Retrieve the generated text from the API response
generated_text = response.choices[0].text
# Print the generated text
print(generated_text)
AI models are smart with words, but they need extra training to become experts at specific tasks like understanding feelings or translating languages. This is called fine-tuning, and it’s what makes these models really useful for different jobs.
Fine-tuning is like giving a final polish to versatile models. Think of it as helping a multi-talented friend focus on one specific skill for a special event. You would provide them with targeted training, just like we do with pre-trained language models during fine-tuning.
Fine-tuning large language models involves training the pre-trained model on a smaller, task-specific dataset. This new dataset is labeled with examples relevant to the target task. By exposing the model to these labeled examples, it can adjust its parameters and internal representations to become well-suited for the target task.
Pre-trained language models are impressive, but they aren’t task-specific by default. Fine-tuning adapts these models for specialized tasks like sentiment analysis or domain-specific question answering. It enhances their accuracy by helping the model understand the nuances of a particular task. Fine-tuning offers two key benefits: it saves time and resources by leveraging pre-existing knowledge from pre-training, and it improves performance on specific tasks by focusing on domain-specific details.
Read More about this article How to Access the OpenAI o1 API
The LLM fine-tuning process typically involves feeding the task-specific dataset to the pre-trained model and adjusting its parameters through backpropagation. The goal is to minimize the loss function, which measures the difference between the model’s predictions and the ground-truth labels in the dataset. This fine-tuning process updates the model’s parameters, making it more specialized for your target task.
Here we will walk through the process of instruction fine tuning a large language model for sentiment analysis. We’ll use the Hugging Face Transformers library, which provides easy access to pre-trained models and utilities for LLM fine tuning.
The first step is to load the pre-trained language model and its corresponding tokenizer. For this example, we’ll use the ‘distillery-base-uncased’ model, a lighter version of BERT.
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
# Load the pre-trained tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# Load the pre-trained model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
We need a labeled dataset with text samples and corresponding sentiments for sentiment analysis. Let’s create a small dataset for illustration purposes:
texts = ["I loved the movie. It was great!",
"The food was terrible.",
"The weather is okay."]
sentiments = ["positive", "negative", "neutral"]
Next, we’ll use the tokenizer to convert the text samples into token IDs, and attention masks the model requires.
# Tokenize the text samples
encoded_texts = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Extract the input IDs and attention masks
input_ids = encoded_texts['input_ids']
attention_mask = encoded_texts['attention_mask']
# Convert the sentiment labels to numerical form
sentiment_labels = [sentiments.index(sentiment) for sentiment in sentiments]
Checkout this article about Hyperparameters and Layers of Neural Network Deep Learning
The pre-trained language model itself doesn’t include a classification head. We must add one to the model to perform sentiment analysis. In this case, we’ll add a simple linear layer.
import torch.nn as nn
# Add a custom classification head on top of the pre-trained model
num_classes = len(set(sentiment_labels))
classification_head = nn.Linear(model.config.hidden_size, num_classes)
# Replace the pre-trained model's classification head with our custom head
model.classifier = classification_head
With the custom classification head in place, we can now fine-tune the model on the sentiment analysis dataset. We’ll use the AdamW optimizer and CrossEntropyLoss as the loss function.
import torch.optim as optim
# Define the optimizer and loss function
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()
# Fine-tune the model
num_epochs = 3
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=torch.tensor(sentiment_labels))
loss = outputs.loss
loss.backward()
optimizer.step()
In machine learning, fine-tuning is the process of further training a previously learned model, such as a llama, on a particular task or dataset in order to enhance that model’s performance. With this method, the model’s prior learnings from a broad, all-purpose dataset are tapped into and tailored to the specifics of a given issue. This process is especially effective when using open source tools, as they provide a flexible and collaborative environment for experimentation and improvement. Additionally, validation is crucial during fine-tuning to ensure that the adjustments made to the model genuinely improve its performance on the targeted task.
Efficiency:
Enhanced Performance:
Data Scarcity:
Instruction fine-tuning is a specialized technique to tailor large language models to perform specific tasks based on explicit instructions. While traditional LLM fine-tuning involves training a model on task-specific data, instruction fine-tuning goes further by incorporating high-level instructions or demonstrations to guide the model’s behavior.
This approach allows developers to specify desired outputs, encourage certain behaviors, or achieve better control over the model’s responses. In this comprehensive guide, we will explore the concept of instruction fine-tuning and its implementation step-by-step.
What if we could go beyond traditional instruction finetuning and provide explicit instructions to guide the model’s behavior? Instruction fine-tuning does that, offering a new level of control and precision over model outputs. Here we will explore the process of instruction fine-tuning large language models for sentiment analysis.
To begin, let’s load the pre-trained language model and its tokenizer. We’ll use GPT-3, a state-of-the-art language model, for this example.
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
# Load the pre-trained tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Load the pre-trained model for sequence classification
model = GPT2ForSequenceClassification.from_pretrained('gpt2')
Checkout this article Essential Steps to Master Large Language Models
For instruction fine-tuning, we need to augment the sentiment analysis dataset with explicit instructions for the model. Let’s create a small dataset for demonstration:
texts = ["I loved the movie. It was great!",
"The food was terrible.",
"The weather is okay."]
sentiments = ["positive", "negative", "neutral"]
instructions = ["Analyze the sentiment of the text and identify if it is positive.",
"Analyze the sentiment of the text and identify if it is negative.",
"Analyze the sentiment of the text and identify if it is neutral."]
Next, let’s tokenize the texts, sentiments, and instructions using the tokenizer:
# Tokenize the texts, sentiments, and instructions
encoded_texts = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
encoded_instructions = tokenizer(instructions, padding=True, truncation=True, return_tensors='pt')
# Extract input IDs, attention masks, and instruction IDs
input_ids = encoded_texts['input_ids']
attention_mask = encoded_texts['attention_mask']
instruction_ids = encoded_instructions['input_ids']
To incorporate instructions during instruction finetuning, we need to customize the model architecture. We can do this by concatenating the instruction IDs with the input IDs:
import torch
# Concatenate instruction IDs with input IDs and adjust attention mask
input_ids = torch.cat([instruction_ids, input_ids], dim=1)
attention_mask = torch.cat([torch.ones_like(instruction_ids), attention_mask], dim=1)
With the instructions incorporated, we can now fine-tune the GPT-3 model on the augmented dataset. During fine-tuning, the instructions will guide the model’s sentiment analysis behavior.
import torch.optim as optim
# Define the optimizer and loss function
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()
# Fine-tune the model
num_epochs = 3
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=torch.tensor(sentiments))
loss = outputs.loss
loss.backward()
optimizer.step()
Instruction fine-tuning takes the power of traditional fine-tuning to the next level, allowing us to control the behavior of large language models precisely. By providing explicit instructions, we can guide the model’s output and achieve more accurate and tailored results.
Standard fine-tuning involves training a model on a labeled dataset, honing its abilities to perform specific tasks effectively. However, when it comes to fine-tuning large language models like GPT-3.5, if we want to provide explicit instructions to guide the model’s behavior, instruction fine-tuning comes into play. This approach offers unparalleled control and adaptability, allowing us to tailor the model’s responses to meet specific criteria or address nuanced requirements
Here are the critical differences between instruction finetuning and standard finetuning.
Fine-tuning large language models often leads to “catastrophic forgetting,” where a model loses valuable pre-trained knowledge while learning a new task. This happens because, during fine-tuning, the model focuses on the new task and unintentionally forgets broader language structures it previously learned. It’s like a ship’s crew rearranging cargo; some containers of knowledge get emptied to make room for new ones, causing some important information to be lost in the process.
To navigate the waters of catastrophic forgetting, we need strategies to safeguard the valuable knowledge captured during pre-training. There are two possible approaches.
Read More about the GPT-3 to Future Generations of Language Models
Here we freeze certain layers of the model during fine-tuning in large language models. By freezing early layers responsible for fundamental language understanding, we preserve the core knowledge while only fine-tuning later layers for the specific task and the specific use case.
Memory is necessary for full fine-tuning to store the model and several other training-related parameters. You must be able to allocate memory for optimizer states, gradients, forward activations, and temporary memory throughout the training process, even if your computer can hold the model weight of hundreds of gigabytes for the largest models. These extra parts may be much bigger than the model and quickly outgrow the capabilities of consumer hardware.
Parameter-efficient fine-tuning large language models techniques only update a small subset of parameters instead of full fine-tuning, which updates every model weight during supervised learning. Some path techniques concentrate on fine-tuning a portion of existing model parameters, such as specific layers or components, while freezing the majority of model weights. Other methods add a few new parameters or layers and only fine-tune the new components; they do not affect the original model weights. Most, if not all, LLM weights are kept frozen using PEFT. As a result, compared to the original LLM, there are significantly fewer trained parameters.
PEFT empowers parameter-efficient models with impressive performance, revolutionizing the landscape of NLP. Here are a few reasons why we use PEFT.
While freezing most pre-trained LLMs, PEFT only approaches fine-tuning a few model parameters, significantly lowering the computational and storage costs. This also resolves the problem of catastrophic forgetting, which was seen during LLMs’ full fine-tuning.
In low-data regimes, PEFT approaches have also been demonstrated to be superior to fine-tuning and to better generalize to out-of-domain scenarios.
Let’s load the opt-6.7b model here; its weights on the Hub are roughly 13GB in half-precision( float16). It will require about 7GB of memory if we load them in 8-bit.
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"facebook/opt-6.7b",
load_in_8bit=True,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")
Let’s freeze all our layers and cast the layer norm in float32 for stability before applying some post-processing to the 8-bit model to enable training. We also cast the final layer’s output in float32 for the same reasons.
for param in model.parameters():
param.requires_grad = False # freeze the model - train adapters later
if param.ndim == 1:
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable() # reduce number of stored activations
model.enable_input_require_grads()
class CastOutputToFloat(nn.Sequential):
def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)
Load a PeftModel, we will use low-rank adapters (LoRA) using the get_peft_model utility function from Peft.
The function calculates and prints the total number of trainable parameters and all parameters in a given model. Along with the percentage of trainable parameters, providing an overview of the model’s complexity and resource requirements for training.
def print_trainable_parameters(model):
# Prints the number of trainable parameters in the model.
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} ||
trainable%: {100 * trainable_params / all_param}"
)
This uses the Peft library to create a LoRA model with specific configuration settings, including dropout, bias, and task type. It then obtains the trainable parameters of the model and prints the total number of trainable parameters and all parameters, along with the percentage of trainable parameters.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)
This uses the Hugging Face Transformers and Datasets libraries to train a language model on a given dataset. It utilizes the ‘transformers.Trainer’ class to define the training setup, including batch size, learning rate, and other training-related configurations and then trains the model on the specified dataset.
import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)
trainer = transformers.Trainer(
model=model,
train_dataset=data['train'],
args=transformers.TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
max_steps=200,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir='outputs'
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
We will look closer at some exciting real-world use cases of fine-tuning large language models, where NLP advancements are transforming industries and empowering innovative solutions.
In the real world, fine-tuning large language models is widely used across industries. It empowers businesses and researchers to harness NLP capabilities for various tasks. This leads to enhanced efficiency, improved decision-making, and enriched user experiences.
RAG stands for Retrieval Augmented Generation, a method that improves the performance of large language models (LLMs). This is an explanation of how it functions:
Large-scale text and code datasets are used to train LLMs. This enables them to accomplish amazing tasks like text generation, language translation, and composing creative content. However, they may struggle to maintain factual accuracy and an up-to-date knowledge base.
RAG combines an LLM with an information retrieval system. When a user submits a query, RAG first gathers pertinent materials from a trustworthy knowledge base (such as Wikipedia or an organization’s internal knowledge repository). The original query is then sent to the LLM along with these documents. Given this further background, the LLM, utilizing its base model, processes the query more accurately.
Metrics play a crucial role in evaluating the performance of these models. Embedding techniques are employed to represent the documents and queries in a high-dimensional space, making the retrieval process efficient and relevant. Python is often used to implement these complex algorithms and manage the integration between the retrieval system and the LLM. Technologies like ChatGPT exemplify the practical applications of RAG, showcasing enhanced accuracy and context awareness in generating responses.
Fine-tuning large language models has emerged as a powerful technique to adapt these pre-trained models to specific tasks and domains. As the field of NLP advances, fine-tuning will remain crucial to developing cutting-edge language models and applications.
Hope you like the process of fine-tuning large language models (LLMs). A fine-tune LLM tutorial can help you master this techniques.
With fine-tuning, we navigate language with precision and creativity. This transforms how we interact with and understand text. Embrace the possibilities and unleash the full potential of language models through fine-tuning. The future of NLP is shaped with each finely tuned model.
A. Fine-tuning large language models involves training a pre-trained model on a specific dataset to tailor its performance to a particular task or domain, enhancing its accuracy and relevance.
A. In machine learning, fine-tuning a model means taking a pre-trained model and further training it on a new, smaller dataset specific to a task, improving its performance without training from scratch.
A. Fine-tuning an LLM (large language model) involves additional training of a pre-trained language model on a domain-specific dataset, enabling the model to generate more accurate and relevant text for specific applications.
A. The fine-tuning method consists of taking a pre-trained model and continuing its training on a new dataset, typically with a smaller learning rate, to adapt the model to new, specific tasks while preserving its previously learned knowledge.