Large Language Models are known for their text-generation capabilities. They are trained with millions of tokens during the pre-training period. This will help the large language models understand English text and generate meaningful full tokens during the generation period. One of the other common tasks in Natural Language Processing is the Sequence Classification Task. In this, we classify the given sequence into different categories. This can be naively done with Large Language Models through Prompt Engineering. But this might only sometimes work. Instead, we can tweak the Large Language Model to output a set of probabilities for each category for a given input. This guide will show how to train such LLM and work with the finetune Llama 3 model.
This article was published as a part of the Data Science Blogathon.
For this guide, we will be working in Kaggle. The first step would be to download the necessary libraries that we will require to finetune Llama 3 for Sequence Classification. Let us run the code below:
!pip install -q transformers accelerate trl bitsandbytes datasets evaluate huggingface-cli
!pip install -q peft scikit-learn
We start by downloading the following libraries:
After that, try logging in to the HuggingFace hub. For this, we will work with the huggingface-cli tool. The code for this can be seen below:
!huggingface-cli login --token $YOUR_HF_TOKEN
Here, we call the login option for the huggingface-cli with the additional – token option. Here we provide our HuggingFace token to log into the HuggingFace. To get the HuggingFace token, go to this link. As shown in the pic below, you can create an access token by clicking on New Token or using an existing token. Just copy that token and paste it in the place YOUR_HF_TOKEN.
Next, we load the dataset for training. For this, we work with the below code:
from datasets import load_dataset
dataset = load_dataset("ag_news")
Running this will download the ag_news dataset to the dataset variable. The ag_news dataset looks like the below pic:
It is a news classification dataset. The news is classified into different categories, such as world, sports, business, and sci/tech. Now, let us see if the examples for each category are represented in equal numbers or if there is any category imbalance.
import pandas as pd
df = pd.DataFrame(dataset['train'])
df.label.value_counts(normalize=True)
Running this code produced the following output. We can check that all 4 labels have an equal proportion, which implies that each category has an equal number of examples in the dataset. The dataset is huge, so we only need a part of it. So, we sample some data from this dataframe with the following code:
# Splitting the dataframe into 4 separate dataframes based on the labels
label_1_df = df[df['label'] == 0]
label_2_df = df[df['label'] == 1]
label_3_df = df[df['label'] == 2]
label_4_df = df[df['label'] == 3]
# Shuffle each label dataframe
label_1_df = label_1_df.sample(frac=1).reset_index(drop=True)
label_2_df = label_2_df.sample(frac=1).reset_index(drop=True)
label_3_df = label_3_df.sample(frac=1).reset_index(drop=True)
label_4_df = label_4_df.sample(frac=1).reset_index(drop=True)
# Splitting each label dataframe into train, test, and validation sets
label_1_train = label_1_df.iloc[:2000]
label_1_test = label_1_df.iloc[2000:2500]
label_1_val = label_1_df.iloc[2500:3000]
label_2_train = label_2_df.iloc[:2000]
label_2_test = label_2_df.iloc[2000:2500]
label_2_val = label_2_df.iloc[2500:3000]
label_3_train = label_3_df.iloc[:2000]
label_3_test = label_3_df.iloc[2000:2500]
label_3_val = label_3_df.iloc[2500:3000]
label_4_train = label_4_df.iloc[:2000]
label_4_test = label_4_df.iloc[2000:2500]
label_4_val = label_4_df.iloc[2500:3000]
# Concatenating the splits back together
train_df = pd.concat([label_1_train, label_2_train, label_3_train, label_4_train])
test_df = pd.concat([label_1_test, label_2_test, label_3_test, label_4_test])
val_df = pd.concat([label_1_val, label_2_val, label_3_val, label_4_val])
# Shuffle the dataframes to ensure randomness
train_df = train_df.sample(frac=1).reset_index(drop=True)
test_df = test_df.sample(frac=1).reset_index(drop=True)
val_df = val_df.sample(frac=1).reset_index(drop=True)
For confirmation, let us check the value counts of each label in the training dataframe. The code for this will be:
train_df.label.value_counts()
So, we can check that the training dataframe has equal examples for each of the four labels. Before sending them to training, we need to convert these Pandas DataFrames to DatasetDict, which the HuggingFace training library accepts. For this, we work with the following code:
from datasets import DatasetDict, Dataset
# Converting pandas DataFrames into Hugging Face Dataset objects:
dataset_train = Dataset.from_pandas(train_df)
dataset_val = Dataset.from_pandas(val_df)
dataset_test = Dataset.from_pandas(test_df)
# Combine them into a single DatasetDict
dataset = DatasetDict({
'train': dataset_train,
'val': dataset_val,
'test': dataset_test
})
dataset
We can see from the output that the DatasetDict contains 3 Datasets, which are the train, test, and validation datasets. Where each of these datasets contains only 2 columns: one is text, and the other is the label.
Here, in our dataset, the proportion of each class is the same. In practical cases, this might only sometimes be true. So when the classes are imbalanced, we need to take proper measures so the LLM doesn’t give more importance to the label containing more examples. For this, we calculate the class weights.
Class weights tell us how much importance we must give to each class; the more class weights there are, the more importance there is to the class. If we have an imbalanced dataset, we may provide more class weight to the label, having fewer examples, thus giving more importance to it. To get these class weights, we can take the inverse of the proportion of class labels (value counts) of the dataset. The code for this will be:
import torch
class_weights=(1/train_df.label.value_counts(normalize=True).sort_index()).tolist()
class_weights=torch.tensor(class_weights)
class_weights=class_weights/class_weights.sum()
class_weights
We can see from the output that the class weights are equal for all the classes; this is because all the classes have the same number of examples.
Also read: 3 Ways to Use Llama 3 [Explained with Steps]
In this section, we will download and prepare the model for the training. The first is to download the model. We cannot work with the full model because we are dealing with a small GPU; hence, we will quantify it. The code for this will be:
from transformers import BitsAndBytesConfig, AutoModelForSequenceClassification
quantization_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = 'nf4',
bnb_4bit_use_double_quant = True,
bnb_4bit_compute_dtype = torch.bfloat16
)
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
quantization_config=quantization_config,
num_labels=4,
device_map='auto'
)
So running the above will download the Llama 3 8B Large Language Model from the HuggingFace hub, quantize it based on the quantization_config that we have provided to it, and then replace the output head of the LLM with a linear head with 4 neurons for the output and pushed the model to the GPU. Next, we will create a LoRA config for the model to train only a subset of parameters. The code for this will be:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
lora_config = LoraConfig(
r = 16,
lora_alpha = 8,
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout = 0.05,
bias = 'none',
task_type = 'SEQ_CLS'
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
Running this, the get_peft_model will take the model and prepare it for training with a PEFT method, like the LoRA in this case, by wrapping the model and the LoRA Configuration.
In this section, we will test the Llama 3 model on the test data before the model has been trained. To do this, we will first download the tokenizer. The code for this will be:
from transformers import AutoTokenizer
model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
model.config.pretraining_tp = 1
Next, we even edit the model configuration by setting the pad token ID of the model to the pad token ID of the tokenizer and not using the cache. Now, we will give our test data to the model and collect the outputs:
sentences = test_df.text.tolist()
batch_size = 32
all_outputs = []
for i in range(0, len(sentences), batch_size):
batch_sentences = sentences[i:i + batch_size]
inputs = tokenizer(batch_sentences, return_tensors="pt",
padding=True, truncation=True, max_length=512)
inputs = {k: v.to('cuda' if torch.cuda.is_available() else 'cpu') for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
all_outputs.append(outputs['logits'])
final_outputs = torch.cat(all_outputs, dim=0)
test_df['predictions']=final_outputs.argmax(axis=1).cpu().numpy()
Running this code will store the model results in a variable. We will add these predictions to the test DataFrame in a new column. We take the argmax of each output; this gives us the label that has the highest probability for each output in the final_outputs list. Now, we need to evaluate the output generated by the LLM, which we can do through the code below:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import balanced_accuracy_score, classification_report
def get_metrics_result(test_df):
y_test = test_df.label
y_pred = test_df.predictions
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Balanced Accuracy Score:", balanced_accuracy_score(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
get_metrics_result(test_df)
Running that has produced the following results. We can see that we get an accuracy of 0.23, which is very low. The model’s precision, recall, and f1-score are very low, too; they do not even reach a percentage above 50%. Testing them after training the model will give us an understanding of how well it is trained.
Before we start training, we need to preprocess the data before sending it to the model. For this, we work with the following code:
def data_preprocesing(row):
return tokenizer(row['text'], truncation=True, max_length=512)
tokenized_data = dataset.map(data_preprocesing, batched=True,
remove_columns=['text'])
tokenized_data.set_format("torch")
Now each Dataset in the datasetdict contains three features/columns, i.e. labels, input_ids, and attention_masks. The input_ids and attention_masks are produced for each text with the above preprocessing function. We will need a data collator for batch processing of data while training. For this, we work with the following code:
from transformers import DataCollatorWithPadding
collate_fn = DataCollatorWithPadding(tokenizer=tokenizer)
This will ensure that all the inputs in the batch have the same length, which will be required for faster training. So, we uniformly pad the inputs to the longest sequence length using a special token like the pad token, thus allowing simultaneous batch processing.
Also read: How to Run Llama 3 Locally?
Before we start training, we need an error metric to evaluate it. The default error metric for the Large Language Model is the negative log-likelihood loss. But here, because we are modifying the LLM to make it a sequence classification tool, we need to redefine the error metric that we need to test the model while training:
def compute_metrics(evaluations):
predictions, labels = evaluations
predictions = np.argmax(predictions, axis=1)
return {'balanced_accuracy' : balanced_accuracy_score(predictions, labels),
'accuracy':accuracy_score(predictions,labels)}
Because we are going with a custom metric, we even define a custom trainer for training our LLM, which is needed because we are working with class weights here. For this, the code will be
class CustomTrainer(Trainer):
def __init__(self, *args, class_weights=None, **kwargs):
super().__init__(*args, **kwargs)
if class_weights is not None:
self.class_weights = torch.tensor(class_weights,
dtype=torch.float32).to(self.args.device)
else:
self.class_weights = None
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels").long()
outputs = model(**inputs)
logits = outputs.get('logits')
if self.class_weights is not None:
loss = F.cross_entropy(logits, labels, weight=self.class_weights)
else:
loss = F.cross_entropy(logits, labels)
return (loss, outputs) if return_outputs else loss
Now, we will define our Training Arguments. The code for this will be below
training_args = TrainiAgrumentsngArguments(
output_dir = 'sentiment_classification',
learning_rate = 1e-4,
per_device_train_batch_size = 8,
per_device_eval_batch_size = 8,
num_train_epochs = 1,
logging_steps=1,
weight_decay = 0.01,
evaluation_strategy = 'epoch',
save_strategy = 'epoch',
load_best_model_at_end = True,
report_to="none"
)
This will create our TrainingArguments object. Now, we are ready to pass it to the Trainer we created. The code for this will be below
trainer = CustomTrainer(
model = model,
args = training_args,
train_dataset = tokenized_datasets['train'],
eval_dataset = tokenized_datasets['val'],
tokenizer = tokenizer,
data_collator = collate_fn,
compute_metrics = compute_metrics,
class_weights=class_weights,
)
train_result = trainer.train()
We have now created the trainer object. We now call the .train() function of the trainer object to start the training process and store the results in the train_result
Now, let us try to perform evaluations to test the newly trained model on the test data:
def generate_predictions(model,df_test):
sentences = df_test.text.tolist()
batch_size = 32
all_outputs = []
for i in range(0, len(sentences), batch_size):
batch_sentences = sentences[i:i + batch_size]
inputs = tokenizer(batch_sentences, return_tensors="pt",
padding=True, truncation=True, max_length=512)
inputs = {k: v.to('cuda' if torch.cuda.is_available() else 'cpu')
for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
all_outputs.append(outputs['logits'])
final_outputs = torch.cat(all_outputs, dim=0)
df_test['predictions']=final_outputs.argmax(axis=1).cpu().numpy()
generate_predictions(model,test_df)
get_performance_metrics(test_df)
Running this code has generated the following results. We see that there is a great boost in the accuracy of the model. Other metrics like precision, recall, and f1-score have increased too from their initial values. The overall accuracy has increased from 0.23 before training to 0.93 after training, which is a 0.7 i.e. 70% improvement in the model after training it. From this, we can get an insight that Large Language Models are very much capable of being employed as sequence classifiers
In conclusion, fine-tuning large language models (LLMs) (finetune Llama 3) for sequence classification involves several detailed steps, from preparing the dataset to quantizing the model for efficient training on limited hardware. By utilizing various libraries from HuggingFace and implementing techniques such as Prompt Engineering and LoRA configurations, it is possible to effectively train these models for specific tasks such as news classification. This guide has demonstrated the entire process, from initial setup and data preprocessing to model training and evaluation, highlighting the versatility and power of LLMs in natural language processing tasks.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
A. Large Language Models are AI systems trained on vast amounts of text data to understand and generate human-like language.
A. Sequence Classification is a task where a sequence of text is categorized into predefined classes or labels.
A. Prompt Engineering might not be reliable because it depends heavily on the prompt’s structure and can lack consistency.
A. Quantization reduces models’ memory footprint, making it feasible to run them on hardware with limited resources, like low-RAM GPUs.
A. Class imbalance can be addressed by calculating and applying class weights, giving more importance to less frequent classes during training.
Mention a few references including blog posts, and HF tutorials you used to write this article. Thank you for the well-structured article, it helped.
Thank you for the article! However, how do I load a model that has been saved as a checkpoint?
You can use the AutoModelForSequenceClassification.from_pretrained(checkpoint_path)
You can use the AitoModel class and its .from_pretrained method to load the checkpoint where you can pass the checkpoint path to the .from_pretrained() function
Thanks, Ajay, for sharing this article. It's incredibly useful, especially since it aggregates all the scattered information on fine-tuning classification models from the internet. I was able to reproduce the results successfully. I have a couple of minor suggestions: 1. There's a typo on "TrainiAgrumentsngArgument." It should be "TrainArguments." 2. Ensure you have NumPy version 1.19. It would be helpful if you could always share the requirements.txt file to make it easier to replicate the results. Appreciate your hard work on this!
Glad that it was helpful to you. Thanks for the suggestions and I will make sure to incorporate them in future articles