Fine-tuning a natural language processing (NLP) model entails altering the model’s hyperparameters and architecture and typically adjusting the dataset to enhance the model’s performance on a given task. You can achieve this by adjusting the learning rate, the number of layers in the model, the size of the embeddings, and various other parameters. Fine-tuning is a time-consuming procedure that demands a firm grasp of the model and the job. This article will look at how to fine-tune a Hugging Face Model.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
Hugging Face is a firm that provides a platform for natural language processing (NLP) model training and deployment. The platform hosts a model library suitable for various NLP tasks, including language translation, text generation, and question-answering. These models undergo training on extensive datasets and are designed to excel in a wide range of natural language processing (NLP) activities.
The Hugging Face platform also includes tools for fine tuning pre-trained models on specific datasets, which can help adapt algorithms to particular domains or languages. The platform also has APIs for accessing and utilizing pre-trained models in apps and tools for constructing bespoke models and delivering them to the cloud.
Using the Hugging Face library for natural language processing (NLP) tasks has various advantages:
Importing necessary libraries is analogous to constructing a toolkit for a particular programming and data analysis activity. These libraries, which are frequently pre-written collections of code, offer a wide range of functions and tools that help to speed development. Developers and data scientists can access new capabilities, increase productivity, and use existing solutions by importing the appropriate libraries.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration, AdamW
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
pl.seed_everything(100)
import warnings
warnings.filterwarnings("ignore")
Importing a dataset is a crucial initial step in data-driven projects.
df = pd.read_csv("/kaggle/input/queestion-answer-dataset-qa/train.csv")
df.columns
df = df[['context','question', 'text']]
print("Number of records: ", df.shape[0])
“To create a model capable of generating responses based on context and questions.”
For example,
Context = “Clustering groups of similar cases, for example, can
find similar patients or use for customer segmentation in the
banking field. The association technique is used for finding items or events
that often co-occur, for example, grocery items that a particular customer usually buys together. Anomaly detection is used to discover abnormal
and unusual cases; for example, credit card fraud
detection.”
Question = “What is the example of Anomaly detection?”
Answer = ????????????????????????????????
df["context"] = df["context"].str.lower()
df["question"] = df["question"].str.lower()
df["text"] = df["text"].str.lower()
df.head()
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
INPUT_MAX_LEN = 512 # Input length
OUT_MAX_LEN = 128 # Output Length
TRAIN_BATCH_SIZE = 8 # Training Batch Size
VALID_BATCH_SIZE = 2 # Validation Batch Size
EPOCHS = 5 # Number of Iteration
The T5 model is based on the Transformer architecture, a neural network designed to handle sequential input data effectively. It comprises an encoder and a decoder, which include a sequence of interconnected “layers.”
The encoder and decoder layers comprise various “attention” mechanisms and “feedforward” networks. The attention mechanisms enable the model to focus on different sections of the input sequence at other times. At the same time, the feedforward networks alter the input data using a set of weights and biases.
The T5 model also employs “self-attention,” which allows each element in the input sequence to pay attention to every other element. This allows the model to recognize links between words and phrases in the input data, which is critical for many NLP applications.
In addition to the encoder and decoder, the T5 model contains a “language model head,” which predicts the next word in a sequence based on the prior words. This is critical for translation and text production jobs, where the model must provide cohesive and natural-sounding output.
The T5 model represents a large and sophisticated neural network designed for highly efficient and accurate processing of sequential input. It has undergone extensive training on a diverse text dataset and can proficiently perform a broad spectrum of natural language processing tasks.
T5Tokenizer is used to turn a text into a list of tokens, each representing a single word or punctuation mark. The tokenizer additionally inserts unique tokens into the input text to denote the text’s start and end and distinguish various phrases.
The T5Tokenizer employs a combination of character-level and word-level tokenization and a subword-level tokenization strategy comparable to the SentencePiece tokenizer. It subwords the input text based on the frequency of each character or character sequence in the training data. This assists the tokenizer in dealing with out-of-vocabulary (OOV) terms that do not occur in the training data but do appear in the test data.
The T5Tokenizer additionally inserts unique tokens into the text to denote the start and end of sentences and to divide them. It adds the tokens s > and / s >, for example, to signify the beginning and end of a phrase, and pad > to indicate padding.
MODEL_NAME = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME, model_max_length= INPUT_MAX_LEN)
print("eos_token: {} and id: {}".format(tokenizer.eos_token,
tokenizer.eos_token_id)) # End of token (eos_token)
print("unk_token: {} and id: {}".format(tokenizer.unk_token,
tokenizer.eos_token_id)) # Unknown token (unk_token)
print("pad_token: {} and id: {}".format(tokenizer.pad_token,
tokenizer.eos_token_id)) # Pad token (pad_token)
When dealing with PyTorch, you usually prepare your data for use with the model by using a dataset class. The dataset class is responsible for loading data from the disc and executing required preparation procedures, such as tokenization and numericalization. The class should also implement the getitem function, which is used to obtain a single item from the dataset by index.
The init method populates the dataset with the text list, label list, and tokenizer. The len function returns the number of samples in the dataset. The get item function returns a single item from a dataset by index. It accepts an index idx and outputs the tokenized input and labels.
It is also customary to include various preprocessing steps, such as padding and truncating the tokenized inputs. You may also turn the labels into tensors.
class T5Dataset:
def __init__(self, context, question, target):
self.context = context
self.question = question
self.target = target
self.tokenizer = tokenizer
self.input_max_len = INPUT_MAX_LEN
self.out_max_len = OUT_MAX_LEN
def __len__(self):
return len(self.context)
def __getitem__(self, item):
context = str(self.context[item])
context = " ".join(context.split())
question = str(self.question[item])
question = " ".join(question.split())
target = str(self.target[item])
target = " ".join(target.split())
inputs_encoding = self.tokenizer(
context,
question,
add_special_tokens=True,
max_length=self.input_max_len,
padding = 'max_length',
truncation='only_first',
return_attention_mask=True,
return_tensors="pt"
)
output_encoding = self.tokenizer(
target,
None,
add_special_tokens=True,
max_length=self.out_max_len,
padding = 'max_length',
truncation= True,
return_attention_mask=True,
return_tensors="pt"
)
inputs_ids = inputs_encoding["input_ids"].flatten()
attention_mask = inputs_encoding["attention_mask"].flatten()
labels = output_encoding["input_ids"]
labels[labels == 0] = -100 # As per T5 Documentation
labels = labels.flatten()
out = {
"context": context,
"question": question,
"answer": target,
"inputs_ids": inputs_ids,
"attention_mask": attention_mask,
"targets": labels
}
return out
The DataLoader class loads data in parallel and batches, making it possible to work with big datasets that would otherwise be too vast to store in memory. Combining the DataLoader class with a dataset class containing the data to be loaded.
The dataloader is in charge of iterating over the dataset and returning a batch of data to the model for training or assessment while training a transformer model. The DataLoader class offers various parameters to control the loading and preprocessing of data, including batch size, worker thread count, and whether to shuffle the data before each epoch.
class T5DatasetModule(pl.LightningDataModule):
def __init__(self, df_train, df_valid):
super().__init__()
self.df_train = df_train
self.df_valid = df_valid
self.tokenizer = tokenizer
self.input_max_len = INPUT_MAX_LEN
self.out_max_len = OUT_MAX_LEN
def setup(self, stage=None):
self.train_dataset = T5Dataset(
context=self.df_train.context.values,
question=self.df_train.question.values,
target=self.df_train.text.values
)
self.valid_dataset = T5Dataset(
context=self.df_valid.context.values,
question=self.df_valid.question.values,
target=self.df_valid.text.values
)
def train_dataloader(self):
return torch.utils.data.DataLoader(
self.train_dataset,
batch_size= TRAIN_BATCH_SIZE,
shuffle=True,
num_workers=4
)
def val_dataloader(self):
return torch.utils.data.DataLoader(
self.valid_dataset,
batch_size= VALID_BATCH_SIZE,
num_workers=1
)
When creating a transformer model in PyTorch, you usually begin by creating a new class that derives from the torch. nn.Module. This class describes the model’s architecture, including the layers and the forward function. The class’s init function defines the model’s architecture, often by instantiating the model’s different levels and assigning them as class attributes.
The forward method is in charge of passing data through the model in the forward direction. This method accepts input data and applies the model’s layers to create the output. The forward method should implement the model’s logic, such as passing input through a sequence of layers and returning the result.
The class’s init function creates an embedding layer, a transformer layer, and a fully connected layer and assigns these as class attributes. The forward method accepts the incoming data x, processes it via the given stages, and returns the result. When training a transformer model, the training process typically involves two stages: training and validation.
The training_step method specifies the rationale for carrying out a single training step, which generally includes:
The val_step method, like the training_step method, is used to assess the model on a validation set. It usually includes:
class T5Model(pl.LightningModule):
def __init__(self):
super().__init__()
self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)
def forward(self, input_ids, attention_mask, labels=None):
output = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
return output.loss, output.logits
def training_step(self, batch, batch_idx):
input_ids = batch["inputs_ids"]
attention_mask = batch["attention_mask"]
labels= batch["targets"]
loss, outputs = self(input_ids, attention_mask, labels)
self.log("train_loss", loss, prog_bar=True, logger=True)
return loss
def validation_step(self, batch, batch_idx):
input_ids = batch["inputs_ids"]
attention_mask = batch["attention_mask"]
labels= batch["targets"]
loss, outputs = self(input_ids, attention_mask, labels)
self.log("val_loss", loss, prog_bar=True, logger=True)
return loss
def configure_optimizers(self):
return AdamW(self.parameters(), lr=0.0001)
Iterating over the dataset in batches, sending the input through the model, and changing the model’s parameters based on the calculated gradients and a set of optimization criteria is usual for training a transformer model.
def run():
df_train, df_valid = train_test_split(
df[0:10000], test_size=0.2, random_state=101
)
df_train = df_train.fillna("none")
df_valid = df_valid.fillna("none")
df_train['context'] = df_train['context'].apply(lambda x: " ".join(x.split()))
df_valid['context'] = df_valid['context'].apply(lambda x: " ".join(x.split()))
df_train['text'] = df_train['text'].apply(lambda x: " ".join(x.split()))
df_valid['text'] = df_valid['text'].apply(lambda x: " ".join(x.split()))
df_train['question'] = df_train['question'].apply(lambda x: " ".join(x.split()))
df_valid['question'] = df_valid['question'].apply(lambda x: " ".join(x.split()))
df_train = df_train.reset_index(drop=True)
df_valid = df_valid.reset_index(drop=True)
dataModule = T5DatasetModule(df_train, df_valid)
dataModule.setup()
device = DEVICE
models = T5Model()
models.to(device)
checkpoint_callback = ModelCheckpoint(
dirpath="/kaggle/working",
filename="best_checkpoint",
save_top_k=2,
verbose=True,
monitor="val_loss",
mode="min"
)
trainer = pl.Trainer(
callbacks = checkpoint_callback,
max_epochs= EPOCHS,
gpus=1,
accelerator="gpu"
)
trainer.fit(models, dataModule)
run()
To make predictions with a fine-tuned NLP model like T5 using new input, you can follow these steps:
train_model = T5Model.load_from_checkpoint("/kaggle/working/best_checkpoint-v1.ckpt")
train_model.freeze()
def generate_question(context, question):
inputs_encoding = tokenizer(
context,
question,
add_special_tokens=True,
max_length= INPUT_MAX_LEN,
padding = 'max_length',
truncation='only_first',
return_attention_mask=True,
return_tensors="pt"
)
generate_ids = train_model.model.generate(
input_ids = inputs_encoding["input_ids"],
attention_mask = inputs_encoding["attention_mask"],
max_length = INPUT_MAX_LEN,
num_beams = 4,
num_return_sequences = 1,
no_repeat_ngram_size=2,
early_stopping=True,
)
preds = [
tokenizer.decode(gen_id,
skip_special_tokens=True,
clean_up_tokenization_spaces=True)
for gen_id in generate_ids
]
return "".join(preds)
let’s generate a prediction using the fine-tuned T5 model with new input:
context = “Clustering groups of similar cases, for example, \
can find similar patients, or use for customer segmentation in the \
banking field. Using association technique for finding items or events that \
often co-occur, for example, grocery items that are usually bought together\
by a particular customer. Using anomaly detection to discover abnormal \
and unusual cases, for example, credit card fraud detection.”
que = “what is the example of Anomaly detection?”
print(generate_question(context, que))
context = "Classification is used when your target is categorical,\
while regression is used when your target variable\
is continuous. Both classification and regression belong to the category \
of supervised machine learning algorithms."
que = "When is classification used?"
print(generate_question(context, que))
In this article, we embarked on a journey to fine-tune a natural language processing (NLP) model, specifically the T5 model, for a question-answering task. Throughout this process, we delved into various NLP model development and deployment aspects.
Key takeaways:
Answer: Fine-tuning in NLP involves modifying a pre-trained model’s hyperparameters and architecture to optimize its performance for a specific task or dataset.
Answer: The Transformer architecture is a neural network architecture. It excels at handling sequential data and is the foundation for models like T5. It uses self-attention mechanisms for context understanding.
Answer: In sequence-to-sequence tasks in NLP, we use the encoder-decoder structure. The encoder processes input data, and the decoder generates output data.
Answer: Yes, you can apply fine-tuned models to various real-world NLP tasks, including text generation, translation, and question-answering.
Answer: To begin, you can explore libraries such as Hugging Face. These libraries offer pre-trained models and tools for fine-tuning your datasets. Learning NLP fundamentals and deep learning concepts is also crucial.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.