The current trend in NLP includes downloading and fine-tuning pre-trained models with millions or even billions of parameters. However, storing and sharing such large trained models is time-consuming, slow, and expensive. These constraints hinder the development of more multi-purpose and adaptable NLP techniques with the RoBERTa model that can learn from and for multiple tasks; in this article, we will be focusing on the sequence classification tasks. Considering this, adapters were proposed, which are small, lightweight, and parameter-efficient alternatives to full fine-tuning. They are basically small bottleneck layers that can be dynamically added with a pre-trained model based on different tasks and languages.
In this article, we will train an adapter for ROBERTa model on the Amazon polarity dataset for sequence classification tasks with the help of adapter-transformers, the AdapterHub adaptation of Hugging Face’s transformers library. Additionally, we will compare the performance of the adapter module to a fully fine-tuned RoBERTa model trained on the same dataset.
By the end of this article, you will have learned the following:
This article was published as a part of the Data Science Blogathon.
This project includes training a task adapter for the RoBERTa model on the Amazon polarity dataset for sequence classification tasks, specifically sentiment analysis. To train, we will use the RoBERTa base model from the Hugging Face hub and the AdapterHub adaptation of Hugging Face’s transformers library. Additionally, we will compare the performance of the adapter module to a fully fine-tuned RoBERTa model trained on the same dataset.
Adapters are lightweight alternatives to fully fine-tuned pre-trained models. Currently, adapters are implemented as small feedforward neural networks that are inserted between layers of a pre-trained model. They provide a parameter-efficient, computationally efficient, and modular approach to transfer learning. The following image shows added adapter.
Source: Adapterhub
During training, all the weights of the pre-trained model are frozen such that only the adapter weights are updated, resulting in modular knowledge representations. They can be easily extracted, interchanged, independently distributed, and dynamically plugged into a language model. These properties highlight the potential of adapters in advancing the NLP field astronomically.
The following are some important points regarding the significance of adapters in NLP transfer learning:
Roberta is a large pre-trained language model developed by Facebook AI and released in 2019. It shares the same architecture as the BERT model. It is a revised version of BERT with minor adjustments to the key hyperparameters and embeddings.
Except for the output layers, BERT’s pre-training and fine-tuning procedures use the same architecture. The pre-trained model parameters are utilized to initialize models for various downstream tasks, and during fine-tuning, all parameters are adjusted. The following diagram illustrates BERT’s pre-training and fine-tuning procedures. The following figure shows the BERT Architecture.
In contrast, RoBERTa does not employ the next-sentence pretraining objective but utilizes much larger mini-batches and learning rates during training. RoBERTa adopts a different pretraining method and replaces the byte-level BPE tokenizer (similar to GPT-2) with a character-level BPE vocabulary. Moreover, RoBERTa uses “dynamic masking,” which helps the model learn more robust representations of the input text by forcing it to predict a diverse set of tokens rather than just predicting a fixed subset of tokens.
In this article, we will train an adapter for RoBERTa base model for the sequence classification task (more precisely, sentiment analysis). Simply put, a sequence classification task is a task that involves assigning a label or category to a sequence of words or tokens, such as a sentence or document.
We will use the Amazon Reviews Polarity dataset constructed by Xiang Zhang. This dataset was created by classifying reviews with scores of 1 and 2 as negative and reviews with scores of 4 and 5 as positive. Moreover, the samples with a score of 3 were ignored. Each class has 1,800,000 training samples and 200,000 testing samples.
To start we will begin with installing the libraries:
!pip install -U adapter-transformers datasets
And now, we will load the Amazon Reviews Polarity dataset using the HuggingFace dataset:
from datasets import load_dataset
#Loading the dataset
dataset = load_dataset("amazon_polarity")
Now let’s see what our dataset consists of:
dataset
Output: DatasetDict({
train: Dataset({
features: [‘label’, ‘title’, ‘content’],
num_rows: 3600000
})
test: Dataset({
features: [‘label’, ‘title’, ‘content’],
num_rows: 400000
})
})
So from the above output, we can see that the Amazon Reviews Polarity dataset consists of 3,600,000 training samples and 400,000 testing samples. Now let’s take a look at what a sample from the train set and test set looks like.
dataset["train"][0]
Output: {‘label’: 1, ‘title’: ‘Stunning even for the ‘non-gamer’, ‘content’: ‘This soundtrack was beautiful! It paints the scenery in your mind so good I would recommend it even to people who hate video game music! I have played the game Chrono Cross, but out of all of the games I have ever played, it has the best music! It backs away and takes a fresher step with great guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^’}
dataset["test"][0]
Output: {‘label’: 1, ‘title’: ‘Great CD’, ‘title’: ‘Great CD’, ‘content’: ‘My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and still LOVE IT. When I\’m in a good mood, it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. The vocals are just STUNNING, and the lyrics just kill. One of life\’s hidden gems. This is a desert island CD in my book. Why she never made it big is just beyond me. Every time I play this, no matter male or female, EVERYBODY says one thing “Who was that singing ?”‘}
From the output of print(dataset), dataset[“train”][0], and dataset[“test”][0], we can see that the dataset consists of three columns, i.e., “label”, “title”, and “content”. Considering this, we need to drop the column named title since we won’t require this to train the adapter.
#Removing the column "title" from the dataset
dataset = dataset.remove_columns("title")
Let’s check whether the column “title” has been dropped!
dataset
Below is a Screenshot showing the composition of the dataset after dropping the column “title”.
Output:
So clearly, the column “title” has been successfully dropped and no longer exists.
Now we will encode all the dataset samples. For this, we will use RobertaTokenizer and dataset.map() function for encoding the input data. Moreover, we will rename the target column class as “labels” since that is what a transformer model takes. Furthermore, we will use set_format() function to set the dataset format to be compatible with PyTorch.
from transformers import AutoTokenizer, RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
#Encoding a batch of input data with the help of tokenizer
def encode_batch(batch):
return tokenizer(batch["content"], max_length=100, truncation = True, padding="max_length")
dataset = dataset.map(encode_batch, batched=True)
#Renaming the column "label" to "labels"
dataset = dataset.rename_column("label", "labels")
#Setting the dataset format to torch and mentioning the columns we want to format
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
Now, we will use RobertaModelWithHeads class, which is unique to adapter-transformers and allows us to easily add and configure prediction heads.
from transformers import RobertaConfig, RobertaModelWithHeads
#Defining the configuration for the model
config = RobertaConfig.from_pretrained("roberta-base", num_labels=2)
#Setting up the model
model = RobertaModelWithHeads.from_pretrained("roberta-base", config=config)
We will now add an adapter with the help of the add_adapter() method. For this, we will pass an adapter name; we passed “amazon_polarity”. Following this, we will also add a matching classification head. Lastly, we will activate the adapter and prediction head using train_adapter().
Basically, train_adapter() method performs two functions majorly:
#Adding adapter to the RoBERTa model
model.add_adapter("amazon_polarity")
# Adding a matching classification head
model.add_classification_head(
"amazon_polarity",
num_labels=2,
id2label={ 0: "negative", 1: "positive"}
)
# Activating the adapter
model.train_adapter("amazon_polarity")
We will configure the training process with the help of TraniningArguments class. Following this, we will also write a function to calculate evaluation accuracy. Lastly, we will pass the arguments to the AdapterTrainer, a class optimized for only training adapters.
import numpy as np
from transformers import TrainingArguments, AdapterTrainer, EvalPrediction
training_args = TrainingArguments(
learning_rate=3e-4,
max_steps=80000,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
logging_steps=1000,
output_dir="adapter-roberta-base-amazon-polarity",
overwrite_output_dir=True,
remove_unused_columns=False,
)
def compute_accuracy(eval_pred):
preds = np.argmax(eval_pred.predictions, axis=1)
return {"acc": (preds == eval_pred.label_ids).mean()}
trainer = AdapterTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
compute_metrics=compute_accuracy,
)
Let’s start training now!
trainer.train()
TrainOutput(global_step=80000, training_loss=0.13133217878341674, metrics={‘train_runtime’: 7884.1676, ‘train_samples_per_second’: 324.701, ‘train_steps_per_second’: 10.147, ‘total_flos’: 1.33836672e+17, ‘train_loss’: 0.13133217878341674, ‘epoch’: 0.71})
Now let’s evaluate the adapter’s performance on the dataset’s test split.
trainer.evaluate()
We can use the trained model with the help of the Hugging Face pipeline to make quick predictions.
from transformers import TextClassificationPipeline
classifier = TextClassificationPipeline(model=model,
tokenizer=tokenizer,
device=training_args.device.index)
classifier("I came across a lot of reviews stating that it is the best book out there.")#import csv
Output: [{‘label’: ‘positive’, ‘score’: 0.5589291453361511}]
Ultimately, we can also extract the adapter from the trained model and save it for later use. save_adapter() creates a file for saving adapter weights and adapter configuration.
model.save_adapter("./final_adapter", "amazon_polarity")
!ls -lh final_adapter
Once we are done working with the adapters, and they are no longer needed, we can restore the weights of the base model in its original form by deactivating and deleting the adapter.
#Deactivating the adapter
model.set_active_adapters(None)
#Deleting the added adapter
model.delete_adapter("amazon_polarity")
We can also push the trained model to the Hugging Face hub for later use. For this, we will import the libraries and install git, and then we will push the model to the hub.
from huggingface_hub import notebook_login
notebook_login()
!apt install git-lfs
!git config --global credential.helper store
trainer.push_to_hub()
Link to the Model Card: https://huggingface.co/DrishtiSharma/adapter-roberta-base-amazon-polarity
Following are some of the potential applications of an Adapter trained on the Amazon Polarity dataset for sequence classification tasks:
Adapters have several advantages over traditional methods. Here are some of the advantages of adapters in NLP:
While adapters have several advantages, they have some disadvantages too. Here are some of the disadvantages of adapters:
Following are some of the potential research directions which can help in furthering the advanced development and usage of Adapters:
This article presents how we can train an adapter model to alter the weights of a given pre-trained model based on the task at hand. And we also saw that once the task is complete, we can easily restore the weights of the base model in its original form by deactivating and deleting the adapter.
To summarize, the key takeaways from this article are:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.