Since its introduction in 2018, BERT has transformed Natural Language Processing. It performs well in tasks like sentiment analysis, question answering, and language inference. Using bidirectional training and transformer-based self-attention, BERT introduced a new way to understand relationships between words in text. However, despite its success, BERT has limitations. It struggles with computational efficiency, handling longer texts, and providing interpretability. This led to the development of ModernBERT, a model designed to address these challenges. ModernBERT improves processing speed, handles longer texts better, and offers more transparency for developers. In this article, we’ll explore how to use ModernBERT for sentiment analysis, highlighting its features and improvements over BERT.
This article was published as a part of the Data Science Blogathon.
BERT, which stands for Bidirectional Encoder Representations from Transformers, has been a game-changer since its introduction by Google in 2018. BERT introduced the concept of bidirectional training that allowed the model to understand the context by looking at surrounding words in all directions. This led to significantly better performance of models for a number of NLP tasks, including question answering, sentiment analysis, and language inference. BERT’s architecture is based on encoder-only transformers, which use self-attention mechanisms to weigh the influence of different words in a sentence and have only encoders. This means that they only understand and encode input, and do not reconstruct or generate output. Thus BERT is excellent at capturing contextual relationships in text, making it one of the most powerful and widely adopted NLP models in recent years.
Despite the groundbreaking success of BERT, it has certain limitations. Some of them are:
BERT | ModernBERT |
It has fixed positional embeddings | It uses Rotary Positional Embeddings (RoPE) |
Standard self-attention | Flash Attention for improved efficiency |
It has fixed-length context windows | It can support longer contexts with Local-Global Alternating Attention |
Complex and less interpretable | Improved interpretability |
Primarily trained on English text | Primarily trained on English and code data |
ModernBERT addresses these limitations by incorporating more efficient algorithms such as Flash Attention and Local-Global Alternating Attention, which optimize memory usage and improve processing speed. Additionally, ModernBERT introduces enhancements to handle longer context lengths more effectively by integrating techniques like Rotary Positional Embeddings (RoPE) to support longer context lengths.
It enhances interpretability by aiming to be more transparent and user-friendly, making it easier for developers to debug and adapt the model to specific tasks. Furthermore, ModernBERT incorporates advancements in common sense reasoning, allowing it to better understand context, nuance, and logical relationships beyond the explicit information provided. It is suitable for common GPUs like NVIDIA T4, A100, and RTX 4090.
ModernBERT is trained on data from a various English sources, including web documents, code, and scientific articles. It is trained on 2 trillion unique tokens, unlike the standard 20-40 repetitions popular in previous encoders.
It is released in the following sizes:
Some of the unique features of ModernBERT are:
This is a new algorithm developed to speed up the attention mechanism of transformer models in terms of time and memory usage. The computation of attention can be sped up by rearranging the operations and using tiling and recomputation. Tiling helps to break down large data into manageable chunks, and recomputation reduces memory usage by recalculating intermediate results as needed. This cuts down the quadratic memory usage down to linear, making it much more efficient for long sequences. The computational overhead reduces. It is 2-4x faster than traditional attention mechanisms. Flash Attention is used for speeding up training and inference of transformer models.
One of the most novel features of ModernBERT is Alternating Attention, rather than full global attention.
Rotary Positional Embeddings (RoPE) is a transformer model technique that encodes the position of tokens in a sequence using rotation matrices. It incorporates absolute and relative positional information, adjusting the attention mechanism to understand the order and distance between tokens. RoPE encodes the absolute position of tokens using a rotation matrix and also makes note of the relative positional information or the order and distance between the tokens.
Unpadding and sequence packing are techniques designed to optimize memory and computational efficiency.
Let’s implement Sentiment Analysis Using ModernBERT practically. We are going to perform sentiment analysis task using ModernBERT. Sentiment analysis is a specific type of text classification task which aims to classify text (ex. reviews) into positive or negative.
The dataset we are using is IMDb movie reviews dataset to classify reviews into either positive or negative sentiments.
Note:
Install the libraries needed to work with Hugging Face Transformers.
#install libraries
!pip install git+https://github.com/huggingface/transformers.git datasets accelerate scikit-learn -Uqq
!pip install -U transformers>=4.48.0
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoModelForMaskedLM,AutoConfig
from datasets import load_dataset
The command imdb[“test”][0] will print the first sample in the test split of the IMDb movie review dataset i.e the first test review along with its associated label.
#Load the dataset
from datasets import load_dataset
imdb = load_dataset("imdb")
#print the first test sample
imdb["test"][0]
okenize the dataset using pre-trained ModernBERT-base tokenizer. This process converts text into numerical inputs suitable for the model. The command “tokenized_test_dataset[0]” will print the first sample of the tokenized test dataset including tokenized inputs such as input IDs and labels.
#initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
#define the tokenizer function
def tokenizer_function(example):
return tokenizer(
example["text"],
padding="max_length",
truncation=True,
max_length=512, ## max length can be modified
return_tensors="pt"
)
#tokenize training and testing data set based on above defined tokenizer function
tokenized_train_dataset = imdb["train"].map(tokenizer_function, batched=True)
tokenized_test_dataset = imdb["test"].map(tokenizer_function, batched=True)
#print the tokenized output of first test sample
print(tokenized_test_dataset[0])
#initialize the model
config = AutoConfig.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForSequenceClassification.from_config(config)
Prepare the datasets by renaming the sentiment labels column (label) to ‘Labels’ and removing unnecessary columns.
#data preparation step -
train_dataset = tokenized_train_dataset.remove_columns(['text']).rename_column('label', 'labels')
test_dataset = tokenized_test_dataset.remove_columns(['text']).rename_column('label', 'labels')
Let’s use f1_score as a metric to evaluate our model. We will define a function to process the evaluation predictions, and calculate their F1 score. This let’s us compare the model’s predictions versus the true labels.
import numpy as np
from sklearn.metrics import f1_score
# Metric helper method
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
score = f1_score(
labels, predictions, labels=labels, pos_label=1, average="weighted"
)
return {"f1": float(score) if score == 1 else score}
Define the hyperparameters and other configurations for fine-tuning the model using Hugging Face’s TrainingArguments. Let us understand some arguments:
#define training arguments
train_bsz, val_bsz = 32, 32
lr = 8e-5
betas = (0.9, 0.98)
n_epochs = 2
eps = 1e-6
wd = 8e-6
training_args = TrainingArguments(
output_dir=f"fine_tuned_modern_bert",
learning_rate=lr,
per_device_train_batch_size=train_bsz,
per_device_eval_batch_size=val_bsz,
num_train_epochs=n_epochs,
lr_scheduler_type="linear",
optim="adamw_torch",
adam_beta1=betas[0],
adam_beta2=betas[1],
adam_epsilon=eps,
logging_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
bf16=True,
bf16_full_eval=True,
push_to_hub=False,
)
Use the Trainer class to perform the model training and evaluation process.
#Create a Trainer instance
trainer = Trainer(
model=model, # The pre-trained model
args=training_args, # Training arguments
train_dataset=train_dataset, # Tokenized training dataset
eval_dataset=test_dataset, # Tokenized test dataset
compute_metrics=compute_metrics, # Personally, I missed this step, my output won't show F1 score
)
Evaluate the trained model on testing dataset.
# Evaluate the model
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)
Save the fine-tuned model and tokenizer for further re-use.
# Save the trained model
model.save_pretrained("./saved_model")
# Save the tokenizer
tokenizer.save_pretrained("./saved_model")
Here: 0 indicates negative review and 1 indicates positive review. For my new example, the output should be [0,1] because boring indicates negative review (0) and spectacular indicates positive opinion thus 1 will be given as output.
# Example input text
new_texts = ["This movie is boring", "Spectacular"]
# Tokenize the input
inputs = tokenizer(new_texts, padding=True, truncation=True, return_tensors="pt")
# Move inputs to the same device as the model
inputs = inputs.to(model.device)
# Put the model in evaluation mode
model.eval()
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
print("Predictions:", predictions.tolist())
While ModernBERT brings several improvements over traditional BERT, it still has some limitations:
ModernBERT takes BERT’s foundation and improves it with faster processing, better handling of long texts, and enhanced interpretability. While it still faces challenges like training data bias and complexity, it represents a significant leap in NLP. ModernBERT opens new possibilities for tasks like sentiment analysis and text classification, making advanced language understanding more efficient and accessible.
References:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Ans. Ans. Encoder-only architectures process input sequences without generating output sequences, focusing on understanding and encoding the input.
Ans. Some limitations of BERT include high computational resources, fixed context length, inefficiency, complexity, and lack of common sense reasoning.
Ans. An attention mechanism is a technique that allows the model to focuses on specific parts of the input to determine which parts are more or less important.
Ans. This mechanism alternates between focusing on local and global contexts within text sequences. Local attention highlights adjacent words or phrases, collecting fine-grained information, whereas global attention recognises overall patterns and relationships across the text.
Ans. In contrast to fixed positional embeddings, which only capture absolute positions, rotary positional embeddings (RoPE) use rotation matrices to encode both absolute and relative positions. RoPE performs better with extended sequences.
Ans. Some applications of ModernBERT can be in areas of text classification, sentiment analysis, question answering, named-entity recognition, legal text analysis, code understanding etc.
Ans. Weights & Biases (W&B) is a platform for tracking, visualizing, and sharing ML experiments. It helps in tracking model metrics, visualize experiment data, share results and more. It helps monitor metrics like accuracy, visualize progress, tune hyperparameters, keep track of versions of model etc.