Enhancing Sentiment Analysis with ModernBERT

Aditi V Last Updated : 20 Jan, 2025
9 min read

Since its introduction in 2018, BERT has transformed Natural Language Processing. It performs well in tasks like sentiment analysis, question answering, and language inference. Using bidirectional training and transformer-based self-attention, BERT introduced a new way to understand relationships between words in text. However, despite its success, BERT has limitations. It struggles with computational efficiency, handling longer texts, and providing interpretability. This led to the development of ModernBERT, a model designed to address these challenges. ModernBERT improves processing speed, handles longer texts better, and offers more transparency for developers. In this article, we’ll explore how to use ModernBERT for sentiment analysis, highlighting its features and improvements over BERT.

Learning Objective

  • Brief introduction to BERT and why ModernBERT came into existence
  • Understand the features of ModernBERT
  • How to practically implement ModernBERT via Sentiment Analysis example
  • Limitations of ModernBERT

This article was published as a part of the Data Science Blogathon.

What is BERT?

BERT, which stands for Bidirectional Encoder Representations from Transformers, has been a game-changer since its introduction by Google in 2018. BERT introduced the concept of bidirectional training that allowed the model to understand the context by looking at surrounding words in all directions. This led to significantly better performance of models for a number of NLP tasks, including question answering, sentiment analysis, and language inference. BERT’s architecture is based on encoder-only transformers, which use self-attention mechanisms to weigh the influence of different words in a sentence and have only encoders. This means that they only understand and encode input, and do not reconstruct or generate output. Thus BERT is excellent at capturing contextual relationships in text, making it one of the most powerful and widely adopted NLP models in recent years.

What is ModernBERT?

Despite the groundbreaking success of BERT, it has certain limitations. Some of them are: 

  • Computational Resources: BERT is a computationally expensive, memory-intensive model, which is constraining for real-time applications or for setups which don’t have an accessible, powerful computing infrastructure.
  • Context Length: BERT has a fixed-length context window which becomes a limitation in handling long range inputs like lengthy documents.
  • Interpretability: the model’s complexity makes it less interpretable than simpler models, leading to challenges in debugging and performing modifications to the model.
  • Common Sense Reasoning: BERT lacks common sense reasoning and struggling to understand context, nuance, and logical reasoning beyond the given information.

BERT vs ModernBERT

BERT ModernBERT
It has fixed positional embeddings It uses Rotary Positional Embeddings (RoPE)
Standard self-attention  Flash Attention for improved efficiency
It has fixed-length context windows  It can support longer contexts with Local-Global Alternating Attention
Complex and less interpretable Improved interpretability
Primarily trained on English text  Primarily trained on English and code data

ModernBERT addresses these limitations by incorporating more efficient algorithms such as Flash Attention and Local-Global Alternating Attention, which optimize memory usage and improve processing speed. Additionally, ModernBERT introduces enhancements to handle longer context lengths more effectively by integrating techniques like Rotary Positional Embeddings (RoPE) to support longer context lengths.

It enhances interpretability by aiming to be more transparent and user-friendly, making it easier for developers to debug and adapt the model to specific tasks. Furthermore, ModernBERT incorporates advancements in common sense reasoning, allowing it to better understand context, nuance, and logical relationships beyond the explicit information provided. It is suitable for common GPUs like NVIDIA T4, A100, and RTX 4090.

ModernBERT is trained on data from a various English sources, including web documents, code, and scientific articles. It is trained on 2 trillion unique tokens, unlike the standard 20-40 repetitions popular in previous encoders. 

It is released in the following sizes:

  • ModernBERT-base which has 22 layers and 149 million parameters
  • ModernBERT-large which has 28 layers and 395 million parameters

Understanding the Features of ModernBERT

Some of the unique features of ModernBERT are:

Flash Attention

This is a new algorithm developed to speed up the attention mechanism of transformer models in terms of time and memory usage. The computation of attention can be sped up by rearranging the operations and using tiling and recomputation. Tiling helps to break down large data into manageable chunks, and recomputation reduces memory usage by recalculating intermediate results as needed. This cuts down the quadratic memory usage down to linear, making it much more efficient for long sequences. The computational overhead reduces. It is 2-4x faster than traditional attention mechanisms. Flash Attention is used for speeding up training and inference of transformer models.

Local-Global Alternating Attention

One of the most novel features of ModernBERT is Alternating Attention, rather than full global attention.

  • The full input is attended only after every third layer. This is global attention.
  • Meanwhile, all other layers have a sliding window. In this sliding window, every token attends only to it’s nearest 128 tokens. This is local attention.
Understanding the features of ModernBERT

Rotary Positional Embeddings (RoPE)

Rotary Positional Embeddings (RoPE) is a transformer model technique that encodes the position of tokens in a sequence using rotation matrices. It incorporates absolute and relative positional information, adjusting the attention mechanism to understand the order and distance between tokens. RoPE encodes the absolute position of tokens using a rotation matrix and also makes note of the relative positional information or the order and distance between the tokens.

Unpadding and Sequencing

Unpadding and sequence packing are techniques designed to optimize memory and computational efficiency.

  • Usually padding is used to find the longest token, add meaningless padding tokens to fill up the rest of shorter sequences to equal their lengths. This increases computation on meaningless tokens. Unpadding removes unnecessary padding tokens from sequences, reducing wasted computation.
  • Sequence Packing reorganizes batches of text into compact forms, grouping shorter sequences together to maximize hardware utilization.

Sentiment Analysis Using ModernBERT

Let’s implement Sentiment Analysis Using ModernBERT practically. We are going to perform sentiment analysis task using ModernBERT. Sentiment analysis is a specific type of text classification task which aims to classify text (ex. reviews) into positive or negative.

The dataset we are using is IMDb movie reviews dataset to classify reviews into either positive or negative sentiments.

Note:

Step 1: Install Necessary Libraries

Install the libraries needed to work with Hugging Face Transformers.

#install libraries
!pip install  git+https://github.com/huggingface/transformers.git datasets accelerate scikit-learn -Uqq
!pip install -U transformers>=4.48.0

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoModelForMaskedLM,AutoConfig
from datasets import load_dataset
Output

Step 2: Load the IMDb Dataset Using load_dataset Function

The command imdb[“test”][0] will print the first sample in the test split of the IMDb movie review dataset i.e the first test review along with its associated label.

#Load the dataset
from datasets import load_dataset
imdb = load_dataset("imdb")
#print the first test sample
imdb["test"][0]
Output
Print the first test sample

Step 3: Tokenization

okenize the dataset using pre-trained ModernBERT-base tokenizer. This process converts text into numerical inputs suitable for the model. The command “tokenized_test_dataset[0]” will print the first sample of the tokenized test dataset including tokenized inputs such as input IDs and labels.

#initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")

#define the tokenizer function
def tokenizer_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",  
        truncation=True,       
        max_length=512,      ## max length can be modified
        return_tensors="pt"
    )

#tokenize training and testing data set based on above defined tokenizer function
tokenized_train_dataset = imdb["train"].map(tokenizer_function, batched=True)
tokenized_test_dataset = imdb["test"].map(tokenizer_function, batched=True)

#print the tokenized output of first test sample
print(tokenized_test_dataset[0])
output
First sample of tokenized dataset

Step 4: Initialize the ModernBERT-base Model for Sentiment Classification

#initialize the model
config = AutoConfig.from_pretrained("answerdotai/ModernBERT-base")

model = AutoModelForSequenceClassification.from_config(config)

Step 5: Prepare the Datasets

Prepare the datasets by renaming the sentiment labels column (label) to ‘Labels’ and removing unnecessary columns.

#data preparation step - 
train_dataset = tokenized_train_dataset.remove_columns(['text']).rename_column('label', 'labels')
test_dataset = tokenized_test_dataset.remove_columns(['text']).rename_column('label', 'labels')

Step 6: Define Compute Metrics

Let’s use f1_score as a metric to evaluate our model. We will define a function to process the evaluation predictions, and calculate their F1 score. This let’s us compare the model’s predictions versus the true labels. 

import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

Step 7: Set the Training Arguments

Define the hyperparameters and other configurations for fine-tuning the model using Hugging Face’s TrainingArguments. Let us understand some arguments: 

  • train_bsz, val_bsz: Indicates batch size for training and validation. Batch size determines the number of samples processed before the model’s internal parameters are updated.
  • lr: Learning rate controls the adjustment of the model’s weights with respect to the loss gradient.
  • betas: These are the beta parameters for the Adam optimizer.
  • n_epochs: Number of epochs, indicating a complete pass through the entire training dataset.
  • eps: A small constant added to the denominator to improve numerical stability in the Adam optimizer.
  • wd: Stands for weight decay, a regularization technique to prevent overfitting by penalizing large weights.
#define training arguments 
train_bsz, val_bsz = 32, 32 
lr = 8e-5
betas = (0.9, 0.98)
n_epochs = 2
eps = 1e-6
wd = 8e-6

training_args = TrainingArguments(
    output_dir=f"fine_tuned_modern_bert",
    learning_rate=lr,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=val_bsz,
    num_train_epochs=n_epochs,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    adam_beta1=betas[0],
    adam_beta2=betas[1],
    adam_epsilon=eps,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,
    bf16_full_eval=True,
    push_to_hub=False,
)

Step 8: Model Training

Use the Trainer class to perform the model training and evaluation process.

#Create a Trainer instance
trainer = Trainer(
    model=model,                         # The pre-trained model
    args=training_args,                  # Training arguments
    train_dataset=train_dataset,         # Tokenized training dataset
    eval_dataset=test_dataset,           # Tokenized test dataset
    compute_metrics=compute_metrics,     # Personally, I missed this step, my output won't show F1 score  
)
Output
Loss results and F1 score

Step 9: Evaluation

Evaluate the trained model on testing dataset.

# Evaluate the model

evaluation_results = trainer.evaluate()

print("Evaluation Results:", evaluation_results)
output
output

Step 10: Save the Fine-tuned Model

Save the fine-tuned model and tokenizer for further re-use.

# Save the trained model 
model.save_pretrained("./saved_model")
# Save the tokenizer
tokenizer.save_pretrained("./saved_model")

Step 11: Predict the Sentiment of the Review

Here: 0 indicates negative review and 1 indicates positive review. For my new example, the output should be [0,1] because boring indicates negative review (0) and spectacular indicates positive opinion thus 1 will be given as output. 

# Example input text
new_texts = ["This movie is boring", "Spectacular"] 

# Tokenize the input
inputs = tokenizer(new_texts, padding=True, truncation=True, return_tensors="pt")

# Move inputs to the same device as the model
inputs = inputs.to(model.device) 
# Put the model in evaluation mode
model.eval()

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=1)

print("Predictions:", predictions.tolist())
output
Prediction for new example

Limitations of ModernBERT

While ModernBERT brings several improvements over traditional BERT, it still has some limitations:

  1. Training Data Bias: it is trained on English and code data, thus it cannot perform as effeciently on other languages or non-code text.
  2. Complexity: The architectural enhancements and new techniques like Flash Attention and Rotary Positional Embeddings add complexity to the model, which can make it harder to implement and fine-tune for specific tasks.
  3. Inference Speed: While Flash Attention improves inference speed, using the full 8,192 token window may still be slower.

Conclusion

ModernBERT takes BERT’s foundation and improves it with faster processing, better handling of long texts, and enhanced interpretability. While it still faces challenges like training data bias and complexity, it represents a significant leap in NLP. ModernBERT opens new possibilities for tasks like sentiment analysis and text classification, making advanced language understanding more efficient and accessible.

Key Takeaways

  • ModernBERT improves on BERT by fixing issues like inefficiency and limited context handling.
  • It uses Flash Attention and Rotary Positional Embeddings for faster processing and longer text support.
  • ModernBERT is great for tasks like sentiment analysis and text classification.
  • It still has some limitations, like bias toward English and code data.
  • Tools like Hugging Face and wandb make it easy to implement and use.

References:

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What are encoder-only architectures?

Ans. Ans. Encoder-only architectures process input sequences without generating output sequences, focusing on understanding and encoding the input.

Q2. What are limitations of BERT?

Ans. Some limitations of BERT include high computational resources, fixed context length, inefficiency, complexity, and lack of common sense reasoning.

Q3. What is attention mechanism?

Ans. An attention mechanism is a technique that allows the model to focuses on specific parts of the input to determine which parts are more or less important.

Q4. What is alternating attention?

Ans. This mechanism alternates between focusing on local and global contexts within text sequences. Local attention highlights adjacent words or phrases, collecting fine-grained information, whereas global attention recognises overall patterns and relationships across the text.

Q5. What are Rotary Potential Embeddings? How are they different from Fixed Positional embeddings?

Ans. In contrast to fixed positional embeddings, which only capture absolute positions, rotary positional embeddings (RoPE) use rotation matrices to encode both absolute and relative positions. RoPE performs better with extended sequences.

Q6. What are the potential applications of ModernBERT?

Ans. Some applications of ModernBERT can be in areas of text classification, sentiment analysis, question answering, named-entity recognition, legal text analysis, code understanding etc. 

Q7. What and why is wandb api needed?

Ans. Weights & Biases (W&B) is a platform for tracking, visualizing, and sharing ML experiments. It helps in tracking model metrics, visualize experiment data, share results and more. It helps monitor metrics like accuracy, visualize progress, tune hyperparameters, keep track of versions of model etc. 

Hello data enthusiasts! I am V Aditi, a rising and dedicated data science and artificial intelligence student embarking on a journey of exploration and learning in the world of data and machines. Join me as I navigate through the fascinating world of data science and artificial intelligence, unraveling mysteries and sharing insights along the way! 📊✨

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details