Enhancing Sentiment Analysis with ModernBERT

Aditi V Last Updated : 20 Jan, 2025

9 min read

Since its introduction in 2018, BERT has transformed Natural Language Processing. It performs well in tasks like sentiment analysis, question answering, and language inference. Using bidirectional training and transformer-based self-attention, BERT introduced a new way to understand relationships between words in text. However, despite its success, BERT has limitations. It struggles with computational efficiency, handling longer texts, and providing interpretability. This led to the development of ModernBERT, a model designed to address these challenges. ModernBERT improves processing speed, handles longer texts better, and offers more transparency for developers. In this article, we’ll explore how to use ModernBERT for sentiment analysis, highlighting its features and improvements over BERT.

Learning Objective

Brief introduction to BERT and why ModernBERT came into existence
Understand the features of ModernBERT
How to practically implement ModernBERT via Sentiment Analysis example
Limitations of ModernBERT

This article was published as a part of the Data Science Blogathon.

What is BERT?
What is ModernBERT?
BERT vs ModernBERT
Understanding the Features of ModernBERT
Sentiment Analysis Using ModernBERT
Limitations of ModernBERT
Conclusion
Frequently Asked Questions

What is BERT?

BERT, which stands for Bidirectional Encoder Representations from Transformers, has been a game-changer since its introduction by Google in 2018. BERT introduced the concept of bidirectional training that allowed the model to understand the context by looking at surrounding words in all directions. This led to significantly better performance of models for a number of NLP tasks, including question answering, sentiment analysis, and language inference. BERT’s architecture is based on encoder-only transformers, which use self-attention mechanisms to weigh the influence of different words in a sentence and have only encoders. This means that they only understand and encode input, and do not reconstruct or generate output. Thus BERT is excellent at capturing contextual relationships in text, making it one of the most powerful and widely adopted NLP models in recent years.

What is ModernBERT?

Despite the groundbreaking success of BERT, it has certain limitations. Some of them are:

Computational Resources: BERT is a computationally expensive, memory-intensive model, which is constraining for real-time applications or for setups which don’t have an accessible, powerful computing infrastructure.
Context Length: BERT has a fixed-length context window which becomes a limitation in handling long range inputs like lengthy documents.
Interpretability: the model’s complexity makes it less interpretable than simpler models, leading to challenges in debugging and performing modifications to the model.
Common Sense Reasoning: BERT lacks common sense reasoning and struggling to understand context, nuance, and logical reasoning beyond the given information.

BERT vs ModernBERT

BERT	ModernBERT
It has fixed positional embeddings	It uses Rotary Positional Embeddings (RoPE)
Standard self-attention	Flash Attention for improved efficiency
It has fixed-length context windows	It can support longer contexts with Local-Global Alternating Attention
Complex and less interpretable	Improved interpretability
Primarily trained on English text	Primarily trained on English and code data

ModernBERT addresses these limitations by incorporating more efficient algorithms such as Flash Attention and Local-Global Alternating Attention, which optimize memory usage and improve processing speed. Additionally, ModernBERT introduces enhancements to handle longer context lengths more effectively by integrating techniques like Rotary Positional Embeddings (RoPE) to support longer context lengths.

It enhances interpretability by aiming to be more transparent and user-friendly, making it easier for developers to debug and adapt the model to specific tasks. Furthermore, ModernBERT incorporates advancements in common sense reasoning, allowing it to better understand context, nuance, and logical relationships beyond the explicit information provided. It is suitable for common GPUs like NVIDIA T4, A100, and RTX 4090.

ModernBERT is trained on data from a various English sources, including web documents, code, and scientific articles. It is trained on 2 trillion unique tokens, unlike the standard 20-40 repetitions popular in previous encoders.

It is released in the following sizes:

ModernBERT-base which has 22 layers and 149 million parameters
ModernBERT-large which has 28 layers and 395 million parameters

Understanding the Features of ModernBERT

Some of the unique features of ModernBERT are:

Flash Attention

This is a new algorithm developed to speed up the attention mechanism of transformer models in terms of time and memory usage. The computation of attention can be sped up by rearranging the operations and using tiling and recomputation. Tiling helps to break down large data into manageable chunks, and recomputation reduces memory usage by recalculating intermediate results as needed. This cuts down the quadratic memory usage down to linear, making it much more efficient for long sequences. The computational overhead reduces. It is 2-4x faster than traditional attention mechanisms. Flash Attention is used for speeding up training and inference of transformer models.

Local-Global Alternating Attention

One of the most novel features of ModernBERT is Alternating Attention, rather than full global attention.

The full input is attended only after every third layer. This is global attention.
Meanwhile, all other layers have a sliding window. In this sliding window, every token attends only to it’s nearest 128 tokens. This is local attention.

Understanding the features of ModernBERT

Rotary Positional Embeddings (RoPE)

Rotary Positional Embeddings (RoPE) is a transformer model technique that encodes the position of tokens in a sequence using rotation matrices. It incorporates absolute and relative positional information, adjusting the attention mechanism to understand the order and distance between tokens. RoPE encodes the absolute position of tokens using a rotation matrix and also makes note of the relative positional information or the order and distance between the tokens.

Unpadding and Sequencing

Unpadding and sequence packing are techniques designed to optimize memory and computational efficiency.

Usually padding is used to find the longest token, add meaningless padding tokens to fill up the rest of shorter sequences to equal their lengths. This increases computation on meaningless tokens. Unpadding removes unnecessary padding tokens from sequences, reducing wasted computation.
Sequence Packing reorganizes batches of text into compact forms, grouping shorter sequences together to maximize hardware utilization.

Sentiment Analysis Using ModernBERT

Let’s implement Sentiment Analysis Using ModernBERT practically. We are going to perform sentiment analysis task using ModernBERT. Sentiment analysis is a specific type of text classification task which aims to classify text (ex. reviews) into positive or negative.

The dataset we are using is IMDb movie reviews dataset to classify reviews into either positive or negative sentiments.

Note:

I have used A100 GPU for faster processing on Google Colab. For more information refer to: answerdotai/ModernBERT-base.
Training process will need wandb api key. You can create one via: Weight and Biases.

Step 1: Install Necessary Libraries

Install the libraries needed to work with Hugging Face Transformers.

#install libraries
!pip install  git+https://github.com/huggingface/transformers.git datasets accelerate scikit-learn -Uqq
!pip install -U transformers>=4.48.0

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoModelForMaskedLM,AutoConfig
from datasets import load_dataset

Step 2: Load the IMDb Dataset Using load_dataset Function

The command imdb[“test”][0] will print the first sample in the test split of the IMDb movie review dataset i.e the first test review along with its associated label.

#Load the dataset
from datasets import load_dataset
imdb = load_dataset("imdb")
#print the first test sample
imdb["test"][0]

Step 3: Tokenization

okenize the dataset using pre-trained ModernBERT-base tokenizer. This process converts text into numerical inputs suitable for the model. The command “tokenized_test_dataset[0]” will print the first sample of the tokenized test dataset including tokenized inputs such as input IDs and labels.

#initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")

#define the tokenizer function
def tokenizer_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",  
        truncation=True,       
        max_length=512,      ## max length can be modified
        return_tensors="pt"
    )

#tokenize training and testing data set based on above defined tokenizer function
tokenized_train_dataset = imdb["train"].map(tokenizer_function, batched=True)
tokenized_test_dataset = imdb["test"].map(tokenizer_function, batched=True)

#print the tokenized output of first test sample
print(tokenized_test_dataset[0])

Step 4: Initialize the ModernBERT-base Model for Sentiment Classification

#initialize the model
config = AutoConfig.from_pretrained("answerdotai/ModernBERT-base")

model = AutoModelForSequenceClassification.from_config(config)

Step 5: Prepare the Datasets

Prepare the datasets by renaming the sentiment labels column (label) to ‘Labels’ and removing unnecessary columns.

#data preparation step - 
train_dataset = tokenized_train_dataset.remove_columns(['text']).rename_column('label', 'labels')
test_dataset = tokenized_test_dataset.remove_columns(['text']).rename_column('label', 'labels')

Step 6: Define Compute Metrics

Let’s use f1_score as a metric to evaluate our model. We will define a function to process the evaluation predictions, and calculate their F1 score. This let’s us compare the model’s predictions versus the true labels.

import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

Step 7: Set the Training Arguments

Define the hyperparameters and other configurations for fine-tuning the model using Hugging Face’s TrainingArguments. Let us understand some arguments:

train_bsz, val_bsz: Indicates batch size for training and validation. Batch size determines the number of samples processed before the model’s internal parameters are updated.
lr: Learning rate controls the adjustment of the model’s weights with respect to the loss gradient.
betas: These are the beta parameters for the Adam optimizer.
n_epochs: Number of epochs, indicating a complete pass through the entire training dataset.
eps: A small constant added to the denominator to improve numerical stability in the Adam optimizer.
wd: Stands for weight decay, a regularization technique to prevent overfitting by penalizing large weights.

#define training arguments 
train_bsz, val_bsz = 32, 32 
lr = 8e-5
betas = (0.9, 0.98)
n_epochs = 2
eps = 1e-6
wd = 8e-6

training_args = TrainingArguments(
    output_dir=f"fine_tuned_modern_bert",
    learning_rate=lr,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=val_bsz,
    num_train_epochs=n_epochs,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    adam_beta1=betas[0],
    adam_beta2=betas[1],
    adam_epsilon=eps,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,
    bf16_full_eval=True,
    push_to_hub=False,
)

Step 8: Model Training

Use the Trainer class to perform the model training and evaluation process.

#Create a Trainer instance
trainer = Trainer(
    model=model,                         # The pre-trained model
    args=training_args,                  # Training arguments
    train_dataset=train_dataset,         # Tokenized training dataset
    eval_dataset=test_dataset,           # Tokenized test dataset
    compute_metrics=compute_metrics,     # Personally, I missed this step, my output won't show F1 score  
)

Step 9: Evaluation

Evaluate the trained model on testing dataset.

# Evaluate the model

evaluation_results = trainer.evaluate()

print("Evaluation Results:", evaluation_results)

Step 10: Save the Fine-tuned Model

Save the fine-tuned model and tokenizer for further re-use.

# Save the trained model 
model.save_pretrained("./saved_model")
# Save the tokenizer
tokenizer.save_pretrained("./saved_model")

Step 11: Predict the Sentiment of the Review

Here: 0 indicates negative review and 1 indicates positive review. For my new example, the output should be [0,1] because boring indicates negative review (0) and spectacular indicates positive opinion thus 1 will be given as output.

# Example input text
new_texts = ["This movie is boring", "Spectacular"] 

# Tokenize the input
inputs = tokenizer(new_texts, padding=True, truncation=True, return_tensors="pt")

# Move inputs to the same device as the model
inputs = inputs.to(model.device) 
# Put the model in evaluation mode
model.eval()

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=1)

print("Predictions:", predictions.tolist())

Limitations of ModernBERT

While ModernBERT brings several improvements over traditional BERT, it still has some limitations:

Training Data Bias: it is trained on English and code data, thus it cannot perform as effeciently on other languages or non-code text.
Complexity: The architectural enhancements and new techniques like Flash Attention and Rotary Positional Embeddings add complexity to the model, which can make it harder to implement and fine-tune for specific tasks.
Inference Speed: While Flash Attention improves inference speed, using the full 8,192 token window may still be slower.

Conclusion

ModernBERT takes BERT’s foundation and improves it with faster processing, better handling of long texts, and enhanced interpretability. While it still faces challenges like training data bias and complexity, it represents a significant leap in NLP. ModernBERT opens new possibilities for tasks like sentiment analysis and text classification, making advanced language understanding more efficient and accessible.

Key Takeaways

ModernBERT improves on BERT by fixing issues like inefficiency and limited context handling.
It uses Flash Attention and Rotary Positional Embeddings for faster processing and longer text support.
ModernBERT is great for tasks like sentiment analysis and text classification.
It still has some limitations, like bias toward English and code data.
Tools like Hugging Face and wandb make it easy to implement and use.

References:

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What are encoder-only architectures?

Ans. Ans. Encoder-only architectures process input sequences without generating output sequences, focusing on understanding and encoding the input.

Q2. What are limitations of BERT?

Ans. Some limitations of BERT include high computational resources, fixed context length, inefficiency, complexity, and lack of common sense reasoning.

Q3. What is attention mechanism?

Ans. An attention mechanism is a technique that allows the model to focuses on specific parts of the input to determine which parts are more or less important.

Q4. What is alternating attention?

Ans. This mechanism alternates between focusing on local and global contexts within text sequences. Local attention highlights adjacent words or phrases, collecting fine-grained information, whereas global attention recognises overall patterns and relationships across the text.

Q5. What are Rotary Potential Embeddings? How are they different from Fixed Positional embeddings?

Ans. In contrast to fixed positional embeddings, which only capture absolute positions, rotary positional embeddings (RoPE) use rotation matrices to encode both absolute and relative positions. RoPE performs better with extended sequences.

Q6. What are the potential applications of ModernBERT?

Ans. Some applications of ModernBERT can be in areas of text classification, sentiment analysis, question answering, named-entity recognition, legal text analysis, code understanding etc.

Q7. What and why is wandb api needed?

Ans. Weights & Biases (W&B) is a platform for tracking, visualizing, and sharing ML experiments. It helps in tracking model metrics, visualize experiment data, share results and more. It helps monitor metrics like accuracy, visualize progress, tune hyperparameters, keep track of versions of model etc.

Aditi V

Hello data enthusiasts! I am V Aditi, a rising and dedicated data science and artificial intelligence student embarking on a journey of exploration and learning in the world of data and machines. Join me as I navigate through the fascinating world of data science and artificial intelligence, unraveling mysteries and sharing insights along the way! 📊✨

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Enhancing Sentiment Analysis with ModernBERT

Learning Objective

Table of contents

What is BERT?

What is ModernBERT?

BERT vs ModernBERT

Understanding the Features of ModernBERT

Flash Attention

Local-Global Alternating Attention

Rotary Positional Embeddings (RoPE)

Unpadding and Sequencing

Sentiment Analysis Using ModernBERT

Step 1: Install Necessary Libraries

Step 2: Load the IMDb Dataset Using load_dataset Function

Step 3: Tokenization

Step 4: Initialize the ModernBERT-base Model for Sentiment Classification

Step 5: Prepare the Datasets

Step 6: Define Compute Metrics

Step 7: Set the Training Arguments

Step 8: Model Training

Step 9: Evaluation

Step 10: Save the Fine-tuned Model

Step 11: Predict the Sentiment of the Review

Limitations of ModernBERT

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B