Unlocking RAG’s Potential with ModernBERT

Nibedita Dutta Last Updated : 20 Jan, 2025

10 min read

ModernBERT is an advanced iteration of the original BERT model, meticulously crafted to elevate performance and efficiency in natural language processing (NLP) tasks. For developers, working in machine learning, this model introduces a host of modern architectural enhancements and innovative training techniques that significantly broaden its applicability. With an impressive context length of 8,192 tokens—far exceeding the limitations of traditional models—ModernBERT empowers you to tackle complex tasks such as long-document retrieval and code understanding with unprecedented accuracy.

Its ability to process information rapidly while utilizing less memory makes it an essential tool for optimizing your NLP applications, whether you’re developing sophisticated search engines or enhancing AI-driven coding environments. Embracing ModernBERT not only streamlines your workflow but also positions you at the forefront of cutting-edge machine learning advancements.

Learning Objectives

Understand the architectural advancements and features of ModernBERT, including Rotary Positional Encoding (RoPE) and GeGLU activation.
Gain insights into the extended sequence length of ModernBERT, enabling long-document retrieval and code understanding.
Learn how ModernBERT improves computational efficiency with alternating attention and Flash Attention 2.
Discover practical applications of ModernBERT in code retrieval, hybrid semantic search, and Retrieval Augmented Generation (RAG) systems.
Implement ModernBERT embeddings in a hands-on Python example to create a simple RAG-based system.

This article was published as a part of the Data Science Blogathon.

Say Hello to ModernBERT
What Makes ModernBERT Stand Out? Rotary Positional Embeddings
GeGLU Activation Function
Alternating Attention Mechanism
How is ModernBERT different from BERT?
Practical Applications of ModernBERT
Python Implementation: Using ModernBERT for a Simple RAG System
Conclusion
Frequently Asked Questions

Say Hello to ModernBERT

ModernBERT is an advanced encoder model that builds upon the original BERT architecture, integrating various modern techniques to enhance performance and efficiency in natural language processing tasks.

Handling Longer Sequence Length. ModernBERT supports a native sequence length of 8,192 tokens, significantly larger than BERT’s limit of 512 tokens. This is critical, for instance, in RAG pipelines, where a small context often makes chunks too small for semantic understanding.

Large & Diverse Training Data: It has been trained on 2 trillion tokens with diverse data sets that include code and scientific literature – enabling unique capabilities in tasks related to code retrieval and understanding.
Pareto improvement over BERT: ModernBERT is a new model series that is a Pareto improvement over BERT and its younger siblings across both speed and accuracy.
Code Retrieval: Since it has been trained on code as well, ModernBERT can work very well in code retrieval scenarios.

ModernBERT is available in two sizes –

ModernBERT-base: This model consists of 22 layers and has 149 million parameters.
ModernBERT-large: This version features 28 layers and contains 395 million parameters

What Makes ModernBERT Stand Out? Rotary Positional Embeddings

ModernBERT replaces traditional positional encodings with RoPE, which improves the model’s ability to understand the relationships between words and allows it to scale effectively to longer sequence lengths of up to 8,192 tokens.

Transformers employ self-attention or cross-attention mechanisms that are agnostic to the order of tokens. This means the model perceives the input tokens as a set rather than a sequence. It thereby loses crucial information about the relationships between tokens based on their positions in the sequence. To mitigate this, positional encodings are utilized to embed information about the token positions directly into the model.

Need For Rotary Positional Embedding

With absolute positional encoding, the challenge is that it has a limited number of rows, which means that our model is now bounded to a maximum input size.

In RoPE (Rotary Positional Encoding), positional information is incorporated directly into the Query (Q) and Key (K) vectors used in scaled dot-product attention. This is achieved by applying a unique rotational transformation to the queries and keys based on their position in the sequence. The key concept is that the rotation applied to each query and key increases with their distance from one another, causing the dot product to decrease. This gradual misalignment between tokens reflects their relative positions, with greater distance resulting in more significant misalignment and a reduced dot product.

For a 2D query vector like the following –

at a single position m, the new rotated query vector for accommodating the positional encoding becomes the following –

where θ is a preset non-zero vector.

The benefit over absolute positional encoding is that RoPE encodings can generalize to sequences of unseen lengths, since the only information it encodes is the relative pairwise distance between two tokens

GeGLU Activation Function

The model utilizes GeGLU layers instead of the standard MLP layers found in older BERT architectures.

GeGLU activation function combines the capabilities of GLU (Gated Linear Unit) and GELU (Gaussian Error Linear Unit) activations, offering a unique mechanism for controlling the flow of information through the network.

In a Gated Linear Unit (GLU) activation function, the output is obtained after applying linear transformations and gating through the sigmoid function –

This gating mechanism modulates the output based on the set of inputs, effectively controlling which parts of the input are passed through. When the sigmoid output is close to 1, more of the input passes through; when it is close to 0, less of the input is allowed through

Gaussian Error Linear Unit (GELU) activation function smoothly weights inputs based on their percentile in a Gaussian distribution as shown below in the output.

GELU provides a smoother transition around zero, which helps in maintaining gradients even for negative inputs unlike RELU which has zero gradients for negative inputs.

GeGLU is a combination of GLU & GLUE activation functions defined as follows –

GeGLU(x) = x sigmoid(x) + x 0.5 (1 + tanh[sqrt(2/pi) (x + 0.044715 x³)])

In summary, the mathematical structure of GeGELU—characterized by its gating mechanism, enhanced non-linearity, smoothness, probabilistic interpretation, and empirical effectiveness—contributes significantly to its superior performance in deep learning models, making it a valuable choice for modern neural network

Alternating Attention Mechanism

ModernBERT employs an alternating attention pattern, where every third layer uses full global attention while the others focus on local context. This design balances efficiency and performance, allowing the model to process long inputs faster by reducing computational complexity.

Streamlined Architecture

ModernBERT removes unnecessary bias terms from the architecture, allowing for more efficient use of parameters. This streamlining helps optimize performance without compromising accuracy.

Additional Normalization Layer

An extra normalization layer is added after the embeddings, which stabilizes training and contributes to better convergence during the training process.

Flash Attention 2 Integration

ModernBERT integrates Flash Attention 2, which enhances computational efficiency by reducing memory consumption and speeding up processing times for long sequences.

Unpadding Technique

The model employs unpadding to eliminate unnecessary padding tokens during computation, further optimizing memory usage and accelerating operations.

How is ModernBERT different from BERT?

Below we will look into the table as to how ModernBERT is different from BERT:

Feature	Modern BERT	BERT
Context Length	8,192 tokens	512 tokens
Positional Embeddings	Rotary Positional Embeddings (RoPE), which enhance the model’s ability to understand token positions and relationships	BERT uses traditional absolute positional embeddings.
Activation Function	Uses GeGLU, which is a gated variant of GeLU. This means it combines the benefits of gating mechanisms with the Gaussian error function.	BERT Utilizes GeLU, which is a smooth and differentiable activation function that approximates the Gaussian distribution.
Training Data	ModernBERT was trained on a diverse dataset of over 2 trillion tokens, including web documents, code, and scientific literature	BERT was primarily trained on Wikipedia.
Model Sizes	ModernBERT comes in two configurations: Base (139 million parameters) and Large (395 million parameters)	BERT comes in two configurations: Base (110 million parameters) and Large (340 million parameters)
Hardware Optimization	Specifically designed for compatibility with consumer-level GPUs like the RTX 3090 and RTX 4090, ensuring optimal performance and accessibility for real-world applications.	While BERT can run on GPUs, it was not specifically optimized for any particular hardware, which can lead to inefficiencies when deployed on consumer-grade GPUs
Speed and Efficiency	Up to 400% faster in training and inference compared to BERT, making it significantly more efficient.	Generally requires extensive computational resources and has slower processing speeds, especially with longer sequences

Practical Applications of ModernBERT

Let us now understand the practical applications of ModernBERT below:

Long-Document Retrieval: ModernBERT processes sequences of up to 8,192 tokens, making it ideal for retrieving and analyzing long documents, such as legal texts or scientific papers.
Hybrid Semantic Search: ModernBERT can enhance search engines by providing semantic understanding for both text and code queries, enabling more accurate and contextually relevant search results.
Contextual Code Analysis: ModernBERT’s training on large code datasets allows it to perform contextual analysis of code snippets, aiding in tasks like bug detection and code optimization.
Code Retrieval: ModernBERT excels in code retrieval tasks, making it suitable for developing AI-powered Integrated Development Environments (IDEs) and enterprise-wide code indexing solutions. It is particularly effective on datasets like StackOverflow-QA.

Python Implementation: Using ModernBERT for a Simple RAG System

Let us now proceed ahead with hands On Python Implementation For Utilizing ModernBERT embeddings to create a Simple RAG based system.

Step 1: Installing Necessary Libraries

!pip install git+https://github.com/huggingface/transformers
!pip install sentence-transformers
!pip install datasets
!pip install -U weaviate-client
!pip install langchain-openai

Step 2: Loading the Dataset

We utilize a dataset on Indian News to query from. In order to use this dataset, you would need to have a Hugging Face account with an authorization token. We select 100 rows from this dataset for executing the retrieval task.

from datasets import load_dataset
ds = load_dataset("kdave/Indian_Financial_News")
# Keep only "content" columns from the dataset
train_ds = ds["train"].select_columns(["Content"])

#SELECT 100 rows
import random
# Set seed
random.seed(42)
# Shuffle the dataset and select the first 100 rows
subset_ds = train_ds.shuffle(seed=42).select(range(100))

Step 3: Embeddings Generation with modernbert-embed-base

Generate text embeddings using the ModernBERT model and map them to the dataset for further processing.

from sentence_transformers import SentenceTransformer

# Load the SentenceTransformer model
model = SentenceTransformer("nomic-ai/modernbert-embed-base")

# Function to generate embeddings for a single text
def generate_embeddings(example):
    example["embeddings"] = model.encode(example["Content"])
    return example

# Apply the function to the dataset using map
embeddings_ds = subset_ds.map(generate_embeddings)

Step 4: Convert Hugging Face Dataset to a Pandas DataFrame

Convert the processed dataset into a Pandas DataFrame for easier manipulation and storage.

import pandas as pd
# Convert HF dataset to Pandas DF
df = embeddings_ds.to_pandas()

Step 5: Inserting the Embeddings into Weviate

Weaviate is an open-source vector database that stores both objects and vectors. Embedded Weaviate allows us to spin up a Weaviate instance directly from your application code, without having to use a Docker container.

import weaviate
# Connect to Weaviate
client = weaviate.connect_to_embedded()

Step 6: Creating a Weviate Collection and Appending the Embeddings

Create a Weaviate collection, define its schema, and insert the text embeddings along with their metadata.

import weaviate.classes as wvc
import weaviate.classes.config as wc
from weaviate.classes.config import Property, DataType

# Define the collection name
collection_name = "news_india"

# Delete the collection if it already exists
if (client.collections.exists(collection_name)):
    client.collections.delete(collection_name)

# Create the collection
collection = client.collections.create(
    collection_name,
    vectorizer_config = wvc.config.Configure.Vectorizer.none(),

    # Define properties of metadata
    properties=[
        wc.Property(
            name="Content",
            data_type=wc.DataType.TEXT
        )
      
    ]
)

#Insert Data to Collection
objs = []
for i, d in enumerate(df["Content"]):
    objs.append(wvc.data.DataObject(
            properties={
                "Content": df["Content"][i]
            },
            vector = df["embeddings"][i].tolist()
        )
    )

collection.data.insert_many(objs)

Step 7: Defining a Retrieval Function

top_n = 3
from weaviate.classes.query import MetadataQuery
def retrieve(query):
  query_embedding = model.encode(query)
  results = collection.query.near_vector(
      near_vector = query_embedding,
      limit=top_n
  )
  return results.objects[0].properties['content']

Defining the RAG Chain

import os
os.environ['OPENAI_API_KEY'] = 'Your_API_Key'

from langchain import hub

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
prompt = hub.pull("rlm/rag-prompt")

rag_chain = (
{"context": retrieve, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()

)
rag_chain.invoke("Which biscuits is Britannia Industries Ltd is looking at reducing prices for?")

Output

"Britannia Industries Ltd is looking at reducing prices for its super-premium biscuits to attract more consumers and boost business. The super-premium biscuits sold under Pure Magic and Good Day brands cost Rs400 per kg or more. The company is focusing on premiumising the biscuit market and believes that lowering prices could lead to a significant upside in business."

As seen from the output above, the relevant answer is accurately fetched.

Conclusion

ModernBERT represents a significant leap forward in natural language processing, incorporating advanced techniques like Rotary Positional Encoding, GeGLU activations, and Flash Attention 2 to deliver enhanced performance and efficiency. Its ability to handle long sequences and its specialized training on diverse datasets, including code, make it a versatile tool for a wide range of applications—from long-document retrieval to contextual code analysis. By leveraging these innovations, ModernBERT provides developers with a powerful, scalable model for tackling complex NLP and code-related tasks.

Key Takeaways

ModernBERT can handle up to 8,192 tokens, far exceeding BERT’s 512-token limit, making it ideal for long-context tasks like Retrieval Augmented Generation (RAG) systems and long-document retrieval.
The use of Rotary Positional Encoding (RoPE) improves ModernBERT’s ability to understand token relationships in longer sequences and offers better scalability compared to traditional positional encodings.
ModernBERT incorporates the GeGLU activation function, which combines GLU and GELU activations to enhance information flow control and improve model performance, especially in deep learning applications.
The alternating attention pattern in ModernBERT optimizes computational efficiency by using full global attention every third layer and local attention in the others, speeding up processing for long inputs.
With training on diverse datasets, including code, ModernBERT excels in tasks like code retrieval and contextual code analysis, making it a powerful tool for applications in development environments and code indexing.

Frequently Asked Questions

Q1. What is ModernBERT and how is it different from the original BERT?

A. ModernBERT is an advanced version of the BERT model, designed to enhance performance and efficiency in natural language processing tasks. It incorporates modern techniques such as Rotary Positional Encoding, GeGLU activations, and Flash Attention 2, allowing it to process longer sequences and perform more efficiently in various applications, including code retrieval and long-document analysis.

Q2. What is the maximum sequence length ModernBERT can handle?

A. ModernBERT supports a native sequence length of up to 8,192 tokens, significantly larger than BERT’s limit of 512 tokens. Tasks like Retrieval Augmented Generation (RAG) systems and long-document retrieval particularly benefit from the extended length, as it maintains semantic understanding over extended contexts.

Q3. How does Rotary Positional Encoding (RoPE) improve ModernBERT?

A. RoPE replaces traditional positional encodings with a more scalable method that encodes relative distances between tokens in a sequence. This allows ModernBERT to efficiently handle long sequences and generalize to unseen sequence lengths, improving its ability to understand token relationships over extended contexts.

Q4. What are the key benefits of using GeGLU activation in ModernBERT?

A. The GeGLU activation function, which combines GLU (Gated Linear Unit) and GELU (Gaussian Error Linear Unit), enhances ModernBERT’s ability to control the flow of information through the network. It provides improved non-linearity and smoothness in the learning process, contributing to better performance and gradient stability.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Unlocking RAG’s Potential with ModernBERT

Learning Objectives

Table of contents

Say Hello to ModernBERT

What Makes ModernBERT Stand Out? Rotary Positional Embeddings

Need For Rotary Positional Embedding

GeGLU Activation Function

Alternating Attention Mechanism

Streamlined Architecture

Additional Normalization Layer

Flash Attention 2 Integration

Unpadding Technique

How is ModernBERT different from BERT?

Practical Applications of ModernBERT

Python Implementation: Using ModernBERT for a Simple RAG System

Step 1: Installing Necessary Libraries

Step 2: Loading the Dataset

Step 3: Embeddings Generation with modernbert-embed-base

Step 4: Convert Hugging Face Dataset to a Pandas DataFrame

Step 5: Inserting the Embeddings into Weviate

Step 6: Creating a Weviate Collection and Appending the Embeddings

Step 7: Defining a Retrieval Function

Defining the RAG Chain

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics