Unlocking RAG’s Potential with ModernBERT

Nibedita Dutta Last Updated : 20 Jan, 2025
10 min read

ModernBERT is an advanced iteration of the original BERT model, meticulously crafted to elevate performance and efficiency in natural language processing (NLP) tasks. For developers, working in machine learning, this model introduces a host of modern architectural enhancements and innovative training techniques that significantly broaden its applicability. With an impressive context length of 8,192 tokens—far exceeding the limitations of traditional models—ModernBERT empowers you to tackle complex tasks such as long-document retrieval and code understanding with unprecedented accuracy.

Its ability to process information rapidly while utilizing less memory makes it an essential tool for optimizing your NLP applications, whether you’re developing sophisticated search engines or enhancing AI-driven coding environments. Embracing ModernBERT not only streamlines your workflow but also positions you at the forefront of cutting-edge machine learning advancements.

Learning Objectives

  • Understand the architectural advancements and features of ModernBERT, including Rotary Positional Encoding (RoPE) and GeGLU activation.
  • Gain insights into the extended sequence length of ModernBERT, enabling long-document retrieval and code understanding.
  • Learn how ModernBERT improves computational efficiency with alternating attention and Flash Attention 2.
  • Discover practical applications of ModernBERT in code retrieval, hybrid semantic search, and Retrieval Augmented Generation (RAG) systems.
  • Implement ModernBERT embeddings in a hands-on Python example to create a simple RAG-based system.

This article was published as a part of the Data Science Blogathon.

Say Hello to ModernBERT

ModernBERT is an advanced encoder model that builds upon the original BERT architecture, integrating various modern techniques to enhance performance and efficiency in natural language processing tasks.

Handling Longer Sequence Length. ModernBERT supports a native sequence length of 8,192 tokens, significantly larger than BERT’s limit of 512 tokens. This is critical, for instance, in RAG pipelines, where a small context often makes chunks too small for semantic understanding.

  • Large & Diverse Training Data: It has been trained on 2 trillion tokens with diverse data sets that include code and scientific literature – enabling unique capabilities in tasks related to code retrieval and understanding.
  • Pareto improvement over BERT: ModernBERT is a new model series that is a Pareto improvement over BERT and its younger siblings across both speed and accuracy.
  • Code Retrieval: Since it has been trained on code as well, ModernBERT can work very well in code retrieval scenarios.

ModernBERT is available in two sizes –

  • ModernBERT-base: This model consists of 22 layers and has 149 million parameters.
  • ModernBERT-large: This version features 28 layers and contains 395 million parameters

What Makes ModernBERT Stand Out? Rotary Positional Embeddings

ModernBERT replaces traditional positional encodings with RoPE, which improves the model’s ability to understand the relationships between words and allows it to scale effectively to longer sequence lengths of up to 8,192 tokens.

Transformers employ self-attention or cross-attention mechanisms that are agnostic to the order of tokens. This means the model perceives the input tokens as a set rather than a sequence. It thereby loses crucial information about the relationships between tokens based on their positions in the sequence. To mitigate this, positional encodings are utilized to embed information about the token positions directly into the model.

Need For Rotary Positional Embedding

With absolute positional encoding, the challenge is that it has a limited number of rows, which means that our model is now bounded to a maximum input size.

Need For Rotary Positional Embedding
Source: Click Here

In RoPE (Rotary Positional Encoding), positional information is incorporated directly into the Query (Q) and Key (K) vectors used in scaled dot-product attention. This is achieved by applying a unique rotational transformation to the queries and keys based on their position in the sequence. The key concept is that the rotation applied to each query and key increases with their distance from one another, causing the dot product to decrease. This gradual misalignment between tokens reflects their relative positions, with greater distance resulting in more significant misalignment and a reduced dot product.

For a 2D query vector like the following –

query vector

at a single position m, the new rotated query vector for accommodating the positional encoding becomes the following –

New Rotated Query Vector: ModernBERT

where θ is a preset non-zero vector.

The benefit over absolute positional encoding is that RoPE encodings can generalize to sequences of unseen lengths, since the only information it encodes is the relative pairwise distance between two tokens

GeGLU Activation Function

The model utilizes GeGLU layers instead of the standard MLP layers found in older BERT architectures.

GeGLU Activation Function: ModernBERT

GeGLU activation function combines the capabilities of GLU (Gated Linear Unit) and GELU (Gaussian Error Linear Unit) activations, offering a unique mechanism for controlling the flow of information through the network.

In a Gated Linear Unit (GLU) activation function, the output is obtained after applying linear transformations and gating through the sigmoid function –

Gated Linear Unit Activa: ModernBERT

This gating mechanism modulates the output based on the set of inputs, effectively controlling which parts of the input are passed through. When the sigmoid output is close to 1, more of the input passes through; when it is close to 0, less of the input is allowed through

Gaussian Error Linear Unit (GELU) activation function smoothly weights inputs based on their percentile in a Gaussian distribution as shown below in the output.

GELU activation function

GELU provides a smoother transition around zero, which helps in maintaining gradients even for negative inputs unlike RELU which has zero gradients for negative inputs.

GeGLU is a combination of GLU & GLUE activation functions defined as follows –

GeGLU(x) = x sigmoid(x) + x 0.5 (1 + tanh[sqrt(2/pi) (x + 0.044715 x³)])

In summary, the mathematical structure of GeGELU—characterized by its gating mechanism, enhanced non-linearity, smoothness, probabilistic interpretation, and empirical effectiveness—contributes significantly to its superior performance in deep learning models, making it a valuable choice for modern neural network

Alternating Attention Mechanism

ModernBERT employs an alternating attention pattern, where every third layer uses full global attention while the others focus on local context. This design balances efficiency and performance, allowing the model to process long inputs faster by reducing computational complexity.

Streamlined Architecture

ModernBERT removes unnecessary bias terms from the architecture, allowing for more efficient use of parameters. This streamlining helps optimize performance without compromising accuracy.

Additional Normalization Layer

An extra normalization layer is added after the embeddings, which stabilizes training and contributes to better convergence during the training process.

Flash Attention 2 Integration

ModernBERT integrates Flash Attention 2, which enhances computational efficiency by reducing memory consumption and speeding up processing times for long sequences.

Unpadding Technique

The model employs unpadding to eliminate unnecessary padding tokens during computation, further optimizing memory usage and accelerating operations.

How is ModernBERT different from BERT?

Below we will look into the table as to how ModernBERT is different from BERT:

Feature Modern BERT BERT
Context Length 8,192 tokens 512 tokens
Positional Embeddings Rotary Positional Embeddings (RoPE), which enhance the model’s ability to understand token positions and relationships   BERT uses traditional absolute positional embeddings.
Activation Function   Uses GeGLU, which is a gated variant of GeLU. This means it combines the benefits of gating mechanisms with the Gaussian error function.     BERT Utilizes GeLU, which is a smooth and differentiable activation function that approximates the Gaussian distribution.  
Training Data ModernBERT was trained on a diverse dataset of over 2 trillion tokens, including web documents, code, and scientific literature BERT was primarily trained on Wikipedia.
Model Sizes  ModernBERT comes in two configurations: Base (139 million parameters) and Large (395 million parameters) BERT comes in two configurations: Base (110 million parameters) and Large (340 million parameters)
Hardware Optimization Specifically designed for compatibility with consumer-level GPUs like the RTX 3090 and RTX 4090, ensuring optimal performance and accessibility for real-world applications. While BERT can run on GPUs, it was not specifically optimized for any particular hardware, which can lead to inefficiencies when deployed on consumer-grade GPUs
Speed and Efficiency  Up to 400% faster in training and inference compared to BERT, making it significantly more efficient.      Generally requires extensive computational resources and has slower processing speeds, especially with longer sequences

Practical Applications of ModernBERT

Let us now understand the practical applications of ModernBERT below:

  • Long-Document Retrieval: ModernBERT processes sequences of up to 8,192 tokens, making it ideal for retrieving and analyzing long documents, such as legal texts or scientific papers.
  • Hybrid Semantic Search: ModernBERT can enhance search engines by providing semantic understanding for both text and code queries, enabling more accurate and contextually relevant search results.
  • Contextual Code Analysis: ModernBERT’s training on large code datasets allows it to perform contextual analysis of code snippets, aiding in tasks like bug detection and code optimization.
  • Code Retrieval: ModernBERT excels in code retrieval tasks, making it suitable for developing AI-powered Integrated Development Environments (IDEs) and enterprise-wide code indexing solutions. It is particularly effective on datasets like StackOverflow-QA.

Python Implementation: Using ModernBERT for a Simple RAG System

Let us now proceed ahead with hands On Python Implementation For Utilizing ModernBERT embeddings to create a Simple RAG based system.

Step 1: Installing Necessary Libraries

!pip install git+https://github.com/huggingface/transformers
!pip install sentence-transformers
!pip install datasets
!pip install -U weaviate-client
!pip install langchain-openai

Step 2: Loading the Dataset

We utilize a dataset on Indian News to query from. In order to use this dataset, you would need to have a Hugging Face account with an authorization token. We select 100 rows from this dataset for executing the retrieval task.

from datasets import load_dataset
ds = load_dataset("kdave/Indian_Financial_News")
# Keep only "content" columns from the dataset
train_ds = ds["train"].select_columns(["Content"])

#SELECT 100 rows
import random
# Set seed
random.seed(42)
# Shuffle the dataset and select the first 100 rows
subset_ds = train_ds.shuffle(seed=42).select(range(100))

Step 3:  Embeddings Generation with modernbert-embed-base

Generate text embeddings using the ModernBERT model and map them to the dataset for further processing.

from sentence_transformers import SentenceTransformer

# Load the SentenceTransformer model
model = SentenceTransformer("nomic-ai/modernbert-embed-base")

# Function to generate embeddings for a single text
def generate_embeddings(example):
    example["embeddings"] = model.encode(example["Content"])
    return example

# Apply the function to the dataset using map
embeddings_ds = subset_ds.map(generate_embeddings)

Step 4: Convert Hugging Face Dataset to a Pandas DataFrame

Convert the processed dataset into a Pandas DataFrame for easier manipulation and storage.

import pandas as pd
# Convert HF dataset to Pandas DF
df = embeddings_ds.to_pandas()

Step 5:  Inserting the Embeddings into Weviate

Weaviate is an open-source vector database that stores both objects and vectors. Embedded Weaviate allows us to spin up a Weaviate instance directly from your application code, without having to use a Docker container.

import weaviate
# Connect to Weaviate
client = weaviate.connect_to_embedded()

Step 6:  Creating a Weviate Collection and Appending the Embeddings

Create a Weaviate collection, define its schema, and insert the text embeddings along with their metadata.

import weaviate.classes as wvc
import weaviate.classes.config as wc
from weaviate.classes.config import Property, DataType

# Define the collection name
collection_name = "news_india"

# Delete the collection if it already exists
if (client.collections.exists(collection_name)):
    client.collections.delete(collection_name)

# Create the collection
collection = client.collections.create(
    collection_name,
    vectorizer_config = wvc.config.Configure.Vectorizer.none(),

    # Define properties of metadata
    properties=[
        wc.Property(
            name="Content",
            data_type=wc.DataType.TEXT
        )
      
    ]
)

#Insert Data to Collection
objs = []
for i, d in enumerate(df["Content"]):
    objs.append(wvc.data.DataObject(
            properties={
                "Content": df["Content"][i]
            },
            vector = df["embeddings"][i].tolist()
        )
    )

collection.data.insert_many(objs)

Step 7: Defining a Retrieval Function

top_n = 3
from weaviate.classes.query import MetadataQuery
def retrieve(query):
  query_embedding = model.encode(query)
  results = collection.query.near_vector(
      near_vector = query_embedding,
      limit=top_n
  )
  return results.objects[0].properties['content']

Defining the RAG Chain

import os
os.environ['OPENAI_API_KEY'] = 'Your_API_Key'

from langchain import hub

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
prompt = hub.pull("rlm/rag-prompt")

rag_chain = (
{"context": retrieve, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()

)
rag_chain.invoke("Which biscuits is Britannia Industries Ltd is looking at reducing prices for?")

Output

"Britannia Industries Ltd is looking at reducing prices for its super-premium biscuits to attract more consumers and boost business. The super-premium biscuits sold under Pure Magic and Good Day brands cost Rs400 per kg or more. The company is focusing on premiumising the biscuit market and believes that lowering prices could lead to a significant upside in business."

As seen from the output above, the relevant answer is accurately fetched.

Conclusion

ModernBERT represents a significant leap forward in natural language processing, incorporating advanced techniques like Rotary Positional Encoding, GeGLU activations, and Flash Attention 2 to deliver enhanced performance and efficiency. Its ability to handle long sequences and its specialized training on diverse datasets, including code, make it a versatile tool for a wide range of applications—from long-document retrieval to contextual code analysis. By leveraging these innovations, ModernBERT provides developers with a powerful, scalable model for tackling complex NLP and code-related tasks.

Key Takeaways

  • ModernBERT can handle up to 8,192 tokens, far exceeding BERT’s 512-token limit, making it ideal for long-context tasks like Retrieval Augmented Generation (RAG) systems and long-document retrieval.
  • The use of Rotary Positional Encoding (RoPE) improves ModernBERT’s ability to understand token relationships in longer sequences and offers better scalability compared to traditional positional encodings.
  • ModernBERT incorporates the GeGLU activation function, which combines GLU and GELU activations to enhance information flow control and improve model performance, especially in deep learning applications.
  • The alternating attention pattern in ModernBERT optimizes computational efficiency by using full global attention every third layer and local attention in the others, speeding up processing for long inputs.
  • With training on diverse datasets, including code, ModernBERT excels in tasks like code retrieval and contextual code analysis, making it a powerful tool for applications in development environments and code indexing.

Frequently Asked Questions

Q1. What is ModernBERT and how is it different from the original BERT?

A. ModernBERT is an advanced version of the BERT model, designed to enhance performance and efficiency in natural language processing tasks. It incorporates modern techniques such as Rotary Positional Encoding, GeGLU activations, and Flash Attention 2, allowing it to process longer sequences and perform more efficiently in various applications, including code retrieval and long-document analysis.

Q2. What is the maximum sequence length ModernBERT can handle?

A. ModernBERT supports a native sequence length of up to 8,192 tokens, significantly larger than BERT’s limit of 512 tokens. Tasks like Retrieval Augmented Generation (RAG) systems and long-document retrieval particularly benefit from the extended length, as it maintains semantic understanding over extended contexts.

Q3. How does Rotary Positional Encoding (RoPE) improve ModernBERT?

A. RoPE replaces traditional positional encodings with a more scalable method that encodes relative distances between tokens in a sequence. This allows ModernBERT to efficiently handle long sequences and generalize to unseen sequence lengths, improving its ability to understand token relationships over extended contexts.

Q4. What are the key benefits of using GeGLU activation in ModernBERT?

A. The GeGLU activation function, which combines GLU (Gated Linear Unit) and GELU (Gaussian Error Linear Unit), enhances ModernBERT’s ability to control the flow of information through the network. It provides improved non-linearity and smoothness in the learning process, contributing to better performance and gradient stability.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details