ModernBERT is an advanced iteration of the original BERT model, meticulously crafted to elevate performance and efficiency in natural language processing (NLP) tasks. For developers, working in machine learning, this model introduces a host of modern architectural enhancements and innovative training techniques that significantly broaden its applicability. With an impressive context length of 8,192 tokens—far exceeding the limitations of traditional models—ModernBERT empowers you to tackle complex tasks such as long-document retrieval and code understanding with unprecedented accuracy.
Its ability to process information rapidly while utilizing less memory makes it an essential tool for optimizing your NLP applications, whether you’re developing sophisticated search engines or enhancing AI-driven coding environments. Embracing ModernBERT not only streamlines your workflow but also positions you at the forefront of cutting-edge machine learning advancements.
This article was published as a part of the Data Science Blogathon.
ModernBERT is an advanced encoder model that builds upon the original BERT architecture, integrating various modern techniques to enhance performance and efficiency in natural language processing tasks.
Handling Longer Sequence Length. ModernBERT supports a native sequence length of 8,192 tokens, significantly larger than BERT’s limit of 512 tokens. This is critical, for instance, in RAG pipelines, where a small context often makes chunks too small for semantic understanding.
ModernBERT is available in two sizes –
ModernBERT replaces traditional positional encodings with RoPE, which improves the model’s ability to understand the relationships between words and allows it to scale effectively to longer sequence lengths of up to 8,192 tokens.
Transformers employ self-attention or cross-attention mechanisms that are agnostic to the order of tokens. This means the model perceives the input tokens as a set rather than a sequence. It thereby loses crucial information about the relationships between tokens based on their positions in the sequence. To mitigate this, positional encodings are utilized to embed information about the token positions directly into the model.
With absolute positional encoding, the challenge is that it has a limited number of rows, which means that our model is now bounded to a maximum input size.
In RoPE (Rotary Positional Encoding), positional information is incorporated directly into the Query (Q) and Key (K) vectors used in scaled dot-product attention. This is achieved by applying a unique rotational transformation to the queries and keys based on their position in the sequence. The key concept is that the rotation applied to each query and key increases with their distance from one another, causing the dot product to decrease. This gradual misalignment between tokens reflects their relative positions, with greater distance resulting in more significant misalignment and a reduced dot product.
For a 2D query vector like the following –
at a single position m, the new rotated query vector for accommodating the positional encoding becomes the following –
where θ is a preset non-zero vector.
The benefit over absolute positional encoding is that RoPE encodings can generalize to sequences of unseen lengths, since the only information it encodes is the relative pairwise distance between two tokens
The model utilizes GeGLU layers instead of the standard MLP layers found in older BERT architectures.
GeGLU activation function combines the capabilities of GLU (Gated Linear Unit) and GELU (Gaussian Error Linear Unit) activations, offering a unique mechanism for controlling the flow of information through the network.
In a Gated Linear Unit (GLU) activation function, the output is obtained after applying linear transformations and gating through the sigmoid function –
This gating mechanism modulates the output based on the set of inputs, effectively controlling which parts of the input are passed through. When the sigmoid output is close to 1, more of the input passes through; when it is close to 0, less of the input is allowed through
Gaussian Error Linear Unit (GELU) activation function smoothly weights inputs based on their percentile in a Gaussian distribution as shown below in the output.
GELU provides a smoother transition around zero, which helps in maintaining gradients even for negative inputs unlike RELU which has zero gradients for negative inputs.
GeGLU is a combination of GLU & GLUE activation functions defined as follows –
GeGLU(x) = x sigmoid(x) + x 0.5 (1 + tanh[sqrt(2/pi) (x + 0.044715 x³)])
In summary, the mathematical structure of GeGELU—characterized by its gating mechanism, enhanced non-linearity, smoothness, probabilistic interpretation, and empirical effectiveness—contributes significantly to its superior performance in deep learning models, making it a valuable choice for modern neural network
ModernBERT employs an alternating attention pattern, where every third layer uses full global attention while the others focus on local context. This design balances efficiency and performance, allowing the model to process long inputs faster by reducing computational complexity.
ModernBERT removes unnecessary bias terms from the architecture, allowing for more efficient use of parameters. This streamlining helps optimize performance without compromising accuracy.
An extra normalization layer is added after the embeddings, which stabilizes training and contributes to better convergence during the training process.
ModernBERT integrates Flash Attention 2, which enhances computational efficiency by reducing memory consumption and speeding up processing times for long sequences.
The model employs unpadding to eliminate unnecessary padding tokens during computation, further optimizing memory usage and accelerating operations.
Below we will look into the table as to how ModernBERT is different from BERT:
Feature | Modern BERT | BERT |
Context Length | 8,192 tokens | 512 tokens |
Positional Embeddings | Rotary Positional Embeddings (RoPE), which enhance the model’s ability to understand token positions and relationships | BERT uses traditional absolute positional embeddings. |
Activation Function | Uses GeGLU, which is a gated variant of GeLU. This means it combines the benefits of gating mechanisms with the Gaussian error function. | BERT Utilizes GeLU, which is a smooth and differentiable activation function that approximates the Gaussian distribution. |
Training Data | ModernBERT was trained on a diverse dataset of over 2 trillion tokens, including web documents, code, and scientific literature | BERT was primarily trained on Wikipedia. |
Model Sizes | ModernBERT comes in two configurations: Base (139 million parameters) and Large (395 million parameters) | BERT comes in two configurations: Base (110 million parameters) and Large (340 million parameters) |
Hardware Optimization | Specifically designed for compatibility with consumer-level GPUs like the RTX 3090 and RTX 4090, ensuring optimal performance and accessibility for real-world applications. | While BERT can run on GPUs, it was not specifically optimized for any particular hardware, which can lead to inefficiencies when deployed on consumer-grade GPUs |
Speed and Efficiency | Up to 400% faster in training and inference compared to BERT, making it significantly more efficient. | Generally requires extensive computational resources and has slower processing speeds, especially with longer sequences |
Let us now understand the practical applications of ModernBERT below:
Let us now proceed ahead with hands On Python Implementation For Utilizing ModernBERT embeddings to create a Simple RAG based system.
!pip install git+https://github.com/huggingface/transformers
!pip install sentence-transformers
!pip install datasets
!pip install -U weaviate-client
!pip install langchain-openai
We utilize a dataset on Indian News to query from. In order to use this dataset, you would need to have a Hugging Face account with an authorization token. We select 100 rows from this dataset for executing the retrieval task.
from datasets import load_dataset
ds = load_dataset("kdave/Indian_Financial_News")
# Keep only "content" columns from the dataset
train_ds = ds["train"].select_columns(["Content"])
#SELECT 100 rows
import random
# Set seed
random.seed(42)
# Shuffle the dataset and select the first 100 rows
subset_ds = train_ds.shuffle(seed=42).select(range(100))
Generate text embeddings using the ModernBERT model and map them to the dataset for further processing.
from sentence_transformers import SentenceTransformer
# Load the SentenceTransformer model
model = SentenceTransformer("nomic-ai/modernbert-embed-base")
# Function to generate embeddings for a single text
def generate_embeddings(example):
example["embeddings"] = model.encode(example["Content"])
return example
# Apply the function to the dataset using map
embeddings_ds = subset_ds.map(generate_embeddings)
Convert the processed dataset into a Pandas DataFrame for easier manipulation and storage.
import pandas as pd
# Convert HF dataset to Pandas DF
df = embeddings_ds.to_pandas()
Weaviate is an open-source vector database that stores both objects and vectors. Embedded Weaviate allows us to spin up a Weaviate instance directly from your application code, without having to use a Docker container.
import weaviate
# Connect to Weaviate
client = weaviate.connect_to_embedded()
Create a Weaviate collection, define its schema, and insert the text embeddings along with their metadata.
import weaviate.classes as wvc
import weaviate.classes.config as wc
from weaviate.classes.config import Property, DataType
# Define the collection name
collection_name = "news_india"
# Delete the collection if it already exists
if (client.collections.exists(collection_name)):
client.collections.delete(collection_name)
# Create the collection
collection = client.collections.create(
collection_name,
vectorizer_config = wvc.config.Configure.Vectorizer.none(),
# Define properties of metadata
properties=[
wc.Property(
name="Content",
data_type=wc.DataType.TEXT
)
]
)
#Insert Data to Collection
objs = []
for i, d in enumerate(df["Content"]):
objs.append(wvc.data.DataObject(
properties={
"Content": df["Content"][i]
},
vector = df["embeddings"][i].tolist()
)
)
collection.data.insert_many(objs)
top_n = 3
from weaviate.classes.query import MetadataQuery
def retrieve(query):
query_embedding = model.encode(query)
results = collection.query.near_vector(
near_vector = query_embedding,
limit=top_n
)
return results.objects[0].properties['content']
import os
os.environ['OPENAI_API_KEY'] = 'Your_API_Key'
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
prompt = hub.pull("rlm/rag-prompt")
rag_chain = (
{"context": retrieve, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
rag_chain.invoke("Which biscuits is Britannia Industries Ltd is looking at reducing prices for?")
Output
"Britannia Industries Ltd is looking at reducing prices for its super-premium biscuits to attract more consumers and boost business. The super-premium biscuits sold under Pure Magic and Good Day brands cost Rs400 per kg or more. The company is focusing on premiumising the biscuit market and believes that lowering prices could lead to a significant upside in business."
As seen from the output above, the relevant answer is accurately fetched.
ModernBERT represents a significant leap forward in natural language processing, incorporating advanced techniques like Rotary Positional Encoding, GeGLU activations, and Flash Attention 2 to deliver enhanced performance and efficiency. Its ability to handle long sequences and its specialized training on diverse datasets, including code, make it a versatile tool for a wide range of applications—from long-document retrieval to contextual code analysis. By leveraging these innovations, ModernBERT provides developers with a powerful, scalable model for tackling complex NLP and code-related tasks.
A. ModernBERT is an advanced version of the BERT model, designed to enhance performance and efficiency in natural language processing tasks. It incorporates modern techniques such as Rotary Positional Encoding, GeGLU activations, and Flash Attention 2, allowing it to process longer sequences and perform more efficiently in various applications, including code retrieval and long-document analysis.
A. ModernBERT supports a native sequence length of up to 8,192 tokens, significantly larger than BERT’s limit of 512 tokens. Tasks like Retrieval Augmented Generation (RAG) systems and long-document retrieval particularly benefit from the extended length, as it maintains semantic understanding over extended contexts.
A. RoPE replaces traditional positional encodings with a more scalable method that encodes relative distances between tokens in a sequence. This allows ModernBERT to efficiently handle long sequences and generalize to unseen sequence lengths, improving its ability to understand token relationships over extended contexts.
A. The GeGLU activation function, which combines GLU (Gated Linear Unit) and GELU (Gaussian Error Linear Unit), enhances ModernBERT’s ability to control the flow of information through the network. It provides improved non-linearity and smoothness in the learning process, contributing to better performance and gradient stability.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.