How to Choose the Right Embedding for Your RAG Model

Soumil Jain Last Updated : 18 Mar, 2025

10 min read

Imagine a journalist piecing together a story—not just relying on memory but searching archives and verifying facts. That’s how a Retrieval-Augmented Generation (RAG) model works, retrieving real-time knowledge for better accuracy. Just like strong research skills, choosing the best embedding for the RAG model is also crucial for retrieving and ranking relevant information. The right embedding ensures precise and relevant retrieval, enhancing the model’s output. The selection of the optimal embedding for RAG models depends on domain specificity, retrieval accuracy, and model architecture. In this blog, we’ll explore the steps involved in choosing embeddings for RAG models based on specific applications.

Key Parameters for Choosing the Right Text Embedding Model
Popular Text Embeddings for Building RAG Models
How to Decide Which Embedding to Use: A Case Study
Conclusion
Frequently Asked Questions

Key Parameters for Choosing the Right Text Embedding Model

RAG models are dependent on good-quality text embeddings for efficiently retrieving relevant information. Text embeddings transform text into numerical values, enabling the model to process and compare text data. The selection of an appropriate embedding model is critical in enhancing retrieval accuracy, response relevance, and overall system performance.

Before jumping into the mainstream embedding models, let’s begin by understanding the most important parameters that determine their efficiency. In language model comparison, the key factors to be considered are context window, cost, quality (measured in terms of MTEB score), vocabulary size, tokenization scheme, dimensionality, and training data type. All these factors decide the efficiency, accuracy, and adaptability of a model to different tasks.

Key Parameters for Choosing the Right Text Embedding Model for your RAG model

Also Read: How to Find the Best Multilingual Embedding Model for Your RAG?

Let’s understand each of these parameters, one-by-one.

1. Context Window

A context window is the largest number of tokens (words or subwords) a model can process in one input. For instance, If a model has a context window of 512 tokens, it can only process 512 tokens at a time. Longer texts will be truncated or split into smaller chunks. Some embeddings, like OpenAI’s text-embedding-ada-002 (8192 tokens) and Cohere’s embedding model (4096 tokens), support longer context windows, making them ideal for handling extensive documents in RAG applications.

Why It Matters:

A wider context window enables the model to handle longer documents or strings of text without being cut off.
For operations such as semantic search on long texts (e.g., scientific articles), a big context window (e.g., 8192 tokens) is required.

2. Tokenization Unit

Tokenization is breaking down text into smaller units (tokens) that the model can process. The tokenization unit refers to the method used to split text into tokens.

Common Tokenization Methods

Let’s explore some common tokenization methods used in NLP and how they impact model performance.

Subword Tokenization (e.g., Byte Pair Encoding – BPE): It splits words into smaller subword units, such as breaking “unhappiness” into “un” and “happiness.” This approach effectively handles rare or out-of-vocabulary words, improving robustness in text representation.
WordPiece: This method is similar to Byte Pair Encoding (BPE) but optimized for models like BERT. It splits words into smaller units based on frequency, ensuring efficient tokenization and better handling of rare words.
Word-Level Tokenization: It splits text into individual words, making it less efficient for handling rare words or large vocabularies, as it lacks subword segmentation.

Why It Matters:

The tokenization technique impacts the quality of how the model processes text, particularly for infrequent or specialized words.
Most contemporary models favor subword tokenization since it meets both vocabulary size and flexibility needs.

3. Dimensionality

Dimensionality refers to the size of the embedding vector produced by the model. For example, a model with 768-dimensional embeddings outputs a vector of 768 numbers for each input text.

Why It Matters:

Higher-dimensional embeddings can capture more nuanced semantic information but require more computational resources.
Lower-dimensional embeddings are faster and more efficient but may lose some semantic richness.

Example: OpenAI text-embedding-3-large produces 3072-dimensional embeddings, while Jina Embeddings v3 produces 1024-dimensional embeddings.

4. Vocabulary Size

The vocabulary size is the number of unique tokens (words or subwords) that the tokenizer can recognize.

Why It Matters:

A larger vocabulary size allows the model to handle a wider range of words and languages but increases memory usage.
A smaller vocabulary size is more efficient but may struggle with rare or domain-specific terms.

Example: Most modern models (e.g., BERT, OpenAI) have vocab sizes of 30,000–50,000 tokens.

5. Training Data

Training data refers to the dataset used to train the model. It determines the model’s knowledge and capabilities.

Types of Training Data

Let’s take a look at the different types of training data that influence a RAG model’s performance.

General-Purpose Data: Trained on diverse sources like web pages, books, and Wikipedia, these models excel in broad tasks such as semantic search and text classification.
Domain-Specific Data: Built on specialized datasets like legal documents, biomedical texts, or scientific papers, these models perform better for niche applications.

Why It Matters:

The quality and diversity of the training data affect the model’s performance.
Domain-specific models (e.g., LegalBERT, BioBERT) perform better on specialized tasks but may struggle with general tasks.

6. Cost

Cost refers to the financial and computational resources required to use an embedding model, including expenses related to infrastructure, API usage, and hardware acceleration.

Types of Models

Models can be of two types: API-based models and open-source models.

API-Based Models: Pay-per-use services like OpenAI, Cohere, and Gemini charge based on API calls and input/output size.
Open-Source Models: Free to use but require computational resources like GPUs or TPUs for training or inference, with potential infrastructure costs for self-hosting.

Why It Matters:

API-based models are convenient but can become expensive for large-scale applications.
Open-source models are cost-effective but require technical expertise and infrastructure.

7. Quality (MTEB Score)

The MTEB (Massive Text Embedding Benchmark) score measures the performance of an embedding model across a wide range of tasks, including semantic search, classification, and clustering.

Why It Matters:

A higher MTEB score indicates better overall performance.
Models with high MTEB scores are more likely to perform well on your specific task.

Example: OpenAI text-embedding-3-large has an MTEB score of ~62.5, while Jina Embeddings v3 has a score of ~59.5.

Also Read: Enhancing RAG Systems with Nomic Embeddings

Popular Text Embeddings for Building RAG Models

Now, let’s explore some of the most popular text embedding models for building RAG systems.

Model	Context Window	Cost (per 1M tokens)	Quality (MTEB Score)	Vocab Size	Tokenization Unit	Dimensionality	Training Data
OpenAI text-embedding-ada-002	8192 tokens	$0.10	~61.0	Not publicly disclosed	Subword (Byte Pair)	1536	OpenAI has not publicly disclosed the specific datasets used to train this model.
NVIDIA NV-Embed-v2	32768 tokens	Open-source	72.31	50,000+	Subword (Byte Pair)	4096	Trained using hard-negative mining, synthetic data generation, and existing publicly available datasets.
OpenAI text-embedding-3-large	8192 tokens	$0.13	~64.6	Not publicly disclosed	Subword (Byte Pair)	3072	OpenAI has not publicly disclosed the specific datasets used to train this model.
OpenAI text-embedding-3-small	8192 tokens	$0.02	~62.3	50,257	Subword (Byte Pair)	1536	OpenAI has not publicly disclosed the specific datasets used to train this model.
Gemini text-embedding-004	2048 tokens	Not available	~60.8	50,000+	Subword (Byte Pair)	768	Training data not publicly disclosed.
Jina Embeddings v3	8192 tokens	Open-source	~59.5	50,000+	Subword (Byte Pair)	1024	Trained on large-scale web data, books, and other text corpora.
Cohere embed-english-v3.0	512 tokens	$0.10	~64.5	50,000+	Subword (Byte Pair)	1024	Trained on large-scale web data, books, and other text corpora.
voyage-3-large	32000 tokens	$0.06	~60.5	50,000+	Subword (Byte Pair)	2048	Trained on diverse datasets across multiple domains, including large-scale web data, books, and other text corpora.
voyage-3-lite	32000 tokens	$0.02	~59.0	50,000+	Subword (Byte Pair)	512	Trained on diverse datasets across multiple domains, including large-scale web data, books, and other text corpora.
Stella 400M v5	512 tokens	Open-source	~58.5	50,000+	Subword (Byte Pair)	1024	Trained on large-scale web data, books, and other text corpora.
Stella 1.5B v5	512 tokens	Open-source	~59.8	50,000+	Subword (Byte Pair)	1024	Trained on large-scale web data, books, and other text corpora.
ModernBERT Embed Base	512 tokens	Open-source	~57.5	30,000	WordPiece	768	Trained on large-scale web data, books, and other text corpora.
ModernBERT Embed Large	512 tokens	Open-source	~58.2	30,000	WordPiece	1024	Trained on large-scale web data, books, and other text corpora.
BAAI/bge-base-en-v1.5	512 tokens	Open-source	~60.0	30,000	WordPiece	768	Trained on large-scale web data, books, and other text corpora.
law-ai/LegalBERT	512 tokens	Open-source	~55.0	30,000	WordPiece	768	Trained on legal documents, case law, and other legal text corpora.
GanjinZero/biobert-base	512 tokens	Open-source	~54.5	30,000	WordPiece	768	Trained on biomedical and clinical text corpora.
allenai/specter	512 tokens	Open-source	~56.0	30,000	WordPiece	768	Trained on scientific papers and citation graphs.
m3e-base	512 tokens	Open-source	~57.0	30,000	WordPiece	768	Trained on Chinese and English text corpora.

How to Decide Which Embedding to Use: A Case Study

Using the text embedding models mentioned above, we will solve a specific problem statement by evaluating different embeddings based on our requirements. In every step of the selection process, we will systematically eliminate models that do not align with our needs. So, by the end, we should be able to identify the best embedding model for our use case. In this example, I’ll show you how to choose the most suitable model, from the list above, for building a semantic search system.

Problem Statement

Let’s say we need to choose the best embedding model for a text-based retrieval system that performs semantic searches on a large dataset of scientific papers. The system must handle long documents (2,000 to 8,000 words). It should achieve high accuracy for retrieval, measured by a strong Massive Text Embedding Benchmark (MTEB) score, to ensure meaningful and relevant search results while remaining cost-effective and scalable, with a monthly budget of $300–$500.

Selecting the model based on your Requirements

Given the specific needs of the semantic search system, we will evaluate each embedding model based on factors such as domain relevance, context window size, cost-effectiveness, and performance to identify the best fit for the task.

1. Domain-specific Needs

Scientific papers are rich in technical terminology and intricate language, necessitating a model trained on academic, scientific, or technical texts. So, we need to eliminate models primarily tailored for legal or biomedical domains, as they may not generalize effectively to broader scientific literature.

Eliminated Models:

law-ai/LegalBERT (Specialized in legal texts)
GanjinZero/biobert-base (Focused on biomedical texts)

2. Context Window Size

A typical research paper contains 2,000 to 8,000 words, which translates to 2,660 to 10,640 tokens, assuming 1.33 tokens per word. Setting the system’s capacity to 8,192 tokens allows the processing of papers up to ~6,156 words (8,192 ÷ 1.33). This would cover most research papers without truncation, capturing the full context of research papers, including the abstract, introduction, methodology, results, and conclusions.

For our use case, models with a small context window (≤512 tokens) would be inadequate. So, we should eliminate those with a context window of 512 tokens or lesser.

Eliminated Models:

Stella 400M v5 (512 tokens)
Stella 1.5B v5 (512 tokens)
ModernBERT Embed Base (512 tokens)
ModernBERT Embed Large (512 tokens)
BAAI/bge-base-en-v1.5 (512 tokens)
allenai/specter (512 tokens)
m3e-base (512 tokens)

3. Cost & Hosting Preferences

With a monthly budget of $300–$500 and a preference for self-hosting to avoid recurring API expenses, it’s essential to evaluate the cost-effectiveness of each model. Let’s look at the models remaining on our list.

OpenAI Models:

text-embedding-3-large: Priced at $0.13 per 1,000 tokens
text-embedding-3-small: Priced at $0.02 per 1,000 tokens

Jina Embeddings v3:

Open-source and self-hosted, eliminating per-token costs.

Cost Analysis: Assuming an average document length of 8,000 tokens and processing 10,000 documents monthly, here’s how much the above embeddings would cost:

OpenAI text-embedding-3-large:
- 8,000 tokens/document × 10,000 documents = 80,000,000 tokens
- 80,000 × $0.13 = $10,400 (Exceeds budget)
OpenAI text-embedding-3-small:
- 80,000 × $0.02 = $1,600 (Exceeds budget)
Jina Embeddings v3:
- No per-token cost, only infrastructure expenses.

Eliminated Models (Exceeding Budget):

OpenAI text-embedding-3-large
OpenAI text-embedding-3-small

4. Final Evaluation Based on MTEB Score

The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across various tasks, providing a comprehensive performance metric.

Performance Insights:

Let’s compare the performance of the the few models we are left with.

Jina Embeddings v3:
- Demonstrated superior performance, outperforming proprietary embeddings from OpenAI on English tasks within the MTEB framework.
Voyage-3-large:
- Competitive MTEB score (~60.5) with a 32,000-token context window, making it suitable for long-document retrieval at a cost-effective rate.
NVIDIA NV-Embed-v2:
- Achieves an MTEB score of 72.31, significantly outperforming many alternatives.
- 32,768-token context window makes it ideal for long documents.
- Self-hosted and open-source, eliminating per-token API costs.

5. Making the Final Decision

Now, let’s evaluate all the aspects of these models to make our final choice.

NVIDIA NV-Embed-v2: Recommended choice for its high MTEB score (72.31), long context window (32,768 tokens), and self-hosting capability.
Jina Embeddings v3: A cost-effective alternative with no API costs and competitive performance.
Voyage-3-large: A budget-friendly choice with a large context window (32,000 tokens), but a slightly lower MTEB score.

NVIDIA NV-Embed-v2 is the recommended model for high-performance, cost-effective, and long-context semantic search in a scientific paper retrieval system. If infrastructure costs are a concern, Jina Embeddings v3 and Voyage-3-large are strong alternatives.

Bonus: Finetuning Embeddings

Fine-tuning an embedding model is not always necessary. In many cases, an off-the-shelf model will perform well enough. However, if you need highly optimized results for your specific dataset, fine-tuning may help extract the last bit of performance improvement. That being said, fine-tuning comes with extensive computational costs and expenses, which must be carefully considered.

How to Fine-Tune an Embedding Model

Collect Domain-Specific Data: Compile a dataset relevant to your application. For example, if your task involves legal documents, gather case law and legal texts.
Preprocess the Data: Clean, tokenize, and format the text to ensure consistency before training.
Choose a Base Model: Select a pre-trained embedding model that closely aligns with your domain (e.g., SBERT for text-based applications).
Train with Contrastive Learning: Use supervised contrastive learning or triplet loss techniques to refine embeddings based on semantic similarity.
Evaluate Performance: Compare fine-tuned embeddings with the original model to ensure improvements in retrieval accuracy.

Conclusion

Choosing an appropriate embedding for your Retrieval-Augmented Generation (RAG) model is an important process in achieving effective and accurate retrieval of relevant documents. The decision is based on various factors, such as data modality, complexity of retrieval, computational capabilities, and available budget. While API-based models often offer high-quality embeddings, open-source alternatives provide greater flexibility and cost-effectiveness for self-hosted solutions.

By carefully evaluating embedding models based on context window size, semantic search capabilities, and benchmark performance, you can optimize your RAG system for your specific use case. Additionally, fine-tuning embeddings can further enhance performance in domain-specific applications, though it requires careful consideration of computational costs. Ultimately, a well-chosen embedding model lays the foundation for an effective RAG pipeline, improving response accuracy and overall system efficiency.

Frequently Asked Questions

Q1. How do embeddings help in semantic search?

A. Embeddings convert words or sentences into numerical vectors, allowing for efficient comparison and retrieval. In semantic search, similar documents or terms are identified by comparing their embedding vectors. This process ensures that the retrieved documents are contextually relevant, even if they don’t share exact keywords.

Q2. Are embeddings affected by the type of model architecture?

A. Yes, the model architecture influences how embeddings are generated. For instance, transformer-based models like BERT and GPT generate embeddings based on contextualized representations, meaning they understand the word in relation to the sentence. Older models like Word2Vec generate static embeddings that are not context-sensitive.

Q3. Can I combine multiple embedding models for better performance?

A. Yes, combining embeddings from different models can help capture different aspects of the text. For example, you could combine embeddings from a general-purpose model with domain-specific embeddings to get a more comprehensive representation of your data. This approach can improve retrieval accuracy and relevance.

Q4. What is the MTEB score, and why is it important?

A. The Massive Text Embedding Benchmark (MTEB) score measures a model’s performance on a range of tasks, such as semantic search, text classification, and sentiment analysis. A high MTEB score indicates better retrieval accuracy and overall performance.

Q5. What is the difference between API-based and open-source embedding models?

A. API-based models are pay-per-use and offer ease of access, while open-source models are free to use but require computational resources (e.g., GPUs) for training or inference. Open-source models may have no per-token cost but could involve infrastructure expenses.

Soumil Jain

Data Scientist | AWS Certified Solutions Architect | AI & ML Innovator

As a Data Scientist at Analytics Vidhya, I specialize in Machine Learning, Deep Learning, and AI-driven solutions, leveraging NLP, computer vision, and cloud technologies to build scalable applications.

With a B.Tech in Computer Science (Data Science) from VIT and certifications like AWS Certified Solutions Architect and TensorFlow, my work spans Generative AI, Anomaly Detection, Fake News Detection, and Emotion Recognition. Passionate about innovation, I strive to develop intelligent systems that shape the future of AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

How to Choose the Right Embedding for Your RAG Model

Table of Contents

Key Parameters for Choosing the Right Text Embedding Model

1. Context Window

Why It Matters:

2. Tokenization Unit

Common Tokenization Methods

Why It Matters:

3. Dimensionality

Why It Matters:

4. Vocabulary Size

Why It Matters:

5. Training Data

Types of Training Data

Why It Matters:

6. Cost

Types of Models

Why It Matters:

7. Quality (MTEB Score)

Why It Matters:

Popular Text Embeddings for Building RAG Models

How to Decide Which Embedding to Use: A Case Study

Problem Statement

Selecting the model based on your Requirements

1. Domain-specific Needs

2. Context Window Size

3. Cost & Hosting Preferences

4. Final Evaluation Based on MTEB Score

5. Making the Final Decision

Bonus: Finetuning Embeddings

How to Fine-Tune an Embedding Model

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)