Imagine a journalist piecing together a story—not just relying on memory but searching archives and verifying facts. That’s how a Retrieval-Augmented Generation (RAG) model works, retrieving real-time knowledge for better accuracy. Just like strong research skills, choosing the best embedding for the RAG model is also crucial for retrieving and ranking relevant information. The right embedding ensures precise and relevant retrieval, enhancing the model’s output. The selection of the optimal embedding for RAG models depends on domain specificity, retrieval accuracy, and model architecture. In this blog, we’ll explore the steps involved in choosing embeddings for RAG models based on specific applications.
RAG models are dependent on good-quality text embeddings for efficiently retrieving relevant information. Text embeddings transform text into numerical values, enabling the model to process and compare text data. The selection of an appropriate embedding model is critical in enhancing retrieval accuracy, response relevance, and overall system performance.
Before jumping into the mainstream embedding models, let’s begin by understanding the most important parameters that determine their efficiency. In language model comparison, the key factors to be considered are context window, cost, quality (measured in terms of MTEB score), vocabulary size, tokenization scheme, dimensionality, and training data type. All these factors decide the efficiency, accuracy, and adaptability of a model to different tasks.
Also Read: How to Find the Best Multilingual Embedding Model for Your RAG?
Let’s understand each of these parameters, one-by-one.
A context window is the largest number of tokens (words or subwords) a model can process in one input. For instance, If a model has a context window of 512 tokens, it can only process 512 tokens at a time. Longer texts will be truncated or split into smaller chunks. Some embeddings, like OpenAI’s text-embedding-ada-002 (8192 tokens) and Cohere’s embedding model (4096 tokens), support longer context windows, making them ideal for handling extensive documents in RAG applications.
Tokenization is breaking down text into smaller units (tokens) that the model can process. The tokenization unit refers to the method used to split text into tokens.
Let’s explore some common tokenization methods used in NLP and how they impact model performance.
Dimensionality refers to the size of the embedding vector produced by the model. For example, a model with 768-dimensional embeddings outputs a vector of 768 numbers for each input text.
Example: OpenAI text-embedding-3-large produces 3072-dimensional embeddings, while Jina Embeddings v3 produces 1024-dimensional embeddings.
The vocabulary size is the number of unique tokens (words or subwords) that the tokenizer can recognize.
Example: Most modern models (e.g., BERT, OpenAI) have vocab sizes of 30,000–50,000 tokens.
Training data refers to the dataset used to train the model. It determines the model’s knowledge and capabilities.
Let’s take a look at the different types of training data that influence a RAG model’s performance.
Cost refers to the financial and computational resources required to use an embedding model, including expenses related to infrastructure, API usage, and hardware acceleration.
Models can be of two types: API-based models and open-source models.
The MTEB (Massive Text Embedding Benchmark) score measures the performance of an embedding model across a wide range of tasks, including semantic search, classification, and clustering.
Example: OpenAI text-embedding-3-large has an MTEB score of ~62.5, while Jina Embeddings v3 has a score of ~59.5.
Also Read: Enhancing RAG Systems with Nomic Embeddings
Now, let’s explore some of the most popular text embedding models for building RAG systems.
Model | Context Window | Cost (per 1M tokens) | Quality (MTEB Score) | Vocab Size | Tokenization Unit | Dimensionality | Training Data |
OpenAI text-embedding-ada-002 | 8192 tokens | $0.10 | ~61.0 | Not publicly disclosed | Subword (Byte Pair) | 1536 | OpenAI has not publicly disclosed the specific datasets used to train this model. |
NVIDIA NV-Embed-v2 | 32768 tokens | Open-source | 72.31 | 50,000+ | Subword (Byte Pair) | 4096 | Trained using hard-negative mining, synthetic data generation, and existing publicly available datasets. |
OpenAI text-embedding-3-large | 8192 tokens | $0.13 | ~64.6 | Not publicly disclosed | Subword (Byte Pair) | 3072 | OpenAI has not publicly disclosed the specific datasets used to train this model. |
OpenAI text-embedding-3-small | 8192 tokens | $0.02 | ~62.3 | 50,257 | Subword (Byte Pair) | 1536 | OpenAI has not publicly disclosed the specific datasets used to train this model. |
Gemini text-embedding-004 | 2048 tokens | Not available | ~60.8 | 50,000+ | Subword (Byte Pair) | 768 | Training data not publicly disclosed. |
Jina Embeddings v3 | 8192 tokens | Open-source | ~59.5 | 50,000+ | Subword (Byte Pair) | 1024 | Trained on large-scale web data, books, and other text corpora. |
Cohere embed-english-v3.0 | 512 tokens | $0.10 | ~64.5 | 50,000+ | Subword (Byte Pair) | 1024 | Trained on large-scale web data, books, and other text corpora. |
voyage-3-large | 32000 tokens | $0.06 | ~60.5 | 50,000+ | Subword (Byte Pair) | 2048 | Trained on diverse datasets across multiple domains, including large-scale web data, books, and other text corpora. |
voyage-3-lite | 32000 tokens | $0.02 | ~59.0 | 50,000+ | Subword (Byte Pair) | 512 | Trained on diverse datasets across multiple domains, including large-scale web data, books, and other text corpora. |
Stella 400M v5 | 512 tokens | Open-source | ~58.5 | 50,000+ | Subword (Byte Pair) | 1024 | Trained on large-scale web data, books, and other text corpora. |
Stella 1.5B v5 | 512 tokens | Open-source | ~59.8 | 50,000+ | Subword (Byte Pair) | 1024 | Trained on large-scale web data, books, and other text corpora. |
ModernBERT Embed Base | 512 tokens | Open-source | ~57.5 | 30,000 | WordPiece | 768 | Trained on large-scale web data, books, and other text corpora. |
ModernBERT Embed Large | 512 tokens | Open-source | ~58.2 | 30,000 | WordPiece | 1024 | Trained on large-scale web data, books, and other text corpora. |
BAAI/bge-base-en-v1.5 | 512 tokens | Open-source | ~60.0 | 30,000 | WordPiece | 768 | Trained on large-scale web data, books, and other text corpora. |
law-ai/LegalBERT | 512 tokens | Open-source | ~55.0 | 30,000 | WordPiece | 768 | Trained on legal documents, case law, and other legal text corpora. |
GanjinZero/biobert-base | 512 tokens | Open-source | ~54.5 | 30,000 | WordPiece | 768 | Trained on biomedical and clinical text corpora. |
allenai/specter | 512 tokens | Open-source | ~56.0 | 30,000 | WordPiece | 768 | Trained on scientific papers and citation graphs. |
m3e-base | 512 tokens | Open-source | ~57.0 | 30,000 | WordPiece | 768 | Trained on Chinese and English text corpora. |
Using the text embedding models mentioned above, we will solve a specific problem statement by evaluating different embeddings based on our requirements. In every step of the selection process, we will systematically eliminate models that do not align with our needs. So, by the end, we should be able to identify the best embedding model for our use case. In this example, I’ll show you how to choose the most suitable model, from the list above, for building a semantic search system.
Let’s say we need to choose the best embedding model for a text-based retrieval system that performs semantic searches on a large dataset of scientific papers. The system must handle long documents (2,000 to 8,000 words). It should achieve high accuracy for retrieval, measured by a strong Massive Text Embedding Benchmark (MTEB) score, to ensure meaningful and relevant search results while remaining cost-effective and scalable, with a monthly budget of $300–$500.
Given the specific needs of the semantic search system, we will evaluate each embedding model based on factors such as domain relevance, context window size, cost-effectiveness, and performance to identify the best fit for the task.
Scientific papers are rich in technical terminology and intricate language, necessitating a model trained on academic, scientific, or technical texts. So, we need to eliminate models primarily tailored for legal or biomedical domains, as they may not generalize effectively to broader scientific literature.
Eliminated Models:
A typical research paper contains 2,000 to 8,000 words, which translates to 2,660 to 10,640 tokens, assuming 1.33 tokens per word. Setting the system’s capacity to 8,192 tokens allows the processing of papers up to ~6,156 words (8,192 ÷ 1.33). This would cover most research papers without truncation, capturing the full context of research papers, including the abstract, introduction, methodology, results, and conclusions.
For our use case, models with a small context window (≤512 tokens) would be inadequate. So, we should eliminate those with a context window of 512 tokens or lesser.
Eliminated Models:
With a monthly budget of $300–$500 and a preference for self-hosting to avoid recurring API expenses, it’s essential to evaluate the cost-effectiveness of each model. Let’s look at the models remaining on our list.
OpenAI Models:
Jina Embeddings v3:
Cost Analysis: Assuming an average document length of 8,000 tokens and processing 10,000 documents monthly, here’s how much the above embeddings would cost:
Eliminated Models (Exceeding Budget):
The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across various tasks, providing a comprehensive performance metric.
Performance Insights:
Let’s compare the performance of the the few models we are left with.
Now, let’s evaluate all the aspects of these models to make our final choice.
NVIDIA NV-Embed-v2 is the recommended model for high-performance, cost-effective, and long-context semantic search in a scientific paper retrieval system. If infrastructure costs are a concern, Jina Embeddings v3 and Voyage-3-large are strong alternatives.
Fine-tuning an embedding model is not always necessary. In many cases, an off-the-shelf model will perform well enough. However, if you need highly optimized results for your specific dataset, fine-tuning may help extract the last bit of performance improvement. That being said, fine-tuning comes with extensive computational costs and expenses, which must be carefully considered.
Choosing an appropriate embedding for your Retrieval-Augmented Generation (RAG) model is an important process in achieving effective and accurate retrieval of relevant documents. The decision is based on various factors, such as data modality, complexity of retrieval, computational capabilities, and available budget. While API-based models often offer high-quality embeddings, open-source alternatives provide greater flexibility and cost-effectiveness for self-hosted solutions.
By carefully evaluating embedding models based on context window size, semantic search capabilities, and benchmark performance, you can optimize your RAG system for your specific use case. Additionally, fine-tuning embeddings can further enhance performance in domain-specific applications, though it requires careful consideration of computational costs. Ultimately, a well-chosen embedding model lays the foundation for an effective RAG pipeline, improving response accuracy and overall system efficiency.
A. Embeddings convert words or sentences into numerical vectors, allowing for efficient comparison and retrieval. In semantic search, similar documents or terms are identified by comparing their embedding vectors. This process ensures that the retrieved documents are contextually relevant, even if they don’t share exact keywords.
A. Yes, the model architecture influences how embeddings are generated. For instance, transformer-based models like BERT and GPT generate embeddings based on contextualized representations, meaning they understand the word in relation to the sentence. Older models like Word2Vec generate static embeddings that are not context-sensitive.
A. Yes, combining embeddings from different models can help capture different aspects of the text. For example, you could combine embeddings from a general-purpose model with domain-specific embeddings to get a more comprehensive representation of your data. This approach can improve retrieval accuracy and relevance.
A. The Massive Text Embedding Benchmark (MTEB) score measures a model’s performance on a range of tasks, such as semantic search, text classification, and sentiment analysis. A high MTEB score indicates better retrieval accuracy and overall performance.
A. API-based models are pay-per-use and offer ease of access, while open-source models are free to use but require computational resources (e.g., GPUs) for training or inference. Open-source models may have no per-token cost but could involve infrastructure expenses.