This article delves into Retrieval-Augmented Generation , an advanced AI technique that improves response accuracy by combining retrieval and generation capabilities. You’ll explore how RAG works by first retrieving relevant, up-to-date information from a knowledge base before generating responses, enabling it to provide more reliable and contextually relevant answers. The content covers the RAG workflow in detail, including the use of vector databases for efficient data retrieval, the role of distance metrics for similarity matching, and how RAG mitigates common AI pitfalls like hallucinations and confabulations. Additionally, it outlines practical steps to set up and implement RAG, making this a comprehensive guide for anyone looking to enhance AI-based knowledge retrieval.
This article was published as a part of the Data Science Blogathon.
RAG is an AI technique that improves the accuracy of answers by retrieving relevant information before generating a response. Instead of creating answers based on what the AI model learns from its training, RAG first searches for up-to-date or specific information from a database or knowledge source. It then uses that information to generate a better, more reliable answer. The RAG AI approach combines retrieval-based models with generation-based models to improve the quality and accuracy of generated content, particularly in natural language processing tasks.
Recommended Reading: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
The RAG (Retrieval-Augmented Generation) workflow involves two main stages: retrieval and generation. Below is an overview of how the RAG workflow operates, step by step.
A user query or questions like the one below would act as a prompt.
“What are the most recent developments in quantum computing?”
In the retrieval phase, the three steps below will happen.
In the retrieval phase, the three steps below will happen.
The system returns a final response to the user that is more factually accurate and up-to-date than what a purely generative model could produce.
Exploring AI with and without RAG reveals the transformative impact of Retrieval-Augmented Generation: while traditional models rely solely on pre-trained data, RAG enhances responses with real-time, relevant information retrieval, bridging the gap between static knowledge and dynamic, contextually aware outputs.
With RAG | Without RAG |
---|---|
Retrieves up-to-date information from external sources. | Relies solely on pre-trained knowledge (which may be outdated). |
Provides specific remediation steps (e.g., patch version, configuration changes). | Generates vague and generalized responses, often without actionable details. |
Minimizes risk of hallucination by grounding responses in real documents. | Higher risk of hallucination or incorrect details, especially for newer vulnerabilities. |
Ensures response includes the latest vendor advisories or security patches. | May not be aware of recent advisories, patches, or updates. |
Can combine both internal (organization-specific) and external (public database) information. | Lacks the ability to retrieve any new or organization-specific information. |
A vector database plays a critical role in the RAG (Retrieval-Augmented Generation) workflow by enabling efficient and accurate retrieval of relevant documents or data based on semantic similarity. In traditional keyword-based search systems, users retrieve information by matching exact terms, which can cause them to miss pertinent data that uses different wording. A vector database addresses this problem by representing text as vectors in a high-dimensional space, placing similar meanings close to each other and making it highly suitable for RAG-based systems. A vector database is a search engine or database that stores vectorized documents, enabling more accurate information retrieval for AI models. The structure of a vector database looks like the one below.
The below example represents how each vector gets stored in a vector database.
{
"id": 0,
"vector": [0.01, -0.03, 0.15, ..., -0.08], // A list of floating-point numbers representing the vector
"payload": {
"company": "Apple Inc.",
"ticker": "AAPL",
"price": 175.50,
"market_cap": "2.8T",
"industry": "Technology",
"pe_ratio": 28.5
}
}
When you run the upsert function, Qdrant stores these components as part of a point in a collection. The collection (in this case, “top_stocks”) is designed to organize and manage these points based on the vectors, payloads, and IDs. The data below shows how it looks: It has 384 dimensions in our example, but the diagram below shows only three dimensions for demonstration purposes.
Vector databases, OLAP (Online Analytical Processing), and OLTP (Online Transaction Processing) serve different data storage and processing purposes. Here’s a comparison of these systems:
A vector database stores data as high-dimensional vectors or embeddings. Users typically use vector databases for tasks involving semantic search and machine learning applications. These databases perform fast similarity searches, which are essential for AI-based systems like RAG (Retrieval-Augmented Generation). They are also ideal for AI-driven applications requiring semantic search, image recognition, or natural language processing tasks (e.g., search recommendations and Retrieval-Augmented Generation). Examples include Qdrant, Pinecone, FAISS, and Milvus.
OLAP is designed for analytical queries, often over large datasets. OLAP databases support complex queries for data analysis, business intelligence, and reporting. They are best for analyzing large datasets to generate business insights, where complex queries, summarizations, and historical data analysis are necessary (e.g., business intelligence and reporting). Examples: Google BigQuery, Amazon Redshift, Snowflake.
OLTP databases efficiently handle high volumes of transactional workloads in real-time, including financial transactions, inventory management, and customer data processing. They excel in real-time, high-volume transactions that require consistent and fast read/write operations, making them ideal for banking systems, inventory management, and e-commerce transactions. Examples: MySQL, PostgreSQL, SQL Server, and Oracle.
In a vector database, distance metrics measure the similarity or dissimilarity between vectors (high-dimensional representations of data such as text, images, or other forms of unstructured data). These distance metrics are critical for tasks like semantic search and nearest neighbor search because they allow the system to find the most relevant vectors (e.g., documents, images) based on how “close” they are in the vector space to a given query. Common Distance Metrics in Vector Databases are given below:
Distance Metric | Function | Use Case |
Euclidean Distance (L2 Norm) | Measures straight-line distance in vector space. | Image retrieval: Finds similar images; Document similarity: Compares document vectors. |
Cosine Similarity | Measures the cosine angle between vectors, focusing on direction. | Text retrieval: Finds similar texts in NLP; Recommendations: Recommends items based on vector similarity. |
Manhattan Distance (L1 Norm) | Sum of absolute differences along vector axes. | Robotics/pathfinding: Used in grid maps; Sparse vectors: Suitable for high-dimensional sparse data. |
Inner Product (Dot Product) | Measures interaction or similarity by multiplying and summing vector components. | Recommendations: Calculates item-user similarity; Neural networks: Activates between layers. |
Hamming Distance | Counts differing positions in binary vectors. | Error detection: Used in communication; Binary classification: Compares binary vectors in bioinformatics or security. |
Hallucinations in AI-generated content refer to instances when a language model generates plausible-sounding but incorrect or fabricated information. This happens because models like GPT, BERT, and other large language models (LLMs) are trained on vast datasets but can only access real-time data, databases, or specific facts from their training. They rely on statistical patterns learned from the data, which means that when a prompt doesn’t closely match something the model “knows,” it may create information that fits linguistically but lacks factual grounding.
Example:
Hallucinations happen because the model tries to predict the next word or phrase based on learned patterns but doesn’t always have access to the correct information.
Confabulation is when a model generates plausible but incorrect or fabricated information, like hallucinations. These inaccuracies often arise when the model tries to fill in gaps in its knowledge, leading to outputs that may sound convincing but lack grounding in reality or facts.
Example:
In confabulation, the AI confidently gives a wrong answer and incorrect justification, making it seem believable. Hallucinations and confabulations refer to errors in AI-generated content but differ in nature and context.
To effectively use RAG in your applications, follow the steps below.
Below is the workflow for how data gets pruned, embeddings are created, and applied to an LLM/FMHow
The below example uses Python 3.12 and related frameworks.
We recommend using IPython notebooks (interactive Python notebooks) and the Jupyter server for better productivity with any data-oriented programs.
Data can come from various sources, such as .csv, .json, and .xml. The Pandas library can load files and supports multiple data formats. We need to do data pruning to make sure there are no missing data.
import pandas as pd
# Step 1: Load and Flatten the JSON Data
df = pd.read_json('../../stock_data.json')
# Normalize the nested JSON structure
df = pd.json_normalize(df['stocks'])
# Step 2: Print columns to verify the structure
print(df.columns)
# Step 3: Filter out any NaN values in 'company' or other fields (if needed)
df = df[df['company'].notna()]
# Step 4: Convert the DataFrame to a list of dictionaries
data = df.to_dict('records')
df
We will use Qdrant, a vector database, to demonstrate the RAG. We will also use a sentence transformer to encode sentences into numerical representations (embeddings), allowing us to compare them using cosine similarity or other distance metrics.
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
# Initialize SentenceTransformer model
# Model to create embeddings
encoder = SentenceTransformer('all-MiniLM-L6-v2')
The above line is loading the all-MiniLM-L6-v2 model from the sentence-transformers library, a pre-trained model designed for creating text embeddings. This model is lightweight and efficient for many NLP tasks. The all-MiniLM-L6-v2 is a MiniLM model that has been fine-tuned for tasks like sentence embeddings, semantic search, and sentence similarity. It’s part of the Sentence Transformers library, which provides a simple API for generating dense vector representations (embeddings) for text. Initializing the SentenceTransformer object with the model name downloads the pre-trained model from Hugging Face’s model hub. If it hasn’t already been downloaded, it loads it into memory. When you run this sentence transformer line, you will see output like below.
# Create the vector database client (In-Memory instance for demonstration)
qdrant = QdrantClient(":memory:")
creates an in-memory instance of the Qdrant vector database. Qdrant is a vector search engine that helps store, search, and manage embeddings (vector representations of data) efficiently, typically used for tasks like semantic search, nearest neighbor search, and similarity matching. Below are the different options you can pass to QdrantClient:
qdrant = QdrantClient(“:memory:”)
This creates a temporary, in-memory instance of Qdrant where all data is lost once the program terminates. It’s ideal for prototyping, testing, or short-term use cases.
qdrant = QdrantClient(“http://localhost:6333″)
This connects to a locally running Qdrant instance. You’ll need to install and run the Qdrant server on your machine before connecting to it. The default port for Qdrant is 6333. You can change the port number if you’ve configured Qdrant to run on a different port.
qdrant = QdrantClient(“http://<remote-server-ip>:<port>”)
You can connect to a remote Qdrant server hosted on a different machine or cloud server by specifying the remote server’s IP address and port. If the remote instance requires authentication (API tokens or credentials), you can pass additional arguments for secure access.
A vector database collection is a specialized data structure that stores high-dimensional vector representations (embeddings) of data along with associated metadata. It allows for efficient similarity searches, which are essential for tasks like semantic search, recommendation systems, and content-based retrieval. Vector databases design collections to manage large-scale data efficiently and return highly relevant, similar items based on vector comparisons. You can create a collection in the following way.
# Create collection in Qdrant
qdrant.recreate_collection(
collection_name="top_stocks",
vectors_config=models.VectorParams(
size=encoder.get_sentence_embedding_dimension(), # Vector size defined by the model
distance=models.Distance.COSINE
)
)
This snippet of code is using the QdrantClient to create (or recreate) a collection called “top_stocks” in the Qdrant vector database. Once collection created successfully, it return “True”.
The configuration of vectors in the collection is set using models.VectorParams, which defines:
Iterate/enumerate the loaded data to create a collection with vectors of dimensions with their id’s and payloads. This can be done in below way.
# Vectorize only valid entries with non-empty "company" values
valid_data = [doc for doc in data if isinstance(doc.get("company", ""), str) and doc["company"].strip()]
# Proceed to upload points to Qdrant
qdrant.upsert(
collection_name="top_stocks",
points=[
models.PointStruct(
id=idx,
vector=encoder.encode(doc["company"]).tolist(), # Encode the "company" name as the vector
payload=doc
) for idx, doc in enumerate(valid_data)
]
)
# Check if the data is successfully uploaded to Qdrant
collection_info = qdrant.get_collection("top_stocks")
print(collection_info)
# Verify if the vectors are uploaded by inspecting the number of points
points = qdrant.scroll(
collection_name="top_stocks",
limit=5,
with_payload=True
)
print(points)
The above code uploads points (vectors) to a collection in Qdrant using the upload_points method. Each point comprises an ID, a vector (embedding), and an associated payload (metadata). This takes some time, depending on the data as it loads to the vector database.
# Define the query
query_prompt = "Technology company with a high market cap"
# Step 1: Encode the query using the same encoder
query_vector = encoder.encode(query_prompt).tolist()
# Step 2: Search the Qdrant collection for the closest vectors
search_results = qdrant.search(
collection_name="top_stocks",
query_vector=query_vector,
limit=2, # Retrieve the top 5 most similar results
with_payload=True # Include the payload (metadata) in the search results
)
# Step 3: Print the search results
for result in search_results:
print(f"Company: {result.payload['company']}")
print(f"Ticker: {result.payload['ticker']}")
print(f"Industry: {result.payload['industry']}")
print(f"Market Cap: {result.payload['market_cap']}")
print(f"Similarity Score: {result.score}")
print("-" * 30)
Using an embedding query string, the above code performs a search query in the Qdrant vector database against the “top_stocks” collection. It retrieves the top 3 most similar vectors and prints each hit’s associated payload (metadata) and similarity score.
search_results_payload = [result.payload for result in search_results]
print(search_results_payload)
Extracts the payload (metadata or additional information) from each of the search results (hits) returned by the Qdrant search and stores them in the list search_results.
from openai import OpenAI
# Initialize the OpenAI client for the local API server
client = OpenAI(
base_url="http://127.0.0.1:8080/v1", # Local API server
api_key="your api key" # Placeholder API key for local server
)
# Create the completion request (chat)
completion = client.chat.completions.create(
model="LLaMA_CPP", # Using a local model
messages=[
{"role": "system", "content": "You are chatbot, stocks specialist. Your top priority is to help guide users into selecting stocks and guide them with their requests."},
{"role": "user", "content": "What is the market cap of NVIDIA and its P/E ratio?"},
{"role": "assistant", "content": str(search_results)} # Providing search results in the assistant's message
]
)
# Print the assistant's generated message
print(completion.choices[0].message["content"])
Output : ChatCompletionMessage(content= ‘The market cap of NVIDIA Corporation is 620B and its P/E ratio is 50.5.’)
Without RAG the output was:
ChatCompletionMessage(content= ‘As of 2021, NVIDIA had a market capitalization of approximately $500 billion and a P/E ratio of around 40’)
The above code uses the OpenAI Python client to interact with a local API server using its API key and generate a response using a locally deployed LLaMA_CPP model (a local version of an LLaMA model).
In an era where the accuracy and reliability of AI-generated content are paramount, Retrieval-Augmented Generation (RAG) emerges as a breakthrough technique that overcomes key limitations of traditional language models. By integrating real-time data retrieval from external knowledge sources, RAG enhances the factual correctness of AI responses, significantly reducing the risk of hallucinations, confabulations, and data accuracy. This approach empowers models to generate more contextually relevant and precise answers, especially in knowledge-intensive domains.
Moreover, vector databases are indispensable in the RAG workflow, enabling efficient semantic search through high-dimensional embeddings. This ensures that AI systems can retrieve and utilize the most relevant and up-to-date information for generation tasks. RAG represents a critical step forward in pursuing more trustworthy, actionable, and grounded AI outputs as AI evolves. The combination of retrieval and generation phases of RAG enhances the user experience and sets a new standard for AI-driven decision-making and content creation.
A. RAG (Retrieval Augmented Generation) is a technique that combines retrieval of relevant information from a knowledge base with AI text generation. It’s important because it reduces AI hallucinations by grounding responses in verified data sources.
A. Unlike traditional LLMs that rely solely on their training data, RAG actively retrieves and references current, specific information from a maintained knowledge base before generating responses, ensuring higher accuracy and relevance.
A. Vector databases are specialized databases that store and retrieve data based on semantic similarity. They’re essential for RAG because they enable efficient storage and retrieval of text embeddings (numerical representations of text), allowing quick access to relevant information.
A. RAG systems can be configured to continuously update their knowledge base with new information. The vector database is updated with new embeddings as fresh data arrives, making it immediately available for retrieval.
A. Retrieval-Augmented Generation (RAG) enhances AI accuracy by retrieving real-time, relevant information before generating responses, effectively reducing hallucinations and ensuring more reliable and factually consistent outputs.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.