Today, I am discussing RAG vs Agentic RAG. In this guide, I will provide you with the comparison and then proceed to the hands-on part.
Firstly, let’s understand what RAG is. It is not a piece of old cloth but the framework LLM uses to get relevant, up-to-date, and context-specific information by combining retrieval and generation capabilities.
But can we see the limitations of LLMs without RAG? Absolutely! Here, I have asked ChatGpt to give me output on its knowledge without any external searches for Swarm by OpenAI; it cannot provide the right output. This is due to its knowledge cutoff date, which is 2023, and to get the correct output, it has to be updated with new information or access to an external source. Intriguing, right? So, can we augment the LLMs with our own custom data to get the right response? Of course, we can do it with long-context LLMs and RAG. Today, we will be talking about RAG.
Instead of relying solely on the large language model’s (LLM) pre-trained knowledge, which may be outdated or incomplete, RAG dynamically retrieves the most relevant documents or information from an external knowledge base or database.
Let us comprehend this with an example: if we humans, after birth, rely on only one source of information when exploring the external environment, our understanding would remain severely limited. Similarly, a Large Language Model (LLM) on its own has a predefined training dataset that serves as its “internal knowledge.” This has to be the only source of Information for the model, resulting in old information, ungrounded hallucinations, senseless content and more. While vast, this dataset needs to be updated or more for real-time, context-specific queries. This is where RAG (Retrieval-Augmented Generation) steps in.
Here’s what RAG does:
Together, the RAG framework helps improve AI-generated content’s relevance, accuracy, and richness by grounding responses in retrieved and contextualised information.
Here’s the comparison of RAG and Without RAG:
Category | Without RAG | With RAG |
Accuracy | Susceptible to generating unverified or “hallucinated” content, not tied to reliable sources. | Responses are grounded with verifiable citations from external sources. |
Timeliness | Relies on static pre-trained data, which may be outdated or irrelevant to current events. | Enhances static pre-trained data by incorporating real-time, up-to-date information from external sources. |
Contextual Clarity | Often struggles to interpret ambiguous queries, leading to vague or incomplete answers. | Retrieves context-specific information, improving the clarity and specificity of responses. |
Customisation | Cannot access or utilise user-specific datasets or private sources, resulting in generic responses. | Integrates public and private datasets, enabling highly tailored and relevant outputs. |
Search Scope | Limited to the pre-trained knowledge base; cannot extend to new or external information. | Capable of broad, on-demand searches across multiple databases or online sources. |
Reliability | High potential for errors due to reliance on static and pre-generated knowledge. | Ensures reliability by cross-referencing multiple trusted sources in real time. |
Use Cases | Suitable for general-purpose tasks but less effective for dynamic or data-intensive applications. | Ideal for tasks requiring live updates, research, or custom data integration. |
Transparency | No clear reference or citation for the provided information, making validation difficult. | Provides citations or links to sources, ensuring transparency and trustworthiness. |
RAG (Retrieval-Augmented Generation)
Without RAG (Retrieval-Augmented Generation)
This part focuses on preparing and managing the knowledge base during retrieval.
Also read: Vector Embeddings with Cohere and Hugging Face
This describes the overall process of combining retrieval and generation to produce an answer:
Here’s how the system architecture looks when combined together:
Also read: Build a RAG Pipeline With the LLama Index
Here are the challenges with RAG:
Traditional Retrieval-Augmented Generation (RAG) systems enhance AI by pairing Large Language Models (LLMs) with vector databases to overcome LLM limitations. While effective for basic tasks like Q&A or support bots, they struggle with complex use cases. These systems often fail to contextualize retrieved data, resulting in superficial responses that lack depth and nuance.
These challenges demonstrate why RAG systems require sophisticated mechanisms for retrieval, context understanding, and natural language generation to handle these nuanced use cases effectively. This is where Agentic RAG comes to the rescue.
I hope you now have a clear understanding of the traditional RAG. We will now discuss a different version of RAG with agents—the Agentic Rag.
Agentic RAG refers to a more intelligent and dynamic Retrieval-Augmented Generation system where an “agent” plays a key role in orchestrating processes. The agent intelligently determines which resources or databases are most relevant for a user’s query, making it capable of handling more complex, multi-tasking scenarios. It is an evolution from traditional RAG systems, offering greater adaptability and decision-making by incorporating additional logic or heuristics into the retrieval and response generation pipeline.
The process flow of an Agentic RAG System for handling user queries. Here’s a breakdown of each component:
It exemplifies a robust approach to Agentic RAG, demonstrating how modular and context-aware processing enables handling a wide range of tasks.
Also read: How Agentic RAG Systems with CrewAI and LangChain Transform Tech?
Let’s see how a Self-reflective Agentic RAG System works:
This iterative approach ensures accuracy and relevance by dynamically retrieving and refining the query as needed.
Also read: A Comprehensive Guide to Building Agentic RAG Systems with LangGraph
Agents are the driving force behind the Retrieval-Augmented Generation (RAG) framework, functioning as specialized units that streamline each stage of the retrieval and generation pipeline. They operate collaboratively to achieve tasks like understanding user queries, retrieving relevant information, generating responses, and managing the overall workflow.
By orchestrating these functions, agents ensure smooth, efficient, and intelligent handling of tasks. This modular and adaptive approach allows the system to tackle complex queries effectively while improving overall performance and system reliability.
The RAG system employs several types of agents, each with a specific purpose and methodology. Here’s a breakdown for clarity:
Purpose: Direct user queries to the most appropriate sources.
How They Work: Analyze queries using large language models (LLMs) to determine which parts of the RAG pipeline best handle the request.
Here’s a hybrid approach combining Semantic Search and Summarization to answer a specific query: “What did the author do during his time in art school?”. Here’s a breakdown of how the system works:
Advantages:
Also read: Agentic RAG for Analyzing Customer Issues
Purpose: Handle complex or multi-faceted queries by breaking them into smaller, manageable components.
How They Work:
The process of retrieving and comparing revenue growth information for Uber and Lyft in 2021 from their financial documents (10-K filings).
Outcome: Results from each sub-query are synthesized into a complete, coherent response.
Benefits:
Purpose: Adaptively combine reasoning and dynamic action to handle real-time queries and user interactions.
How They Work:
Why They Matter:
Also read: What is Agentic AI Planning Pattern?
Purpose: Continuously adapt to evolving data and changing user requirements.
Key Areas of Focus:
How They Work:
By employing specialized agents with distinct functions, the RAG pipeline ensures:
These agents collectively enable RAG systems to deliver high-quality, contextually accurate, and timely responses to users, regardless of the complexity of their queries.
Agentic RAG frameworks are much more versatile than traditional RAG setups. In a traditional RAG system, the AI relies on a single tool—a vector database—for retrieving information to shape its responses. While effective for basic data retrieval, this approach is limited to working with static documents.
In contrast, agentic RAG systems go beyond simple data retrieval. These advanced frameworks can integrate multiple tools to handle a variety of tasks. For example, they can perform complex mathematical calculations, write emails, analyze data, or even make decisions based on contextual needs. This ability to incorporate different tools makes them far more flexible and capable.
Additionally, agentic RAG systems excel in multistep reasoning. They are context-aware, meaning they can decide when and how to use specific tools to solve problems or accomplish tasks. This ensures better accuracy and efficiency in handling more complex requirements.
Its ability to work collaboratively in multiagent systems sets agentic RAG apart. Multiple AI agents can work together, achieving results that are often far better than those of a single AI agent. This adaptability and scalability make agentic RAG a powerful choice for dynamic, real-world applications.
Also read:
Feature | RAG | Agentic RAG |
Task Complexity | Handles simple query-based tasks but lacks advanced decision-making | Handles complex multi-step tasks with multiple tools and agents as needed for retrieval, reasoning, and more |
Decision-Making | Limited, no autonomous decision-making involved | Agents autonomously decide what data to retrieve, how to retrieve, grade, reason, reflect, and generate responses |
Multi-Step Reasoning | Limited to single-step queries and responses | Excels at multi-step reasoning, especially after retrieval with grading, hallucination, and response evaluation |
Key Role | Combines LLMs with external data retrieval to generate responses | Enhances RAG by using agents for intelligent retrieval, response generation, grading, critiquing, and more |
Real-Time Data Retrieval | Not possible in native RAG | Designed for real-time data retrieval and integration |
Integration with Retrieval Systems | Tied to static retrieval from pre-defined vector databases | Deeply integrated with diverse retrieval systems, agents control the process |
Context-Awareness | Limited by the static vector database, no advanced or real-time context-awareness | High, agents adapt to user query and retrieve context, including real-time data |
Also read: Evolution of RAG, Long Context LLMs to Agentic RAG
To understand RAG vs Agentic RAG, let’s understand their implementation.
Necessary Libraries and Imports
!pip install langchain==0.3.4
!pip install langchain-openai==0.2.3
!pip install langchain-community==0.3.3
!pip install jq==1.8.0
!pip install pymupdf==1.24.12
!pip install langchain-chroma==0.1.4
from getpass import getpass
OPENAI_KEY = getpass('Enter Open AI API Key: ')
import os
os.environ['OPENAI_API_KEY'] = OPENAI_KEY
from langchain_openai import OpenAIEmbeddings
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
Processes JSON documents into structured formats:
from langchain.document_loaders import JSONLoader
import json
from langchain.docstore.document import Document
# Load JSON documents
loader = JSONLoader(file_path='./rag_docs/wikidata_rag_demo.jsonl',
jq_schema='.',
text_content=False,
json_lines=True)
wiki_docs = loader.load()
# Process JSON documents
import json
from langchain.docstore.document import Document
wiki_docs_processed = []
for doc in wiki_docs:
doc = json.loads(doc.page_content)
metadata = {
"title": doc['title'],
"id": doc['id'],
"source": "Wikipedia"
}
data = ' '.join(doc['paragraphs'])
wiki_docs_processed.append(Document(page_content=data, metadata=metadata))
Output
Document(metadata={'title': 'Chi-square distribution', 'id': '71548',
'source': 'Wikipedia'}, page_content='In probability theory and statistics,
the chi-square distribution (also chi-squared or formula_1\xa0 distribution)
is one of the most widely used theoretical probability distributions. Chi-
square distribution with formula_2 degrees of freedom is written as
formula_3. It is a special case of gamma distribution. Chi-square
distribution is primarily used in statistical significance tests and
confidence intervals. It is useful, because it is relatively easy to show
that certain probability distributions come close to it, under certain
conditions. One of these conditions is that the null hypothesis must be
true. Another one is that the different random variables (or observations)
must be independent of each other.')
Splits PDF content into chunks for vector embedding:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def create_simple_chunks(file_path, chunk_size=3500, chunk_overlap=200):
loader = PyMuPDFLoader(file_path)
doc_pages = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
return splitter.split_documents(doc_pages)
from glob import glob
pdf_files = glob('./rag_docs/*.pdf')
# Process PDF files
paper_docs = []
for fp in pdf_files:
paper_docs.extend(create_simple_chunks(file_path=fp))
Output
Loading pages: ./rag_docs/cnn_paper.pdf
Chunking pages: ./rag_docs/cnn_paper.pdf
Finished processing: ./rag_docs/cnn_paper.pdf
Loading pages: ./rag_docs/attention_paper.pdf
Chunking pages: ./rag_docs/attention_paper.pdf
Finished processing: ./rag_docs/attention_paper.pdf
Loading pages: ./rag_docs/vision_transformer.pdf
Chunking pages: ./rag_docs/vision_transformer.pdf
Finished processing: ./rag_docs/vision_transformer.pdf
Loading pages: ./rag_docs/resnet_paper.pdf
Chunking pages: ./rag_docs/resnet_paper.pdf
Finished processing: ./rag_docs/resnet_paper.pdf
Creates embeddings for documents using OpenAI’s model and stores them in a Chroma vector database:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize embedding model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
# Combine documents
total_docs = wiki_docs_processed + paper_docs
# Create and save vector database
chroma_db = Chroma.from_documents(documents=total_docs,
collection_name='my_db',
embedding=openai_embed_model,
collection_metadata={"hnsw:space": "cosine"},
persist_directory="./my_db")
Load an existing vector database from disk:
chroma_db = Chroma(persist_directory="./my_db",
collection_name='my_db',
embedding_function=openai_embed_model)
Retrieves the top-k most relevant documents based on a query:
similarity_retriever = chroma_db.as_retriever(search_type="similarity", search_kwargs={"k": 5})
# Query for semantic similarity
query = "What is machine learning?"
top_docs = similarity_retriever.invoke(query)
# Display results
from IPython.display import display, Markdown
def display_docs(docs):
for doc in docs:
print('Metadata:', doc.metadata)
print('Content Brief:')
display(Markdown(doc.page_content[:1000]))
print()
display_docs(top_docs)
Combines retrieval with a generative AI model for Q&A:
from langchain_core.prompts import ChatPromptTemplate
rag_prompt = """You are an assistant who is an expert in question-answering tasks.
Answer the following question using only the following pieces of retrieved context.
If the answer is not in the context, do not make up answers, just say that you don't know.
Keep the answer detailed and well formatted based on the information from the context.
Question:
{question}
Context:
{context}
Answer:
"""
rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
# Initialize ChatGPT model
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
# Format documents into a single string
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Construct the RAG pipeline
qa_rag_chain = (
{
"context": (similarity_retriever | format_docs),
"question": RunnablePassthrough()
}
|
rag_prompt_template
|
chatgpt
)
Example Usage
query = "What is the difference between AI, ML, and DL?"
result = qa_rag_chain.invoke(query)
# Display the generated answer
from IPython.display import display, Markdown
display(Markdown(result.content))
query = "What is LangGraph?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))
Output
I don't know.
This is due to the fact that the document does not contain any information about the LangGraph.
Also read: A Comprehensive Guide to Building Multimodal RAG Systems
Here, we will create an Agentic RAG system that uses external information to discuss the 2024 US Open.
This involves creating the necessary infrastructure:
To link machine learning capabilities:
Install required libraries:
!pip install langchain | tail -n 1
!pip install langchain-ibm | tail -n 1
!pip install langchain-community | tail -n 1
!pip install ibm-watsonx-ai | tail -n 1
!pip install ibm_watson_machine_learning | tail -n 1
!pip install chromadb | tail -n 1
!pip install tiktoken | tail -n 1
!pip install python-dotenv | tail -n 1
!pip install bs4 | tail -n 1
import os
from dotenv import load_dotenv
from langchain_ibm import WatsonxEmbeddings, WatsonxLLM
from langchain.vectorstores import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts import PromptTemplate
from langchain.tools import tool
from langchain.tools.render import render_text_description_and_args
from langchain.agents.output_parsers import JSONAgentOutputParser
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from langchain_core.runnables import RunnablePassthrough
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes
The Setup:
llm = WatsonxLLM(
model_id= "ibm/granite-3-8b-instruct",
url=credentials.get("url"),
apikey=credentials.get("apikey"),
project_id=project_id,
params={
GenParams.DECODING_METHOD: "greedy",
GenParams.TEMPERATURE: 0,
GenParams.MIN_NEW_TOKENS: 5,
GenParams.MAX_NEW_TOKENS: 250,
GenParams.STOP_SEQUENCES: ["Human:", "Observation"],
},
)
template = "Answer the {query} accurately. If you do not know the answer, simply say you do not know."
prompt = PromptTemplate.from_template(template)
agent = prompt | llm
agent.invoke({"query": "What sport is played at the US Open?"})
'\n\nThe sport played at the US Open is tennis.'
agent.invoke({"query": "Where was the 2024 US Open Tennis Championship?"})
Do not make up an answer.\n\nThe 2024 US Open Tennis Championship has not
been officially announced yet, so the location is not confirmed. Therefore,
I do not know the answer to this question.'
This step enables the agent to retrieve specific contextual information.
urls = [
"https://www.ibm.com/case-studies/us-open",
"https://www.ibm.com/sports/usopen",
"https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement",
"https://newsroom.ibm.com/2024-08-15-ibm-and-the-usta-serve-up-new-and-enhanced-generative-ai-features-for-2024-us-open-digital-platforms",
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
docs_list[0]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=250, chunk_overlap=0
)
doc_splits = text_splitter.split_documents(docs_list)
#The embedding model that we are using is an IBM Slate™ model through the watsonx.ai embeddings service. Let's initialize it.
embeddings = WatsonxEmbeddings(
model_id=EmbeddingTypes.IBM_SLATE_30M_ENG.value,
url=credentials["url"],
apikey=credentials["apikey"],
project_id=project_id,
)
#In order to store our embedded documents, we will use Chroma DB, an open source vector store.
vectorstore = Chroma.from_documents(
documents=doc_splits,
collection_name="agentic-rag-chroma",
embedding=embeddings,
)
Set up a retriever to enable queries over this knowledge base. We must set up a retriever to access information in the vector store.
retriever = vectorstore.as_retriever()
@tool
def get_IBM_US_Open_context(question: str):
"""Get context about IBM's involvement in the 2024 US Open Tennis Championship."""
context = retriever.invoke(question)
return context
tools = [get_IBM_US_Open_context]
system_prompt = """Respond to the human as helpfully and accurately as possible. You have access to the following tools: {tools}
Use a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).
Valid "action" values: "Final Answer" or {tool_names}
Provide only ONE action per $JSON_BLOB, as shown:"
```
{{
"action": $TOOL_NAME,
"action_input": $INPUT
}}
```
Follow this format:
Question: input question to answer
Thought: consider previous and subsequent steps
Action:
```
$JSON_BLOB
```
Observation: action result
... (repeat Thought/Action/Observation N times)
Thought: I know what to respond
Action:
```
{{
"action": "Final Answer",
"action_input": "Final response to human"
}}
Begin! Reminder to ALWAYS respond with a valid json blob of a single action.
Respond directly if appropriate. Format is Action:```$JSON_BLOB```then Observation"""
human_prompt = """{input}
{agent_scratchpad}
(reminder to always respond in a JSON blob)"""
prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
MessagesPlaceholder("chat_history", optional=True),
("human", human_prompt),
]
)
agent_executor.invoke({"input": "Where was the 2024 US Open Tennis Championship?"})
{'input': 'Where was the 2024 US Open Tennis Championship?',
'history': '',
'output': 'The 2024 US Open Tennis Championship was held at the USTA Billie
Jean King National Tennis Center in Flushing, Queens, New York.'}
Great! The agent used its available RAG tool to return the location of the
2024 US Open, per the user's query. We even get to see the exact document
that the agent is retrieving its information from. Now, let's try a slightly
more complex question query. This time, the query will be about IBM's
involvement in the 2024 US Open.
agent_executor.invoke(
{"input": "How did IBM use watsonx at the 2024 US Open Tennis Championship?"}
)
> Finished chain.
Out[ ]:
{'input': 'How did IBM use watsonx at the 2024 US Open Tennis Championship?',
'history': 'Human: Where was the 2024 US Open Tennis Championship?\nAI: The
2024 US Open Tennis Championship was held at the USTA Billie Jean King
National Tennis Center in Flushing, Queens, New York.',
'output': 'IBM used watsonx at the 2024 US Open Tennis Championship to
create generative AI-powered features such as Match Reports, AI Commentary,
and SlamTracker. These features enhance the digital experience for fans and
scale the productivity of the USTA editorial team.'}
This structured system combines IBM’s watsonx.ai, LangChain, and machine learning to build a versatile, knowledge-augmented AI agent tailored for both general and domain-specific queries.
Also, if you are looking for an AI Agents course online, then explore: Agentic AI Pioneer Program
RAG (Retrieval-Augmented Generation) enhances LLMs by combining external data retrieval with generative capabilities, improving accuracy and relevance and reducing hallucinations. However, it struggles with complex, multi-step queries. Agentic RAG advances this by integrating intelligent agents that dynamically select tools, refine queries, and handle specialized tasks like code generation or visualizations. It supports multi-agent collaboration, ensuring adaptability, scalability, and precise context-aware responses. While traditional RAG suits basic Q&A and research, Agentic RAG excels in dynamic, data-intensive applications like real-time analysis and enterprise systems. Agentic RAG’s modularity and intelligence make it ideal for tackling complex tasks beyond the scope of traditional RAG systems.
I hope you find this guide helpful in understanding RAG vs Agentic RAG! If you any questions regarding the article comment below.
Ans. RAG focuses on integrating retrieval and generation capabilities to improve AI outputs by grounding responses with external knowledge. Agentic RAG, on the other hand, incorporates intelligent agents that can autonomously select tools, refine queries, and adapt to complex, multi-step tasks.
Ans. Agentic RAG enables decision-making and dynamic planning, allowing it to handle real-time data, multi-tool integration, and context-aware reasoning, making it ideal for sophisticated, task-specific applications.
Ans. Agentic RAG employs agents like routing agents to direct queries, query planning agents for breaking down multi-step tasks, and Re-Act agents for iterative reasoning and actions, ensuring precise and contextual responses.
Ans. Traditional RAG struggles with contextual understanding, synthesis, and scalability. Agentic RAG overcomes these by dynamically adapting to user inputs, integrating diverse data sources, and leveraging multi-agent collaboration for efficient task management.
Ans. Agentic RAG is ideal for applications requiring real-time updates, multi-step reasoning, and integration with multiple tools, such as enterprise systems, data analytics, and domain-specific AI systems. Traditional RAG suits simpler, static tasks like basic Q&A or static content retrieval.