RAG vs Agentic RAG: A Comprehensive Guide

Pankaj Singh Last Updated : 02 Jan, 2025
21 min read

Today, I am discussing RAG vs Agentic RAG. In this guide, I will provide you with the comparison and then proceed to the hands-on part.

Firstly, let’s understand what RAG is. It is not a piece of old cloth but the framework LLM uses to get relevant, up-to-date, and context-specific information by combining retrieval and generation capabilities.

But can we see the limitations of LLMs without RAG? Absolutely! Here, I have asked ChatGpt to give me output on its knowledge without any external searches for Swarm by OpenAI; it cannot provide the right output. This is due to its knowledge cutoff date, which is 2023, and to get the correct output, it has to be updated with new information or access to an external source. Intriguing, right? So, can we augment the LLMs with our own custom data to get the right response? Of course, we can do it with long-context LLMs and RAG. Today, we will be talking about RAG.

Chatgpt

Instead of relying solely on the large language model’s (LLM) pre-trained knowledge, which may be outdated or incomplete, RAG dynamically retrieves the most relevant documents or information from an external knowledge base or database.

Let us comprehend this with an example: if we humans, after birth, rely on only one source of information when exploring the external environment, our understanding would remain severely limited. Similarly, a Large Language Model (LLM) on its own has a predefined training dataset that serves as its “internal knowledge.” This has to be the only source of Information for the model, resulting in old information, ungrounded hallucinations, senseless content and more. While vast, this dataset needs to be updated or more for real-time, context-specific queries. This is where RAG (Retrieval-Augmented Generation) steps in.

What Does RAG Do?

Here’s what RAG does:

Source: Author
  1. Retrieval (R): This involves searching for relevant data from external sources, databases, or knowledge repositories. The goal is to gather specific, accurate, and relevant information that can support or enhance the AI’s understanding of a particular topic or query.
  2. Augmentation (A): In this phase, the retrieved data is added to the prompt context. This means the information is integrated or combined with the input given to the AI, effectively enriching its knowledge base for better reasoning and context-aware responses.
  3. Generation (G): Finally, the AI uses the augmented context to generate outputs, such as text, explanations, or insights, based on the combined input and retrieved data. This step represents the output of generative AI tools like GPT models.

Together, the RAG framework helps improve AI-generated content’s relevance, accuracy, and richness by grounding responses in retrieved and contextualised information.

RAG framework
Source: Author

RAG vs Without RAG

Here’s the comparison of RAG and Without RAG:

CategoryWithout RAGWith RAG
AccuracySusceptible to generating unverified or “hallucinated” content, not tied to reliable sources.Responses are grounded with verifiable citations from external sources.
TimelinessRelies on static pre-trained data, which may be outdated or irrelevant to current events.Enhances static pre-trained data by incorporating real-time, up-to-date information from external sources.
Contextual ClarityOften struggles to interpret ambiguous queries, leading to vague or incomplete answers.Retrieves context-specific information, improving the clarity and specificity of responses.
CustomisationCannot access or utilise user-specific datasets or private sources, resulting in generic responses.Integrates public and private datasets, enabling highly tailored and relevant outputs.
Search ScopeLimited to the pre-trained knowledge base; cannot extend to new or external information.Capable of broad, on-demand searches across multiple databases or online sources.
ReliabilityHigh potential for errors due to reliance on static and pre-generated knowledge.Ensures reliability by cross-referencing multiple trusted sources in real time.
Use CasesSuitable for general-purpose tasks but less effective for dynamic or data-intensive applications.Ideal for tasks requiring live updates, research, or custom data integration.
TransparencyNo clear reference or citation for the provided information, making validation difficult.Provides citations or links to sources, ensuring transparency and trustworthiness.

RAG (Retrieval-Augmented Generation)

LLM with RAG
Source: Author

Without RAG (Retrieval-Augmented Generation)

llm without RAG
Source: Author

Working of RAG

RAG System Architecture: Data Indexing

This part focuses on preparing and managing the knowledge base during retrieval.

Step 1: Load

  • The system ingests different types of data (e.g., text files, PDFs, URLs, and JSON files).
  • The data can come from diverse sources, ensuring a comprehensive knowledge base.

Step 2: Split

  • The data is divided into smaller, meaningful chunks.
  • This step ensures that retrieval works efficiently, allowing the system to fetch precise and relevant parts of documents instead of retrieving entire files.

Step 3: Embed

  • Each chunk of data is converted into vector representations using embedding models.
  • These embeddings capture the semantic meaning of the text, enabling the system to perform similarity-based searches.

Step 4: Store

  • The embeddings and corresponding data are stored in a vector database.
  • The vector database is optimised for quick and accurate similarity searches, which is crucial for retrieval.

Also read: Vector Embeddings with Cohere and Hugging Face

RAG System Architecture: Search and Generation

This describes the overall process of combining retrieval and generation to produce an answer:

Step 1: Question Input

  • The user provides a query or question.
  • The system begins by analysing this question for context and intent.

Step 2: Retrieve

  • The system queries an indexed knowledge base (retrieval system) to gather the most relevant documents or pieces of information.
  • These documents serve as supporting evidence or context for generating an answer.

Step 3: Prompt Creation

  • The retrieved documents are structured into a prompt for the LLM.
  • The prompt includes the original question and the retrieved information, guiding the LLM in generating a context-aware response.

Step 4: Large Language Model (LLM)

  • The LLM processes the prompt, utilizing its generative capabilities to create a coherent and precise response.
  • The response combines insights from the retrieved documents with the LLM’s pre-trained knowledge.

Step 5: Answer Output

  • The final answer is presented to the user, blending the retrieved knowledge and the LLM’s generative capabilities.

Here’s how the system architecture looks when combined together:

Also read: Build a RAG Pipeline With the LLama Index

Challenges with RAG

Here are the challenges with RAG:

  1. Contextual Understanding: RAG systems must understand the context and intent behind each query, especially when handling ambiguous or multi-part questions. But sometimes, they lag in this area!
  2. Synthesis and Reasoning: Beyond retrieving relevant information, the system must synthesize data from multiple sources and generate coherent, actionable insights.
  3. Customization: Adhering to specific internal style guides or user-defined preferences adds another layer of complexity.
  4. Accuracy and Relevance: Ensuring that the retrieved and generated content is accurate, relevant, and directly addresses the user’s query.
  5. Scalability:  Managing a large volume of diverse queries across different domains or topics can strain the system’s ability to provide high-quality responses.

Traditional Retrieval-Augmented Generation (RAG) systems enhance AI by pairing Large Language Models (LLMs) with vector databases to overcome LLM limitations. While effective for basic tasks like Q&A or support bots, they struggle with complex use cases. These systems often fail to contextualize retrieved data, resulting in superficial responses that lack depth and nuance.

These challenges demonstrate why RAG systems require sophisticated mechanisms for retrieval, context understanding, and natural language generation to handle these nuanced use cases effectively. This is where Agentic RAG comes to the rescue.

I hope you now have a clear understanding of the traditional RAG. We will now discuss a different version of RAG with agents—the Agentic Rag.

What is Agentic RAG?

Agentic RAG refers to a more intelligent and dynamic Retrieval-Augmented Generation system where an “agent” plays a key role in orchestrating processes. The agent intelligently determines which resources or databases are most relevant for a user’s query, making it capable of handling more complex, multi-tasking scenarios. It is an evolution from traditional RAG systems, offering greater adaptability and decision-making by incorporating additional logic or heuristics into the retrieval and response generation pipeline.

  • Agentic: The system works on its own, making decisions and taking actions depending on the situation.
  • RAG (Retrieval-Augmented Generation): It mixes information from a knowledge base with the AI’s ability to create responses.

Agentic RAG Workflow

Agentic RAG
Source: Author

The process flow of an Agentic RAG System for handling user queries. Here’s a breakdown of each component:

  1. User Input and Initial Assessment:
    • The system receives a user query.
    • The query is assessed to determine if it fits the criteria for retrieval (it is part of the vector database).
  2. Vector Database Selection:
    • The agent identifies the most relevant vector database for the query.
    • Multiple vector databases are available:
      • DB1: Contains data for generating code.
      • DB2: Contains other general data.
      • DB3: Contains data for generating charts.
    • If the query does not match any database, the process routes to a failsafe mechanism.
  3. Content Retrieval:
    • Once a database is selected, the relevant content is retrieved.
    • Retrieved content is integrated into the LLM prompt for further processing.
  4. Response Type Selection:
    • Based on the query and retrieved content, the system determines the appropriate response type:
      • Generate Code: If the query involves code-related tasks.
      • Generate Charts: If the query requires visualization.
      • Generate Text Response: For standard text-based answers.
  5. Final Output:
    • The system generates the appropriate response (text, code, or chart) and delivers the final output.
    • If no relevant data is found, the system defaults to a failsafe response, returning a message like:
      “Sorry, I don’t have the information you’re looking for.”

Crucial Points:

  • Agent Role: The agent dynamically selects the most relevant database, enhancing flexibility and efficiency in handling diverse queries.
  • Failsafe Mechanism: Ensures the system gracefully handles unanswerable queries by returning a fallback response.
  • Task Specialization: Different vector databases are optimized for specific tasks (e.g., code generation, chart creation), improving performance and accuracy for complex scenarios.

It exemplifies a robust approach to Agentic RAG, demonstrating how modular and context-aware processing enables handling a wide range of tasks.

Also read: How Agentic RAG Systems with CrewAI and LangChain Transform Tech?

Let’s see how a Self-reflective Agentic RAG System works:

LangGraph agentic RAG
Source: LangChain
  1. Agent (Node): Initiates the process and decides whether to retrieve documents by evaluating the query (via a function call).
  2. Should Retrieve (Conditional Edge): Determines if retrieval is necessary. If yes, the process continues; if no, it ends.
  3. Tool (Node): Executes a retrieval tool to fetch relevant documents or information.
  4. Check Relevance (Conditional Edge): Assesses if the retrieved documents are relevant. If yes, it moves to the next step; if no, it redirects to the rewrite process.
  5. Rewrite (Node): Reformulates the query and restarts the retrieval process if necessary.
  6. Generate (Node): If relevant documents are found, the system generates an answer and outputs it.

This iterative approach ensures accuracy and relevance by dynamically retrieving and refining the query as needed.

Also read: A Comprehensive Guide to Building Agentic RAG Systems with LangGraph

Understanding Agents in RAG Systems

Agents are the driving force behind the Retrieval-Augmented Generation (RAG) framework, functioning as specialized units that streamline each stage of the retrieval and generation pipeline. They operate collaboratively to achieve tasks like understanding user queries, retrieving relevant information, generating responses, and managing the overall workflow.

By orchestrating these functions, agents ensure smooth, efficient, and intelligent handling of tasks. This modular and adaptive approach allows the system to tackle complex queries effectively while improving overall performance and system reliability.

Types of Agents in the RAG Pipeline

The RAG system employs several types of agents, each with a specific purpose and methodology. Here’s a breakdown for clarity:

1. Routing Agents

Routing Agents
Source: LlamaIndex

Purpose: Direct user queries to the most appropriate sources.

How They Work: Analyze queries using large language models (LLMs) to determine which parts of the RAG pipeline best handle the request.

Here’s a hybrid approach combining Semantic Search and Summarization to answer a specific query: “What did the author do during his time in art school?”. Here’s a breakdown of how the system works:

  1. Router (Green Box):
    • The entry point where the query is received.
    • Decides how to process the query, directing it to the appropriate engines.
  2. Semantic Search + Summarization (Pink Area):
    • This is the main process to extract and summarize information relevant to the query.
  3. Vector Query Engine (Left Path):
    • Performs semantic search by comparing the query with document embeddings (vectorized representations of the content).
    • Retrieves the top-k relevant documents based on similarity scores.
  4. Summary Query Engine (Right Path):
    • Instead of ranking by relevance, this engine retrieves all potentially related documents.
    • Focuses on summarizing or extracting the exact answer from the retrieved data.
  5. Docs (Document Corpus):
    • Represents the database or collection of text/documents being queried.
  6. Final Output (Bottom):
    • After processing by the engines, a summarized response is generated: “During his time in art school, the author took foundation classes in fundamental subjects like drawing, color, and design.”

Advantages:

  • Enhance query accuracy by targeting relevant data sources.
  • Improve system efficiency by avoiding unnecessary processing.
  • Also, with this, you can combine QA and Summarisation

Also read: Agentic RAG for Analyzing Customer Issues

2. Query Planning Agents

Query Planning Agents
Source: LlamaIndex

Purpose: Handle complex or multi-faceted queries by breaking them into smaller, manageable components.

How They Work:

  • Divide the main query into sub-queries.
  • Assign retrieval and generation tasks for each sub-query across the RAG pipelines.

The process of retrieving and comparing revenue growth information for Uber and Lyft in 2021 from their financial documents (10-K filings).

Process Overview:

  1. Query Decomposition:
    • The initial query (Compare revenue growth of Uber and Lyft in 2021) is split into two sub-queries:
      • Describe revenue growth of Lyft in 2021.
      • Describe revenue growth of Uber in 2021.
  2. Data Source:
    • The data is extracted from 10-K filings (annual financial reports) of Uber and Lyft.
    • These filings are stored in a document database where each report is split into smaller chunks for efficient retrieval.
  3. Retrieval (Top-2 Chunks):
    • For each sub-query:
      • The system identifies the most relevant chunks (top-2) from the respective 10-K filings.
      • For example:
        • Uber 10-K chunk 4 and Uber 10-K chunk 8 for the Uber sub-query.
        • Lyft 10-K chunk 4 and Lyft 10-K chunk 8 for the Lyft sub-query.
  4. Results Compilation:
    • After retrieving the relevant chunks, the system processes the content to generate responses for each sub-query.
    • Finally, the results for Lyft and Uber are combined to facilitate a comparison.

Key Insights:

  • Chunking: Large documents like 10-K filings are divided into smaller sections (chunks) for more efficient and targeted searches.
  • Relevance Ranking: The system uses a ranking mechanism (e.g., semantic similarity or keyword relevance) to select the top-2 chunks most likely to contain the required information.
  • Modular Query Handling: By decomposing the query into smaller parts, the system can handle complex, multi-entity questions more effectively.

Outcome: Results from each sub-query are synthesized into a complete, coherent response.

Benefits:

  • Streamline responses to intricate questions.
  • Leverage multiple data sources to provide comprehensive answers.

3. ReAct Agents (Reasoning and Action Agents)

Purpose: Adaptively combine reasoning and dynamic action to handle real-time queries and user interactions.

How They Work:

  • Select and execute tools or processes needed for specific tasks.
  • Retrieve data, process information, and store outputs incrementally.
  • Iterate the process, refining results until an accurate response is generated.

Why They Matter:

  • Handle dynamic queries requiring multiple steps and tool integrations.
  • Respond effectively to real-time changes in user input or query scope.

Also read: What is Agentic AI Planning Pattern?

4. Dynamic Planning and Execution Agents

Purpose: Continuously adapt to evolving data and changing user requirements.

Key Areas of Focus:

  • Long-Term Planning: Chart out strategies for sustained system performance.
  • Execution Insights: Monitor and refine real-time actions.
  • Efficiency: Minimize delays and optimize resource usage.

How They Work:

  • Separate overarching planning from granular, step-by-step execution.
  • Use computational graphs to map out comprehensive query solutions.
  • Incorporate a two-part system:
    • Planner: Designs strategies.
    • Executor: Implements these strategies effectively.

Components and Workflow

  1. User Input:
    • Example Query: “How much does Microsoft’s market cap need to increase to exceed Apple’s market cap?”
    • The system receives a natural language input from the user.
  2. LLM Planner:
    • Task Generation: The query is analyzed, and tasks are created as a Directed Acyclic Graph (DAG) with dependencies. For example:
      • Task 1: search(Microsoft Market Cap)
      • Task 2: search(Apple Market Cap)
      • Task 3: 1 – 2 (compute the difference after retrieving results for Tasks 1 and 2).
    • Tasks are arranged based on dependencies:
      • Tasks 1 and 2 are independent and can be executed in parallel.
      • Task 3 depends on the results of Tasks 1 and 2 and must wait for their completion.
  3. Task Fetching Unit:
    • Dependency Resolution:
      • This unit identifies the tasks ready for execution (those with resolved dependencies).
      • For example:
        • Initially, Tasks 1 and 2 are fetched for parallel execution.
        • Once Tasks 1 and 2 are completed, their results are fed into Task 3.
  4. Executor:
    • Executes tasks using tools or functions as needed.
    • Tools Available:
      • search: Used to retrieve information (e.g., market caps for Microsoft and Apple).
      • math: Performs calculations (e.g., subtracting one market cap from another).
    • Execution Workflow:
      • Fetches tasks from the Task Fetching Unit.
      • Utilizes tools and functions to perform the necessary operations.
      • Results are stored in memory for dependent tasks.
  5. Final Answer:
    • Once all tasks are executed and dependencies are resolved, the results are returned to the user.
    • For the example query, the final result would quantify how much Microsoft’s market cap needs to increase to exceed Apple’s.

Key Features:

  • Task Decomposition: Breaks down complex queries into manageable components.
  • Parallel Execution: Executes independent tasks simultaneously to optimize performance.
  • Dependency Management: Ensures tasks are executed in the correct sequence, based on their interdependencies.
  • Tool Integration: Supports multiple tools (e.g., search, math) to handle various task types.

Why These Agents Matter?

By employing specialized agents with distinct functions, the RAG pipeline ensures:

  • Accuracy: Queries are routed and processed efficiently.
  • Scalability: Complex tasks are divided and executed seamlessly.
  • Flexibility: Dynamic agents respond effectively to changing scenarios or unexpected inputs.
  • Efficiency: Redundant processes are avoided, ensuring faster, smarter results.

These agents collectively enable RAG systems to deliver high-quality, contextually accurate, and timely responses to users, regardless of the complexity of their queries.

RAG vs Agentic RAG

Agentic RAG frameworks are much more versatile than traditional RAG setups. In a traditional RAG system, the AI relies on a single tool—a vector database—for retrieving information to shape its responses. While effective for basic data retrieval, this approach is limited to working with static documents.

In contrast, agentic RAG systems go beyond simple data retrieval. These advanced frameworks can integrate multiple tools to handle a variety of tasks. For example, they can perform complex mathematical calculations, write emails, analyze data, or even make decisions based on contextual needs. This ability to incorporate different tools makes them far more flexible and capable.

Additionally, agentic RAG systems excel in multistep reasoning. They are context-aware, meaning they can decide when and how to use specific tools to solve problems or accomplish tasks. This ensures better accuracy and efficiency in handling more complex requirements.

Its ability to work collaboratively in multiagent systems sets agentic RAG apart. Multiple AI agents can work together, achieving results that are often far better than those of a single AI agent. This adaptability and scalability make agentic RAG a powerful choice for dynamic, real-world applications.

Also read:

The Tabular Comparison of RAG vs Agentic RAG

FeatureRAGAgentic RAG
Task ComplexityHandles simple query-based tasks but lacks advanced decision-makingHandles complex multi-step tasks with multiple tools and agents as needed for retrieval, reasoning, and more
Decision-MakingLimited, no autonomous decision-making involvedAgents autonomously decide what data to retrieve, how to retrieve, grade, reason, reflect, and generate responses
Multi-Step ReasoningLimited to single-step queries and responsesExcels at multi-step reasoning, especially after retrieval with grading, hallucination, and response evaluation
Key RoleCombines LLMs with external data retrieval to generate responsesEnhances RAG by using agents for intelligent retrieval, response generation, grading, critiquing, and more
Real-Time Data RetrievalNot possible in native RAGDesigned for real-time data retrieval and integration
Integration with Retrieval SystemsTied to static retrieval from pre-defined vector databasesDeeply integrated with diverse retrieval systems, agents control the process
Context-AwarenessLimited by the static vector database, no advanced or real-time context-awarenessHigh, agents adapt to user query and retrieve context, including real-time data

Also read: Evolution of RAG, Long Context LLMs to Agentic RAG

To understand RAG vs Agentic RAG, let’s understand their implementation.

Hands-On: Build a Simple RAG System

Necessary Libraries and Imports

!pip install langchain==0.3.4
!pip install langchain-openai==0.2.3
!pip install langchain-community==0.3.3
!pip install jq==1.8.0
!pip install pymupdf==1.24.12
!pip install langchain-chroma==0.1.4
from getpass import getpass
OPENAI_KEY = getpass('Enter Open AI API Key: ')
import os
os.environ['OPENAI_API_KEY'] = OPENAI_KEY
from langchain_openai import OpenAIEmbeddings
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

1. Core Functionalities

JSON Document Handling

Processes JSON documents into structured formats:

from langchain.document_loaders import JSONLoader
import json
from langchain.docstore.document import Document
# Load JSON documents
loader = JSONLoader(file_path='./rag_docs/wikidata_rag_demo.jsonl',
                    jq_schema='.',
                    text_content=False,
                    json_lines=True)
wiki_docs = loader.load()
# Process JSON documents
import json
from langchain.docstore.document import Document
wiki_docs_processed = []
for doc in wiki_docs:
    doc = json.loads(doc.page_content)
    metadata = {
        "title": doc['title'],
        "id": doc['id'],
        "source": "Wikipedia"
    }
    data = ' '.join(doc['paragraphs'])
    wiki_docs_processed.append(Document(page_content=data, metadata=metadata))

Output

Document(metadata={'title': 'Chi-square distribution', 'id': '71548',
'source': 'Wikipedia'}, page_content='In probability theory and statistics,
the chi-square distribution (also chi-squared or formula_1\xa0 distribution)
is one of the most widely used theoretical probability distributions. Chi-
square distribution with formula_2 degrees of freedom is written as
formula_3. It is a special case of gamma distribution. Chi-square
distribution is primarily used in statistical significance tests and
confidence intervals. It is useful, because it is relatively easy to show
that certain probability distributions come close to it, under certain
conditions. One of these conditions is that the null hypothesis must be
true. Another one is that the different random variables (or observations)
must be independent of each other.')

PDF Document Handling

Splits PDF content into chunks for vector embedding:

from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def create_simple_chunks(file_path, chunk_size=3500, chunk_overlap=200):
    loader = PyMuPDFLoader(file_path)
    doc_pages = loader.load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_documents(doc_pages)
from glob import glob
pdf_files = glob('./rag_docs/*.pdf')
# Process PDF files
paper_docs = []
for fp in pdf_files:
    paper_docs.extend(create_simple_chunks(file_path=fp))

Output

Loading pages: ./rag_docs/cnn_paper.pdf

Chunking pages: ./rag_docs/cnn_paper.pdf

Finished processing: ./rag_docs/cnn_paper.pdf

Loading pages: ./rag_docs/attention_paper.pdf

Chunking pages: ./rag_docs/attention_paper.pdf

Finished processing: ./rag_docs/attention_paper.pdf

Loading pages: ./rag_docs/vision_transformer.pdf

Chunking pages: ./rag_docs/vision_transformer.pdf

Finished processing: ./rag_docs/vision_transformer.pdf

Loading pages: ./rag_docs/resnet_paper.pdf

Chunking pages: ./rag_docs/resnet_paper.pdf

Finished processing: ./rag_docs/resnet_paper.pdf

2. Embedding and Vector Storage

Creates embeddings for documents using OpenAI’s model and stores them in a Chroma vector database:

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize embedding model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
# Combine documents
total_docs = wiki_docs_processed + paper_docs
# Create and save vector database
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

Load an existing vector database from disk:

chroma_db = Chroma(persist_directory="./my_db",
                   collection_name='my_db',
                   embedding_function=openai_embed_model)

3. Semantic Retrieval

Retrieves the top-k most relevant documents based on a query:

similarity_retriever = chroma_db.as_retriever(search_type="similarity", search_kwargs={"k": 5})
# Query for semantic similarity
query = "What is machine learning?"
top_docs = similarity_retriever.invoke(query)
# Display results
from IPython.display import display, Markdown
def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()
display_docs(top_docs)
Output

4. RAG Pipeline

Combines retrieval with a generative AI model for Q&A:

Prompt Template

from langchain_core.prompts import ChatPromptTemplate
rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.
                Question:
                {question}
                Context:
                {context}
                Answer:
            """
rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

Pipeline Construction

from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
# Initialize ChatGPT model
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
# Format documents into a single string
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
# Construct the RAG pipeline
qa_rag_chain = (
    {
        "context": (similarity_retriever | format_docs),
        "question": RunnablePassthrough()
    }
      |
    rag_prompt_template
      |
    chatgpt
)

Example Usage

query = "What is the difference between AI, ML, and DL?"
result = qa_rag_chain.invoke(query)
# Display the generated answer
from IPython.display import display, Markdown
display(Markdown(result.content))
Output
query = "What is LangGraph?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Output

I don't know.

This is due to the fact that the document does not contain any information about the LangGraph.

Also read: A Comprehensive Guide to Building Multimodal RAG Systems

LangChain Agentic RAG System Using the IBM Granite-3.0-8B-Instruct model

Here, we will create an Agentic RAG system that uses external information to discuss the 2024 US Open.

1. Setting Up the Environment

This involves creating the necessary infrastructure:

  • Log in to watsonx.ai: Use your IBM Cloud credentials.
  • Create a watsonx.ai Project: Obtain the project ID for the configuration.
  • Set Up Jupyter Notebook: This can be done in the cloud environment or locally by uploading pre-built notebooks.

2. Configuring Watson Machine Learning (WML)

To link machine learning capabilities:

  • Create WML Instance: Select the region and Lite plan for a free option.
  • Generate API Key: Required for secure integration.
  • Link WML to watsonx.ai Project: Integrate the project for seamless use.

3. Installing Libraries and Setting Credentials

Install required libraries:

!pip install langchain | tail -n 1
!pip install langchain-ibm | tail -n 1
!pip install langchain-community | tail -n 1
!pip install ibm-watsonx-ai | tail -n 1
!pip install ibm_watson_machine_learning | tail -n 1
!pip install chromadb | tail -n 1
!pip install tiktoken | tail -n 1
!pip install python-dotenv | tail -n 1
!pip install bs4 | tail -n 1

import os
from dotenv import load_dotenv
from langchain_ibm import WatsonxEmbeddings, WatsonxLLM
from langchain.vectorstores import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts import PromptTemplate
from langchain.tools import tool
from langchain.tools.render import render_text_description_and_args
from langchain.agents.output_parsers import JSONAgentOutputParser
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from langchain_core.runnables import RunnablePassthrough
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes
  • Import essential libraries (LangChain for agent framework, ibm-watsonx-ai, etc.).
  • Use .env to secure sensitive credentials like APIKEY and PROJECT_ID.

4. Initializing a Basic Agent

The Setup:

  • Model Parameters: Use IBM’s Granite-3.0-8B-Instruct LLM with defined decoding methods, temperature, token limits, and stop sequences.
  • Prompt Template: A reusable format to guide agent responses.
llm = WatsonxLLM(
    model_id= "ibm/granite-3-8b-instruct", 
    url=credentials.get("url"),
    apikey=credentials.get("apikey"),
    project_id=project_id,
    params={
        GenParams.DECODING_METHOD: "greedy",
        GenParams.TEMPERATURE: 0,
        GenParams.MIN_NEW_TOKENS: 5,
        GenParams.MAX_NEW_TOKENS: 250,
        GenParams.STOP_SEQUENCES: ["Human:", "Observation"],
    },
)
template = "Answer the {query} accurately. If you do not know the answer, simply say you do not know."
prompt = PromptTemplate.from_template(template)
agent = prompt | llm
agent.invoke({"query": "What sport is played at the US Open?"})
'\n\nThe sport played at the US Open is tennis.'
agent.invoke({"query": "Where was the 2024 US Open Tennis Championship?"})
Do not make up an answer.\n\nThe 2024 US Open Tennis Championship has not
been officially announced yet, so the location is not confirmed. Therefore,
I do not know the answer to this question.'

5. Building a Knowledge Base

This step enables the agent to retrieve specific contextual information.

  1. Data Collection: Use URLs to fetch content via LangChain’s WebBaseLoader.
  2. Chunking: Split data into manageable pieces using RecursiveCharacterTextSplitter.
  3. Embedding: Convert documents into vector representations using IBM’s Slate model.
  4. Vector Store: Store embeddings in Chroma DB.
urls = [
    "https://www.ibm.com/case-studies/us-open",
    "https://www.ibm.com/sports/usopen",
    "https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement",
    "https://newsroom.ibm.com/2024-08-15-ibm-and-the-usta-serve-up-new-and-enhanced-generative-ai-features-for-2024-us-open-digital-platforms",
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
docs_list[0]
Output
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(

    chunk_size=250, chunk_overlap=0

)

doc_splits = text_splitter.split_documents(docs_list)

#The embedding model that we are using is an IBM Slate™ model through the watsonx.ai embeddings service. Let's initialize it.

embeddings = WatsonxEmbeddings(
    model_id=EmbeddingTypes.IBM_SLATE_30M_ENG.value,
    url=credentials["url"],
    apikey=credentials["apikey"],
    project_id=project_id,
)

#In order to store our embedded documents, we will use Chroma DB, an open source vector store.

vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="agentic-rag-chroma",
    embedding=embeddings,
)

Set up a retriever to enable queries over this knowledge base. We must set up a retriever to access information in the vector store.

retriever = vectorstore.as_retriever()

6. Defining Tools

  • Create tools, like get_IBM_US_Open_context, for specialized queries.
  • Tools guide the agent to retrieve specific information from the vector store.
@tool
def get_IBM_US_Open_context(question: str):
    """Get context about IBM's involvement in the 2024 US Open Tennis Championship."""
    context = retriever.invoke(question)
    return context
tools = [get_IBM_US_Open_context]

7. Advanced Prompt Template

  • System Prompt: Guides the agent on formatting, tool usage, and decision-making logic.
  • Human Prompt: Handles user inputs and intermediary steps.
  • Combine these into a structured ChatPromptTemplate.
system_prompt = """Respond to the human as helpfully and accurately as possible. You have access to the following tools: {tools}
Use a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).
Valid "action" values: "Final Answer" or {tool_names}
Provide only ONE action per $JSON_BLOB, as shown:"
```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Follow this format:
Question: input question to answer
Thought: consider previous and subsequent steps
Action:
```
$JSON_BLOB
```
Observation: action result
... (repeat Thought/Action/Observation N times)
Thought: I know what to respond
Action:
```
{{
  "action": "Final Answer",
  "action_input": "Final response to human"
}}
Begin! Reminder to ALWAYS respond with a valid json blob of a single action.
Respond directly if appropriate. Format is Action:```$JSON_BLOB```then Observation"""
human_prompt = """{input}
{agent_scratchpad}
(reminder to always respond in a JSON blob)"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history", optional=True),
        ("human", human_prompt),
    ]
)

8. Adding Memory and Chains

  • Memory: Store historical interactions to refine responses using ConversationBufferMemory.
  • Agent Chain: Combine the prompt, LLM, tools, and memory into an AgentExecutor.

9. Testing and Using the RAG System

  • Verify behavior for complex queries requiring tools (e.g., retrieving IBM’s US Open involvement).
  • Ensure fallback to basic knowledge for straightforward questions (e.g., “What is the capital of France?”).
agent_executor.invoke({"input": "Where was the 2024 US Open Tennis Championship?"})
execution of Output
{'input': 'Where was the 2024 US Open Tennis Championship?',

 'history': '',

 'output': 'The 2024 US Open Tennis Championship was held at the USTA Billie
Jean King National Tennis Center in Flushing, Queens, New York.'}

Great! The agent used its available RAG tool to return the location of the
2024 US Open, per the user's query. We even get to see the exact document
that the agent is retrieving its information from. Now, let's try a slightly
more complex question query. This time, the query will be about IBM's
involvement in the 2024 US Open.
agent_executor.invoke(

    {"input": "How did IBM use watsonx at the 2024 US Open Tennis Championship?"}

)
execution of output

> Finished chain.

Out[ ]:

{'input': 'How did IBM use watsonx at the 2024 US Open Tennis Championship?',

 'history': 'Human: Where was the 2024 US Open Tennis Championship?\nAI: The
2024 US Open Tennis Championship was held at the USTA Billie Jean King
National Tennis Center in Flushing, Queens, New York.',

 'output': 'IBM used watsonx at the 2024 US Open Tennis Championship to
create generative AI-powered features such as Match Reports, AI Commentary,
and SlamTracker. These features enhance the digital experience for fans and
scale the productivity of the USTA editorial team.'}

How Does It Work in Practice?

  1. Query Processing: The agent parses the user’s query.
  2. Decision Making: Determines whether to use tools or respond directly.
  3. Tool Interaction: If necessary, invoke the tool (e.g., get_IBM_US_Open_context).
  4. Final Response: Combines retrieved data or knowledge base information to provide an accurate answer.

This structured system combines IBM’s watsonx.ai, LangChain, and machine learning to build a versatile, knowledge-augmented AI agent tailored for both general and domain-specific queries.

Also, if you are looking for an AI Agents course online, then explore: Agentic AI Pioneer Program

Conclusion

RAG (Retrieval-Augmented Generation) enhances LLMs by combining external data retrieval with generative capabilities, improving accuracy and relevance and reducing hallucinations. However, it struggles with complex, multi-step queries. Agentic RAG advances this by integrating intelligent agents that dynamically select tools, refine queries, and handle specialized tasks like code generation or visualizations. It supports multi-agent collaboration, ensuring adaptability, scalability, and precise context-aware responses. While traditional RAG suits basic Q&A and research, Agentic RAG excels in dynamic, data-intensive applications like real-time analysis and enterprise systems. Agentic RAG’s modularity and intelligence make it ideal for tackling complex tasks beyond the scope of traditional RAG systems.

I hope you find this guide helpful in understanding RAG vs Agentic RAG! If you any questions regarding the article comment below.

Frequently Asked Questions

Q1. What is the primary difference between RAG vs Agentic RAG?

Ans. RAG focuses on integrating retrieval and generation capabilities to improve AI outputs by grounding responses with external knowledge. Agentic RAG, on the other hand, incorporates intelligent agents that can autonomously select tools, refine queries, and adapt to complex, multi-step tasks.

Q2. Why is Agentic RAG considered more advanced than RAG?

Ans. Agentic RAG enables decision-making and dynamic planning, allowing it to handle real-time data, multi-tool integration, and context-aware reasoning, making it ideal for sophisticated, task-specific applications.

Q3. How does Agentic RAG improve the handling of ambiguous or complex queries?

Ans. Agentic RAG employs agents like routing agents to direct queries, query planning agents for breaking down multi-step tasks, and Re-Act agents for iterative reasoning and actions, ensuring precise and contextual responses.

Q4. What are the key challenges with traditional RAG, and how does Agentic RAG address them?

Ans. Traditional RAG struggles with contextual understanding, synthesis, and scalability. Agentic RAG overcomes these by dynamically adapting to user inputs, integrating diverse data sources, and leveraging multi-agent collaboration for efficient task management.

Q5. In what scenarios is Agentic RAG preferable over traditional RAG?

Ans. Agentic RAG is ideal for applications requiring real-time updates, multi-step reasoning, and integration with multiple tools, such as enterprise systems, data analytics, and domain-specific AI systems. Traditional RAG suits simpler, static tasks like basic Q&A or static content retrieval.

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details