Building a RAG (Retrieval-Augmented Generation) application isn’t just about plugging in a few tools—it’s about choosing the right stack that makes retrieval and generation not just possible but efficient and scalable.
Let’s say you’re working on something like “Smart Chat with PDF”—an AI app that lets users interact with PDFs conversationally. It’s not as simple as just loading a file and asking questions. You need to:
Sounds like a lot? It is. Working across multiple tools, frameworks, and databases can get overwhelming fast.
That’s exactly why I created the RAG Developer’s Stack—a curated set of tools and frameworks designed to streamline this whole process. From smart data extractors to efficient vector databases and cost-effective generation models, it’s everything you need to build robust, production-ready RAG applications without reinventing the wheel every time.
Firstly, here is a brief on RAG – Retrieval-Augmented Generation (RAG) enhance the capabilities of large language models (LLMs) by integrating external information retrieval mechanisms. This approach allows LLMs to generate more accurate, contextually relevant, and factually grounded responses by supplementing their static training data with up-to-date or domain-specific information.
RAG operates in four key stages:
Now, why do you need a RAG developer stack?
Here are 9 things you should know to develop RAG Projects:
LLMs are the brains of RAG systems, leveraging transformer-based architectures to generate coherent and contextually relevant text. These models come in two categories:
I have already imported the JSON Documents using the JSON Loader and here is the pipeline for understanding how LLM is used in RAG.
from langchain_core.prompts import ChatPromptTemplate
rag_prompt = """You are an assistant who is an expert in question-answering tasks.
Answer the following question using only the following pieces of retrieved context.
If the answer is not in the context, do not make up answers, just say that you don't know.
Keep the answer detailed and well formatted based on the information from the context.
Question:
{question}
Context:
{context}
Answer:
"""
rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
# Initialize ChatGPT model
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
# Format documents into a single string
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Construct the RAG pipeline
qa_rag_chain = (
{
"context": (similarity_retriever | format_docs),
"question": RunnablePassthrough()
}
|
rag_prompt_template
|
chatgpt
)
query = "What is the difference between AI, ML, and DL?"
result = qa_rag_chain.invoke(query)
# Display the generated answer
from IPython.display import display, Markdown
display(Markdown(result.content))
In Retrieval-Augmented Generation (RAG) systems, the response generation LLM plays an important role as the final decision-maker — it takes the retrieved documents, user query, and context and synthesizes everything into a coherent, relevant, and often conversational response. While retrieval models bring in potentially useful information, the LLM can reason, summarize, and contextualize, which ensures the output feels intelligent and human-like.
A strong response model can filter noisy or partial information, infer unstated connections, and deliver answers that align with user intent. This is especially critical in applications like enterprise search, customer support, legal/medical assistants, and technical Q&A, where users expect precise, grounded, and trustworthy responses.
In a nutshell, without a capable generation model, even the best retrieval stack falls flat — making this component the core brain of any RAG pipeline.
Model | Developer | Key Strengths | Common Use Cases |
---|---|---|---|
GPT-4.5 | OpenAI | Advanced text generation, summarization, conversational fluency | Chatbots, customer support, content creation |
Claude 3.7 Sonnet | Anthropic | Real-time conversations, strong reasoning, “extended thinking mode” | Business automation, customer service |
Gemini 2.0 Pro | Google DeepMind | Multimodal (text + image), high performance | Data analysis, enterprise automation, content generation |
Cohere Command R+ | Cohere | Retrieval-Augmented Generation (RAG), enterprise-grade design | Knowledge management, support automation, moderation |
DeepSeek | DeepSeek AI | On-premise deployment, secure data handling, high customizability | Finance, healthcare, privacy-sensitive industries |
Model | Developer | Key Strengths | Common Use Cases |
---|---|---|---|
LLaMA 3 | Meta | Scalable (up to 405B params), multimodal capabilities | Conversational AI, research, content generation |
Mistral 7B | Mistral AI | Lightweight yet powerful, optimized for code and chat | Code generation, chatbots, content automation |
Falcon 180B | Technology Innovation Institute | Efficient, high-performance, open-access | Real-time applications, science/research bots |
DeepSeek R1 | DeepSeek AI | Strong logic/reasoning, 128K context window | Math tasks, summarization, complex reasoning |
Qwen2.5-72B-Instruct | Alibaba Cloud | 72.7 billion parameters, supporting long contexts up to 128K tokens. coding, mathematical reasoning, and multilingual support. | Generates structured outputs like JSON, making it highly versatile for technical applications in RAG workflows. |
The Frameworks simplify the development of RAG applications by providing pre-built components:
Frameworks like LangChain, LangGraph, and LlamaIndex significantly streamline RAG (Retrieval-Augmented Generation) development by offering modular tools for integrating retrieval and generation processes. LangChain simplifies chaining LLM calls, managing prompts, and connecting to vector stores. LangGraph introduces graph-based flow control, enabling dynamic and multi-step RAG workflows. LlamaIndex focuses on data ingestion, indexing, and retrieval, making large datasets queryable by LLMs. Together, they abstract away complex infrastructure, allowing developers to focus on logic and data quality. These tools enable rapid prototyping and robust deployment of RAG applications for tasks like question answering, document search, and knowledge assistance.
Let’s build a simple RAG using LangChain:
%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph
!pip install -qU "langchain[openai]"
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
# Load and chunk contents of the blog
loader = WebBaseLoader(
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(
class_=("post-content", "post-title", "post-header")
)
),
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
# Index chunks
_ = vector_store.add_documents(documents=all_splits)
# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")
# Define state for application
class State(TypedDict):
question: str
context: List[Document]
answer: str
# Define application steps
def retrieve(state: State):
retrieved_docs = vector_store.similarity_search(state["question"])
return {"context": retrieved_docs}
def generate(state: State):
docs_content = "\n\n".join(doc.page_content for doc in state["context"])
messages = prompt.invoke({"question": state["question"], "context": docs_content})
response = llm.invoke(messages)
return {"answer": response.content}
# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()
response = graph.invoke({"question": "What are Types of Memory?"})
print(response["answer"])
The types of memory include Sensory Memory, Short-Term Memory (STM), and
Long-Term Memory (LTM). Sensory Memory retains impressions of sensory
information for a few seconds, while Short-Term Memory holds currently
relevant information for 20-30 seconds. Long-Term Memory can store
information for days to decades and includes explicit (declarative) and
implicit (procedural) memory.
If you are extracting the data from other sources, then data extraction tools work very well. RAG applications require robust tools for extracting structured and unstructured data from various sources:
pip install -U langchain-community
%pip install langchain pypdf
# %pip install langchain pypdf
from langchain.document_loaders import PyPDFLoader
# Define the path to your PDF file
pdf_path = "/content/Multimodal Agent Using Agno Framework.pdf"
# Initialize the PyPDFLoader
loader = PyPDFLoader(pdf_path)
# Load the PDF and split it into pages
documents = loader.load()
# Print the content of each page
for i, doc in enumerate(documents):
print(f"Page {i + 1} Content:")
print(doc.page_content)
print("\n")
Text embeddings transform textual data into numerical vectors for similarity-based retrieval. Beyond text embeddings:
Here are the embedding models across providers:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def create_simple_chunks(file_path, chunk_size=3500, chunk_overlap=200):
loader = PyMuPDFLoader(file_path)
doc_pages = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
return splitter.split_documents(doc_pages)
from glob import glob
pdf_files = glob('./rag_docs/*.pdf')
# Process PDF files
paper_docs = []
for fp in pdf_files:
paper_docs.extend(create_simple_chunks(file_path=fp))
Loading pages: ./rag_docs/cnn_paper.pdf
Chunking pages: ./rag_docs/cnn_paper.pdf
Finished processing: ./rag_docs/cnn_paper.pdf
Loading pages: ./rag_docs/attention_paper.pdf
Chunking pages: ./rag_docs/attention_paper.pdf
Finished processing: ./rag_docs/attention_paper.pdf
Loading pages: ./rag_docs/vision_transformer.pdf
Chunking pages: ./rag_docs/vision_transformer.pdf
Finished processing: ./rag_docs/vision_transformer.pdf
Loading pages: ./rag_docs/resnet_paper.pdf
Chunking pages: ./rag_docs/resnet_paper.pdf
Finished processing: ./rag_docs/resnet_paper.pdf
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize embedding model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
# Combine documents
total_docs = wiki_docs_processed + paper_docs
# Create and save vector database
chroma_db = Chroma.from_documents(documents=total_docs,
collection_name='my_db',
embedding=openai_embed_model,
collection_metadata={"hnsw:space": "cosine"},
persist_directory="./my_db")
Vector databases store embeddings (numerical representations of text or other data), enabling efficient retrieval of semantically similar chunks. Examples include:
Note: Above we already did make the embeddings, and now we will store them in the vector database.
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize embedding model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
# Combine documents
total_docs = wiki_docs_processed + paper_docs
# Create and save vector database
chroma_db = Chroma.from_documents(documents=total_docs,
collection_name='my_db',
embedding=openai_embed_model,
collection_metadata={"hnsw:space": "cosine"},
persist_directory="./my_db")
chroma_db = Chroma(persist_directory="./my_db",
collection_name='my_db',
embedding_function=openai_embed_model)
similarity_retriever = chroma_db.as_retriever(search_type="similarity", search_kwargs={"k": 5})
# Query for semantic similarity
query = "What is machine learning?"
top_docs = similarity_retriever.invoke(query)
# Display results
from IPython.display import display, Markdown
def display_docs(docs):
for doc in docs:
print('Metadata:', doc.metadata)
print('Content Brief:')
display(Markdown(doc.page_content[:1000]))
print()
display_docs(top_docs)
Rerankers refine the retrieval process by improving the relevance of retrieved documents:
They operate in a two-stage retrieval pipeline:
By integrating rerankers into the stack, developers can ensure higher-quality responses tailored to user queries while optimizing retrieval efficiency.
Also read: Comprehensive Guide on Reranker for RAG
%pip install --upgrade --quiet cohere
Set up the Cohere and ContextualCompressionRetriever
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere
from langchain.chains import RetrievalQA
llm = Cohere(temperature=0)
compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
chain = RetrievalQA.from_chain_type(
llm=Cohere(temperature=0), retriever=compression_retriever
)
Evaluation ensures the accuracy and relevance of RAG systems:
from tqdm.notebook import tqdm
from datasets import load_dataset
from qdrant_client import QdrantClient
from tqdm import tqdm
from langchain.docstore.document import Document as LangchainDocument
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI
import deepeval
# Get your key from https://platform.openai.com/api-keys
OPENAI_API_KEY = "<OPENAI_API_KEY>"
# Get your Confident AI API key from https://app.confident-ai.com
CONFIDENT_AI_API_KEY = "<CONFIDENT_AI_API_KEY>"
# Get a FREE forever cluster at https://cloud.qdrant.io/
# More info: https://qdrant.tech/documentation/cloud/create-cluster/
QDRANT_URL = "<QDRANT_URL>"
QDRANT_API_KEY = "<QDRANT_API_KEY>"
COLLECTION_NAME = "qdrant-deepeval"
EVAL_SIZE = 10
RETRIEVAL_SIZE = 3
dataset = load_dataset("atitaarora/qdrant_doc", split="train")
langchain_docs = [
LangchainDocument(
page_content=doc["text"], metadata={"source": doc["source"]}
)
for doc in tqdm(dataset)
]
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
add_start_index=True,
separators=["\n\n", "\n", ".", " ", ""],
)
docs_processed = []
for doc in langchain_docs:
docs_processed += text_splitter.split_documents([doc])
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
docs_contents, docs_metadatas = [], []
for doc in docs_processed:
if hasattr(doc, "page_content") and hasattr(doc, "metadata"):
docs_contents.append(doc.page_content)
docs_metadatas.append(doc.metadata)
else:
print(
"Warning: Some documents do not have 'page_content' or 'metadata' attributes."
)
# Uses FastEmbed - https://qdrant.tech/documentation/fastembed/
# To generate embeddings for the documents
# The default model is `BAAI/bge-small-en-v1.5`
client.add(
collection_name=COLLECTION_NAME,
metadata=docs_metadatas,
documents=docs_contents,
)
openai_client = OpenAI(api_key=OPENAI_API_KEY)
def query_with_context(query, limit):
search_result = client.query(
collection_name=COLLECTION_NAME, query_text=query, limit=limit
)
contexts = [
"document: " + r.document + ",source: " + r.metadata["source"]
for r in search_result
]
prompt_start = """ You're assisting a user who has a question based on the documentation.
Your goal is to provide a clear and concise response that addresses their query while referencing relevant information
from the documentation.
Remember to:
Understand the user's question thoroughly.
If the user's query is general (e.g., "hi," "good morning"),
greet them normally and avoid using the context from the documentation.
If the user's query is specific and related to the documentation, locate and extract the pertinent information.
Craft a response that directly addresses the user's query and provides accurate information
referring the relevant source and page from the 'source' field of fetched context from the documentation to support your answer.
Use a friendly and professional tone in your response.
If you cannot find the answer in the provided context, do not pretend to know it.
Instead, respond with "I don't know".
Context:\n"""
prompt_end = f"\n\nQuestion: {query}\nAnswer:"
prompt = prompt_start + "\n\n---\n\n".join(contexts) + prompt_end
res = openai_client.completions.create(
model="gpt-3.5-turbo-instruct",
prompt=prompt,
temperature=0,
max_tokens=636,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None,
)
return (contexts, res.choices[0].text)
qdrant_qna_dataset = load_dataset("atitaarora/qdrant_doc_qna", split="train")
def create_deepeval_dataset(dataset, eval_size, retrieval_window_size):
test_cases = []
for i in range(eval_size):
entry = dataset[i]
question = entry["question"]
answer = entry["answer"]
context, rag_response = query_with_context(
question, retrieval_window_size
)
test_case = deepeval.test_case.LLMTestCase(
input=question,
actual_output=rag_response,
expected_output=answer,
retrieval_context=context,
)
test_cases.append(test_case)
return test_cases
test_cases = create_deepeval_dataset(
qdrant_qna_dataset, EVAL_SIZE, RETRIEVAL_SIZE
)
deepeval.login_with_confident_api_key(CONFIDENT_AI_API_KEY)
deepeval.evaluate(
test_cases=test_cases,
metrics=[
deepeval.metrics.AnswerRelevancyMetric(),
deepeval.metrics.FaithfulnessMetric(),
deepeval.metrics.ContextualPrecisionMetric(),
deepeval.metrics.ContextualRecallMetric(),
deepeval.metrics.ContextualRelevancyMetric(),
],
)
Platforms enabling local or API-based access to open LLMs include:
Download Ollama: Click here to download
curl -fsSL https://ollama.com/install.sh | sh
After this, pull the DeepSeek R1:1.5b using:
ollama pull deepseek-r1:1.5b
!pip install langchain==0.3.11
!pip install langchain-openai==0.2.12
!pip install langchain-community==0.3.11
!pip install langchain-chroma==0.1.4
from langchain_openai import OpenAIEmbeddings
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('AgenticAI.pdf')
pages = loader.load_and_split()
texts = [doc.page_content for doc in pages]
from langchain_chroma import Chroma
chroma_db = Chroma.from_texts(
texts=texts,
collection_name='db_docs',
collection_metadata={"hnsw:space": "cosine"}, # Set distance function to cosine
embedding=openai_embed_model
)
from langchain_core.prompts import ChatPromptTemplate
prompt = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If no context is present or if you don't know the answer, just say that you don't know.
Do not make up the answer unless it is there in the provided context.
Keep the answer concise and to the point with regard to the question.
Question:
{question}
Context:
{context}
Answer:
"""
prompt_template = ChatPromptTemplate.from_template(prompt)
from langchain_community.llms import Ollama
deepseek = Ollama(model="deepseek-r1:1.5b")
from langchain.chains import Retrieval
rag_chain = Retrieval.from_chain_type(llm=deepseek,
chain_type="stuff",
retriever=similarity_threshold_retriever,
chain_type_kwargs={"prompt": prompt_template})
query = "Tell the Leaders’ Perspectives on Agentic AI"
rag_chain.invoke(query)
{'query': 'Tell the Leaders’ Perspectives on Agentic AI',
Building effective RAG applications isn’t just about plugging in a language model—it’s about choosing the right RAG Developer stack across the board, from frameworks and embeddings to vector databases and retrieval tools. When these components are thoughtfully integrated, they enable intelligent, scalable systems that can chat with PDFs, pull relevant facts in real time, and generate context-aware responses. As the ecosystem continues to evolve, staying agile with your tools and grounded in solid architecture will be key to building reliable, future-proof AI solutions.