A Comprehensive Guide to RAG Developer Stack

Pankaj Singh Last Updated : 03 Apr, 2025
13 min read

Building a RAG (Retrieval-Augmented Generation) application isn’t just about plugging in a few tools—it’s about choosing the right stack that makes retrieval and generation not just possible but efficient and scalable.

Let’s say you’re working on something like “Smart Chat with PDF”—an AI app that lets users interact with PDFs conversationally. It’s not as simple as just loading a file and asking questions. You need to:

  1. Extract relevant content from the PDF
  2. Chunk the text into meaningful pieces
  3. Store those chunks in a vector database
  4. Then, when a user asks something, the app runs a similarity search, fetches the most relevant chunks, and passes them to the language model to generate a coherent and accurate response

Sounds like a lot? It is. Working across multiple tools, frameworks, and databases can get overwhelming fast.

That’s exactly why I created the RAG Developer’s Stack—a curated set of tools and frameworks designed to streamline this whole process. From smart data extractors to efficient vector databases and cost-effective generation models, it’s everything you need to build robust, production-ready RAG applications without reinventing the wheel every time.

Why You Need RAG Developer Stack?

Rag architecture
Source: Hugging Face

Firstly, here is a brief on RAG – Retrieval-Augmented Generation (RAG) enhance the capabilities of large language models (LLMs) by integrating external information retrieval mechanisms. This approach allows LLMs to generate more accurate, contextually relevant, and factually grounded responses by supplementing their static training data with up-to-date or domain-specific information.

How does RAG work?

RAG operates in four key stages:

  1. Indexing: Data from external sources (e.g., documents, databases) is converted into vector representations (embeddings) and stored in a vector database. This enables efficient retrieval of relevant information.
  2. Retrieval: When a user submits a query, the system retrieves the most relevant data from the indexed sources using similarity-based search techniques.
  3. Augmentation: The retrieved information is combined with the user’s query through prompt engineering, effectively “augmenting” the input to the LLM.
  4. Generation: The LLM uses both its internal knowledge and the augmented prompt to produce a response. This process ensures that the output is informed by both pre-trained data and real-time, authoritative sources.

Now, why do you need a RAG developer stack?

Why Do You Need a RAG Developer Stack?

  • Accelerate Development: Leverage pre-built, ready-to-integrate components to move from prototype to production faster.
  • Boost Accuracy: Retrieve real-time, context-relevant data to ground responses and reduce hallucinations.
  • Strengthen Deployment: Built-in tools enhance security, observability, and scalability, making production readiness a smoother ride.
  • Maximize Flexibility: Modular design lets you mix and match tools, adapting to the unique demands of different industries and use cases.
  • Customizable by Design: Developers can hand-pick components that fit their workflow, architecture, and performance goals.

RAG Developer Stack for Your Next Project

Here are 9 things you should know to develop RAG Projects:

1. Large Language Models (LLMs)

LLMs
Source: Author

LLMs are the brains of RAG systems, leveraging transformer-based architectures to generate coherent and contextually relevant text. These models come in two categories:

  • Open-source LLMs: Examples include LLaMA, Falcon, Cohere and more, which allow customization and local deployment.
  • Closed LLMs: Proprietary models like GPT-4 and Bard offer advanced capabilities but are typically accessible via APIs.
Large language model
Source: Author

Example of LLM Usage in RAG

I have already imported the JSON Documents using the JSON Loader and here is the pipeline for understanding how LLM is used in RAG.

Prompt Template

from langchain_core.prompts import ChatPromptTemplate
rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.
                Question:
                {question}
                Context:
                {context}
                Answer:
            """
rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

Pipeline Construction

from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
# Initialize ChatGPT model
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
# Format documents into a single string
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
# Construct the RAG pipeline
qa_rag_chain = (
    {
        "context": (similarity_retriever | format_docs),
        "question": RunnablePassthrough()
    }
      |
    rag_prompt_template
      |
    chatgpt
)

Example Usage

query = "What is the difference between AI, ML, and DL?"
result = qa_rag_chain.invoke(query)
# Display the generated answer
from IPython.display import display, Markdown
display(Markdown(result.content))

Output

Output

2. LLMs Used in Response Generation for RAG

In Retrieval-Augmented Generation (RAG) systems, the response generation LLM plays an important role as the final decision-maker — it takes the retrieved documents, user query, and context and synthesizes everything into a coherent, relevant, and often conversational response. While retrieval models bring in potentially useful information, the LLM can reason, summarize, and contextualize, which ensures the output feels intelligent and human-like.
A strong response model can filter noisy or partial information, infer unstated connections, and deliver answers that align with user intent. This is especially critical in applications like enterprise search, customer support, legal/medical assistants, and technical Q&A, where users expect precise, grounded, and trustworthy responses.

In a nutshell, without a capable generation model, even the best retrieval stack falls flat — making this component the core brain of any RAG pipeline.

Commercial LLMs

ModelDeveloperKey StrengthsCommon Use Cases
GPT-4.5OpenAIAdvanced text generation, summarization, conversational fluencyChatbots, customer support, content creation
Claude 3.7 SonnetAnthropicReal-time conversations, strong reasoning, “extended thinking mode”Business automation, customer service
Gemini 2.0 ProGoogle DeepMindMultimodal (text + image), high performanceData analysis, enterprise automation, content generation
Cohere Command R+CohereRetrieval-Augmented Generation (RAG), enterprise-grade designKnowledge management, support automation, moderation
DeepSeekDeepSeek AIOn-premise deployment, secure data handling, high customizabilityFinance, healthcare, privacy-sensitive industries

Open-Source LLMs

ModelDeveloperKey StrengthsCommon Use Cases
LLaMA 3MetaScalable (up to 405B params), multimodal capabilitiesConversational AI, research, content generation
Mistral 7BMistral AILightweight yet powerful, optimized for code and chatCode generation, chatbots, content automation
Falcon 180BTechnology Innovation InstituteEfficient, high-performance, open-accessReal-time applications, science/research bots
DeepSeek R1DeepSeek AIStrong logic/reasoning, 128K context windowMath tasks, summarization, complex reasoning
Qwen2.5-72B-Instruct
Alibaba Cloud72.7 billion parameters, supporting long contexts up to 128K tokens.
coding, mathematical reasoning, and multilingual support.
Generates structured outputs like JSON, making it highly versatile for technical applications in RAG workflows.

3. Frameworks

Frameworks
Source: Author

The Frameworks simplify the development of RAG applications by providing pre-built components:

  • LangChain: Framework for LLM application development with modular architecture for prompt management, chaining, memory handling, and agent creation. Excels at building RAG pipelines with built-in support for document loaders, retrievers, and vector stores.
  • LlamaIndex: Specialized framework for data indexing and retrieval, connecting unstructured data with language models through custom indices. Optimized for ingesting, transforming, and querying large datasets for chatbots and knowledge management.
  • LangGraph: It integrates LLMs with graph-based structures, allowing developers to define application logic using nodes and edges. Ideal for complex workflows with multiple branches and feedback loops, especially in multi-agent systems.
  • RAGFlow: A Framework specifically for Retrieval-Augmented Generation systems, orchestrating retrievers, rankers, and generators into coherent pipelines. Enhances relevance when pulling from external data sources for search-driven interfaces and Q&A systems.
Rag Frameworks
Source: Author

Frameworks like LangChain, LangGraph, and LlamaIndex significantly streamline RAG (Retrieval-Augmented Generation) development by offering modular tools for integrating retrieval and generation processes. LangChain simplifies chaining LLM calls, managing prompts, and connecting to vector stores. LangGraph introduces graph-based flow control, enabling dynamic and multi-step RAG workflows. LlamaIndex focuses on data ingestion, indexing, and retrieval, making large datasets queryable by LLMs. Together, they abstract away complex infrastructure, allowing developers to focus on logic and data quality. These tools enable rapid prototyping and robust deployment of RAG applications for tasks like question answering, document search, and knowledge assistance.

Example of Frameworks for RAG Building

Let’s build a simple RAG using LangChain:

%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph
!pip install -qU "langchain[openai]"

Chat model

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

Select embeddings model

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Select vector store

from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)

Creating the indexing pipeline

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()
response = graph.invoke({"question": "What are Types of Memory?"})
print(response["answer"])

Output

The types of memory include Sensory Memory, Short-Term Memory (STM), and
Long-Term Memory (LTM). Sensory Memory retains impressions of sensory
information for a few seconds, while Short-Term Memory holds currently
relevant information for 20-30 seconds. Long-Term Memory can store
information for days to decades and includes explicit (declarative) and
implicit (procedural) memory.

4. Data Extraction

data extraction
Source: Author

If you are extracting the data from other sources, then data extraction tools work very well. RAG applications require robust tools for extracting structured and unstructured data from various sources:

  • Websites, PDFs, Word documents, slides, etc.
  • Tools like BeautifulSoup or PyPDF2 can automate this process.

Example of Data Extraction for RAG Building

pip install -U langchain-community
%pip install langchain pypdf

Let’s extract content from the PDF

# %pip install langchain pypdf

from langchain.document_loaders import PyPDFLoader

# Define the path to your PDF file
pdf_path = "/content/Multimodal Agent Using Agno Framework.pdf"

# Initialize the PyPDFLoader
loader = PyPDFLoader(pdf_path)

# Load the PDF and split it into pages
documents = loader.load()

# Print the content of each page
for i, doc in enumerate(documents):
    print(f"Page {i + 1} Content:")
    print(doc.page_content)
    print("\n")

Output

Output

5. Embeddings

Embeddings
Source: Author

Text embeddings transform textual data into numerical vectors for similarity-based retrieval. Beyond text embeddings:

  • Image embeddings: Used in multimodal RAG applications.
  • Multi-modal embeddings: Combine text, image, and other data types for complex tasks.

Here are the embedding models across providers:

OpenAI Embeddings

  • Latest models: text-embedding-3-small (lower cost) and text-embedding-3-large (higher accuracy)
  • Features: Dynamic dimension adjustment (e.g., 256-3072 dim), multilingual support, optimized for search/RAG

Cohere Embed v3

  • Specializes in document quality ranking and noisy data handling
  • Models: English/multilingual variants (1024/384 dim), compression-aware training for cost efficiency

Nomic Embed v2

  • Open-source MoE architecture (305M active params) with Matryoshka embeddings
  • Multilingual (100+ languages), outperforms models 2x its size on MTEB/BEIR benchmarks

Gemini Embedding

  • Experimental model (gemini-embedding-exp-03-07) with 8K token input and 3K dimensions
  • MTEB leaderboard leader (68.32 mean score), supports 100+ languages

Ollama Embeddings

  • Hosts models like mxbai-embed-large and custom variants (e.g., suntray-embedding)
  • Designed for RAG workflows with local inference and ChromaDB integration

BGE (BAAI)

  • BERT-based models (large/base/small-en-v1.5) for retrieval/RAG
  • Open-source, supports instruction tuning (e.g., “Represent this sentence…”)

Mixedbread

  • The mxbai-embed-large-v1 model by Mixedbread AI is a state-of-the-art sentence embedding solution designed for multilingual and multimodal retrieval tasks.
  • It supports advanced techniques like Matryoshka Representation Learning (MRL) and binary quantization, enabling efficient memory usage and cost reduction at scale. With strong performance across diverse tasks, it rivals larger proprietary models while maintaining open-source accessibility
embedding RAG

Splitting the PDF content into chunks

from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def create_simple_chunks(file_path, chunk_size=3500, chunk_overlap=200):
    loader = PyMuPDFLoader(file_path)
    doc_pages = loader.load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_documents(doc_pages)
from glob import glob
pdf_files = glob('./rag_docs/*.pdf')
# Process PDF files
paper_docs = []
for fp in pdf_files:
    paper_docs.extend(create_simple_chunks(file_path=fp))

Output

Loading pages: ./rag_docs/cnn_paper.pdf

Chunking pages: ./rag_docs/cnn_paper.pdf

Finished processing: ./rag_docs/cnn_paper.pdf

Loading pages: ./rag_docs/attention_paper.pdf

Chunking pages: ./rag_docs/attention_paper.pdf

Finished processing: ./rag_docs/attention_paper.pdf

Loading pages: ./rag_docs/vision_transformer.pdf

Chunking pages: ./rag_docs/vision_transformer.pdf

Finished processing: ./rag_docs/vision_transformer.pdf

Loading pages: ./rag_docs/resnet_paper.pdf

Chunking pages: ./rag_docs/resnet_paper.pdf

Finished processing: ./rag_docs/resnet_paper.pdf

Creating the Embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize embedding model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
# Combine documents
total_docs = wiki_docs_processed + paper_docs
# Create and save vector database
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

6. Vector Databases

Vector databases

Vector databases store embeddings (numerical representations of text or other data), enabling efficient retrieval of semantically similar chunks. Examples include:

  • Pinecone: A managed vector database platform designed for high-performance and scalable applications, enabling efficient storage and retrieval of high-dimensional vector embeddings.
  • Chroma DB: An open-source AI-native embedding database that includes features like vector search, document storage, full-text search, and metadata filtering, facilitating seamless retrieval in AI applications.
  • Qdrant: An open-source vector database and search engine written in Rust, offering fast and scalable vector similarity search services with extended filtering support, suitable for neural-network or semantic-based matching.
  • Milvus DB: An open-source vector database built for scalable similarity search, capable of handling large-scale and dynamic vector data, and supporting various index types for efficient retrieval.
  • Weaviate: An open-source vector database that stores both objects and vectors, allowing for combining vector search with structured filtering, and is modular, cloud-native, and real-time.
vector database RAG

Example of Vector Database for RAG Building

Note: Above we already did make the embeddings, and now we will store them in the vector database.

Using Chroma db to store the embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize embedding model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
# Combine documents
total_docs = wiki_docs_processed + paper_docs
# Create and save vector database
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

Loading the Vector database

chroma_db = Chroma(persist_directory="./my_db",
                   collection_name='my_db',
                   embedding_function=openai_embed_model)

Retrieving the information and getting the output

similarity_retriever = chroma_db.as_retriever(search_type="similarity", search_kwargs={"k": 5})
# Query for semantic similarity
query = "What is machine learning?"
top_docs = similarity_retriever.invoke(query)
# Display results
from IPython.display import display, Markdown
def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()
display_docs(top_docs)

Output

7. Rerankers

Reranker
Source: Link

Rerankers refine the retrieval process by improving the relevance of retrieved documents:

They operate in a two-stage retrieval pipeline:

  1. Initial recall retrieves a broad set of candidates from the vector database.
  2. Rerankers prioritize the most relevant documents based on additional scoring mechanisms like semantic similarity or contextual relevance.
    This approach significantly enhances the precision of RAG systems.

By integrating rerankers into the stack, developers can ensure higher-quality responses tailored to user queries while optimizing retrieval efficiency.

Rerankers

Also read: Comprehensive Guide on Reranker for RAG

Example of Rerankers for RAG Building

%pip install --upgrade --quiet  cohere

Set up the Cohere and ContextualCompressionRetriever

from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere
from langchain.chains import RetrievalQA

llm = Cohere(temperature=0)
compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
chain = RetrievalQA.from_chain_type(
   llm=Cohere(temperature=0), retriever=compression_retriever
)

Output

8. Evaluation

Evaluation

Evaluation ensures the accuracy and relevance of RAG systems:

  • Giskard: A library for testing machine learning pipelines.
  • Ragas: Specifically designed to evaluate RAG pipelines by analyzing retrieval quality and generated outputs.
  • Arize Phoenix: An open-source observability library for evaluating, troubleshooting, and improving LLM outputs with features like model drift detection and cohort analysis.
  • Comet Opik: A fully open-source platform for evaluating, testing, and monitoring LLM applications with tools for observability, automated scoring, and unit testing across the development lifecycle
  • DeepEval: deepeval offers three LLM evaluation metrics to evaluate retrievals:
    • ContextualPrecisionMetric: evaluates whether the reranker in your retriever ranks more relevant nodes in your retrieval context higher than irrelevant ones.
    • ContextualRecallMetric: evaluates whether the embedding model in your retriever is able to accurately capture and retrieve relevant information based on the context of the input.
    • ContextualRelevancyMetric: evaluates whether the text chunk size and top-K of your retriever is able to retrieve information without much irrelevancy.

Example of Evaluation for RAG Building

from tqdm.notebook import tqdm
from datasets import load_dataset
from qdrant_client import QdrantClient
from tqdm import tqdm
from langchain.docstore.document import Document as LangchainDocument
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI
import deepeval

# Get your key from https://platform.openai.com/api-keys
OPENAI_API_KEY = "<OPENAI_API_KEY>"

# Get your Confident AI API key from https://app.confident-ai.com
CONFIDENT_AI_API_KEY = "<CONFIDENT_AI_API_KEY>"

# Get a FREE forever cluster at https://cloud.qdrant.io/
# More info: https://qdrant.tech/documentation/cloud/create-cluster/
QDRANT_URL = "<QDRANT_URL>"
QDRANT_API_KEY = "<QDRANT_API_KEY>"
COLLECTION_NAME = "qdrant-deepeval"

EVAL_SIZE = 10
RETRIEVAL_SIZE = 3

dataset = load_dataset("atitaarora/qdrant_doc", split="train")

langchain_docs = [
    LangchainDocument(
        page_content=doc["text"], metadata={"source": doc["source"]}
    )
    for doc in tqdm(dataset)
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

docs_contents, docs_metadatas = [], []

for doc in docs_processed:
    if hasattr(doc, "page_content") and hasattr(doc, "metadata"):
        docs_contents.append(doc.page_content)
        docs_metadatas.append(doc.metadata)
    else:
        print(
            "Warning: Some documents do not have 'page_content' or 'metadata' attributes."
        )

# Uses FastEmbed - https://qdrant.tech/documentation/fastembed/
# To generate embeddings for the documents
# The default model is `BAAI/bge-small-en-v1.5`
client.add(
    collection_name=COLLECTION_NAME,
    metadata=docs_metadatas,
    documents=docs_contents,
)

openai_client = OpenAI(api_key=OPENAI_API_KEY)


def query_with_context(query, limit):

    search_result = client.query(
        collection_name=COLLECTION_NAME, query_text=query, limit=limit
    )

    contexts = [
        "document: " + r.document + ",source: " + r.metadata["source"]
        for r in search_result
    ]
    prompt_start = """ You're assisting a user who has a question based on the documentation.
        Your goal is to provide a clear and concise response that addresses their query while referencing relevant information
        from the documentation.
        Remember to:
        Understand the user's question thoroughly.
        If the user's query is general (e.g., "hi," "good morning"),
        greet them normally and avoid using the context from the documentation.
        If the user's query is specific and related to the documentation, locate and extract the pertinent information.
        Craft a response that directly addresses the user's query and provides accurate information
        referring the relevant source and page from the 'source' field of fetched context from the documentation to support your answer.
        Use a friendly and professional tone in your response.
        If you cannot find the answer in the provided context, do not pretend to know it.
        Instead, respond with "I don't know".

        Context:\n"""

    prompt_end = f"\n\nQuestion: {query}\nAnswer:"

    prompt = prompt_start + "\n\n---\n\n".join(contexts) + prompt_end

    res = openai_client.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt,
        temperature=0,
        max_tokens=636,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None,
    )

    return (contexts, res.choices[0].text)


qdrant_qna_dataset = load_dataset("atitaarora/qdrant_doc_qna", split="train")


def create_deepeval_dataset(dataset, eval_size, retrieval_window_size):
    test_cases = []
    for i in range(eval_size):
        entry = dataset[i]
        question = entry["question"]
        answer = entry["answer"]
        context, rag_response = query_with_context(
            question, retrieval_window_size
        )
        test_case = deepeval.test_case.LLMTestCase(
            input=question,
            actual_output=rag_response,
            expected_output=answer,
            retrieval_context=context,
        )
        test_cases.append(test_case)
    return test_cases


test_cases = create_deepeval_dataset(
    qdrant_qna_dataset, EVAL_SIZE, RETRIEVAL_SIZE
)

deepeval.login_with_confident_api_key(CONFIDENT_AI_API_KEY)

deepeval.evaluate(
    test_cases=test_cases,
    metrics=[
        deepeval.metrics.AnswerRelevancyMetric(),
        deepeval.metrics.FaithfulnessMetric(),
        deepeval.metrics.ContextualPrecisionMetric(),
        deepeval.metrics.ContextualRecallMetric(),
        deepeval.metrics.ContextualRelevancyMetric(),
    ],
)

9. Open LLMs Access

Open LLM Access

Platforms enabling local or API-based access to open LLMs include:

  • Ollama: Allows running open LLMs locally.
  • Groq, Hugging Face, Together AI: Provide API integrations for open LLMs.

Example of Open LLMs Access for RAG Building

Download Ollama: Click here to download

curl -fsSL https://ollama.com/install.sh | sh

After this, pull the DeepSeek R1:1.5b using:

ollama pull deepseek-r1:1.5b

Install the required libraries

!pip install langchain==0.3.11
!pip install langchain-openai==0.2.12
!pip install langchain-community==0.3.11
!pip install langchain-chroma==0.1.4

Open AI Embedding Models

from langchain_openai import OpenAIEmbeddings
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

Create a Vector DB and persist on the disk

from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('AgenticAI.pdf')
pages = loader.load_and_split()
texts = [doc.page_content for doc in pages]

from langchain_chroma import Chroma
chroma_db = Chroma.from_texts(
    texts=texts,
    collection_name='db_docs',
    collection_metadata={"hnsw:space": "cosine"},  # Set distance function to cosine
embedding=openai_embed_model
)

Build a RAG Chain

from langchain_core.prompts import ChatPromptTemplate
prompt = """You are an assistant for question-answering tasks.
            Use the following pieces of retrieved context to answer the question.
            If no context is present or if you don't know the answer, just say that you don't know.
            Do not make up the answer unless it is there in the provided context.
            Keep the answer concise and to the point with regard to the question.
            Question:
            {question}
            Context:
            {context}
            Answer:
         """
prompt_template = ChatPromptTemplate.from_template(prompt)

Load Connection to LLM

from langchain_community.llms import Ollama
deepseek = Ollama(model="deepseek-r1:1.5b")

LangChain Syntax for RAG Chain

from langchain.chains import Retrieval
rag_chain = Retrieval.from_chain_type(llm=deepseek,
                                           chain_type="stuff",
                                           retriever=similarity_threshold_retriever,
                                           chain_type_kwargs={"prompt": prompt_template})
query = "Tell the Leaders’ Perspectives on Agentic AI"
rag_chain.invoke(query)
{'query': 'Tell the Leaders’ Perspectives on Agentic AI',

Output

output

Conclusion

Building effective RAG applications isn’t just about plugging in a language model—it’s about choosing the right RAG Developer stack across the board, from frameworks and embeddings to vector databases and retrieval tools. When these components are thoughtfully integrated, they enable intelligent, scalable systems that can chat with PDFs, pull relevant facts in real time, and generate context-aware responses. As the ecosystem continues to evolve, staying agile with your tools and grounded in solid architecture will be key to building reliable, future-proof AI solutions.

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details