Since the release of Chatgpt, the pace of progress in the AI space shows no signs of slowing down, new tools and technologies are being developed every day. Sure, It’s a great thing for businesses and the AI space in general, but as a programmer, do you need to learn all of them to build something? Well, the answer is No. A rather pragmatic approach to this would be to learn about things that you need. There are a lot of tools and technologies that promise to make things easier, and to some extent they do. But also at times, we do not need them at all. Using large frameworks for simple use cases only ends up making your code a bloated mess. So, in this article, we are going to explore by building a CLI PDF chatbot without langchain and understand why we do not always need AI frameworks.
This article was published as a part of the Data Science Blogathon.
Over the recent months, frameworks such as Langchain and LLama Index have experienced a remarkable surge in popularity, primarily due to their exceptional capacity to facilitate convenient development of LLM apps by developers. But for a lot of usecases these frameworks might become overkill. It’s like bringing a bazooka to a gun fight.
They ship with things that may not be required in your project. Python is already infamous for being bloated. On top of that, adding dependencies that you hardly need will only make your environment messier. One such use case is document querying. If your project does not involve an AI agent or other such complicated stuff, you can ditch Langchain and make the workflow from scratch, thus reducing unnecessary bloat. Besides this, Langchain or Llama Index-like frameworks are under rapid development; any code refactoring might break your build.
If you have an higher order need such as building an Agent to automate complicated software, or projects that require longer engineering hours to build from scratch, it makes sense to use prebuilt solutions. Never reinvent the wheel, unless you need a better wheel. There are other such countless examples where using readymade solutions with minor tweaks makes absolute sense.
One of the most sought-after use cases of LLMs has been Document question and answering. And after OpenAI made their ChatGPT endpoints public, it has become much easier to build an interactive conversational bot with any text data sources. In this article, we will build an LLM Q&A CLI app from scratch. So, how do we approach the problem? Before building it let’s understand what we need to do.
A typical workflow will involve
All these things will require a user-facing interface. For this article, we will build a simple Command Line Interface with Python Argparse.
Here is a workflow diagram of our CLI chatbot:
Before going into the coding part, let’s understand a thing or two about vector Databases and Indexes.
As the name suggests, vector databases store vectors or embeddings. So, why do we need Vector Databases? Building any AI application requires embeddings of real-world data as the Machine learning models cannot directly process these raw data such as texts, images, or audio. When you are dealing with a large amount of this data that will be used repeatedly, it will need to be stored somewhere. So, why can’t we use a traditional database for this? Well, you can use traditional databases for your search needs, but vector databases offer a significant advantage: they can perform vector similarity search in addition to lexical search.
In our case, whenever a user sends a query, the vector DB will perform a vector similarity search over all the embeddings and fetch the K nearest neighbors. The search mechanism is superfast as it employs an algorithm called HNSW.
HNSW stands for Hierarchical Navigable Small World. It is a graph-based algorithm and indexing method for Approximate Nearest Neighbor search (ANN). ANN is a type of search that finds the k most similar items to a given item.
HNSW works by building a graph of the data points. The nodes in the graph represent the data points, and the edges in the graph represent the similarity between the data points. The graph is then traversed to find the k most similar items to the given item.
The HNSW algorithm is fast, reliable, and scalable. Most of the Vector Databases use HNSW as the default search algorithm.
Now, we are all set to delve into codes.
As with any Python project, start with creating a virtual environment. This keeps the development environment nice and tidy. Refer to this article for choosing the right Python environment for your project.
The project file structure is simple, we will have two Python files one for defining the CLI and the other for processing, storing, and querying data. Also, create a .env file to store your OpenAI API key.
This is the requirements.txt file install it before getting started.
#requiremnets.txt
openai
chromadb
PyPDF2
dotenv
Now, import the necessary classes and functions.
import os
import openai
import PyPDF2
import re
from chromadb import Client, Settings
from chromadb.utils import embedding_functions
from PyPDF2 import PdfReader
from typing import List, Dict
from dotenv import load_dotenv
Load the OpenAI API key from the .env file.
load_dotenv()
key = os.environ.get('OPENAI_API_KEY')
openai.api_key = key
To store text embeddings and their metadata, we will create a collection with ChromaDB.
ef = embedding_functions.ONNXMiniLM_L6_V2()
client = Client(settings = Settings(persist_directory="./", is_persistent=True))
collection_ = client.get_or_create_collection(name="test", embedding_function=ef)
As an embedding model, we are using MiniLM-L6-V2 with ONNX runtime. It is small yet capable and on top of that open-sourced.
Next, we will define a function to verify if a provided file path belongs to a valid PDF file.
def verify_pdf_path(file_path):
try:
# Attempt to open the PDF file in binary read mode
with open(file_path, "rb") as pdf_file:
# Create a PDF reader object using PyPDF2
pdf_reader = PyPDF2.PdfReader(pdf_file)
# Check if the PDF has at least one page
if len(pdf_reader.pages) > 0:
# If it has pages, the PDF is not empty, so do nothing (pass)
pass
else:
# If it has no pages, raise an exception indicating that the PDF is empty
raise ValueError("PDF file is empty")
except PyPDF2.errors.PdfReadError:
# Handle the case where the PDF cannot be read (e.g., it's corrupted or not a valid PDF)
raise PyPDF2.errors.PdfReadError("Invalid PDF file")
except FileNotFoundError:
# Handle the case where the specified file doesn't exist
raise FileNotFoundError("File not found, check file address again")
except Exception as e:
# Handle other unexpected exceptions and display the error message
raise Exception(f"Error: {e}")
One of the major parts of a PDF Q&A app is to get text chunks. So, we need to define a function that gets us the required chunks of text.
def get_text_chunks(text: str, word_limit: int) -> List[str]:
"""
Divide a text into chunks with a specified word limit
while ensuring each chunk contains complete sentences.
Parameters:
text (str): The entire text to be divided into chunks.
word_limit (int): The desired word limit for each chunk.
Returns:
List[str]: A list containing the chunks of text with
the specified word limit and complete sentences.
"""
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
chunks = []
current_chunk = []
for sentence in sentences:
words = sentence.split()
if len(" ".join(current_chunk + words)) <= word_limit:
current_chunk.extend(words)
else:
chunks.append(" ".join(current_chunk))
current_chunk = words
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
We have defined a basic algorithm for getting chunks. The idea is to let users create as many words as they want in a single text chunk. And every text chunk will end with a complete sentence, even if it breaches the limit. This is a simple algorithm. You may create something on your own.
Now, we need a function to load texts from PDFs and create a dictionary to keep track of text chunks belonging to a single page.
def load_pdf(file: str, word: int) -> Dict[int, List[str]]:
# Create a PdfReader object from the specified PDF file
reader = PdfReader(file)
# Initialize an empty dictionary to store the extracted text chunks
documents = {}
# Iterate through each page in the PDF
for page_no in range(len(reader.pages)):
# Get the current page
page = reader.pages[page_no]
# Extract text from the current page
texts = page.extract_text()
# Use the get_text_chunks function to split the extracted text into chunks of 'word' length
text_chunks = get_text_chunks(texts, word)
# Store the text chunks in the documents dictionary with the page number as the key
documents[page_no] = text_chunks
# Return the dictionary containing page numbers as keys and text chunks as values
return documents
Now, we need to store the data in a ChromaDB collection.
def add_text_to_collection(file: str, word: int = 200) -> None:
# Load the PDF file and extract text chunks
docs = load_pdf(file, word)
# Initialize empty lists to store data
docs_strings = [] # List to store text chunks
ids = [] # List to store unique IDs
metadatas = [] # List to store metadata for each text chunk
id = 0 # Initialize ID
# Iterate through each page and text chunk in the loaded PDF
for page_no in docs.keys():
for doc in docs[page_no]:
# Append the text chunk to the docs_strings list
docs_strings.append(doc)
# Append metadata for the text chunk, including the page number
metadatas.append({'page_no': page_no})
# Append a unique ID for the text chunk
ids.append(id)
# Increment the ID
id += 1
# Add the collected data to a collection
collection_.add(
ids=[str(id) for id in ids], # Convert IDs to strings
documents=docs_strings, # Text chunks
metadatas=metadatas, # Metadata
)
# Return a success message
return "PDF embeddings successfully added to collection"
In Chromadb, the metadata field stores additional information regarding the documents. In this case, the page number of a text chunk is its metadata. After extracting metadata from each text chunk, we can store them in the collection we created earlier. This is required only when the user provides a valid file path to a PDF file.
We will now define a function that processes user queries to fetch data from the database.
def query_collection(texts: str, n: int) -> List[str]:
result = collection_.query(
query_texts = texts,
n_results = n,
)
documents = result["documents"][0]
metadatas = result["metadatas"][0]
resulting_strings = []
for page_no, text_chunk in zip(metadatas, documents):
resulting_strings.append(f"Page {page_no['page_no']}: {text_chunk}")
return resulting_strings
The above function uses a query method to retrieve “n” relevant data from the database. We then create a formatted string that starts with the page number of the text chunk.
Now, the only major thing remaining is to feed the LLM with information.
def get_response(queried_texts: List[str],) -> List[Dict]:
global messages
messages = [
{"role": "system", "content": "You are a helpful assistant.\
And will always answer the question asked in 'ques:' and \
will quote the page number while answering any questions,\
It is always at the start of the prompt in the format 'page n'."},
{"role": "user", "content": ''.join(queried_texts)}
]
response = openai.ChatCompletion.create(
model = "gpt-3.5-turbo",
messages = messages,
temperature=0.2,
)
response_msg = response.choices[0].message.content
messages = messages + [{"role":'assistant', 'content': response_msg}]
return response_msg
The global variable messages store the context of the conversation. We have defined a system message to print the page number from where the LLM gets the answer.
Lastly, the ultimate utility function combines obtained text chunks with the user query, feeds it into the get_response() function, and returns the resulting answer string.
def get_answer(query: str, n: int):
queried_texts = query_collection(texts = query, n = n)
queried_string = [''.join(text) for text in queried_texts]
queried_string = queried_string[0] + f"ques: {query}"
answer = get_response(queried_texts = queried_string,)
return answer
We are done with our utility functions. Let’s move on to building CLI.
To use the chatbot on-demand, we need an interface. This could be a web app, a mobile app, or a CLI. In this article, we will build a CLI for our chatbot. If you want to build a nice-looking demo web app, you can use tools like Gradio or Streamlit. Check out this article on building a chatbot for PDF.
Build a ChatGPT for PDFs with Langchain
To build a CLI, we will need the Argparse library. Argparse is a potent library that lets you create CLIs in Python. It has a simple and easy syntax to create commands, sub-commands, and flags. So, before delving into it, here is a small primer on Argparse.
The Argparse module was first released in Python 3.2, providing a quick and convenient way to build CLI applications with Python without relying on third-party installations. It allows us to parse command line arguments, create sub-commands in CLIs, and many more features, making it a reliable tool for building CLIs.
Here’s a small example of Argparse in action,
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-f", "--filename", help="The name of the file to read.")
parser.add_argument("-n", "--number", help="The number of lines to print.", type=int)
parser.add_argument("-s", "--sort", help="Sort the lines in the file.", action="store_true")
args = parser.parse_args()
with open(args.filename) as f:
lines = f.readlines()
if args.sort:
lines.sort()
for line in lines:
print(line)
The add_argument method lets us define sub-commands with checks and balances. We can define the type of argument or the action it needs to undertake when a flag is provided and a help parameter that explains the use case of a particular sub-command. The help subcommand will display all the flags and their use cases.
On a similar note, we will define sub-commands for the chatbot CLI.
Import Argparse and necessary utility functions.
import argparse
from utils import (
add_text_to_collection,
get_answer,
verify_pdf_path,
clear_coll
)
Define Argument parser and add arguments.
def main():
# Create a command-line argument parser with a description
parser = argparse.ArgumentParser(description="PDF Processing CLI Tool")
# Define command-line arguments
parser.add_argument("-f", "--file", help="Path to the input PDF file")
parser.add_argument(
"-c", "--count",
default=200,
type=int,
help="Optional integer value for the number of words in a single chunk"
)
parser.add_argument(
"-q", "--question",
type=str,
help="Ask a question"
)
parser.add_argument(
"-cl", "--clear",
type=bool,
help="Clear existing collection data"
)
parser.add_argument(
"-n", "--number",
type=int,
default=1,
help="Number of results to be fetched from the collection"
)
# Parse the command-line arguments
args = parser.parse_args()
We have defined a few sub-commands, such as –file, –value, –question, etc.
Now, we process the arguments;
if args.file is not None:
verify_pdf_path(args.file)
confirmation = add_text_to_collection(file = args.file, word = args.value)
print(confirmation)
if args.question is not None:
if args.number:
n = args.number
answer = get_answer(args.question, n = n)
print("Answer:", answer)
if args.clear:
clear_coll()
return "Current collection cleared successfully"
Putting everything together.
import argparse
from app import (
add_text_to_collection,
get_answer,
verify_pdf_path,
clear_coll
)
def main():
# Create a command-line argument parser with a description
parser = argparse.ArgumentParser(description="PDF Processing CLI Tool")
# Define command-line arguments
parser.add_argument("-f", "--file", help="Path to the input PDF file")
parser.add_argument(
"-c", "--count",
default=200,
type=int,
help="Optional integer value for the number of words in a single chunk"
)
parser.add_argument(
"-q", "--question",
type=str,
help="Ask a question"
)
parser.add_argument(
"-cl", "--clear",
type=bool,
help="Clear existing collection data"
)
parser.add_argument(
"-n", "--number",
type=int,
default=1,
help="Number of results to be fetched from the collection"
)
# Parse the command-line arguments
args = parser.parse_args()
# Check if the '--file' argument is provided
if args.file is not None:
# Verify the PDF file path and add its text to the collection
verify_pdf_path(args.file)
confirmation = add_text_to_collection(file=args.file, word=args.count)
print(confirmation)
# Check if the '--question' argument is provided
if args.question is not None:
n = args.number if args.number else 1 # Set 'n' to the specified number or default to 1
answer = get_answer(args.question, n=n)
print("Answer:", answer)
# Check if the '--clear' argument is provided
if args.clear:
clear_coll()
print("Current collection cleared successfully")
if __name__ == "__main__":
main()
Now open your terminal and run the below script.
python cli.py -f "path/to/file.pdf" -v 1000 -n 1 -q "query"
To delete the collection, type
python cli.py -cl True
If the provided file path does not belong to a PDF, it will raise a FileNotFoundError.
The GitHub Repository: https://github.com/sunilkumardash9/pdf-cli-chatbot
A chatbot running as a CLI tool can be used in many real-world applications, such as
Academic Research: Researchers often deal with numerous research papers and articles in PDF format. A CLI chatbot could help them extract relevant information, create bibliographies, and organize their references efficiently.
Language Translation: Language professionals can use the chatbot to extract text from PDFs, translate it, and then generate translated documents, all from the command line.
Educational Institutions: Teachers and educators can extract content from educational resources to create customized learning materials or to prepare course content. Students can extract useful information from large PDFs from the chatbot CLI.
Open Source Project Management: CLI chatbots can help open-source software projects manage documentation, extract code snippets, and generate release notes from PDF manuals.
So, this was all about building a PDF Q&A chatbot with a Command Line Interface built without using frameworks such as the Langchain and Llama Index. Here is a quick summary of things we covered.
A. A chatbot PDF is an interactive bot specially designed to retrieve information from PDFs.
A. LangChain is an open-source framework that simplifies the creation of applications using large language models. It can be used for a variety of tasks, including chatbots, document analysis, code analysis, question answering, and generative tasks.
A. Yes, chatbots are AI tools. They use artificial intelligence (AI) and natural language processing (NLP) to simulate human conversation. Chatbots can be used to provide customer service, answer questions, and even generate creative content.
A. Chatbots for PDF are tools that allow you to interact with PDF files using natural language. You can ask questions about the PDF, and Chatbot for PDF will try to answer them. You can also ask a PDF Chatbot to summarize the PDF or to extract specific information from it.
A. Yes, with the advent of capable Large Language Models and vector stores, it is possible to chat with PDFs.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.