The world of Natural Language Processing is expanding tremendously, especially with the birth of large language models, which have revolutionized this field and made it accessible to everyone. In this article, we will explore and implement some NLP techniques to create a powerful chat assistant that can respond to your questions based on a given article (or PDF) using open-source libraries, all without requiring an OpenAI API key.
This article was published as a part of the Data Science Blogathon.
The workflow of the application is as below:
Where the user provides a PDF file or a URL to an article, asks a question, and the application will attempt to answer it based on the provided source.
We will extract the content using the PYPDF2 library (in the case of a PDF file) or BeautifulSoup (in the case of an article URL). Then, we will split it into chunks using the CharacterTextSplitter from the langchain library.
For each chunk, we calculate its corresponding word embedding vector using all-MiniLM-L6-v2 model, which maps sentences & paragraphs to a 384 dimensional dense vector space (word embedding is just a technique to represent word/sentence as a vector), and the same technique is applied to the user question.
The vectors are given as input to the semantic search function provided by sentence_transformers which is a Python framework for state-of-the-art sentence, text and image embeddings.
This function will return the text chunk that may contain the answer , and the Question Answering model will generate the final answer based on the output of the semantic_search + user question.
In this section, I will focus only on the implementation, while the details will be provided in the FAQ section.
We start by downloading the dependencies and then importing them.
pip install -r requirements.txt
numpy
torch
sentence-transformers
requests
langchain
beautifulsoup4
PyPDF2
import torch
import numpy as np
from sentence_transformers import util
from langchain.text_splitter import CharacterTextSplitter
from bs4 import BeautifulSoup
import requests
In case of a PDF
try:
pdf=PdfReader(path_pdf_file)
result=''
for i in range(len(pdf.pages)):
result+=pdf.pages[i].extract_text()
except:
print('PDF file doesn\'t exist'))
exit(0)
In case of an article, we attempt to find the content between the html tags like h1, p, li, h2, etc (These tags work fine for website like : Medium and may differ in others)
try:
request=requests.get(URL_LINK)
request=BeautifulSoup(request.text,'html.parser')
request=request.find_all(['h1','p','li','h2'])
except:
print('Bad URL link')
exit(0)
result=[element.text for element in request]
result=''.join(result)
Each chunk will contain 1000 tokens, with 200 tokens overlapped to keep the chunks related and prevent separation.(FAQ-Q2)
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(result)
You can download the model all-MiniLM-L6-v2 from huggingface, or you can just access it through HTTP requests since it’s available as an API. (FAQ-Q1)
Note: To access the huggingface APIs, you have to sign up (it’s free) to obtain your token.
hf_token='Put here you huggingface access token'
api_url= """
https://api-inference.huggingface.co/pipeline/feature-extraction/
sentence-transformers/all-MiniLM-L6-v2"""
headers = {"Authorization": f"Bearer {hf_token}"}
def query(texts):
response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
return response.json()
user_question = 'Put your question here'
question = query([user_question])
query_embeddings = torch.FloatTensor(question)
output=query(chunks)
output=torch.from_numpy(np.array(output)).to(torch.float)
The query function returns the 384 dimensional dense vector, and the transformation to ‘torch.Float’ & FloatTensor is necessarily for the semantic_search function.
Final will contain 2 text chunks that may include the answer ( i set top_k=2, to increase the probability of getting the right answer from the QA model).(FAQ-Q4)
result=util.semantic_search(query_embeddings, output,top_k=2)
final=[chunks[result[0][i]['corpus_id']] for i in range(len(result[0]))]
Since you have the context (text chunks) and the question, you can use any model you want (you can take a quick look into the huggingface QA models to get an idea). I chose AI21studio Question Answer model, you can sign up for free to get an access token.
AI21_api_key = 'AI21studio api key'
url = "https://api.ai21.com/studio/v1/answer"
payload = {
"context":' '.join(final),
"question":user_question
}
headers = {
"accept": "application/json",
"content-type": "application/json",
"Authorization": f"Bearer {AI21_api_key}"
}
response = requests.post(url, json=payload, headers=headers)
if response.json()['answerInContext']:
print(response.json()['answer'])
else:
print('The answer is not found in the document ⚠️,
please reformulate your question.')
The model enables you to verify if the answer is in context or not (in case of using Large language models, you may face the problem where the LLM answers a question that is not related to the provided context).(FAQ-Q3)
You can extend this project to various source inputs (PowerPoint files, YouTube videos/audios, slides, audiobooks) at a relatively low cost, so feel free to adapt it to your use cases. Additionally, you can create a simple UI for this application and host it.
Streamlit as I did (the github repo can be found here, don’t forget to hit the star button.
In this article, we built a powerful chat assistant for your PDF files/articles.
Thank you for your time and attention. For further assistance:
LinkedIn : SAMY GHEBACHE
Email : [email protected].
A. This model was the result of fine-tuning the nreimers/MiniLM-L6-H384-uncased model on a dataset of 1 billion sentence pairs. Train the base model using a self-supervised technique, where we provide the model with a phrase containing a missing word and attempt to predict it. Consider the word embedding vectors are as the weights of this model, and we have 384 hidden layers representing the dimensions in our case.
A. We could pass the entire extracted text to the Question Answer model directly without performing the semantic search operation, but it would be very costly in case you are using the OpenAI API (or any paid API). You have a cost for each token, so it will be quite expensive. In case you are using Question Answer models, they are limited in terms of the number of input tokens. Therefore, you can’t handle a PDF paper with many pages or even a long article. In addition, the performance of the models is not the same when dealing with chunk of 1000 tokens and a whole text with 10,000 tokens or more.
A. The idea behind this function is to project your sentences and paragraphs into an N-dimensional vector space. In our case, transform each chunk to V_chunk, and change the user question to V_question by the word embedding model, where Dimension(V_chunk) = Dimension(V_question) = N (N = 384 ). Then, we apply: Similarity(V_chunk, V_question) for each chunk, and obtain the vector with the highest similarity value.The SentenceTransformers framework uses the cosine-similarity
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.