Data Extraction from Unstructured PDFs

Ashish Last Updated : 13 Dec, 2024

9 min read

Data Extraction is the process of extracting data from various sources such as CSV files, web, PDF, etc. Although in some files, data can be extracted easily as in CSV, while in files like unstructured PDFs we have to perform additional tasks to extract data from PDF Python.

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library. In all these cases data is in structured form i.e. sequential, forms or tables.

However, in the real world, most of the data is not present in any of the forms & there is no order of data. It is present in unstructured form. In this case, it is not feasible to use the above Python libraries since they will give ambiguous results. To analyze unstructured data, we need to convert it to a structured form using methods like “python extract data from pdf.”

As such, there is no specific technique or procedure for extracting data from unstructured PDFs since data is stored randomly & it depends on what type of data you want to extract from PDF.

Here, I will show you a most successful technique & a python library through which you can extract data from bounding boxes in unstructured PDFs and then performing data cleaning operation on extracted data and converting it to a structured form.

In this article you will understand how to extract data from pdf python, extract unstructured data from pdf, extract information from pdf python, and perform unstructured pdf text extraction. Python provides powerful tools to extract data, information and unstructured text from PDF files. Libraries like PyPDF2 and pdfplumber enable extracting structured data as well as parsing unstructured PDF content programmatically.

This article was published as a part of the Data Science Blogathon

PyMuPDF
Code
Data Cleaning and Data Processing
How to Extract Data from Unstructured PDF Files with Python?
How to extract information from pdf python?
Conclusion

PyMuPDF

I have used the PyMuPDF library for this purpose. This library provided many applications such as extracting images from PDF, extracting texts from different shapes, making annotations, draw a bounded box around the texts along with the features of libraries like PyPDF2.

Now, I will show you how I extracted data from the bounding boxes in a PDF with several pages.

Here are the PDF and the red bounding boxes from which we need to extract data.

I have tried many python libraries like PyPDF2, PDFMiner, pikepdf, Camelot, and tabulat. However, none of them worked except PyMuPDF to extract data extraction from PDF using Python.

Before going into the code it’s important to understand the meaning of 2 important terms which would help in understanding the code.

Word: Sequence of characters without space. Ex – ash, 23, 2, 3.

Annots: An annotation associates an object such as a note, image, or bounding box with a location on a page of a PDF document, or provides a way to interact with the user using the mouse and keyboard. The objects are called annots. These annotations can be crucial when you extract data from PDF with Python, as they help identify specific data locations.

Please note that in our case the bounding box, annots, and rectangles are the same thing. Therefore, these terms would be used interchangeably.

Also Read: A Comprehensive Guide to Data Exploration

First, we will extract text from one of the bounding boxes. Then we will use the same procedure to extract data from all the bounding boxes of pdf.

Code

import fitz
import pandas as pd 
doc = fitz.open('Mansfield--70-21009048 - ConvertToExcel.pdf')
page1 = doc[0]
words = page1.get_text("words")

Firstly, we import the fitz module of the PyMuPDF library and pandas library. Then the object of the PDF file is created and stored in doc and the 1st page of the PDF is stored on page1. Using the PyMuPDF library to extract data from PDF with Python, the page.get_text() method extracts all the words from page 1. Each word consists of a tuple with 8 elements.

In words variable, the First 4 elements represent the coordinates of the word, 5th element is the word itself, 6th,7th, 8th elements are block, line, word numbers respectively.

OUTPUT

Extract the coordinates of the first object

first_annots=[]

rec=page1.first_annot.rect

rec

#Information of words in first object is stored in mywords

mywords = [w for w in words if fitz.Rect(w[:4]) in rec]

ann= make_text(mywords)

first_annots.append(ann)

This function selects the words contained in the box, sort the words and return in form of a string

def make_text(words):

    line_dict = {} 

    words.sort(key=lambda w: w[0])

    for w in words:  

        y1 = round(w[3], 1)  

        word = w[4] 

        line = line_dict.get(y1, [])  

        line.append(word)  

        line_dict[y1] = line  

    lines = list(line_dict.items())

    lines.sort()  

    return "n".join([" ".join(line[1]) for line in lines])

OUTPUT

page.first_annot() gives the first annot i.e. bounding box of the page.

.rect gives coordinates of a rectangle.

Now, we got the coordinates of the rectangle and all the words on the page. We then filter the words which are present in our bounding box and store them in mywords variable.

We have got all the words in the rectangle with their coordinates. However, these words are in random order. Since we need the text sequentially and that only makes sense, we used a function make_text() which first sorts the words from left to right and then from top to bottom. It returns the text in string format.

Hurrah! We have extracted data from one annot. Our next task is to extract data from all annots of the PDF which would be done in the same approach.

Extracting each page of the document and all the annots/rectanges

for pageno in range(0,len(doc)-1):

    page = doc[pageno]

    words = page.get_text("words")

    for annot in page.annots():

        if annot!=None:

            rec=annot.rect

            mywords = [w for w in words if fitz.Rect(w[:4]) in rec]

            ann= make_text(mywords)

            all_annots.append(ann)

all_annots, a list is initialized to store the text of all annots in the pdf.

The function of the outer loop in the above code is to go through each page of PDF, while that of the inner loop is to go through all annots of the page and performing the task of adding texts to all_annots list as discussed earlier.

Printing all_annots provides us the text of all annots of the pdf which you can see below.

OUTPUT

Finally, we have extracted the texts from all the annots/ bounding boxes.

Its time to clean the data and bring it in an understandable form.

Data Cleaning and Data Processing

Splitting to form column name and its values :

cont=[]

for i in range(0,len(all_annots)):

    cont.append(all_annots[i].split('n',1))

Removing unnecessary symbols *,#,:

liss=[]

for i in range(0,len(cont)):

    lis=[]

    for j in cont[i]:

        j=j.replace('*','')

        j=j.replace('#','')

        j=j.replace(':','')

        j=j.strip()

        #print(j)

        lis.append(j)

    liss.append(lis)

Spliting into keys and values and removing spaces in the values which only contain digits

keys=[]

values=[]

for i in liss:

    keys.append(i[0])

    values.append(i[1])

for i in range(0, len(values)):

    for j in range(0,len(values[i])):

        if values[i][j]>='A' and values[i][j]<='Z':

            break            

    if j==len(values[i])-1:
       values[i]=values[i].replace(' ','')

We split each string based on a new line (n) character to separate the column name from its values. By further cleaning unnecessary symbols like (*, #, 🙂 are removed. Spaces between digits are removed.

With the key-value pairs, we create a dictionary which is shown below:

Converting to dictionary

report=dict(zip(keys,values))

report['VEHICLE IDENTIFICATION']=report['VEHICLE IDENTIFICATION'].replace(' ','')

dic=[report['LOCALITY'],report['MANNER OF CRASH COLLISION/IMPACT'],report['CRASH SEVERITY']]

l=0

val_after=[]

for local in dic:

    li=[]

    lii=[]

    k=''

    extract=''

    l=0

    for i in range(0,len(local)-1):

        if local[i+1]>='0' and local[i+1]<='9':

            li.append(local[l:i+1])

            l=i+1

    li.append(local[l:])

    print(li)

    for i in li:

        if i[0] in lii:

            k=i[0]

            break

        lii.append(i[0])

    for i in li:

        if i[0]==k:

extract=i

            val_after.append(extract)
break
report['LOCALITY']=val_after[0]
report['MANNER OF CRASH COLLISION/IMPACT']=val_after[1]
report['CRASH SEVERITY']=val_after[2]

OUTPUT:

Lastly, dictionary is converted to dataframe with the help of pandas.

Converting to DataFrame and exporting to CSV

data=pd.DataFrame.from_dict(report)

data.to_csv('final.csv',index=False)

OUTPUT

Converting to DataFrame and exporting to CSV

Now, we can perform analysis on our structured data or export it to excel.

How to Extract Data from Unstructured PDF Files with Python?

To extract data from unstructured PDF files using Python, you can use a combination of libraries such as PyPDF2 and NLTK (Natural Language Toolkit). Here’s a general approach:

Install the required libraries by running the following command
pip install PyPDF2 nltk
Import the necessary modules in your Python script
import PyPDF2
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Load the PDF file using PyPDF2
def extract_text_from_pdf(pdf_path):
pdf_file = open(pdf_path, ‘rb’)
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
num_pages = pdf_reader.numPages

text = “”
for page_num in range(num_pages):
page = pdf_reader.getPage(page_num)
text += page.extract_text()

pdf_file.close()
return text
Preprocess the extracted text by removing stopwords and non-alphanumeric characters
def preprocess_text(text):
tokens = word_tokenize(text.lower())
stop_words = set(stopwords.words(‘english’))
cleaned_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
cleaned_text = ‘ ‘.join(cleaned_tokens)
return cleaned_text
Call the functions and extract the data from the PDF
pdf_path = “path/to/your/pdf/file.pdf”
extracted_text = extract_text_from_pdf(pdf_path)
preprocessed_text = preprocess_text(extracted_text)

# Process the preprocessed text further as per your specific requirements
# such as information extraction, entity recognition, etc.

This is a basic example to get you started. Depending on the structure and content of your PDF files, you may need to apply additional techniques for more accurate and specific data extraction.

How to extract information from pdf python?

To extract information from PDF files using Python, several libraries can be utilized, each with its own strengths and methods. Below are some popular libraries and examples of how to use them effectively.

Libraries for PDF Extraction

PyPDF2: A widely used library for reading and manipulating PDF files. It allows you to extract text and metadata from PDFs.
PDFMiner: This library excels in extracting structured data from PDFs, making it suitable for complex layouts.
PyMuPDF (fitz): A powerful library that supports various file formats and provides efficient text extraction and manipulation capabilities.
PDFQuery: This library uses CSS-like selectors to extract data from PDFs, making it user-friendly for specific data retrieval.
IronPDF: A commercial library that simplifies PDF data extraction and manipulation, including interactive forms.

Example Code Snippets

Using PyPDF2

from PyPDF2 import PdfReader

# Create a PDF reader object
reader = PdfReader('example.pdf')

# Print the number of pages
print(len(reader.pages))

# Extract text from the first page
text = reader.pages[0].extract_text()
print(text)

Conclusion

Data extraction from PDFs, especially unstructured ones, can be challenging, but Python offers powerful libraries to simplify this process. While libraries like PyPDF2 and Camelot work well for structured data, PyMuPDF excels in handling unstructured PDFs by extracting data extraction from bounding boxes. This approach, combined with effective data cleaning and processing techniques, enables the conversion of unstructured data into a structured form suitable for analysis. By leveraging these tools, you can efficiently extract, clean, and utilize data from a variety of PDF formats, enhancing your data science projects.

Hope you like the article on how to extract data from PDF using Python. These libraries allow you to extract unstructured data from PDF, making unstructured PDF text extraction a breeze and helping you extract information from PDF with ease.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Frequently Asked Questions

Q1. How do I extract specific data from a PDF?

A. To extract specific data from a PDF, you can use Python libraries like PyMuPDF or PyPDF2. These libraries allow you to search for and extract text or other elements from specific locations within the PDF.

Q2. How do I extract form data from a PDF?

A. You can extract form data from a PDF using the PyPDF2 or pdfplumber libraries in Python. These libraries can read and extract values from form fields in a PDF.

Q3. Can I extract data from PDF to Excel?

A. Yes, you can extract data from a PDF to Excel using libraries like PyMuPDF, Camelot, or tabula-py to read the PDF and extract the data, then use the pandas library to organize and export the data to an Excel file.

Q4. How to automatically extract data from PDF?

A. To automatically extract data from a PDF, you can use a combination of Python libraries such as PyPDF2 for text extraction and pandas for data processing. You can write a script that reads the PDF, processes the data, and saves it in the desired format, such as CSV or Excel. This script can be scheduled to run automatically using tools like cron jobs or task schedulers.

Ashish

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Mohammed Nawaz Ahmed

hey ashish excellent article. however i struct somewhere in my code. how could i contact u

Jackson

thankyou for this post but I would appreciate it if you can share a link for the pdf used for the data extraction as that would make it easier to try this exercise.

D S BALA KRISHNA SARAN

I have tried to execute /practice the same below code : import fitz import pandas as pd doc = fitz.open('FlipKart_Invoice_2022_3.pdf') page1 = doc[0] words = page1.get_text("words") first_annots=[] rec=page1.first_annot.rect mywords = [w for w in words if fitz.Rect(w[:4]) in rec] ann= make_text(mywords) first_annots.append(ann) but, I am facing the below error : rec=page1.first_annot.rect AttributeError: 'NoneType' object has no attribute 'rect' Could you help me here ?

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Data Extraction from Unstructured PDFs

Table of contents

PyMuPDF

Code

Extract the coordinates of the first object

This function selects the words contained in the box, sort the words and return in form of a string

Extracting each page of the document and all the annots/rectanges

Data Cleaning and Data Processing

Splitting to form column name and its values :

Removing unnecessary symbols *,#,:

Spliting into keys and values and removing spaces in the values which only contain digits

Converting to dictionary

Converting to DataFrame and exporting to CSV

How to Extract Data from Unstructured PDF Files with Python?

How to extract information from pdf python?

Libraries for PDF Extraction

Example Code Snippets

Using PyPDF2

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID