How to Extract tabular data from PDF document using Camelot in Python

Guest Blog Last Updated : 14 Aug, 2020

4 min read

Introduction

PDF or Portable Document File format is one of the most common file formats in today’s time. It is widely used across every industry such as in government offices, healthcare, and even in personal work. As a result, there is a large unstructured data that exists in PDF format and extracting this data to generate meaningful insights is a common work among data scientists.

There are several Python libraries dedicated to working with PDF documents such as PYPDF2 etc. In this tutorial, I will be using Camelot.

camelot

Why Camelot?

You are in control: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
Export to multiple formats, including JSON, Excel, HTML, and Sqlite.

Let’s Begin

Before installing Camelot libraries we have to install ghost script , once we install the ghost script lets install camelot-py.

Run below commands :

pip install "camelot-py[cv]"

Once you have installed camelot-py library we are all set to go. We are trying to extract a state-wise GST revenue table from this pdf doc.

import camelot

If you have camelot, Python will not print an error message, and if not, you will see an ImportError.

# Syntax of the camelot.read_pdf function 
camelot.read_pdf(
    filepath,
    pages='1',
    password=None,
    flavor='lattice',
    suppress_stdout=False,
    layout_kwargs={},
    **kwargs,
)

If you have to extract a table from different pages you have to give the page number.

tables2=camelot.read_pdf('gst-revenue-collection-march2020.pdf', flavor='stream', pages='0-3')
tables2

This will give you a total Table list that is there in a pdf doc. we can select a table passing the index.

tables2[2]  # 2 is the index

tables2[2].parsing_report

The above code will give you the details such as accuracy and page no. Note that there are 2 pages.

The following code will extract the table from the pdf document.

df2=tables2[2].df
df2

In this case, because the table is split into two different pages. So we can do a workaround.

tables2[3]
tables2[3].parsing_report

Here you can notice, we extract the table from page no 3.

df3=tables2[3].df
df3

The following is the code to append df2 and df3.

df4=df2.append(df3)
df4

df5=df4[1:]
df5.head()
new_header = df5.iloc[0]df5 = df5[1:]df5.columns = new_header

Here you go, we have extracted a table from pdf, now we can export this data in any format to the local system.

Conclusion

Extracting tabular data from pdf with help of camelot library is really easy. Moreover, we know there is a huge amount of unstructured data in pdf formats and after extracting the tables we can do lots of analysis and visualization based on your business need.

I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

About the Author

Author

Prabhat Kumar – Associate Analyst

I am an engineer currently working in Top MNCs as an Associate Analyst and Innovation Enthusiast, I love learning new things, I Believe Every data has a story and I love reading the stories.

Prabhat Pathak (Linkedin profile) is an Associate Analyst.

Guest Blog

Beginner Python Python Structured Data Technique

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How to Extract tabular data from PDF document using Camelot in Python

Introduction

Why Camelot?

Let’s Begin

Conclusion

About the Author

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS