PDF stands for Portable Document Format and has a .pdf file extension. Users predominantly utilize this format for document sharing because it preserves the original formatting, ensuring that documents appear consistent across various platforms, regardless of the hardware, software, or operating system used. This consistency makes PDFs the format of choice for distributing, viewing, and ensuring the integrity of documents on a global scale.
Originally developed by Adobe, PDF has transcended its proprietary origins to become an open standard, governed by the International Organization for Standardization (ISO). This transition to an ISO standard has further cemented PDF’s position as a cornerstone of digital document management, facilitating its adoption in a wide range of applications from academic publishing to business communications.
In this tutorial, we will learn how to work with PDF files in Python. The following topics will be covered:
This article was published as a part of the Data Science Blogathon.
There are many libraries available freely for working with PDFs:
PyPDF2 is a comprehensive Python library designed for the manipulation of PDF files. It enables users to create, modify, and extract content from PDF documents. Built entirely in Python, PyPDF2 does not rely on any external modules, making it an accessible tool for Python developers.
The library offers a dual API system to cater to different programming needs. The low-level API, inspired by Pygments, provides the capability to craft programs that can generate or manipulate documents with high efficiency. On the other hand, the high-level API, influenced by ReportLab, simplifies the creation of complex documents—ranging from forms to entire books or magazines—with minimal coding effort.
Designed for efficiency, PyPDF2 leverages native C code for intensive operations like parsing, ensuring optimal performance without compromising the simplicity of its Pythonic interface. Additionally, the library is thread-safe, boasting a modest memory footprint approximately the size of Python’s own (around 1MB), making it both powerful and lightweight for developers looking to manage PDF documents in their projects.
PyPDF2’s flexibility and command-line interface make it an ideal choice for integrating PDF processing into your workflow or Python projects. Below are some practical applications where PyPDF2 excels:
Traditionally, converting PDFs into Word or other file formats requires specialized software for each conversion type, which can be inefficient, especially when handling multiple documents. PyPDF2 offers a streamlined alternative, enabling users to automate the conversion process within their Python scripts or via command-line instructions, significantly simplifying the task of converting PDF files into desired formats.
Whether you’re compiling reports, combining chapters of a book, or consolidating financial statements, PyPDF2 simplifies the process of merging multiple PDF files into a single document. This capability is invaluable for creating cohesive documents from disparate sources, enhancing organization and accessibility.
PyPDF2’s functionality extends beyond basic file manipulation, allowing for detailed modifications within PDF documents. Users can add or remove pages, extract text for analysis, and even insert images or other objects into existing PDFs. This level of control makes PyPDF2 a versatile tool for tailoring documents to specific requirements.
Large PDF documents can be unwieldy, making them difficult to share or process. PyPDF2 addresses this challenge by providing robust tools for splitting a single, large document into smaller, more manageable files. Whether you need to divide a document by page number, at regular intervals (every n pages), or according to document metadata such as author or title, PyPDF2 equips you with the necessary functionality.
Also Read: Transforming PDF Summary using Python
To install PyPDF2, copy the following commands in the command prompt and run:
pip install PyPDF2
PyPDF2 provides metadata about the PDF document. This can be useful information about the PDF files. Information like the author of the document, title, producer, Subject, etc is available directly.
To extract the above information, run the following code:
from PyPDF2 import PdfFileReader
pdf_path=r"C:UsersDellDesktopTesting Tesseractexample.pdf"
with open(pdf_path, 'rb') as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
print(information)
The output of the above code is as follows:
Let us format the output:
print("Author" +': ' + information.author)
print("Creator" +': ' + information.creator)
print("Producer" +': ' + information.producer)
Extracting text from PDFs with PyPDF2 can be challenging due to its restricted capabilities in text extraction. The output generated by the code might not be well-formatted, often resulting in an output cluttered with line break characters, a consequence of PyPDF2’s constrained text extraction support.
To extract text, we will read the file and create a PDF object of the file.
# creating a pdf file object
pdfFileObject = open(pdf_path, 'rb')
Then we will create a PDFReader class object and pass PDF File Object to it.
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
And Finally, we will extract each page and concatenate the text of each page.
text=''
for i in range(0,pdfReader.numPages):
# creating a page object
pageObj = pdfReader.getPage(i)
# extracting text from page
text=text+pageObj.extractText()
print(text)
The output text is as follows:
To rotate a page of a PDF file and save it another file, copy the following code and run it.
pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
# Rotate page 90 degrees to the right
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
pdf_write.write(fh)
We can also merge two or more PDF files using the following commands:
pdf_read = PdfFileReader(r”C:UsersDellDesktopstory.pdf”)
pdf_write = PdfFileWriter()
# Rotate page 90 degrees to the right
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
pdf_write.write(fh)
The output PDF is shown below:
We can split a PDF into separate pages and save them again as PDFs.
fname = os.path.splitext(os.path.basename(pdf_path))[0]
for page in range(pdf.getNumPages()):
pdfwrite = PdfFileWriter()
pdfwrite.addPage(pdf.getPage(page))
outputfilename = '{}_page_{}.pdf'.format(
fname, page+1)
with open(outputfilename, 'wb') as out:
pdfwrite.write(out)
print('Created: {}'.format(outputfilename))
pdf = PdfFileReader(pdf_path)
Encryption of a PDF file means adding a password to the file. Each time the file is opened, it prompts to give the password for the file. It allows the content to be password protected. The following popup comes up:
We can use the following code for the same:
for page in range(pdf.getNumPages()):
pdfwrite.addPage(pdf.getPage(page))
pdfwrite.encrypt(user_pwd=password, owner_pwd=None,
use_128bit=True)
with open(outputpdf, 'wb') as fh:
pdfwrite.write(fh)
A watermark is an identifying image or pattern that appears on each page. It can be a company logo or any strong information to be reflected on each page. To add a watermark to each page of the PDF, copy the following code and run.
originalfile = r"C:UsersDellDesktopTesting Tesseractexample.pdf"
watermark = r"C:UsersDellDesktopTesting Tesseractwatermark.pdf"
watermarkedfile = r"C:UsersDellDesktopTesting Tesseractwatermarkedfile.pdf"
watermark = PdfFileReader(watermark)
watermarkpage = watermark.getPage(0)
pdf = PdfFileReader(originalfile)
pdfwrite = PdfFileWriter()
for page in range(pdf.getNumPages()):
pdfpage = pdf.getPage(page)
pdfpage.mergePage(watermarkpage)
pdfwrite.addPage(pdfpage)
with open(watermarkedfile, 'wb') as fh:
pdfwrite.write(fh)
The above code reads two files- the input file and the watermark. Then after reading each page it attaches the watermark to each page and saves the new file in the same location.
PyPDF2 stands out as a highly accessible solution for PDF file conversion, celebrated for its open-source nature and integration capabilities. Its comprehensive online documentation, hosted on GitHub, ensures that even those pressed for time can quickly find their way through setup and execution, streamlining the learning curve with well-organized docs and examples. For those seeking further assistance or looking to contribute, the PyPDF2 community on GitHub welcomes inquiries and contributions, fostering an environment of support and continuous improvement.
This library is not only user-friendly but also designed with automation and integration in mind, making it a go-to choice for developers looking to incorporate PDF manipulation into their workflows or applications. Since PyPDF2 is available on PyPI, installing it is straightforward for any Python project, and its compatibility with HTML and other formats enhances its versatility in handling various document conversion tasks.
With no dependencies other than Python, PyPDF2 promises exceptional portability across different operating systems, ensuring developers can deploy it in diverse environments without compatibility issues. The BSD-style license under which PyPDF2 is released allows developers to include it in commercial software packages without legal concerns.
In essence, PyPDF2 serves as an invaluable tool for Python developers interested in automating PDF manipulation, providing an optimal blend of ease of use, efficiency, and adaptability. Whether you’re generating reports, converting documents, or integrating PDF functionalities into larger systems, PyPDF2’s robust feature set and supportive community make it a highly recommended resource.
A. Yes, Python 3 supports various libraries for PDF manipulation, such as PyPDF2, PDFMiner, and pdflib. These libraries allow you to perform operations like extracting text, html, merging, splitting, and encrypting PDFs in a Python 3 environment.
A. PyPDF refers to libraries like PyPDF2 and PyPDF4, which are Python libraries that allow users to work with PDF files. They provide functionalities for extracting information, merging, splitting, encrypting, and decrypting PDF documents.
A. Appending text directly to a PDF is complex due to the format’s nature. Instead, you can use Python to add text as annotations or by creating a new PDF with the text and then merging it with the original PDF using PyPDF2.
A. Yes, PyPDF2 allows you to decrypt PDF files, provided you have the necessary permissions and the password.
A. To work with Excel files and PDFs, you can use libraries like Pandas to manipulate Excel data and then use ReportLab or PyPDF2 to generate or manipulate PDFs based on that data.
A. On Linux, ensure you have dependencies installed for libraries like PyPDF2 or PDFMiner. Use the Linux package manager to install any required system libraries for advanced operations like OCR.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Hi, thanks for those useful info! By the way, the scripts you indicated for rotating and merging are the same ;)