Data is everywhere, and every action we take creates some form of it, though it’s not always structured. Beginners in data analysis often start with standard formats like CSV or text files, as they are easy to work with using tools like pandas or basic Python file handling. However, real-world data can come in various formats, including documents like Doc files. For example, during an internship assignment, I had to analyze data from a Doc file, which required me to extract tabular data from the doc. In this article, I’ll explain the ETL process for Doc files, how to extract tabular data from doc file using python and the difference between Doc and Docx formats, how to convert doc to docx using Python, and how I created interactive plots from the data.
This article was published as a part of the Data Science Blogathon
While dealing with doc files, you will come across these two extensions: ‘.doc’ and ‘.docx’. Both the extensions are used for Microsoft word documents that can be created using Microsoft Word or any other word processing tool. The difference lies in the fact that till word 2007, the “doc” extension was used extensively.
After this version, Microsoft introduced a new extension, “Docx”, which is a Microsft Word Open XML Format Document. This extension allowed files to be smaller, easy to store, and less corrupted. It also opened doors to online tools like Google Sheets which can easily manage these Docx files.
Today, all the files are by default created with the extension Docx but there are still many old files with Doc extension. A Docx file is a better solution to store and share data but we can’t neglect the data stored in Doc files. It might be of great value. Therefore, to retrieve data from Doc files, we need to convert the Doc file to Docx format. Depending on the platform, Windows or Linux, we have different ways for this conversion.
Manually, for a word file to be saved as Docx, you simply need to save the file with the extension “.docx”
We will perform this task using Python. Window’s Component Object Model (COM) allows Windows applications to be controlled by other applications. pywin32 is the Python wrapper module that can interact with this COM and automate any windows application using Python. Therefore, the implementation code goes like this:
from win32com import client as wc
w = wc.Dispatch('Word.Application')
doc = w.Documents.Open("file_name.doc")
doc.SaveAs("file_name.docx", 16)
Breakdown of the code:
For Linux
We can directly use LibreOffice in-build converter:
lowriter --convert-to docx testdoc.doc
Python has a module for reading and manipulating Docx files. It’s called “python-docx”. Here, all the essential functions have been already implemented. You can install this module via pip:
pip install python-docx
I won’t go into detail about how a Docx document is structured but on an abstract level, it has 3 parts: Run, paragraph, and Document objects. For this tutorial, we will be dealing with paragraph and Document objects. Before moving to the actual code implementation, let us see the data will be extracting:
Data in new Docx file
The new Docx file contains the glucose level of a patient after several intervals. Each data row has an Id, Timestamp, type, and glucose level reading. To maintain anonymity, I have blurred out the Patient’s name. Procedure to extract this data:
import docx
Text = docx.Document('file_name.docx')
data = {}
paragraphs = Text.paragraphs
for i in range(2, len(Text.paragraphs)):
data[i] = tuple(Text.paragraphs[i].text.split('t'))
Here I had to split the text at “t” as if you look at one of the rows, it had the tab separator.
data_values = list(data.values())
Now, these values are transformed as a list and we can pass them into a pandas dataframe. According to my use case, I had to follow some additional steps such as dropping unnecessary columns and timestamp conversion. Here is the final pandas dataframe I got from the initial Doc file:
There are a lot of things that can be done using the python-docx module. Apart from loading the file, one can create a Docx file using this module. You can add headings, paragraphs, make text bold, italics, add images, tables, and much more! Here is the link to the full documentation of the module.
The main aim of this article was to show you how to extract tabular data from a doc file into a pandas dataframe. Let’s complete the ELT cycle and transform this data into beautiful visualizations using the Plotly library! If you don’t know, Plotly is an amazing visualization library that helps in creating interactive plots.
These plots don’t require much effort as most of the things can be customized. There are many articles on Analytics Vidhya describing the usage of this library. For my use case, here is the configuration for the plot:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=doc_data.index,
y=doc_data['Historic Glucose (mg/dL)'].rolling(5).mean(),
mode='lines',
marker=dict(
size=20,
line_width=2,
colorscale='Rainbow',
showscale=True,
),
name = 'Historic Glucose (mg/dL)'
))
fig.update_layout(xaxis_tickangle=-45,
font=dict(size=15),
yaxis={'visible': True},
xaxis_title='Dates',
yaxis_title='Glucose',
template='plotly_dark',
title='Glucose Level Over Time'
)
fig.update_layout(hovermode="x")
Checkout this article about the guide to pandas for data science
In this article, I explained what Doc files are, the difference between Doc and Docx file extensions, the conversion of Doc files into Docx files, the process of converting Doc to Docx, loading and manipulating Docx files, and finally, how to load this tabular data into a pandas DataFrame.”
Use the pywin32
library to automate Microsoft Word for conversion, or use unoconv
or LibreOffice
for an open-source solution.
Use the python-docx
library to create a DOCX file and add the text content from the TXT file programmatically.
Yes, Python can parse Word documents using libraries like python-docx
for DOCX files or pywin32
for older DOC files.
ok