Excel is the organization’s go-to tool for one or the other tasks in the workflow. It is the most popular choice among stakeholders. Whether accept it or not, I have seen people rejecting Tableau visuals for Excel. Even in conversations with my current organization’s CEO, he acknowledged that Excel is the only tool that will never vanish. There are various workarounds for adding extensible workflows to enhance the productivity of using excel automation. VBAs, macros, and tools like data-snipper are just the starting point. Openpyxl is an open-source library that brings Excel operations to Python. It provides implementation to manipulate existing as well as create new Excel files on the fly.
Although the use case and scenario where this library can be integrated will be far beyond the scope of this article. Here one can expect an implementation guide starting with the basic I/O of files to build up an automation process of generating reports in a predefined template.
Learning Objectives:
This article was published as a part of the Data Science Blogathon.
Most people in the data science field use the Pandas library. The library has a “style” attribute for data frames, use them to beautify the Excel files. It comes in handy when the user doesn’t expect to leverage a lot of Excel functionalities. You can read more about styling data frames from this blog post Style your Pandas DataFrame and Make it Stunning.
By sticking to Pandas alone, a lot of use cases can be achieved using Styler. In fact, Pandas offer openpyxl as an engine while exporting data frames to Excel files. Using standalone Openpyxl opens the door to data not stored natively in Pandas data frames. Getting data connected to sources such as MongoDB, Databricks, or CloudWatch can be directly wrangled and dumped to Excel files in a structured way without Pandas as an intermediate. Add Pivot tables, charts, passwords, and whatnot using Openpyxl!
To set up the library in the local environment, one can use a simple pip command, as with any other Python package. Install the library in a separate environment as a best practice to avoid any dependency conflicts.
pip install openpyxl
There are two ways to start working with Openpyxl:
Let’s start with the first way. Below is the preview of the file which we will refer:
The syntax to load a file in openpyxl is very similar to Pandas.
from openpyxl import load_workbook
wb = load_workbook("basic_sample.xlsx",
read_only=False,
keep_vba=False,
data_only=False,
keep_links=True)
The load_workbook function from the openpyxl package can take 4 additional arguments along with the file name to load.
All the option’s default value is set in the above code. Excel file is a collection of different worksheets, it is important to select the right worksheet to access the desired values. This can be achieved in two ways:
And to access the cells of these worksheets, the dictionary access method would work and combined with the value property will give the actual value of that specific cell. Let’s try to get the value of the “A1” cell using both worksheet access methods.
wb.active['A1'].value
# Kaustubh
wb['Sheet1']['A1'].value
# Kaustubh
To change the values of the cells, a simple assignment operator works. Let’s change the currently accessed name to something else.
wb['Sheet1']['A1'] = "Funky"
wb.save("new_basic.xlsx")
The save function saves the file in its current state. This is the output of the above code.
Excel automation often has different worksheets that are helpful in segregating analysis into isolated places so that these can be used as building blocks in cumulative dashboards. These worksheets can contain new data, pivots, charts, and much more. To add new worksheets to the existing workbooks, the create_sheet function can be used.
Let’s create a new sheet named “aggregated”. To verify that the sheet has been created, the workbook’s “sheetnames” property returns all the sheets of a workbook. And to make this change permanent, the workbook needs to be saved using the save function. The code implementation goes as follows:
wb.create_sheet("aggregated")
print(wb.sheetnames)
# Output: ['Sheet1', 'aggregated']
wb.save("new_basic.xlsx")
It’s time to wrangle the tabular data loaded from the sample file and add some additional values to the existing values. We will discuss the two methods of accessing the cell values and achieve multi-cell values access, let’s look at the 3 possible methods. Before discussing them, let’s understand how the cells are located in Excel.
In Excel, the cells are located in “CharacterNumber” formatting and in rows columns fashion. Some examples are “A1”: A column 1st row, “C67”: C column 67th row, “Z1”: Z column 1st row, etc. It means that every cell has a fixed location. These locations are used in every Excel formula, referencing, and whatnot. Let’s take a look at each one of the access methods.
Use the above methods for looping. In the below code snippet, all 3 methods return the same result:
# Scratch Method
for row in range(1, 10):
for col in range(0, 6):
char = chr(col+65)
print(wb['Sheet1'][char + str(row)].value, end=' ')
print('\n')
# Slicing Method
for row in wb['Sheet1']['A1':'F9']:
for cell in row:
print(cell.value, end=' ')
print('\n')
# In-build Method
for row in wb['Sheet1'].iter_rows(min_row=1, max_col=6, max_row=9):
for cell in row:
print(cell.value, end=' ')
print('\n')
Let’s try to modify the values of the last column to add the prefix “test_”. The code below is implemented using the slicing method:
for row in wb['Sheet1']['A1':'F9']:
for cell in row:
if 'College' in str(cell.value):
cell.value = 'test_' + cell.value
print(cell.value, end=' ')
print('\n')
There are many instances where new rows and columns need to be added to the existing data. For instance, the currently loaded data misses the headers for the rows. There is also a possibility of adding a new column showing Score %. Let’s make these changes.
To add a new row to the worksheet, the insert_rows function can be used. It takes the index location and the number of rows to be inserted as parameters (by default 1 row). This makes everything moves below by the number of rows inserted. Similarly, the insert_cols function is used for inserting columns at the provided index location. The code below inserts two rows from the start of the sheet and one column at the penultimate location :
wb['Sheet1'].insert_rows(0,2)
wb['Sheet1'].insert_cols(6,1)
Now let’s add the header values to the empty rows.
headers = [
'Name',
'Mathematics',
'Computer Organization',
'Data Structures and Algorithms',
'Web Development',
'Score %',
'College'
]
for row in wb['Sheet1']['A2':'G2']:
idx = 0
for cell in row:
cell.value = headers[idx]
idx += 1
Similarly, computations on the existing data can be done to add new data. The following code is calculating the Score % for all the rows and appends those scores in the Score % column.
for row in wb['Sheet1']['B3':'F11']:
score = 0
for cell in row[:4]:
score += int(cell.value)
row[4].value = round((score/400)*100, 2)
The resultant Excel after all the operations applied so far:
In the above section, the Score % calculation was done in a very ad-hoc manner and this can be clearly avoided by adding Excel formulas. There are more than 450+ Excel formulas that help to find valuable insights from the data. To calculate Score % in the current case, the average function can be used. This is how the above Score % can be calculated using Excel formulas:
for row in wb['Sheet1']['B3':'H11']:
row[6].value = f'=AVERAGE({row[0].coordinate}:{row[3].coordinate})'
In the above code, the coordinate property of cells will retrieve the locations of the cells. The results match the calculations done via Python.
Excel allows users to merge cells to form a bigger cell out of existing cells. This makes sure that the sheet formatting is consistent and that bigger cells occupy fixed space. These mergings are useful in case users want to display headers and create subsections and format sheets. Openpyxl offers two straightforward functions for this functionality: merge_cells and unmerge_cells. Both functions take in the cell range as the parameter and are functions of worksheet clasobject. Two checklist items related to these functions:
Continuing the Excel file used in previous sections, let’s merge the cells above the subjects rows and display that combined row as “Subjects”.
wb['Sheet1'].merge_cells('B1:E1')
wb['Sheet1']['B1'].value = 'Subjects'
Excel filters are one of the most commonly used features for quick summaries on a subset of data rather than working with whole data. With openpyxl, the Excel files can have the filters already added to the data. One caution is that the filters will not filter the data once added. It will just be present, and the user must interact with the filter once so that it is activated. Let’s add a filter for college:
from openpyxl.worksheet.filters import (
FilterColumn,
CustomFilter,
CustomFilters,
DateGroupItem,
Filters,
)
filters = wb['Sheet1'].auto_filter
filters.ref = "A2:G11"
col = FilterColumn(colId=6)
col.filters = Filters(filter=["test_College2"])
filters.filterColumn.append(col)
In the above code:
Once the Excel is opened and the filter initial interaction is done, the data is filtered on the selected column filter.
Pivots and charts are essentially the first steps toward summarizing data. One drawback of Openpyxl is it does not support the creation of Pivot tables from the code. It allows loading an existing Pivot table and making changes to that. For instance, if the reference data for Pivot tables are updated, the Pivot tables numbers will be updated once the user opens the updated Excel workbook. Openpyxl does support adding various types of charts such as Area, Bar, Line, Scatter, Pie, etc provided the table range and the other required parameters.
Let’s add a bar chart showing all subjects marks by students.
from openpyxl.chart import BarChart, Series, Reference
chart1 = BarChart()
chart1.type = "col"
chart1.style = 10
chart1.title = "Bar Chart"
chart1.y_axis.title = 'Marks'
chart1.x_axis.title = 'Names'
data = Reference(wb['Sheet1'], min_col=2, min_row=2, max_row=11, max_col=5)
cats = Reference(wb['Sheet1'], min_col=1, min_row=3, max_row=11)
chart1.add_data(data, titles_from_data=True)
chart1.set_categories(cats)
chart1.shape = 4
wb['Sheet1'].add_chart(chart1, "B13")
Here is the breakdown of the code:
The resultant graph:
Apart from adding graphs and charts, Excel supports adding graphics in the form of stock images. These can be useful in formatting the dashboards and tailoring them as per end-user needs. One thing to note here is that Openpyxl does not retain the images when an existing workbook is loaded. The images added via code do retain in Excel. To add images, the Image class from Openpyxl drawing can be used. See an example below to add images:
from openpyxl.drawing.image import Image
excel_logo = Image('Demo_Picture.png')
excel_logo.anchor = 'J4'
excel_logo.width = 130
excel_logo.height = 150
wb['Sheet1'].add_image(excel_logo)
The Image class is just a wrapper on the PIL (Pillow) library. Therefore, all attributes that can be configured in PIL are supported. In the above code, the anchor decides the location of the image, width, and height control the dimensions. The result looks like this:
Almost every formatting option is available in Excel. Whether it’s the range of Fonts, formatting such as Bold, italics, filling color patterns, cell alignment, or adding borders. Openpxl supports all these features with its easy-to-use API over Excel files. So far the demo Excel has undergone a lot of changes and now it’s time to beautify it. The following things are missing from the current Excel:
See the code implementation below:
from openpyxl.styles import Border, Alignment, Font, Side
font_bold = Font(bold=True)
font_red = Font(color='ffff2e2e')
font_green = Font(color='ff00ff00')
border = Border(left=Side(border_style="thin", color='FF000000'),
right=Side(border_style="thin", color='FF000000'),
top=Side(border_style="thin", color='FF000000'),
bottom=Side(border_style="thin", color='FF000000'))
alignment = Alignment(horizontal='center')
wb['Sheet1']['B1'].alignment = alignment
wb['Sheet1']['B1'].font = font_bold
for row in wb['Sheet1']['A2':'G2']:
for cell in row:
cell.font = font_bold
for row in wb['Sheet1']['F3':'F11']:
for cell in row:
cell.border = border
for row in wb['Sheet1']['B3':'F11']:
for cell in row[:4]:
if int(cell.value) > 30:
cell.font = font_green
else:
cell.font = font_red#import csv
The above code modifies the cell border, font, and alignment properties. See the below output:
Excel has a feature where anyone can leave comments on the cells. These comments are helpful when multiple people are working on the same type of analysis. Openpyxl allows adding comments to the cells using the comment property. To illustrate this, let’s add a “need to review” comment on cells where the marks are less than 30.
from openpyxl.comments import Comment
comment = Comment("Need to review", "auto grader")
for row in wb['Sheet1']['B3':'F11']:
for cell in row[:4]:
if int(cell.value) < 30:
cell.comment = comment
This will add a comment on all the cells in the range where the value is less than 30. This can be verified by hovering over the cells where the pinned triangle is added on the top right:
Till here, all the essential topics in Excel automation are covered. Here is the repo link to the Excel used and the notebook with all the code covered in this article: Excel automation. To apply all the concepts, here is a small toy example to get started.
Assume you are a Data Analyst working for a University. The University analyses every college assessment in the form of Excel charts. You have decided the automate the whole process. You have prepared one template Excel file that can be used to prefill the data and the charts get auto-updated based on the Pivot table created. The template file looks like this:
Your job is to come up with a script that can
The above-mentioned points are just the starting point. There are endless possibilities to work on this example and make the table more attractive and maybe create an Excel dashboard out of this! The template file can be found in the same repository mentioned above.
Excel automation has its own limitations but ideally, we should aim for shifting from manual work to automation. Openpyxl still has a lot of unexplored options that can help in building more robust workflows. Features such as streaming files from the backend, creating custom filters, and adding password protection are just a few examples. This guide covers the process of reading and manipulating Excel files using Python data structures, and then adding Excel functionalities.
Some Key Takeaways:
If you want to read/explore every article of mine, then head over to my master article list.
Hope you liked my article on Excel automation using python. For any doubts, queries, or potential opportunities, you can reach out to me via LinkedIn — in/kaustubh-gupta/
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.