Efficiency has become a key ingredient for the timely completion of work. One is not expected to spend more than a reasonable amount of time to get things done. Especially when the task involves basic coding. One such area where data scientists are expected to be the fastest is when using the Pandas library in Python.
Pandas is an open-source package. It helps to perform data analysis and data manipulation in Python language. Additionally, it provides us with fast and flexible data structures that make it easy to work with Relational and structured data.
If you’re new to Pandas then go ahead and enroll in this free course. It will guide you through all the in’s and out’s of this wonderful Python library. And set you up for your data analysis journey. This is the sixth part of my Data Science hacks, tips, and tricks series. I highly recommend going through the previous articles to become a more efficient data scientist or analyst.
I have also converted my learning into a free course that you can check out:
Also, if you have your own Data Science hacks, tips, and tricks, you can share it with the open community on this GitHub repository: Data Science hacks, tips and tricks on GitHub.
To begin with, data exploration is an integral step in finding out the properties of a dataset. Pandas provide a quick and easy way to perform all sorts of analysis. One such important analysis is the conditional selection of rows or filtering of data.
The conditional selection of rows can be based on a single condition or multiple conditions in a single statement separated by logical operators.
For example, I’m taking up a dataset on loan prediction. You can check out the dataset here.
We are going to select the rows of customers who haven’t graduated and have an income of less than 5400. Let us see how do we perform it.
Note: Remember to put each of the conditions inside the parenthesis. Else you’ll set yourself up for an error.
Try this code out in the live coding window below.
import pandas as pd
data = pd.read_csv('loan_train.csv')
print(data[['Education', 'ApplicantIncome']].head())
print('\n\nConditional Selection of Rows\n\n')
data_2 = data.loc[(data['Education'] == 'Not Graduate') & (data['ApplicantIncome'] <= 5400)]
print('\n\nFiltered Data\n\n')
print(data_2[['Education', 'ApplicantIncome']].head())
The data can be of 2 types – Continuous and categorical depending on the requirement of our analysis. Sometimes we do not require the exact value present in our continuous variable. But the group it belongs to. This is where Binning comes into play.
For instance, you have a continuous variable in your data – age. But you require an age group for your analysis such as – child, teenager, adult, senior citizen. Indeed, Binning is perfect to solve our problem here.
To perform binning, we use the cut() function. This useful for going from a continuous variable to a categorical variable.
Let us check out the video to get a better idea!
This operation is frequently performed in the daily lives of data scientists and analysts. Pandas provide an essential function to perform grouping of data which is Groupby.
The Groupby operation involves the splitting of an object based on certain conditions, applying a function, and then combining the results.
Let us again take the loan prediction dataset, say I want to look at the average loan amount given to the people from different property areas such as Rural, Semiurban, and Urban. Take a moment to understand this problem statement and think about how can you solve it.
Well, pandas groupby can solve this problem very efficiently. Firstly we split the data according to the property area. Secondly, we apply the mean() function to each of the categories. Finally we combine it all together and print it as a new dataframe.
This is yet another important operation that provides high flexibility and practical applications.
Pandas map() is used for mapping each value in a series to some other value-based according to an input correspondence. In fact, this input may be a Series, Dictionary, or even a function.
Let us take up an interesting example. We have a dummy employee dataset. This dataset consists of the following columns – name, age, profession, city. Now you want to add another column stating the corresponding state. How would you do it? If the dataset is ranging to ten rows you might do it manually but what if you have thousands of rows? It would be much more advantageous to use the pandas map.
Note – Map is defined on Series only.
This is one of my favorite Pandas Hacks. This hack provides me with the power to pinpoint the data visually which follows a certain condition.
You can use the Pandas style property to apply conditional formatting to your data frame. In fact, Conditional Formatting is the operation in which you apply visual styling to the dataframe based on some condition.
While Pandas provides an abundant number of operations, I’m going to show you a simple one here. For example, we have the sales data corresponding to each of the respective salespeople. I want to highlight the sales values as green that is higher than 80.
Note – We have applied the apply map function here since we want to apply our style function elementwise.
To summarize, in this article, we covered seven useful Pandas hacks, tips, and tricks across various pandas modules and functions. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time. In case you are completely new to python, I highly recommend this free course-
Let me know your Data Science hacks, tips, and tricks in the comments section below!
The data frame styling is interesting. Is this part of pandas or does it only apply within a Jupyter notebook? Can I use it work with styles in an Excel spreadsheet?
The explanation about pandas is very valuable and understandable. Thanks!
Nice work! Can be better with embedding the code snippet in the post like the conditional selection trick.