DataFrame in Python
Performing Data Cleaning Operations on the Pandas DataFrame
Undoubtedly, a DataFrame in python is the most important structure used to store the data because it is used in all practical cases to store our given data set which we will be using for creating our models. It is defined under the Pandas library of Python. While doing any kind of analysis on our given dataset with the help of the Python tool the very next step after importing the required libraries is to create a data frame which is mostly done by reading the data file having our data set into Python. And now since we get our data set stored in a structure (Data Frame) we have to perform all our operations on this data frame only which makes it a big deal to learn about the various operations we have to perform on a data frame i.e on its constituent rows and columns in almost every case as a part of the Data Cleaning and hence the Data Preparation process.
In this article we will get to learn about a few of these operations, however, I have left the links for you to study about the more advanced and frequently used operations, especially as part of the EDA process in this article only.
Before moving on to the DataFrame, it would be helpful to first understand some of the basic data structures defined in Python like the Series and the built-in data structures. You will get all of this knowledge just by referring to one article: A Beginners’ Guide to Data Structures in Python.
Introducing the Dataset
Importing the Python Libraries
Reading data into a DataFrame
Subsetting a DataFrame
Renaming the Variables
Re-ordering the Variables
Creating Calculated Columns
Dropping a Variable
Filtering the Data in a DataFrame
Sorting the Data
Grouping and Binning
Creating Summaries
For this article, we will be using the Iris dataset which can be downloaded from here. We will use this data set to learn how these operations are actually performed on some actual data.
Let’s import all the python libraries we will be needed for operating on a DataFrame namely NumPy and Pandas.
import numpy as np import pandas as pd
Before going to the operations we first need to create a DataFrame and here we will be reading the data from a CSV (comma-separated values) file into a Pandas DataFrame naming it as df here.
df=pd.read_csv('C:/Users/ACER/Desktop/Iris.csv')
By this our data frame df is created and to have a basic look at the data we can give the command:
df.head()
By subsetting a DataFrame we mean selecting particular columns from the table. It is one of the frequently used operations. There are various ways to subset the data. We will discuss each of them one by one.
Let’s determine the column names first!
df.columns
Let’s start subsetting!
df.SepalLengthCm
Here we use the name of the column and using this method we can get the data out of a single column only.
df['Species']
Using this method we can subset one or more columns on the basis of the column names.
df.iloc[:,1]
By this, we get all the rows and the column with the index as 1 i.e. the second column only and hence the column is taken out using the default index. As is clear from the slicers being used here multiple columns can be taken out at the same time.
df.loc[:,['PetalLengthCm','PetalWidthCm']]
Here we get all the rows and two columns namely Item_Type and Item_MRP.
While there is no specific way to reorder the variables in the original data frame we have two options to reorder them. Firstly, we can view the columns of a Data Frame in a specific order as per our wish by subsetting the data in that same order. Secondly, we can update the original data frame with the data subsetted in the first option.
To view the data with the column names in a specific order we can do the following:
df.loc[:,['Species','SepalLengthCm', 'PetalWidthCm', 'PetalLengthCm', 'SepalWidthCm','Id']]
However, do remember that it does not lead to any permanent change in df.
To overwrite df simply command:
df=df.loc[:,['Species','SepalLengthCm', 'PetalWidthCm', 'PetalLengthCm', 'SepalWidthCm','Id']]
Also known as the derived columns, the calculated columns take their values from existing columns or a combination of them. In this case we can replace an existing column or create a new one both of which will be seen as permanent changes in the table.
Let’s take a scenario!
I want to get the area of the Sepal for some kind of analysis. How can I do so?
df['SepalArea']=df.SepalLengthCm*df.SepalWidthCm df
A new column SepalArea is created towards the end. However, it makes more sense to have the area column besides the parameter columns. Well, that can be done too using the insert method.
df.insert(5,'PetalArea',df.PetalLengthCm*df.PetalWidthCm) df.head()
A new column PetalArea is created at the sixth position.
In both of the cases above the derived column was added in df. However to first view the output before making a permanent change in df we can go for the assign method.
df.assign(Ratio=df.PetalArea/df.SepalArea)
The ratio column is displayed in the output only and not added to df.
The rename method comes as a saviour when we get a data set having misspelt column names or sometimes when the variables are not self-explanatory giving us no idea about the data they are storing.
For example, I wish the Species variable to be called NameOfSpecies.
df.rename(columns={'Species':'NameOfSpecies'},inplace=True) df.tail()
An extremely important step as a part of the Data Cleaning process is to remove the unnecessary variables we have in our data usually which do not affect our analysis in any way and do not relate to the given business problem we are trying to solve.
Let me show you how it is done by dropping the variables we created above!
df.drop(columns=['PetalArea','SepalArea'],inplace=True) df.head()
Filtering a data set essentially means filtering the rows which in turn refers to selecting particular rows from the data frame. This selection can be done both manually and conditionally. Let’s try filtering our data by both methods one by one!
Manual Filtering
You might have noticed that we have already filtered our data in some of the steps above! Recall! Yes , Using .head() and .tail()
#display the first 4 rows of df df.head(4)
#display the last 3 rows of df df.tail(3)
There are other ways too by which filtering can be done.
Using [ ] we can slice the data. Giving a slicer in the first argument gives us the required rows on the basis of their default index.
df[:4]
Using .iloc[ ] we can extract out the rows on the basis of their default index i.e the default row names. It takes out the rows with index from start to end – 1 if we slice as .iloc[start:end]
df.iloc[:2]
We get the rows with the default index as 0 and 1 i.e the first two rows of df.
Using .loc[ ] we can extract out the rows on the basis of their user-defined index i.e the row names. It takes out the rows with index from start to end if we slice as .loc[start:end]
df.loc[:5]
We get the rows with the User Defined Index in (0,1,2,3,4,5) i.e the first six rows of df.
You must notice that in this case, the UDI is the same as the DI.
Suppose we want to extract only some specific rows, not necessarily consequent ones. How can that be done? Just mention the individual index and we are done!
df.iloc[[3,0,12,5,9]]
Conditional Filtering
Unlike the manual filtering where we mentioned the row indices manually in order to filter the rows, in the case of conditional filtering we filter the rows by indexing i.e checking conditions on the data. This can be done using [ ] and .loc[ ] on df but not with .iloc[ ]. Let’s take a different approach to learn indexing by considering some scenarios.
Task 1: Get details for virginica species.
df[df.NameOfSpecies=='Iris-virginica'].head()
Task 2 : Get details for virginica and setosa species.
Although the above method can also be used, let’s try a different approach here where we will be using .isin
names = ['Iris-setosa','Iris-virginica'] df[df.NameOfSpecies.isin(names)]
By this we get all the records where the NameOfSpecies value is Iris-setosa or Iris-virginica.
Task 3 : Get the records for which the petal length is greater than the average petal length.
df.loc[df.PetalLengthCm > df.PetalLengthCm.mean()]
df.PetalLengthCm.mean() gives the average petal length ~ 3.75.
So we get the records where the petal length is greater than 3.75(approximately).
We can combine task 2 and task 3 to get all those records where the species is virginica or setosa and petal length is more than the overall average petal length.
df.loc[(df.PetalLengthCm > df.PetalLengthCm.mean()) & ((df.NameOfSpecies=='Iris-virginica')| (df.NameOfSpecies=='Iris-setosa'))]
And we get along data frame in this case! Let me show you a few rows and columns from it.
df.loc[(df.PetalLengthCm > df.PetalLengthCm.mean()) & ((df.NameOfSpecies=='Iris-virginica')| (df.NameOfSpecies=='Iris-setosa')),["NameOfSpecies","PetalLengthCm"]].head()
And now comes an interesting operation which is sorting. Our primary purpose of sorting the data in a data frame is to arrange it in order for the better readability of the data. To sort the values inside a particular column we use the sort_values method.
Let’s take a scenario where we want the sepal lengths to be in ascending order!
df.sort_values('SepalLengthCm',inplace=True) df
What do we see?
The individual records have been sorted according to the sepal length values. (Check the row indices !)
Now, to sort the data by sepal width from highest to lowest value we can simply write the command as:
df.sort_values(by='SepalWidthCm' , ascending=False , inplace=True) df
The data frame is changed which is evident from the jumbled row indices!
But what if I am not happy with the indices being in this way and rather want them to be ordered starting from 0 while at the same time the records should be sorted by the sepal width from highest to lowest. We can simply give another argument in the above method!
df.sort_values(by='SepalWidthCm' , ascending=False , inplace=True , ignore_index=True) df
By this, we are just resetting the index to the default index from the user-defined index we obtained on sorting initially.
The next thing I am going to do is combine the above two examples we studied. We can actually sort the sepal length in the ascending order and within that sort the sepal width in the descending order by giving the command!
df.sort_values(by=['SepalLengthCm','SepalWidthCm'],ascending=[True,False],ignore_index=True)
We just learned about derived columns and it’s time to introduce another kind of them. According to our business problem, the values in an existing column can be grouped or binned to make a new column known as a grouped/binned column. Why is it even done? To convert the continuous variables to categorical variables.
Both of these falls in the category of derived columns however they differ in some way. While binning is done only on continuous variables, grouping can be performed on categorical variables too. This is due to the fact that bins are of equal frequencies.
But why do we even want these columns? They help us reduce the cardinality of the columns.
Let’s try grouping and binning the variables in our dataset!
To create groups in Python we have 3 main methods two of which are defined in Pandas library and one comes from the numpy library.
Method 1 : pd.cut()
This is used to group the values of a single continuous variable only.
Task: Group the Petal length values into groups!
pd.cut(df.PetalLengthCm , [0,2,5,8])
Here we are creating user-defined groups (0,2] , (2,5] , (5,8]. We get the class intervals in ascending order.
Method 2 : pd.qcut()
Just like pd.cut() it is used to group the values of a single continuous variable only. But it divides the values into groups having equal frequencies i.e. each group has an equal number of values.
Task: Group the Petal width into three equal parts!
pd.qcut(df.PetalWidthCm,3)
In this case, first, the values inside the PetalWidthCm column is sorted and then the data is divided into 3 equal parts and hence we get the groups.
Method 3 : np.where()
Unlike the previous 2 methods, it can be used for one or multiple columns for any type of variable.
Task: Create a column ‘grouped’ with a few columns in one category and the rest in other.
#np.where(df.NameOfSpecies.isin(['Iris-virginica']),'Major','Minor' ) df['grouped']=pd.Series(np.where(df.NameOfSpecies.isin(['Iris-virginica']),'Major','Minor' )) df
To create bins we used the pd.cut() method!
Creating 4 bins of equal class interval
pd.cut(df.SepalLengthCm , 4)
There is yet another way where we do not even need to mention the number of bins!
pd.cut(df.SepalLengthCm,range(0,10,2))
To summarize the data in Python and create tables we have three ways.
Method 1 : Using .groupby()
Task: Determine species wise total Sepal Length.
df.groupby('NameOfSpecies').SepalLengthCm.sum()
df.groupby(['NameOfSpecies','grouped']).SepalLengthCm.sum()
Task: Determine species wise total Sepal Length and average Sepal Length.
df.groupby('NameOfSpecies').SepalLengthCm.agg([np.sum,np.mean])
df.groupby('NameOfSpecies')['SepalLengthCm','SepalWidthCm'].agg([np.sum,np.mean])
So we have grouped the data successfully and created a summary. Now let’s learn a bit about tables. There are three tables we come across: Vertical tables are those having their first row as header, horizontal tables are those having their first column as header and crosstables are those having header in both rows and columns.
To create a cross table on top of the summarized data we use the .pivot() method.
But these are two different steps. Instead, we can use just one method and do all the operations: group the data, aggregate it and create the table on top of the summarized data. This can be done using .pivot_table().
Method 2 : Using .pivot_table()
To create a cross table we can give the following command:
df.pivot_table(index=’col1 ‘,columns=’col2 ‘ , values =‘col3 ‘ , aggfunc=’sum’)
Method 3 : Using pd.crosstab()
With this method, only the cross tables can be created and it is used to create the frequency tables.
I will mention the syntax here to create a frequency table:
pd.crosstab(index=df.col1 , columns=df.col2 ,values=df.col3 , aggfunc=’count’)
Now we can move on to performing much more advanced operations on the data frame like detecting the duplicates, dropping the duplicates, performing Outlier and Missing Value Treatment etc. All of these steps play a very important role in preparing our data for further analysis and predictive modelling. I have explained these topics in detail in this article: Exploratory Data Analysis in Python.
Finally, we have come to the end of this article. In this article we performed various operations on a Pandas DataFrame in Python which is typically done while cleaning the data, manipulating it and preparing it for our analysis. However, this is not all, A lot more operations can be performed on a data frame like dealing with the duplicates, outliers and missing values followed by their treatment. These are really important steps in the EDA part and hence should not be missed.
I strongly recommend you to read this article on Exploratory Data Analysis in Python which will help you understand much more crucial operations performed on a DataFrame.
You can connect with me on LinkedIn.