Operating on the Pandas DataFrame in Python

Ayushi Last Updated : 25 Mar, 2022

11 min read

Overview

DataFrame in Python
Performing Data Cleaning Operations on the Pandas DataFrame

Introduction

Undoubtedly, a DataFrame in python is the most important structure used to store the data because it is used in all practical cases to store our given data set which we will be using for creating our models. It is defined under the Pandas library of Python. While doing any kind of analysis on our given dataset with the help of the Python tool the very next step after importing the required libraries is to create a data frame which is mostly done by reading the data file having our data set into Python. And now since we get our data set stored in a structure (Data Frame) we have to perform all our operations on this data frame only which makes it a big deal to learn about the various operations we have to perform on a data frame i.e on its constituent rows and columns in almost every case as a part of the Data Cleaning and hence the Data Preparation process.

In this article we will get to learn about a few of these operations, however, I have left the links for you to study about the more advanced and frequently used operations, especially as part of the EDA process in this article only.

Before moving on to the DataFrame, it would be helpful to first understand some of the basic data structures defined in Python like the Series and the built-in data structures. You will get all of this knowledge just by referring to one article: A Beginners’ Guide to Data Structures in Python.

Introducing the Dataset
Importing the Python Libraries
Reading data into a DataFrame
Subsetting a DataFrame
Renaming the Variables
Re-ordering the Variables
Creating Calculated Columns
Dropping a Variable
Filtering the Data in a DataFrame
Sorting the Data
Grouping and Binning
Creating Summaries

Introducing the Dataset

For this article, we will be using the Iris dataset which can be downloaded from here. We will use this data set to learn how these operations are actually performed on some actual data.

Importing the Python Libraries

Let’s import all the python libraries we will be needed for operating on a DataFrame namely NumPy and Pandas.

import numpy as np
import pandas as pd

Reading Data into a DataFrame

Before going to the operations we first need to create a DataFrame and here we will be reading the data from a CSV (comma-separated values) file into a Pandas DataFrame naming it as df here.

df=pd.read_csv('C:/Users/ACER/Desktop/Iris.csv')

By this our data frame df is created and to have a basic look at the data we can give the command:

df.head()

Pandas DataFrame in Python

Subsetting a DataFrame

By subsetting a DataFrame we mean selecting particular columns from the table. It is one of the frequently used operations. There are various ways to subset the data. We will discuss each of them one by one.

Let’s determine the column names first!

df.columns

Let’s start subsetting!

df.SepalLengthCm

Pandas DataFrame in Python

Here we use the name of the column and using this method we can get the data out of a single column only.

df['Species']

Pandas DataFrame in Python

Using this method we can subset one or more columns on the basis of the column names.

df.iloc[:,1]

By this, we get all the rows and the column with the index as 1 i.e. the second column only and hence the column is taken out using the default index. As is clear from the slicers being used here multiple columns can be taken out at the same time.

df.loc[:,['PetalLengthCm','PetalWidthCm']]

Here we get all the rows and two columns namely Item_Type and Item_MRP.

Pandas DataFrame in Python

Re-ordering the Variables

While there is no specific way to reorder the variables in the original data frame we have two options to reorder them. Firstly, we can view the columns of a Data Frame in a specific order as per our wish by subsetting the data in that same order. Secondly, we can update the original data frame with the data subsetted in the first option.

To view the data with the column names in a specific order we can do the following:

df.loc[:,['Species','SepalLengthCm', 'PetalWidthCm', 'PetalLengthCm', 'SepalWidthCm','Id']]

Re-ordering the variable

However, do remember that it does not lead to any permanent change in df.

To overwrite df simply command:

df=df.loc[:,['Species','SepalLengthCm', 'PetalWidthCm', 'PetalLengthCm', 'SepalWidthCm','Id']]

Creating Calculated Columns

Also known as the derived columns, the calculated columns take their values from existing columns or a combination of them. In this case we can replace an existing column or create a new one both of which will be seen as permanent changes in the table.

Let’s take a scenario!

I want to get the area of the Sepal for some kind of analysis. How can I do so?

df['SepalArea']=df.SepalLengthCm*df.SepalWidthCm
df

create calculated columns

A new column SepalArea is created towards the end. However, it makes more sense to have the area column besides the parameter columns. Well, that can be done too using the insert method.

df.insert(5,'PetalArea',df.PetalLengthCm*df.PetalWidthCm)
df.head()

A new column PetalArea is created at the sixth position.

Petal Area

In both of the cases above the derived column was added in df. However to first view the output before making a permanent change in df we can go for the assign method.

df.assign(Ratio=df.PetalArea/df.SepalArea)

Pandas DataFrame in Python

The ratio column is displayed in the output only and not added to df.

Renaming the Variables

The rename method comes as a saviour when we get a data set having misspelt column names or sometimes when the variables are not self-explanatory giving us no idea about the data they are storing.

For example, I wish the Species variable to be called NameOfSpecies.

df.rename(columns={'Species':'NameOfSpecies'},inplace=True)
df.tail()

Renaming the Variable

Dropping a Variable

An extremely important step as a part of the Data Cleaning process is to remove the unnecessary variables we have in our data usually which do not affect our analysis in any way and do not relate to the given business problem we are trying to solve.

Let me show you how it is done by dropping the variables we created above!

df.drop(columns=['PetalArea','SepalArea'],inplace=True)
df.head()

Dropping a variable

Filtering the Data in a DataFrame

Filtering a data set essentially means filtering the rows which in turn refers to selecting particular rows from the data frame. This selection can be done both manually and conditionally. Let’s try filtering our data by both methods one by one!

Manual Filtering

You might have noticed that we have already filtered our data in some of the steps above! Recall! Yes , Using .head() and .tail()

#display the first 4 rows of df
df.head(4)

Filtering the data

#display the last 3 rows of df
df.tail(3)

There are other ways too by which filtering can be done.

Using [ ] we can slice the data. Giving a slicer in the first argument gives us the required rows on the basis of their default index.

df[:4]

Using .iloc[ ] we can extract out the rows on the basis of their default index i.e the default row names. It takes out the rows with index from start to end – 1 if we slice as .iloc[start:end]

df.iloc[:2]

We get the rows with the default index as 0 and 1 i.e the first two rows of df.

Using .loc[ ] we can extract out the rows on the basis of their user-defined index i.e the row names. It takes out the rows with index from start to end if we slice as .loc[start:end]

df.loc[:5]

We get the rows with the User Defined Index in (0,1,2,3,4,5) i.e the first six rows of df.

You must notice that in this case, the UDI is the same as the DI.

Suppose we want to extract only some specific rows, not necessarily consequent ones. How can that be done? Just mention the individual index and we are done!

df.iloc[[3,0,12,5,9]]

Conditional Filtering

Unlike the manual filtering where we mentioned the row indices manually in order to filter the rows, in the case of conditional filtering we filter the rows by indexing i.e checking conditions on the data. This can be done using [ ] and .loc[ ] on df but not with .iloc[ ]. Let’s take a different approach to learn indexing by considering some scenarios.

Task 1: Get details for virginica species.

df[df.NameOfSpecies=='Iris-virginica'].head()

Task 2 : Get details for virginica and setosa species.

Although the above method can also be used, let’s try a different approach here where we will be using .isin

names = ['Iris-setosa','Iris-virginica']
df[df.NameOfSpecies.isin(names)]

By this we get all the records where the NameOfSpecies value is Iris-setosa or Iris-virginica.

Task 3 : Get the records for which the petal length is greater than the average petal length.

df.loc[df.PetalLengthCm > df.PetalLengthCm.mean()]

df.PetalLengthCm.mean() gives the average petal length ~ 3.75.

So we get the records where the petal length is greater than 3.75(approximately).

We can combine task 2 and task 3 to get all those records where the species is virginica or setosa and petal length is more than the overall average petal length.

df.loc[(df.PetalLengthCm > df.PetalLengthCm.mean()) & ((df.NameOfSpecies=='Iris-virginica')| (df.NameOfSpecies=='Iris-setosa'))]

And we get along data frame in this case! Let me show you a few rows and columns from it.

df.loc[(df.PetalLengthCm > df.PetalLengthCm.mean()) & ((df.NameOfSpecies=='Iris-virginica')| (df.NameOfSpecies=='Iris-setosa')),["NameOfSpecies","PetalLengthCm"]].head()

Sorting the Data

And now comes an interesting operation which is sorting. Our primary purpose of sorting the data in a data frame is to arrange it in order for the better readability of the data. To sort the values inside a particular column we use the sort_values method.

Let’s take a scenario where we want the sepal lengths to be in ascending order!

df.sort_values('SepalLengthCm',inplace=True)
df

What do we see?

The individual records have been sorted according to the sepal length values. (Check the row indices !)

Now, to sort the data by sepal width from highest to lowest value we can simply write the command as:

df.sort_values(by='SepalWidthCm' , ascending=False , inplace=True)
df

The data frame is changed which is evident from the jumbled row indices!

But what if I am not happy with the indices being in this way and rather want them to be ordered starting from 0 while at the same time the records should be sorted by the sepal width from highest to lowest. We can simply give another argument in the above method!

df.sort_values(by='SepalWidthCm' , ascending=False , inplace=True , ignore_index=True)
df

By this, we are just resetting the index to the default index from the user-defined index we obtained on sorting initially.

The next thing I am going to do is combine the above two examples we studied. We can actually sort the sepal length in the ascending order and within that sort the sepal width in the descending order by giving the command!

df.sort_values(by=['SepalLengthCm','SepalWidthCm'],ascending=[True,False],ignore_index=True)

Grouping and Binning

We just learned about derived columns and it’s time to introduce another kind of them. According to our business problem, the values in an existing column can be grouped or binned to make a new column known as a grouped/binned column. Why is it even done? To convert the continuous variables to categorical variables.

Both of these falls in the category of derived columns however they differ in some way. While binning is done only on continuous variables, grouping can be performed on categorical variables too. This is due to the fact that bins are of equal frequencies.

But why do we even want these columns? They help us reduce the cardinality of the columns.

Let’s try grouping and binning the variables in our dataset!

Grouping

To create groups in Python we have 3 main methods two of which are defined in Pandas library and one comes from the numpy library.

Method 1 : pd.cut()

This is used to group the values of a single continuous variable only.

Task: Group the Petal length values into groups!

pd.cut(df.PetalLengthCm , [0,2,5,8])

Pandas DataFrame in Python

Here we are creating user-defined groups (0,2] , (2,5] , (5,8]. We get the class intervals in ascending order.

Method 2 : pd.qcut()

Just like pd.cut() it is used to group the values of a single continuous variable only. But it divides the values into groups having equal frequencies i.e. each group has an equal number of values.

Task: Group the Petal width into three equal parts!

pd.qcut(df.PetalWidthCm,3)

Pandas DataFrame in Python

In this case, first, the values inside the PetalWidthCm column is sorted and then the data is divided into 3 equal parts and hence we get the groups.

Method 3 : np.where()

Unlike the previous 2 methods, it can be used for one or multiple columns for any type of variable.

Task: Create a column ‘grouped’ with a few columns in one category and the rest in other.

#np.where(df.NameOfSpecies.isin(['Iris-virginica']),'Major','Minor' )
df['grouped']=pd.Series(np.where(df.NameOfSpecies.isin(['Iris-virginica']),'Major','Minor' ))
df

Pandas DataFrame in Python

Binning

To create bins we used the pd.cut() method!

Creating 4 bins of equal class interval

pd.cut(df.SepalLengthCm , 4)

Binning | Pandas DataFrame in Python

There is yet another way where we do not even need to mention the number of bins!

pd.cut(df.SepalLengthCm,range(0,10,2))

Pandas DataFrame in Python

Creating Data Summaries

To summarize the data in Python and create tables we have three ways.

Method 1 : Using .groupby()

Task: Determine species wise total Sepal Length.

df.groupby('NameOfSpecies').SepalLengthCm.sum()

Creating Data summaries

df.groupby(['NameOfSpecies','grouped']).SepalLengthCm.sum()

creating data summaries 2

Task: Determine species wise total Sepal Length and average Sepal Length.

df.groupby('NameOfSpecies').SepalLengthCm.agg([np.sum,np.mean])

Pandas DataFrame in Python

df.groupby('NameOfSpecies')['SepalLengthCm','SepalWidthCm'].agg([np.sum,np.mean])

Pandas DataFrame in Python

So we have grouped the data successfully and created a summary. Now let’s learn a bit about tables. There are three tables we come across: Vertical tables are those having their first row as header, horizontal tables are those having their first column as header and crosstables are those having header in both rows and columns.

To create a cross table on top of the summarized data we use the .pivot() method.

But these are two different steps. Instead, we can use just one method and do all the operations: group the data, aggregate it and create the table on top of the summarized data. This can be done using .pivot_table().

Method 2 : Using .pivot_table()

To create a cross table we can give the following command:

df.pivot_table(index=’col1 ‘,columns=’col2 ‘ , values =‘col3 ‘ , aggfunc=’sum’)

Method 3 : Using pd.crosstab()

With this method, only the cross tables can be created and it is used to create the frequency tables.

I will mention the syntax here to create a frequency table:

pd.crosstab(index=df.col1 , columns=df.col2 ,values=df.col3 , aggfunc=’count’)

Now we can move on to performing much more advanced operations on the data frame like detecting the duplicates, dropping the duplicates, performing Outlier and Missing Value Treatment etc. All of these steps play a very important role in preparing our data for further analysis and predictive modelling. I have explained these topics in detail in this article: Exploratory Data Analysis in Python.

EndNotes

Finally, we have come to the end of this article. In this article we performed various operations on a Pandas DataFrame in Python which is typically done while cleaning the data, manipulating it and preparing it for our analysis. However, this is not all, A lot more operations can be performed on a data frame like dealing with the duplicates, outliers and missing values followed by their treatment. These are really important steps in the EDA part and hence should not be missed.

I strongly recommend you to read this article on Exploratory Data Analysis in Python which will help you understand much more crucial operations performed on a DataFrame.

You can connect with me on LinkedIn.

Ayushi

Beginner Datasets Libraries Python Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Operating on the Pandas DataFrame in Python

Overview

Introduction

Table of Contents

Introducing the Dataset

Importing the Python Libraries

Reading Data into a DataFrame

Subsetting a DataFrame

Re-ordering the Variables

Creating Calculated Columns

Renaming the Variables

Dropping a Variable

Filtering the Data in a DataFrame

Sorting the Data

Grouping and Binning

Grouping

Binning

Creating Data Summaries

EndNotes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at