Top Rarely Used Pandas Function In 2023 One Should Know

Shilpi Mazumdar Last Updated : 09 Feb, 2023

12 min read

Introduction

When it comes to data preparation using Python, the term which comes to our mind is Pandas. No, not the one whom you see happily munching away on bamboo and lazily somersaulting.

Well, a library for prepping up the data for further analysis. Be it finding out which columns need to be removed or which features are to be created from the existing features. Pandas help us with all these questions. Everything and anything which requires your data to be prepped up for future analyzing steps, Pandas comes up as a handy tool.

Pandas

All of the functions we’ll discuss now fall under the data preprocessing step. The preprocessing step is an essential step, be it from the view of data preprocessing or the building of a machine learning model over the data. It helps clean, transform and prepare the data for further analysis. In many cases, the raw(un-processed) data might be incomplete and have some errors, which is well addressed when data preprocessing is done.

Here, in this article, we’ll quickly see some of the basic Pandas functions, which are not much used but can be very useful and convenient, especially if you’re a newbie or noob 😛

Let’s dive in!

Learning Objectives

Understanding some rarely used python data analysis functions.
Illustrating how to use these pandas functions effectively with examples of real world-scenarios and understanding it better with code snippets.
To show the benefits of these pandas functions and how they help in data preprocessing and analysis.
Hands-on examples and code snippets for the readers to practice and implement.

Ffill and Bfill Function
Shift Function
Select_Dtypes Function
Clip Function
Query Function
Melt Function
Where Function
Iat Function
Conclusion

Ffill and Bbfill Function

There are many instances where data in certain places don’t get captured, resulting in missing values. We need to treat the missing values so that it doesn’t lead to incorrect results in the long run when building a machine learning model on the same data. So, we would substitute the missing values which is called Missing Value Imputation. There are many ways to impute the missing values, but we’ll be discussing the functions ffill(forward fill) and bfill(backward fill). Let’s quickly see them. But what’s the basis on which we are imputing the values? In general, it’s a good idea to use forward fill (ffill) or backward fill (bfill) when the data has a temporal ordering and you want to propagate the last known value forward or backward in time, respectively, to fill in missing values.

Important to Note:
Missing Values Imputation using ffill() and bfill() is useful when the data has a temporal ordering, or if there is a trend or pattern which we wish to maintain. If our data is not of the above type, then it may be more appropriate to use the mean, median, or mode values to fill in missing values. Still, this might cause the loss of some information. It’s also important to note that, in any case, the choice of imputation method will depend on the specific characteristics of the dataset, so it’s often best to try different methods and see which one gives the best results.

The ffill method example,
We have the following dataframe. And we want to fill in the missing values.

Here, we’ll be taking an example of a time series dataset and imputing the missing values with the ffill function. If we observe carefully, we can see an increasing trend in the dataset. Let’s have a quick look at the dataset.

'ffill' and 'bfill' function

We’ll be imputing the following missing values using the forward fill function.

'ffill' and 'bfill' function

After applying the following code, the values will get imputed in such a way that the previous value to the missing value would get forwarded.

df1 = df1.ffill(axis=0)
df1

After applying the code, we get the following output. If we observe carefully, then we can see that only one observation has not been replaced. Why? Because we don’t have a value preceding the first observation.

'ffill' and 'bfill' function

Notice how the value in the 4th row, 2nd column has a value that it has copied from its previous cell. Forward cells are filled with the value of its previous observation.

Although there are many ways to impute this particular value(0th row, 1st column), here we’ll be using the bfill method to impute this value.

Let’s take the same dataset and perform the bfill function.

'ffill' and 'bfill' function

df1 = df1.bfill(axis=0)
df1

'ffill' and 'bfill' function

Observe the above output, where the missing values are imputed with the value following it. The value in the 1st observation(0th row) and 2nd column has been imputed with the value of the 2nd observation(1st row) and 2nd column and the same has happened in the following rows.

The ffill (forward fill) and bfill (backward fill) functions are commonly used in real-life scenarios for data pre-processing and handling missing values. Some use cases are:

Financial Data Analysis: Financial data often contain missing values due to various reasons, such as stock market holidays or delayed data releases. bfill can be used to backward-fill missing values with the next available value.
Climate Data Analysis: Climate data often contains missing values, such as when a weather station is not functioning or data is lost. ffill and bfill can be used to fill missing values with the nearest available data, allowing for continuous analysis.
Energy Consumption Data Analysis: Energy consumption data can contain missing values due to meter reading errors or equipment failure. Both ffill and bfill can be used to fill in missing values in energy consumption data, allowing for accurate analysis of energy consumption patterns.

Shift Function

This pandas function shifts the element to a desired location as per the desired number of periods we enter as a parameter. This function can work on both columns and also on rows.

Let’s see an example, where we can see the working of the shift function. Let’s say we have a weather dataset, and we wish to create new columns which contain the weather measurements from the previous day. We have the following weather data with us.

'shift' function

And now, we wish to create a new feature, such as the previous day’s temperature. These kinds of features would, in turn, be helpful as input features for the machine learning model to predict future weather.

So in order to fill the new features with values, we would shift the original features such as the temperature column by one 1 day.

df[‘prev_day_temp’] = df[‘temperature’].shift(1)
# we’ll remove the first row as it has a null value
df = df.dropna()
df

'shift' function

Some real-life scenarios where we use the shift function are as follows.

Seasonality Analysis: The shift function can be used in seasonality analysis to shift data by one seasonal cycle. This allows for the comparison of data across different seasons and can be useful for identifying seasonal trends and patterns.
Time-series Data Analysis: In time-series data, the shift function can be used to shift data by a specified number of periods, allowing for data to be compared across different time frames. This can be useful for analyzing trends and patterns in data.
Data Alignment: The shift function can be used to align data from different sources, such as data collected at different times or locations. This can be useful for ensuring that data is comparable and suitable for analysis.

Select_Dtypes Function

This function includes or excludes columns of specific datatypes. Let’s take the example of the students’ health dataset, which contains information(name, age, class roll no) about the college students and their bmi.

'select_dtypes' function

Now, let’s include only integer datatypes from our dataset.

We run the following code:

df.select_dtypes(include='int')

'select_dtypes' function

Let’s exclude the object(string) datatype from our dataset.

df.select_dtypes(exclude='object')

'select_dtypes' function

A few examples where we can get to see the use of the select_dtypes function are as follows:

A dataframe containing information about employees in a company, with columns like “employee_name”, “hire_date”, “salary”, and “department”. You want to analyze the salary data, so you use select_dtypes to filter the dataframe and select only the “salary” column, which is of float data type.
A dataframe containing customer reviews for different products, with columns like “reviewer_name”, “review_date”, “star_rating”, and “review_text”. You want to analyze the star rating data, so you use select_dtypes to select only the “star_rating” column, which is of int data type.
A large dataframe containing information about various customers and their purchases. The dataframe has columns like “customer_name”, “purchase_date”, “purchase_amount”, and “product_category”. You want to analyze only the numerical columns, like “purchase_amount”, to understand spending patterns.

Clip Function

In this pandas function, we trim the values when it reaches a certain level. This pandas function applies lower and upper bounds as parameters to have the required set of observations. Suppose we only want those observations whose range lies between 20.0 to 50.0. The highlighted values in the red box would be trimmed from the given range. 'clip' function
We’ll be running the following code.

sub_data.clip(20.0, 50.0)

'clip' function
If we look at the output, we can see that all the values in the Weight column are greater than 50.0, which is the upper bound. Hence, all the values are replaced by the upper bound limit.

A few examples of the clip function, where we can see its real-life usage.

A dataframe containing employee salaries, with a column “salary”. The company has a policy of capping the maximum salary at a certain value. You can use the clip function to set all salaries above the cap to the cap value.
A dataframe containing energy consumption data for different households, with a column “energy_consumption”. You want to clip the values below a certain minimum to 0, to represent that the minimum amount of energy used in a household is 0.
A dataframe containing stock prices for different companies, with a column “stock_price”. You want to clip the stock prices so that any value above a certain threshold is set to the threshold value. You can use the clip function to achieve this.

Query Function

As the name suggests, the query method helps in querying the dataset. This method takes a query expression as a string parameter which evaluates either True or False. The return type of this pandas function is a Dataframe. A dataframe is returned if the results turn out to be true depending on the given query expression.
Let’s take an example of a dataset and understand how this works. So, we have a dataset where we have details about customers visiting a hotel, their total_bill, tip_given, and time of ordering(lunch or dinner). I have used the tips.csv file over here. You can download the same from this link.

import pandas as pd
df = pd.read_csv('tips.csv')
df.head(10)

'query' function | pandas
Let’s say I only want to view those observations which have details of female customers who have ordered Lunch or Dinner and have sizes more than or equal to 2.

We can use the query function to get those observations.

food_time = ['Lunch','Dinner'] # the query result should include both dinner and lunch
sub_df = df.query('sex=="Female" & size>=2 & time==@food_time') # Using @ would take the variable
sub_df.tail(10)

'query' function

Here are some real-life examples to better understand the usage of the query function.

A dataframe containing information about different countries, with columns like “country_name”, “population”, “area”, and “continent”. You want to analyze the data for countries with a population greater than 100 million. Based on this condition, you can use the query function to filter the dataframe.
A dataframe containing information about different sports teams, with columns like “team_name”, “sport”, “wins”, and “losses”. You want to analyze the data for teams with a win-loss ratio greater than 0.5. You can use the query function to filter the dataframe based on this condition.
A dataframe containing information about different movies, with columns like “movie_title”, “release_date”, “rating”, and “genres”. You want to analyze the data for movies with a rating greater than 8.0. You can use the query function to filter the dataframe based on this condition.

Melt Function

To put it more easily, in this pandas function, the dataset is changed from being broader to being taller. The columns are transposed into rows. Let’s understand this with an example. We have the following dataset, where students’ marks(in percentage) in their two semesters have been given. Pandas I want the dataset to be in a long-form instead of being broader. Pandas
So, we will use the melt function.

df_mark=df_mark.melt(id_vars='Name', var_name='Sem_Month', value_name = 'Percentage')
# id_vars - the column's name to be used as identifier variables.
# var_name - the name to be used for the 'variable' column
# value_name - the name to be used for the values under the previous columns
df_mark

Pandas
The above output seems a bit different than what we had expected. Just observe the Sem_Month column. Let’s just sort the values based on the Name and Sem_Month columns.

df_mark = df_mark.sort_values(['Name','Sem_Month'])
df_mark

Pandas
We’ll just reset the index of our dataset to make it look better.

df_mark.reset_index(drop=True)

Pandas

Real-life examples where we use the melt function.

A dataframe containing exam scores for different students in different subjects, with columns like “Math”, “Science”, “English”, and so on. You want to analyze the exam scores, but the current format makes it difficult to do so. You can use the melt function to reshape the dataframe into a “long” format, with columns for “subject” and “score”, making it easier to perform analysis.
A dataframe containing the number of apples and bananas sold at a fruit stand, with columns like “Apples Sold” and “Bananas Sold”. You want to analyze the sales data, but the current format makes it difficult to do so. You can use the melt function to reshape the dataframe into a “long” format, with columns for “fruit” and “sold”, making it easier to perform analysis.
A dataframe containing the number of books read by different people, with columns like “Person A”, “Person B”, “Person C”, and so on. You want to analyze the book reading data, but the current format makes it difficult to do so. You can use the melt function to reshape the dataframe into a “long” format, with columns for “person” and “books_read”, making it easier to perform analysis.

Where Function

This function comes in handy when we wish to replace certain values in our dataframe. This checks one or more conditions and returns the result accordingly. In the following example, we have certain values as ‘Select,’ and we wish to replace them with the NaN values. Pandas

df_sub=df.where(cond=(df!='Select'), other='NaN')
#keep the dataframe values as it is, if not equal to ‘Select’, otherwise replace it with ‘NaN’
df_sub

So what exactly is happening in the above code? We use the cond parameter to filter the dataframe or series based on a given boolean condition. Here, the condition is that replace the ‘Select’ string values with ‘NaN’ values, and if there’s anything other than the ‘Select’ value, then keep it as it is. For example, in the 2nd row, BMI column, the condition is true (20 is not equal to ‘Select’), this value won’t be replaced.

Note that, by default, even if we don’t specify the value to be replaced with, then it would replace it by NaN itself. Otherwise, if we want to replace the existing value with some value as per our needs, then that needs to be mentioned in the other parameter.

Once the above code is run, we get the following output. Notice how the values which were ‘Select’ earlier have been replaced with ‘NaN’ values.

Pandas

A few real-life scenarios where the Pandas function can be used are as follows:

A DataFrame with customer data, and you want to replace all customers ages less than 18 with the value “minor”.
A dataframe having employee data. You want to highlight all employees who have been with the company for less than two years in red.
Analyzing sales data and want to identify all sales that are below the average sale amount.

Iat Function

Last but not least, let’s just quickly discuss the iat function. So let’s say that in the above Pandas function we replaced all occurrences of Select with NaN. But now, let’s say we wish to replace a particular occurrence of Select with some different value. So, here, comes the need for the iat

function.Let’s understand this with the same dataset. Pandas We’ll replace only a single occurrence(marked by the red box) of the Select value with NaN in the above dataset.
We’ll run the following code.

df.iat[3, 1] = ‘NaN’
# replacing the 3rd row, 1st column value by NaN
df

Pandas

The iat function is specific to the Pandas library in Python and is used to access a scalar value in a DataFrame or Series.

Conclusion

In this article, we learned about some of the Pandas functions used while working with Pandas to make our work easier and more efficient. These functions are some which save a lot of our time. These functions might not be as widely used as compared to other functions, but they should also be known to anyone, be it even a beginner.

Recap of key takeaways: So in this article, we learned about the importance of preprocessing. How is it important to prepare the data for further analysis? There might be various other functions that might help preprocess the data, but the functions discussed are some that are not known much but can be very useful.
Usefulness: The data on which we work is obtained from real-world scenarios, hence would contain missing values and would be dirty in most cases. Working with unprocessed or dirty data might bring inaccurate results, which can be dangerous in the long run when decisions are made based on that. Hence, preprocessing is important.
We understood the functions with code snippets, which would better our understanding and be very useful to someone starting off in their data analysis journey.

Shilpi Mazumdar

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Top Rarely Used Pandas Function In 2023 One Should Know

Introduction

Table of Contents

Ffill and Bbfill Function

Shift Function

Select_Dtypes Function

Clip Function

Query Function

Melt Function

Where Function

Iat Function

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme