Pandas Groupby for Data Aggregation

Aniruddha Bhandari Last Updated : 24 Feb, 2025
9 min read

Pandas GroupBy is a powerful functionality in the Pandas library, widely used for data manipulation and analysis in Python. It allows you to split data into groups based on specific criteria, apply functions to each group, and combine the results. This is particularly useful for aggregating, transforming, or filtering data efficiently. With GroupBy, you can easily perform operations like summing, averaging, or counting data within categories. It simplifies complex data tasks, making it a go-to tool for data scientists and analysts. In this article, you will get to know all about Pandas GroupBy.

What is the Pandas groupBy Function?

The groupby function in Pandas is a tool that helps you organize data into groups based on certain criteria, like the values in a column. This makes it easier to analyze and summarize your data.

Let me take an example to elaborate on this. Let’s say we are trying to analyze the weight of a person in a city. We can easily get a fair idea of their weight by determining the mean weight of all the city dwellers. But here‘s a question – would the weight be affected by the gender of a person?

We can group the city dwellers into different gender groups and compute their mean weight. This would give us a better insight into the weight of a person living in the city. But we can probably get an even better picture if we further separate these gender groups into different age groups and then take their mean weight (because a teenage boy’s weight could differ from that of an adult male)!

GroupBy example
  • Separating people into groups and applying statistical values helps in better analysis compared to looking at the entire population’s statistics.
  • This is the key advantage of using GroupBy.
  • GroupBy enables grouping data based on different features.
  • It provides a more accurate understanding of the data.
  • It serves as a one-stop solution for gaining deep insights from data.

Understanding the Dataset & Problem Statement

We will be working with the Big Mart Sales dataset from our DataHack platform. It contains attributes related to the products sold at various stores of BigMart. The aim is to find out the sales of each product at a particular store.

Right, let’s import the libraries and explore the data:

import pandas as pd

import numpy as np

df = pd.read_csv(‘train_v9rqX0R.csv’)

Python Code:

We have some nan values in our dataset. These are mostly in the Item_Weight and Outlet_Size. I will handle the missing values for Outlet_Size right now, but we’ll handle the missing values for Item_Weight later in the article using the GroupBy function!

First Look at Pandas GroupBy

Let’s group the dataset based on the outlet location type using GroupBy, the syntax is simple we just have to use pandas dataframe.groupby:

df.groupby('Outlet_Location_Type')

Output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018C4E098288>

GroupBy has conveniently returned a DataFrameGroupBy object. It has split the data into separate groups. However, it won’t do anything unless it is being told explicitly to do so. So, let’s find the count of different outlet location types:

df.groupby('Outlet_Location_Type').count()

We did not tell GroupBy which column we wanted it to apply the aggregation function on, so we applied it to multiple columns (all the relevant columns) and returned the output.

But fortunately, GroupBy object supports column indexing just like a pandas Dataframe!

So let’s find out the total sales for each location type:

df.groupby('Outlet_Location_Type')['Item_Outlet_Sales']

Output:

GroupBy Series Object

Here, GroupBy has returned a SeriesGroupBy object. No computation will be done until we specify the agg function:

df.groupby('Outlet_Location_Type')['Item_Outlet_Sales'].sum()

Output:

GroupBy: Aggregation

Awesome! Now, let’s understand the work behind the GroupBy function in Pandas.

Read More about the Pandas for Data Science!

The Split-Apply-Combine Strategy

You just saw how quickly you can get an insight into grouped data using the Pandas GroupBy function. But, behind the scenes, a lot is taking place, which is important to understand to gauge the true power of GroupBy.

  • GroupBy Strategy:
    • Uses the Split-Apply-Combine approach, introduced by Hadley Wickham in 2011.
    • Breaks large problems into smaller, manageable parts.
    • Performs operations on individual parts.
    • Combines results to answer specific questions.
  • Example:
    • Goal: Calculate the average height for males and females in a dataset.
    • Steps:
      1. Split the data by gender (male and female).
      2. Apply the average height calculation to each group.
      3. Combine the results to get the final averages.
  • Dataset:
    • A sample dataset is created to demonstrate the process.

Contains bidirectional Unicode text, which may require an editor to review hidden characters :

data = {
    'Gender': ['m', 'f', 'f', 'm', 'f', 'm', 'm'],
    'Height': [172, 171, 169, 173, 170, 175, 178]
}
df_sample = pd.DataFrame(data)
df_sample

Output:

Gender Height dataset

Splitting the data into separate groups:

f_filter = df_sample['Gender']=='f'

print(df_sample[f_filter])


m_filter = df_sample['Gender']=='m'

print(df_sample[m_filter])

Applying the operation that we need to perform (average in this case):

f_avg = df_sample[f_filter]['Height'].mean()

m_avg = df_sample[m_filter]['Height'].mean()

print(f_avg,m_avg)

Output:

170.0 174.5

Finally, combining the result to output a DataFrame:

df_output = pd.DataFrame({'Gender':['f','m'],'Height':[f_avg,m_avg]})

df_output

Output:

Grouped average

All these three steps can be achieved by using GroupBy with just a single line of code! Here’s how:

df_sample.groupby('Gender').mean()

Output:

GroupBy: Mean

Now that is smart! Have a look at how GroupBy did that in the image below:

Split Apply Combine

You can see how GroupBy simplifies our task by doing all the work behind the scenes without us having to worry about a thing!

Now that you understand the Split-Apply-Combine strategy let’s dive deeper into the GroupBy function and unlock its full potential.

Loop Over GroupBy Groups

Remember the GroupBy object we created at the beginning of this article? Don’t worry, we’ll create it again:

 obj = df.groupby( 'Outlet_Location_Type' )

 obj 

Output:

We can display the indices in each group by calling the groups on the GroupBy object:

obj.groups

Output:

GroupBy: Groups

We can even iterate over all of the groups:

for name,group in obj:
    print(name,'contains',group.shape[0],'rows')

Output:

GroupBy: Group iteration

But what if you want to get a specific group out of all the groups? Well, don’t worry. Pandas has a solution for that too.

Just provide the specific group name when calling get_group on the group object. Here, I want to check out the features for the ‘Tier 1’ group of locations only:

obj.get_group('Tier 1')

Output:

GroupBy: get_group

Now isn’t that wonderful! You have the entire Tier 1 features to work with and derive wonderful insights! But wait, didn’t I say that GroupBy is lazy and doesn’t do anything unless explicitly specified? Alright then, let’s see GroupBy in action with the aggregate functions.

Applying Functions to GroupBy Groups

The apply step is unequivocally the most important step of a Pandas GroupBy function where we can perform a variety of operations using aggregation, transformation, filtration, or even with your own function!

Let’s have a look at these in detail.

Aggregation

We have looked at some aggregation functions in the article so far, such as mean, mode, and sum. These perform statistical operations on a set of data. Have a glance at all the aggregate functions in the Pandas package:

  • count() – Number of non-null observations
  • sum() – Sum of values
  • mean() – Mean of values
  • median() – Arithmetic median of values
  • min() – Minimum
  • max() – Maximum
  • mode() – Mode
  • std() – Standard deviation
  • var() – Variance

But the agg() function in Pandas gives us the flexibility to perform several statistical computations all at once! Here is how it works:

df.groupby('Outlet_Location_Type').agg([np.mean,np.median])

Output:

GroupBy: Aggregate functions

We can even run GroupBy with multiple indexes to get better insights from our data

df.groupby(['Outlet_Location_Type', 'Outlet_Establishment_Year'], as_index=False).agg(
    {'Outlet_Size': pd.Series.mode,
     'Item_Outlet_Sales': np.mean
     }
)

Notice that I have used different aggregation functions for different column names by passing them in a dictionary with the corresponding operation to be performed. This allowed me to group and apply computations on nominal and numeric features simultaneously.

Also, I have changed the value of the as_index parameter to False. This way, the grouped index would not be output as an index.

We can even rename the aggregated columns to improve their comprehensibility, and we get a multi-index dataframe:

df.groupby(['Outlet_Type', 'Item_Type']).agg(
    mean_MRP=('Item_MRP', np.mean),
    mean_Sales=('Item_Outlet_Sales', np.mean)
)

It is amazing how a name change can improve the understandability of the output!

Transformation

Transformation allows us to perform some computation on the groups as a whole and then return the combined DataFrame. This is done using the transform() function.

We will try to compute the null values in the Item_Weight column using the transform() function.

The Item_Fat_Content and Item_Type will affect the Item_Weight, don’t you think? So, let’s group the DataFrame by these columns and handle the missing weights using the mean of these groups:

df['Item_Weight'] = df.groupby(['Item_Fat_Content', 'Item_Type'])['Item_Weight'].transform(
    lambda x: x.fillna(x.mean())
)

“Using the Transform function, a DataFrame calls a function on itself to produce a DataFrame with transformed values.”

You can read more about the transform() function in this article.

Filtration

Filtration allows us to discard certain values based on computation and return only a subset of the group. We can do this using the filter() function in Pandas.

Let’s take a look at the number of rows in our DataFrame presently:

 df.shape 

Output:

(8523, 12)

If I wanted only those groups that have item weights within 3 standard deviations, I could use the filter function to do the job:

def filter_func(x):
    return x['Item_Weight'].std() < 3

df_filter = df.groupby(['Item_Weight']).filter(filter_func)
df_filter.shape

Output:

(8510, 12)

GroupBy has conveniently returned a DataFrame with only those groups that have Item_Weight less than 3 standard deviations.

Applying Our Own Functions

Pandas’ apply() function applies a function along an axis of the DataFrame. When using it with the GroupBy function, we can apply any function to the grouped result.

For example, if I wanted to center the Item_MRP values with the mean of their establishment year group, I could use the apply() function to do just that”:

df_apply = df.groupby(['Outlet_Establishment_Year'])['Item_MRP'].apply(lambda x: x - x.mean())
df_apply

Output:

Apply function

Here, the values have been centered, and you can check whether the item was sold at an MRP above or below the mean MRP for that year.

Conclusion

I’m sure you can see how amazing the Pandas GroupBy function is and how useful it can be for analyzing your data. I hope this article helped you understand the function better! But practice makes perfect, so start with the super impressive datasets on our very own DataHack platform. Moving forward, you can read about how you can analyze your data using a pivot table in Pandas.

Hope you like the article and know you have clear understanding of the topics, pandas groupby aggregate, group by in pandas, groupby aggregate pandas.

Key Takeaways

  • Groupby() is a powerful function in pandas that allows you to group data based on a single column or more.
  • You can apply many operations to a groupby object, including aggregation functions like sum(), mean(), and count(), as well as lambda function and other custom functions using apply().
  • The resulting output of a groupby() operation can be a pandas Series or dataframe, depending on the operation and data structure.

Frequently Asked Questions

Q1. Can we use groupby without aggregate function in pandas?

A. Yes, we can use groupby without an aggregate function in pandas. In this case, groupby will return a GroupBy object that can be used to perform further operations.

Q2. Can we use groupby without aggregate function in pandas?

A. Groupby and groupby agg are both methods in pandas that allow us to group a DataFrame by one or more columns and perform operations on the resulting groups. However, there are some important differences between the two methods. Groupby returns a GroupBy object, which can be used to perform a variety of operations on the groups, such as applying functions, resetting index, or filtering. Whereas groupby agg is a method specifically for performing aggregation operations on a grouped DataFrame. It allows us to specify one or more aggregation functions to apply to each group and returns a DataFrame containing the results.

Q3. Is groupby slow panda?

A. Yes, it can be slow for large datasets. Optimize performance with smaller data types, chunk processing, or faster libraries like Dask or PySpark.

Q4. What is the role of categorical data in groupby operations?

A. Categorical data plays a significant role in groupby operations as it allows for efficient grouping and aggregation. When using categorical data, pandas can perform the groupby operation faster and use less memory. You can convert a column to a categorical type using astype('category') before grouping.

Q5. How can we work with unique values in groupby operations?

A. To get unique values within each group, you can use the nunique method, which returns the number of unique values in each group.

Q6. How can we use dictionaries in groupby operations?

A. Dictionaries can be used in groupby operations to specify different aggregation functions for different columns. You can pass a dict to the agg method where keys are column names and values are aggregation functions.

I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from the challenges that I face, with the community to make the world a better place!

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details