How to Use Pandas fillna() for Data Imputation?

Yashashwy Alok Last Updated : 24 Nov, 2024

8 min read

Handling missing data is one of the most common challenges in data analysis and machine learning. Missing values can arise for various reasons, such as errors in data collection, manual omissions, or even the natural absence of information. Regardless of the cause, these gaps can significantly impact your analysis’s or predictive models’ quality and accuracy.

Pandas, one of the most popular Python libraries for data manipulation, provides robust tools to deal with missing values effectively. Among these, the fillna() method stands out as a versatile and efficient way to handle missing data through imputation. This method allows you to replace missing values with a specific value, the mean, median, mode, or even forward- and backward-fill techniques, ensuring that your dataset is complete and analysis-ready.

What is Data Imputation?
Why is Data Imputation Important?
Understanding fillna() in Pandas
- Syntax of fillna() in Pandas
Using fillna() for Different Data Imputation Techniques
Conclusion
Frequently Asked Questions

What is Data Imputation?

Data imputation is the process of filling in missing or incomplete data in a dataset. When data is missing, it can create problems in analysis, as many algorithms and statistical techniques require a complete dataset to function properly. Data imputation addresses this issue by estimating and replacing the missing values with plausible ones, based on the existing data in the dataset.

Why is Data Imputation Important?

Here’s why:

Distorted Dataset

Missing data can skew the distribution of variables, altering the dataset’s integrity. This distortion may lead to anomalies, change the relative importance of categories, and produce misleading results.
For example, a high number of missing values in a particular demographic group could cause incorrect weighting in a survey analysis.

Limitations with Machine Learning Libraries

Most machine learning libraries, such as Scikit-learn, assume that datasets are complete. Missing values can cause errors or prevent the successful execution of algorithms, as these tools often lack built-in mechanisms for handling such issues.
Developers must preprocess the data to address missing values before feeding it into these models.

Impact on Model Performance

Missing data introduces bias, leading to inaccurate predictions and unreliable insights. A model trained on incomplete or improperly handled data might fail to generalize effectively.
For instance, if income data is missing predominantly for a specific group, the model may fail to capture key trends related to that group.

Desire to Restore Dataset Completeness

In cases where data is critical or datasets are small, losing even a small portion can significantly impact the analysis. Imputation becomes essential to retain all available information while mitigating the effects of missing data.
For example, a small medical study dataset might lose statistical significance if rows with missing values are removed.

Also read: Pandas Functions for Data Analysis and Manipulation

Understanding fillna() in Pandas

The fillna() method replaces missing values (NaN) in a DataFrame or Series with specified values or computed ones. Missing values can arise due to various reasons, such as incomplete data entry or data extraction errors. Addressing these missing values ensures the integrity and reliability of your analysis or model.

Syntax of fillna() in Pandas

There are some important parameters available in fillna():

value: Scalar, dictionary, Series, or DataFrame to fill the missing values.
method: Imputation method. Can be:
- ‘ffill’ (forward fill): Replaces NaN with the last valid value along the axis.
- ‘bfill’ (backward fill): Replaces NaN with the next valid value.

axis: Axis along which to apply the method (0 for rows, 1 for columns).
inplace: If True, modifies the original object.
limit: Maximum number of consecutive NaNs to fill.
downcast: Attempts to downcast the resulting data to a smaller data type.

Using fillna() for Different Data Imputation Techniques

There are several data Imputation techniques which aims to preserve the dataset’s structure and statistical properties while minimizing bias. These methods range from simple statistical approaches to advanced machine learning-based strategies, each suited to specific types of data and missingness patterns.

We will see some of these techniques which can be implemented with fillna():

1. Next or Previous Value

For time-series or ordered data, imputation methods often leverage the natural order of the dataset, assuming that nearby values are more similar than distant ones. A common approach replaces missing values with either the next or previous value in the sequence. This technique works well for both nominal and numerical data.

import pandas as pd

data = {'Time': [1, 2, 3, 4, 5], 'Value': [10, None, None, 25, 30]}

df = pd.DataFrame(data)

# Forward fill

df_ffill = df.fillna(method='ffill')

# Backward fill

df_bfill = df.fillna(method='bfill')

print(df_ffill)

print(df_bfill)

Also read: Effective Strategies for Handling Missing Values in Data Analysis

2. Maximum or Minimum Value

When the data is known to fall within a specific range, missing values can be imputed using either the maximum or minimum boundary of that range. This method is particularly useful when data collection instruments saturate at a limit. For example, if a price cap is reached in a financial market, the missing price can be replaced with the maximum allowable value.

import pandas as pd

data = {'Time': [1, 2, 3, 4, 5], 'Value': [10, None, None, 25, 30]}

df = pd.DataFrame(data)

# Impute missing values with the minimum value of the column

df_min = df.fillna(df.min())

# Impute missing values with the maximum value of the column

df_max = df.fillna(df.max())

print(df_min)

print(df_max)

3. Mean Imputation

Mean Imputation involves replacing missing values with the mean (average) value of the available data in the column. This is a straightforward approach that works well when the data is relatively symmetrical and free of outliers. The mean represents the central tendency of the data, making it a reasonable choice for imputation when the dataset has a normal distribution. However, the major drawback of using the mean is that it is sensitive to outliers. Extreme values can skew the mean, leading to an imputation that may not reflect the true distribution of the data. Therefore, it is not ideal for datasets with significant outliers or skewed distributions.

import pandas as pd

import numpy as np

# Sample dataset with missing values

data = {'A': [1, 2, np.nan, 4, 5, np.nan, 7],

        'B': [10, np.nan, 30, 40, np.nan, 60, 70]}

df = pd.DataFrame(data)

# Mean Imputation

df['A_mean'] = df['A'].fillna(df['A'].mean())

print("Dataset after Imputation:")

print(df)

4. Median Imputation

Median Imputation replaces missing values with the median value, which is the middle value when the data is ordered. This method is especially useful when the data contains outliers or is skewed. Unlike the mean, the median is not affected by extreme values, making it a more robust choice in such cases. When the data has a high variance or contains outliers that could distort the mean, the median provides a better measure of central tendency. However, one downside is that it may not capture the full variability in the data, especially in datasets that follow a normal distribution. Thus, in such cases, the mean would generally provide a more accurate representation of the data’s true central value.

import pandas as pd

import numpy as np

# Sample dataset with missing values

data = {'A': [1, 2, np.nan, 4, 5, np.nan, 7],

        'B': [10, np.nan, 30, 40, np.nan, 60, 70]}

df = pd.DataFrame(data)

# Median Imputation

df['A_median'] = df['A'].fillna(df['A'].median())

print("Dataset after Imputation:")

print(df)

5. Moving Average Imputation

The Moving Average Imputation method calculates the average of a specified number of surrounding values, known as a “window,” and uses this average to impute missing data. This method is particularly valuable for time-series data or datasets where observations are related to previous or subsequent ones. The moving average helps smooth out fluctuations, providing a more contextual estimate for missing values. It is commonly used to handle gaps in time-series data, where the assumption is that nearby values are likely to be more similar. The major disadvantage is that it can introduce bias if the data has large gaps or irregular patterns, and it can also be computationally intensive for large datasets or complex moving averages. However, it is highly effective in capturing temporal relationships within the data.

import pandas as pd

import numpy as np

# Sample dataset with missing values

data = {'A': [1, 2, np.nan, 4, 5, np.nan, 7],

        'B': [10, np.nan, 30, 40, np.nan, 60, 70]}

df = pd.DataFrame(data)

# Moving Average Imputation (using a window of 2)

df['A_moving_avg'] = df['A'].fillna(df['A'].rolling(window=2, min_periods=1).mean())

print("Dataset after Imputation:")

print(df)

6. Rounded Mean Imputation

The Rounded Mean Imputation technique involves replacing missing values with the rounded mean value. This method is often applied when the data has a specific precision or scale requirement, such as when dealing with discrete values or data that should be rounded to a certain decimal place. For instance, if a dataset contains values with two decimal places, rounding the mean to two decimal places ensures that the imputed values are consistent with the rest of the data. This approach makes the data more interpretable and aligns the imputation with the precision level of the dataset. However, a downside is that rounding can lead to a loss of precision, especially in datasets where fine-grained values are crucial for analysis.

import pandas as pd

import numpy as np

# Sample dataset with missing values

data = {'A': [1, 2, np.nan, 4, 5, np.nan, 7],

        'B': [10, np.nan, 30, 40, np.nan, 60, 70]}

df = pd.DataFrame(data)

# Rounded Mean Imputation

df['A_rounded_mean'] = df['A'].fillna(round(df['A'].mean()))

print("Dataset after Imputation:")

print(df)

7. Fixed Value Imputation

Fixed value imputation is a simple and versatile technique for handling missing data by replacing missing values with a predetermined value, chosen based on the context of the dataset. For categorical data, this might involve substituting missing responses with placeholders like “not answered” or “unknown,” while numerical data might use 0 or another fixed value that is logically meaningful. This approach ensures consistency and is easy to implement, making it suitable for quick preprocessing. However, it may introduce bias if the fixed value does not reflect the data’s distribution, potentially reducing variability and impacting model performance. To mitigate these issues, it is important to choose contextually meaningful values, document the imputed values clearly, and analyze the extent of missingness to assess the imputation’s impact.

import pandas as pd

# Sample dataset with missing values

data = {

    'Age': [25, None, 30, None],

    'Survey_Response': ['Yes', None, 'No', None]

}

df = pd.DataFrame(data)

# Fixed value imputation

# For numerical data (e.g., Age), replace missing values with a fixed number, such as 0

df['Age'] = df['Age'].fillna(0)

# For categorical data (e.g., Survey_Response), replace missing values with "Not Answered"

df['Survey_Response'] = df['Survey_Response'].fillna('Not Answered')

print("\nDataFrame after Fixed Value Imputation:")

print(df)

Also read: An Accurate Approach to Data Imputation

Conclusion

Handling missing data effectively is crucial for maintaining the integrity of datasets and ensuring the accuracy of analyses and machine learning models. Pandas fillna() method offers a flexible and efficient approach to data imputation, accommodating a variety of techniques tailored to different data types and contexts.

From simple methods like replacing missing values with fixed values or statistical measures (mean, median, mode) to more sophisticated techniques like forward/backward filling and moving averages, each strategy has its strengths and is suited to specific scenarios. By choosing the appropriate imputation technique, practitioners can mitigate the impact of missing data, minimize bias, and preserve the dataset’s statistical properties.

Ultimately, selecting the right imputation method requires understanding the nature of the dataset, the pattern of missingness, and the goals of the analysis. With tools like fillna(), data scientists and analysts are equipped to handle missing data efficiently, enabling robust and reliable results in their workflows.

If you are looking for an AI/ML course online, then, explore: Certified AI & ML BlackBelt PlusProgram

Frequently Asked Questions

Q1. What does fillna() do in pandas?

Ans. The fillna() method in Pandas is used to replace missing values (NaN) in a DataFrame or Series with a specified value, method, or computation. It allows filling with a fixed value, propagating the previous or next valid value using methods like ffill (forward fill) or bfill (backward fill), or applying different strategies column-wise with dictionaries. This function is essential for handling missing data and ensuring datasets are complete for analysis.

Q2. What is the difference between Dropna and Fillna in pandas?

Ans. The primary difference between dropna() and fillna() in Pandas lies in how they handle missing values (NaN). dropna() removes rows or columns containing missing values, effectively reducing the size of the DataFrame or Series. In contrast, fillna() replaces missing values with specified data, such as a fixed value, a computed value, or by propagating nearby values, without altering the DataFrame’s dimensions. Use dropna() when you want to exclude incomplete data and fillna() when you want to retain the dataset’s structure by filling gaps.

Q3. What’s the difference between interpolate () and fillna () in Pandas?

Ans. In Pandas, both fillna() and interpolate() handle missing values but differ in approach. fillna() replaces NaNs with specified values (e.g., constants, mean, median) or propagates existing values (e.g., ffill, bfill). In contrast, interpolate() estimates missing values using surrounding data, making it ideal for numerical data with logical trends. Essentially, fillna() applies explicit replacements, while interpolate() infers values based on data patterns.

Yashashwy Alok

Hello, my name is Yashashwy Alok, and I am passionate about data science and analytics. I thrive on solving complex problems, uncovering meaningful insights from data, and leveraging technology to make informed decisions. Over the years, I have developed expertise in programming, statistical analysis, and machine learning, with hands-on experience in tools and techniques that help translate data into actionable outcomes.

I’m driven by a curiosity to explore innovative approaches and continuously enhance my skill set to stay ahead in the ever-evolving field of data science. Whether it’s crafting efficient data pipelines, creating insightful visualizations, or applying advanced algorithms, I am committed to delivering impactful solutions that drive success.

In my professional journey, I’ve had the opportunity to gain practical exposure through internships and collaborations, which have shaped my ability to tackle real-world challenges. I am also an enthusiastic learner, always seeking to expand my knowledge through certifications, research, and hands-on experimentation.

Beyond my technical interests, I enjoy connecting with like-minded individuals, exchanging ideas, and contributing to projects that create meaningful change. I look forward to further honing my skills, taking on challenging opportunities, and making a difference in the world of data science.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How to Use Pandas fillna() for Data Imputation?

Table of contents

What is Data Imputation?

Why is Data Imputation Important?

Distorted Dataset

Limitations with Machine Learning Libraries

Impact on Model Performance

Desire to Restore Dataset Completeness

Understanding fillna() in Pandas

Syntax of fillna() in Pandas

Using fillna() for Different Data Imputation Techniques

1. Next or Previous Value

2. Maximum or Minimum Value

3. Mean Imputation

4. Median Imputation

5. Moving Average Imputation

6. Rounded Mean Imputation

7. Fixed Value Imputation

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID