Introduction to Data Imputation

Shashank Last Updated : 04 Apr, 2025

8 min read

Imagine trying to solve a puzzle with missing pieces—it’s frustrating, incomplete, and ultimately, inaccurate. That’s exactly what working with datasets full of missing values feels like in the world of data science. Whether you’re building predictive models or uncovering trends, incomplete data can throw off your entire analysis. But don’t worry—this article is your missing piece! We’ll dive into practical and powerful data imputation techniques that help fill in the gaps, including Complete Case Analysis (CCA), Arbitrary Value Imputation, and Frequent Category Imputation. By the end, you’ll be equipped to handle missing data like a pro and build models that don’t just guess—they know.

Learning Objectives

Understand the significance of data imputation in data science and its impact on analysis quality.
Learn about various data imputation methods, including Complete Case Analysis (CCA), Arbitrary Value Imputation, and Frequent Category Imputation.
Identify the assumptions, benefits, and limitations of different imputation techniques.
Gain practical skills in applying imputation methods to manage missing data effectively.
Ensure robust and accurate analyses by mitigating issues related to missing data.

This article was published as a part of the Data Science Blogathon.

What is Data Imputation?
Why Data Imputation is Important?
Data Imputation Techniques
Conclusion
Frequently Asked Questions

What is Data Imputation?

Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.

Not Sure What is Missing Data ? How it occurs? And its type? Have a look HERE to know more about it.

Let’s understand the concept of Imputation from the above Fig {Fig 1}. In the above image, I have tried to represent the Missing data on the left table(marked in Red) and by using the Imputation techniques we have filled the missing dataset in the right table(marked in Yellow), without reducing the actual size of the dataset. If we notice here we have increased the column size, which is possible in Imputation(Adding “Missing” category imputation).

Why Data Imputation is Important?

So, after knowing the definition of Imputation, the next question is Why should we use it, and what would happen if I don’t use it?

We use imputation because Missing data can cause the below issues:

Incompatible with most of the Python libraries used in Machine Learning: Yes, you read it right. While using the libraries for ML (the most common is sklearn), they don’t have a provision to automatically handle these missing data and can lead to errors.
Distortion in Dataset:-A huge amount of missing data can cause distortions in the variable distribution i.e. it can increase or decrease the value of a particular category in the dataset.
Affects the Final Model: the missing data can cause a bias in the dataset and can lead to a faulty analysis by the model.

Another and the most important reason is “We want to restore the complete dataset”. This is mostly in the case when we do not want to lose any(more of) data from our dataset as all of it is important, & secondly, dataset size is not very big, and removing some part of it can have a significant impact on the final model.

Great..!! we got some basic concepts of Missing data and Imputation. Now, let’s have a look at the different techniques of Imputation and compare them. But before we jump to it, we have to know the types of data in our dataset.

Sounds strange..!!! Don’t worry… Most data is of 4 types:- Numeric, Categorical, Date-time and Mixed. These names are quite self-explanatory so not going much in-depth and describing them.

Imputation Techniques data type — Types of Data

Data Imputation Techniques

Moving on to the main highlight of this article – techniques used in data imputation.

Imputation Techniques imputation techniques — Imputation Techniques

Note: I will be focusing only on Mixed, Numerical and Categorical Imputation here. Date-Time will be part of next article.

Complete Case Analysis(CCA)

This is a quite straightforward method of handling the Missing Data, which directly removes the rows that have missing data i.e. we consider only those rows where we have complete data i.e. data is not missing. This method is also popularly known as “Listwise deletion”.

Assumptions:
- Data is Missing At Random(MAR).
- Missing data is completely removed from the table.
Advantages:
- Easy to implement.
- No Data manipulation required.
Limitations:
- Deleted data can be informative.
- Can lead to the deletion of a large part of the data.
- Can create a bias in the dataset, if a large amount of a particular type of variable is deleted from it.
- The production model will not know what to do with Missing data.
When to Use:
- Data is MAR(Missing At Random).
- Good for Mixed, Numerical, and Categorical data.
- Missing data is not more than 5% – 6% of the dataset.
- Data doesn’t contain much information and will not bias the dataset.

Code Implementation

## To check the shape of original dataset
train_df.shape

## Output (614 rows & 13 columns)
(614, 13)

## Finding the columns that have Null Values(Missing Data) 
## We are using a for loop for all the columns present in dataset with average null values greater than 0
na_variables = [var for var in train_df.columns if train_df[var].isnull().mean() > 0]

## Output of column names with null values
['Gender', 'Married', 'Dependents', 'Self_Employed', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']

## We can also see the mean Null values present in these columns {Shown in image below}
data_na = train_df[na_variables].isnull().mean()

## Implementing the CCA techniques to remove Missing Data
data_cca = train_df.dropna(axis=0)  ### axis=0 is used for specifying rows

## Verifying the final shape of the remaining dataset
data_cca.shape

## Output (480 rows & 13 columns)
(480, 13)

Output

complete case analysis Imputation Techniques

Here we can see, dataset had initially 614 rows and 13 columns, out of which 7 rows had missing data(na_variables), their mean missing rows are shown by data_na. We notice that apart from <Credit_History> & <Self_Employed> all have mean less than 5%. So as per the CCA, we dropped the rows with missing data which resulted in a dataset with only 480 rows. Around 20% of the data reduction can be seen here, which can cause many issues going ahead.

Arbitrary Value Imputation

This is an important technique used in Imputation as it can handle both the Numerical and Categorical variables. This technique states that we group the missing values in a column and assign them to a new value that is far away from the range of that column. Mostly we use values like 99999999 or -9999999 or “Missing” or “Not defined” for numerical & categorical variables.

Assumptions:-
- Data is not Missing At Random.
- The missing data is imputed with an arbitrary value that is not part of the dataset or Mean/Median/Mode of data.
Advantages:-
- Easy to implement.
- We can use it in production.
- It retains the importance of “missing values” if it exists.
Disadvantages:-
- Can distort original variable distribution.
- Arbitrary values can create outliers.
- Extra caution required in selecting the Arbitrary value.
When to Use:-
- When data is not MAR(Missing At Random).
- Suitable for All.

Code Implementation

## Finding the columns that have Null Values(Missing Data) 
## We are using a for loop for all the columns present in dataset with average null values greater than 0

na_variables = [ var for var in train_df.columns if train_df[var].isnull().mean() > 0 ]

## Output of column names with null values
['Gender','Married','Dependents','Self_Employed','LoanAmount','Loan_Amount_Term','Credit_History']

## Use Gender column to find the unique values in the column
train_df['Gender'].unique()

## Output
array(['Male','Female',nan])

## Here nan represent Missing Data
## Using Arbitary Imputation technique, we will Impute missing Gender with "Missing"  {You can use any other value also}
arb_impute = train_df['Gender'].fillna('Missing')
arb.impute.unique()

## Output
array(['Male','Female','Missing'])

Output

We can see here column Gender had 2 Unique values {‘Male’,’Female’} and few missing values {nan}. By using the Arbitrary Imputation we filled the {nan} values in this column with {missing} thus, making 3 unique values for the variable ‘Gender’.

Frequent Category Imputation

This technique says to replace the missing value with the variable with the highest frequency or in simple words replacing the values with the Mode of that column. This technique is also referred to as Mode Imputation.

Assumptions:
- Data is missing at random.
- There is a high probability that the missing data looks like the majority of the data.
Advantages:
- Implementation is easy.
- We can obtain a complete dataset in very little time.
- We can use this technique in the production model.
Disadvantages:
- The higher the percentage of missing values, the higher will be the distortion.
- May lead to over-representation of a particular category.
- Can distort original variable distribution.
When to Use:
- Data is Missing at Random(MAR)
- Missing data is not more than 5% – 6% of the dataset.

Code Implementation

## finding the count of unique values in Gender
train_df['Gender'].groupby(train_df['Gender']).count()

## Output (489 Male & 112 Female)
Male 489
Female 112

## Male has higgest frequency. We can also do it by checking the mode
train_df['Gender'].mode()

## Output 
Male

## Using Frequent Category Imputer
frq_impute = train_df['Gender'].fillna('Male')
frq_impute.unique()

## Output
array(['Male','Female'])

Output

Here we noticed “Male” was the most frequent category thus, we used it to replace the missing data. Now we are left with only 2 categories i.e. Male & Female.

Thus, we can see every technique has its Advantages and Disadvantages, and it depends upon the dataset and the situation for which different techniques we are going to use.

Conclusion

Handling missing data is crucial for maintaining analysis integrity in data science. This article has covered key imputation methods—Complete Case Analysis (CCA), Arbitrary Value Imputation, and Frequent Category Imputation—highlighting their assumptions, benefits, and drawbacks. Each method has specific applications and limitations, emphasizing the importance of context in choosing the right technique. Mastering these techniques ensures robust, accurate datasets, leading to reliable and unbiased AI models and more meaningful insights from your data.

Key Takeaways

Data imputation is essential for maintaining dataset integrity and preventing biases in analysis.
Incompatible missing data can cause issues with many machine learning libraries, leading to errors and faulty analyses.
Complete Case Analysis (CCA) is straightforward but can result in significant data loss and potential bias.
Arbitrary Value Imputation is versatile but requires careful selection of arbitrary values to avoid distorting the dataset.
Frequent Category Imputation is useful for categorical data but can over-represent certain categories if not used carefully.
Each imputation technique has its own set of advantages and limitations, making the choice context-dependent.
Proper data imputation ensures that AI models are trained on accurate, unbiased, and comprehensive datasets.

Frequently Asked Questions

Q1. What are the different types of single imputation?

A. The different types of single imputation include Mean Imputation, Median Imputation, Mode Imputation, and Arbitrary Value Imputation. Each method replaces missing values with a single, substituted value.

Q2. What is data imputation with mean?

A. Data imputation with mean involves replacing missing values with the mean of the available values in the dataset. This method ensures that the overall mean of the data remains unchanged.

Q3. When should you impute data?

A. Data should be imputed when missing values are present and removing these values could lead to significant data loss, potential bias, or distortion in the analysis.

Q4. What are the benefits of data imputation?

A. The benefits of data imputation include maintaining dataset integrity, reducing biases, preventing analysis distortion, and ensuring compatibility with machine learning libraries, leading to more accurate and reliable models.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Shashank

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Introduction to Data Imputation

Learning Objectives

Table of contents

What is Data Imputation?

Why Data Imputation is Important?

Data Imputation Techniques

Complete Case Analysis(CCA)

Code Implementation

Output

Arbitrary Value Imputation

Code Implementation

Output

Frequent Category Imputation

Code Implementation

Output

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk