Understanding how to utilize tools like NumPy, Pandas, and Sklearn is essential in the field of data science for creating thorough machine learning models. Data cleaning is a vital part of the process that includes finding and fixing incorrect data in a dataset. Handling missing data is a common occurrence during the data cleaning process. This article will show how to handle missing values in numerous real datasets using Python. You will learn how to manage and substitute missing values. By the end of this article, you will be equipped to handle NaN values in Python and effectively address missing data.
In this article, you will learn how to handle missing values in Python. We’ll cover techniques like imputing missing values, filling NaNs, and treating missing data. Mastering these methods for handling null values and missing values in Python datasets will make your analysis more robust and accurate. Whether you’re working with missing values, NaN values, or null values, these Python skills for handling missing data will help you clean up your datasets and get better results.
Learning Objectives
It is necessary to fill in missing data values in datasets, as most of the machine learning models that you want to use will provide an error if you pass NaN values into them. The easiest way is to naturally handle missing data in Python by just filling them up with 0, but it’s essential to note that this approach can potentially reduce your model accuracy significantly.
For filling missing values, there are many methods available. For choosing the best method, you need to understand the type of missing value and its significance, before you start filling/deleting the data to completely understand how to handle missing data in Python.
See that the data contains many columns like PassengerId
, Name
, Age
, etc. We won’t be working with all the columns in the dataset, so I am going to be deleting the columns I don’t need.
Import the required libraries that you will be using – numpy
and pandas
by using import pandas and import numpy
We will then use the pandas read_csv function to read the dataset.
df.drop("Name",axis=1,inplace=True)
df.drop("Ticket",axis=1,inplace=True)
df.drop("PassengerId",axis=1,inplace=True)
df.drop("Cabin",axis=1,inplace=True)
df.drop("Embarked",axis=1,inplace=True)
See that there are also categorical values in the dataset, for this, you need to use Label Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
newdf=df
#splitting the data into x and y
y = df['Survived']
df.drop("Survived",axis=1,inplace=True)
Missing Value Treatment in Python – Missing values are usually represented in the form of Nan
or null
or None
in the dataset.
df.info() The function can be used to give information about the dataset, including insights into missing data in Python. This function is one of the most used functions for data analysis. This will provide you with the column names and the number of non–null values in each column. It will also display the data types of each column. Thus, we can find out which number columns are where null values are present, and by looking at the data types, we can have an understanding of which value to replace nulls with when addressing missing data in Python.
Sometimes though, instead of np.nan null values could be present as empty strings or other values that represent null values, so we must be careful and make sure that all the null values in our dataset are np.nan values.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null int64
2 Age 714 non-null float64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
dtypes: float64(2), int64(4)
memory usage: 41.9 KB
See that there are null values in the column Age
.
The second way of finding whether we have null values in the data is by using the isnull()
function.
print(df.isnull().sum())
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
dtype: int64
See that all the null values in the dataset are in the column – Age
.
Let’s try fitting the data using logistic regression.
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(df,y,test_size=0.3)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
---------------------------------------------------------------------------
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
See that the logistic regression model does not work as we have NaN values in the dataset. Only some of the machine learning algorithms can work with missing data like KNN, which will ignore the values with Nan values.
Let’s now look at the different methods that you can use to deal with the missing data.
In this case, let’s delete the column, Age
and then fit the model and check for accuracy.
But this is an extreme case and should only be used when there are many null values in the column.
updated_df = df.dropna(axis=1)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null int64
2 SibSp 891 non-null int64
3 Parch 891 non-null int64
4 Fare 891 non-null float64
dtypes: float64(1), int64(4)
memory usage: 34.9 KB
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(updated_df,y,test_size=0.3)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))
0.7947761194029851
See that we can achieve an accuracy of 79.4%.
The problem with this method is that we may lose valuable information on that feature, as we have deleted it completely due to some null values.
It should only be used if there are too many null values.
If there is a certain row with missing data, then you can delete the entire row with all the features in that row.
axis=1
is used to drop the column with NaN
values.
axis=0
is used to drop the row with NaN
values.
updated_df = newdf.dropna(axis=0)
y1 = updated_df['Survived']
updated_df.drop("Survived",axis=1,inplace=True)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 714 non-null int64
1 Sex 714 non-null int64
2 Age 714 non-null float64
3 SibSp 714 non-null int64
4 Parch 714 non-null int64
5 Fare 714 non-null float64
dtypes: float64(2), int64(4)
memory usage: 39.0 KB
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))
0.8232558139534883
In this case, see that we are able to achieve better accuracy than before. This is maybe because the column Age
contains more valuable information than we expected.
In this case, we will be filling the missing values with a certain number.
The possible ways to do this are:
You can use the fillna()
function to fill the null values in the dataset.
updated_df = df
updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean())
updated_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null int64
3 Age 891 non-null float64
4 SibSp 891 non-null int64
5 Parch 891 non-null int64
6 Fare 891 non-null float64
dtypes: float64(2), int64(5)
memory usage: 48.9 KB
y1 = updated_df['Survived']
updated_df.drop("Survived",axis=1,inplace=True)
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))
0.7798507462686567
The accuracy value comes out to be 77.98% which is a reduction over the previous case.
This will not happen in general; in this case, it means that the mean has not filled the null
value properly.
Just like the fillna function there is another function called interpolate, it uses linear interpolation which means that it estimates unknown values between two known data points.
We can also use the bfill function which backfills the unknown values with the value in the next row.
Use the SimpleImputer()
function from sklearn
module to impute the values.
Pass the strategy
as an argument to the function. It can be either mean or mode or median.
The problem with the previous model is that the model does not know whether the values came from the original data or the imputed value. To make sure the model knows this, we are adding Ageismissing
the column which will have True
as value, if it is a null value and False
if it is not a null value.
updated_df = df
updated_df['Ageismissing'] = updated_df['Age'].isnull()
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer(strategy = 'median')
data_new = my_imputer.fit_transform(updated_df)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null int64
2 Age 891 non-null float64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
6 Ageismissing 891 non-null bool
dtypes: bool(1), float64(2), int64(4)
memory usage: 42.8 KB
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))
0.7649253731343284
In this case, the null values in one column are filled by fitting a regression model using other columns in the dataset.
I.e. in this case the regression model will contain all the columns except Age
in X and Age
in Y.
Then after filling the values in the Age column, then we will use logistic regression to calculate accuracy.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
df.head()
testdf = df[df['Age'].isnull()==True]
traindf = df[df['Age'].isnull()==False]
y = traindf['Age']
traindf.drop("Age",axis=1,inplace=True)
lr.fit(traindf,y)
testdf.drop("Age",axis=1,inplace=True)
pred = lr.predict(testdf)
testdf['Age']= pred
traindf['Age']=y
y = traindf['Survived']
traindf.drop("Survived",axis=1,inplace=True)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(traindf,y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
y_test = testdf['Survived']
testdf.drop("Survived",axis=1,inplace=True)
pred = lr.predict(testdf)
print(metrics.accuracy_score(pred,y_test))
0.8361581920903954
See that this model produces more accuracy than the previous model as we are using a specific regression model for filling in the missing values.
We can also use models KNN
for filling in the missing values. But sometimes, using models for imputation can result in overfitting the data.
Imputing missing values using the regression model allowed us to improve our model compared to dropping those columns.
But you have to understand that There is no perfect way for filling the missing values in a dataset.
Each of the methods may work well with different types of datasets. You have to experiment with different techniques to check which approach works best for handling missing data in Python within your dataset. Understanding why data are missing is crucial for appropriately managing the remaining data. If values are missing completely at random, the data sample is likely still representative of the population. However, if the values are missing systematically, the analysis may be biased, emphasizing the importance of practical techniques for addressing missing data in Python.
Hope you clear your all doubts and get understand that gow to handle missing values in python or how to fill missing values in dataset.We have explain with the best 5 steps of missing values in python.Hope in this article you get understanding and also get understanding about how to handle nan values in python.
A. There is no “best“ way to fill missing values in pandas per say, however, the function fillna() is the most widely used function to fill nan values in a dataframe. From this function, you can simply fill the values according to your column with mean, median and mode.
A. Missing values can bias the results of your machine learning models and can result in decreased accuracy. That is why we must handle these values in the correct way, so that the data is imputed correctly.
A. Pandas has many different functions that you can use to handle missing values. Some of these functions are the fillna function, the bfill function and the interpolate function.
In NumPy:
Find missing values: np.isnan(arr)
returns True for missing (NaN) values.
Drop them: arr.dropna()
removes rows/columns with missing values.
Impute them:Fill with constant: arr[mask] = constant_value
(e.g., mean).
Interpolation (for sequential data): np.interp
for 1D, scipy.interpolate
for complex cases.
Consider Pandas for more advanced missing data handling.
Hello Guys! I'm a bit of a newbie in this field, first of all why use logistic regression to evaluate each method? - Second, why, in the last method, fill in null values via linear regression, the model was not evaluated via linear regression? when I tested it the r² is much smaller... I look forward to your response and appreciate the great quality of your articles, greetings from France!