This article was published as a part of the Data Science Blogathon.
Missing data in machine learning is a type of data that contains “None” or “NaN” type of values. One should take care of the missing data while dealing with machine learning algorithms and training. Missing data can be filled using basic python programming, pandas library, and a sci-kit learn library named SimpleImputer. Handling missing values using the sci-kit learns library SimpleImputer is the easiest and most convenient method of all the other missing data handling methods.
In simple words, SimpleImputer is a sci-kit library used to fill in the missing values in the datasets. As the name suggests, the class performs simple imputations on the dataset, and it replaces the missing data with another value based on a given strategy.
The basic syntax or structure of a SimpleImputer initialization is:
SimpleImputer
(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
missing_values: indicates the missing values in the dataset. By default np.nan is set in the missing_values, which means all the values containing np.nan will be considered as missing values.
strategy: it is the method by using which we want to fill in the missing values.the value of the strategy could be “mean”, “median”, “most_frequent”, or “constant”.
fill_value: It is a parameter only used when the strategy is set to be constant. When the strategy is constant, the Nan values will be replaced by the value passes in fill_value.
verbose: It is the parameter that is used to control the verbosity of the SimpleImputer. By default, the value of verbose is set to 0.
copy: If this parameter is set to be True, then the copy of the dataset will be created, else imputation will be done without copying.
add_indicator: If this parameter is set to be True, the MissingIndicator transform will stack onto the output of the imputer’s transform.
To start with the SimpleImputer library, first, we must install and import the library from the sci-kit learn.
To install the library from sci-kit learn, use the code below:
pip install scikit-learn
Once the library is installed in the machine, it should be imported to the Python IDE you are using. Use the code below to import the library:
# importing sklearn import sklearn # importing simpleimputer from sklearn.impute import SimpleImputer
Using the strategy “Mean” in SimpleImputer allows us to impute the missing value by the mean of the particular dataset. This strategy can only be used on a numerical dataset.
Let’s suppose we have a numerical column named “Age” in our data set in which some of the values are missing. Then using the Mean strategy will allow us to fill in the missing values in the column by the mean of all age values.
Code:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') imputer.fit(df['age']) df['age']= imputer.fit_transform(df['age'])
Example:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] print(imp_mean.transform(age))
The Output of the particular code would be:
[[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]
While working with mean strategy imputation, the scenario of an outlier should be considered as the mean strategy counts the mean of the values and fill the missing values by the counted mean values, but in the case of an outlier, it is possible that due to the outlier the mean can be shifted to one side and it is biased, which results in in-accurate mean value imputation.
Using the strategy “Median” in the SimpleImputer allows us to impute the missing value by the median value of the particular dataset. This strategy can only be used on the numerical dataset.
Let’s suppose we have a column age in our dataset in which we have a missing value. Using the strategy median will allow us to fill the missing values by the median of the values from the column age.
Code:
imputer = SimpleImputer(missing_values=np.nan, strategy='median') imputer.fit(df['age']) df['age']= imputer.fit_transform(df['age'])
Example:
imp_median = SimpleImputer(missing_values=np.nan, strategy='median') imp_median.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] print(imp_median.transform(age))
The Output of the particular code would be:
[[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]
Most frequent imputation is a technique that is used for handling categorical missing data. This technique is used when we have missing values in a categorical column.
Using a most frequent imputation technique on the particular categorical column will allow us to fill the missing values bu the most frequent value from the column occurring in the dataset.
Code:
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent') imputer.fit(df['category']) df['category']= imputer.fit_transform(df['category'])
Example:
imp_mf = SimpleImputer(missing_values=np.nan, strategy='median') imp_mf.fit([['one', 'two', 'three'], ['four', np.nan, 'six'], ['two', 'five', 'two']]) category = [[np.nan, 'two', 'two'], ['four', np.nan, 'six'], ['ten', np.nan, 'nine']] print(imp_mf.transform(category))
The Output of the particular code would be:
[[ 'two' 'two' 'two' ] [ 'four' 'two' 'six' ] [ 'ten' 'two' 'nine' ]]
Constant imputation is a technique in simple imputer using which we can fill the missing value by any desired value we want. This can be used on strings and numerical datasets.
Passing the desired value to the fill_value parameter, we can fill all the missing values present in the dataset by the value passed in the fill_value parameter.
Code:
imputer = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=20) imputer.fit(df['age']) df['age']= imputer.fit_transform(df['age'])
Example:
imp_constant = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=20) imp_constant.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] print(imp_constant.transform(age))
The Output of the particular code would be:
[[20. 2. 3.] [ 4. 20. 6.] [10. 20. 9.]]
In this article, the handling of missing data with the class SimpleImputer is discussed in detail. A total of 4 strategies, mean median, most_frequent, and constant, can be used to fill in the missing value and are discussed in the code example above.
Some Key Takeaways From this article are:
1. We should consider an outlier scenario while working with a meaningful strategy, as outliers can impact the data imputed and may result in a less accurate model with unexpected behavior. (avoid using mean strategy in case of outliers).
2. Mean and Median is a strategy that only can be used on numerical data, and the most frequent strategy can be used only on categorical data. They are one of the easiest and lower computational methods.
3. Constant strategy can be used when we have a better understanding of the dataset, and we already know the impact of imputing the missing values by our desired number or string. It can be used on strings and numerical data.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.