This article was published as a part of the Data Science Blogathon.
Machine learning models are garbage in garbage-out boxes, and it is essential to address any missing data before feeding it to your model.
Missing data in your dataset could be due to multiple reasons like
1) The data was not available.
2) The information was not recorded due to a data entry error and so on.
It is essential to handle them appropriately instead of completely ignoring them, as they might represent crucial information and affect your model’s performance.
We will look into different imputation techniques from basic to advanced as part of this article. Let’s begin!
Tree-based models like Light GBM and XGBoost can work with NA values; you could try a baseline model with your missing data and check the performance metric.
Another approach would be to drop the rows with missing values; you can do this when the records with missing values are a small percentage of your entire dataset. Otherwise, it would not make any sense to drop so much information.
During the EDA stage, try to visualize the percentage of missing data and make an informed choice.
The plot of missing data is a part of the work I am currently doing in the WIDS competition on Kaggle.
ncounts = pd.DataFrame([train.isna().mean(), test.isna().mean()]).T ncounts = ncounts.rename(columns={0: "train_missing", 1: "test_missing"}) ncounts.query("train_missing > 0").plot( kind="barh", figsize=(8, 5), title="% of Values Missing" ) plt.show()
ncounts.query("train_missing > 0")
Image Source: Author
As you can see, for some features, almost 50% of records in train data and 80% of records in test data are missing. In such cases, you may want to remove the feature altogether as it may not be providing any meaningful contribution to the predicted target.
You can also visualize the number of missing features per record as below.
tt["n_missing"] = tt[nacols].isna().sum(axis=1) train["n_missing"] = train[nacols].isna().sum(axis=1) test["n_missing"] = test[nacols].isna().sum(axis=1) tt["n_missing"].value_counts().plot( kind="bar", title="Number of Missing Values per Sample" )
We can see that we have almost 30000 records with four missing features and some 824 records with six features missing, such an analysis would help you make an informed choice on the imputation technique you would want to use, or would you not want to use anything at all.
You can impute the missing data by replacing it with a constant value; in our case, we have some missing data for the column year built since the competition is running in the current year, i.e. 2022, a reasonable choice to replace missing values of the feature would be 2022. The same can be done as below in python.
train['year_built'] =train['year_built'].replace(np.nan, 2022) test['year_built'] =test['year_built'].replace(np.nan, 2022)
Another approach would be to replace the missing values with the mean of the non-empty records in the feature. As mean is susceptible to outliers, you can also use mean or median as the replacement strategy.
Code examples as below
test['energy_star_rating']=test['energy_star_rating'].replace(np.nan,test['energy_star_rating'].mean()) test['energy_star_rating']=test['energy_star_rating'].replace(np.nan,test['energy_star_rating'].median())
Another interesting strategy is imputing a feature with missing values based on another part.
For example, suppose you want to fill in the missing value for max wind speed based on the building class. In that case, the record belongs to; you would first do a group by on the feature building class and calculate the mean of max wind speed in all individual groups, following which impute the missing values in max wind speed based on the building class the record belongs to.
Sklearn provides a similar strategy as we discussed above to impute missing values by a constant or an average value.
It is always good to know alternate methods to perform the same task, which is why we look at the below code using Simple Imputer in action.
from sklearn.impute import SimpleImputer imptr = SimpleImputer(strategy="mean") tr_imp = imptr.fit_transform(train[FEATURES]) test_imp = imptr.transform(test[FEATURES])
Based on your use case, you can replace the strategy by mean, mode, median, or constant.
Under this topic, we will look at the below imputation techniques.
Under the hood, its implementation involves imputing missing values by modelling each feature as a function of other elements round-robin fashion.
You can also understand this simply because the missing values are considered targets, and the remaining features are used to predict their values.
It uses the Bayesian Ridge algorithm internally.
Let’s see the same in action with the below python code.
it_imputer = IterativeImputer(max_iter=10) train_iterimp = it_imputer.fit_transform(X[FEATURES]) test_iterimp = it_imputer.transform(X_test[FEATURES]) # Create train test imputed dataframe X_df = pd.DataFrame(train_iterimp, columns=FEATURES) X_test_df = pd.DataFrame(test_iterimp, columns=FEATURES)
The imputer works on the same principles as the K nearest neighbour unsupervised algorithm for clustering. It uses KNN for imputing missing values; two records are considered neighbours if the features that are not missing are close to each other.
Logically, it does make sense to impute values based on its nearest neighbour. You can give it a try and check your cross-validation score for an improvement or otherwise on your dataset.
Below is the code to get started with the KNN imputer
from sklearn.impute import KNNImputer imputer = KNNImputer(n_neighbors=2) imputer.fit_transform(X)
n_neighbors parameter specifies the number of neighbours to be considered for imputation.
It uses LightGBM to impute missing values in features; you can refer to the entire implementation of the imputer by Hiroshi Yoshihara here.
!git clone https://github.com/analokmaus/kuma_utils.git import sys sys.path.append("kuma_utils/") from kuma_utils.preprocessing.imputer import LGBMImputer lgbm_imtr = LGBMImputer(n_iter=100, verbose=True) train_lgbmimp = lgbm_imtr.fit_transform(train[FEATURES]) test_lgbmimp = lgbm_imtr.transform(test[FEATURES]) tt_lgbmimp = lgbm_imtr.fit_transform(tt[FEATURES]) tt_imp = pd.DataFrame(tt_lgbmimp, columns=FEATURES) # Create LGBM Train/Test imputed dataframe lgbm_imp_df = pd.DataFrame(tt_imp, columns=FEATURES)
Code reference:- here
We have discussed multiple techniques for imputing missing values as part of this article, and I hope you have learned something new from it. There is no one solution fits all mechanism for imputation; you may have to try different approaches and see which works best to your cross-validation score; a general guideline would be to start with a baseline model with mean imputation and build up from there.
If you have any questions or feedback suggestions, you can provide the same in the comments below. I write these articles to improve upon my understanding of applied machine learning. You can connect with me on Linkedin or read about me here. Hope you liked my article on imputation techniques, share it in the comments below.
Read more articles on our blog.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.