Classification problems are pretty common in the machine-learning world. In the classification problem, we try to predict the class label by studying the input data or predictor where the target or output variable is a categorical variable. If you have already dealt with classification problems, you must have encountered instances where one of the target class labels of observation is significantly lower than other class labels. This dataset type is called an imbalanced class dataset, often referred to as data imbalance, and it is common in practical classification scenarios. Any usual approach to solving this machine-learning problem often yields inappropriate results. In this article, we will discuss how to handle an imbalanced dataset, the problem regarding its prediction, and how to deal with such data more efficiently than the traditional approach.
In this article, you will learn how to handle imbalanced datasets effectively, exploring techniques for imbalanced dataset classification. We will discuss strategies for managing unbalanced data and improving model performance.
This article was published as a part of the Data Science Blogathon.
Imbalanced data refers to datasets where the target class has an uneven distribution of observations, i.e., one class label has a very high number of observations, and the other has a deficient number of observations.
We can better understand imbalanced dataset handling by using an example.
Let’s assume that XYZ is a bank that issues credit cards to its customers. Now, the bank is concerned that some fraudulent transactions are going on, and when the bank checks their data, they found that for every 2000 transactions, there are only 30 Nos of fraud recorded. So, the fraud per 100 transactions is less than 2% or more than 98% of transactions is “No Fraud.” Here, the class “No Fraud” is called the majority class, and the much smaller “Fraud” class is called the minority class.
More such examples of imbalanced dataset are:
Class imbalance is generally normal in classification problems. But in some cases, this imbalance is quite acute, where the majority class’s presence is much higher than that of the minority class.
If we explain it simply, the main problem with imbalanced dataset prediction is how accurately we predict both majority and minority classes. Let’s start with an example of disease diagnosis. Now, we will predict disease from an existing dataset where, for every 100 records, only five patients are diagnosed. So, the majority class is 95% with no disease, and the minority class is only 5% with the disease. Now, assume our model predicts that all 100 out of 100 patients have no disease.
Sometimes, when the records of a particular class are much more than those of another class, our classifier may get biased towards the prediction. In this case, the confusion matrix for the classification problem shows how well our model classifies the target classes, and we arrive at the model’s model’s accuracy from the confusion matrix. It is calculated based on the model’s total number of correct predictions divided by the total number of predictions. In the above case, it is (0+95)/(0+95+0+5)=0.95 or 95%. This means that the model fails to identify the minority class, yet the accuracy score of the model will be 95%.
Thus, our traditional approach to classifying and calculating model accuracy is ineffective in the case of an imbalanced dataset.
Imbalanced dataset is a problem because it can lead to biased models and inaccurate predictions. Here’s why:
In rare cases like fraud detection or disease prediction, it is vital to identify the minority classes correctly. So, the model should not be biased to detect only the majority class but should give equal weight or importance to the minority class, too. Here, I discuss some techniques to handle imbalanced dataset problem. There is no correct or wrong method; different techniques work well with other problems.
The first technique to handle imbalanced data is choosing a proper evaluation metric. The accuracy of a classifier is the total number of correct predictions divided by the total number of predictions. This may be good enough for a well-balanced class but not ideal for an imbalanced class problem. Other metrics, such as precision, measure how accurate the classifier’s prediction of a specific class, and recall measures the classifier’s ability to identify a class.
For an imbalanced class dataset, the F1 score is a more appropriate metric. It is the harmonic mean of precision and recall and the expression is –
So, if the classifier predicts the minority class but the prediction is erroneous and the false-positive increases, the precision metric will be low, and so will the F1 score. Also, if the classifier identifies the minority class poorly, i.e., more of this class wrongfully predicted as the majority class, then false negatives will increase, so recall and F1 score will be low. The F1 score only increases if the number and prediction quality improve.
F1 score keeps the balance between precision and recall and improves the score only if the classifier identifies more of a certain class correctly.
The second technique used to handle the imbalanced data is used to upsample or downsample the minority or majority class. When we are using an imbalanced dataset, we can oversample the minority class using replacement. This technique used to handle imbalanced data is called oversampling. Similarly, we can randomly delete rows from the majority class to match them with the minority class which is called undersampling. After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both classes have a similar number of records present in the dataset, we can assume that the classifier will give equal importance to both classes.
An example of this technique using the sklearn library; it is shown below for illustration purposes. Here, Is_Lead is our target variable. Let’s see the distribution of the classes in the target.
It has been observed that our target class is imbalanced. So, we’ll upsample the data so that the minority class matches the majority class.
from sklearn.utils import resample
#create two different dataframe of majority and minority class
df_majority = df_train[(df_train['Is_Lead']==0)]
df_minority = df_train[(df_train['Is_Lead']==1)]
# upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples= 131177, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
After upsampling, the distribution of class is balanced as below –
Sklearn.utils resample can be used for both undersamplings the majority class and oversample minority class instances.
The third technique to handle imbalanced data is the Synthetic Minority Oversampling Technique or SMOTE, which is another technique to oversample the minority class. Simply adding duplicate records of minority class often don’t adon’ty new information to the model. In SMOTE new instances are synthesized from the existing data. If we explain it in simple words, SMOTE looks into minority class instances and use k nearest neighbor to select a random nearest neighbor, and a synthetic instance is created randomly in feature space.
I am going to show the code sample of the same below:
from imblearn.over_sampling import SMOTE
# Resampling the minority class. The strategy can be changed as required.
sm = SMOTE(sampling_strategy='minority', random_state=42)
# Fit the model to generate the data.
oversampled_X, oversampled_Y = sm.fit_sample(df_train.drop('Is_Lead', axis=1), df_train['Is_Lead'])
oversampled = pd.concat([pd.DataFrame(oversampled_Y), pd.DataFrame(oversampled_X)], axis=1)
Now the class has been balanced as below
When we try to use a usual classifier to classify an imbalanced dataset, the model favors the majority class due to its larger volume presence. A BalancedBaggingClassifier is the same as a sklearn classifier but with additional balancing. It includes an additional step to balance the training set at the time of fit for a given sampler. This classifier takes two special parameters, “sampling_strategy” and “replacement”. The sampling_strategy decides the type of resampling required (e.g., ‘majority’ – resample only the majority class, ‘all’ – resample all classes, etc.), and replacement decides whether it is going to be a sample with replacement or not.
An illustrative example is given below:
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
#Create an instance
classifier = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
sampling_strategy='not majority',
replacement=False,
random_state=42)
classifier.fit(X_train, y_train)
preds = classifier.predict(X_test)
In the case of our classifiers, many times classifiers actually predict the probability of class membership. We assign those prediction’s abilities to a certain class based on a threshold which is usually 0.5, i.e. if the probabilities < 0.5 it belongs to a certain class, and if not it belongs to the other class.
For imbalanced class problems, this default threshold may not work properly. We need to change the threshold to the optimum value so that it can efficiently separate two classes. Also, we can use ROC Curves and Precision-Recall Curves to find the optimal threshold for the classifier. We can also use a grid search method or search within a set of values to identify the optimal value.
In this method first, we will find the probabilities for the class label, then we’ll fwe’llhe optimum threshold to map the probabilities to its proper class label. The probability of prediction can be obtained from a classifier by using predict_proba() method from sklearn.
rom sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train,y_train)
rf_model.predict_proba(X_test) #probability of the class label
Output:
array([[0.97, 0.03],
[0.94, 0.06],
[0.78, 0.22],
...,
[0.95, 0.05],
[0.11, 0.89],
[0.72, 0.28]])
After getting the probability we can check for the optimum value.
step_factor = 0.05
threshold_value = 0.2
roc_score=0
predicted_proba = rf_model.predict_proba(X_test) #probability of prediction
while threshold_value <=0.8: #continue to check best threshold upto probability 0.8
temp_thresh = threshold_value
predicted = (predicted_proba [:,1] >= temp_thresh).astype('int') #change the class boundary for prediction
print('Threshold',temp_thresh,'--',roc_auc_score(y_test, predicted))
if roc_score<roc_auc_score(y_test, predicted): #store the threshold for best classification
roc_score = roc_auc_score(y_test, predicted)
thrsh_score = threshold_value
threshold_value = threshold_value + step_factor
print('---Optimum Threshold ---',thrsh_score,'--ROC--',roc_score)
Output:
Here, we get the optimal threshold in 0.3 instead of our default 0.5.
Dealing with data imbalance in classification problems poses significant challenges that traditional approaches often fail to address effectively. The skewed distribution of classes can lead to biased models, inaccurate predictions, and poor generalization of new data. Moreover, the misleading nature of traditional evaluation metrics like accuracy exacerbates these issues, making adopting alternative metrics such as precision, recall, F1-score, or AUC-ROC crucial.
To overcome these challenges, various techniques can be employed, including proper selection of evaluation metrics, resampling methods like oversampling and undersampling, utilizing algorithms designed for imbalance such as SMOTE, employing ensemble methods like BalancedBaggingClassifier, and adjusting threshold values for optimal classification. Each technique offers unique advantages and may be more suitable depending on the specific characteristics of the dataset and the problem at hand.
By understanding the complexities of imbalanced datasets and implementing appropriate strategies for handling them, machine learning practitioners can improve the performance and reliability of their models. This will ultimately lead to more accurate predictions and better decision-making in real-world applications.
Hope you like the article on handling an unbalanced dataset for imbalanced dataset classification. Techniques for how to deal with imbalanced data include resampling methods, cost-sensitive learning, and algorithm adjustments for better performance. Imbalanced dataset handling is crucial for effective imbalanced dataset classification. Imbalanced data and imbalance dataset are common challenges in machine learning that require specific strategies to overcome.
For those looking to enhance their analytics skills and dive deeper into data science, consider enrolling in Analytics Vidhya’s Program, a comprehensive learning platform for aspiring data scientists.
A. Three ways to handle an imbalanced data set are:
a) Resampling: Over-sampling the minority class, under-sampling the majority class, or generating synthetic samples.
b) Using different evaluation metrics: F1-score, AUC-ROC, or precision-recall.
c) Algorithm selection: Choose algorithms designed for imbalance, like SMOTE or ensemble methods.
A. Several algorithms are capable of handling imbalanced data effectively. Random Forest, for instance, can manage class imbalance through bagging and feature selection. SVM can be adjusted by assigning class weights to penalize errors in the minority class. SMOTE generates synthetic samples for the minority class, aiding in balancing the dataset and improving model performance.
A. When a dataset is imbalanced, several issues may arise. Models may exhibit bias toward the majority class, resulting in poor predictions for the minority class. Accuracy as an evaluation metric can be misleading, as it may appear high while the model’smodel’smance on the minority class is lacking. In real-world applications, dealing with imbalanced data can pose significant challenges, potentially affecting decision-making, particularly in critical domains where accurate predictions are essential.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
This blog post provides an informative overview of imbalanced data and the techniques used to handle it, which can be useful for those working with large datasets that contain both common and rare events.
This post provided some great insights into dealing with imbalanced data! I especially appreciated the section on using synthetic data generation techniques. It’s clear that addressing this issue is crucial for improving model performance. Thank you for sharing these practical tips!