10 Techniques to Solve Imbalanced Classes in Machine Learning

Himanshi Singh Last Updated : 18 Nov, 2024

11 min read

While working as a data scientist, some of the most frequently occurring problem statements are related to binary classification. A common problem when solving these problem statements is imbalance classification. When observation in one class is higher than in other classes, a class imbalance exists.

Example: To detect fraudulent credit card transactions. As shown in the graph below, the fraudulent transaction is around 400 compared to the non-fraudulent transaction of around 90000.

Class Imbalance: Introduction | oversampling and undersampling

Class Imbalance in machine learning oversampling in machine learning is a common problem in machine learning, especially in classification problems. Imbalance data can hamper our model accuracy big time. It appears in many domains, including fraud detection, spam filtering, disease screening, SaaS subscription churn, advertising click-throughs, etc. Let’s understand how to deal with imbalanced data in machine learning. In this article you will get to know about oversampling and undersampling,about the class imbalance in machine learning and how to deal with class imbalance in classification.

Learning Objectives

Get familiar with class imbalance in ML through coding tutorials in this article.
Understand various techniques for handling imbalanced data, such as Random under-sampling, Random over-sampling, and NearMiss.

The Problem With Class Imbalance in Machine Learning
Credit Card Fraud Detection Example
The Metric Trap
Resampling Techniques to Solve Class Imbalance
How to Balance Data With the Imbalanced-Learn Python Module?
How to deal with class imbalance in classification?
Undersampling and Oversampling
Advantages and Disadvantages of Under-Sampling
Advantages and Disadvantages of Over-Sampling
Conclusion
- Key Takeaways

The Problem With Class Imbalance in Machine Learning

Most machine learning algorithms work best when the number of samples in each class is about equal. This is because most algorithms are designed to maximize accuracy and reduce errors.

However, if the dataframes has imbalanced classes, then In such cases, you get a pretty high accuracy just by predicting the majority class, but you fail to capture the minority class, which is most often the point of creating the model in the first place. For example, if the class distribution shows that 99% of the data has the majority class, then any basic classification model like the logistic regression or decision tree will not be able to identify the minor class data points.

Credit Card Fraud Detection Example

Let’s say we have a d ataset of credit card companies where we have to find out whether the credit card transaction was fraudulent or not.

But here’s the catch… fraud transaction is relatively rare. Only 6% of the transactions are fraudulent.

Now, before you even start, do you see how the problem might break? Imagine if you didn’t bother training a model at all. Instead, what if you just wrote a single line of code that always predicts ‘no fraudulent transaction’

def transaction(transaction_data):
    return 'No fradulent transaction'

Well, guess what? Your “solution” would have 94% accuracy!

Unfortunately, that accuracy is misleading.

For all those non-fraudulent transactions, you’d have 100% accuracy.
For those transactions which are fraudulent, you’d have 0% accuracy.
Your overall accuracy would be high simply because most of the transactions are not fraudulent (not because your model is any good).

This is clearly a problem because many machine learning algorithms are designed to maximize overall accuracy. In this article, we will see different techniques to handle imbalanced data.

Sample Dataset

We will use a credit card fraud detection dataset for this article. You can find the dataset here.

After loading the data display the first five-row of the data set.

You can clearly see that there is a huge difference between the data set. 9000 non-fraudulent transactions and 492 fraudulent.

The Metric Trap

One of the major issues that new developer users fall into when dealing with unbalanced datasets relates to the evaluation metrics used to evaluate their machine learning model. Using simpler metrics like accuracy score can be misleading. In a dataset with highly unbalanced classes, the classifier will always “predicts” the most common class without performing any analysis of the features, and it will have a high accuracy rate, obviously not the correct one.

Let’s do this experiment using the simple XGBClassifier and no feature engineering:

# import linrary
from xgboost import XGBClassifier

xgb_model = XGBClassifier().fit(x_train, y_train)

# predict
xgb_y_predict = xgb_model.predict(x_test)

# accuracy score
xgb_score = accuracy_score(xgb_y_predict, y_test)

print('Accuracy score is:', xbg_score)OUTPUT
Accuracy score is: 0.992

We can see 99% accuracy, we are getting very high accuracy because it is predicting mostly the majority class that is 0 (Non-fraudulent).

Resampling Techniques to Solve Class Imbalance

One of the widely adopted imbalance classification techniques for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

Class Imbalance: Undersampling and Oversampling | oversampling and undersampling

Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free lunch).

The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfishing.

In under-sampling, the simplest technique involves removing random records from the majority class, which can cause a loss of information.

Let’s implement this with the credit card fraud detection example.

We will start by separating the class that will be 0 and class 1.

# class count
class_count_0, class_count_1 = data['Class'].value_counts()

# Separate class
class_0 = data[data['Class'] == 0]
class_1 = data[data['Class'] == 1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape

1. Random Under-Sampling

Undersampling can be defined as removing some observations of the majority class. This is done until the majority and minority class is balanced out.

Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback to undersampling is that we are removing information that may be valuable.

class_0_under = class_0.sample(class_count_1)

test_under = pd.concat([class_0_under, class_1], axis=0)

print("total class of 1 and0:",test_under['Class'].value_counts())# plot the count after under-sampeling
test_under['Class'].value_counts().plot(kind='bar', title='count (target)')

Class Imbalance: Random Under Sampling | oversampling and undersampling

2. Random Over-Sampling

Oversampling can be defined as adding more copies to the minority class. Oversampling in machine learning can be a good choice when you don’t have a ton of data to work with.

A con to consider when undersampling is that it can cause overfitting and poor generalization to your test set.

class_1_over = class_1.sample(class_count_0, replace=True)

test_over = pd.concat([class_1_over, class_0], axis=0)

print("total class of 1 and 0:",test_under['Class'].value_counts())# plot the count after under-sampeling
test_over['Class'].value_counts().plot(kind='bar', title='count (target)')

How to Balance Data With the Imbalanced-Learn Python Module?

A number of more sophisticated resampling techniques have been proposed in the scientific literature.

For example, we can cluster the records of the majority class and do the under-sampling by removing records from each cluster, thus seeking to preserve information. In over-sampling, instead of creating exact copies of the minority class records, we can introduce small variations into those copies, creating more diverse synthetic samples.

Let’s apply some of these resampling techniques using the Python library imbalanced-learn. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

import imblearn

3. Random Under-Sampling With Imblearn

You may have heard about pandas, numpy, matplotlib, etc. while learning data science. But there is another library: imblearn, which is used to sample imbalanced datasets and improve your model performance.

RandomUnderSampler is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes. Under-sample the majority class(es) by randomly picking samples with or without replacement.

# import library
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
x_rus, y_rus = rus.fit_resample(x, y)

print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))

Image for post, imbalanced data machine learning

4. Random Over-Sampling With imblearn

One way to fight imbalanced data is to generate new samples in the minority classes. The most naive strategy is to generate new samples by random sampling with the replacement of the currently available samples. The RandomOverSampler offers such a scheme.

# import library
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

# fit predictor and target variablex_ros, y_ros = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

5. Under-Sampling: Tomek Links

Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.

Tomek’s link exists if the two samples are the nearest neighbors of each other.

In the code below, we’ll use ratio='majority' to resample the majority class.

# import library
from imblearn.under_sampling import TomekLinks

tl = RandomOverSampler(sampling_strategy='majority')

# fit predictor and target variable
x_tl, y_tl = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

6. Synthetic Minority Oversampling Technique (SMOTE)

This technique generates synthetic data for the minority class.

SMOTE (Synthetic Minority Oversampling Technique in machine learning) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

SMOTE algorithm works in 4 simple steps:

Choose a minority class as the input vector.
Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function).
Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor.
Repeat the steps until the data is balanced.

# import library
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

7. NearMiss

NearMiss is an under-sampling technique. Instead of resampling the Minority class, using a distance will make the majority class equal to the minority class.

from imblearn.under_sampling import NearMiss

nm = NearMiss()

x_nm, y_nm = nm.fit_resample(x, y)

print('Original dataset shape:', Counter(y))
print('Resample dataset shape:', Counter(y_nm))

8. Change the Performance Metric

Accuracy is not the best metric to use when evaluating imbalanced datasets, as it can be misleading.

Metrics that can provide better insight are:

Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
Recall: the number of true positives divided by the number of positive values in the test data. The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.
F1: Score: the weighted average of precision and recall.
Area Under ROC Curve (AUROC): AUROC represents the likelihood of your model distinguishing observations from two classes.
In other words, if you randomly select one observation from each class, what’s the probability that your model will be able to “rank” them correctly?

9. Penalize Algorithms (Cost-Sensitive Training)

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes in the minority class.

A popular algorithm for this technique is Penalized-SVM.

During training, we can use the argument class_weight=’balanced’ to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

We also want to include the argument probability=True if we want to enable probability estimates for SVM algorithms.

Let’s train a model using Penalized-SVM on the original imbalanced dataset:

# load library
from sklearn.svm import SVC

# we can add class_weight='balanced' to add panalize mistake
svc_model = SVC(class_weight='balanced', probability=True)

svc_model.fit(x_train, y_train)

svc_predict = svc_model.predict(x_test)# check performance
print('ROCAUC score:',roc_auc_score(y_test, svc_predict))
print('Accuracy score:',accuracy_score(y_test, svc_predict))
print('F1 score:',f1_score(y_test, svc_predict))

10. Change the Algorithm

While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets.

Decision trees frequently perform well on imbalanced data. In modern machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll jump right into those:

Tree base algorithm work by learning a hierarchy of if/else questions. This can force both classes to be addressed.

# load library
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

# fit the predictor and target
rfc.fit(x_train, y_train)

# predict
rfc_predict = rfc.predict(x_test)# check performance
print('ROCAUC score:',roc_auc_score(y_test, rfc_predict))
print('Accuracy score:',accuracy_score(y_test, rfc_predict))
print('F1 score:',f1_score(y_test, rfc_predict))

How to deal with class imbalance in classification?

Steps to deal with Class Imbalance in Classification :

Understanding the Issue:

Determine the classes that are in the minority and majority in your dataset.
Assess the seriousness of the imbalance by looking at the ratio between minority and majority instances.

Exploring the data:

Examine the spread of characteristics and outcome variable.
Use visual aids to reveal possible trends or prejudices in the data.
Selecting the Appropriate Measurements:
Choose suitable evaluation criteria such as precision, recall, F1-score, or AUC-ROC.
Don’t depend only on being right.

Techniques for Sampling Again:

Increasing the number of instances in the minority class to balance the class distribution.
Augment the quantity of instances belonging to the minority class.
Options include Random Over Sampling and SMOTE techniques.
Reducing the size of the majority class by randomly selecting a subset of it.
Reduce the amount of instances in the majority class.
Methods include Random Under Sampling and Cluster-Based Under Sampling.

Learning with a focus on costs:

Assign varying prices for misclassifying based on the importance of each class.
Increase the penalty for misclassifying the minority class.

Choosing an algorithm:

Select algorithms that are well-suited for dealing with unbalanced data, like decision trees, random forests, and gradient boosting.
Enhancing data through various techniques:
Generate artificial data for the underrepresented class in order to equalize the dataset.
Relevant for image, text, or time-series data.

Detection of irregularities:

Consider the minority class as anomalies if necessary.
Utilize methods for anomaly detection in order to pinpoint them.
Different methods of combining two or more approaches.
Utilize a variety of methods for the best outcomes.
For instance, the combination of oversampling and cost-sensitive learning.

Evaluation of the model:

Evaluate the model’s efficiency using selected measurements.
Iterate and improve the method according to the outcomes.

Further factors to take into account:

Try out various mixes of methods.
Take into account the expenses of computing and the magnitude of the dataset.
Employ cross-validation to ensure accurate performance evaluation.

Undersampling and Oversampling

Undersampling and oversampling are techniques used to address class imbalance in machine learning. Class imbalance occurs when a dataset has a significant difference in the number of samples between different categories, or classes. Here is the advantages and disadvantages of these:

Advantages and Disadvantages of Under-Sampling

Advantage:

It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

Disadvantages:

It can discard potentially useful information which could be important for building rule classifiers.
The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population. Thereby resulting in inaccurate results with the actual test data set.

Advantages and Disadvantages of Over-Sampling

Advantages:

Unlike under-sampling, this method leads to no information loss.
Outperforms under sampling

Disadvantages:

It increases the likelihood of overfitting since it replicates the minority class events.

Conclusion

To summarize, in this article, we have seen various techniques to handle the imbalance classification in a dataset. There are actually many methods to try when dealing with imbalanced data. You can check the implementation of these codes in my GitHub repository here.

Hope you like the article and you get understanding about the class imbalance in machine learning and about the undersampling and oversampling whts that and the difference , this difference b/w oversampling and undersampling tell you about the how imbalanced classification is performing in machine learning and with the difference only of undersampling and oversampling you will came to know also how imbalanced classification is important and how its performing these are topics which you get cleared Now.

Key Takeaways

In this article, we learned about the different techniques that we can perform to handle class imbalance in machine learning.
Some of the most widely used techniques are SMOTE, imblearn oversampling, and under sampling.
There is no “best“ method for handling imbalance, it depends on your use case.
Also, this class imbalance in machine learning will tell you use case of imbalanced classification.

Q1. What are class imbalances?

A. Class imbalances in MLhappen when the categories in your dataset are not evenly represented. For example, in a medical dataset, you might have many more healthy patients than sick ones. This can make it hard for a model to learn to recognize the less common category (the sick patients in this case).

Q2. What ratio is class imbalance?

A. Class imbalance is when a dataset has more examples of one class than others. It’s often expressed as a ratio (e.g., 1:10). This can make models biased towards the majority class. Techniques like oversampling, undersampling, and class weighting can help.

Q3. How to solve class imbalance problem?

A. There are several ways to address class imbalance:
Resampling: You can oversample the minority class or undersample the majority class to balance the dataset.
Synthetic Data: Generate new samples for the minority class using techniques like SMOTE (Synthetic Minority Over-sampling Technique).
Class Weighting: Adjust the weights of the classes in your loss function to give more importance to the minority class.
Anomaly Detection Models: Sometimes, models designed to detect anomalies can work well for imbalanced datasets.

Q4. Which loss is best for class imbalance?

A. One commonly used loss function for handling class imbalance in ML is Focal Loss. It reduces the weight of well-classified examples and focuses more on hard-to-classify examples, which helps the model to learn better from the minority class.

Himanshi Singh

I’m a data lover who enjoys finding hidden patterns and turning them into useful insights. As the Manager - Content and Growth at Analytics Vidhya, I help data enthusiasts learn, share, and grow together.

Thanks for stopping by my profile - hope you found something you liked :)

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

10 Techniques to Solve Imbalanced Classes in Machine Learning

Table of contents

The Problem With Class Imbalance in Machine Learning

Credit Card Fraud Detection Example

Sample Dataset

The Metric Trap

Resampling Techniques to Solve Class Imbalance

1. Random Under-Sampling

2. Random Over-Sampling

How to Balance Data With the Imbalanced-Learn Python Module?

3. Random Under-Sampling With Imblearn

4. Random Over-Sampling With imblearn

5. Under-Sampling: Tomek Links

6. Synthetic Minority Oversampling Technique (SMOTE)

7. NearMiss

8. Change the Performance Metric

9. Penalize Algorithms (Cost-Sensitive Training)

10. Change the Algorithm

How to deal with class imbalance in classification?

Undersampling and Oversampling

Advantages and Disadvantages of Under-Sampling

Advantages and Disadvantages of Over-Sampling

Conclusion

Key Takeaways

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap