How to Use SMOTE for an Imbalanced Dataset

Jun 10, 2022•5 min read

Languages, frameworks, tools, and trends

Classification problems are a major machine learning issue. When we try to label a dataset into a category based on an input dataset, there are numerous challenges we have to deal with in the data. An imbalanced dataset is one such challenge. It contains two classes, where one class is excessively high compared to the other. There are different techniques to deal with it, but SMOTE - Synthetic Minority Oversampling Technique - is one of the more popular.

This article will explore imbalanced data and how it can be handled using the SMOTE algorithm.

What is an imbalanced dataset?

An imbalanced dataset contains two different observations, where one observation is in the majority class and the other is in the minority class.

Let’s understand this with the help of an example:

Suppose we want to build a model that will help us identify whether a given patient has a tumor or not. There are 1000 patients, with 900 being non-cancer patients and the other 100 being cancer patients. Since the non-cancer patients are high in number, they will belong to the majority class while the rest will belong to the minority class.

As the purpose of our model is to predict whether someone has cancer or not, the focus is primarily on the minority class. In this case, however, the majority is nine times bigger than the minority class. This is an imbalanced dataset because the model will deliver high accuracy in predicting non-cancer patients and will be more inclined to the majority class - even though this is not the main objective of building our system.

Why is an imbalanced dataset a problem?

The algorithm in our example tends to incline towards the majority class, though our job is to build a model that helps predict cancer patients. Even if the majority class is just twice or thrice the minority class, we still consider it an imbalanced dataset.

If we assume that for all 1000 records the model will predict that all are non-cancer patients, then there is a fallacy in the model. It will have 90% accuracy since there are 900 non-cancer patients. Thus, despite the high accuracy of the model, it will not give the best results.

An imbalanced dataset is a common classification problem. A well-known example is when we have to identify whether email is spam or not.

Here are more examples of imbalanced datasets:

Fraudulent transactions occurring in banks.
Theft and pilferage of electricity.
Identification of rare diseases such as cancer, tumors, and so on.
Natural disasters.
Customer churn rate.

Undersampling and oversampling of imbalanced datasets

Before learning about SMOTE’s functionality, it’s important to understand two important terms: undersampling and oversampling.

Undersampling

The purpose of undersampling is to reduce the majority class. We perform it by removing some observations of the said class. There are two ways of doing so: in the first method, we randomly remove some records of the majority class, which is known as random undersampling. In the second method, we use statistical methods to remove the majority class, known as informed undersampling.

These undersampling methods also use data clearing techniques to further refine the majority of classes. Undersampling methods are generally not preferred because there is a chance of losing valuable information. This also leads to bias since we are removing data to ensure that the proportion of the two classes remains the same.

Oversampling

Oversampling is the opposite of undersampling. The objective is to increase the samples of the minority class so that the observations of both major and minor classes become equal. Unlike undersampling, where we remove datasets, we add new data to the dataset in oversampling. It can be achieved in two ways: random oversampling and synthetic oversampling.

In case of random oversampling, we replicate the existing minority class and add it to our dataset to increase the minority class. Meanwhile, synthetic oversampling technique is the process of generating artificial samples for the minority class. New samples are created in such a way that they add relevant information to the minority class while avoiding misclassification. The only downfall of oversampling is that it can lead to overfitting due to duplication of the same information.

The SMOTE algorithm

An oversampling method, SMOTE creates new, synthetic observations from present samples of the minority class. Not only does it duplicate the existing data, it also creates new data that contains values that are close to the minority class with the help of data augmentation. These new synthetic training records are made randomly by selecting one or more K-nearest neighbors for each of the minority classes. After completing oversampling, the problem of an imbalanced dataset is resolved and we are ready to test different classification models.

Below are the steps to implement the SMOTE algorithm:

Draw a random set from the minority class.
For all the observations for the sample, locate the K-nearest neighbors. To obtain the distance between the neighbors, find the Euclidean distance.
The next step is to find the vector between the current data point and the selected neighbor.
Next, multiply a vector between 0 and 1.
To obtain the new dataset, add new samples to the current data point.

Implementation of SMOTE in Python

1. The first step is to import all the necessary libraries. We will also install the imbalanced learned package and Pandas and NumPy - two important libraries.

# install the libraries
pip install imblearn

import numpy as np
import pandas as pd

Python

2. The next task is to load the dataset. read_csv is used to load a CSV file as a Pandas dataframe.

# read csv data- salary.csv
df = pd.read_csv('salary.csv')

Python

3. After the data is successfully loaded, it’s time to analyze class distribution.

Emp_inf = df['Class'].value_counts()

print(emp_inf)
print(“\n\tClass 0: {:0.2f}%”.format(20 * salary_status[0] / (emp_inf[0] + emp_inf[1])))
print(“\n\tClass 1: {:0.2f}%”.format(20 * salary_status[1] / (emp_inf[0]+emp_inf[1])))

Here,
Class 0: 98.83%
Class 1: 0.18%

Python

4. Once the analysis is done, we will split the data into train and test sets.

X = df.drop(columns = ['Emp_Cal','Class'])
y = df['Class']

X_diver, x_runner, y_diver, y_runner = train_test_split(X,y,random_state = 100,test_size =
 0.5,stratify = y)

Python

5. Finally, we will evaluate the results without SMOTE as well as with SMOTE.

First_model = LogisticRegression()
First_model.fit(x_diver,y_diver)
Deter = First_model.predict(x_runner)

print(“Prediction_Score”,accuracy_score(y_runner,deter))
print(classification_report(y_runner,deter))

sns.heatmap(confusion_matrix(y_runner,deter), annot = True,fmt ='.2g')

Python

Now, we will use the SMOTE module from imblearn.

print("\n\t Pre Oversampling, counts of label '1': {}".format(sum(y_runner == 1))) 
print("\n\t Pre Oversampling, counts of label '0': 
{}".format(sum(y_runner == 0))) 
  
# import SMOTE for sampling
from imblearn.over_sampling import SMOTE 

sm = SMOTE(sampling_strategy = 0.3, k_neighbors = 5, random_state = 100) 
X_diver_res, y_diver_res = sm.fit_sample(x_diver, y_diver.ravel()) 
  
# Print the oversampling results
print(“\n\t Post OverSampling, the shape of  diver_X: {}”.format(X_diver_res.shape)) 
print(“\n\t Post OverSampling, the shape of diver_y: 
{}”.format(y_diver_res.shape)) 
  
print("Post OverSampling, label count '1': {}".format(sum(y_diver_res == 1))) 
print("Post OverSampling, label count '0': {}".format(sum(y_diver_res == 0)))

Python

Results of SMOTE:

Inr = LogisticRegression() 
lnr.fit(X_diver_res, y_diver_res.ravel()) 
predictions = lr.predict(x_runner) 

# print the outcomes of SMOTE  
print(“Predicting_Score”,accuracy_score(y_runner,predictions))
print(classification_report(y_runner, predictions)) 
sns.heatmap(confusion_matrix(y_runner,predictions), annot=True, fmt='.2g')

Python

As discussed, imbalanced datasets contain two classes where one class (known as the majority class) has an excessively higher number of observations than the other class (minority class). Due to this, the model doesn’t yield expected results. Thus, imbalanced data needs to be dealt with to ensure that the machine learning model is effective. With the help of SMOTE, we can increase the number of observations of the minority class in a balanced way, helping the model to become more effective. Note that the technique also has a disadvantage as valuable information can be lost owing to duplication.

How to Use SMOTE for an Imbalanced Dataset

What is an imbalanced dataset?

Why is an imbalanced dataset a problem?

Undersampling and oversampling of imbalanced datasets

Undersampling

Oversampling

The SMOTE algorithm

Implementation of SMOTE in Python

Share this post

Share