Synthetic Minority Over-sampling TEchnique (SMOTE)
Synthetic Minority Over-sampling TEchnique, or SMOTE for short, is a preprocessing technique used to address a class imbalance in a dataset.
In the real world, oftentimes we end up trying to train a model on a dataset with very few examples of a given class (e.g. rare disease diagnosis, manufacturing defects, fradulent transactions) which results in poor performance. Due to the nature of the data (occurrences are so rare), it’s not always realistic to go out and acquire more. One way of solving this issue is to under-sample the majority class. That is to say, we would exclude rows corresponding to the majority class such that there are roughly the same amount of rows for both the majority and minority classes. However, in doing so, we lose out on a lot of data that could be used to train our model thus improving its accuracy (e.g. higher bias). Another other option is to over-sample the minority class. In other words, we randomly duplicate observations of the minority class. The problem with this approach is that it leads to overfitting because the model learns from the same examples. This is where SMOTE comes in. At a high level, the SMOTE algorithm can be described as follows:
- Take difference between a sample and its nearest neighbour
- Multiply the difference by a random number between 0 and 1
- Add this difference to the sample to generate a new synthetic example in feature space
- Continue on with next nearest neighbour up to user-defined number
Let’s examine the algorithm in closer detail. Suppose we had an imbalanced dataset. In other words, there are a lot more rows of a given class than the other. We proceed to plot a subset of the rows belonging to the minority class. We’re only considering 2 of the many features, here, but you could imagine the subsequent steps taking place for all N dimensions in the dataset.
We consider the first row (or a random row in the case N < 100), and compute its k nearest neighbors. We then select a random nearest neighbor out of the k nearest neighbors.
We compute the difference between the two points and multiply it by a random number between 0 and 1. This gives us a synthetic example along the line between the two points.
We repeat the process N/100 times. In other words, if the amount of over-sampling needed is 300%, only 300/100 = 3 neighbors from the k = 5 nearest neighbors are chosen and one sample is generated in the direction of each. In this case, we set N = 300. Therefore, we consider 3 random nearest neighbors of each point.
We then move on to the next row, compute its k nearest neighbors and select N/100 = 300/100 = 3 of its nearest neighbors at random to use in generating new synthetic examples.
SMOTE in Python
Let’s walk through an example of using SMOTE in Python.
We begin by importing the required libraries.
from random import randrange, uniform
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score
In this example, we will make use of the Credit Card Fraud Detection dataset on Kaggle to train a model to determine whether a given transaction is fraudulent. We read the CSV file and store its contents in a Pandas DataFrame as follows:
df = pd.read_csv("creditcard.csv")
Unfortunately, due to confidentiality issues, they cannot provide the original features. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’.
df.head(5)
As we can see, there are significantly more negative samples than positive samples.
df['Class'].value_counts()
Out:
0 284315
1 492
Name: Class, dtype: int64
For simplicity, we remove the time dimension.
df = df.drop(['Time'], axis=1)
We split the dataset into features and labels.
X = df.drop(['Class'], axis=1)
y = df['Class']
In order to evaluate the performance of our model, we split the data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Next, we initialize an instance of the RandomForestClassifier class.
rf = RandomForestClassifier(random_state=42)
We fit our model to the training set.
rf.fit(X_train, y_train)
Finally, we use our model to predict whether a transaction is fraudulent given what it has learnt.
y_pred = rf.predict(X_test)
Suppose our dataset contained 100 examples of fraudulent and 9900 examples of regular transactions. If we used accuracy to measure the model’s performance, it could obtain an accuracy of 99% by simply predicting false every time. It’s for this reason that we use a confusion matrix to evaluate the model’s performance. As we can see, our model classified 23 samples as non-fraudulent when, in fact, they were.
confusion_matrix(y_test, y_pred)
Out:
array([[56862, 2],
[ 23, 75]])
For ease of comparison, if we wanted a single number to gauge the model’s performance, we could use recall. Recall (dad joke) that recall is equal to the number of true positives divided by the sum of true positives and false negatives.
recall_score(y_test, y_pred)
Out: 0.7653061224489796
SMOTE from scratch
SMOTE was first described by Nitesh Chawla, et al. in their 2002 white paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.” In said whitepaper, they provide the following pseudo code:
We proceed to implement the the latter in Python.
def SMOTE(sample: np.array, N: int, k: int) -> np.array:
T, num_attrs = sample.shape
# If N is less than 100%, randomize the minority class samples as only a random percent of them will be SMOTEd
if N < 100:
T = round(N / 100 * T)
N = 100
# The amount of SMOTE is assumed to be in integral multiples of 100
N = int(N / 100)
synthetic = np.zeros([T * N, num_attrs])
new_index = 0
nbrs = NearestNeighbors(n_neighbors=k+1).fit(sample.values)
def populate(N, i, nnarray):
nonlocal new_index
nonlocal synthetic
nonlocal sample
while N != 0:
nn = randrange(1, k+1)
for attr in range(num_attrs):
dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
gap = uniform(0, 1)
synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
new_index += 1
N = N - 1
for i in range(T):
nnarray = nbrs.kneighbors(sample.iloc[i].values.reshape(1, -1), return_distance=False)[0]
populate(N, i, nnarray)
return synthetic
If you didn’t understand it at first glance, don’t fret. We’ll walk-through the code step by step.
The algorithm assumes that if N > 100 than it’s a multiple of 100. In the event N is less than 100 then we select a subset of the samples. For example, if N were 50 (e.g. 50%), we’d want the length of our synthetic array to be 50 / 100 * T = 0.5T where T is the length of the original array of rows belonging to the minority class. We also set N to 100, here, so that it becomes 1 in the subsequent line since we only want to generate a synthetic sample using 1 of the nearest neighbors of each point in the subset.
if N < 100:
T = round(N / 100 * T)
N = 100
N = int(N / 100)
We use k+1, here, because the implementation of NearestNeighbors considers the point itself as one of the neighbors. In other words, we want the k nearest neighbors excluding the point itself.
nbrs = NearestNeighbors(n_neighbors=k+1).fit(sample.values)
In Python, nonlocal
ensures the variable references the “closest” (in this case, the scope outside the function) variable with the same name in the source code.
nonlocal new_index
nonlocal synthetic
nonlocal sample
We select a nearest neighbor at random, by selecting a number between 1 and k+1 because like we mentioned before, the implementation of NearestNeighbors considers the point itself as one of the nearest neighbors.
nn = randrange(1, k+1)
We loop through the different attributes in order to stay true to the algorithm described above. However, it’s worth noting that there is a more efficient way of doing this using the numpy APIs.
for attr in range(num_attrs):
We create a new example by taking the difference between the point being considered and a random neighbor then multiplying it by a number between 0 and 1.
dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
gap = uniform(0, 1)
synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
We move onto the next available index in our array and decrement N to indicate that we already considered one of the N/100 nearest neighbors.
new_index += 1
N = N - 1
We obtain the nearest neighbors for a point in the original array and call the populate function.
for i in range(T):
nnarray = nbrs.kneighbors(sample.iloc[i].values.reshape(1, -1), return_distance=False)[0]
populate(N, i, nnarray)
Prior to running the algorithm, we select all the fraudulent rows within our dataset.
minority = df[df['Class'] == 1].drop(['Class'], axis=1)
We set k to 5 meaning that for each row we will randomly select N/100 nearest neighbors from the available k = 5 to use in our calculations (assuming N ≥ 100). We set N to 200 meaning that we want to generate 200% more fraudulent examples.
synthetic = SMOTE(minority, N=200, k=5)
As we can see, the array of synthetic examples has twice the number of rows as the original dataset.
synthetic.shape
Out:
(984, 29)
Next, we concatenate the original samples with the samples we just generated and set the label to 1 for a total of 984 + 492 = 1476 samples.
synthetic_df = pd.DataFrame(synthetic, columns=minority.columns)
combined_minority_df = pd.concat([minority, synthetic_df])
combined_minority_df["Class"] = 1
Finally, we combine the fraudulent and non-fraudulent samples into a single DataFrame.
new_df = pd.concat([combined_minority_df, df[df['Class'] == 0]])
Like we did before, we split the data into training and testing datasets, train the model and classify the rows in the testing dataset.
X = new_df.drop(['Class'], axis=1)
y = new_df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
If we look at the confusion matrix, we can clearly see that the model has a similar number of false negatives despite the fact that there are 3 times more positive samples.
confusion_matrix(y_test, y_pred)
Out:
array([[56844, 0],
[ 24, 291]])
The recall score is much higher than the model that was trained on the dataset without the use of SMOTE.
recall_score(y_test, y_pred)
0.9238095238095239
SMOTE using library
The Python implementation of SMOTE actually comes in its own library (outside Scikit-Learn) which can be installed as follows:
pip install imbalanced-learn
We can then import the SMOTE class.
from imblearn.over_sampling import SMOTE
To avoid confusion, we read the csv file again.
df = pd.read_csv("creditcard.csv")
df = df.drop(['Time'], axis=1)
X = df.drop(['Class'], axis=1)
y = df['Class']
We instantiate an instance of the SMOTE class. It’s worth noting that, by default, it will ensure that there are an equal number of positive samples as negative samples.
sm = SMOTE(random_state=42, k_neighbors=5)
We apply the SMOTE algorithm to the dataset as follows:
X_res, y_res = sm.fit_resample(X, y)
Again, we split the dataset, train the model and predict whether the samples in the testing dataset should be considered fraudulent or not.
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
If we look at the confusion matrix, we can see that there are an equal number of positive samples as negative samples and the model didn’t have any false negatives. Ergo, the recall is 1.
confusion_matrix(y_test, y_pred)
Out:
array([[56737, 13],
[ 0, 56976]])
recall_score(y_test, y_pred)
Out: 1.0
Conclusion
When a machine learning model is trained on an imbalanced dataset it tends to perform poorly. When acquiring more data isn’t an option, we have to resort to down-sampling or up-sampling. Down-sampling is bad because it removes samples that could otherwise have been used to train the model. Up-sampling on its own is less than ideal since it causes our model to overfit. SMOTE is a technique to up-sample the minority classes while avoiding overfitting. It does this by generating new synthetic examples close to the other points (belonging to the minority class) in feature space.