Optimizing Our Predictive Model: SMOTE

SMOTE stands for Synthetic Minority Over-sampling Technique. The main difference between SMOTE and resampling is that SMOTE will not only increase the size of the training data set, but will also increase the variety of training examples. Oversampling increases the size of the training data through repetition of the original examples. SMOTE creates new training examples based on the original training examples. If there are two examples near each other, SMOTE will synthetically create a third example found in the middle of the first two examples.

We import SMOTE from the imbalanced-learn library.

from imblearn.over_sampling import SMOTE

In our previous blog, we had to determine which is the minority class and which is the majority class. We learned that the positive variable was the minority class with 286 counts. We then manually resampled it by separating the different classes of our predicted variable and resampled it to have an equal number to the majority class. SMOTE on the other hand, will automatically oversample the minority class without the need for our supervised differentiation.

With SMOTE, after we split our data into training and testing data, we use imblearn’s library to set the sm variable to run SMOTE. Again, SMOTE synthesizes training data, so we only .fit_sample or train the X_train data and the y_train data.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)

sm = SMOTE(random_state=3)
X_train, y_train = sm.fit_sample(X_train, y_train)

Finally, with our new synthetic data added to our original data, we run our original model. We use sklearn’s Logisitic Regression library and .fit or train our X_train and y_train data to create a logistic regression predictive model. We then use our model to .predict on X_test.

smote_lr = LogisticRegression(solver='liblinear')

smote_lr.fit(X_train, y_train)

smote_pred = smote_lr.predict(X_test)



# checking accuracy
print('Test Accuracy score: ', accuracy_score(y_test, smote_pred))

# checking accuracy
print('Test F1 score: ', f1_score(y_test, smote_pred))

Test Accuracy score: 0.752
Test F1 score: 0.7019

SMOTE returned a F1 Score of 0.7019 which is better than our initial F1 score of 0.5935 but still not better than our Upsample F1 score of 0.7299.

Leave a comment

Design a site like this with WordPress.com
Get started