Optimizing Our Predictive Model With Resampling

We made a predictive model using Logistic Regression. The model yielded a F1 Score of 0.5935.

One of the things we can take a look at to optimize our model is…our data. Do we have a class imbalance problem? Are there more negative outcomes than positive outcomes or vice versa?

We can use sampling techniques such as oversampling the minority class or undersampling the majority class. This technique can help by producing a synthetic dataset that the learning algorithm is trained on. With this, it is important to still maintain a test set from the original dataset in order to accurately judge the accuracy of the algorithm overall.

In our last blog, we did test_train_split to separate our data into a training set and a testing set.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)

Here we take our X_train and y_train and join them together using concat and name the dataset training.

training  = pd.concat([X_train, y_train], axis=1)

Now we find where the resolution_outcome is equal to 0 or negative and resolution_outcome is equal to 1 or positive.

# separate minority and majority classes
negative = training[training.resolution_outcome==0]
positive = training[training.resolution_outcome==1]

print('negative outcomes count: '+ str(len(negative)))
print('positive outcomes count: '+ str(len(positive)))

Here is our data:
negative outcomes count: 464
positive outcomes count: 286

Upsample/Oversample

Now using sklearn we will resample the positive variable by oversampling or upsampling it. Since the positive variable is the minority class (286 counts) we will oversample it to match the number of negative variables which has a count of 464.

from sklearn.utils import resample

# upsample minority
positive_upsampled = resample(positive,
                          replace=True, # sample with replacement
                          n_samples=len(negative), # match number in majority class
                          random_state=23) # reproducible results

We now join the negative and positive_upsample data and name the dataframe upsampled.

# combine majority and upsampled minority
upsampled = pd.concat([negative, positive_upsampled])

# check new class counts
upsampled.resolution_outcome.value_counts()

0: 464
1: 464

Using our upsampled data we set our X and y values. We will be predicting a positive or negative outcome based on the features of our data. Therefore we set y to resolution_outcome which will either be 0 (negative) or 1 (positive). X will be all the features of our data, except for resolution_outcome so we set X to everything except resolution_outcome by dropping it.


# trying logistic regression again with the balanced dataset
y_train = upsampled.resolution_outcome
X_train = upsampled.drop('resolution_outcome', axis=1)



upsampled_lr = LogisticRegression(solver='liblinear')

upsampled_lr.fit(X_train, y_train)

upsampled_pred = upsampled_lr.predict(X_test)

We use sklearn’s Logisitic Regression library and .fit or train our new X_train and y_train data. After our model is trained, we use it to .predict on our X_test data. Now we can check our accuracy and F1 Score and see if it has improved from our initial F1 Score of 0.5935.

# checking accuracy
print('Test Accuracy score: ', accuracy_score(y_test, upsampled_pred))

# checking accuracy
print('Test F1 score: ', f1_score(y_test, upsampled_pred))

Using upsampling has increased the F1 Score to 0.7299!

Downsample

Now we will downsample and see if the F1 Score is better than 0.7299. Downsampling or undersampling is done to the majority class.

print('negative outcomes count: '+ str(len(negative)))
print('positive outcomes count: '+ str(len(positive)))

Here is our data:
negative outcomes count: 464
positive outcomes count: 286

We will be downsampling the negative outcomes (464) which is the majority to the same number of the positive outcomes (286).

# downsample majority
negative_downsampled = resample(negative,
                                replace = False, # sample without replacement
                                n_samples = len(positive), # match minority n
                                random_state = 23) # reproducible results

# combine minority and downsampled majority
downsampled = pd.concat([negative_downsampled, positive])

# checking counts
downsampled.resolution_outcome.value_counts()
# trying logistic regression again with the balanced dataset
y_train = downsampled.resolution_outcome
X_train = downsampled.drop('resolution_outcome', axis=1)


# downsampled_dt = DecisionTreeClassifier(max_depth=5)
downsampled_lr = LogisticRegression(solver='liblinear')


# downsampled_dt.fit(X_train, y_train)
downsampled_lr.fit(X_train, y_train)


# downsampled_pred = upsampled_dt.predict(X_test)
downsampled_pred = downsampled_lr.predict(X_test)



# checking accuracy
print('Test Accuracy score: ', accuracy_score(y_test, downsampled_pred))


# checking accuracy
print('Test F1 score: ', f1_score(y_test, downsampled_pred))

Downsampling returned a F1 Score of 0.7207 which is better than our initial F1 Score of 0.5935 but not better than our Upsample F1 Score of 0.7299.

One thought on “Optimizing Our Predictive Model With Resampling

Leave a comment

Design a site like this with WordPress.com
Get started