More Predictive Models: K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a supervised learning algorithm that is used for classification and regression. KNN predicts based on the distance between two points and assumes that the smaller the distance is between two points, the more similar they are. For prediction, KNN will calculate the prediction point and find the K closest points to it, then examine what class it belongs to. Whichever class has the majority, is what KNN will predict for the prediction point. Evaluation metrics for KNN are precision, recall, accuracy and F1-Score.

Again we are using the sklearn library and thus must import the library.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

We again set our X and y values. We will be predicting a positive or negative outcome based on the features of our data. Therefore we set y to resolution_outcome which will either be 0 (negative) or 1 (positive). X will be all the features of our data, except for resolution_outcome so we set X to everything except resolution_outcome by dropping it.

X = aug.drop('resolution_outcome', axis=1)
y = aug.resolution_outcome

So since KNN predicts based on the distance between two points, it is affected greatly by outliers and different types of measurements, for example inches vs centimeters. Scaling the data will make it unit independent will not be affected by the magnitude of different variables.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 3)

scaler = StandardScaler() 
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)

We use sklearn’s KNeighborsClassifier and .fit our X_train and y_train data. We then use the model to predict on our X_test data.

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
test_preds = clf.predict(X_test)

Finally, we take our predictions found in variable test_preds and compare it to our y_test. Using sklearn’s metrics library, we can easily calculate the precision, recall and accuracy score, but most importantly, the F1 Score.

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))
    
print_metrics(y_test, test_preds)

Our initial KNN model returned a F1 Score of 0.6067 which is better than our inital logistic regression model’s F1 Score of 0.5935. We can optimize our F1-Score by changing the value of K. In the following image, if we set our K = 3, then K will be classified as red. However, if we set our K to 5, we would take a look at everything within the dashed line circle instead of the solid line. If our K =5, K would be blue.

Here is a function that will help us fin the best k value by continually iterating on the predictive model by changing k, until it finds the best F1 Score.

def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))

find_best_k(X_train, y_train, X_test, y_test)

Best Value for k: 3
F1-Score: 0.6837

We have optimized our F1 Score from 0.6067 to 0.6837. Our best F1 Score is still from upsampling our logistic regression model which returned a F1 Score of 0.7299.

Share this:

Related

Leave a comment Cancel reply