Predictive Modeling Continued: Random Forests

Random Forests is an ensemble method built upon Decision Trees. You can read about Decision Trees in my previous article here. An ensemble method uses multiple predictive models to achieve the highest predictive performance.

Ensemble Methods work off of the idea of the “Wisdom of the Crowd”. This phrase refers to the phenomenon that the average estimate of all predictions typically outperforms any single prediction by a statistically significant margin

High variance found in ensemble methods works to its advantage. Normally distributed predictions will have roughly the same overestimated predictions as there are underestimated ones, which leads it to essentially cancel each other out. This moves the average closer to the actual value.

The decision trees in our Random Forest is again split based on the Gini Index. The degree of Gini index varies between 0 and 1, where 0 denotes that all elements belong to a certain class and 1 denotes that the elements are randomly distributed across various classes. A Gini Index of 0.5 denotes equally distributed elements into some classes.

Let’s get started. First thing we do is import the libraries we are going to be using. From sklearn we import train_test_split to split our data and the RandomForestClassifier.

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

We will be predicting resolution outcome so we set it to our y. We take all our other features and set it to X by dropping resolution_outcome. From there we train_test_split our data and set our classifier.

X = aug.drop('resolution_outcome', axis=1)
y = aug['resolution_outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=3)

rfc = RandomForestClassifier(random_state=3, max_depth=8)
rfc

rfc returns several parameters that we will be able to tune later with CrossValidation. n_estimators: is how many decision trees will be made for the random forest. max_depth is how the maximum depth of trees. If it is set to none, then the nodes will expand until all leaves are pure or have less than the min_sample_split. max_features is the number of features the forest will consider when looking for the best split.

From there we fit (or train) our rfc classifier with our X_train and y_train data. Then predict on our X_test data.

rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)

# checking accuracy
print('Test Accuracy score: ', accuracy_score(y_test, rfc_pred))

# checking accuracy
print('Test F1 score: ', f1_score(y_test, rfc_pred))

Our initial model returned an Accuracy score of 0.736 and a F1 Score of 0.529. Our next step is to change the paramaters in our classifier .

param_grid = { 
    'n_estimators': [200,300, 400], #tree
    'max_features': [0.25, 0.33, 0.5 ], #each node, trying to decide which feature to split on
    'max_depth' : [5,6,7,8,9],
    'min_samples_leaf': [0.03,0.04,0.05,0.06]
}
from sklearn.model_selection import GridSearchCV

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, verbose=5, n_jobs=-1, cv=5)
CV_rfc.fit(X_train, y_train)

GridSearch/CrossValidation will use our random forest classifier and run a predictive model for every different parameter we have listed in param_grid. As the number of parameters to test increases, the time and computational cost of running these tests increase as well. Setting n_jobs=-1 makes use of all available processors. cv=5 sets the cross-validating splitting strategy to a 5-fold cross validation.

CV_rfc.best_params_

CV_rfc.best_params will then return the parameters that returned the best score. We would then re-set our classifier with those paramaters.

Leave a comment

Design a site like this with WordPress.com
Get started