Gradient Boosting takes a predictive model that performs only slightly better than random chance. This model is called a weak learner. Boosting is the process that takes this weak learner and figures out what the weak learner got wrong. It builds another model based on the weak learner’s errors in an attempt to improve its predictions. Boosting is an ensemble technique in which the predictors are made sequentially and iteratively. It is continually built upon learning what mistakes the previous model made. This continues until it reaches the stopping criteria that has been set for it.
Since boosting is based upon weak learners, it is highly resilient to noisy data and overfitting. A week learner is too simple to overfit and the subsequent models are based on the mistakes of the previous model. Therefore, due to the iterative nature of boosting, the models focus on different things.
Finally, a boosting algorithm aggregates its predictions based on a system of weights that determine how important each input is.
In our example below, we use XGBoost which is short for eXtreme Gradient Boosting. It is an independent library that mirrors how sklearn is used in python.
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
We will be predicting resolution outcome so we set it to our y. We take all our other features and set it to X by dropping resolution_outcome. From there we train_test_split our data.
X = aug.drop('resolution_outcome', axis=1)
y = aug.resolution_outcome
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 3)
We can take a look at the different parameters we can tweak with xgb.
xgb.XGBClassifier()

From there we .fit (or train) our model on our X_train and y_train. We assess our accuracy by predicting on our training data with the variable train_preds, then seeing how our model predicts on data it has not seen yet with X_test with the variable named val_preds. This allows us to see if our model is overfit to our training data.
xg_clf = xgb.XGBClassifier
xg_clf.fit(X_train, y_train)
train_preds = xg_clf.predict(X_train)
val_preds = xg_clf.predict(X_test)
training_accuracy = accuracy_score(y_train, training_preds)
val_accuracy = accuracy_score(y_test, val_preds)
print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("Validation accuracy: {:.4}%".format(val_accuracy * 100))
Training Accuracy: 80.9%
Validation Accuracy: 63.25%
Now we can us GridSearch to tweak our parameters.
param_test = {
'max_depth':range(3,10,1),
'min_child_weight':range(1,6,2),
'alpha':range(10,50,10),
'n_estimators':(100,400,25),
'learning_rate':(0.1,0.5,0.1)
gird_clf = GridSearchCV(xg_clf, param_test, scoring='accuracy', cv=None)
grid_clf.fit(X_train, y_train)
best_parameters=grid_clf.best_params_
print("Grid Search found the following optimal parameters: ")
for param_name in sorted(best_parameters.keys()):
print("%s: %r" % (param_name, best_parameters[param_name]))
training_preds = grid_clf.predict(X_train)
val_preds = grid_clf.predict(X_test)
training_accuracy = accuracy_score(y_train, training_preds)
val_accuracy = accuracy_score(y_test, val_preds)
print("")
print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("Validation accuracy: {:.4}%".format(val_accuracy * 100))
Grid Search found the following optimal parameters:
learning_rate: 0.1
max_depth: 6
min_child_weight: 10
n_estimators: 30
subsample: 0.7
Training Accuracy: 75.73%
Validation accuracy: 77.0%
Using GridSearch, we found the optimal parameters, then plugged it into our model by .fitting it. Our new training accuracy of 75% shows that our model is not overfitting to our training data, while our prediction accuracy has jumped from 63.25% to 77%.