XGBoost Python Example
XGBoost is short for Extreme Gradient Boost (I wrote an article that provides the gist of gradient boost here). Unlike Gradient Boost, XGBoost makes use of regularization parameters that helps against overfitting.
Suppose we wanted to construct a model to predict the price of a house given its square footage.
We start with an arbitrary initial prediction. This could be the average in the case of regression and 0.5 in the case of classification.
For every sample, we calculate the residual with the proceeding formula.
residual = actual value — predicted value
Suppose, after applying the formula, we end up with the following residuals, starting with the samples from left to right.
Next, we use a linear scan to decide the best split along the given feature (Square Footage). By linear scan, we mean that we select a threshold between the first pair of points (their average), then select a threshold between the next pair of points (their average) and so on until we’ve explored all possibilities.
In our example, we start off by selecting a threshold of 500.
The corresponding tree is:
Notice how the values in each leaf are the residuals. That is, the difference between the prediction and the actual value of the independent variable, and not the house price of a given sample.
In order to compare splits, we introduce the concept of gain. Gain is the improvement in accuracy brought about by the split. The gain is calculated as follows.
where
Lambda and Gamma are both hyperparameters. Lambda is a regularization parameter that reduces the prediction’s sensitivity to individual observations, whereas Gamma is the minimum loss reduction required to make a further partition on a leaf node of the tree.
Say, we arbitrarily set Lambda and Gamma to the following.
We can proceed to compute the gain for the initial split.
We continue and compute the gains corresponding to the remaining permutations.
Then, we use the threshold that resulted in the maximum gain. In this case, the optimal threshold is Sq Ft < 1000. Thus, we end up with the following tree.
We repeat the process for each of the leaves. That is to say, we select a threshold to
When the gain is negative, it implies that the split does not yield better results than would otherwise have been the case had we left the tree as it was.
We still need to check that a different threshold used in splitting the leaf doesn’t improve the model’s accuracy.
The gain is positive. Therefore, we still benefit from splitting the tree further. In doing so, we end up with the following tree.
We examine whether it would beneficial to split the whose samples have a square footage between 1,000 and 1,600.
The gain is negative. Therefore, we leave the tree as it is.
We still need to check whether we should split the leaf on the left (square footage < 1000).
Again, the gain is negative. Therefore, the final decision tree is:
When presented with a sample, the decision tree must return a single scalar value. Therefore, we use to following formula that takes into account multiple residuals in a single leaf node.
The first prediction is the sum of the initial prediction and the prediction made by the tree multiplied by the learning rate.
Assuming a learning rate of 0.5, the model makes the following predictions.
The new residuals are:
We then use these residuals to construct another decision tree, and repeat the process until we’ve reached the maximum number of estimators (default of 100). Once we’ve finished training the model, the predictions made by the XGBoost model as a whole are the sum of the initial prediction and the predictions made by each individual decision tree multiplied by the learning rate.
Python Code
Unlike other machine learning models, XGBoost isn’t included in the Scikit-Learn package. Therefore,
The XGBoost library has a lot of dependencies that can make installing it a nightmare. Lucky for you, I went through that process so you don’t have to. By far, the simplest way to install XGBoost is to install Anaconda (if you haven’t already) and run the following commands.
conda install -c conda-forge xgboost
conda install -c anaconda py-xgboost
Once, we have XGBoost installed, we can proceed and import the desired libraries.
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
Just like in the example from above, we’ll be using a XGBoost model to predict house prices. We use the Scikit-Learn API to load the Boston house prices dataset into our notebook.
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target)
We use the head function to examine the data.
X.head()
Here’s the list of the different features and their acronyms.
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000’s
In order to evaluate the performance of our model, we split the data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y)
Next, we initialize an instance of the XGBRegressor class. We can select the value of Lambda and Gamma, as well as the number of estimators and maximum tree depth.
regressor = xgb.XGBRegressor(
n_estimators=100,
reg_lambda=1,
gamma=0,
max_depth=3
)
We fit our model to the training set.
regressor.fit(X_train, y_train)
We can examine the relative importance attributed to each feature, in determining the house price.
pd.DataFrame(regressor.feature_importances_.reshape(1, -1), columns=boston.feature_names)
As we can see, the percentage of the lower class population is the greatest predictor of house price.
Finally, we use our model to predict the price of a house in Boston given what it has learnt.
y_pred = regressor.predict(X_test)
We use the mean squared error to evaluate the model performance. The mean squared error is the average of the differences between the predictions and the actual values squared.
mean_squared_error(y_test, y_pred)