In 2001, Jerome H. Friedman wrote up a seminal paper – Greedy function approximation: A gradient boosting machine. Little did he know that was going to evolve into a class of methods which threatens Wolpert’s No Free Lunch theorem in the tabular world. Gradient Boosting and its cousins(XGBoost and LightGBM) have conquered the world by giving excellent performances in classification as well as regression problems in the realm of tabular data.

Not Really!

Today, I am starting a new blog series on the Gradient Boosting Machines and all its cousins. The outline for the blog series are as follows: (This will be updated with links as and when each of them gets published)

  1. Part I – Gradient Boosting Algorithm
  2. Part II – Regularized Greedy Forest
  3. Part III – XGBoost
  4. Part IV – LightGBM
  5. Part V – CatBoost
  6. Part VI – NGBoost
  7. Part VII – The Battle of the Boosters

In the first part, let’s understand the classic Gradient Boosting methodology put forth by Friedman. Even though this is math heavy, it’s not that difficult. And wherever possible, I have tried to provide intuition to what’s happening.

Problem Setup

Let there be a dataset D with n samples. Each sample has m set of features in a vector x, and a real valued target, y. Formally, it is written as

Dataset\; \mathcal{D} = {x_i, y_i} ( |\mathcal{D}|=n,\; x_i\;\epsilon\;\mathbb{R}^m,\; y_i\; \epsilon\; \mathbb{R})

Now, Gradient Boosting algorithms are an ensemble method which takes an additive form. The intuition is that a complex function, which we are trying to estimate, can be made up of smaller and simpler functions, added together.

Suppose the function we are trying to approximate is

F(x) = 25+x^2+cos(x)

We can break this function as :

f_1(x) = 25, f_2(x) = x^2, f_3(x)=cos(x)

F(x) = f_1(x)+f_2(x)+f_3(x)

This is the assumption we are taking when we choose an additive ensemble model and the Tree ensemble that we usually talk about when talking about Gradient Boosting can be written as below:

y_i = \phi(x_i) = \sum_{m=1}^M f_m(x_i),\; f_m\;\epsilon\;\mathcal{F}

where M is the number of base learners and F is the space of regression trees.

Loss Function

\mathcal{L}(\phi)=\sum_i l(\hat{y}_i,y_i)

where l is the differentiable convex loss function

Since we are looking at an additive functional form for f(x), we can replace y_i with

\hat{y}_i^t = \hat{y}_i^{t-1}+f_t(x_i)

So, the loss function will become:

\mathcal{L} = \sum_{i=1}^n l(y_i, \hat{y}_i^{t-1}+f_t(x_i))

Algorithm

  1. Initialize the model with a constant value by minimizing the loss function
    • F_0 (x) = \underset{\rho}{\arg\min} \sum_{i=1}^n L(y_i,b_0)
    • b_0 is the prediction of the model which minimizes the loss function at 0th iteration
    • For the Squared Error loss, it works out to be the average over all training samples
    • For the Least Absolute Deviation loss, it works out to be the median over all training samples
  2. for m=1 to M:
    1. Compute r_{im} = -[\frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}] \; \; for \; \; i = 1,\cdots, n
      • r_{im} is nothing but the derivative of the Loss Function(between true and the output from the last iteration) w.r.t. F(x) from the last iteration
      • For the Squared Error loss, this works out to be the residual(Observed – Predicted)
      • It is also called the pseudo residual because it acts like the residual and it is the residual for the Squared Error loss function
      • We calculate the r_{im} for all the n samples
    2. Fit a Regression Tree to the r_{im} values using Gini Impurity or Entropy (the usual way)
      • Each leaf of the tree is denoted by R_{jm} for j = 1 \cdots J_m, where J_m is the number of leaves in the tree created in iteration m
    3. For j = 1 \cdots J_m compute \rho_{m}= \underset{\rho}{\arg\min} \sum_{i=1}^n L(y_i,F_{m-1}(x_i)+ \rho_m \sum_{j=1}^{J} b_{jm})
      • b_{jm} is the basis function or the least squared coefficients. This conveniently works out to be the average over all the samples in any leaf for the Squared Error loss and median for the Least Absolute Deviation loss
      • \rho_m is the scaling factor or leaf weights.
      • The inner summation over b can be ignored because of the disjoint nature of the Regression Trees. One particular sample will only be in one of those leaves. So the equation simplifies to:
      • \rho_{jm}= \underset{\rho}{\arg\min} \sum_{i=1}^n L(y_i,F_{m-1}(x_i)+ \rho_m \cdot b_m)
      • So, for each of the leaves, R_{jm}, we calculate the optimum value \rho, when added to the prediction of the last iteration minimizes the loss for the samples that reside in the leaf
      • For known loss functions like the Squared Error loss and Least Absolute Deviation loss, the scaling factor is 1. And because of that, the standard GBM implementation ignores the scaling factor.
      • For some losses like the Huber loss, \rho is estimated using a line search to find the minimum loss.
    4. Update F_m(x) = F_{m-1}(x) + \eta \cdot \rho_m \cdot\sum_{j=1}^{J_m} b_{jm}
      • Now we add the latest, optimized tree to the result from previous iteration.
      • \eta is the shrinkage or learning rate
      • The summation in the equation is only useful in the off chance that a particular sample ends up in multiple nodes. Otherwise, it’s just the optimized regression tree score b

Regularization

In the standard implementation (Sci-kit Learn), the regularization term in the objective function is not implemented. The only regularization that is implemented there are the following:

  • Regularization by Shrinkage – In the additive formulation, each new weak learner is “shrunk” by a factor, \eta. This shrinkage is also called learning rate in some implementation because it resembles the learning rate in neural networks.
  • Row Subsampling – Each of the candidate trees in the ensemble uses a subset of samples. This has a regularizing effect.
  • Column Subsampling – Each of the candidate trees in the ensemble uses a subset of features. This also has a regularizing effect and is usually more effective. This also helps in parallelization.

Gradient Boosting & Gradient Descent

The Similarity

We know that Gradient Boosting is an additive model and can be shows as below:

F_m(X) = F_{m-1}(X)+ \eta \cdot f_m(X)

where F is the ensemble model, f is the weak learner, η is the learning rate and X is the input vector.

Substituting F with \hat{y}, we get the familiar equation,

\hat{Y}^m = \hat{Y}^{m-1}+\eta \cdot f_m(X)

Now since f_m(X) is obtained at each iteration by minimizing the loss function which is a function of the first and second order gradients (derivatives), we can intuitively think about that as a directional vector pointing towards the steepest descent. Let’s call this directional vector as r_{m-1}. The subscript is m-1 because the vector has been trained on stage m-1 of the iteration. Or, intuitively the residual y- F_{m-1}(X). So the equation now becomes:

\hat{Y}^m = \hat{Y}^{m-1}+\eta r_{m-1}

Flipping the signs, we get:

\hat{Y}^m = \hat{Y}^{m-1}-\eta (-r_{m-1})

Now let’s look at the standard Gradient Descent equation:

x_t = x_{t-1} - \eta \Delta f(x_{t-1})

We can clearly see the similarity. And this result is what enables us to use any differentiable loss function.

The Difference

When we train a neural network using Gradient Descent, it tries to find the optimum parameters(weights and biases), w, that minimizes the loss function. And this is done using the gradients of the loss with respect to the parameters.

But in Gradient Boosting, the gradient only adjusts the way the ensemble is created and not the parameters of the underlying base learner.

While in neural network, the gradient directly gives us the direction vector of the loss function, in Boosting, we only get the approximation of that direction vector from the weak learner. Consequently, the loss of a GBM is only likely to reduce monotonically. It is entirely possible that the loss jumps around a bit as the iterations proceed.

Implementation

GradientBoostingClassifier and GradientBoostingRegressor in Sci-kit Learn are one of the earliest implementations in the python ecosystem. It is a straight forward implementation, faithful to the original paper. I follows pretty much the discussion we had till now. And it has implemented for a variety of loss functions for which the Greedy function approximation: A gradient boosting machine[1] by Friedman had derived algorithms.

  • Regression Losses
    • ‘ls’ → Least Squares
    • ‘lad’ → Least Absolute Deviation
    • ‘huber’ → Huber Loss
    • ‘quantile’ → Quantile Loss
  • Classification Losses
    • ‘deviance’ → Logistic Regression loss
    • ‘exponential’ → Exponential Loss

In the next part of the series, let’s take a look at one of the lesser known cousin of Gradient Boosting – Regularized Greedy Forest

Stay tuned!

References

  1. Friedman, Jerome H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 (2001), no. 5, 1189–1232.

3 thoughts on “The Gradient Boosters I: The Good Old Gradient Boosting

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s