So, I present to you, the Battle of the Boosters.

I have chosen a few datasets for regression from Kaggle Datasets, mainly because it’s easy to setup and run in Google Colab. Another reason is that I do not need to spend a lot of time in data preprocessing, instead I can pick one of the public kernels and get cracking. I’ll also share one kernel for EDA of the datasets we choose. All notebooks will also be shared and links at the bottom of the blog.

- AutoMPG – The data is technical spec of cars from UCI Machine Learning Repository.

Shape: (398,9) , EDA Kernel, Data Volume: Low - House Prices: Advanced Regression Techniques – The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Shape: (1460, 81), EDA kernel, Data Volume: Medium - Electric Motor Temperature – The data set comprises several sensor data collected from a permanent magnet synchronous motor (PMSM) deployed on a test bench. The PMSM represents a German OEM’s prototype model.

Shape: (998k,13), EDA Kernel, Data Volume: High

- XGBoost
- LightGBM
- Regularized Greedy Forest
- NGBoost

Nothing fancy here. Just did some basic data cleansing and scaling. Most of the code is from some random kernel. The only point is that the same preprocessing is applied to all algorithms

I have chosen cross validation to make sure the comparison between different algorithms is more generalized than specific to one particular split of the data. I have chosen a simple K-Fold with 5 folds for this exercise.

**Evaluation Metric :** Mean Squared Error

To have standard evaluation for all the algorithms(thankfully all of them are Sci-kit Learn api), I defined a couple of functions.

**Default Parameters:** First fit the CV splits with default parameters of the model. We record the mean and standard deviation of the CV scores and then fit the entire train split to predict on the test split.

def eval_algo_sklearn(alg, x_train, y_train,x_test, y_test, cv): MSEs=ms.cross_val_score(alg, x_train, y_train, scoring='neg_mean_squared_error', cv=cv) meanMSE=np.mean(MSEs) stdMSE = np.std(MSEs) alg=alg.fit(x_train,y_train) pred=alg.predict(x_test) rmse_train = math.sqrt(-meanMSE) rmse_test = math.sqrt(sklm.mean_squared_error(y_test, pred)) return rmse_train, stdMSE, rmse_test

**With Hyperparameter Tuning**: Very similar to the previous one, but with an additional step of GridSearch to find best parameters.

def tune_eval_algo_sklearn(alg, param_grid, x_train, y_train, x_test, y_test, cv): grid=GridSearchCV(alg, param_grid=param_grid, cv=cv, scoring='neg_mean_squared_error', n_jobs=-1) grid.fit(x_train,y_train) print(grid.best_estimator_) best_params = grid.best_estimator_.get_params() alg= clone(alg).set_params(**best_params) alg=alg.fit(x_train,y_train) pred=alg.predict(x_test) rmse_train = math.sqrt(-grid.cv_results_['mean_test_score'][grid.best_index_]) stdMSE = grid.cv_results_['std_test_score'][grid.best_index_] rmse_test = math.sqrt(sklm.mean_squared_error(y_test, pred)) return rmse_train, stdMSE, rmse_test, alg

Hyperparameter tuning is in no way exhaustive, but is fairly decent.

The grids over which we run GridSearch for the different algorithms are

**XGBoost**:

{ "learning_rate": [0.01, 0.1, 0.5], "n_estimators": [100, 250, 500], "max_depth": [3, 5, 7], "min_child_weight": [1, 3, 5], "colsample_bytree": [0.5, 0.7, 1], }

**LightGBM**:

{ "learning_rate": [0.01, 0.1, 0.5], "n_estimators": [100, 250, 500], "max_depth": [3, 5, 7], "min_child_weight": [1, 3, 5], "colsample_bytree": [0.5, 0.7, 1], }

**RGF**:

{ "learning_rate": [0.01, 0.1, 0.5], "max_leaf": [1000, 2500, 5000], "algorithm": ["RGF", "RGF_Opt", "RGF_Sib"], "l2": [1.0, 0.1, 0.01], }

**NGBoost**:

Because NGBoost is kinda slow, instead of defining a standard grid for all experiments, I have done search along each parameter, independently, and decided a grid based on the intuitions from that experiment

I’ve tabulated the Mean and Standard Deviations of RMSEs for the Train CV splits for all three datasets. For Electric-Motors, I did not tune the data, as it was computationally expensive.

Disclaimer: These experiments are in no way complete. One would need a much larger scale experiment to arrive at a conclusion on which algorithm is doing better. And then there is the No Free Lunch Theorem to keep in mind.

Right off the bat, NGBoost seems like a strong contender in this space. In AutoMPG and Housing Prices datasets, NGBoost performs the best among all the other boosters, both on mean RMSE as well as the Standard Deviation in the CV scores, and by a large margin. NGBoost also shows quite a large gap between the default and tuned versions. This shows that either the default parameters are not well set, or that each tuning for dataset is a key element in getting good performance from NGBoost. But the Achilles heel of the algorithm is the run-time. With those huge bars towering over the others, we can see that the runtime, really, is in a different scale as compared to the other boosters. Especially on large datasets like Electric Motors Temperature dataset, the runtime is prohibitively large and because of that I didn’t tune the algorithm as well. It fares last among the other boosters in the competition.

Another standout algorithm is Regularized Greedy Forest, which is performing as good as or even better than XGBoost. In low and medium data setting, the runtime is also comparable to the reigning king, XGBoost.

In low data setting, popular algorithms like XGBoost and LightGBM are not performing well. And the standard deviation of the CV scores are higher, especially XGBoost, showing that it overfits. XGBoost has this problem in all three examples. In the matter of runtime, LightGBM reins king(although I haven’t tuned for computational performance), beating XGBoost in all three examples. In the high data setting, it blew everything else out of the water by having much lower RMSE and runtime than the rest of the competitors.

We have come far and wide in the world of gradient boosting and I hope that at least for some of you, Gradient Boosting does not mean XGBoost. There are so many algorithms with its own quirks in this world and lot of them perform at par or better than XGBoost. And another exciting area is Probabilistic Regression. I hope NGBoost become more efficient and step over that hurdle of computational efficiency. Once that happens, NGBoost is a very strong candidate in the probabilistic regression space.

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

If you’ve not read the previous parts of the series, I strongly advise you to read up, at least the first one where I talk about the Gradient Boosting algorithm, because I am going to take it as a given that you already know what Gradient Boosting is. I would also strongly suggest to read the VI(A)so that you have a better understanding of what Natural Gradients are

The key innovation in NGBoost is the use of Natural Gradients instead of regular gradients in the boosting algorithm. And by adopting this probabilistic route, it models a full probability distribution over the outcome space, conditioned on the covariates.

The paper modularizes their approach into three components –

- Base Learner
- Parametric Distribution
- Scoring Rule

As in any boosting technique, there are base learners which are combined together to get a complete model. And the NGBoost doesn’t make any assumptions and states that the base learners can be any simple model. The implementation supports a Decision Tree and ridge Regression as base learners out of the box. But you can replace them with any other sci-kit learn style models just as easily.

Here, we are not training a model to predict the outcome as a point estimate, instead we are predicting a full probability distribution. And every distribution is parametrized by a few parameters. For eg., the normal distribution is parametrized by its mean and standard deviation. You don’t need anything else to define a normal distribution. So, if we train the model to predict these parameters, instead of the point estimate, we will have a full probability distribution as the prediction.

Any machine learning system works on a learning objective, and more often than not, it is the task of minimizing some loss. In point prediction, the predictions are compared with data with a loss function. Scoring rule is the analogue from the probabilistic regression world. The scoring rule compares the estimated probability distribution with the observed data.

A proper scoring rule, , takes as input a forecasted probability distribution and one observation *(outcome)*, and assigns a score to the forecast such that the true distribution of the outcomes gets the best score in expectation.

The most commonly used proper scoring rule is the logarithmic score , which, when minimized we get the MLE

which is nothing but the log likelihood that we have seen in so many places. And the scoring rule is parametrized by because that is what we are predicting as part of the machine learning model.

Another example is CRPS(Continuous Ranked Probability Score). While the logarithmic score or the log likelihood generalizes Mean Squared Error to a probabilistic space, CRPS does the same to Mean Absolute Error.

In the last part of the series, we saw what Natural Gradient was. And in that discussion, we talked abut KL Divergences, because traditionally, Natural Gradients were defined on the MLE scoring rule. But the paper proposes a generalization of the concept and provide a way to extend the concept to CRPS scoring rule as well. They generalized KL Divergence to a general Divergence and provided derivations for CRPS scoring rule.

Now that we have seen the major components, let’s take a look at how all of this works together. NGBoost is a supervised learning method for probabilistic prediction that uses boosting to estimate the parameters of a conditional probability distribution . As we saw earlier, we need to choose three modular components upfront:

- Base learner ()
- Parametric Probability Distribution ()
- Proper scoring rule ()

A prediction on a new input *x* is made in the form of a conditional distribution , whose parameters are obtained by an additive combination of *M* base learners outputs and in initial . Let’s denote the combined function learned by the M base learners for all parameters by . And there will be a separate set of base learners for each parameter in the chosen probability distribution. For eg. in the normal distribution, there will be and .

The predicted outputs are also scaled with a stage specific scaling factor (), and a common learning rate :

One thing to note here is that even if you have *n* parameters for your probability distribution, is still a single scalar.

Let us consider a Dataset , Boosting Iterations *M*, Learning rate , Probability Distribution with parameters , Proper scoring rule , and Base learner

- Initialize to the marginal. This is just estimating the parameters of the distribution without conditioning on any covariates; similar to initializing to mean. Mathematically, we solve this equation:
- For each iteration in M:
- Calculate the Natural gradients of the Scoring rule S with respect to the parameters at previous stage for all
*n*examples in dataset. - A set of base learners for that iteration are fit to predict the corresponding components of the natural gradients, . This output can be thought of as the projection of the natural gradient on to the range of the base learner class, because we are training the base learners to predict the natural gradient at current stage.
- This projected gradient is then scaled by a scaling factor . This is because Natural Gradients rely on local approximations(as we saw in the earlier post) and these local approximations won’t hold good far away from the current parameter position.

In practice, we use a line search to get the best scaling factor which minimizes the overall scoring rule. In the implementation, they have found out that reducing the scaling factor by half in the line search works well. - Once the scaling parameter is estimated, we update the parameters by adding the negative scaled projected gradient to the outputs of previous stage, after further scaling by a learning rate.

- Calculate the Natural gradients of the Scoring rule S with respect to the parameters at previous stage for all

The algorithm has a ready use Sci-kit Learn style implementation at https://github.com/stanfordmlgroup/ngboost. Let’s take a look at the key parameters to tune in the model.

- Dist : This parameter sets the distribution of the output. Currently, the library supports
*Normal, LogNormal, and Exponential*distributions for regression,*k_categorical and Bernoulli*for classification.*Default: Normal* - Score : This specifies the scoring rule. Currently the options are between
*LogScore or CRPScore*.*Default: LogScore* - Base: This specifies the base learner. This can be any Sci-kit Learn estimator.
*Default is a 3-depth Decision Tree* - n_estimators : The number of boosting iterations.
*Default: 500* *learning_rate*: The learning rate.*Default:0.01**minibatch_frac*: The percent subsample of rows to use in each boosting iteration. This is more of a performance hack than performance tuning. When the data set is huge, this parameter can considerably speed things up.

Although there needs to be a considerable amount of caution before using the importances from machine learning models, NGBoost also offers feature importances. It has separate set of importances for each parameter it estimates.

But the best part is not just this, but that SHAP, is also readily available for the model. You just need to use TreeExplainer to get the values.(To know more about SHAP and other interpretable techniques, check out my other blog series – Interpretability: Cracking open the black box).

The paper also looks at how the algorithm performs when compared to other popular algorithms. There were two separate type of evaluation – Probabilistic, and Point Estimation

On a variety of datasets from the UCI Machine Learning Repository, NGBoost was compared with other major probabilistic regression algorithms, like Monte-Carlo Dropout, Deep Ensembles, Concrete Dropout, Gaussian Process, Generalized Additive Model for Location, Scale and Shape(GAMLSS), Distributional Forests.

They also evaluated the algorithm on the point estimate use case against other regression algorithms like, Elastic Net, Random Forest(Sci-kit Learn), Gradient Boosting(Sci-kit Learn).

NGBoost performs as well or better than the existing algorithms, but also has an additional advantage of providing us a probabilistic forecast. And the formulation and implementation are flexible and modular enough to make it easy to use.

But one drawback here is with the performance of the algorithm. The time complexity linearly increases with each additional parameter we have to estimate. And all the efficiency hacks/changes which has made its way into popular Gradient Boosting packages like LightGBM, or XGBoost are not present in the current implementation. Maybe it would be ported over soon enough because I see the repo is under active development and see this as one of the action items they are targeting. But until that happens, this is quite slow, especially with large data. One way out is to use *minibatch_frac* parameter to make the calculation of natural gradients faster.

Now that we have seen all the major Gradient Boosters, let’s pick a sample dataset and see how these performs in the next part of the series.

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Amari, Shun-ichi. Natural Gradient Works Efficiently in Learning. Neural Computation, Volume 10, Issue 2, p.251-276.
- Duan, Tony. NGBoost: Natural Gradient Boosting for Probabilistic Prediction, arXiv:1910.03225v4 [cs.LG] 9 Jun 2020
- NGBoost documentation, https://stanfordmlgroup.github.io/ngboost

But among all the negativity, there was a sliver of light shining through. When faced with a common enemy, mankind united across borders(for the most part; there are bad apples always) to help each other tide over the current assault. Scientists, who are the heroes of the day, doubled down to find a cure, vaccine, and a million other things which helps in the battle against COVID-19. And along with the real heroes, Data Scientists were also called to action to help in any way they could. A lot of people tried their best at forecasting the progression of the disease, so that the Governments can plan better. A lot more dedicated their time in analysing the data coming out of a multitude of sources to prepare dashboards, or network diagrams, etc. to help understand the progression of the disease. And yet another set of people tried to apply the techniques of AI to things like identifying the risk of a patient, or help diagnose the disease with X-Rays, etc.

While following these developments, one particular area where a lot of people did some attempt is Chest Radiography based identification of COVID-19 cases. One of those early attempts received a lot of attention, volunteers, funding etc. along with a lot of flak for the positioning the research took(You can read more about it here). TLDR; A PhD candidate out of Australia used a pretrained model(Resnet50), trained on 50 images, messed up the code because of train validation leak, and claimed 97% accuracy on COVID-19 case identification. Some others even got a 100% accuracy(turned out it was trained on the same dataset on which it got a 100% accuracy).

Along with this noise, there was an arxiv preprint came out of University of Waterloo, Canada by Linda Wang and Alexander Wong titled, COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. In the paper, they propose a new CNN Architecture which was trained from scratch on a dataset of 5941 posteroanterior chest radiography images. To generate the dataset, they combined two publicly available datasets – COVID chest X-ray dataset, and Kaggle Chest X-ray images (pneumonia) dataset. In the paper, they divided this dataset into four classes – COVID-19, Viral, Bacterial, and Normal. The below bar chart shows the class distribution of train and test splits. This was a decently sized dataset, although the COVID cases were on the lower side. They reported a 100% Recall and an 80% Precision for the model.

This was the first dataset of decent size on COVIDx and it got me interested. And since they shared the trained model and code to evaluate in a Github Repo, this was prime for an analysis.

I feel I need to state a disclaimer right about here. Whatever follows is a purely academic exercise and not at all an attempt to suggest this as a verifiable and valid way of testing for COVID-19. First things first. I personally do not endorse this attempt at identifying COVID-19 using any of these models. I very little knowledge about medicine, and absolutely zero about reading an X-ray. Unless this has been verified and vetted by a medical professional, this is nothing better than a model trained on a competition data set.

There are a few problems also regarding the dataset.

- COVID-19 cases and the other cases comes from different data sources and it’s doubtful if the model is identifying the data source or actual visual indicators representing COVID-19. I’ve tried to look at GradCAM results, but me being an absolute zero in reading an X-ray, I don’t know if the model is looking at the right indicators.
- It is also unclear as to what stage a patient was when the X-ray was taken. If it was something that was taken too late in his disease, this method does not hold it’s ground.

The first thought I had when I saw the model and the dataset was this – Why not Transfer Learning? The dataset is quite small, especially the class that we are interested in. Training a model from scratch and trying to properly capture the complex representation of the different classes, especially the COVID-19 class was a little bit of a stretch for me.

But playing the Devil’s advocate, why would a CNN trained on animals and food (the most popular classes in ImageNet) do better in X-rays? The network trained on natural and colourful images might have learnt totally different feature representations necessary than what is necessary to handle monochromatic X-rays.

As a rational human being and a staunch believer of the process of Science, I decided to look up the existing literature on it. Surprisingly, the research community is divided about the issue. Let me make it clear. There is no debate as to whether pretraining or Transfer Learning works for medical images. But the debate is about whether pretraining on Imagenet has any benefit. There were papers which claimed Imagenet pretraining helped Medical Image Classification and segmentation. And there were papers who pushed for random initialisation for the weights.

There was a recent paper by Veronika Cheplygina[1] which did a review of the literature in the perspective of whether or not Imagenet pretraining is beneficial for medical images. The conclusion was – “It depends”. Another recent paper from Google Brain(which was accepted into NeurIPS 2019)[2] deep dives into this issue. Although the general conclusion was that transfer learning with Imagenet weights is not beneficial for medical imaging, they do point out a few other interesting behaviour:

- When the sample size is small, as is most cases in medical imaging, Transfer Learning, even if it is Imagenet based, is beneficial
- When they look at the convergence rates of the different models, pretrained ones converged faster
- Most interesting result was that they tried initializing the networks with random weights, but derived the mean and standard deviation of the random initialization based on pretrained weights and found that it too provided the convergence speedup that pretrained models had.

Bottom line was that the study didn’t show worse performance for Imagenet trained models and had faster convergence. Even though the large Imagenet models may be over-parametrized for the problem, it does offer some benefit if you want to get a model working as fast as possible.

Now that I’ve done the literature review, it was time to test out my hypothesis. I gathered the dataset, wrote up a training script, and tested out a few Pretrained models.

Model | # of Parameters | GFLOPS |

DenseNet 121 | 8,010,116 | ~3 |

Xception | 22,915,884 | – |

ResNeXt 101 32x4d | 44,237,636 | ~7.8 |

I’ve used the FastAI library(a wrapper around PyTorch), which is very easy to use, especially if you are doing Transfer Learning with it’s easy “freeze” and “unfreeze” functions. Most of the experiments were run either on my laptop with a meagre GTX 1650 or on Google Colab. I’ve used the amazing library pretrainedmodels by Cadene as a source of my pretrained models apart from torchvision.

As our training dataset is relatively small and because it has two different sources of X-rays, I’ve used a few transformations as data augmentation. It both increases the dataset samples as well as give better generalization capabilities to the model. The transforms used are:

- Horizontal Flip
- Rotate
- Zoom
- Brightness
- Warp
- Cutout

Below are the basic steps I’ve used for the training of these models. Full code is published on GitHub.

- Import the models and create a learner from fastai. fastai has a few inbuilt mechanism to cut and split pretrained models so that we can use a custom head and apply discriminative learning rates easily. for the models in torchvision, the cut and split are predefined in fastai. But for models that are loaded from outside torchvision, we need to define those as well. ‘cut’ tells fastai where to make the separation between the feature extractor part of the CNN and the classifier part so that it can replace it with a custom head. “split” tells fastai how to split the different blocks on the CNN so that each block can have different learning rates.
- Split the Train into Train and Validation using a StratifiedShuffleSplit
- Kept the loss as a standard CrossEntropy
- Freeze the feature extractor part of the CNN and train the model. I used the One-Cycle Learning Rate Scheduler proposed by Leslie Smith[3]. It is heavily advocated by Jeremy Howard in his fastai courses and is implemented in the fastai library.
- After the learning saturates, unfreeze the rest of the model and finetune the model. Whether to use One-Cycle scheduler or not and whether to use differential learning rates or not, was decided empirically by looking at the loss curves.

Mixup[4] is a form of data augmentation where we generate new examples by weighted linear interpolation of two existing examples.

is between 0 and 1. In practice, it is sampled from a beta distribution which is parametrised by . Typically, is between 0.1 to 0.4 where the effect of mixup is not too much that it leads to underfitting.

For DenseNet 121, I tried doing Progressive Resizing as well, just to see if it gets me better results. Progressive Resizing is when we start training the network with a small image size and then use the weights learned from the smaller size image and start training on a bigger size image and in stages we move to higher resolution image sizes. I tried it in three stages – 64×64, 128×128, and 224×224.

Without further ado, let’s take a look at the results.

Best DenseNet model was got by progressively resizing 64×64 –> 128×128 and using mixup during training.

The best Xception model was trained using mixup and finetuned after initial pretraining with frozen weights.

The best ResNeXt model was trained without mixup(did not try), and without finetuning(finetuning was giving me worse performance for some reason).

Let’s summarize these results in a table and place them alongside the results from the COVID-Net paper.

We can see right off the bat that all the models have a better accuracy than COVID-Net. But Accuracy isn’t the right metric to evaluate here. The Xception model, with the highest F1 score, seems to be the best performing model among the lot. But, if we look at Precision and Recall separately, we can see that COVID-Net is having high recall, especially for the COVID-19 cases, whereas our models have high Precision. Densenet 121 have a perfect recall, but the Precision is bad. But the Xception model has high precision and a not too bad recall.

We have seen that DenseNet was a high recall model and Xception was a high precision model. Would the performance be better if we average the predictions across both these models?

Not much different as before. Our ensemble still doesn’t have better recall. Let’s try a Weighted Ensemble to give more weight to Densenet which has a perfect recall in COVID-19. To determine the optimum weight, I use the predictions in the validation set and tried different weights.

Let’s add these ensembles also to the earlier table for comparison.

Finally, we have a model which has a balance between Precision and Recall and also beats the COVID-Net scores across all the metrics. Let’s take a look at the Confusion Matrix for the ensemble.

When we are thinking about the usability of the model, we should also keep in mind the model complexity and inference time. The below table shows the number of parameters as a proxy for model complexity and inference time on my machine(GTX 1650).

N.B. – Was not able to run inference on COVID-Net on my laptop(which has a terrible relationship with Tensorflow) and therefore do not know the inference time for the model. But by udging from the # of parameters, it should be more than the other models.

N.B. – The # of parameters and Inference time for the ensemble is taken as the summation of the constituents.

Class Activation maps were introduced as a way to understand what a CNN is looking at while paking predictions way back in 2015 by Zhou, Bolei et al[5]. It is a technique to understand the regions of the image a CNN focuses on while making predictions. They achieve this by projecting back the weights of the output layer back to the output from the Convolutional Neural Network part.

Grad CAM is a generalization of CAM for many end use cases, apart from classification, and they acheive this by using the gradients w.r.t. the class at the last output from the Convolutional Layers. The authors of the paper[6] say:

Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.

Let’s see a few examples of our predictions and their activations overlayed as a heatmap. Although I don’t understand if the network is looking at the right places, if somebody who is reading this know how to, reach out to me and let me know.

We can also take a look at how good the feature representations that come out of these networks are. Since the output from the Convolutional Layers are high dimensional, we’ll have to use a dimensionality reduction technique to plot it in two dimensions.

A popular method for exploring high-dimensional data is t-SNE, introduced by van der Maaten and Hinton in 2008[7]. t-SNE, unlike something like a PCA, isn;t a linear projection.It uses the local relationships between points to create a low-dimensional mapping. This allows it to capture non-linear structure.

Let’s take a look at the t-SNE vizualizations, with Perplexity 50 for the three models – COVID-Net, Xception, DenseNet.

It also appears that our Imagenet pretrained models (Xception and DenseNet), has a better feature representation than COVID-Net. The t-SNE of COVID-Net is quite spread out and there is a lot of interspersion between the different classes. But the Xception and DenseNet feature representations show much better degree of separation of the different classes. The COVID-19 cases(Green) in all three cases shows separation, but because the dataset is so small, we need to take that inference with a grain of salt.

We’ve seen that the Imagenet pretrained models performed better than the COVID-Net model. The best Xception model had better Precision and best DenseNet model had better Recall. In this particular scenario, Recall is what matters more because you need to be safe than sorry. Even if you classify a few non COVID-19 cases as positive, they will just be directed to a proper medical test. But the other kind of error is not that forgiving. So going purely by that, our DenseNet model is the better. But we also need to keep in mind that this has been trained on a limited data set. And that too, the number of COVID-19 cases were just around 60. It is highly likely that the model has memorised or overfit to those 60. A prime example of the case where the model used the label on the Xray to classify that as COVID-19. The GradCAM examination was also not very helpful, as some of the examples seemed like the model is looking at the right places. But for some examples, the heat map lit up almost all of the X-ray.

But after examining the GradCAM and the t-SNE, I think that the Xception model has learned a much better representation for the cases. The problem of having low Recall is something that can be dealt with.

On the larger point, with which we have started this whole exercise with, I think we can safely say that Imagenet pretraining does help in classification of Chest Radiography images of COVID-19(I did try to train DenseNet without pretrained weights on the same dataset without much sucess).

There are a lot of unexplored dimensions to this problem and I’m gonna mention those in case any of my readers want to take those up.

- Collecting more data, especially COVID-19 cases, and retraining the models
- Dealing with the Class Imbalance
- Pretraining on Grayscale ImageNet[9] and subsequent Transfer Learning on Grayscale X-Rays
- Using the CheXpert dataset[8] as a bridge between Imagenet and Chest Radiography images by fine-tuning Imagenet models on CheXpert dataset and then apply to the problem at hand

If you are a medical professional, who think this is a worthwhile direction of research, do reach out to me. I want to be convinced that this is effective, but currently am not.

If you are a ML researcher, who want to collaborate on publishing a paper, or continue this line of research, do reach out to me.

** Update **– After I’ve cloned the COVID-Net repo, there has been an update, both in the data set by adding a few more cases for COVID-19, and the addition of a larger model. And the performance of our ensemble is still better than the large model.

- Cheplygina, Veronika , “Cats or CAT scans: transfer learning from natural or medical image source datasets?,” arXiv:1810.05444 [cs.CV], Jan. 2019.
- Raghu, Maithra et.al, “Transfusion: Understanding Transfer Learning for Medical Imaging”, arXiv:1902.07208 [cs.CV], Feb. 2019
- Smith, Leslie N., “Cyclical Learning Rates for Training Neural Networks”, arXiv:1506.01186 [cs.CV], June.2019
- Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, David Lopez-Paz, “mixup: Beyond Empirical Risk Minimization”. ICLR (Poster) 2018
- Zhou, Bolei et al, “Learning Deep Features for Discriminative Localization”, arXiv:1512.04150, Dec. 2015
- Selvaraju, Ramaprasaath R. et al, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization”. arXiv:1610.02391 [cs.CV], Oct. 2016
- Laurens van der Maaten, Geoffrey Hinton, “Visualizing Data using t-SNE”. 2008
- Irvin, Jeremy et al. “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison”, arXiv:1901.07031 [cs.CV], Jan, 2019
- Xie, Yiting & Richmond, David. (2019). Pre-training on Grayscale ImageNet Improves Medical Image Classification: Munich, Germany, September 8-14, 2018, Proceedings, Part VI. 10.1007/978-3-030-11024-6_37.

Pre-reads: I would be talking about KL Divergence and if you are unfamiliar with the concept, take a pause and catch-up. I have another article where I give the maths and intuition behind Entropy, Cross Entropy and KL Divergence.I also assume basic familiarity with Gradient Descent. I’m sure you’ll find a million articles about Gradient Descent online.

Sometimes, during the explanation, we might take on some maths heavy portions. But wherever possible, I’ll also provide the intuition so that if you are averse to maths and equations, you can skip right to the intuition.

There are a million articles on the internet explaining what Gradient Descent is and this is not the million and one article. Briefly, we will just cover enough Gradient Descent to make the discussion ahead relevant.

The Problem setup is : Given a function ** f(x)**, we want to find its minimum. In machine learning, this

- We pick an initial value for
, mostly at random,**x** - We calculate the gradient of the loss function w.r.t. the parameters
,*x* - We adjust the parameters
, such that , where is the learning rate**x** - Repeat 2 and 3 until we are satisfied with the loss value.

There are two main ingredients to the algorithm – the gradient and the learning rate.

** Gradient** is nothing but the first derivative of the loss function w.r.t.

** Learning Rate** is a necessary scaling which is applied to the gradient update every time. It is just a small quantity by which we restrain the parameter update based on the gradient to make sure our algorithm converges. We will discuss why we need this very soon.

The first pitfall of Gradient Descent is the step size. We adjust the parameters, ** x**, using the gradient of the loss function(scaled by learning rate).

The gradient is the first derivative of the loss function and by definition, it only knows the slope at the point at which it was calculated. It is myopic towards what the slope is even a point at an infinitesimally small distance. In the diagram above, we can see two points, and the one on the steep drop has a higher gradient value and the one on the almost flat has a smaller gradient. But does that mean we should be taking larger or smaller steps respectively? Because of this we can only take the direction the gradient gives us as the absolute truth and manage the step size using a hyperparameter ** learning rate**. This hyperparameter makes sure we do not jump over the minima(if the step size is too high) or never reach the minima (if the step size is too small).

This is a pitfall you can find in almost all first order optimization algorithms. One way to solve it is to use the second derivative which also tells you the curvature of the function and make the update step based on that, which is what Newton Raphson method of optimization does (Appendix in one of the previous blog post). But second order optimization methods come with their own baggage – computational and analytical complexity.

Another pitfall is the fact that this update treats all the parameters the same by scaling everything by a learning rate. There might be some parameters which have a larger influence on the loss function than other and by restricting the update for such parameters, we are making the algorithm longer to converge.

What’s the alternative? The constant learning rate update is like a safety cushion we are giving the algorithm so that it doesn’t blindly dash from one point to the other and miss the minima altogether. But there is another way we can implement this safety cushion.

Instead of fixing the euclidean distance each parameter moves(distance in the parameter space), we can fix the distance in the distribution space of the target output. i.e. Instead of changing all the parameters within an epsilon distance, we constrain the output distribution of the model to be within an epsilon distance from the distribution from the previous step.

Now how do we measure the distance between two distributions? Kullback-Leibler Divergence. Although technically not a distance(because it is not symmetric), it can be considered a distance in the locality it is defined. This works out well for us because we are also concerned about how the output distribution changes when we make small steps in the parameter space. In essence, we are not moving in the Euclidean parameter space like in normal gradient descent. but in the distribution space with KL Divergence as the metric.

I’ll skip to the end and tell you that there is this magical matrix called the Fischer Information Matrix, which if we include in the regular gradient descent formula, we will get the Natural Gradient descent which has the property of constraining the output distribution in each update step[1].

This is exactly the gradient descent equation, with a few changes:

- , the learning rate, is replaced with to make it clear that the step size may change in each iteration
- An additional term has been added to the normal gradient.

When the normal gradient is scaled with the inverse of the Fisher’s matrix, we call it the Natural Gradient.

Now for those who can accept the hand-wave that Fisher Matrix is a magical quantity which makes the normal gradient natural, skip to the next section. For the brave souls who stick around, a little maths is on your way.

I assume everyone knows what KL divergence is. We are going to start with KL Divergence and see how that translates to Fisher Matrix and what is the Fisher Matrix.

As we saw in the Deep Learning and Information Theory, KL Divergence was defined as:

That was when we were talking about a discrete variable, like a categorical outcome. In a more general sense, we need to replace the summations with integration. Let’s also switch the notation to suit the gradient descent framework that we are working with.

Instead of P and Q as two distributions, let’s say and be the two distributions, where x is our input features or covariates, be the parameters of the loss function(for eg. the weights and biases in a neural network), and be the small change in parameters that we make in a gradient update step. So, under the new notation, the KL Divergence is defined as:

Let’s rewrite the second term of the equation in terms of using second-order Taylor Expansion(which is an approximation using the derivatives at a particular point). Taylor Expansion in it’s general form is:

Rewriting the second term in this form, we get:

Let’s whip out our trusty Chain rule from high school calculus and apply to the first term. Derivative of Log x is 1/x. The Taylor Expansion of the second term becomes:

Plugging this back into the KL Divergence equation,. we get:

Rearranging the terms we have:

The first term is the KL Divergence between the same distribution and that is going to be zero. Another way you can think about it is that log 1 = 0 and hence the first term becomes zero.

The second term will also be zero. Let’s take a look at that below. Key property which we use to infer zero is that the integration of a probability distribution P(x) with x is 1 (just like the summation of the area under the curve of a probability distribution is 1).

Now, that leaves us with the Hessian of . A little bit of mathematics(*hand-wave*) gets us:

Let’s put this back into the KL Divergence equation.

The first term becomes zero because as we saw earlier, the integral of a probability distribution P(x) over x is 1. And the first and second derivative of 1 is 0. In the second term, the integral is called the Fisher’s matrix. Let’s call it **F**. So the KL Divergence equation becomes:

The integral over x in the Fisher matrix can be interpreted as the Expected Value. And what that makes the Fisher Matrix is the negative Expected Hessian of . And is nothing but the log likelihood. We know that the first derivative gives us the slope and the second derivative(Hessian) gives us the curvature. So, the Fisher Matrix can be seen as the curvature of the log likelihood function.

The Fisher Matrix is also the covariance of the score function. In our case the score function is the log likelihood which measures the goodness of our prediction.

We looked at what Fisher Matrix is, but we still haven’t linked it to Gradient Descent. Now, we know that KL Divergence is a function of Fisher Matrix and the delta change in parameters between the two distributions. As we discussed earlier, our aim is to make sure that we minimize the loss subject to keeping the KL Divergence within a constant ** c**.

Formally, it can be written as:

Let’s take the Lagrangian relaxation of the problem and use our trusted first order Taylor relaxation for the

To minimize the function above, we set the derivative to zero and solve. The derivative of the above function is:

Setting it to zero and solving for , we get,

What this means is that to a factor of (this is the error tolerance we accepted with the Lagrangian Relaxation), we get the optimal descent direction taking into account the curvature of the log likelihood at that point. We can take this constant factor of relaxation into the learning rate and consider it part of the same constant.

And with that final piece of mathematical trickery, we have the Natural Gradient as,

We’ve talked a lot about Gradients and Natural Gradients. But it is critical to understand why Natural Gradient is different from normal gradient to understand how Natural Gradient Descent is different from Gradient Descent.

At the center of it all, is a Loss Function which measures the difference between the predicted output and the ground truth. And how do we change the loss? By changing the parameters which changes the predicted output and thereby the loss. So, in a normal Gradient, we take the derivative of the loss function w.r.t. the parameters. The derivative will be small if the predicted probability distribution is closer to the true distribution and vice versa This represents the amount your loss would change if you moved each parameter by one unit. So, when we apply the learning rate, we are scaling the gradient update to these parameters by a fixed amount.

In the Natural Gradient world, we are no longer restricting the movement of the parameters in the parameter space. Instead, we arrest the movement of the output probability distribution at each step. And how do we measure the probability distribution? Log Likelihood. And Fisher Matrix gives you the curvature of the Log Likelihood.

As we saw earlier, the normal gradient has no idea about the curvature of the loss function because it is a first order optimization method. But when we include the Fisher matrix to the gradient, what we are doing is scaling the parameter updates with the curvature of the log likelihood function. So, places in the distribution space where the log likelihood changes fast w.r.t. a parameter, the update for the parameter would be less as opposed to a flat plain in the distribution space.

Apart from the obvious benefit of adjusting our updates using the curvature information, Natural Gradient also allows a way for us to directly control the movement of your model in the predicted space. In normal gradient, your movement is strictly in the parameter space and we are restricting the movement in that space with a learning rate in the hope that the movement in the prediction space is also restricted. But in Natural Gradient update, we are directly restricting the movement in the prediction space by stipulating that the model only move a fixed distance in KL Divergence terms.

The obvious question is a lot of your minds would be: If Natural Gradient Descent is so awesome, and clearly better than Gradient Descent, why isn’t it the defacto standard in Neural Networks?

This is one of those areas where pragmatism won over theory. Theoretically, the idea of using Natural Gradients is beautiful and it works also as expected. But the catch is that the calculation of the Fisher Matrix and it’s inverse becomes an intractable problem when the number of parameters are huge, like in a typical Deep Neural Network. This calculation exists in .

One other reason why there wasn’t a lot of focus on Natural Gradients was that the Deep Learning researchers and practitioners figured out some clever tricks/heuristics to approximate the information in a second derivative without calculating it. The Deep Learning optimizers have come a long way from SGD and a lot of that progress has been in using such tricks to get better gradient updates. Momentum, RMSProp, Adam, are all variations of the SGD which uses a running mean and/or running variance of the gradients to approximate the second order derivative and use that information to do the gradient updates.

These heuristics are much less computationally intensive that calculating a second order derivative or the natural gradient and has enabled Deep Learning to be scaled to the level it is currently.

That being said, the Natural Gradient still finds use in some cases where the parameters to be estimated are relatively small, or the expected distribution is relatively standard like a Gaussian distribution, or in some areas of Reinforcement Learning. Recently, it has also been used in a form of Gradient Boosting, which we will cover in the next part of the series.

In the next part of our series, let’s look at the new kid on the block – **NgBoost**

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Amari, Shun-ichi. Natural Gradient Works Efficiently in Learning. Neural Computation, Volume 10, Issue 2, p.251-276.
- It’s Only Natural: An Excessively Deep Dive into Natural Gradient Optimization, https://towardsdatascience.com/its-only-natural-an-excessively-deep-dive-into-natural-gradient-optimization-75d464b89dbb
- Ratliff, Nathan ,Information Geometry and Natural Gradients, https://ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2015/01/mathematics_for_intelligent_systems_lecture12_notes_I.pdf
- Natural Gradient Descent, https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/
- Fisher Information Matrix, https://wiseodd.github.io/techblog/2018/03/11/fisher-information/
- What is the natural gradient, and how does it work?, http://kvfrans.com/what-is-the-natural-gradient-and-where-does-it-appear-in-trust-region-policy-optimization/

Let’s take a look at what made it different:

Let’s take a look at the innovation which gave the algorithm it’s name – CatBoost. Unlike XGBoost, CatBoost deals with Categorical variables in a native way. Many studies have shown that One-Hot encoding high cardinality categorical features is not the best way to go, especially in tree based algorithms. And other popular alternatives all come under the umbrella of Target Statistics – Target Mean Encoding, Leave-One-Out Encoding, etc.

The basic idea of Target Statistics is simple. We replace a categorical value by the mean of all the targets for the training samples with the same categorical value. For example, we have a Categorical value called weather, which has four values – sunny, rainy, cloudy, and snow. The most naive method is something called Greedy Target Statistics, where we replace “sunny” with the average of the target value for all the training samples where weather was “sunny”.

If M is the categorical feature we are encoding and is the specific value in M, and n is the number of training samples with ,

But this is unstable when the number of samples with is too low or zero. Therefore we use the Laplace Smoothing used in Naive Bayes Classifier to make the statistics much more robust.

where *a* > 0 is a parameter. A common setting for *p* (prior) is the average target value in the dataset.

But these methods usually suffer from something called Target Leakage because we are using our targets to calculate a representation for the categorical variables and then using those features to predict the target. Leave-One-Out Encoding tries to reduce this by excluding the sample for which it is calculating the representation, but is not fool proof.

CatBoost authors propose another idea here, which they call Ordered Target Statistics. This is inspired from Online Learning algorithms which get the training examples sequentially in time. And in such cases, the Target Statistics will only rely on the training examples in the past. To adapt this idea to a standard offline training paradigm, they imagine a concept of artificial time, but randomly permuting the dataset and considering them sequential in nature.

Then they calculate the target statistics using only the samples which occured before that particular sample in the artificial time. It is important to note that if we use just one permutation as the artificial time, it would not be very stable and to this end they do this encoding with multiple permutations.

The main motivation for the CatBoost algorithm is, as argued by the authors of the paper, the target leakage, which they call Prediction Shift, inherent in the traditional Gradient Boosting models. The high-level idea is quite simple. As we know, any Gradient Boosting model works iteratively by building base learners over base learners in an additive fashion. But since each base learner is build based on the same dataset, the authors argue that there is a bit of target leakage which affects the generalization capabilities of the model. Empirically, we know that Gradient Boosted Trees has an overwhelming tendency to overfit the data. The only countermeasures against this leakage are features like subsampling, which they argue is a heuristic way of handling the problem and only alleviates it and not completely removes it.

The authors formalize the proposed target leakage and mathematically show us that it is present. Another interesting observation that they had is that the target shift or the bias is inversely proportional to the size of the dataset, i.e. if the dataset is small, the target leak is much more pronounced. This observation also agrees with our empirical observation that Gradient Boosted Trees tend to overfit to a small dataset.

To combat this issue, they propose a new variant of Gradient Boosting, called Ordered Boosting. The idea, at it’s heart, is quite intuitive. The main problem with previous Gradient Boosting was the reuse of the same dataset for each iteration. So, if we have a different dataset for each of the iteration, we would be solving the problem of leakage. But since none of the datasets are infinite, this idea, purely applied, will not be feasible. So, the authors have proposed a practical implementation of the above concept.

It starts out with creating *s+1* permutations of the dataset. This permutation is the artificial time that the algorithm takes into account. Let’s call it . The permutations is used for constructing the tree splits and is used to choose the leaf values . In the absence of multiple permutations, the training samples with short “history” will have high variance and hence having multiple permutations ease out that defect.

We saw the way CatBoost deals with Categorical variables earlier and we mentioned that there we use multiple permutations to calculate the target statistics. This is implemented as part of the boosting algorithm which uses a particular permutation from in any iteration. The gradient statistics required for the tree splits and the target statistics required for the categorical encoding are calculated using the sampled permutation.

And once all the trees are built, the leaf values of the final model F are calculated by the standard gradient boosting procedure(that we saw in the previous articles) using permutation . When the final model F is applied to new examples from test set, the target statistics are calculated on the entire training data.

One important thing to note it that CatBoost supports the traditional Gradient Boosting also, apart from the Ordered Boosting (boosting_type = ‘Plain’ or ‘Ordered’). If it is ‘Plain’, and there are categorical features, the permutations are still created for the target statistic, but the tree building and boosting is done without the permutations.

CatBoost also differs from the rest of the flock in another key aspect – the kind of trees that is built in its ensemble. CatBoost, by default, builds Symmetric Trees or Oblivious Trees. These are trees the same features are responsible in splitting learning instances into the left and the right partitions for each level of the tree.

This has a two-fold effect in the algorithm –

- Regularization: Since we are restricting the tree building process to have only one feature split per level, we are essentially reducing the complexity of the algorithm and thereby regularization.
- Computational Performance: One of the most time consuming part of any tree-based algorithm is the search for the optimal split at each nodes. But because we are restricting the features split per level to one, we only have to search for a single feature split instead of k splits, where k is the number of nodes in the level. Even during inference these trees make it lightning fast. It was shown to be 8X faster than XGBoost in inference.

Although the default option is “*SymmetricTree*“, there is also the option to switch to “*Depthwise*“(XGBoost) or “*Lossguide*“(LightGBM) using the parameter “*grow_policy*“,

Another important detail of CatBoost is that it considers combinations of categorical variables implicitly in the tree building process. This helps it consider joint information of multiple categorical features. But since the total number of combinations possible can explode quickly, a greedy approach is undertaken in the tree building process. For each split in the current tree, CatBoost concatenates all previously used Categorical Features in the leaf with all the rest of the categorical features as combinations and target statistics are calculated on the fly.

Another interesting feature in CatBoost is the inbuilt Overfitting Detector. CatBoost can stop training earlier than the number of iterations we set, if it detects overfitting. there are two overfitting detectors implemented in CatBoost –

- IncToDec
- Iter

*Iter* is the equivalent of early stopping where the algorithm waits for *n* iterations since an improvement in validation loss value before stopping the iterations

*IncToDec *is more slightly involved. It takes a slightly complicated route by keeping track of the improvement of the metric iteration after iteration and also smooths the progression using an approach similar to exponential smoothing and sets a threshold to stop training whenever that smoothed value falls below it.

Following XGBoost’s footsteps, CatBoost also deals with missing values separately. There are two ways of handling missing values in CatBoost – Min and Max.

If you select “Min”, the missing values are processed as the minimum value for the feature. And if you select “Max”, the missing values are processed as the maximum value for the feature. In both cases, it is guaranteed that the split between missing values and others are considered in every tree split.

If LightGBM had a lot of hyperparameters, CatBoost has even more. With so many hyperparameters to tune, GridSearch stops being feasible. It becomes more of an art to get the right combination of parameters for any given problem. But still I’ll try to summarize a few key parameters which you have to keep in mind.

*one_hot_max_size*: This sets the maximum number of unique values in a categorical feature below which it will be one-hot encoded and not using Target statistics. It is recommended that you do not do your one-hot encoding before you feed in the feature set, because it will hurt both accuracy and performance of the algorithm.*iterations*– The number of trees to be built in the ensemble. This has to be tuned with a cv or one of the overfitting detection methods should be employed to make the iteration stop at the ideal iteration.*od_type, od_pval, od_wait*– These three parameters configure the overfitting detector.*od_type*is the type of overfitting detector.*od_pval*is the threshold for IncToDec(Recommended Range: [10e-10, 10e-2]). Larger the value, earlier Overfitting is detected.- od_wait has different meaning depending on the od_type. If it is
*IncToDec*, the*od_wait*is the number of iterations it has to run before the overfitting detector kicks in. If it is*Iter*, the*od_wait*is the number of iterations it will wait without an improvement of the metric before it stops training.

*learning_rate*– The usual meaning. But CatBoost automatically set the learning rate based on the dataset properties and the number of iterations set.*depth*– This is the depth of the tree.Optimal values range from 4 to 10. Default Value: 6 and 16 if*boosting_type*is*Lossguide*- l2_leaf_reg – This is the regularization along the leaves. Any positive value is allowed as the value. Increase this value to increase the regularization effect.
*has_time*– We have already seen that there is an artificial time which is taken to accomplish ordered boosting. But what if your data actually have a temporal order? In such cases set has_time = True to avoid using permutations in ordered boosting, but instead use the order in which the data was provided as the one and only permutation.*grow_policy*– As discussed earlier, CatBoost builds “*SymmetricTree*” by default. But sometimes “*Depthwise*” and “Lossguide” might give better results.- min_data_in_leaf is the usual parameter to control the minimum number of training samples in each leaf. This can only be used in
*Lossguide*and*Depthwise.* - max_leaves is the maximum number of leaves in any given tree. This can only be used in
*Lossguide*. It is not recommended to have values greater than 64 here as it significantly slow down the training process.

- min_data_in_leaf is the usual parameter to control the minimum number of training samples in each leaf. This can only be used in
*rsm or colsample_bylevel*– The percentage of features to be used in each split selection. This helps us control overfitting and the values range from (0,1].- nan_mode – Can take values “Forbidden”, “Min”, “Max” as the three options. “Forbidden” does not allow missing values and will throw an error. Min and Max we have discussed earlier.

In the next part of our series, let’s look at the new kid on the block – **NgBoost**

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Friedman, Jerome H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 (2001), no. 5, 1189–1232.
- Prokhorenkova, Liudmila, Gusev, Gleb et.al. (2018). CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems
- CatBoost Parameters. https://catboost.ai/docs/concepts/python-reference_parameters-list.html

The starting point for the LightGBM was XGBoost. So essentially, they took XGBoost and optimized it, and therefore, it has all the innovations XGBoost had (more or less), and some additional ones. Let’s take a look at the incremental improvements that LightGBM made:

One of the main changes from all the other GBMs, like XGBoost, is the way tree is constructed. In LightGBM, a leaf-wise tree growth strategy is adopted.

All the other popular GBM implementations follow somehting called a Level-wise tree growth, where you find the best possible node to split and you split that one level down. This strategy will result in symmetric trees, where every node in a level will have child nodes resulting in an additional layer of depth.

In LightGBM, the leaf-wise tree growth finds the leaves which will reduce the loss the maximum, and split only that leaf and not bother with the rest of the leaves in the same level. This results in an asymmetrical tree where subsequent splitting can very well happen only on one side of the tree.

Leaf-wise tree growth strategy tend to achieve lower loss as compared to the level-wise growth strategy, but it also tends to overfit, especially small datasets. So small datasets, the level-wise growth acts like a regularization to restrict the complexity of the tree, where as the leaf-wise growth tends to be greedy.

Subsampling or Downsampling is one of the ways with which we introduce variety and speed up the training process in an ensemble. It is also a form of regularization as it restricts from fitting to the complete training data. Usually, this subsampling is done by taking a random sample from the training dataset and building a tree on that subset. But what LightGBM introduced was an intelligent way of doing this downsampling.

The core of the idea is that the gradients of different samples is an indicator to how big of a role does it play in the tree building process. The instances with larger gradients (under-trained), contribute a lot more to the tree building process than instances with small gradients. So, when we downsample, we should strive to keep the instances with large gradients so that the tree building is the most efficient.

The most straightforward idea is to discard the instances with low gradients and build the tree on just the large gradient instances. But this would change the distribution of the data which in turn would hurt the accuracy of the model. And hence, the GOSS method.

The algorithm is pretty straightforward:

- Keep all the instances with large gradients
- Perform random sampling on instances with small gradients
- Introduce a constant multiplier for the data instances with small gradients when computing the information gain in the tree building process.
- If we select
*a*instances with large gradients and randomly samples*b*instances with small gradients, we amplify the sampled data by

- If we select

The motivation behind EFB is a common theme between LightGBM and XGBoost. In many real world problems, although there are a lot of features, most of them are really sparse, like on-hot encoded categorical variables. The way LightGBM tackles this problem is slightly different.

The crux of the idea lies in the fact that many of these sparse features are exclusive, i.e. they do not take non-zero values simultaneously. And we can efficiently bundle these features and treat them as one. But finding the optimal feature bundles is an NP-Hard problem.

To this end, the paper proposes a Greedy Approximation to the problem, which is the Exclusive Feature Bundling algorithm. The algorithm is also slightly fuzzy in nature, as it will allow bundling features which are not 100% mutually exclusive, but it tries to maintain the balance between accuracy and efficiency when selecting the bundles.

The algorithm, on a high level, is:

- Construct a graph with all the features, weighted by the edges which represents the total conflicts between the features
- Sort the features by their degrees in the graph in descending order
- Check each feature and either assign it to an existing bundle with a small conflict or create a new bundle.

The amount of time it takes to build a tree is proportional to the number of splits that have to be evaluated. And when you have continuous or categorical features with high cardinality, this time increases drastically. But most of the splits that can be made for a feature only offer miniscule changes in performance. And this concept is why a histogram based method is applied to tree building.

The core idea is to group features into set of bins and perform splits based on these bins. This reduces the time complexity from *O(#data)* to *O(#bins)*.

In another innovation, similar to XGBoost, LightGBM ignores the zero feature values while creating the histograms. And this reduces the cost of building the histogram from *O(#data) *to *O(#non-zero data)*.

In many real world datasets, Categorical features make a big presence and thereby it becomes essential to deal with them appropriately. The most common approach is to represent a categorical feature as it’s one-hot representation, but this is sub-optimal for tree learners. If you have high-cardinality categorical features, your tree needs to be very deep to achieve accuracy.

LightGBM takes in a list of categorical features as an input so that it can deal with it more optimally. It takes inspiration from “On Grouping for Maximum Homogeneity” by Fisher, Walter D. and uses the following methodology for finding the best split for categorical features.

- Sort the histogram on accumulated gradient statistics
- Find the best split on the sorted histogram

There are a few hyperparameters which help you tune the way the categorical features are dealt with[4]:

`cat_l2`

, default =`10.0`

, type = double, constraints:`cat_l2 >= 0.0`

- used for the categorical features

- L2 regularization in categorical split

`cat_smooth`

, default =`10.0`

, type = double, constraints:`cat_smooth >= 0.0`

- used for the categorical features
- this can reduce the effect of noises in categorical features, especially for categories with few data
`max_cat_to_onehot`

, default =`4`

, type = int, constraints:`max_cat_to_onehot > 0`

- when number of categories of one feature smaller than or equal to
`max_cat_to_onehot`

, one-vs-other split algorithm will be used

- when number of categories of one feature smaller than or equal to

The majority of the incremental performance improvements were made through GOSS and EFB.

xgb_exa is the original XGBoost, xgb_his is the histogram based version(which came out later), lgb_baseline is the LightGBM without EFB and GOSS, and LightGBM is with EFB and GOSS. It is quite evident that the improvement in GOSS and EFB is phenomenal as compared to lgb_baseline.

The rest of the improvements in performance is derived from the ability to parallelize the learning. There are two main ways of parallelizing the learning process:

Feature Parallel tries to parallelize the “Find the best split” part in a distributed manner. Evaluating different splits are done in parallel across multiple workers, and then they communicate with each other to decide among themselves who has the best split.

Data Parallel tries to parallelize the whole decision learning. In this, we typically split the data and send different parts of the data to different workers who calculate the histograms based on the section of the data they receive. Then they communicate to merge the histogram at a global level and this global level histogram is what is used in the tree learning process.

Voting Parallel is a special case of Data Parallel, where the communication cost in Data Parallel is capped to a constant.

LightGBM is one of those algorithms which has a lot, and I mean a lot, of hyperparameters. It is so flexible that it is intimidating for the beginner. But there is a way to use the algorithm and still not tune like 80% of those parameters. Let’s look at a few parameters that you can start tuning and then build up confidence and start tweaking the rest.

`objective`

︎, default =`regression`

, type = enum, options:`regression`

,`regression_l1`

,`huber`

,`fair`

,`poisson`

,`quantile`

,`mape`

,`gamma`

,`tweedie`

,`binary`

,`multiclass`

,`multiclassova`

,`cross_entropy`

,`cross_entropy_lambda`

,`lambdarank`

,`rank_xendcg`

, aliases:`objective_type`

,`app`

,`application`

`boosting`

︎, default =`gbdt`

, type = enum, options:`gbdt`

,`rf`

,`dart`

,`goss`

, aliases:`boosting_type`

,`boost`

`gbdt`

, traditional Gradient Boosting Decision Tree, aliases:`gbrt`

`rf`

, Random Forest, aliases:`random_forest`

`dart`

, Dropouts meet Multiple Additive Regression Trees`goss`

, Gradient-based One-Side Sampling

`learning_rate`

︎, default =`0.1`

, type = double, aliases:`shrinkage_rate`

,`eta`

, constraints:`learning_rate > 0.0`

- shrinkage rate
- in
`dart`

, it also affects on normalization weights of dropped trees

`num_leaves`

︎, default =`31`

, type = int, aliases:`num_leaf`

,`max_leaves`

,`max_leaf`

, constraints:`1 < num_leaves <= 131072`

- max number of leaves in one tree

`max_depth`

︎, default =`-1`

, type = int- limit the max depth for tree model. This is used to deal with over-fitting when
`#data`

is small. Tree still grows leaf-wise

`<= 0`

means no limit

- limit the max depth for tree model. This is used to deal with over-fitting when
`min_data_in_leaf`

︎, default =`20`

, type = int, aliases:`min_data_per_leaf`

,`min_data`

,`min_child_samples`

, constraints:`min_data_in_leaf >= 0`

- minimal number of data in one leaf. Can be used to deal with over-fitting

`min_sum_hessian_in_leaf`

︎, default =`1e-3`

, type = double, aliases:`min_sum_hessian_per_leaf`

,`min_sum_hessian`

,`min_hessian`

,`min_child_weight`

, constraints:`min_sum_hessian_in_leaf >= 0.0`

- minimal sum hessian in one leaf. Like
`min_data_in_leaf`

, it can be used to deal with over-fitting

- minimal sum hessian in one leaf. Like
`lambda_l1`

︎, default =`0.0`

, type = double, aliases:`reg_alpha`

, constraints:`lambda_l1 >= 0.0`

- L1 regularization

`lambda_l2`

︎, default =`0.0`

, type = double, aliases:`reg_lambda`

,`lambda`

, constraints:`lambda_l2 >= 0.0`

- L2 regularization

In the next part of our series, let’s look at the one who tread a path less taken – **CatBoost**

- Part I – Gradient boosting Algorithm

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Friedman, Jerome H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 (2001), no. 5, 1189–1232.
- Ke, Guolin et.al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, pages 3149-3157
- Walter D. Fisher. “On Grouping for Maximum Homogeneity.” Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
- LightGBM Parameters. https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst#core-parameters

There were a few key innovations that made XGBoost so effective:

Similar to Regularized Greedy Forest, XGBoost also has an explicit regularization term in the objective function.

is the regularization term which penalizes *T*, the number of leaves in the tree and is the regularization term which penalizes *w*, the weights of different leaves.

This is a much simpler regularization term than some of the ways we saw in Regularized Greedy Forest.

One of the key ingredients of Gradient Boosting algorithms is the gradients or derivatives of the objective function. And all the implementations that we saw earlier used pre-calculated gradient formulae for specific loss functions, thereby, restricting the objectives which can be used in the algorithm to a set which is already implemented in the library.

XGBoost uses the Newton-Raphson method we discussed in a previous part of the series to approximate the loss function.

Now, the complex recursive function made up of tree structures can be approximated using Taylor’s approximation into a differentiable form. In the case of Gradient Boosting, we take the second order approximation, meaning we use two terms – first order derivative and second order derivative- to approximate the function.

Let,

Approximated Loss function:

The first term, the loss, is constant at a tree building stage, t, and because of that it doesn’t add any value to the optimization objective. So removing it and simplifying we get,

Substituting the Ω with the regularization term, we get:

The f(x) we are talking about is essentially a tree with leaf weights, w. So if we define as the instance set in leaf j, we can substitute the tree structure directly into the equation and simplify as:

Setting this equation to zero we can find the optimum value for as :

Putting this back into the loss function and simplifying we get:

What all of this enables us to do is to separate out the objective function from the core working of the algorithm. And by adopting this formulation, the only requirement from an objective function/loss function is that it needs to be differentiable. To be very specific, the loss function should return the first and second order derivatives.

See here for a list of all the objective functions that are pre-built into XGBoost.

The tree building strategy lies somewhat in between classical Gradient Boosting and regularized Greedy Forests. While the classical Gradient Boosting takes the tree at each stage as a black box, Regularized Greedy Forest operates at a leaf level by updating any part of the forest at each step. XGBoost takes a middle ground and considers a tree that is made in a previous iteration sacrosanct, but while making a tree for any iteration, it doesn’t use the standard impurity measures, but the gradient based Loss function we derived in the last section in the tree building process. While in classic Gradient Boosting the optimization of the loss function happens after the tree is built, XGBoost gets that optimization during the tree building process as well.

Normally its impossible to enumerate all possible tree structures. Therefore a greedy algorithm is implemented that starts from a single leaf and iteratively adds branches to the tree based on the simplified loss function

One of the key problems in tree learning is to find the best split. Usually, we will have to enumerate all the possible splits over all the features and then use the impurity criteria to choose the best split. This is called *exact greedy algorithm*. It is computationally demanding, especially for continuous and high cardinality categorical features. And it also not feasible when the data doesn’t fit into memory.

To overcome these inefficiencies, the paper proposes an Approximate Algorithm. It first proposes candidate splitting points according to the percentiles of the features. On a highlevel, the algorithm is:

- Propose candidate splitting points for all the features using the percentiles of the features
- Map the continuous features to buckets split by the candidate splitting points
- Aggregate the gradient statistics for all the features based on the candidate splitting points
- Finds the best solution among proposals based on aggregate statistics

One of the important steps in the algorithm discussed above is the proposal of candidate splits. usually percentiles of a feature are used to make candidates distribute evenly on the data. And a set of algorithms which does that in a distributed manner and with speed and efficiency are called Quantile sketching algorithms. But here, the problem is slightly complicated because the need is to have a weighted quantile sketching algorithm which weighs the instances based on the gradient(so that we can learn the most from instances with most error). So they proposed a new data structure which has provable theoretical guarantee.

This is another key innovation in XGBoost and this came from the realization that real-world datasets are sparse. This sparsity can come form multiple causes,

- Presence of missing values in the data
- Frequent zero values
- Artifacts of feaature engineering such as One-Hot Encoding

And for this reason, the authors decide to make the algorithm aware of the sparsity so that it can be dealt with intelligently. And the way they made that is deceivingly simple.

They gave a default direction in each tree node. i.e. if a value is missing or zero, that instance flows down a default direction in the branch. And the optimal default direction is learned from the data

This innovation has a two fold benefit –

- it helps group the missing values learn an optimal way of handling them
- it boosts the performance by making sure no computing power is wasted in trying to find gradient statistics for missing values.

One of the main drawbacks of all the implementations of the Gradient Boosting algorithm were that they were quite slow. While the forest creation in a Random Forest was parallel out of the box, Gradient Boosting was a sequential process which builds new base learners on old ones. One of XGBoost’s claim to fame was how blazingly fast it was. It was at least 10 times faster than the existing implementations and it was able to work with large datasets because of the out-of-core learning capabilities. The key innovations in performance improvement were:

The most time consuming part of tree learning is to get the data sorted into order. The authors of the paper proposed to store the data in in-memory units, called blocks. Data in each block is sorted in the Compressed column (CSC) format. This input data layout is computed once before training and reused thereafter. By handing the data in this sorted format, the tree split algorithm is reduced to a linear scan over the sorted columns

While the proposed block structure helps optimize the computation complexity of split finding, it requires indirect fetches of gradient statistics by row index. To get over the slow write and read operations in the process, the authors implemented an internal buffer for each thread and accumulate the gradient statistics in minibatches. This helps in reducing the runtime overhead when the rows are large.

One of the goals of the algorithm is to fully utilize the machine’s resources. While the CPUs are utilized by parallelization of the process, the available disk space is utilized by the out-of-core learning. These blocks that we saw earlier are stored on disk and a separate prefetch thread keeps fetching the data into memory just in time for the computation to continue. They use two techniques to make the I/O operations from disk faster –

- Block Compression, where each block is compressed before storage and uncompressed on the fly while reading.
- Block Sharding, where a block is broken down into multiple pieces and stored across multiple disks(if any). And the prefetcher reads the block by alternating between the two disks, thereby increasing throughput of the disk reading.

XGBoost has so many articles and blogs about it covering the hyperparameters and how to tune them that I’m not even going to attempt it.

The single source of truth for any hyperparameter is the official documentation. It might be intimidating to look at the long list of hyperparameters there, but you won’t end up touching the majority of them in a normal use case.

The major ones are:

**eta [default=0.3, alias: learning_rate]**- Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.

- range: [0,1]

**gamma [default=0, alias: min_split_loss]**- Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.

- range: [0,∞]

**max_depth [default=6]**- Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.

- range: [0,∞] (0 is only accepted in lossguided growing policy when tree_method is set as hist)

**min_child_weight [default=1]**- Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be.

- range: [0,∞]

**subsample [default=1]**- Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.

- range: (0,1]

**colsample_bytree, colsample_bylevel, colsample_bynode [default=1]**- This is a family of parameters for subsampling of columns.

- All colsample_by* parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.

- colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.

- colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.

- colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.

- colsample_by* parameters work cumulatively. For instance, the combination {‘colsample_bytree’:0.5,

**lambda [default=1, alias: reg_lambda]**- L2 regularization term on weights. Increasing this value will make model more conservative.

**alpha [default=0, alias: reg_alpha]**- L1 regularization term on weights. Increasing this value will make model more conservative.

After publishing this, I came to realize I haven’t talked about some of the later developments in XGBoost like leaf-wise tree growth and how tuning the parameters are slightly different for the new faster implementation.

LightGBM, about which we will be talking about in the next blog in the series, implemented leaf-wise tree growth and that led to a huge performance improvement. XGBoost also played catch-up and implemented the leaf-wise strategy in a histogram based tree splitting strategy.

Leaf-wise growth policy, while faster, also overfits faster if the data is small. Therefore it is quite important to use regularization or cap tree complexity through hyperparameters. But if we just keep tuning *max_depth* as before to control complexity, it won’t work. *num_leaves *(which controls the number of leaves in a tree) also need to be tuned together. This is because with the same depth, a leaf-wise tree growing algorithm can make a more complex tree.

You can enable leaf-wise tree building in XGBoost by setting *tree_method* parameter to “hist” and the *grow_policy *parameter to “lossguide”

In the next part of our series, let’s look at the one who challenged the king – **LightGBM**

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI – NGBoost
- Part VII – The Battle of the Boosters

- Friedman, Jerome H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 (2001), no. 5, 1189–1232.
- Chen, Tianqi & Guestrin, Carlos. (2016). XGBoost: A Scalable Tree Boosting System. 785-794. 10.1145/2939672.2939785.

The implementation(both the original and a faster multi-core version) is on the Github Page : https://github.com/RGF-team/rgf

The key modifications to the core GBDT algorithm they suggested are as follows:

According to Friedman[1], one of the disadvantages of the standard Gradient Boosting is that the shrinkage/learning rate, needs to be small to achieve convergence. In fact, he argued for infinitesimal step size. They suggested a modification which made the shrinkage parameter unnecessary.

In standard Gradient Boosting, the algorithm does a partial corrective step in each iteration. The algorithm only optimizes the base learner in the current iteration and ignores all the previous ones. It creates the best regression tree for the current timestep and adds it to the ensemble. But they proposed that at every iteration, we update the whole forest(*m* base learners for iteration *m*) and readjust the scaling factor at each iteration.

While the fully corrective greedy update means that the algorithm will converge faster, it also means that it overfitted faster. Therefore Structured Sparsity Regularization was adopted to combat this problem. The general idea of structured sparsity is that in a situation where a sparse solution is assumed, one can take advantage of the sparsity structure underlying the task. In this specific setting, it was implemented as a sparse search of decision rules in the forest structure.

In addition to the Structured Sparsity Regularization, they also included an explicit Regularization term to the loss function.

where *l* is the differentiable convex loss function and Ω is the regularisation term penalising the complexity of the tree structure, and is the Forest structure.

The paper introduces three types of Regularization options:

where is the forest structure, is the constant for controlling the strength of regularization, are the weights of the node v (which is restricted to leaf nodes), is the leaves of the tree T, and is the set of all Trees in the forest.

hyperparameter in implementation: *l2*

The minimum penalty regularization penalises the depth of the trees. This is a regularization that acts on all nodes and not just the leaves of the trees. This uses the principle that any leaf node can be written in terms of its ancestor nodes. The intuition behind the regularization is that it penalises depth, which is conceptually a complex decision rule.

The exact formula is beyond our scope, but the key hyperparameters in there are *l2* which governs the overall strength of regularization, and *reg_depth* which controls how severely you penalise the depth of a tree. Suggested values for *l2* are 1, 0.1, 0.01 and *reg_depth *should be a value greater than 1

This is very similar to Minimum-penalty regularization, but with an added condition that the weights of sibling nodes should sum to zero. The intuition behind the sum-to-zero constraint is that less redundant models are preferable and that the models are least redundant when branches at internal nodes lead to completely opposite actions, like adding ‘x’ to versus subtracting ‘x’ from the output value. So this penalises two things- depth of the tree and redundancy of the tree. There are no additional hyperparameters here.

**Note:** An interesting tidbit to note here is that all the bechmarks in the paper and the competitions only used simple L2 regularization.

The general concept is still similar to Gradient Boosting, but the key differences are in the tree updates in each iteration. And also the easy and convenient derivations or gradients to be mean or median does not work anymore because of the regularization term in the objective function.

Let’s look at the new algorithm, albeit at a high level.

- repeat
- is the optimum forest that minimizes among all the forests that can be obtained by applying one step of structure-changing operation to the current forest
- Optimize the leaf weights in to minimize the loss

- until some exit criteria is met
- Optimize the leaf weights in to minimize the loss
- return

There is one key difference in the way the trees are built in the forest in Regularized Greedy Forest. In classical Gradient Boosting, at each stage a new tree is built, with a specific criteria of stopping, like depth or number of leaves. Once you pass a stage, you do not touch that tree or the weights associated with it. On the contrary, in RGF, at any m iteration, the option to update any of the previously created m-1 trees or starting a new tree is open. The exact update to the forest is determined by the action which minimizes the loss function.

Let’s look at an example to understand the fundamental difference in Tree Building between Gradient Boosting and Regularized Greedy Forest

Standard Gradient Boosting builds successive trees and sum those up into an additive function which approximates the desired function.

RGF takes a slightly different route. For each step change in the structure of the tree, it evaluates the possibility of growing a new leaf in an existing tree vs starting a new tree with the help of the loss function, and then takes the greedy approach of taking the route with least loss. So in the diagram above, we can choose to grow leaves in T1, T2, or T3, or we can start a new tree T4 depending on which one gives you the most reduction in loss.

But practically, it is computationally challenging to do this as the possible splits to evaluate grows exponentially as we move deeper and deeper into the forest. So, there is a hyperparameter, *n_tree_search*, in the implementation which restricts the retrospective update of trees to those many latest trees only. The default value is set as 1 so that the update always looks at one previously created tree. In our example, this reduces the possibilities to growing leaves in T3 or growing a new tree T4.

Conceptually, this becomes an additive function over leaves of the forest than an additive function of trees in a forest, and consequently, there is not max_depth parameter in RGF as the depth of the tree is automatically restrained by the incremental updates to the tree structure.

The next step is the weights of the new leaf that was selected to be the best structural change. This is an optimization problem and can be solved using any of the multiple methods like Gradient Descent, or Newton-Raphson’s method. Since the optimization that we are looking at is simpler, the paper uses a Newton’s step, which is much more accurate than Gradient Descent, to get an approximately optimal weight for the new leaf. Refer to Appendix A, if you are interested in how and why we need a Newton’s step to optimize such functions.

With the base learner or the basis function fixed, we need to optimize the weights of all the leaves in the forest. This is again an optimization problem, and this is solved using Cordinate Descent, which iteratively go through each of the leaves and update the weights by a Newton Step with a small step size.

Since the initial weights that is already set are approximately optimal, we do not need to re-optimize the weights every iteration. It would be computationally expensive if we do that. This is another hyperparameter in the implementation called *opt_interval*. Empirically, it was observed that unless *opt_interval* is an extreme value, the choice of *opt_interval* is not critical. For all the competitions they won, they had simply set the value as 100.

Below is a list of key Hyperparameters that the authors of the paper suggest. It is almost directly taken from their Github Page, but adopted to the Python Wrapper.

Parameters to control training | |

`algorithm=` | `RGF` | `RGF_Opt` | `RGF_Sib` (Default: `RGF` )`RGF` : RGF with regularization on leaf-only models.`RGF_Opt` : RGF with min-penalty regularization.`RGF_Sib` : RGF with min-penalty regularization with the sum-to-zero sibling constraints. |

`loss=` | Loss function . `LS` | `Expo` | `Log` (Default: `LS` )`LS` : square loss .`Expo` : exponential loss .`Log` : logistic loss . |

`max_leaf=` | Training will be terminated when the number of leaf nodes in the forest reaches this value. It should be large enough so that a good model can be obtained at some point of training, whereas a smaller value makes training time shorter. Appropriate values are data-dependent and in [2] varied from 1000 to 10000. (Default: 10000) |

| If turned on, training targets are normalized so that the average becomes zero. It was turned on in all the regression experiments in [2]. |

`l2=` | . Used to control the degree of regularization. Crucial for good performance. Appropriate values are data-dependent. Either 1.0, 0.1, or 0.01 often produces good results though with exponential loss (`loss=Expo` ) and logistic loss (`loss=Log` ) some data requires smaller values such as 1e-10 or 1e-20. |

`sl2=` | . Override regularization parameter for the process of growing the forest. That is, if specified, the weight correction process uses and the forest growing process uses . If omitted, no override takes place and is used throughout training. On some data, works well. |

`reg_depth=` | Must be no smaller than 1. Meant for being used with `algorithm=RGF_Opt|RGF_Sib` . A larger value penalizes deeper nodes more severely. (Default: 1) |

`test_interval=` | Test interval in terms of the number of leaf nodes. For example, if `test_interval=500` , every time 500 leaf nodes are newly added to the forest, end of training is simulated and the model is tested or saved for later testing. For efficiency, it must be either multiple or divisor of the optimization interval (`opt_interval` : default 100). If not, it may be changed by the system automatically. (Default: 500) |

n_tree_search= | Number of trees to be searched for the nodes to split. The most recently-grown trees are searched first. (Default: 1) |

Youtube Channel 3Blue1Brown(which I recommend strongly if you want fundamental intuitions about Math), has a yet another brilliant video for explaining the Taylor Expansions/Approximations. Be sure to check out at least the first 6 minutes of the video.

Taylor’s approximation lets us approximate a function close to a point by using the derivatives of that function.

…

Suppose we are taking a second order approximation and finding a local minima, we can do that by setting the derivative to zero

.

Setting it to zero, we get:

This (x-a) is the optimum step to minimize the function at that point. So, this minima is more like the **step direction** towards the minima than the actual minima.

To optimize the non-differentiable function, we need to take multiple steps in the step direction until we are relatively satisfied with the loss, or technically until the loss is below our tolerance. This is called the **Newton-Raphson method **of optimization.

In the next part of the series, let’s take a look at giant – **XGBoost**

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI – NGBoost
- Part VII – The Battle of the Boosters

Stay tuned!

- Johnson, Rie & Zhang, Tong, “Learning Nonlinear Functions Using Regularized Greedy Forest”,
*IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 36 , Issue: 5 , May 2014 )*.

Today, I am starting a new blog series on the Gradient Boosting Machines and all its cousins. The outline for the blog series are as follows: (This will be updated with links as and when each of them gets published)

- Part I – Gradient Boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI – NGBoost
- Part VII – The Battle of the Boosters

In the first part, let’s understand the classic Gradient Boosting methodology put forth by Friedman. Even though this is math heavy, it’s not that difficult. And wherever possible, I have tried to provide intuition to what’s happening.

Let there be a dataset *D* with *n* samples. Each sample has *m* set of features in a vector *x*, and a real valued target, *y*. Formally, it is written as

Now, Gradient Boosting algorithms are an ensemble method which takes an additive form. The intuition is that a complex function, which we are trying to estimate, can be made up of smaller and simpler functions, added together.

Suppose the function we are trying to approximate is

We can break this function as :

This is the assumption we are taking when we choose an additive ensemble model and the Tree ensemble that we usually talk about when talking about Gradient Boosting can be written as below:

where M is the number of base learners and F is the space of regression trees.

where *l* is the differentiable convex loss function

Since we are looking at an additive functional form for , we can replace with

So, the loss function will become:

- Initialize the model with a constant value by minimizing the loss function
- is the prediction of the model which minimizes the loss function at 0th iteration
- For the Squared Error loss, it works out to be the average over all training samples
- For the Least Absolute Deviation loss, it works out to be the median over all training samples

- for m=1 to M:
- Compute
- is nothing but the derivative of the Loss Function(between true and the output from the last iteration) w.r.t.
*F(x)*from the last iteration - For the Squared Error loss, this works out to be the residual
*(Observed – Predicted)* - It is also called the pseudo residual because it acts like the residual and it is the residual for the Squared Error loss function
- We calculate the for all the
*n*samples

- is nothing but the derivative of the Loss Function(between true and the output from the last iteration) w.r.t.
- Fit a Regression Tree to the values using Gini Impurity or Entropy (the usual way)
- Each leaf of the tree is denoted by for , where is the number of leaves in the tree created in iteration
*m*

- Each leaf of the tree is denoted by for , where is the number of leaves in the tree created in iteration
- For compute
- is the basis function or the least squared coefficients. This conveniently works out to be the average over all the samples in any leaf for the Squared Error loss and median for the Least Absolute Deviation loss
- is the scaling factor or leaf weights.
- The inner summation over can be ignored because of the disjoint nature of the Regression Trees. One particular sample will only be in one of those leaves. So the equation simplifies to:
- So, for each of the leaves, , we calculate the optimum value , when added to the prediction of the last iteration minimizes the loss for the samples that reside in the leaf
- For known loss functions like the Squared Error loss and Least Absolute Deviation loss, the scaling factor is 1. And because of that, the standard GBM implementation ignores the scaling factor.
- For some losses like the Huber loss, is estimated using a line search to find the minimum loss.

- Update
- Now we add the latest, optimized tree to the result from previous iteration.
- is the shrinkage or learning rate
- The summation in the equation is only useful in the off chance that a particular sample ends up in multiple nodes. Otherwise, it’s just the optimized regression tree score

- Compute

In the standard implementation (Sci-kit Learn), the regularization term in the objective function is not implemented. The only regularization that is implemented there are the following:

**Regularization by Shrinkage**– In the additive formulation, each new weak learner is “shrunk” by a factor, . This shrinkage is also called learning rate in some implementation because it resembles the learning rate in neural networks.**Row Subsampling**– Each of the candidate trees in the ensemble uses a subset of samples. This has a regularizing effect.**Column Subsampling**– Each of the candidate trees in the ensemble uses a subset of features. This also has a regularizing effect and is usually more effective. This also helps in parallelization.

We know that Gradient Boosting is an additive model and can be shows as below:

where *F* is the ensemble model, *f* is the weak learner, *η* is the learning rate and *X* is the input vector.

Substituting F with , we get the familiar equation,

Now since is obtained at each iteration by minimizing the loss function which is a function of the first and second order gradients (derivatives), we can intuitively think about that as a directional vector pointing towards the steepest descent. Let’s call this directional vector as . The subscript is m-1 because the vector has been trained on stage m-1 of the iteration. Or, intuitively the residual . So the equation now becomes:

Flipping the signs, we get:

Now let’s look at the standard Gradient Descent equation:

We can clearly see the similarity. And this result is what enables us to use any differentiable loss function.

When we train a neural network using Gradient Descent, it tries to find the optimum parameters(weights and biases), * w*, that minimizes the loss function. And this is done using the gradients of the loss with respect to the parameters.

But in Gradient Boosting, the gradient only adjusts the way the ensemble is created and not the parameters of the underlying base learner.

While in neural network, the gradient directly gives us the direction vector of the loss function, in Boosting, we only get the approximation of that direction vector from the weak learner. Consequently, the loss of a GBM is only likely to reduce monotonically. It is entirely possible that the loss jumps around a bit as the iterations proceed.

GradientBoostingClassifier and GradientBoostingRegressor in Sci-kit Learn are one of the earliest implementations in the python ecosystem. It is a straight forward implementation, faithful to the original paper. I follows pretty much the discussion we had till now. And it has implemented for a variety of loss functions for which the *Greedy function approximation: A gradient boosting machine*[1] by Friedman had derived algorithms.

- Regression Losses
- ‘ls’ → Least Squares
- ‘lad’ → Least Absolute Deviation
- ‘huber’ → Huber Loss
- ‘quantile’ → Quantile Loss

- Classification Losses
- ‘deviance’ → Logistic Regression loss
- ‘exponential’ → Exponential Loss

In the next part of the series, let’s take a look at one of the lesser known cousin of Gradient Boosting – **Regularized Greedy Forest**

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI – NGBoost
- Part VII – The Battle of the Boosters

Stay tuned!

If you are a beginner in Machine Learning, you might not have made an effort to go deep and understand the mathematics behind the “*.fit()*“, but as you mature and stumble across more and more complex problems, it becomes essential to understand the math or at least the intuition behind the maths to effectively apply the right technique at the right place.

When I was starting out, I was also guilty of the same. I’ll see “Cross Categorical Entropy” as a loss function in a Neural Network and I take it for granted – that it is some magical loss function that works with multi-class labels. I’ll see “entropy” as one of the splitting criterion in Decision Trees and I just experiment with it without understanding what it is. But as I matured, I decided to spend more time in understanding the basics and it helped me immensely in getting my intuitions right. This also helped in understanding the different ways the popular Deep Learning Frameworks, PyTorch and Tensorflow, have implemented the different loss functions and decide when to use what.

This blog is me summarising my understanding of the underlying concepts of Information Theory and how the implementations differ across the different Deep Learning Frameworks. Each of the concepts that I’ve tried to explain starts off with an introduction and a way to reinforce the intuition, and then provide the mathematics associated with it as a bonus. I’ve always found the maths to clear the understanding like nothing else can.

In the early 20th century, computer scientists and mathematicians around the world were faced with a problem. How to quantify information? Let’s consider the two sentences below:

- It is going to rain tomorrow
- It is going to rain heavily tomorrow

We a humans intuitively understand the both sentences transmit different amounts of information. The second one is much more informative than the first. But, how do you quantify it? How do you express that in the language of mathematics?

Enter Claude E. Shannon with his seminal paper “A Mathematical Theory of Communication”[1]. Shannon introduced the qualitative and quantitative model of communication as a statistical process. Among many other ideas and concepts, he introduced Information Entropy, and the concept of ‘bit’-the fundamental unit of measurement of Information.

Information Theory is quite vast, but I’ll try to summarise key bits of information(pun not intended) in a short glossary.

*Information*is considered to be a sequence of symbols to be transmitted through a medium, called a*channel*.*Entropy*is the amount of uncertainty in a string of symbols given some knowledge of the distribution of symbols.*Bit*is a unit of information, or a sequence of symbols.- To transfer 1 bit of information means reducing the uncertainty of the recipient by 2
- For eg. In a coin toss, before we see the coin rest, there are two possible outcomes. Once the coin rests and we find that it is heads, the possible outcome become just 1. So in this situation, the transfer of information by the coin toss is 1 bit.

Let’s setup an example which we will use to make the intuition behind Entropy clear. The ubiquitous “balls in a box” example is as good as any.

There is a box with 100 balls which have four different colours – Blue, Red, Green and Yellow. Let us assume there are 25 balls of each colour in the box. A *transmitter* picks up a ball from the container at random and transmits that information to a *receiver*. For our example, let’s assume that the transmitter is in India and the receiver is in Australia. And let’s also assume that we are in the early 20th century when Telegraphy was one of the primary mode of long distance communication. The thing about telegram messages is that they are charged by the word and so you need to be careful about what you are sending if you are on a budget(this might not seem important right now, but I assure you it will). Adding just one more restriction to the formulation – you cannot send the actual word “blue” through the telegram. You do not have 26 symbols of the English language, but instead just two symbols – 0 and 1.

Now let’s look at how we will be coding the four responses. I’m sure all of you know the binary numerical system. So if we have four outcomes, we can get unique codes using a code length of two. We use that to assign a code for each of our four outcomes. This is fixed length encoding.

Another way to think about this is in terms of reduction in uncertainty. We know that all the four outcomes are equally likely(each has a probability of 1/4). And when we transmit the information about the colour of the picked ball, we reduce the uncertainty by 4. We know that 1 bit can reduce uncertainty by 2 and to reduce uncertainty by 4 we need two bits.

Mathematically, if we have M symbols in the code we are using, we would need bits to encode the information.

What would be the average number of bits we would be using to send the information?

What is an average? In the world of probability, average or mean is the expected value of a probability distribution.

Expected Value can also be thought of this way: If we are to pick a ball at random from the box for 1000 times and record the length of bits that was required to encode that message and take an average of all those 1000 entries, you would get the expected value of the length of bits.

If all outcomes are equally likely, P(x) = 1/N, where N is the number of possible outcomes. And in that case, the expected value becomes a simple mean.

Let’s slightly change the setup of our example. Instead of having equal number of coloured balls, now we have 25 blue balls, 50 red balls, 13 green balls and 12 yellow balls. This example is much better to explain the rest of the concepts and since now you know what an expected value of a probability distribution is, we can follow that convention.

The expected value does not change because no matter which ball you choose, the number of bits you use is still 2.

But, is the current coding scheme the most optimal? This is where Variable Length Coding comes into the picture. Let’s take a look at three different variable coding schemes.

Now that we know how to calculate expected value of the length of bits, let’s calculate it for the three schemes. The one with the lowest length of bits should be the most economical one, right?

We are using 1 bit for Blue and Red and 2 bits each for Green and Yellow.

We are using 1 bit for Blue, 2 bits for Red and 3 bits each for Green and Yellow.

We are using 2 bit for Blue, 1 bits for Red and 3 bits each for Green and Yellow.

Coding Scheme 1 is the obvious choice, right? This is where Variable Length Coding gets tricky. If we pick a ball from the box and it was blue. So we send ‘0’ as a message to Australia. And before Australia got a change to read the message, we pick another ball and it was green. So we send ’10’ to Australia. Now when Australia looks at the message queue, they will see ‘010’ there. If it was a fixed length code, we would know that there is a break every *n* symbols. But in the absence of that, ‘010’ can be interpreted as blue, green or blue, red, blue. That is why a code should be *uniquely decodable*. A code is said to be *uniquely decodable* if two distinct strings of symbols never give rise to the same encoded bit string. This results in a scenario where you have to let go of a few codes every extra symbol you add. Chris Olah has a great blog[2] explaining the concept.

That leaves us with Coding Scheme 2 and 3. Both of them are *uniquely decodable*. The only difference between them is that in Scheme 2 we are using 1 bit for blue and 2 bits for red. Scheme 2 is the reverse. And we know that getting a red ball from the box is much more likely than blue ball. So it makes sense to use the smaller length for the red ball and that is why the expected value of length of bits is lower for Scheme 3 than 2.

Now you would be wondering why we are so concerned about the expected length of bits. This expected length of bits of the best possible coding scheme is what is called the * Shannon Entropy* or just

The easy answer is the .

And extending it to the whole probability distribution *P*, we take the expected value:

In our example, it works out to be:

For those who are interested to know how we arrived at the formula, read along.

While the intuition is useful to get the idea, you can’t really scale it. Every time you get a new problem, you can’t sit down with a paper and pen, figure out the best coding scheme and then calculate the entropy. That’s where the maths comes in.

One of the properties of uniquely decodable codes is something called a prefix property. No codeword should be the prefix of another codeword. This means that every time you choose a code with a shorter length, you are letting go of all the possible codes with that code as the prefix. If we take 01 as a code, we cannot use 011 or 01001, etc. So there is a cost incurred for selecting each code. Quoting from Chris Olah’s blog:

The cost of buying a codeword of length 0 is 1, all possible codewords – if you want to have a codeword of length 0, you can’t have any other codeword. The cost of a codeword of length 1, like “0”, is 1/2 because half of possible codewords start with “0”. The cost of a codeword of length 2, like “01”, is 1/4 because a quarter of all possible codewords start with “01”. In general, the cost of codewords decreases

Chris Olah’s blog[2]exponentiallywith the length of the codeword.

And that cost can be quantified as

where L is the length of the message. Inverting the equation, we get:

Now what is the cost? Our spends are proportional to how much a particular outcome needs to be encoded. So we spend P(x) for a particular variable x. And therefore the cost = P(x). (For better intuition on this cost read Chris Olah’s Blogpost[2])

In our example, now Australia also has a box with the same four coloured balls. But the distribution of the balls are different, as shown below(which is denoted by *q*)

Now Australia wants to do the same kind of experiments India did and send back the color o the ball they picked. They decided to use the same coding scheme that was set earlier to carry out the communication.

Since the coding scheme is derived from the source distribution, the length of bits for each of the outcome is the same as before. For eg.

But what has changed is the frequency in which the different messages are used. So, now the colour of the ball which Australia draws is going to be Green 50% of the time, and so on. The coding scheme that we have derived for the first use case(India to Australia) is not optimised for the 2nd use case(Australia to India).

This is the Cross Entropy which can be formally defined as:

where, p(x) is the probability of the distribution which was used to come up with the encoding scheme and q(x) is the probability of the distribution which is using them.

The Cross Entropy is always equal to or greater than the Entropy. In the best case scenario, the source and destination distributions are exactly the same and in that case, the Cross Entropy becomes Entropy because we can substitute q(x) with p(x). In Machine Learning terms, we can say that the predicted distribution is p(x) and the ground truth distribution is q(x). So the more the predicted and true distribution is similar, closer Entropy and Cross Entropy would be.

Now that we have understood Entropy and Cross Entropy, we are in a position to understand another concept with a scary name. The name is so scary that even practitioners call it KL Divergence instead of the actual Kullback-Leibler Divergence. This monster from the depths of information theory is a pretty simple and easy concept to understand.

KL Divergence measures how much one distribution diverges from another. In our example, we use the coding scheme we used for communication from India to Australia to do the reverse communication also. In machine learning, this can be though of as trying to approximate a true distribution(p) with a predicted distribution(q). But doing this we are spending more bits than necessary for sending the message. Or, we are having some loss because we are approximating p with q. This loss of information is called KL Divergence.

Two key properties of KL Divergence is worth to note:

- KL Divergence is non-negative. i.e. It is always zero or a positive number.
- KL Divergence is non-symmetric. i.e. KL Divergence from p to q is not equal to KL Divergence of q to p. And because of this, it is not strictly a distance.

KL(P||Q) can be interpreted in the following ways:

- The divergence from Q to P
- The relative entropy of P with respect to Q
- How well Q approximates P

Let’s take a look at how we arrive at the formula, which will enhance our understanding of KL Divergence.

We know that from our intuition that KL(P||Q) is the information lost when approximating P with Q. What would that be?

Just to recap the analogy of our example to an ML system to help you connect the dots better. Let’s use the situation where Australia is sending messages to India using the same coding scheme as was devised for messages from India to Australia.

- The box with balls(in India and Australia) are probability distributions
- When we prepare a machine learning model(classification or regression), what we are doing is approximating a probability distribution and in most cases output the expected value of the probability distribution.
- Devising a coding scheme based on a probability distribution is like preparing a model to output that distribution.
- The box with balls in India is the predicted distribution, because this is the distribution which we assume when we devise the coding scheme.
- The box with the balls in Australia is the true distribution, because this is where the coding scheme is used.

We know the Expected Value of Bits when we use the coding scheme in Australia. It’s the Cross Entropy of *Q* w.r.t. *P* (where P is the predicted distribution and Q is the true distribution).

Now, there is some inefficiency as we are using a coding scheme devised for another probability distribution. What is the ideal case here? The predicted distribution is equal to the true distribution. That means if we use a coding scheme devised for Q to send messages. And that is nothing but the Entropy of Q.

Since we have the actual Cross Entropy and the ideal Entropy, the information lost due to the coding scheme is

This should also give you some intuition towards why this metric is not symmetric.

* Note: We have discussed the entire blog assuming X, the random variable, is a discrete variable. If it is a continuous variable, just replace the with and the formulae would work again. *

Loss functions are at the heart of a Deep Learning system. The gradients that we propagate to adjust the weights originate from this loss function. The most popular loss functions in classification problems are derived from Information Theory, specifically Entropy.

In an N-way classification problem, the neural network, typically, has N output nodes(with an exception for a binary classification, which has just one node). These nodes’ outputs are also called the logits. *Logits are the real valued output of a Neural Network before we apply an activation*. If we pass these logits through a softmax layer, the outputs will be transformed into something resembling the probability(statisticians will be turning in their graves now) that the sample is that particular class.

Essentially, a softmax converts the raw logits to probabilities. And since we now have probabilities, we can calculate the Cross Entropy as we have reviewed earlier.

In Machine Learning terms, the Cross Entropy formula is:

where N is the number of samples, C is the number of classes, y is the true probability and y_hat is the predicted probability. In a typical case, y would be the one-hot representation of the target label(with zeroes everywhere and one for the correct label).

Likelihood, in non-technical parlance, is interchangeably used with probability. But in statistics, and by extension Machine Learning, it has different meaning. In a very curt manner, Probability is when we are talking about outcomes, and Likelihood is for hypothesis.

Let’s take an example of a normal distribution, with mean 30 and standard deviation 2.5. We can find the probability that the drawn value from the distribution is 32.

Now let’s reverse the situation, and consider that we do not know the underlying distribution. We just have the observations drawn from the distribution. In such a case, we can still calculate the probability like before assuming a particular distribution.

Now we are finding the likelihood that a particular observation is described by the assumed parameters. And when we maximise this likelihood, we would arrive at the parameters which best fit the drawn samples. This is the likelihood.

To summarize:

And in machine learning, since we try to estimate the underlying distribution from data, we always try to maximise the Likelihood and this is called **Maximum Likelihood Estimation**. In practice we maximise the log likelihood rather than the likelihood.

where N is the number of samples, and f is a function which gives the probability of y given covariates x and parameters.

There are two difficulties when we deal with the above product term:

- From a maths standpoint, it’s easier to work with integrations and derivatives when its a summation
- From a computer science standpoint, multiplying lot of small numbers would result in in really small numbers and consequently arithmetic underflow

This is where Log Likelihood comes in. If we apply the log function, the product turns into a summation. And since log is a monotonic transformation, maximising log likelihood would also maximise the likelihood. So, as a matter of convenience, we almost always maximise log likelihood instead of the pure likelihood.

Let’s consider *q* as the true distribution and *p* as the predicted distribution. *y* is the target for a sample and *x* is the input features.

The ground truth distribution, typically, is a one-hot representation over the possible labels.

For a sample , the Cross Entropy is:

where *Y* is the set of all labels

The term is zero for all the elements of Y except and 1 for . So the term reduced to:

where is the correct label for the sample.

Now summing up over all N samples,

which is nothing but the negative of the log likelihood. So maximising log likelihood is equivalent to minimising the Cross Entropy.

In a binary classification, the neural network only has a single output, typically passed through a sigmoid layer. The sigmoid layer as show below squashes the logits to a value between 0 and 1.

Therefore the output of a sigmoid layer acts like a probability for the event. So if we have two classes, 0 and 1, as the confidence of the network increases about the sample being ‘1’, the output of the sigmoid layer also becomes closer to 1, and vice versa.

Binary Cross Entropy(BCE) is a loss function specifically tailored for a binary classification. Let’s start with the formula and then try to draw parallels to what we have learnt so far.

Although the formula might look unfamiliar at first glance, this is just a special case of the Cross Entropy that we reviewed earlier. Unlike the multi-class output, we have just one output which is between 0 and 1. Therefore, to apply the Cross Entropy formula, we synthetically create two output nodes, one with *y* and another with *1-y* (we know from the laws of probability that in a binary outcome, the probabilities sum to 1), and sum the Cross Entropy of all the outcomes.

We know KL Divergence is not symmetric. If *p* is the predicted distribution and *q* is the true distribution, there are two ways you can calculate KL Divergence.

The first one is called the Forward KL Divergence. It gives you how much the predicted is diverging from the true distribution. The second one is called Backward KL Divergence. It gives you how much the true is diverging from the predicted distribution.

In supervised learning, we use the Forward KL Divergence because it can be shown that maximising the forward KL Divergence is equivalent to maximising the Cross Entropy (which we will detail in the maths section), and Cross Entropy is much easier to calculate. And for this reason, Cross Entropy is preferred over KL Divergence is most simple use cases. Variational Autoencoders and GANs are a few areas where KL Divergence becomes useful again.

Backward KL Divergence is used in Reinforcement Learning and encourages the optimisation to find the mode of the distribution, when Forward KL does the same for the mean. For more details on the Forward vs Backward KL Divergence, read the blogpost by Dibya Ghosh[3]

We know that KL Divergence is the difference between Cross Entropy and Entropy.

Therefore, our Cross Entropy Loss over N samples is:

Now the optimisation objective is to minimise this loss by changing the parameters which parameterize predicted distribution *p*. Since H(q) is the entropy of the true distribution, independent of the parameters, it can be considered as a constant. And while optimising, we know a constant is immaterial and can be ignored. So the Loss becomes:

So, to summarise, we started with the Cross Entropy loss and proved that minimising the Cross Entropy is equivalent to minimising the KL Divergence.

The way these losses are implemented in the popular Deep Learning Frameworks like PyTorch and Tensorflow are a little confusing. This is especially true for PyTorch.

- binary_cross_entropy – This expects the logits to be passed through a sigmoid layer before computing the loss
- binary_cross_entropy_with_logits – This expects the raw outputs or the logits to be passed to the loss. The loss implementation applies a sigmoid internally
- cross_entropy – This expects the logits as the input and it applies a softmax(technically a log softmax) before calculating the entropy loss
- nll_loss – This is the plain negative log likelihood loss. This expects the logits to be passed through a softmax layer before computing the loss
- kl_div – This expects the logits to be passed through a softmax layer before computing the loss

All the Tensorflow 2.0 losses expects probabilities as the input by default, i.e. logits to be passed through Sigmoid or Softmax before feeding to the losses. But they provide a parameter *from_logits* which is set to *True *will accept logits and do the conversion to probabilities inside.

- BinaryCrossentropy – Calculates the BCE loss for a
*y_true*and*y_pred*.*y_true*and*y_pred*are single dimensional tensors – a single value for each sample. - CategoricalCrossentropy – Calculates the Cross Entropy for a
*y_true*and*y_pred*.*y_true*is a one-hot representation of labels and*y_pred*is a multi-dimensional tensor – # of classes values for each sample - SparseCategoricalCrossentropy – Calculates the Cross Entropy for a
*y_true*and*y_pred*.*y_true*is a single dimensional tensor-a single value for each sample- and*y_pred*is a multi-dimensional tensor – # of classes values for each sample - KLDivergence – Calculates KL Divergence from
*y_pred*to*y_true*. This is an exception in the sense that this always expects probabilities as the input.

We have come to the end of the blog and I hope by now you have a much better understanding and intuition about the magical loss functions which make Deep Learning possible.

- Shannon, C.E. (1948), “A Mathematical Theory of Communication”,
*Bell System Technical Journal*, 27, pp. 379–423 & 623–656, July & October, 1948. - Chris Olah, Visual Information, colah.github.io
- Dibya Ghosh, KL Divergence for Machine Learning, http://www.dibyaghosh.com