The starting point for the LightGBM was XGBoost. So essentially, they took XGBoost and optimized it, and therefore, it has all the innovations XGBoost had (more or less), and some additional ones. Let’s take a look at the incremental improvements that LightGBM made:
One of the main changes from all the other GBMs, like XGBoost, is the way tree is constructed. In LightGBM, a leaf-wise tree growth strategy is adopted.
All the other popular GBM implementations follow somehting called a Level-wise tree growth, where you find the best possible node to split and you split that one level down. This strategy will result in symmetric trees, where every node in a level will have child nodes resulting in an additional layer of depth.
In LightGBM, the leaf-wise tree growth finds the leaves which will reduce the loss the maximum, and split only that leaf and not bother with the rest of the leaves in the same level. This results in an asymmetrical tree where subsequent splitting can very well happen only on one side of the tree.
Leaf-wise tree growth strategy tend to achieve lower loss as compared to the level-wise growth strategy, but it also tends to overfit, especially small datasets. So small datasets, the level-wise growth acts like a regularization to restrict the complexity of the tree, where as the leaf-wise growth tends to be greedy.
Subsampling or Downsampling is one of the ways with which we introduce variety and speed up the training process in an ensemble. It is also a form of regularization as it restricts from fitting to the complete training data. Usually, this subsampling is done by taking a random sample from the training dataset and building a tree on that subset. But what LightGBM introduced was an intelligent way of doing this downsampling.
The core of the idea is that the gradients of different samples is an indicator to how big of a role does it play in the tree building process. The instances with larger gradients (under-trained), contribute a lot more to the tree building process than instances with small gradients. So, when we downsample, we should strive to keep the instances with large gradients so that the tree building is the most efficient.
The most straightforward idea is to discard the instances with low gradients and build the tree on just the large gradient instances. But this would change the distribution of the data which in turn would hurt the accuracy of the model. And hence, the GOSS method.
The algorithm is pretty straightforward:
The motivation behind EFB is a common theme between LightGBM and XGBoost. In many real world problems, although there are a lot of features, most of them are really sparse, like on-hot encoded categorical variables. The way LightGBM tackles this problem is slightly different.
The crux of the idea lies in the fact that many of these sparse features are exclusive, i.e. they do not take non-zero values simultaneously. And we can efficiently bundle these features and treat them as one. But finding the optimal feature bundles is an NP-Hard problem.
To this end, the paper proposes a Greedy Approximation to the problem, which is the Exclusive Feature Bundling algorithm. The algorithm is also slightly fuzzy in nature, as it will allow bundling features which are not 100% mutually exclusive, but it tries to maintain the balance between accuracy and efficiency when selecting the bundles.
The algorithm, on a high level, is:
The amount of time it takes to build a tree is proportional to the number of splits that have to be evaluated. And when you have continuous or categorical features with high cardinality, this time increases drastically. But most of the splits that can be made for a feature only offer miniscule changes in performance. And this concept is why a histogram based method is applied to tree building.
The core idea is to group features into set of bins and perform splits based on these bins. This reduces the time complexity from O(#data) to O(#bins).
In another innovation, similar to XGBoost, LightGBM ignores the zero feature values while creating the histograms. And this reduces the cost of building the histogram from O(#data) to O(#non-zero data).
In many real world datasets, Categorical features make a big presence and thereby it becomes essential to deal with them appropriately. The most common approach is to represent a categorical feature as it’s one-hot representation, but this is sub-optimal for tree learners. If you have high-cardinality categorical features, your tree needs to be very deep to achieve accuracy.
LightGBM takes in a list of categorical features as an input so that it can deal with it more optimally. It takes inspiration from “On Grouping for Maximum Homogeneity” by Fisher, Walter D. and uses the following methodology for finding the best split for categorical features.
There are a few hyperparameters which help you tune the way the categorical features are dealt with[4]:
cat_l2
, default = 10.0
, type = double, constraints: cat_l2 >= 0.0
cat_smooth
, default = 10.0
, type = double, constraints: cat_smooth >= 0.0
max_cat_to_onehot
, default = 4
, type = int, constraints: max_cat_to_onehot > 0
max_cat_to_onehot
, one-vs-other split algorithm will be usedThe majority of the incremental performance improvements were made through GOSS and EFB.
xgb_exa is the original XGBoost, xgb_his is the histogram based version(which came out later), lgb_baseline is the LightGBM without EFB and GOSS, and LightGBM is with EFB and GOSS. It is quite evident that the improvement in GOSS and EFB is phenomenal as compared to lgb_baseline.
The rest of the improvements in performance is derived from the ability to parallelize the learning. There are two main ways of parallelizing the learning process:
Feature Parallel tries to parallelize the “Find the best split” part in a distributed manner. Evaluating different splits are done in parallel across multiple workers, and then they communicate with each other to decide among themselves who has the best split.
Data Parallel tries to parallelize the whole decision learning. In this, we typically split the data and send different parts of the data to different workers who calculate the histograms based on the section of the data they receive. Then they communicate to merge the histogram at a global level and this global level histogram is what is used in the tree learning process.
Voting Parallel is a special case of Data Parallel, where the communication cost in Data Parallel is capped to a constant.
LightGBM is one of those algorithms which has a lot, and I mean a lot, of hyperparameters. It is so flexible that it is intimidating for the beginner. But there is a way to use the algorithm and still not tune like 80% of those parameters. Let’s look at a few parameters that you can start tuning and then build up confidence and start tweaking the rest.
objective
︎, default = regression
, type = enum, options: regression
, regression_l1
, huber
, fair
, poisson
, quantile
, mape
, gamma
, tweedie
, binary
, multiclass
, multiclassova
, cross_entropy
, cross_entropy_lambda
, lambdarank
, rank_xendcg
, aliases: objective_type
, app
, application
boosting
︎, default = gbdt
, type = enum, options: gbdt
, rf
, dart
, goss
, aliases: boosting_type
, boost
gbdt
, traditional Gradient Boosting Decision Tree, aliases: gbrt
rf
, Random Forest, aliases: random_forest
dart
, Dropouts meet Multiple Additive Regression Treesgoss
, Gradient-based One-Side Samplinglearning_rate
︎, default = 0.1
, type = double, aliases: shrinkage_rate
, eta
, constraints: learning_rate > 0.0
dart
, it also affects on normalization weights of dropped treesnum_leaves
︎, default = 31
, type = int, aliases: num_leaf
, max_leaves
, max_leaf
, constraints: 1 < num_leaves <= 131072
max_depth
︎, default = -1
, type = int#data
is small. Tree still grows leaf-wise<= 0
means no limitmin_data_in_leaf
︎, default = 20
, type = int, aliases: min_data_per_leaf
, min_data
, min_child_samples
, constraints: min_data_in_leaf >= 0
min_sum_hessian_in_leaf
︎, default = 1e-3
, type = double, aliases: min_sum_hessian_per_leaf
, min_sum_hessian
, min_hessian
, min_child_weight
, constraints: min_sum_hessian_in_leaf >= 0.0
min_data_in_leaf
, it can be used to deal with over-fittinglambda_l1
︎, default = 0.0
, type = double, aliases: reg_alpha
, constraints: lambda_l1 >= 0.0
lambda_l2
︎, default = 0.0
, type = double, aliases: reg_lambda
, lambda
, constraints: lambda_l2 >= 0.0
In the next part of our series, let’s look at the one who tread a path less taken – CatBoost
There were a few key innovations that made XGBoost so effective:
Similar to Regularized Greedy Forest, XGBoost also has an explicit regularization term in the objective function.
is the regularization term which penalizes T, the number of leaves in the tree and is the regularization term which penalizes w, the weights of different leaves.
This is a much simpler regularization term than some of the ways we saw in Regularized Greedy Forest.
One of the key ingredients of Gradient Boosting algorithms is the gradients or derivatives of the objective function. And all the implementations that we saw earlier used pre-calculated gradient formulae for specific loss functions, thereby, restricting the objectives which can be used in the algorithm to a set which is already implemented in the library.
XGBoost uses the Newton-Raphson method we discussed in a previous part of the series to approximate the loss function.
Now, the complex recursive function made up of tree structures can be approximated using Taylor’s approximation into a differentiable form. In the case of Gradient Boosting, we take the second order approximation, meaning we use two terms – first order derivative and second order derivative- to approximate the function.
Let,
Approximated Loss function:
The first term, the loss, is constant at a tree building stage, t, and because of that it doesn’t add any value to the optimization objective. So removing it and simplifying we get,
Substituting the Ω with the regularization term, we get:
The f(x) we are talking about is essentially a tree with leaf weights, w. So if we define as the instance set in leaf j, we can substitute the tree structure directly into the equation and simplify as:
Setting this equation to zero we can find the optimum value for as :
Putting this back into the loss function and simplifying we get:
What all of this enables us to do is to separate out the objective function from the core working of the algorithm. And by adopting this formulation, the only requirement from an objective function/loss function is that it needs to be differentiable. To be very specific, the loss function should return the first and second order derivatives.
See here for a list of all the objective functions that are pre-built into XGBoost.
The tree building strategy lies somewhat in between classical Gradient Boosting and regularized Greedy Forests. While the classical Gradient Boosting takes the tree at each stage as a black box, Regularized Greedy Forest operates at a leaf level by updating any part of the forest at each step. XGBoost takes a middle ground and considers a tree that is made in a previous iteration sacrosanct, but while making a tree for any iteration, it doesn’t use the standard impurity measures, but the gradient based Loss function we derived in the last section in the tree building process. While in classic Gradient Boosting the optimization of the loss function happens after the tree is built, XGBoost gets that optimization during the tree building process as well.
Normally its impossible to enumerate all possible tree structures. Therefore a greedy algorithm is implemented that starts from a single leaf and iteratively adds branches to the tree based on the simplified loss function
One of the key problems in tree learning is to find the best split. Usually, we will have to enumerate all the possible splits over all the features and then use the impurity criteria to choose the best split. This is called exact greedy algorithm. It is computationally demanding, especially for continuous and high cardinality categorical features. And it also not feasible when the data doesn’t fit into memory.
To overcome these inefficiencies, the paper proposes an Approximate Algorithm. It first proposes candidate splitting points according to the percentiles of the features. On a highlevel, the algorithm is:
One of the important steps in the algorithm discussed above is the proposal of candidate splits. usually percentiles of a feature are used to make candidates distribute evenly on the data. And a set of algorithms which does that in a distributed manner and with speed and efficiency are called Quantile sketching algorithms. But here, the problem is slightly complicated because the need is to have a weighted quantile sketching algorithm which weighs the instances based on the gradient(so that we can learn the most from instances with most error). So they proposed a new data structure which has provable theoretical guarantee.
This is another key innovation in XGBoost and this came from the realization that real-world datasets are sparse. This sparsity can come form multiple causes,
And for this reason, the authors decide to make the algorithm aware of the sparsity so that it can be dealt with intelligently. And the way they made that is deceivingly simple.
They gave a default direction in each tree node. i.e. if a value is missing or zero, that instance flows down a default direction in the branch. And the optimal default direction is learned from the data
This innovation has a two fold benefit –
One of the main drawbacks of all the implementations of the Gradient Boosting algorithm were that they were quite slow. While the forest creation in a Random Forest was parallel out of the box, Gradient Boosting was a sequential process which builds new base learners on old ones. One of XGBoost’s claim to fame was how blazingly fast it was. It was at least 10 times faster than the existing implementations and it was able to work with large datasets because of the out-of-core learning capabilities. The key innovations in performance improvement were:
The most time consuming part of tree learning is to get the data sorted into order. The authors of the paper proposed to store the data in in-memory units, called blocks. Data in each block is sorted in the Compressed column (CSC) format. This input data layout is computed once before training and reused thereafter. By handing the data in this sorted format, the tree split algorithm is reduced to a linear scan over the sorted columns
While the proposed block structure helps optimize the computation complexity of split finding, it requires indirect fetches of gradient statistics by row index. To get over the slow write and read operations in the process, the authors implemented an internal buffer for each thread and accumulate the gradient statistics in minibatches. This helps in reducing the runtime overhead when the rows are large.
One of the goals of the algorithm is to fully utilize the machine’s resources. While the CPUs are utilized by parallelization of the process, the available disk space is utilized by the out-of-core learning. These blocks that we saw earlier are stored on disk and a separate prefetch thread keeps fetching the data into memory just in time for the computation to continue. They use two techniques to make the I/O operations from disk faster –
XGBoost has so many articles and blogs about it covering the hyperparameters and how to tune them that I’m not even going to attempt it.
The single source of truth for any hyperparameter is the official documentation. It might be intimidating to look at the long list of hyperparameters there, but you won’t end up touching the majority of them in a normal use case.
The major ones are:
‘colsample_bylevel’:0.5,
‘colsample_bynode’:0.5} with 64 features will leave 8 features to choose from at each split.After publishing this, I came to realize I haven’t talked about some of the later developments in XGBoost like leaf-wise tree growth and how tuning the parameters are slightly different for the new faster implementation.
LightGBM, about which we will be talking about in the next blog in the series, implemented leaf-wise tree growth and that led to a huge performance improvement. XGBoost also played catch-up and implemented the leaf-wise strategy in a histogram based tree splitting strategy.
Leaf-wise growth policy, while faster, also overfits faster if the data is small. Therefore it is quite important to use regularization or cap tree complexity through hyperparameters. But if we just keep tuning max_depth as before to control complexity, it won’t work. num_leaves (which controls the number of leaves in a tree) also need to be tuned together. This is because with the same depth, a leaf-wise tree growing algorithm can make a more complex tree.
You can enable leaf-wise tree building in XGBoost by setting tree_method parameter to “hist” and the grow_policy parameter to “lossguide”
In the next part of our series, let’s look at the one who challenged the king – LightGBM
The implementation(both the original and a faster multi-core version) is on the Github Page : https://github.com/RGF-team/rgf
The key modifications to the core GBDT algorithm they suggested are as follows:
According to Friedman[1], one of the disadvantages of the standard Gradient Boosting is that the shrinkage/learning rate, needs to be small to achieve convergence. In fact, he argued for infinitesimal step size. They suggested a modification which made the shrinkage parameter unnecessary.
In standard Gradient Boosting, the algorithm does a partial corrective step in each iteration. The algorithm only optimizes the base learner in the current iteration and ignores all the previous ones. It creates the best regression tree for the current timestep and adds it to the ensemble. But they proposed that at every iteration, we update the whole forest(m base learners for iteration m) and readjust the scaling factor at each iteration.
While the fully corrective greedy update means that the algorithm will converge faster, it also means that it overfitted faster. Therefore Structured Sparsity Regularization was adopted to combat this problem. The general idea of structured sparsity is that in a situation where a sparse solution is assumed, one can take advantage of the sparsity structure underlying the task. In this specific setting, it was implemented as a sparse search of decision rules in the forest structure.
In addition to the Structured Sparsity Regularization, they also included an explicit Regularization term to the loss function.
where l is the differentiable convex loss function and Ω is the regularisation term penalising the complexity of the tree structure, and is the Forest structure.
The paper introduces three types of Regularization options:
where is the forest structure, is the constant for controlling the strength of regularization, are the weights of the node v (which is restricted to leaf nodes), is the leaves of the tree T, and is the set of all Trees in the forest.
hyperparameter in implementation: l2
The minimum penalty regularization penalises the depth of the trees. This is a regularization that acts on all nodes and not just the leaves of the trees. This uses the principle that any leaf node can be written in terms of its ancestor nodes. The intuition behind the regularization is that it penalises depth, which is conceptually a complex decision rule.
The exact formula is beyond our scope, but the key hyperparameters in there are l2 which governs the overall strength of regularization, and reg_depth which controls how severely you penalise the depth of a tree. Suggested values for l2 are 1, 0.1, 0.01 and reg_depth should be a value greater than 1
This is very similar to Minimum-penalty regularization, but with an added condition that the weights of sibling nodes should sum to zero. The intuition behind the sum-to-zero constraint is that less redundant models are preferable and that the models are least redundant when branches at internal nodes lead to completely opposite actions, like adding ‘x’ to versus subtracting ‘x’ from the output value. So this penalises two things- depth of the tree and redundancy of the tree. There are no additional hyperparameters here.
Note: An interesting tidbit to note here is that all the bechmarks in the paper and the competitions only used simple L2 regularization.
The general concept is still similar to Gradient Boosting, but the key differences are in the tree updates in each iteration. And also the easy and convenient derivations or gradients to be mean or median does not work anymore because of the regularization term in the objective function.
Let’s look at the new algorithm, albeit at a high level.
There is one key difference in the way the trees are built in the forest in Regularized Greedy Forest. In classical Gradient Boosting, at each stage a new tree is built, with a specific criteria of stopping, like depth or number of leaves. Once you pass a stage, you do not touch that tree or the weights associated with it. On the contrary, in RGF, at any m iteration, the option to update any of the previously created m-1 trees or starting a new tree is open. The exact update to the forest is determined by the action which minimizes the loss function.
Let’s look at an example to understand the fundamental difference in Tree Building between Gradient Boosting and Regularized Greedy Forest
Standard Gradient Boosting builds successive trees and sum those up into an additive function which approximates the desired function.
RGF takes a slightly different route. For each step change in the structure of the tree, it evaluates the possibility of growing a new leaf in an existing tree vs starting a new tree with the help of the loss function, and then takes the greedy approach of taking the route with least loss. So in the diagram above, we can choose to grow leaves in T1, T2, or T3, or we can start a new tree T4 depending on which one gives you the most reduction in loss.
But practically, it is computationally challenging to do this as the possible splits to evaluate grows exponentially as we move deeper and deeper into the forest. So, there is a hyperparameter, n_tree_search, in the implementation which restricts the retrospective update of trees to those many latest trees only. The default value is set as 1 so that the update always looks at one previously created tree. In our example, this reduces the possibilities to growing leaves in T3 or growing a new tree T4.
Conceptually, this becomes an additive function over leaves of the forest than an additive function of trees in a forest, and consequently, there is not max_depth parameter in RGF as the depth of the tree is automatically restrained by the incremental updates to the tree structure.
The next step is the weights of the new leaf that was selected to be the best structural change. This is an optimization problem and can be solved using any of the multiple methods like Gradient Descent, or Newton-Raphson’s method. Since the optimization that we are looking at is simpler, the paper uses a Newton’s step, which is much more accurate than Gradient Descent, to get an approximately optimal weight for the new leaf. Refer to Appendix A, if you are interested in how and why we need a Newton’s step to optimize such functions.
With the base learner or the basis function fixed, we need to optimize the weights of all the leaves in the forest. This is again an optimization problem, and this is solved using Cordinate Descent, which iteratively go through each of the leaves and update the weights by a Newton Step with a small step size.
Since the initial weights that is already set are approximately optimal, we do not need to re-optimize the weights every iteration. It would be computationally expensive if we do that. This is another hyperparameter in the implementation called opt_interval. Empirically, it was observed that unless opt_interval is an extreme value, the choice of opt_interval is not critical. For all the competitions they won, they had simply set the value as 100.
Below is a list of key Hyperparameters that the authors of the paper suggest. It is almost directly taken from their Github Page, but adopted to the Python Wrapper.
Parameters to control training | |
algorithm= | RGF | RGF_Opt | RGF_Sib (Default: RGF )RGF : RGF with regularization on leaf-only models.RGF_Opt : RGF with min-penalty regularization.RGF_Sib : RGF with min-penalty regularization with the sum-to-zero sibling constraints. |
loss= | Loss function . LS | Expo | Log (Default: LS )LS : square loss .Expo : exponential loss .Log : logistic loss . |
max_leaf= | Training will be terminated when the number of leaf nodes in the forest reaches this value. It should be large enough so that a good model can be obtained at some point of training, whereas a smaller value makes training time shorter. Appropriate values are data-dependent and in [2] varied from 1000 to 10000. (Default: 10000) |
normalize | If turned on, training targets are normalized so that the average becomes zero. It was turned on in all the regression experiments in [2]. |
l2= | . Used to control the degree of regularization. Crucial for good performance. Appropriate values are data-dependent. Either 1.0, 0.1, or 0.01 often produces good results though with exponential loss (loss=Expo ) and logistic loss (loss=Log ) some data requires smaller values such as 1e-10 or 1e-20. |
sl2= | . Override regularization parameter for the process of growing the forest. That is, if specified, the weight correction process uses and the forest growing process uses . If omitted, no override takes place and is used throughout training. On some data, works well. |
reg_depth= | Must be no smaller than 1. Meant for being used with algorithm=RGF_Opt|RGF_Sib . A larger value penalizes deeper nodes more severely. (Default: 1) |
test_interval= | Test interval in terms of the number of leaf nodes. For example, if test_interval=500 , every time 500 leaf nodes are newly added to the forest, end of training is simulated and the model is tested or saved for later testing. For efficiency, it must be either multiple or divisor of the optimization interval (opt_interval : default 100). If not, it may be changed by the system automatically. (Default: 500) |
n_tree_search= | Number of trees to be searched for the nodes to split. The most recently-grown trees are searched first. (Default: 1) |
Youtube Channel 3Blue1Brown(which I recommend strongly if you want fundamental intuitions about Math), has a yet another brilliant video for explaining the Taylor Expansions/Approximations. Be sure to check out at least the first 6 minutes of the video.
Taylor’s approximation lets us approximate a function close to a point by using the derivatives of that function.
…
Suppose we are taking a second order approximation and finding a local minima, we can do that by setting the derivative to zero
.
Setting it to zero, we get:
This (x-a) is the optimum step to minimize the function at that point. So, this minima is more like the step direction towards the minima than the actual minima.
To optimize the non-differentiable function, we need to take multiple steps in the step direction until we are relatively satisfied with the loss, or technically until the loss is below our tolerance. This is called the Newton-Raphson method of optimization.
In the next part of the series, let’s take a look at giant – XGBoost
Stay tuned!
Today, I am starting a new blog series on the Gradient Boosting Machines and all its cousins. The outline for the blog series are as follows: (This will be updated with links as and when each of them gets published)
In the first part, let’s understand the classic Gradient Boosting methodology put forth by Friedman. Even though this is math heavy, it’s not that difficult. And wherever possible, I have tried to provide intuition to what’s happening.
Let there be a dataset D with n samples. Each sample has m set of features in a vector x, and a real valued target, y. Formally, it is written as
Now, Gradient Boosting algorithms are an ensemble method which takes an additive form. The intuition is that a complex function, which we are trying to estimate, can be made up of smaller and simpler functions, added together.
Suppose the function we are trying to approximate is
We can break this function as :
This is the assumption we are taking when we choose an additive ensemble model and the Tree ensemble that we usually talk about when talking about Gradient Boosting can be written as below:
where M is the number of base learners and F is the space of regression trees.
where l is the differentiable convex loss function
Since we are looking at an additive functional form for , we can replace with
So, the loss function will become:
In the standard implementation (Sci-kit Learn), the regularization term in the objective function is not implemented. The only regularization that is implemented there are the following:
We know that Gradient Boosting is an additive model and can be shows as below:
where F is the ensemble model, f is the weak learner, η is the learning rate and X is the input vector.
Substituting F with , we get the familiar equation,
Now since is obtained at each iteration by minimizing the loss function which is a function of the first and second order gradients (derivatives), we can intuitively think about that as a directional vector pointing towards the steepest descent. Let’s call this directional vector as . The subscript is m-1 because the vector has been trained on stage m-1 of the iteration. Or, intuitively the residual . So the equation now becomes:
Flipping the signs, we get:
Now let’s look at the standard Gradient Descent equation:
We can clearly see the similarity. And this result is what enables us to use any differentiable loss function.
When we train a neural network using Gradient Descent, it tries to find the optimum parameters(weights and biases), w, that minimizes the loss function. And this is done using the gradients of the loss with respect to the parameters.
But in Gradient Boosting, the gradient only adjusts the way the ensemble is created and not the parameters of the underlying base learner.
While in neural network, the gradient directly gives us the direction vector of the loss function, in Boosting, we only get the approximation of that direction vector from the weak learner. Consequently, the loss of a GBM is only likely to reduce monotonically. It is entirely possible that the loss jumps around a bit as the iterations proceed.
GradientBoostingClassifier and GradientBoostingRegressor in Sci-kit Learn are one of the earliest implementations in the python ecosystem. It is a straight forward implementation, faithful to the original paper. I follows pretty much the discussion we had till now. And it has implemented for a variety of loss functions for which the Greedy function approximation: A gradient boosting machine[1] by Friedman had derived algorithms.
In the next part of the series, let’s take a look at one of the lesser known cousin of Gradient Boosting – Regularized Greedy Forest
Stay tuned!
If you are a beginner in Machine Learning, you might not have made an effort to go deep and understand the mathematics behind the “.fit()“, but as you mature and stumble across more and more complex problems, it becomes essential to understand the math or at least the intuition behind the maths to effectively apply the right technique at the right place.
When I was starting out, I was also guilty of the same. I’ll see “Cross Categorical Entropy” as a loss function in a Neural Network and I take it for granted – that it is some magical loss function that works with multi-class labels. I’ll see “entropy” as one of the splitting criterion in Decision Trees and I just experiment with it without understanding what it is. But as I matured, I decided to spend more time in understanding the basics and it helped me immensely in getting my intuitions right. This also helped in understanding the different ways the popular Deep Learning Frameworks, PyTorch and Tensorflow, have implemented the different loss functions and decide when to use what.
This blog is me summarising my understanding of the underlying concepts of Information Theory and how the implementations differ across the different Deep Learning Frameworks. Each of the concepts that I’ve tried to explain starts off with an introduction and a way to reinforce the intuition, and then provide the mathematics associated with it as a bonus. I’ve always found the maths to clear the understanding like nothing else can.
In the early 20th century, computer scientists and mathematicians around the world were faced with a problem. How to quantify information? Let’s consider the two sentences below:
We a humans intuitively understand the both sentences transmit different amounts of information. The second one is much more informative than the first. But, how do you quantify it? How do you express that in the language of mathematics?
Enter Claude E. Shannon with his seminal paper “A Mathematical Theory of Communication”[1]. Shannon introduced the qualitative and quantitative model of communication as a statistical process. Among many other ideas and concepts, he introduced Information Entropy, and the concept of ‘bit’-the fundamental unit of measurement of Information.
Information Theory is quite vast, but I’ll try to summarise key bits of information(pun not intended) in a short glossary.
Let’s setup an example which we will use to make the intuition behind Entropy clear. The ubiquitous “balls in a box” example is as good as any.
There is a box with 100 balls which have four different colours – Blue, Red, Green and Yellow. Let us assume there are 25 balls of each colour in the box. A transmitter picks up a ball from the container at random and transmits that information to a receiver. For our example, let’s assume that the transmitter is in India and the receiver is in Australia. And let’s also assume that we are in the early 20th century when Telegraphy was one of the primary mode of long distance communication. The thing about telegram messages is that they are charged by the word and so you need to be careful about what you are sending if you are on a budget(this might not seem important right now, but I assure you it will). Adding just one more restriction to the formulation – you cannot send the actual word “blue” through the telegram. You do not have 26 symbols of the English language, but instead just two symbols – 0 and 1.
Now let’s look at how we will be coding the four responses. I’m sure all of you know the binary numerical system. So if we have four outcomes, we can get unique codes using a code length of two. We use that to assign a code for each of our four outcomes. This is fixed length encoding.
Another way to think about this is in terms of reduction in uncertainty. We know that all the four outcomes are equally likely(each has a probability of 1/4). And when we transmit the information about the colour of the picked ball, we reduce the uncertainty by 4. We know that 1 bit can reduce uncertainty by 2 and to reduce uncertainty by 4 we need two bits.
Mathematically, if we have M symbols in the code we are using, we would need bits to encode the information.
What would be the average number of bits we would be using to send the information?
What is an average? In the world of probability, average or mean is the expected value of a probability distribution.
Expected Value can also be thought of this way: If we are to pick a ball at random from the box for 1000 times and record the length of bits that was required to encode that message and take an average of all those 1000 entries, you would get the expected value of the length of bits.
If all outcomes are equally likely, P(x) = 1/N, where N is the number of possible outcomes. And in that case, the expected value becomes a simple mean.
Let’s slightly change the setup of our example. Instead of having equal number of coloured balls, now we have 25 blue balls, 50 red balls, 13 green balls and 12 yellow balls. This example is much better to explain the rest of the concepts and since now you know what an expected value of a probability distribution is, we can follow that convention.
The expected value does not change because no matter which ball you choose, the number of bits you use is still 2.
But, is the current coding scheme the most optimal? This is where Variable Length Coding comes into the picture. Let’s take a look at three different variable coding schemes.
Now that we know how to calculate expected value of the length of bits, let’s calculate it for the three schemes. The one with the lowest length of bits should be the most economical one, right?
We are using 1 bit for Blue and Red and 2 bits each for Green and Yellow.
We are using 1 bit for Blue, 2 bits for Red and 3 bits each for Green and Yellow.
We are using 2 bit for Blue, 1 bits for Red and 3 bits each for Green and Yellow.
Coding Scheme 1 is the obvious choice, right? This is where Variable Length Coding gets tricky. If we pick a ball from the box and it was blue. So we send ‘0’ as a message to Australia. And before Australia got a change to read the message, we pick another ball and it was green. So we send ’10’ to Australia. Now when Australia looks at the message queue, they will see ‘010’ there. If it was a fixed length code, we would know that there is a break every n symbols. But in the absence of that, ‘010’ can be interpreted as blue, green or blue, red, blue. That is why a code should be uniquely decodable. A code is said to be uniquely decodable if two distinct strings of symbols never give rise to the same encoded bit string. This results in a scenario where you have to let go of a few codes every extra symbol you add. Chris Olah has a great blog[2] explaining the concept.
That leaves us with Coding Scheme 2 and 3. Both of them are uniquely decodable. The only difference between them is that in Scheme 2 we are using 1 bit for blue and 2 bits for red. Scheme 2 is the reverse. And we know that getting a red ball from the box is much more likely than blue ball. So it makes sense to use the smaller length for the red ball and that is why the expected value of length of bits is lower for Scheme 3 than 2.
Now you would be wondering why we are so concerned about the expected length of bits. This expected length of bits of the best possible coding scheme is what is called the Shannon Entropy or just Entropy. There is just one part of this which is incomplete. How do you calculate the optimum number of bits for a given problem?
The easy answer is the .
And extending it to the whole probability distribution P, we take the expected value:
In our example, it works out to be:
For those who are interested to know how we arrived at the formula, read along.
While the intuition is useful to get the idea, you can’t really scale it. Every time you get a new problem, you can’t sit down with a paper and pen, figure out the best coding scheme and then calculate the entropy. That’s where the maths comes in.
One of the properties of uniquely decodable codes is something called a prefix property. No codeword should be the prefix of another codeword. This means that every time you choose a code with a shorter length, you are letting go of all the possible codes with that code as the prefix. If we take 01 as a code, we cannot use 011 or 01001, etc. So there is a cost incurred for selecting each code. Quoting from Chris Olah’s blog:
The cost of buying a codeword of length 0 is 1, all possible codewords – if you want to have a codeword of length 0, you can’t have any other codeword. The cost of a codeword of length 1, like “0”, is 1/2 because half of possible codewords start with “0”. The cost of a codeword of length 2, like “01”, is 1/4 because a quarter of all possible codewords start with “01”. In general, the cost of codewords decreases exponentially with the length of the codeword.
Chris Olah’s blog[2]
And that cost can be quantified as
where L is the length of the message. Inverting the equation, we get:
Now what is the cost? Our spends are proportional to how much a particular outcome needs to be encoded. So we spend P(x) for a particular variable x. And therefore the cost = P(x). (For better intuition on this cost read Chris Olah’s Blogpost[2])
In our example, now Australia also has a box with the same four coloured balls. But the distribution of the balls are different, as shown below(which is denoted by q)
Now Australia wants to do the same kind of experiments India did and send back the color o the ball they picked. They decided to use the same coding scheme that was set earlier to carry out the communication.
Since the coding scheme is derived from the source distribution, the length of bits for each of the outcome is the same as before. For eg.
But what has changed is the frequency in which the different messages are used. So, now the colour of the ball which Australia draws is going to be Green 50% of the time, and so on. The coding scheme that we have derived for the first use case(India to Australia) is not optimised for the 2nd use case(Australia to India).
This is the Cross Entropy which can be formally defined as:
where, p(x) is the probability of the distribution which was used to come up with the encoding scheme and q(x) is the probability of the distribution which is using them.
The Cross Entropy is always equal to or greater than the Entropy. In the best case scenario, the source and destination distributions are exactly the same and in that case, the Cross Entropy becomes Entropy because we can substitute q(x) with p(x). In Machine Learning terms, we can say that the predicted distribution is p(x) and the ground truth distribution is q(x). So the more the predicted and true distribution is similar, closer Entropy and Cross Entropy would be.
Now that we have understood Entropy and Cross Entropy, we are in a position to understand another concept with a scary name. The name is so scary that even practitioners call it KL Divergence instead of the actual Kullback-Leibler Divergence. This monster from the depths of information theory is a pretty simple and easy concept to understand.
KL Divergence measures how much one distribution diverges from another. In our example, we use the coding scheme we used for communication from India to Australia to do the reverse communication also. In machine learning, this can be though of as trying to approximate a true distribution(p) with a predicted distribution(q). But doing this we are spending more bits than necessary for sending the message. Or, we are having some loss because we are approximating p with q. This loss of information is called KL Divergence.
Two key properties of KL Divergence is worth to note:
KL(P||Q) can be interpreted in the following ways:
Let’s take a look at how we arrive at the formula, which will enhance our understanding of KL Divergence.
We know that from our intuition that KL(P||Q) is the information lost when approximating P with Q. What would that be?
Just to recap the analogy of our example to an ML system to help you connect the dots better. Let’s use the situation where Australia is sending messages to India using the same coding scheme as was devised for messages from India to Australia.
We know the Expected Value of Bits when we use the coding scheme in Australia. It’s the Cross Entropy of Q w.r.t. P (where P is the predicted distribution and Q is the true distribution).
Now, there is some inefficiency as we are using a coding scheme devised for another probability distribution. What is the ideal case here? The predicted distribution is equal to the true distribution. That means if we use a coding scheme devised for Q to send messages. And that is nothing but the Entropy of Q.
Since we have the actual Cross Entropy and the ideal Entropy, the information lost due to the coding scheme is
This should also give you some intuition towards why this metric is not symmetric.
Note: We have discussed the entire blog assuming X, the random variable, is a discrete variable. If it is a continuous variable, just replace the with and the formulae would work again.
Loss functions are at the heart of a Deep Learning system. The gradients that we propagate to adjust the weights originate from this loss function. The most popular loss functions in classification problems are derived from Information Theory, specifically Entropy.
In an N-way classification problem, the neural network, typically, has N output nodes(with an exception for a binary classification, which has just one node). These nodes’ outputs are also called the logits. Logits are the real valued output of a Neural Network before we apply an activation. If we pass these logits through a softmax layer, the outputs will be transformed into something resembling the probability(statisticians will be turning in their graves now) that the sample is that particular class.
Essentially, a softmax converts the raw logits to probabilities. And since we now have probabilities, we can calculate the Cross Entropy as we have reviewed earlier.
In Machine Learning terms, the Cross Entropy formula is:
where N is the number of samples, C is the number of classes, y is the true probability and y_hat is the predicted probability. In a typical case, y would be the one-hot representation of the target label(with zeroes everywhere and one for the correct label).
Likelihood, in non-technical parlance, is interchangeably used with probability. But in statistics, and by extension Machine Learning, it has different meaning. In a very curt manner, Probability is when we are talking about outcomes, and Likelihood is for hypothesis.
Let’s take an example of a normal distribution, with mean 30 and standard deviation 2.5. We can find the probability that the drawn value from the distribution is 32.
Now let’s reverse the situation, and consider that we do not know the underlying distribution. We just have the observations drawn from the distribution. In such a case, we can still calculate the probability like before assuming a particular distribution.
Now we are finding the likelihood that a particular observation is described by the assumed parameters. And when we maximise this likelihood, we would arrive at the parameters which best fit the drawn samples. This is the likelihood.
To summarize:
And in machine learning, since we try to estimate the underlying distribution from data, we always try to maximise the Likelihood and this is called Maximum Likelihood Estimation. In practice we maximise the log likelihood rather than the likelihood.
where N is the number of samples, and f is a function which gives the probability of y given covariates x and parameters.
There are two difficulties when we deal with the above product term:
This is where Log Likelihood comes in. If we apply the log function, the product turns into a summation. And since log is a monotonic transformation, maximising log likelihood would also maximise the likelihood. So, as a matter of convenience, we almost always maximise log likelihood instead of the pure likelihood.
Let’s consider q as the true distribution and p as the predicted distribution. y is the target for a sample and x is the input features.
The ground truth distribution, typically, is a one-hot representation over the possible labels.
For a sample , the Cross Entropy is:
where Y is the set of all labels
The term is zero for all the elements of Y except and 1 for . So the term reduced to:
where is the correct label for the sample.
Now summing up over all N samples,
which is nothing but the negative of the log likelihood. So maximising log likelihood is equivalent to minimising the Cross Entropy.
In a binary classification, the neural network only has a single output, typically passed through a sigmoid layer. The sigmoid layer as show below squashes the logits to a value between 0 and 1.
Therefore the output of a sigmoid layer acts like a probability for the event. So if we have two classes, 0 and 1, as the confidence of the network increases about the sample being ‘1’, the output of the sigmoid layer also becomes closer to 1, and vice versa.
Binary Cross Entropy(BCE) is a loss function specifically tailored for a binary classification. Let’s start with the formula and then try to draw parallels to what we have learnt so far.
Although the formula might look unfamiliar at first glance, this is just a special case of the Cross Entropy that we reviewed earlier. Unlike the multi-class output, we have just one output which is between 0 and 1. Therefore, to apply the Cross Entropy formula, we synthetically create two output nodes, one with y and another with 1-y (we know from the laws of probability that in a binary outcome, the probabilities sum to 1), and sum the Cross Entropy of all the outcomes.
We know KL Divergence is not symmetric. If p is the predicted distribution and q is the true distribution, there are two ways you can calculate KL Divergence.
The first one is called the Forward KL Divergence. It gives you how much the predicted is diverging from the true distribution. The second one is called Backward KL Divergence. It gives you how much the true is diverging from the predicted distribution.
In supervised learning, we use the Forward KL Divergence because it can be shown that maximising the forward KL Divergence is equivalent to maximising the Cross Entropy (which we will detail in the maths section), and Cross Entropy is much easier to calculate. And for this reason, Cross Entropy is preferred over KL Divergence is most simple use cases. Variational Autoencoders and GANs are a few areas where KL Divergence becomes useful again.
Backward KL Divergence is used in Reinforcement Learning and encourages the optimisation to find the mode of the distribution, when Forward KL does the same for the mean. For more details on the Forward vs Backward KL Divergence, read the blogpost by Dibya Ghosh[3]
We know that KL Divergence is the difference between Cross Entropy and Entropy.
Therefore, our Cross Entropy Loss over N samples is:
Now the optimisation objective is to minimise this loss by changing the parameters which parameterize predicted distribution p. Since H(q) is the entropy of the true distribution, independent of the parameters, it can be considered as a constant. And while optimising, we know a constant is immaterial and can be ignored. So the Loss becomes:
So, to summarise, we started with the Cross Entropy loss and proved that minimising the Cross Entropy is equivalent to minimising the KL Divergence.
The way these losses are implemented in the popular Deep Learning Frameworks like PyTorch and Tensorflow are a little confusing. This is especially true for PyTorch.
All the Tensorflow 2.0 losses expects probabilities as the input by default, i.e. logits to be passed through Sigmoid or Softmax before feeding to the losses. But they provide a parameter from_logits which is set to True will accept logits and do the conversion to probabilities inside.
We have come to the end of the blog and I hope by now you have a much better understanding and intuition about the magical loss functions which make Deep Learning possible.
Now before writing about this topic, I did a quick Google Search to see how much of this is already covered and quickly observed a phenomenon that I see increasingly in the field – Data Science = Modelling, at best, Modelling + Data Processing. Open a MOOC, they talk about the different models and architectures, go to a bootcamp, they will make you write code to fit and train a machine learning model. While I understand why the MOOCs and bootcamps take this route (because these machine learning models are at the heart of data science), they sure have made it seem like machine learning models are the only thing in Data Science. But Data Science in practice is radically different. There are no curated datasets or crisply formatted notebooks, only a deluge of unorganized, unclean data, and complex processes. And to effectively practice Data Science there, you need to be a good programmer. Period.
Not even John Skeet – the Chuck Norris of Programming. It is an inevitable and irritating part of writing code. There is nothing like a long and obscure error message to bring down the high of coding up a particularly complex task. And therefore, being able to hunt down and deal with errors is an essential skill in a Data Scientist’s toolbox.
Machine Learning is slightly different from conventional programming. Traditionally programming tries to solve a problem and there will be a set of steps/logic which has to be translated into code and voila! it works. But Machine Learning is slightly frustrating in that aspect. You can do everything right, but still fail to achieve your intended result. And because of that, debugging a machine learning program is doubly hard.
An error in a machine learning code can come from two very distinct sources –
And as Andrej Karpathy alluded, the majority of errors are usually because of number 1. So before you delve deep into the mathematics, do a basic sanity check to see if you have coded it up right. This doesn’t depend on how good you are at programming or how experienced you are. Andrej Karpathy is a PhD in Computer Science from Stanford, gone on to having a stint at OpenAI, and is not the Sr. Director of AI at Tesla. Now that’s an impressive profile, isn’t it? And he has written numerous blogs about a lot of topics which shows a deep understanding of those. He even had a tweet pointing out common pitfalls in training neural networks. Guess what one of those pointers were? – “you forgot to .zero_grad()”
To summarize, even if you are Andrej Karpathy you will make errors and need to know how to debug to become effective. So the rest of the blog post, I’m gonna cover pointers on how to debug a machine learning model, both from the programmatic side as well as the machine learning side. Before getting into the specifics, I’d like to discuss a couple of ‘mindset’s which will help you go a long way.
In Dijkstra’s classic paper “On the Cruelty of Really Teaching Computing Science”, he argues the case for calling bugs as errors, because it puts the blame squarely where is should reside – with the programmer and not on a gremlin who creeps up when we are sleeping and deletes a line, or an indentation. This change in vocabulary has a profound impact on how you approach a problem with your code. Before the program was “almost correct” with some unexplained bugs which the hero programmer will find and fix. But once you start calling them errors, the program is just wrong and the programmer, who made the error, should find a way to correct it and himself in the process. It went from a “me-against-the-world” action movie to a thoughtful, and introspective drama about a man/woman who brings a profound change in their character through the course of the movie.
In one of his lectures, Jeremy Howard mentioned something profound, and it derives directly from mindset #1.
We all have this habit… when we find a bug, we go, “Uh! I’m an idiot!” So, don’t wait to find out. I already know I’m an idiot. Let’s start working on that assumption when you are debugging.
Jeremy Howard
If something in our code isn’t working, it means that we thought worked in a particular way, isn’t working that way. So you have to start with the understanding that you are wrong about something, which could be quite hard for some people. Experienced programmers go right back to the start, and check every single step. But the new programmers have a tendency to overestimate their confidence about a particular block of code and discard them from the check. So they will fly by blocks of code which they feel pretty confident about and zero in on a block of code by declaring – “I think the problem is x“. But debugging is never about “I think the problem is..”, but about starting with “I don’t know what the problem is, cause I made a mistake.”
How many times have you seen a traceback like the one above and thought – Read through all of that? Hell, No!. But what you would be letting go there is a gold mine of information and a damn good starting point for debugging.
I know the traceback looks scary(and I intentionally posted a non-formatted, multi-threaded traceback on a black screen for effect), but it doesn’t have to be. If you are using a jupyter notebook or any IDE worth it’s salt, the traceback will be formatted and will be a million times easier to read. And what if I told you that you don’t need to read the entire traceback and still get good information from it?
Let’s breakdown the traceback a bit. Chad Hansen @ RealPython has made a good job about explaining the traceback and I urge you to take a look at that and skip the next paragraph(which is an excert from the original article). And for those of you who just want a summary, read ahead.
The first and foremost rule in reading a traceback is that you do it from bottom to top.
Blue box: The last line of the traceback is the error message line. This is your first clue as to what kind of error it is. Python has many type of built-in errors which lead to different kind of issues.
Green box: After the exception name is the error message. Pay very very close attention to this part because this is practically the answer to why it didn’t work. More often than not, the developers try to put in meaningful messages which can intuitively lead you to the problem.
Yellow box: Further up the traceback are the various function calls moving from bottom to top, most recent to least recent.
Red underline: The second line for these calls contains the actual code that was executed.
Chad Hansen @ RealPython
The Yellow box and the Red underline are key to localizing your error. But this particular aspect is what intimidates a lot of people, because this can be long; especially if you are using a library like Sci-kit Learn or Tensorflow. But what is great about this part is that it gives you the filenames in which the error has occured. Now to start with, you should be focusing on the part of the error that rose from your script, and consider the ones from the library as a consequence of a mistake that you have made (there are exceptions to this where there is actually a problem in the library, but the starting point of your analysis should always be introspection).
To summarize:
You should consider yourselves lucky if the error in your code throws an exception and a helpful traceback. But many times, the error is not so superficial. It either do not throw an error, or manifests itself in a totally different form and raises an unexplained exception. It is such errors which is the hardest to debug.
These errors typically manifests themselves in the data processing pipeline in a data science project. And, since we do not have an exact location or line number where the error occurred, we need to find that place. Now, how do you find the error in a humongous piece of code? Reading from start to finish is a sure shot way, but it’s a hugely inefficient process.
Computer Engineers will instantly tell you that when you go from Linear Search to a Binary Search, the time complexity goes from O(N) to O(log N). The irony is that these same engineers who know this at the back of their hand, will resort of a linear search while debugging.
The process is pretty simple. Let’s take an example of a simple bug. The dtype of a column in a pandas dataframe is getting changed somewhere in the code and this is messing up the data processing pipeline.
Now that I’ve made a blasphemous statement and caught your attention, let me clarify. Jupyter Notebooks is a brilliant tool and I use it all the time, but for quick prototyping. Once you’ve made substantial progress in the coding process and you have a long block of code it becomes unwieldy.
Just imagine doing the binary search I described earlier in a notebook. If your code is split across different cells, you would be executing cell after cell, torturing your keyboard. And if your code is merged into a single cell, then you will be splitting cells, creating new cells to check, etc. It’s a major headache.
There is a fine line between debugging data science and debugging code and both of these processes require slightly different tools. If you are figuring out a kink in your model, where you have to iterate a lot and try different things, there is no better tool than Jupyter Notebooks. But if you have a hidden bug in something like a data pipeline, or feature engineering pipeline, ditch Jupyter Notebooks and use an IDE of your choice. My go-to tools are VSCode and Spyder, depending on specific use cases.
I’ve had many data scientists draw a blank face when I tell them to “step through” the code and debug it. And I notice their eyes widening when I tell them about this magical thing called the debugger which let’s you put break points, watch certain variables, let you step through the code line by line, and even put a conditional break point. And this debugger is there in almost all IDEs, functionalities remain the same on a high level. If you are one of those Data Scientist, I urge you to immediately check out debuggers, be it in VSCode, Spyder, PyCharm, or any other IDE that supports it.
In machine learning, an error can be because of programmatic logic or mathematics and it is important to be able to quickly diagnose and isolate the error source so that you don’t waste all day chasing an error.
I’m gonna take an example to make my point clear. I was training an LSTM for time series prediction and tinkering with the architecture to accommodate a few exogenous variables. Soon enough, I understood that the network was not training properly, the loss curve was not looking right. There could be many things that could have gone wrong here – the data processing for the LSTMs (which is always a honey trap for data leakage), the standardization that I was doing for the input matrices, the tinkering I was doing with the LSTM architecture. So, I just created a dummy time series as a function of the exogenous variables, turned off the standardization , and ran my data processing. When I checked the formatted arrays, it was alright. So I ruled out the data processing. Then I ran the data through the LSTM and watched the training loss go down. Perfect! The tinkering was also not the problem. Now that I isolated the source of the error to the standardization part of it, I placed a few breakpoints in the standardization function, stepped through the function and figured out the dumb mistake that I made.
Have you ever ran a model and hit 90% accuracy on a difficult problem and you feel elated that you achieved such a stellar result without putting in much effort? But, a little voice in your head is nagging you, telling you this is not possible. I’m here to hand an amplifier to that voice. Listen to it. More often than not, that voice got it right.
You must have stumbled on to the monster in data science called “data leakage”. Data Leakage is, as you all might know, when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict, resulting in overstated performance. There are a lot of ways data leakage can happen in your model, some of which I’m listing below to help you kick start your train of thought-
If you are working on a classification problem with a high class imbalance, it has its own pitalls.
One of the most common mistake I see people make is the metric, especially when there is high class imbalance. I have sat through too many interviews where the candidate talk on and on about how he got a 90% accuracy in a xyz fraud detection type problem. Even after quite a bit of prodding and clue-dropping, most of these people don’t seem to make the connection to the Accuracy Paradox. When I remind them of Precision or Recall, I get a text book answer or the formulae. At this point, I drop any further discussion, such as the bias of F1 scores to majority class, etc.
Over or under sampling is one of the way you handle the training of an imbalanced class(which is not ideal, because you are changing the inherent distribution of the problem). But here is the kicker, if you over or undersample before you do a validation split, and measure the performance of your model on the under or over sampled dataset, you are going down a dark and windy path towards model failure.
The errors in your model will tell you exactly the story you need to make your model perform better. And the process of extracting this story from your results is Error Analysis. When I sat Error Analysis, it includes two parts – ablation study to identify the errors/benefits from each of the components of the system, like preprocessing, dimensionality reduction, modelling, etc., and the analysis of the results and errors in them like Andrew Ng tells us to do.
You quickly identify the impact of the components of your machine learning pipeline by turning them on or off, or substituting them with the ground truth. This is also a very quick way to identify a feature with leakage. If you see that the model is relying on one single feature too much, and dropping that feature affects the performance drastically, it should be a clue to start investigating data leakage for that variable. For a review of the state of the art methods in feature attribution, check my previous blog series on interpretability.
The other part of Error Analysis is what Andrew Ng teaches in his course on Coursera. It’s all about manually going through the mis-classified cases, trying to identify a pattern, and eventually devising a plan of action to mitigate such a pattern. In case of tabular data, this process can be more efficiently classified by trying to check the performance split on different categorical splits. Is your error concentrated on a particular type of samples? May be it needs a new feature to help learning them.
In case of regression problems, you can look at the error at different categorical splits, or plot a prediction-ground truth scatter chart or a residual plot to further analyze where your errors are rising from.
Let’s do a short tour of the different visualizations provided by the excellent library Yellowbrick.
We use confusion matrices to understand which classes are most easily confused, and not because we are ourselves confused by it. So, we identify the classes which are most easily confused and analyze them to find out why. For eg. for a computer vision task, we look at the samples from these classes and check if a human can easily identify the different between the two.
A Receiver Operating Characteristic/Area Under the Curve plot allows the user to visualize the tradeoff between the classifier’s sensitivity and specificity.
This plot is a twist on the conventional Confusion Matrix and can be used for the same purpose. For some reason, I always find this plot much more intuitive than the confusion matrix.
The residuals plot shows the difference between residuals on the vertical axis and the dependent variable on the horizontal axis, allowing you to detect regions within the target that may be susceptible to more or less error. This plot also shows the relation between the predicted value and how well we are able to predict that value. It also shows how the train residuals differ from the test residuals. Analyzing this plot will give you a lot of insights as to why a model is failing, or where it is failing.
A prediction error plot shows the actual targets from the dataset against the predicted values generated by our model. We can diagnose regression models using this plot by comparing against the 45 degree line, where the prediction exactly matches the model.
This is by no means an exhaustive list of things that can go wrong, and is neither meant to be. The intention of this article is to just get your thought process started, organize your debugging efforts to be more efficient, and put in place a mindset which is conducive to debugging. If I have done any of the above for even a single person reading the article, I consider this effort worthwhile.
Go forth and Debug!
]]>As the name suggests, this is a model agnostic technique to generate local explanations to the model. The core idea behind the technique is quite intuitive. Suppose we have a complex classifier, with a highly non-linear decision boundary. But if we zoom in and look at a single prediction, the behaviour of the model in that locality can be explained by a simple interpretable model (mostly linear).
LIME[2] uses a local surrogate model trained on perturbations of the data point we are investigating for explanations. This ensures that even though the explanation does not have global fidelity(faithfulness to original model) it have local fidelity. The paper[2] also recognizes that there is an interpretability vs fidelity trade-off and proposed a formal framework to express the framework.
is the explanation, is the inverse of local fidelity (or how unfaithful is g in approximating f in the locality), and is the complexity of the local model, g. In order to ensure both local fidelity and interpretability, we need to minimize the unfaithfulness (or maximize the local fidelity), keeping in mind that the complexity should be low enough for humans to understand.
Even though we can use any interpretable model as the local surrogate, the paper uses a Lasso regression to induce sparsity in explanations. The authors of the paper have restricted their explorations to the fidelity if the model and kept the complexity as a user input. In case of a Lasso Regression, it is the number of features for which the explanation should be attributed.
One additional aspect they have explored and proposed a solution (one which has not got a lot of popularity) to is the challenge of providing a global explanation using a set of individual instances. They call it “Submodular Pick for Explaining Models”. It is essentially a greedy optimization which tries to pick a few instances from the whole lot which maximizes something they call “non-redundant coverage”. Non redundant coverage makes sure that the optimization is not picking instances with similar explanations.
The advantages of the technique are:
The implementation by the paper authors is available in Github as well as a installable package in pip. Before we take a look at how we would implement those, let’s discuss a few quirks in the implementation which you should know before running with it (focus on tabular explainer).
The main steps are as follows
The key things you need to keep in mind are:
Now, let’s continue with the same dataset we have been working with in the last part and see LIME in action.
import lime import lime.lime_tabular # Creating the Lime Explainer # Be very careful in setting the order of the class names lime_explainer = lime.lime_tabular.LimeTabularExplainer( X_train.values, training_labels=y_train.values, feature_names=X_train.columns.tolist(), feature_selection="lasso_path", class_names=["<50k", ">50k"], discretize_continuous=True, discretizer="entropy", ) #Now let's pick a sample case from our test set. row = 345
exp = lime_explainer.explain_instance(X_test.iloc[row], rf.predict_proba, num_features=5) exp.show_in_notebook(show_table=True)
For variety, let’s look at another example. One which the model mis-classified.
Example 1
Example 2
As mentioned earlier, there is another technique mentioned in the paper called “submodular pick” to find a handful of explanations which try to explain most of the cases. Let’s try to get that as well. This particular part of the python library is not so stable and the example notebooks provided was giving me errors. But after spending sometime reading through the source code, I figured out a way out of the errors.
from lime import submodular_pick sp_obj = submodular_pick.SubmodularPick(lime_explainer, X_train.values, rf.predict_proba, sample_size=500, num_features=10, num_exps_desired=5) #Plot the 5 explanations [exp.as_pyplot_figure(label=exp.available_labels()[0]) for exp in sp_obj.sp_explanations]; # Make it into a dataframe W_pick=pd.DataFrame([dict(this.as_list(this.available_labels()[0])) for this in sp_obj.sp_explanations]).fillna(0) W_pick['prediction'] = [this.available_labels()[0] for this in sp_obj.sp_explanations] #Making a dataframe of all the explanations of sampled points W=pd.DataFrame([dict(this.as_list(this.available_labels()[0])) for this in sp_obj.explanations]).fillna(0) W['prediction'] = [this.available_labels()[0] for this in sp_obj.explanations] #Plotting the aggregate importances np.abs(W.drop("prediction", axis=1)).mean(axis=0).sort_values(ascending=False).head( 25 ).sort_values(ascending=True).iplot(kind="barh") #Aggregate importances split by classes grped_coeff = W.groupby("prediction").mean() grped_coeff = grped_coeff.T grped_coeff["abs"] = np.abs(grped_coeff.iloc[:, 0]) grped_coeff.sort_values("abs", inplace=True, ascending=False) grped_coeff.head(25).sort_values("abs", ascending=True).drop("abs", axis=1).iplot( kind="barh", bargap=0.5 )
There are two charts where we have aggregated the explanations across the 500 points we sampled from out test set(we can run it on all test data points, but chose to do sampling only cause of computation).
The first chart aggregates the effect of the feature across >50k and <50k cases and ignores the sign when calculating the mean. This gives you an idea of what features were important in the larger sense.
The second chart splits the inference across the two labels and looks at them separately. This chart lets us understand which feature was more important in predicting a particular class.
Along with these, the submodular pick also(in fact this is the main purpose of the module) a set of n data points from the dataset, which best explains the model. We can look at it like a representative sample of the different cases in the dataset. So if we need to explain a few cases from the model to someone, this gives you those cases which will cover the most ground.
From the looks of it, this looks like a very good technique, isn’t it? But it is not without its problems.
The biggest problem here is the correct definition of the neighbourhood, especially in tabular data. For images and text, it is more straightforward. Since the authors of the paper left kernel width as a hyperparameter, choosing the right one is left to the user. But how do you tune the parameter when you don’t have a ground truth? You’ll just have to try different widths, look at the explanations, and see if it makes sense. Tweak them again. But at what point are we crossing the line in to tweaking the parameters to get the explanations we want?
Another main problem is similar to the problem we have with permutation importance(Part II). When sampling for the points in the locality, the current implementation of LIME uses a gaussian distribution, which ignores the relationship between the features. This can create the same ‘unlikely’ data points on which the explanation is learnt.
And finally, the choice of a linear interpretable model for local explanations may not hold true for all the cases. If the decision boundary is too non-linear, the linear model might not explain it well(local fidelity may be high).
Before we discuss how Shapely Values can be used for machine learning model explanation, let’s try to understand what they are. And for that, we have to take a brief detour into Game Theory.
Game Theory is one of the most fascinating branches of mathematics which deals with mathematical models of strategic interaction among rational decision-makers. When we say game, we do not mean just chess or, for that matter, monopoly. Game can be generalized into any situation where two or more players/parties are involved in a decision or series of decisions to better their position. When you look at it that way, it’s application extends to war strategies, economic strategies, poker game, pricing strategies, negotiations and contracts, the list is endless.
But since our topic of focus is not Game Theory, we will just go over some major terms so that you’ll be able to follow the discussion. The parties participating in a Game are called Players. The different actions these players can take are called choices. If there are a finite set of choices for each player, there are also a finite set of combinations of choices of each player. So if each player plays a choice, it will result in an outcome and if we quantify those outcomes, it’s called a payoff. And if we list all the combinations and the payoffs associated with it, it’s called payoff matrix.
There are two paradigms in Game Theory – Non-cooperative, Cooperative games. And Shapely values is an important concept in cooperative games. Let’s try to understand through an example.
Alice, Bob, and Celine share a meal. The bill came to be 90, but they didn’t want to go dutch. So to figure out what they each owe, they went to the same restaurant multiple times in different combinations and recorded how much the bill was.
Now with this information, we do a small mental experiment. Suppose A goes to the restaurant, then B shows up and C shows up. So, for each person who joins, we can have the extra cash(marginal contribution) each person has to put in. We start with 80 (which is what A would have paid if he ate alone). Now when B joined, we look at the payoff when A and B ate together – also, 80. So the additional contribution B brought to the coalition is 0. And when C joined, the total payoff is 90. That makes the marginal contribution of C 10. So, the contribution when A, B, C joined in that order is (80,0,10). Now we repeat this experiment for all combinations of the three friends.
Now that we have all possible orders of arriving, we have the marginal contributions of all the players in all situations. And the expected marginal contribution of each player is the average of their marginal contribution across all combinations. For eg. the marginal contribution of A would be, (80+80+56+16+5+70)/6 = 51.17. And if we calculate the expected marginal contributions of each of the player and add them together, we will get 90- which is the total payoff if all three ate together.
You must be wondering what all these has to do with machine learning and interpretability. A lot. If we think about it, a machine learning prediction is like a game, where the different features(players), play together to bring an outcome(prediction). And since the features work together, with interactions between them, to make the prediction, this becomes a case of cooperative games. This is right up the alley of Shapely Values.
But there is just one problem. Calculating all possible coalitions and their outcomes quickly becomes infeasible as the features increase. Therefore, in 2013, Erik Štrumbelj et al. proposed an approximation using Monte-Carlo sampling. And in this construct, the payoff is modelled as the difference in predictions of different Monte-Carlo samples from the mean prediction.
where f is the blackbox machine learning model we are trying to explain, x is the instance we are trying to explain, j is the feature for which we are trying to find the expected marginal contribution, are two instances of x which we have permuted randmly by sampling another point from the dataset itself, and M is the number of samples we draw from the training set.
Let’s look at a few desirable mathematical properties of Shapely values which makes it very desirable in interpretability application. Shapely Values is the only attribution method which satisfies the properties Efficiency, Symmetry, Dummy, and Additivity. And satisfying these together is considered to be the definition of a fair payout.
While all of the properties make this a desirable way of feature attribution, one in particular has far reaching effect – Additivity. This means that for an ensemble model like a RandomForest or Gradient Boosting, this property guarantees that if we calculate the Shapely Values of the features for each tree individually and average them , you’ll get the Shapely values for the ensemble. This property can be extended to other ensemble techniques like model stacking or model averaging as well.
We will not be reviewing the algorithm and see the implementation of Shapely Values for two reasons:
SHAP (SHapely Additive exPlanations) puts forward a unified approach to interpreting Model Predictions. Scott Lundberg et.al proposes a framework which unifies six previously existing feature attribution methods (including LIME and DeepLIFT) and they present their framework as an additive feature attribution model.
They show that each of these methods can be formulated as the equation above and the Shapely Values can be calculated easily, which bring with it a few guarantees. Even though the paper mentions slightly different properties than the Shapely Values, in principle they are the same. This provides a strong theoretical foundation to the techniques(like LIME) once adapted to this framework of estimating the Shapely values. In the paper, the authors have proposed a novel model-agnostic way of approximating the Shapely Values called Kernel SHAP(LIME + Shapely Values) and some model specific methods like DeepSHAP(which is the adaptation of DeepLIFT, a method for estimating feature importance for neural networks). In addition to it, they have also showed that for linear models, the Shapely Values can be approximated directly from the model’s weight coefficients, if we assume feature independence. And in 2018, Scott Lundberg et.al[6] proposed another extension to the framework which accurately calculates the Shapely values of tree based ensembles like RandomForest or Gradient Boosting.
Even though it’s not super intuitive from the equation below, LIME is also a additive feature attribution method. And for an additive feature explanation method, Scott Lundberg et.al showed that the only solution that satisfies the desired properties is the Shapely Values. And that solution depends on the Loss function L, weighting kernel and regularization term .
If you remember, when we discussed LIME, I mentioned that one of the disadvantages is that it left the kernel function and kernel distance as hyperparameters and they are chosen using a heuristic. Kernel SHAP does away with that uncertainty by proposing a Shapely Kernel and a corresponding loss function which makes sure the solution to the equation above will result in Shapely values and enjoys the mathematical guarantees along with it.
Tree SHAP, as mentioned before[6], is an fast algorithm which computes the exact Shapely Values for decision tree based models. In comparison, Kernel SHAP only approximates the Shapely values and is much more expensive to compute.
Let’s try to get some intuition of how it is calculated, without going into a lot of mathematics(Those of you who are mathematically inclined, the paper is linked in the references, Have a blast!).
We will talk about how the algorithm works for a single tree first. If you remember the discussion about Shapely values, you will remember that to accurately calculate we need the predictions conditioned on all subsets of feature vector of an instance. So let the feature vector of the instance we are trying to explain be x and the subset of feature for which we need the expected prediction be S.
Below is an artificial Decision Tree which uses just three features, age, hours_worked, and marital_status to make the prediction about earning potential.
And now that you have the expected predictions of all subsets in one Decision Tree, you repeat that for all the trees in an ensemble. And remember the additivity property of Shapely values? It lets you aggregate them across the trees in an ensemble by calculating an average of Shapely values across all the trees.
But, now the problem is that this expected values has to be calculated for all possible subsets of features in all the trees and for all features. The authors of the paper proposed an algorithm, where we are able to push all possible subsets of features down the tree at the same time. The algorithm is quite complicated and I refer you to the paper linked in references to know the details.
We will only be looking at TreeSHAP in this section for two reasons:
import shap # load JS visualization code to notebook shap.initjs() explainer = shap.TreeExplainer(model = rf, model_output='margin') shap_values = explainer.shap_values(X_test)
These lines of code calculate the Shapely values. Even though the algorithm is fast, this will still take some time.
Now let’s look at individual explanations. We will take the same cases as LIME. There are multiple ways of plotting the individual explanations in SHAP library – Force Plots and Decision Plots. Both are very intuitive to understand the different features playing together to arrive at the prediction. If the number of features is too large, Decision Plots hold a slight advantage in interpreting.
shap.force_plot( base_value=explainer.expected_value[1], shap_values=shap_values[1][row], features=X_test.iloc[row], feature_names=X_test.columns, link="identity", out_names=">50k", ) # We provide new_base_value as the cutoff probability for the classification mode # This is done to increase the interpretability of the plot shap.decision_plot( base_value=explainer.expected_value[1], shap_values=shap_values[1][row], features=X_test.iloc[row], feature_names=X_test.columns.tolist(), link="identity", new_base_value=0.5, )
Now, we will check the second example.
Example 1
Example 2
The SHAP library also provides with easy ways to aggregate and plot the Shapely values for a set of points(in our case the test set) to have a global explanation for the model.
#Summary Plot as a bar chart shap.summary_plot(shap_values = shap_values[1], features = X_test, max_display=20, plot_type='bar') #Summary Plot as a dot chart shap.summary_plot(shap_values = shap_values[1], features = X_test, max_display=20, plot_type='dot') #Dependence Plots (analogous to PDP) # create a SHAP dependence plot to show the effect of a single feature across the whole dataset shap.dependence_plot("education_num", shap_values=shap_values[1], features=X_test) shap.dependence_plot("age", shap_values=shap_values[1], features=X_test)
As always, there are disadvantages that we should be aware of to effectively use the technique. If you were following along to find the perfect technique for explainability, I’m sorry to disappoint you. Nothing in life is perfect. So, let’s dive into the shortcomings.
Some of the techniques we discussed are also applicable to Text and Image data. Although we will not be going in deep, I’ll link to some notebooks which shows you how to do it.
We have come to the end of our journey through the world of explainability. Explainability and Interpretability are catalysts for business adoption of Machine Learning(including Deep Learning) and the onus is on us practitioners to make sure these aspects gets addressed with reasonable effectiveness. It’ll be a long time before humans trust machines blindly and till then we will have to supplant the excellent performance with some kind of explainability to develop trust.
If this series of blog posts enabled you to answer at least one question about your model I consider my endeavor a success.
Full Code is available in my Github
Blog Series
For this exercise, I have chosen the Adult dataset a.k.a Census Income dataset. Census Income is a pretty popular dataset which has demographic information like age, occupation, along with a column which tells us if the income of the particular person >50k or not. We are using this column to run a binary classification using Random Forest. The reasons for choosing Random Forest are two fold:
Now let’s look at techniques to do post-hoc interpretation to understand our black box models. All through the rest of the blog, the discussion will be based on machine learning models(and not deep learning) and will be based on structured data. While many of the methods here are model agnostic, since there are a lot of specific ways to interpret deep learning models, especially on unstructured data, we leave that out of our current scope.(May be another blog, another day.)
A Random Forest algorithm was tuned and trained on the data with 83.58% performance. It is a decent score considering the best scores vary from 78-86% based on the way you model and test set. But for our purposes, the model performance is more than enough.
This is by far the most popular way of explaining a tree based model and it’s ensembles. A lot of it is because of Sci-Kit Learn and its easy to use implementation of the same. Fitting a Random Forest or a Gradient Boosting Model and plotting the “feature importance” has become the most used and abused technique among Data Scientist.
The mean decrease in impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within any ensemble Decision Tree method(Random Forest, Gradient Boosting, etc.).
The advantages of the technique are:
Sci-kit Learn implements this by default in the “feature importance” in tree based models. So retreiving them and plotting top 25 features is very simple.
feat_imp = pd.DataFrame({'features': X_train.columns.tolist(), "mean_decrease_impurity": rf.feature_importances_}).sort_values('mean_decrease_impurity', ascending=False) feat_imp = feat_imp.head(25) feat_imp.iplot(kind='bar', y='mean_decrease_impurity', x='features', yTitle='Mean Decrease Impurity', xTitle='Features', title='Mean Decrease Impurity', )
We can also retrieve and plot the mean decrease in impurity of each tree as a box plot.
# get the feature importances from each tree and then visualize the # distributions as boxplots all_feat_imp_df = pd.DataFrame(data=[tree.feature_importances_ for tree in rf], columns=X_train.columns) order_column = all_feat_imp_df.mean(axis=0).sort_values(ascending=False).index.tolist() all_feat_imp_df[order_column[:25]].iplot(kind='box', xTitle = 'Features', yTitle='Mean Decease Impurity')
Let’s take a look at what fnlwgt and random are.
Now, surely, these features cannot be more important than other features like occupation, work_class, sex etc. If that is the case, then something is wrong.
Of course… there is. The mean decrease in impurity measure is a biased measure of feature importance. It favours continuous features and features with high cardinality. In 2007 Strobl et al [1] also pointed out in Bias in random forest variable importance measures: Illustrations, sources and a solution that “the variable importance measures of Breiman’s original Random Forest method … are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.”
Let’s try to understand why it is biased. Remember how the mean decrease in impurity is calculated? Each time a node is split on a feature, the decrease in gini index is recorded. And when a feature is continuous, or have high cardinality, the feature may be split many more times than other features. This inflates the contribution of that particular feature. And what does our two culprit features have in common- they are both continuous variables.
Drop Column feature importance is another intuitive way of looking at the feature importance. As the name suggests, it’s a way of iteratively removing a feature and calculating the different in performance.
The advantages of the technique are:
def dropcol_importances(rf, X_train, y_train, cv = 3): rf_ = clone(rf) rf_.random_state = 42 baseline = cross_val_score(rf_, X_train, y_train, scoring='accuracy', cv=cv) imp = [] for col in X_train.columns: X = X_train.drop(col, axis=1) rf_ = clone(rf) rf_.random_state = 42 oob = cross_val_score(rf_, X, y_train, scoring='accuracy', cv=cv) imp.append(baseline - oob) imp = np.array(imp) importance = pd.DataFrame( imp, index=X_train.columns) importance.columns = ["cv_{}".format(i) for i in range(cv)] return importance
Let’s do a 50 fold cross validation to estimate our OOB score. (I know it’s excessive, but let’s keep it to increase the samples for our boxplot) Like before, we are plotting the mean decrease in accuracy as well as the boxplot to understand the distribution across cross validation trials.
drop_col_imp = dropcol_importances(rf, X_train, y_train, cv=50) drop_col_importance = pd.DataFrame({'features': X_train.columns.tolist(), "drop_col_importance": drop_col_imp.mean(axis=1).values}).sort_values('drop_col_importance', ascending=False) drop_col_importance = drop_col_importance.head(25) drop_col_importance.iplot(kind='bar', y='drop_col_importance', x='features', yTitle='Drop Column Importance', xTitle='Features', title='Drop Column Importances', ) all_feat_imp_df = drop_col_imp.T order_column = all_feat_imp_df.mean(axis=0).sort_values(ascending=False).index.tolist() all_feat_imp_df[order_column[:25]].iplot(kind='box', xTitle = 'Features', yTitle='Drop Column Importance')
As expected, the fnlwgt was much less important that we was led to believe from the Mean Decrease in Impurity importance. The high position of the random perplexed me a little bit and I re-ran the importance calculation considering all one-hot features as one. i.e. dropping all the occupation columns and checking the predictive power of the occupation. When I do that, I can see random and fnlwgt rank lower than occupation, and workclass. At the risk of making the post bigger than it already is, let’s leave that investigation for another day.
So, have we got the perfect solution? The results are aligned with the Mean Decrease in Impurity, they make coherent sense, and they can be applied to any model.
The kicker here is the computation involved. To carry out this kind of importance calculation, you have to train a model multiple times, one for each feature you have and repeat that for the number of cross validation loops you want to do. Even if you have a model which trains under a minute, the time required to calculate this explodes as you have more features. To give you an idea, it took 2 hr 44 mins for me to calculate the feature importance with 36 features and 50 cross validation loops (which, of course, can be improved with parallel processing, but you get the point). And if you have a large model which is takes two days to train, then you can forget about this technique.
Another concern I have with this method is that since we are retraining the model every time with new set of features, we are not doing a fair comparison. We remove one column and train the model again, it will find another way to derive the same information if it can, and this gets amplifies when there are collinear features. So we are mixing two things when we investigate – the predictive power of the feature and the way the model configures itself.
The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled [2]. This technique measures the difference in performance if you permute or shuffle a feature vector. The key idea is that a feature is important, if the model performance drops if that feature is shuffled.
The advantages of this technique are:
The permutation importance is implemented in at least three libraries in python – ELI5, mlxtend, and in a development branch of Sci-kit Learn. I’ve picked the mlxtend version for purely no other reason other than convenience. According to Strobl et al. [3], “the raw [permutation] importance… has better statistical properties.” as opposed to normalizing the importance values by dividing by the standard deviation. I have checked the source code for mlxtend and Sci-kit Learn and they do not normalize them.
from mlxtend.evaluate import feature_importance_permutation #This takes sometime. You can reduce this number to make the process faster num_rounds = 50 imp_vals, all_trials = feature_importance_permutation( predict_method=rf.predict, X=X_test.values, y=y_test.values, metric='accuracy', num_rounds=num_rounds, seed=1) permutation_importance = pd.DataFrame({'features': X_train.columns.tolist(), "permutation_importance": imp_vals}).sort_values('permutation_importance', ascending=False) permutation_importance = permutation_importance.head(25) permutation_importance.iplot(kind='bar', y='permutation_importance', x='features', yTitle='Permutation Importance', xTitle='Features', title='Permutation Importances', )
We also plot a box plot of all trials to get a sense of the deviation.
all_feat_imp_df = pd.DataFrame(data=np.transpose(all_trials), columns=X_train.columns, index = range(0,num_rounds)) order_column = all_feat_imp_df.mean(axis=0).sort_values(ascending=False).index.tolist() all_feat_imp_df[order_column[:25]].iplot(kind='box', xTitle = 'Features', yTitle='Permutation Importance')
Everything is hunky-dory in feature importance land? Have we got the best way of explaining what features the model is using for predictions?
We know from life that nothing is perfect and neither is this technique. It’s Achilles’ Heel is correlation between features. Just like drop column importance, this technique is also affected by the effect of correlation between features. Strobl et al. in Conditional variable importance for random forests [3] showed that “permutation importance over-estimates the importance of correlated predictor variables.” Especially in ensemble of trees, if you have two correlated variables, some of the trees might have picker feature A and some others might have picked feature B. And while doing this analysis, in the absence of feature A, the trees which picked feature B would work well and keep the performance high and vice versa. What this will result in is that both the correlated features A and B will have inflated importance.
Another drawback of the technique is that the core idea in the technique is about permuting a feature. But that is essentially a randomness which is not in our control. And because of this the results may vary greatly. We don’t see it here, but if the box plot shows a large variation in importance for a feature across trials, I’ll be wary in my interpretation.
There is yet another drawback to this technique, which in my opinion is the most concerning. Giles Hooker et al. [6] says, “When features in the training set exhibit statistical dependence, permutation methods can be highly misleading when applied to the original model.”
Let’s consider occupation and education. We can understand this from two perspectives:
Giles Hooker et al. [6] suggests alternative methodologies which combine LOOC and Permutation methods, but all the alternatives are computationally more intensive and does not have a strong theoretical guarantee of having better statistical properties.
After identifying the highly correlated features, there are two ways of dealing with correlated features.
N.B. The second method is the same method that I would suggest to deal with one-hot variables.
During the discussion of both Drop Column importance and Permutation importance, one question should have come to your mind. We passed the test/validation set to the methods to calculate the importance. Why not train set?
This is a grey area in the application of some of these methods. There is no right or wrong here because there are arguments for and against both. In Interpretable Machine Learning, Christoph Molnar argues a case for both train and validation sets.
The case for test/validation data is a no-brainer. For the same reason why we do not judge a model by the error in the training set, we cannot judge the feature importance on the performance on the training set (especially since the importance is intrinsically linked to the error).
The case for train data is counter-intuitive. But if you think about it, you’ll see that what we want to measure is how the model is using the features. And what better data to judge that than the training set on which the model was trained? Another trivial issue is also that we would ideally train the model on all available data and in such an ideal scenario, there will not be a test or validation data to check performances on. In Interpretable Machine Learning[5], section 5.5.2 discusses this issue at length and even with a synthetic example of an overfitting SVM.
It all comes down to whether you want to know what features the model relies on to make predictions or the predictive power of each feature on unseen data. For eg. if you are evaluating feature importance in the context of feature selection, do not use test data in any circumstances(there you are overfitting your feature selection to fit the test data)
All the techniques we reviewed till now looked at the relative importance of different features. Now let’s move slightly in a different direction and look at a few techniques which explore how a particular feature interact with the target variable.
Partial Dependence Plots and Individual Conditional Expectation plots help us understand the functional relationship between the features and the target. They are graphical visualizations of the marginal effect of a given variable(or multiple variables) on an outcome. Friedman(2001) introduced this technique in his seminal paper Greedy function approximation: A gradient boosting machine[8].
Partial Dependence Plots shows an average effect, whereas ICE plots show the functional relationship for individual observations. PD plots shows the average effect whereas ICE plots show the dispersion or heterogeneity of the effect.
The advantages of this technique are:
Let’s consider a simple situation where we plot the PD plot for a single feature x, with unique values . The PD plot can be constructed as follows:
I have found the PD plots implemented in PDPbox, skater and Sci-kit Learn. And the ICE plots in PDPbox, pyCEbox, and skater. Out of all of these, I found PDPbox to be the most polished. And they also support 2 variable PDP plots as well.
from pdpbox import pdp, info_plots pdp_age = pdp.pdp_isolate( model=rf, dataset=X_train, model_features=X_train.columns, feature='age' ) #PDP Plot fig, axes = pdp.pdp_plot(pdp_age, 'Age', plot_lines=False, center=False, frac_to_plot=0.5, plot_pts_dist=True,x_quantile=True, show_percentile=True) #ICE Plot fig, axes = pdp.pdp_plot(pdp_age, 'Age', plot_lines=True, center=False, frac_to_plot=0.5, plot_pts_dist=True,x_quantile=True, show_percentile=True)
Let me take some time to explain the plot. On the x axis, you can find the values of the feature you are trying to understand, i.e. age. On the y axis you find the prediction. In case of a classification it is the prediction probability and in case of regression it is the real valued prediction. The bar on the bottom represents the distribution of training data points in different quantiles. It helps us gauge the goodness of the inference. The parts where the number of points are very less, the model has seen very less examples and the interpretation can be tricky. The single line in the PD plot shows the average functional relationship between the feature and target. All the lines in the ICE plot shows the heterogeneity in the training data, i.e. how all the observations in the training data vary with the different values of age.
Now, let’s also take an example with a categorical feature, like education. PDPbox has a very nice feature where it lets you pass a list of features as an input and it will calculate the PDP for them considering them as categorical features.
# All the one-hot variables for the occupation feature occupation_features = ['occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving'] #Notice we are passing the list of features as a list with the feature parameter pdp_occupation = pdp.pdp_isolate( model=rf, dataset=X_train, model_features=X_train.columns, feature=occupation_features ) #PDP fig, axes = pdp.pdp_plot(pdp_occupation, 'Occupation', center = False, plot_pts_dist=True) #Processing the plot for aesthetics _ = axes['pdp_ax']['_pdp_ax'].set_xticklabels([col.replace("occupation_","") for col in occupation_features]) axes['pdp_ax']['_pdp_ax'].tick_params(axis='x', rotation=45) bounds = axes['pdp_ax']['_count_ax'].get_position().bounds axes['pdp_ax']['_count_ax'].set_position([bounds[0], 0, bounds[2], bounds[3]]) _ = axes['pdp_ax']['_count_ax'].set_xticklabels([])
PD plots can be theoretically drawn for any number of features to show their interaction effect as well. But practically, we can only do it for two, at the max three. Let’s take a look at an interaction plot between two continuous features age and education(education and age are not truly continuous, but for lack of better example we are choosing them).
There are two ways you can plot a PD plot between two features. There are three dimensions here, feature value 1, feature value 2, and the target prediction. Either, we can plot a 3-D plot or a 2-D plot with the 3rd dimension depicted as color. I prefer the 2-D plot because I think it conveys the information in a much more crisp manner than a 3-D plot where you have to look at the 3-D shape to infer the relationship. PDPbox implements the 2-D interaction plots, both as a contour plot and grid. Contour works best for continuous features and grid for categorical features
# Age and Education inter1 = pdp.pdp_interact( model=rf, dataset=X_train, model_features=X_train.columns, features=['age', 'education_num'] ) fig, axes = pdp.pdp_interact_plot( pdp_interact_out=inter1, feature_names=['age', 'education_num'], plot_type='contour', x_quantile=False, plot_pdp=False ) axes['pdp_inter_ax'].set_yticklabels([edu_map.get(col) for col in axes['pdp_inter_ax'].get_yticks()])
This is also a very useful technique to investigate bias(the ethical kind) in your algorithms. Suppose if we want to look at the algorithmic bias in the sex dimension.
#PDP Sex pdp_sex = pdp.pdp_isolate( model=rf, dataset=X_train, model_features=X_train.columns, feature='sex' ) fig, axes = pdp.pdp_plot(pdp_sex, 'Sex', center=False, plot_pts_dist=True) _ = axes['pdp_ax']['_pdp_ax'].set_xticklabels(sex_le.inverse_transform(axes['pdp_ax']['_pdp_ax'].get_xticks())) # marital_status and sex inter1 = pdp.pdp_interact( model=rf, dataset=X_train, model_features=X_train.columns, features=['marital_status', 'sex'] ) fig, axes = pdp.pdp_interact_plot( pdp_interact_out=inter1, feature_names=['marital_status', 'sex'], plot_type='grid', x_quantile=False, plot_pdp=False ) axes['pdp_inter_ax'].set_xticklabels(marital_le.inverse_transform(axes['pdp_inter_ax'].get_xticks())) axes['pdp_inter_ax'].set_yticklabels(sex_le.inverse_transform(axes['pdp_inter_ax'].get_yticks()))
The assumption of independence between the features is the biggest flaw in this approach. The same flaw which is present in LOOC importance and Permutation Importance. is applicable to PDP and ICE plots. Accumulated Local Effects plots are a solution to this problem. ALE plots solve this problem by calculating – also based on the conditional distribution of the features – differences in predictions instead of averages.
To summarize how each type of plot (PDP,ALE) calculates the effect of a feature at a certain grid value v:
Partial Dependence Plots: “Let me show you what the model predicts on average when each data instance has the value v for that feature. I ignore whether the value v makes sense for all data instances.”
ALE plots: “Let me show you how the model predictions change in a small”window” of the feature around v for data instances in that window.”
In the python environment, there is no good and stable library for ALE. I’ve only found one ALEpython, which is still very much in development. I managed to get a ALE plot of age, which you can find below. But got an error when I tried an interaction plot. It’s also not developed for categorical features.
This is where we break off again and push the rest of the stuff to the next blog post. In the next part we take a look at LIME, SHAP, Anchors, and more.
Full Code is available in my Github
Blog Series
Interpretability is the degree to which a human can understand the cause of a decision – Miller, Tim[1]
Explainable AI (XAI) is a sub-field of AI which has been gaining ground in the recent past. And as I machine learning practitioner dealing with customers day in and day out, I can see why. I’ve been an analytics practitioner for more than 5 years and I swear, the hardest part of a machine learning project is not creating the perfect model which beats all the benchmarks. It’s the part where you convince the customer why and how it works.
Humans always had a dichotomy when faced with the unknown. Some of us deal with it using faith and worship it, like our ancestors who worshipped fire, the skies, etc. And some of us turn to distrust. Likewise, in Machine Learning, there are people who are satisfied by rigorous testing of the model (i.e. the performance of the model) and those who want to know why and how a model is doing what it is doing. And there is no right or wrong here.
Yann LeCun, Turing Award winner and Facebook’s Chief AI Scientist and Cassie Kozyrkov, Google’s Chief Decision Intelligence Engineer, are strong proponents of the line of thought that you can infer a model’s reasoning by observing it’s actions(i.e. predictions in a supervised learning framework) . On the other hand Microsoft Research’s Rich Caruana and a few others have insisted that the models inherently have interpretability and not just derived through the performance of the model.
We can spend years debating the topic, but for the wide spread adoption of AI, explainable AI is essential and is increasingly demanded from the industry. So, here I am attempting to explain and demonstrate a few interpretability techniques which have been useful for me in both explaining the model to a customer, as well as investigating a model and make it more reliable.
Interpretability is the degree to which a human can understand the cause of a decision. And in the artificial intelligence domain, it means it is the degree to which a person can understand the how and why of an algorithm and its predictions. There are two major ways of looking at this – Transparency and Post-hoc Interpretation.
Transparency addresses how well the model can be understood. This is inherently specific to the model that we use.
One of the key aspects of such transparency is simulatability. Simulatability denotes the ability of a model of being simulated or thought about strictly by a human[3]. Complexity of the model plays a big part in defining this characteristic. While a simple linear model or a single layer perceptron is simple enough to think about, it becomes increasingly difficult to think about a decision tree with a depth of, say, 5. It also becomes harder to think about a model which has a lot of features. Therefore it follows that a sparse linear model(Regularized Linear Model) is more interpretable than a dense one.
Decomposability is another major tenet of transparency. It stands for the ability to explain each of the parts of a model(input, parameter and calculation)[3]. It requires everything from input(no complex features) to output to be explained without the need for another tool.
The third tenet of transparency is Algorithmic Transparency. This deals with the inherent simplicity of the algorithm. It deals with the ability of a human to fully understand the process an algorithm takes to convert inputs to output.
Post-hoc interpretation is useful when the model itself is not transparent. So in the absence of clarity on how the model is working, we resort to explaining the model and its predictions using a multitude of ways. Arrieta, Alejandro Barredo et al. have compiled and categorized them into 6 major buckets. We will be talking about some of these here.
Since this a wide topic and covering all of it would be a humongous blog post, I’ve split it into multiple parts. We will cover the interpretable models and the ‘gotchas’ in it in the current part and leave the post-hoc analysis for the next one.
Occam’s Razor states that simple solutions are more likely to be correct than complex ones. In data science, Occam’s Razor is usually stated in conjunction with overfitting. But I believe it is equally applicable in the explainability context. If you can get the performance that you desire with a transparent model, look no further in your search for the perfect algorithm.
Arrieta, Alejandro Barredo et al. have summarised the ML models and categorised them in a nice table.
Since Logistic Regression is also a linear regression in some way, we just focus on Linear Regression. Let’s take a small data set (auto-mpg) to investigate the model. The data concerns city-cycle fuel consumption in miles per gallon along with different attributes of the car like:
After loading the data, the first step is to run pandas_profiling.
import pandas as pd import numpy as np import pandas_profiling import pathlib import cufflinks as cf #We set the all charts as public cf.set_config_file(sharing='public',theme='pearl',offline=False) cf.go_offline() cwd = pathlib.Path.cwd() data = pd.read_csv(cwd/'mpg_dataset'/'auto-mpg.csv') report = data.profile_report() report.to_file(output_file="auto-mpg.html")
Just one line of code and this brilliant library does your preliminary EDA for you.
In the python world, Linear regression is available in Sci-kit Learn and Statsmodels. Both of them give the same results, but Statsmodels is more leaning towards statisticians and Sci-kit Learn towards ML practitioners. Let’s use statsmodels because of the beautiful summary it provides out of the box.
X = data.drop(['mpg'], axis=1) y = data['mpg'] ## let's add an intercept (beta_0) to our model # Statsmodels does not do that by default X = sm.add_constant(X) model = sm.OLS(y, X).fit() predictions = model.predict(X) # Print out the statistics model.summary() # Plot the coefficients (except intercept) model.params[1:].iplot(kind='bar')
The interpretation is really straightforward here.
Now coming to the feature importance, looks like origin, and model year are the major features which drive the model, right?
Nope. Let’s look at it in detail. To make my point clear, let’s look at a few rows of the data.
origin has values like 1, 2, etc., model_year has values like 70,80, etc., weight has values like 3504, 3449 etc., and mpg(our dependent variable) has values like 15,16,etc. You see the problem here? To make an equation which outputs 15 or 16, the equation needs to have a small coefficient for weight, and a large coefficient for origin.
So, what do we do?
Enter, Standardized Regression Coefficients.
We multiply each of the coefficients with the ratio of standard deviation of the independent variable to standard deviation of the dependent variable. Standardized coefficients refer to how many standard deviations a dependent variable will change, per standard deviation increase in the predictor variable.
#Standardizing the Regression coefficients std_coeff = model.params for col in X.columns: std_coeff[col] = (std_coeff[col]* np.std(X[col]))/ np.std(y) std_coeff[1:].round(4).iplot(kind='bar')
The picture really changed, didn’t it? Weight of the car, whose coefficient was almost zero, turned out to be the biggest driver in determining mileage. If you want to get more intuition/math behind the standardization, I suggest you check out this stackoverflow answer.
Another way you can get similar results is by standardizing the input variables before fitting the linear regression and then examining the coefficients.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_std = scaler.fit_transform(X) lm = LinearRegression() lm.fit(X_std,y) r2 = lm.score(X_std,y) adj_r2 = 1-(1-r2)*(len(X)-1)/(len(X)-len(X.columns)-1) print ("R2 Score: {:.2f}% | Adj R2 Score: {:.2f}%".format(r2*100,adj_r2*100)) params = pd.Series({'const': lm.intercept_}) for i,col in enumerate(X.columns): params[col] = lm.coef_[i] params[1:].round(4).iplot(kind='bar')
Even though the actual coefficients are different, the relative importance between the features remain the same.
The final ‘gotcha’ in Linear Regression is around multicollinearity and OLS in general . Linear Regression is is solved using OLS, which is an unbiased estimator. Even though it sounds like a good thing, it is not necessarily. What ‘unbiased’ here means is that the solving procedure doesn’t consider which independent variable is more important that the others; i.e. it is unbiased towards the independent variables and strives to achieve the coefficients which minimizes the Residual Sum of Squares. But do we really want a model which just minimizes the RSS? Hint: RSS is computed in the training set.
In the Bias vs Variance tradeoff, there exists a sweet spot where you get optimal model complexity which avoids overfitting. And usually since it is quite difficult to estimate bias and variance to do an analytical reasoning and arrive at the optimal point, we employ cross validation based strategies to achieve the same. But, if you think about it, there is no real hyperparameter to tweak in a Linear Regression.
And, since the estimator is unbiased, it will allocate a fraction of contribution to every feature available to it. This becomes more of a problem when there is multicollinearity. While this doesn’t affect the predictive power of the model much, it does affect the interpretability of it. When one feature is highly correlated with another feature or a combination of features, the marginal contribution of that feature is influenced by the other features. So, if there is strong multi-collinearity in your dataset, the regression coefficients will be misleading.
Enter Regularization.
At the heart of almost any machine learning algorithm is an optimization problem which minimizes a cost function. In the case of Linear Regression, that cost function is Residual Sum of Squares, which is nothing but the squared error between the prediction and the ground truth parametrized by the coefficients.
To add regularization, we introduce an additional term to the cost function of the optimization. The cost function now becomes:
In Ridge Regression we added the sum of all squared coefficients to the cost function and in Lasso Regression we added the sum of absolute coefficients. In addition to those, we also introduced a parameter which is a hyperparameter we can tune to arrive at the optimum model complexity. And by virtue of the mathematical properties of L1 and L2 regularization, the effect on the coefficients are slightly different.
Elements of Statistical Learning by Hastie, Tibshirani, Friedman gives the following guideline: When you have many small/medium sized effects you should go with Ridge. If you have only a few variables with a medium/large effect, go with Lasso. You can check out this medium blog which explains Regularization in quite detail. The author has also given a succinct summary which I’m borrowing here.
lm = RidgeCV() lm.fit(X,y) r2 = lm.score(X,y) adj_r2 = 1-(1-r2)*(len(X)-1)/(len(X)-len(X.columns)-1) print ("R2 Score: {:.2f}% | Adj R2 Score: {:.2f}%".format(r2*100,adj_r2*100)) params = pd.Series({'const': lm.intercept_}) for i,col in enumerate(X.columns): params[col] = lm.coef_[i] ridge_params = params.copy() lm = LassoCV() lm.fit(X,y) r2 = lm.score(X,y) adj_r2 = 1-(1-r2)*(len(X)-1)/(len(X)-len(X.columns)-1) print ("R2 Score: {:.2f}% | Adj R2 Score: {:.2f}%".format(r2*100,adj_r2*100)) params = pd.Series({'const': lm.intercept_}) for i,col in enumerate(X.columns): params[col] = lm.coef_[i] lasso_params = params.copy() ridge_params.to_frame().join(lasso_params.to_frame(), lsuffix='_ridge', rsuffix='_lasso')
We just ran Ridge and Lasso Regression on the same data. Ridge Regression gave the exact same R2 and Adjusted R2 scores as the original regression(82.08% and 81.72%, respectively), but with slightly shrunk coefficients. And Lasso gave a lower R2 and Adjusted R2 score(76.67% and 76.19%, respectively) with considerable shrinkage.
If you look at the coefficients carefully, you can see that Ridge Regression did not shrink the coefficients much. The only place where it really shrunk is for displacement and origin. There may be two reasons for this:
But if you look at how Lasso has shrunk the coefficients, you’ll see it is quite aggressive.
As a rule of thumb, I would suggest to always use some kind of regularization.
Let’s pick another dataset for this exercise – the world famous Iris dataset. For those who have been living under a rock, the Iris dataset is a dataset of measurements taken for three species of flowers. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
We dropped the ‘Id’ column, encoded the Species target to make it the target, and trained a Decision Tree Classifier on it.
Let’s take a look at the “feature importance“(we will go in detail about the feature importance and it’s interpretation in the next part of the blog series) from the Decision Tree.
clf = DecisionTreeClassifier(min_samples_split = 4) clf.fit(X,y) feat_imp = pd.DataFrame({'features': X.columns.tolist(), "mean_decrease_impurity": clf.feature_importances_}).sort_values('mean_decrease_impurity', ascending=False) feat_imp = feat_imp.head(25) feat_imp.iplot(kind='bar', y='mean_decrease_impurity', x='features', yTitle='Mean Decrease Impurity', xTitle='Features', title='Mean Decrease Impurity', )
Out of the four features, the classifier only used PetalLength and Petal Width to separate the three classes.
Let’s visualize the Decision Tree using the brilliant library dtreeviz and see how the model has learned the rules.
from dtreeviz.trees import * viz = dtreeviz(clf, X, y, target_name='Species', class_names=["setosa", "versicolor", "virginica"], feature_names=X.columns) viz
It’s very clear how the model is doing what it is doing. Let’s go a step ahead and visualize how a particular prediction is made.
# random sample from training x = X.iloc[np.random.randint(0, len(X)),:] viz = dtreeviz(clf, X, y, target_name='Species', class_names=["setosa", "versicolor", "virginica"], feature_names=X.columns, X=x) viz
If we rerun the classification with just the two features which the Decision Tree selected, it will give you the same prediction. But the same cannot be said of an algorithm like Linear Regression. If we remove the variables which doesn’t meet the p-value cutoff, the performance of the model may also go down by some amount.
Interpreting a Decision tree is a lot more straightforward than a Linear Regression, with all its quirks and assumptions. Statistical Modelling: The Two Cultures by Leo Breiman is a must read to understand the problems in interpreting Linear Regression and it also argues a case for Decision Trees or even Random Forests over Linear Regression, both in terms of performance and interpretability. (Disclaimer: If it has not struck you already, Leo Breiman co-invented Random Forest)
This is where we break off the part I of the blog series. Stay tuned for the next part where we explore the post-hoc interpretation techniques like permutation importance, Shapely Values, PDP and more.
Full Code is available in my Github
Blog Series
References