– *i*th element of the time series

– The index of timeseries

– The index of non-zero demand

– The inter-demand interval, i.e. the gap between two non-zero demand.

– The demand size at non-zero demand point.

Traditionally, there is a class of algorithms which take a slightly different path to forecasting the intermittent time series. This set of algorithms considered the intermittent demand in two parts – Demand Size and Inter-demand Interval – and modelled them separately.

Croston proposed to apply a single exponential smoothing seperately to both *M* and *Q*, as below:

After getting these estimates, final forecast,

And this is a one-step ahead forecast and if we have to extend to multiple timesteps, we are left with a flat forecast with the same value.

Syntetos and Boylan, 2005, showed that Croston forecasting was biased on intermittent demand and proposed a correction with the from inter demand interval estimation.

Shale, Boylan, and Johnston (2006) derived the expected bias when the arrival follow a Poisson process.

Renewal process is an arrival process in which the interarrival intervals are positive, independent and identically distributed (IID) random variables (rv’s). This formulation generalizes Poison process for arbitrary long times. Usually, in a Poisson process the inter-demand intervals are exponentially distributed. But renewal processes have an i.i.d. inter-demand time that have a finite mean.

Turkmen et al. 2019 casts Croston and its variants into a renewal process mold. The random variables, *M* and *Q*, both defined on positive integers fully define the

Once Croston forecasting was cast as a renewal process, Turkmen et al. proposed to estimate them by using a separate RNN for each “Demand Size” and “Inter-demand Interval”.

where

This means that we have a single RNN, which takes in as input both *M* and *Q* and encodes that information into an encoder(*h*). And then we put two separate NN layers on top of this hidden layer to estimate the probability distribution of both *M* and *Q*. For both *M* and *Q*, Negative Binomial distribution is the choice suggested by the paper.

Negative Binomial Distribution is a discrete probability distribution which is commonly used to model count data. For example, the number of units of an SKU sold, number of people who visited a website, or number of service calls a call center receives.

The distribution derives from a sequence of Bernoulli Trials, which says that there are only two outcomes for each experiment. A classic example is a coin toss, which can either be heads or tails. So the probability of success is *p* and failure is *1-p* (in a fair coin toss, this is 0.5 each). So now if we keep on performing this experiment until we have seen *r* successes, the number of failures we see will have a negative binomial distribution.

The semantic meaning of success and failure need not hold true when we apply this, but what matters is that there are only two types of outcome.

The paper just talks about one-step ahead forecasts, which is also what you will find in a lot of Intermittent Demand Forecast literature. But in a real world, we would need longer than that to plan properly. Whether it is Croston, or Deep Renewal Process, how we generate a n-step ahead forecast is the same – a flat forecast of Demand Size(M)/Inter-demand Time(Q).

We have introduced a two new method of decoding the output – Exact and Hybrid – in addition to the existing method Flat. Suppose we trained the model with a prediction length of 5.

The raw output from the mode would be:

**Flat**

Under flat decoding, we would just pick the first set of outputs (M=22 and Q=2) and generate a one-step ahead forecast and extend the same forecast for all 5 timesteps.

**Exact**

Exact decoding is more of a more confident version of decoding. Here we predict a demand of demand size M, every Inter demand time of Q and make the rest of the forecast zero.

**Hybrid**

In hybrid decoding, we combine these two to generate a forecast which is also taking into account long-term changes in the model’s expectation. We use the M/Q value for forecast, but we update the M/Q value based on the next steps. For eg, in the example we have, we will forecast 11(which is 22/3) for the first 2 timesteps, and then forecast 33(which is 33/1) for the next timestep, etc.

I have implemented the algorithm using GluonTS, which is a framework for Neural Time Series forecasting, built on top of MXNet. AWS Labs is behind the open source project and some of the algorithms like DeepAR are used internally by Amazon to produce forecasts.

The paper talks about two variants of the model – Discrete Time DRP(Deep Renewal Process) and Continuous Time DRP. In this library, we have only implemented the discrete time DRP, as it is the more popular use case.

The package is uploaded on pypi and can be installed using:

pip install deeprenewal

**Recommended Python Version: 3.6**

**Source Code:** https://github.com/manujosephv/deeprenewalprocess

If you are working Windows and need to use your GPU(which I recommend), you need to first install MXNet==1.6.0 version which supports GPU MXNet Official Installation Page

And if you are facing difficulties installing the GPU version, you can try(depending on the CUDA version you have)

pip install mxnet-cu101==1.6.0 -f https://dist.mxnet.io/python/all

Relevant Github Issue

usage: deeprenewal [-h] [--use-cuda USE_CUDA] [--datasource {retail_dataset}] [--regenerate-datasource REGENERATE_DATASOURCE] [--model-save-dir MODEL_SAVE_DIR] [--point-forecast {median,mean}] [--calculate-spec CALCULATE_SPEC] [--batch_size BATCH_SIZE] [--learning-rate LEARNING_RATE] [--max-epochs MAX_EPOCHS] [--number-of-batches-per-epoch NUMBER_OF_BATCHES_PER_EPOCH] [--clip-gradient CLIP_GRADIENT] [--weight-decay WEIGHT_DECAY] [--context-length-multiplier CONTEXT_LENGTH_MULTIPLIER] [--num-layers NUM_LAYERS] [--num-cells NUM_CELLS] [--cell-type CELL_TYPE] [--dropout-rate DROPOUT_RATE] [--use-feat-dynamic-real USE_FEAT_DYNAMIC_REAL] [--use-feat-static-cat USE_FEAT_STATIC_CAT] [--use-feat-static-real USE_FEAT_STATIC_REAL] [--scaling SCALING] [--num-parallel-samples NUM_PARALLEL_SAMPLES] [--num-lags NUM_LAGS] [--forecast-type FORECAST_TYPE] GluonTS implementation of paper 'Intermittent Demand Forecasting with Deep Renewal Processes' optional arguments: -h, --help show this help message and exit --use-cuda USE_CUDA --datasource {retail_dataset} --regenerate-datasource REGENERATE_DATASOURCE Whether to discard locally saved dataset and regenerate from source --model-save-dir MODEL_SAVE_DIR Folder to save models --point-forecast {median,mean} How to estimate point forecast? Mean or Median --calculate-spec CALCULATE_SPEC Whether to calculate SPEC. It is computationally expensive and therefore False by default --batch_size BATCH_SIZE --learning-rate LEARNING_RATE --max-epochs MAX_EPOCHS --number-of-batches-per-epoch NUMBER_OF_BATCHES_PER_EPOCH --clip-gradient CLIP_GRADIENT --weight-decay WEIGHT_DECAY --context-length-multiplier CONTEXT_LENGTH_MULTIPLIER If context multipler is 2, context available to hte RNN is 2*prediction length --num-layers NUM_LAYERS --num-cells NUM_CELLS --cell-type CELL_TYPE --dropout-rate DROPOUT_RATE --use-feat-dynamic-real USE_FEAT_DYNAMIC_REAL --use-feat-static-cat USE_FEAT_STATIC_CAT --use-feat-static-real USE_FEAT_STATIC_REAL --scaling SCALING Whether to scale targets or not --num-parallel-samples NUM_PARALLEL_SAMPLES --num-lags NUM_LAGS Number of lags to be included as feature --forecast-type FORECAST_TYPE Defines how the forecast is decoded. For details look at the documentation

There is also a notebook in examples folder which shows how to use the model. Relevant Excerpt is below:

dataset = get_dataset(args.datasource, regenerate=False) prediction_length = dataset.metadata.prediction_length freq = dataset.metadata.freq cardinality = ast.literal_eval(dataset.metadata.feat_static_cat[0].cardinality) train_ds = dataset.train test_ds = dataset.test trainer = Trainer(ctx=mx.context.gpu() if is_gpu&args.use_cuda else mx.context.cpu(), batch_size=args.batch_size, learning_rate=args.learning_rate, epochs=20, num_batches_per_epoch=args.number_of_batches_per_epoch, clip_gradient=args.clip_gradient, weight_decay=args.weight_decay, hybridize=True) #hybridize false for development estimator = DeepRenewalEstimator( prediction_length=prediction_length, context_length=prediction_length*args.context_length_multiplier, num_layers=args.num_layers, num_cells=args.num_cells, cell_type=args.cell_type, dropout_rate=args.dropout_rate, scaling=True, lags_seq=np.arange(1,args.num_lags+1).tolist(), freq=freq, use_feat_dynamic_real=args.use_feat_dynamic_real, use_feat_static_cat=args.use_feat_static_cat, use_feat_static_real=args.use_feat_static_real, cardinality=cardinality if args.use_feat_static_cat else None, trainer=trainer, ) predictor = estimator.train(train_ds, test_ds) deep_renewal_flat_forecast_it, ts_it = make_evaluation_predictions( dataset=test_ds, predictor=predictor, num_samples=100 ) evaluator = IntermittentEvaluator(quantiles=[0.25,0.5,0.75], median=True, calculate_spec=False, round_integer=True) #DeepAR agg_metrics, item_metrics = evaluator( ts_it, deep_renewal_flat_forecast_it, num_series=len(test_ds) )

The paper evaluates the model on two datasets – Parts dataset and UCI Retail Dataset. And for evaluation of the probabilistic forecast, they use Quantile Loss.

The authors used a single hidden layer with 10 hidden units and used the softplus activation to map the LSTM embedding to distribution parameters. They have used a global RNN, i.e. LSTM parameters are shared across all the timeseries. And they evaluated the one-step ahead forecast.

We did not recreate the experiment, rather expanded the scope. The dataset chosen was the UCI Retail Dataset, and instead of one step ahead forecast, we took 39 days ahead forecast. This is more in line with a real world application where you need more than one-step ahead forecast to plan. And instead of just comparing with Croston and its variants, we do the comparison also with ARIMA, ETS, NPTS, and Deep AR (which is mentioned as the next steps in the paper).

UCI Retail dataset is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

**Columns:**

*InvoiceNo*: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.*StockCode*: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.*Description*: Product (item) name. Nominal.*Quantity*: The quantities of each product (item) per transaction. Numeric.*InvoiceDate*: Invice Date and time. Numeric, the day and time when each transaction was generated.*UnitPrice*: Unit price. Numeric, Product price per unit in sterling.*CustomerID*: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.*Country*: Country name. Nominal, the name of the country where each customer resides.

**Preprocessing:**

- Group by at
*StockCode*,*Country*,*InvoiceDate*–> Sum of*Quantity*, and Mean of*UnitPrice* - Filled in zeros to make timeseries continuous
- Clip lower value of Quantity to 0(removing negatives)
- Took only Time series which had length greater than 52 days.
- Train Test Split Date:
*2011-11-01*

**Stats:**

- # of Timeseries: 3828. After filtering: 3671
- Quantity: Mean = 3.76, Max = 12540, Min = 0, Median = 0
- Heavily Skewed towards zero

**Time Series Segmentation**

Using the same segmentation – Intermittent, Lumpy, Smooth and Erratic- we discussed earlier, I’ve divided the dataset into four.

We can see that almost 98% of the timeseries in the dataset are either Intermittent or Lumpy, which is perfect for our use case.

The baseline we have chosen, as the paper, is Croston Forecast. We have also added slight modification to Croston, namely SBA and SBJ. So let’s first do a comparison against these baselines, for both one-step ahead and n-step ahead forecasts. We will also evaluate both point forecast(using MSE, MAPE, and MAAPE) and probabilistic forecast (using Quantile Loss)

We can see that the DRP models have considerably outperformed the baseline methods across both point as well as probabilistic forecasts.

Now let’s take a look at n-step ahead forecasts.

Here, we see a different picture. On the point forecasts, Croston is doing much better on MSE. DRPs are doing well on MAPE, but we know that MAPE favors under forecasting and not very reliable when we look at intermittent demand pattern. So from a point forecast standpoint, I wouldn’t say that DRPs outperformed Croston on long term forecast. What we do notice is that within the DRPs, the Hybrid decoding is working much better than flat decoding, in both MAPE and MSE.

For an extended comparison of the results, let’s also include results from other popular forecasting techniques as well. We have added ETS and ARIMA from the point estimation stable and DeepAR and NPTS from the probabilistic forecast stable.

On MSE, ETS takes the top spot, even though on MAPE and MAAPE DRPs retain their position. On the probabilistic forecast side, DeepAR has extraordinarily high Quantile losses, but when we look at a weighted Quantile Loss(weighted by volume) we see it emerging in the top position. This may be because DeepAR forecasting zero(or close to zero) in most of the time and only forecasting good numbers in high volume SKUs. Across all the three Quantile Losses, NPTS seems to be outperforming DRPs by a small margin.

When we look at a long term forecast, we see that ARIMA and ETS is doing considerably better on the point estimates(MSE). And on the Probabilistic side, Deep AR turned it around and managed to become the best probabilistic model.

**Intermittent**

**Stable (less Intermittent)**

Ali Caner Turkmen, Yuyang Wang, Tim Januschowski. “Intermittent Demand Forecasting with Deep Renewal Processes”. arXiv:1911.10416 [cs.LG] (2019)

]]>Casually, we call intermittent series as series with a lot of periods with no demand, i.e. sporadic demand. Syntetos and Boylan(2005) proposed a more formal way of categorizing time series. They used wo parameters of the time series for this classification – Average Demand Interval and Square of Coefficient of Variation.

Average Demand Interval is the average interval in time periods between two non-zero demand. i.e. if the ADI for a time series is 1.9, it means that on an average we see a non-zero demand every 1.9 time periods.

ADI is a measure of intermittency; the higher it is, the ore intermittent the series is.

Coefficient of Variation is the standardized standard deviation. We calculate the standard deviation, but then scale it with the mean of the series to guard against scale dependency.

This shows the variability of a time series. If is high, that means that the variability in the series is also high.

Based on these two demand characteristics, Syntetos and Boylan has theoretically derived cutoff values which defines a marked change in the type of behaviour. They have defined intermittency cutoff as 1.32 and cutoff as 0.49. Using these cutoffs, they defined highs and lows and then putting both together a grid which classifies timeseries into Smooth, Erratic, Intermittent, and Lumpy.

The Forecast measures we have discussed was all, predominantly, designed to handle Smooth and Erratic timeseries. But in the real world, there are a lot more Intermittent and Lumpy timeseries. Typical examples are Spare Parts sales, Long tail of Retail Sales, etc.

The single defining characteristic of Intermittent and Lumpy series are the number of times there are zero demand. And this wreaks havoc with a lot of the measures we have seen so far. All the percent errors(for eg. **MAPE**) become unstable because of the division by zero, which now is an almost certainty. Similarly, the Relative Errors(for eg. **MRAE**), where we use a reference forecast to scale the errors, also becomes unstable, especially when using Naïve Forecast as a reference. This happens because there would be multiple periods with zero demand, and that would create zero reference error and hence undefined.

**sMAPE** is designed against this division with zero, but even there sMAPE has problems when the number of zero demands increase. And we know from our previous explorations that sMAPE has problems when either forecast is much higher than actuals or vice versa. And in case of intermittent demand, such cases are galore. If there is zero demand and we have forecasted something, or the other way around, we have such a situation. For eg. for a zero demand, one method forecasts 1 and another forecasts 10, the outcome is 200% regardless.

**Cumulative Forecast Error** **(CFE, CFE Min, CFE Max)**

We have already seen Cumulative Forecast Error( a.k.a. Forecast Bias) earlier. It is just the signed error over the entire horizon, so that the negative and positive errors cancel out each other. This has direct implications to over or under stocking in a Supply Chain. Peter Wallstrom[1] also advocates the use of CFE Max and CFE Min. A CFE of zero can happen because of chance as well and over a large horizon, we miss out on a lot of detail in between. so he proposes to look at CFE in conjunction with CFE Max and CFE Min, which are the maximum and the minimum values of CFE in the horizon.

**Percent Better(PBMAE, PBRMSE, etc.)**

We have already seen Percent Better. This is also a pretty decent measure to use for Intermittent demand. This does not have the problem of numerical instability and is defined everywhere. But it does not measure the magnitude of errors, rather than the count of errors.

**Number of Shortages (NOS and NOSp)**

Generally, to trace whether a forecast is biased or not, we use tracking signal(which is CFE/MAD). But the limits that are set to trigger warnings(+/- 4) is derived on the assumption that the demand is normally distributed. In the case of intermittent demand, it is not normally distributed and because of that this trigger calls out a lot of false positives.

Another alternative to this is the Number of Shortages measure, more commonly represented as Percentage of Number of Shortages. It just counts the number of instances where the cumulative forecast error was greater than zero, resulting in a shortage. A very high number or a low number indicated bias in either direction.

**Periods in Stock (PIS)**

NOS does not identify systematic errors because it doesn’t consider the temporal dimension of stock carry over. PIS goes one step ahead and measures the total number of periods the forecasted items has spent in stock or number of stock out periods.

To understand how PIS works, let’s take an example.

Let’s say there is a forecast of one unit every day in a three day horizon. In the beginning of the first period the one item is delivered to the fictitious stock (this is a simplification compared to reality). If there has been no demand during the first day, the result is plus one PIS. When a demand occurs, the demand is subtracted from the forecast. A demand of one in period 1 results in zero PIS in period 1 and CFE of -1. If the demand is equal to zero during three periods, PIS in period 3 is equal to plus six. The item from day one has spent three days in stock, the item from the second day have spent two days in stock and the last item has spent one day in stock

Excerpt from Evaluation of Forecasting Techniques and Forecast Errors with a focus on Intermittent Demand

A positive number indicates over forecast and a negative number shows under forecast of demand. It can easily be calculated as the cumulative sum of the CFE, i.e. the area under the bar chart in the diagram

**Stock-keeping-oriented Prediction Error Costs(SPEC)**

SPEC is a newer metric(Martin et al. 2020[4]) which tries to take the same route as Periods in Stock, but slightly more sophisticated

Although it looks intimidating at first, we can understand it intuitively. The crux of the calculation is handled by two inner min terms – Opportunity cost and Stock Keeping costs. These are the two costs which we need to balance in a Supply Chain from an inventory management perspective.

The left term measures the opportunity cost which arises from under forecasting. This is the sales which we could have made if there was enough stock. For eg. if the demand was 10 and we only forecasted 5, we have an opportunity loss of 5. Now, let’s suppose we have been forecasting 5 for last three time periods and there were no demand and then a demand of 10 comes in. so we have 15 in stock and we fulfill 10. So here, there is no opportunity cost. And we can also say that an opportunity cost for a time period will not be greater than the demand at that time period. So combining these conditions, we get the first term of the equation, which measures the opportunity cost.

Using the same logic as before, but inverting it, we can derive the similar equation for Stock Keeping costs(where we over forecast). That is taken care by the right term in the equation.

SPEC for a timestep. actually, looks at all the previous timesteps, calculates the opportunity costs and stock keeping costs for each timestep, and adds them up to arrive at a single number. At any timestep, there will either be an opportunity cost or a stock keeping cost, which in turn looks at the cumulative forecast and actuals till that timestep.

And SPEC for a horizon of timeseries forecast is the average across all the timesteps.

Now there are two terms which lets us apply different weightages to opportunity costs and stock keeping costs, and depending upon the strategy of the organization we can choose the right kind of weightages. The recommendation is to keep the summation of the weights to 1, and is a common choice in a retail setting.

One disadvantage of this is the time complexity. We need nested loops to calculate this metric, which makes it slow to compute.

The implementation is available here – https://github.com/DominikMartin/spec_metric

**Mean Arctangent Absolute Percent Error (MAAPE)**

This is a clever trick on the MAPE formula which avoids one of the main problems with it – undefined at zero. And while addressing this concern, this change also makes it symmetric.

the idea is simple. We know that,

So, if we consider a triangle, with adjacent and opposite sides equal to A and |A-F| respectively, the Absolute Percent Error is nothing by the slope of the hypotenuse.

Slope can be measured as a ratio, ranging from 0 to infinity, and also as an angle, ranging from 0 to 90. The slope as a ratio is the traditional Absolute Percent Error that is quite popular. So the paper presents slope as an angle as a stable alternative. Those of you who remember your Trignometry would remember that:

The paper christens it as Arctangent Absolute Percent Error and defines Mean Arctangent Absolute Error as :

where

arctan is defined at all real values from negative infinity to infinity. When . So by extension, while APE ranges from , AAPE ranges from . This makes it defined everywhere and robust that way.

The symmetricity test we saw earlier gives us the below results(from the paper)

We can see that the asymmetry that we saw in APE is not as evident here. The complementary plots that we saw earlier, if we compare AAPE to APE, we see it in a much better shape.

We can see that AAPE still favours under forecasting, but not as much as APE and for that reason might be more useful.

**Relative Mean Absolute Error & Relative Mean Squared Error (RelMAE & RelMSE)**

There are relative measures which compare the error of the forecast with the error of a reference forecast, in most use cases a naïve forecast or more formally a random walk forecast.

**Scaled Error(MASE)**

We’ve seen MASE also earlier and know how it’s defined. We scale the error by the average MAE of the reference forecast. Davidenko and Fildes, 2013[3], have shows that the MASE is nothing but a weighted mean of Relative MAE, the weights being the number of error terms. This means that include both MASE and RelMAE may be redundant. But let’s check them out anyways.

Let’s pick a real dataset, run ARIMA, ETS, and Crostons, with Zero Forecast as a baseline and calculate all these measures(using GluonTS).

I’ve chosen the Retail Dataset from UCI Machine Learning Repository. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

**Columns:**

*InvoiceNo*: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.*StockCode*: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.*Description*: Product (item) name. Nominal.*Quantity*: The quantities of each product (item) per transaction. Numeric.*InvoiceDate*: Invice Date and time. Numeric, the day and time when each transaction was generated.*UnitPrice*: Unit price. Numeric, Product price per unit in sterling.*CustomerID*: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.*Country*: Country name. Nominal, the name of the country where each customer resides.

**Preprocessing:**

- Group by at
*StockCode*,*Country*,*InvoiceDate*–> Sum of*Quantity*, and Mean of*UnitPrice* - Filled in zeros to make timeseries continuous
- Clip lower value of Quantity to 0(removing negatives)
- Took only Time series which had length greater than 52 days.
- Train Test Split Date:
*2011-11-01*

**Stats:**

- # of Timeseries: 3828. After filtering: 3671
- Quantity: Mean = 3.76, Max = 12540, Min = 0, Median = 0
- Heavily Skewed towards zero

**Time Series Segmentation**

Using the same segmentation – Intermittent, Lumpy, Smooth and Erratic- we discussed earlier, I’ve divided the dataset into four.

We can see that almost 98% of the timeseries in the dataset are either Intermittent or Lumpy, which is perfect for our use case.

We have included Zero Forecast as kind of a litmus test which will tell us which forecast metrics we should be wary of when using it with Intermittent Demand.

We can see sMAPE, RelMAE, MAE, MAPE, MASE and ND(which is the volume weighted MAPE), all favors zero forecast and ranks it the best forecasting method. But when we look at the inventory related metrics(like CFE, PIS etc. which measure systematic bias in the forecast), Zero Forecast is the worst performing.

MASE, which was supposed to perform better in Intermittent Demand also falls flat and rates the Zero Forecast the best. The danger with choosing a forecasting methodology based on these measures is that we end up forecasting way loo low and that will wreak havoc in the downstream planning tasks.

Surprisingly, ETS and ARIMA fares quite well(over Croston) and ranks 1st and 2nd when we look at the metrics like PIS, MSE, CFE, NOSp etc.

Croston fares well only when we look at MAAPE, MRAE, and CFE_min.

We have ranked the different forecast methods based on all these different metrics. And if a set of metrics are measuring the same thing, then these rankings would also show good correlation. So let’s calculate the Spearman’s Rank correlation for these ranks and see which metrics agree with each other.

We can see two large groups of metrics which are positively correlated among each other and negatively correlated between groups. MAE, MRAE, MASE, MAPE, RelMAPE, ND, and sMAPE falls into a group and MSE, RMSE, CFE, PIS, SPEC_0.75, SPEC_0.5, SPEC_0.25, NOSp, PBMAE, RelRMSE, and NRMSE in the other. MAAPE and CFE_min fall into the second group as well, but lightly correlated.

Are these two groups measuring different characteristics of the forecast?

Let’s look at the same agreement between metrics at an item level now, instead of the aggregate one. For eg. for each item that we are forecasting we rank the forecasting methods based on these different metrics and run a Spearman’s Rank Correlation on those ranks.

Similar to the aggregate level view, here also we can find two groups of metrics, but contrary to the aggregate level, we cannot find a strong negative correlation across two groups. SPEC_0.5(where we give equal weightage to both opportunity cost and stock keeping cost) and PIS shows a high correlation, mostly because it is conceptually the same.

Another way to visualize and understand the similarity of different metrics is to use the item level metrics and run a PCA with two dimensions. And plot the direction of the original features which point towards the two components we have extracted using PCA. It shows how the original variables contribute to creating the principal component. So if we assume the two PCA components are the main “attributes” that we measure when we talk about “accuracy” of a forecast, the Loading Plot shows you how these different features(metrics) contribute to it, both magnitude and direction wise.

Here, we can see the relationship more crystalized. Most of the metrics are clustered together around the two components. MAE, MSE, RMSE, CFE, CFE_max, and the SPEC metrics all occupy similar space in the loading plot, and it looks like it is the component for “forecast bias” as CFE and SPEC metrics dominate this component. PIS is on the other side, almost at 180 degrees to CFE, because of the sign of PIS.

The other component might be the “accuracy” component. This is dominated by RelRMSE, MASE, MAAPE, MRAE etc. MAPE seems to straddle between the two components, and so is MAAPE.

We can also see that sMAPE might be measuring something totally different, like NOSp and CFE_min.

PIS is at a 180 degrees from CFE, SPEC_0.5, and SPEC_0.75 because of the sign, but they are measuring the same thing. SPEC_0.25(where we give 0.25 weight to opportunity cost) shows more similarity to the other group, probably because it favors under forecasting because of the heavy penalty on stock keeping costs.

We’ve not done a lot of experiments in this short blog post(not as much as Peter Wallström’s thesis[1]), but whatever we have done have shown us quite a bit already. We know not to rely on metrics like sMAPE, RelMAE, MAE, MAPE, MASE because they were giving a zero forecast the best ranking. We also know that there is no single metric that would tell you the whole story. If we look at something like MAPE, we are not measuring the structural bias in the forecast. And if we just look at CFE, it might show a rosy picture, when it is not the case.

Let me quickly summarize the findings from Peter Wallström’s thesis(along with a few of my own conclusions.

- MSE and CFE, even though it appears in the same place in the loading plots, does not show that relationship consistently across different forecasting methods. The same we can see in our loading plots as well. CFE is away from the second component for Croston and ETS.
- MAE and MSE are strongly related and they measure the same variability. And since MAE has shown an affinity to zero forecast, it is preferable to use MSE based error metrics.
- CFE on its own is not very reliable to measure Forecast Bias. And it should be paired with metrics like PIS or SPEC to have a complete picture. CFE can conceal the bias tendency when a time in point is considered. If the CFE value is low in absolute terms the sign do not reveal any bias information. A positive CFE (underestimating) might just be a random figure for a method that is overestimating the demand when the other measures are checked. The low CFE is the result of fulfilling the demand afterwards the demand has occurred which is not traceable by CFE.
- Peter also recommends not to use CFE_max and CFE_min in favor of metrics like PIS and NOSp.

- Apart from these, SPEC scores and MAAPE(which were not reviewed in the thesis) are also suitable measures.

GitHub Repository – https://github.com/manujosephv/forecast_metrics

**Checkout the rest of the articles in the series**

- Forecast Error Measures: Understanding them through experiments
- Forecast Error Measures: Scaled, Relative, and other Errors
- Forecast Error Measures: Intermittent Demand

- Peter Wallström, 2009, Evaluation of Forecasting Techniques and Forecast Errors with a focus on Intermittent Demand
- Kim et al., 2016. A new metric of absolute percentage error for intermittent demand forecasts
- Davidenko & Fildes. 2013, Measuring Forecast Accuracy: The Case Of Judgmental Adjustments To Sku-Level Demand Forecasts
- Martin et al. 2013, A New Metric for Lumpy and Intermittent Demand Forecasts: Stock-keeping-oriented Prediction Error Costs

Both Scaled Error and Relative Error are extrinsic error measures. They depend on another reference forecast to evaluate itself, and more often than not, in practice, the reference forecast is a Naïve Forecast or a Seasonal Naïve Forecast. In addition to these errors, we will also look at measures like Percent better, cumulative Forecast Error, Tracking Signal etc.

When we say Relative Error, there are two main ways of calculating it and Shcherbakov et al. calls them Relative Errors and Relative Measures.

Relative Error is when we use the forecast from a reference model as a base to compare the errors and Relative Measures is when we use some forecast measure from a reference base model to calculate the errors.

Relative Error is calculated as below:

Similarly Relative Measures are calculated as below:

where is the Mean Absolute Error on the forecast and is the MAE of the reference forecast. This measure can be anything really, and not just MAE.

Relative Error is based on a reference forecast, although most commonly we use Naïve Forecast, not necessarily all the time. For instance, we can use the Relative measures if we have an existing forecast we are trying to better, or we can use the baseline forecast we define during the development cycle, etc.

One disadvantage we can see right away is that it will be undefined when the reference forecast is equal to ground truth. And this can be the case for either very stable time series or intermittent ones where we can have the same ground truth repeated, which makes the naïve forecast equal to the ground truth.

Scaler Error was proposed by Hyndman and Koehler in 2006. They proposed to scale the errors based on the in-sample MAE from the naïve forecasting method. So instead of using the ground truth from the previous timestep as the scaling factor, we use the average absolute error across the entire series as the scaling factor.

where is the error at timestep t, is the length of the timeseries, is the ground truth at timestep t, and is the offset. is 1 for naïve forecasting. Another alternative that is popularly used is . For eg. l=12, for a seasonality of 12 months.

Here in-sample MAE is chosen because it is always available and more reliable to estimate the scale as opposed to the out of sample ones.

In our previous blog, we checked Scale Dependency, Symmetricity, Loss Curves, Over and under Forecasting and Impact of outliers. But this time, we are dealing with relative errors. And therefore plotting loss curves are not easy anymore because there are three inputs, ground truth, forecast, and reference forecast and the value of the measure can vary with each of these. Over and Under Forecasting and Impact of Outliers we can still check.

The loss curves are plotted as a contour map to accommodate the three dimensions – Error, Reference Forecast and the measure value.

We can see that the errors are symmetric around the Error axis. If we keep the Reference Forecast constant and vary the error, the measures are symmetric on both sides of the errors. Not surprising since all these errors have their base in absolute error, which we saw was symmetric.

But the interesting thing here is the dependency on the reference forecast. The same error lead to different Relative Absolute Error values depending on the Reference Forecast.

We can see the same asymmetry in the 3D plot of the curve as well. But Scaled Error is different here because it is not directly dependent on the Reference Forecast, but rather on the mean absolute error of the reference forecast. And therefore it has the good symmetry of absolute error and very little dependency on the reference forecast.

For the Over and Under Forecasting experiment, we repeated the same setup from last time***, but for these four error measures – Mean Relative Absolute Error(MRAE), Mean Absolute Scaled Error(MASE), Relative Mean Absolute Error(RMAE), and Relative Root Mean Squared Error(RRMSE)

** * *– With one small change, because we also add a random noise less than 1 to make sure consecutive actuals are not the same. In such cases the Relative Measures are undefined.*

We can see that these scaled and relative errors do not have that problem of favoring over or under forecasting. Both the error bars of low forecast and high forecast are equally bad. Even in cases where the base error was favoring one of these,(for eg. MAPE), the relative error measure(RMAPE) reduces that “favor” and makes the error measure more robust.

One other thing we notice is that the Mean Relative Error has a huge spread(I’ve actually zoomed in to make the plot legible). For eg. The median *baseline_rmae *is 2.79 and the maximum *baseline_mrae *is 42k. This large spread shows us that the Mean Absolute Relative Error has low reliability. Depending on different samples, the errors vary wildly. this may be partly because of the way we use the Reference Forecast. If the Ground Truth is too close to Reference Forecast(in this case the Naïve Forecast), the errors are going to be much higher. This disadvantage is partly resolved by using Median Relative Absolute Error(MdRAE)

For checking the outlier impact also, we repeated the same experiment from previous blog post for MRAE, MASE, RMAE, and RRMSE.

Apart from these standard error measures, there are a few more tailored to tackle a few aspects of the forecast which is not properly covered by the measures we have seen so far.

Out of all the measures we’ve seen so far, only MAPE is what I would call interpretable for non-technical folks. But as we saw, MAPE does not have the best of properties. All the other measures does not intuitively expound how good or bad the forecast is. Percent Better is another attempt at getting that kind of interpretability.

Percent Better(PB) also relies on a reference forecast and measures our forecast by counting the number of instances where our forecast error measure was better than reference forecast error.

For eg.

where I = 0 when MAE>MAE* and 1 when MAE<MAE*, and N is the number of instances.

Similarly, we can extend this to any other error measure. This gives us an intuitive understand of how better are we doing as compared to reference forecast. This is also pretty resistant to outliers because it only counts the instances instead of measuring or quantifying the error.

That is also a key disadvantage. We are only measuring the count of the times we are better. But it doesn’t measure how better or how worse we are doing. If our error is 50% less than reference error or 1% less, the impact of that on the Percent better score is the same.

Normalized RMSE was proposed to neutralize the scale dependency of RMSE. The general idea is to divide RMSE with a scalar, like the maximum value in all the timeseries, or the difference between the maximum or minimum, or the mean value of all the ground truths etc.

Since dividing by maximum or the difference between maximum and minimum are prone to impact from outliers, popular use of nRMSE is by normalizing with the mean.

nRMSE =RMSE/ mean (y)

All the errors we’ve seen so far focuses on penalizing errors, no matter positive or negative. We use an absolute or squared term to make sure the errors do not cancel each other out and paint a rosier picture than what it is.

But by doing this, we are also becoming blind to structural problems with the forecast. If we are consistently over forecasting or under forecasting, that is something we should be aware of and take corrective actions. But none of the measures we’ve seen so far looks at this perspective.

This is where Forecast Bias comes in.

Although it looks like the Percent Error formula, the key here is the absence of the absolute term. So without the absolute term, we are cumulating the actuals and forecast and measuring the difference between them as a percentage. This gives an intuitive explanation. If we see a bias of 5%, we can infer that overall, we are under-forecasting by 5%. Depending on whether we use Actuals – forecast or Forecast – Actuals, the interpretation is different, but in spirit the same.

If we are calculating across timeseries, then also we cumulate the actuals and forecast at whatever cut of the data we are measuring and calculate the Forecast Bias.

Let’s add the error measures we saw now to the summary table we made last time.

Again we see that there is *no one ring to rule them all*. There may be different choices depending on the situation and we need to pick and choose for specific purposes.

We have already seen that it is not easy to just pick one forecast metric and use it everywhere. Each of them has its own advantages and disadvantages and our choice should be cognizant of all of those.

That being said, there are thumb-rules you can apply to help you along the process:

- If every timeseries is on the same scale, use MAE, RMSE etc.
- If there are large changes in the timeseries(i.e. in the horizon we are measuring, there is a huge shift is timeseries levels), then something like a Percent Better or Relative Absolute Error can be used.
- When summarizing across timeseries, for metrics like Percent Better or APE, we can use Arithmetic Means(eg. MAPE). For relative errors, it has been empirically proven that Geometric Means have better properties. But at the same time, they are also vulnerable to outliers. A few ways we can control for outliers are:
- Trimming the outliers or discarding them from the aggregate calculation
- Using the Median for aggregation(MdAPE) is another extreme measure in controlling for outliers.
- Winsorizing(replacing the outliers with the cutoff value) is another way to deal with such huge individual cases of errors.

Armstrong et al. 1992, carried out an extensive study on these forecast metrics using the M competition to sample 5 subsamples totaling a set of 90 annual and 101 quarterly series, and its forecast. Then they went ahead and calculation the error measures on this sample and carried out a study to examine them.

The key dimensions they examined the different measures for were:

Reliability talks about whether repeated application of the measure produce similar results. To measure this, they first calculated the error measures for different forecasting methods on all 5 subsamples(aggregate level), and ranked them in order of performance. They carried out this 1 step ahead and 6 steps ahead for Annual and Quarterly series.

So they calculated the Spearman’s rank-order correlation coefficients(pairwise) for each subsample and averaged them. e.g. We took the rankings from subsample 1 and compared them with subsample 2, and then subsample 1 with subsample 3, etc., until we covered all the pairs and then averaged them.

The rankings based on RMSE was the least reliable with very low correlation coefficients. They state that the use of RMSE can overcome this reliability issue only when there is a high number of time series in the mix which might neutralize the effect.

They also found out that Relative Measures like the Percent Better and MdRAE has much higher reliability than their peers. They also tried to calculate the number of samples required to achieve the same statistical significance as Percent Better – 18 series for GMRAE, 19 using MdRAE, 49 using MAPE, 55 using MdAPE, and 170 using RMSE.

While reliability was measuring the consistency, construct validity asks whether a measure does, in fact, measure what it intents to measure. This shows us the extent to which the various measures assess the “accuracy” of forecasting methods. To compare this they examined the rankings of the forecast methods as before, but this time they compared the rankings between pairs of error measures. For eg., how much agreement is there in ranking based on RMSE vs ranking based on MAPE?

These correlations are influenced by both Construct Validity as well as Reliability. To account for the change in Reliability, the authors derived the same table by using more number of samples and found that as expected the average correlations increased from 0.34 to 0.68 showing that these measures are, in fact, measuring what they are supposed to.

As a final test of validity, they constructed a consensus ranking by averaging the rankings from each of the error measures for the full sample of 90 annual series and 1010 quarterly series and then examined the correlations of each individual error measure ranking with the consensus ranking.

RMSE had the lowest correlation with the consensus. This is probably because of the low reliability. It can also be because of RMSE’s emphasis on higher errors.

Percent Better also shows low correlation(even though it had high reliability). This is probably because Percent better is the only measure which does not measure the magnitude of the error.

It is desirable to have error measures which are sensitive to effects of changes, especially for parameter calibration or tuning. The measure should indicate the effect on “accuracy” when a change is made in the parameters of the model.

Median error measures are not sensitive and neither is Percent Better. Median aggregation hides the change by focusing on the middle value and will only change slowly. Percent Better is not sensitive because once the series is performing better than the reference, it stops making any more change in the metric. It also does not measure if we improve an extremely bad forecast to a point where it is almost as accurate as a naïve forecast.

The paper makes it very clear that none of the measures they evaluated are ideal for decision making. They propose RMSE as a good enough measure and frown upon percent based errors under the argument that actual business impact occurs in dollars and not in percent errors. But I disagree with the point because when we are objectively evaluating a forecast to convey how good or bad it is doing, RMSE just does not make the cut. If I walk up to the top management and say that the financial forecast had an RMSE of 22343 that would fall flat. But instead if I say that the accuracy was 90% everybody is happy.

Both me and the paper agree on one thing, the relative error measures of not that relevant to decision making.

To help with selection of errors, the paper also rates the different measures of the dimensions they identified.

For calibration of parameter tuning, the paper suggests to use on of the measures which are rated high in sensitivity, – RMSE, MAPE, and **GMRAE**. And because of the low reliability of RMSE and the favoring low forecast issue of MAPE, they suggest to use GMRAE(Geometric Mean Relative Absolute Error). **MASE **was proposed way after the release of this paper and hence it does not actor in these analysis. But if you think about it MASE is also sensitive and immune to the problems that we see for RMSE or MAPE and can be a good candidate for calibration.

To select between forecast methods, the primary criteria are reliability, construct validity, protection against outliers, and relationship to decision making. Sensitivity is not that important in this context.

The paper, right away, dismissed RMSE because of the low reliability and the lack of protection to outliers. When the number of series is low, they suggest MdRAE, which is as reliable as GMRAE, but offers additional protection from outliers.Given a moderate number of series, reliability becomes less of an issue and in such cases MdAPE would be an appropriate choice because of its closer relationship to decision making.

Over the two blogposts, we’ve seen a lot of forecast measures and understood what are the advantages and disadvantages for each of them. And finally arrived at a few thumb rules to go by when choosing forecast measures. although not conclusive, I hope it gives you a direction when going about these decisions.

But all this discussion was made under the assumption that the time-series that we are forecasting are stable and smooth. But in real-world business cases, there are also a lot of series which are intermittent or sporadic. We see long periods of zero demand before a non-zero demand. under such cases, almost all of the error measures(with an exception of may be MASE) fails. In the next blog post, let’s take a look at a few different measures which are suited to intermittent demand.

Github Link for the Experiments: https://github.com/manujosephv/forecast_metrics

**Update – 04-10-2020**

Upon further reading, stumbled upon a few criticism of MASE as well, and thought I should mention that as well here.

- There is some criticism on the fact that we use the average MAE of the reference forecast as the scaling error term. Davidenko and Fildes(2013)[3] claims that that introduces a bias towards overrating the accuracy of the reference forecast. In other words, the penalty for bad forecasting becomes larger than the reward for good forecasting.
- Another criticism derives from the fact that mean is not a very stable estimate and can be swayed by a couple of large values.

Another interesting fact that Davidenko and Fildes[3] shows is that MASE is equivalent to the weighted arithmetic mean of relative MAE, where number of available error values is the weight.

**Checkout the rest of the articles in the series**

- Forecast Error Measures: Understanding them through experiments
- Forecast Error Measures: Scaled, Relative, and other Errors
- Forecast Error Measures: Intermittent Demand

- Shcherbakov et al. 2013, A Survey of Forecast Error Measures
- Armstrong et al. 1992, Error Measures for Generalizing About Forecasting Methods: Empirical Comparisons
- Davidenko & Fildes. 2013, Measuring Forecast Accuracy: The Case Of Judgmental Adjustments To Sku-Level Demand Forecasts

Measurement is the first step that leads to control and eventually improvement.

H. James Harrington

In many business applications, the ability to plan ahead is paramount and in a majority of such scenario we use forecasts to help us plan ahead. For eg., If I run a retail store*, how many boxes of that shampoo should I order today?* Look at the Forecast. *Will I achieve my financial targets by the end of the year?* Let’s forecast and make adjustments if necessary. If I run a bike rental firm, *how many bikes do I need to keep at a metro station tomorrow at 4pm?*

If for all of these scenarios, we are taking actions based on the forecast, we should also have an idea about how good those forecasts are. In classical statistics or machine learning, we have a few general loss functions, like the squared error or the absolute error. But because of the way Time Series Forecasting has evolved, there are a lot more ways to assess your performance.

In this blog post, let’s explore the different Forecast Error measures through experiments and understand the drawbacks and advantages of each of them.

There are a few key points which makes the metrics in Time Series Forecasting stand out from the regular metrics in Machine Learning.

As the name suggests, Time Series Forecasting have the temporal aspect built into it and there are metrics like Cumulative Forecast Error or Forecast Bias which takes this temporal aspect as well.

In most business use-cases, we would not be forecasting a single time series, rather a set of time series, related or unrelated. And the higher management would not want to look at each of these time series individually, but rather an aggregate measure which tells them directionally how well we are doing the forecasting job. Even for practitioners, this aggregate measure helps them to get an overall sense of the progress they make in modelling.

Another key aspect in forecasting is the concept of over and under forecasting. We would not want the forecasting model to have structural biases which always over or under forecasts. And to combat these, we would want metrics which doesn’t favor either over-forecasting or under-forecasting.

The final aspect is interpretability. Because these metrics are also used by non-analytics business functions, it needs to be interpretable.

Because of these different use cases, there are a lot of metrics that is used in this space and here we try to unify it under some structure and also critically examine them.

We can classify the different forecast metrics. broadly,. into two buckets – **Intrinsic and Extrinsic**. Intrinsic measures are the measures which just take the generated forecast and ground truth to compute the metric. Extrinsic measures are measures which use an external reference forecast also in addition to the generated forecast and ground truth to compute the metric.

Let’s stick with the intrinsic measures for now(Extrinsic ones require a whole different take on these metrics). There are four major ways in which we calculate errors – Absolute Error, Squared Error, Percent Error and Symmetric Error. All the metrics that come under these are just different aggregations of these fundamental errors. So, without loss of generality, we can discuss about these broad sections and they would apply to all the metrics under these heads as well.

This group of error measurement uses the absolute value of the error as the foundation.

Instead of taking the absolute, we square the errors to make it positive, and this is the foundation for these metrics.

In this group of error measurement, we scale the absolute error by the ground truth to convert it into a percentage term.

Symmetric Error was proposed as an alternative to Percent Error, where we take the average of forecast and ground truth as the base on which to scale the absolute error.

Instead of just saying that these are the drawbacks and advantages of such and such metrics, let’s design a few experiments and see for ourselves what those advantages and disadvantages are.

In this experiment, we try and figure out the impact of the scale of timeseries in aggregated measures. For this experiment, we

- Generate 10000 synthetic time series at different scales, but with same error.
- Split these series into 10 histogram bins
- Sample Size = 5000; Iterate over each bin
- Sample 50% from current bin and res, equally distributed, from other bins.
- Calculate the aggregate measures on this set of time series
- Record against the bin lower edge

- Plot the aggregate measures against the bin edges.

The error measure should be symmetric to the inputs, i.e. Forecast and Ground Truth. If we interchange the forecast and actuals, ideally the error metric should return the same value.

To test this, let’s make a grid of 0 to 10 for both actuals and forecast and calculate the error metrics on that grid.

In this experiment, we take complementary pairs of ground truths and forecasts which add up to a constant quantity and measure the performance at each point. Specifically, we use the same setup as we did the Symmetricity experiment, and calculate the points along the cross diagonal where ground truth + forecast always adds up to 10.

Our metrics depend on two entities – forecast and ground truth. We can fix one and vary the other one using a symmetric range of errors((for eg. -10 to 10), then we expect the metric to behave the same way on both sides of that range. In our experiment, we chose to fix the Ground Truth because in reality, that is the fixed quantity, and we are measure the forecast against ground truth.

In this experiment we generate 4 random time series – ground truth, baseline forecast, low forecast and high forecast. These are just random numbers generated within a range. Ground Truth and Baseline Forecast are random numbers generated between 2 and 4. Low forecast is a random number generated between 0 and 3 and High Forecast is a random number generated between 3 and 6. In this setup, the Baseline Forecast should act as a baseline for us, Low Forecast is a forecast where we continuously under-forecast, and High Forecast is a forecast where we continuously over-forecast. And now let’s calculate the MAPE for these three forecasts and repeat the experiment for 1000 times.

To check the impact on outliers, we setup the below experiment.

We want to check the relative impact of outliers on two axes – number of outliers, scale of outliers. So we define a grid – number of outliers [0%-40%] and scale of outliers [0 to 2]. Then we picked a synthetic time series at random, and iteratively introduced outliers according to the parameters of the grid we defined earlier and recorded the error measures.

That’s a nice symmetric heatmap. We see zero errors along the diagonal, and higher errors spanning away from it in a nice symmetric pattern.

Again symmetric. MAE varies equally if we go on both sides of the curve.

Again good news. If we vary forecast, keeping actuals constant, and vice versa the variation in the metric is also symmetric.

As expected, over or under forecasting doesn’t make much of a difference in MAE. Both are equally penalized.

This is the Achilles heel of MAE. here, as we increase the base level of the time-series, we can see that the MAE increases linearly. This means that when we are comparing performances across timeseries, this is not the measure you want to use. For eg., when comparing two timeseries, one with a level of 5 and another with a level of 100, using MAE would always assign a higher error to the timeseries with level 100. Another example is when you want to compare different sub-sections of your set of timeseries to see where the error is higher(for eg. different product categories, etc.), then using MAE would always tell you that the sub-section which has a higher average sales would also have a higher MAE, but that doesn’t mean that sub-section is not doing well.

Squared Error also shows the symmetry we are looking for. But one additional point we can see here is that the errors are skewed towards higher errors. The distribution of color from the diagonal is not as uniform as we saw in Absolute Error. **This is because the squared error(because of the square term), assigns higher impact to higher errors that lower errors. **This is also why Squared Errors are, typically, more prone to distortion due to outliers.

*Side Note:* Since squared error and absolute error are also used as loss functions in many machine learning algorithms, this also has the implications on the training of such algorithms. If we choose squared error loss, we are less sensitive to smaller errors and more to higher ones. And if we choose absolute error, we penalize higher and lower errors equally and therefore a single outlier will not influence the total loss that much.

We can see the same pattern here as well. It is symmetric around the origin, but because of the quadratic form, higher errors are having disproportionately more error as compared to lower ones.

Similar to MAE, because of the symmetry, Over and Under Forecasting has pretty much the same impact.

Similar to MAE, RMSE also has the scale dependency problem, which means that all the disadvantages we discussed for MAE, applied here as well, but worse. We can see that RMSE scales quadratically when we increase the scale.

Percent Error is **the** most popular error measure used in the industry. A couple of reasons why it is hugely popular are:

- Scale Independent – As we saw in the scale dependency plots earlier, the MAPE line is flat as we increase the scale of the timeseries.
- Interpretability – Since the error is represented as a percentage term, which is quite popular and interpretable, the error measure also instantly becomes interpretable. If we say the RMSE is 32, it doesn’t mean anything in isolation. But on the other hand, if we say the MAPE is 20%, we instantly know ho good or bad the forecast is.

Now that doesn’t look right, does it? Percent Error, the most popular of them all, doesn’t look symmetric at all. In fact, we can see that the errors peak when actuals is close to zero and tending to infinity when actuals is zero(the colorless band at the bottom is where the error is infinity because of division by zero).

We can see two shortcomings of the percent error here:

- It is undefined when ground truth is zero(because of division by zero)
- It assigns higher error when ground truth value is lower(top right corner)

Let’s look at the Loss Curves and Complementary Pairs plots to understand more.

Suddenly, the asymmetry we are seeing is no more. If we keep the ground truth fixed, Percent Error is symmetric around the origin.

But when we look at complementary pairs, we see the asymmetry we were seeing earlier in the heatmap. When the actuals are low, the same error is having a much higher Percent Error than the same error when the forecast was low.

All of this is because of the base which we take for scaling it. Even if we have the same magnitude of error, if the ground truth is low, the percent error will be high and vice versa. For example, let’s review two cases:

- F = 8, A=2 –> Absolute Percent Error =
- F=2, A=8 –> Absolute Percent Error =

There are countless papers and blogs which claim the asymmetry of percent error to be a deal breaker. The popular claim is that absolute percent error penalizes over-forecasting more than under-forecasting, or in other words, it incentivizes under-forecasting.

One argument against this point is that this asymmetry is only there because we change the ground truth. An error of 6 for a time series which has an expected value of 2 is much more serious than an error of 2 for a time series which has an expected value of 6. So according to that intuition, the percent error is doing what it is supposed to do, isn’t it?

Not exactly. On some levels the criticism of percent error is rightly justified. Here we see that the forecast where we were under-forecasting has a consistently lower MAPE than the ones where we were over-forecasting. The spread of the low MAPE is also considerably lower than the others. But does that mean that the forecast which always predicts on the lower side is the better forecast as far as the business is concerned? Absolutely not. In a Supply Chain, that leads to stock outs, which is not where you want to be if you want to stay competitive in the market.

Symmetric Error was proposed as an better alternative to Percent error. There were two key disadvantages for Percent Error – Undefined when Ground Truth is zero and Asymmetry. And Symmetric Error proposed to solve both by using the average of ground truth and forecast as the base over which we calculate the percent error.

Right off the bat, we can see that this is symmetric around the diagonal, almost similar to Absolute Error in case of symmetry. And the bottom bar which was empty, now has colors(which means they are not undefined). But a closer look reveals something more. It is not symmetric around the second diagonal. We see the errors are higher when both actuals and forecast are low.

This is further evident in the Loss Curves. We can see the asymmetry as we increase errors on both sides of the origin. And contrary to the name, Symmetric error penalizes under forecasting more than over forecasting.

But when we look at complementary pairs, we can see it is perfectly symmetrical. This is probably because of the base, which we are keeping constant.

We can see the same here as well. The over forecasting series has a consistently lower error as compared to the under forecasting series. So in the effort to normalize the bias towards under forecasting of Percent Error, Symmetric Error shot the other way and is biased towards over forecasting.

In addition to the above experiments, we had also ran an experiment to check the impact of outliers(single predictions which are wildly off) on the aggregate metrics.

All four error measures have similar behavior, when coming to outliers. The number of outliers have a much higher impact than the scale of outliers.

Among the four, RMSE is having the biggest impact from outliers. We can see the contour lines are spaced far apart, showing the rate of change is high when we introduce outliers. On the other end of the spectrum, we have sMAPE which has the least impact from outliers. It is evident from the flat and closely spaced contour lines. MAE and MAPE are behaving almost similarly, probably MAPE a tad bit better.

To close off, there is no one metric which satisfies all the *desiderata* of an error measure. And depending on the use case, we need to pick and choose. Out of the four intrinsic measures( and all its aggregations like MAPE, MAE, etc.), if we are not concerned by Interpretability and Scale Dependency, we should choose Absolute Error Measures(that is also a general statement. there are concerns with Reliability for Absolute and Squared Error Measures). And when we are looking for scale independent measures, Percent Error is the best we have(even with all of it’s short comings). Extrinsic Error measures like Scaled Error offer a much better alternative in such cases(May be in another blog post I’ll cover those as well.)

All the code to recreate the experiments are at my github repository:

https://github.com/manujosephv/forecast_metrics/tree/master

**Checkout the rest of the articles in the series**

- Forecast Error Measures: Understanding them through experiments
- Forecast Error Measures: Scaled, Relative, and other Errors
- Forecast Error Measures: Intermittent Demand

*Featured Image Source*

**Further Reading**

- Shcherbakov et al. 2013, A Survey of Forecast Error Measures
- Goodwin & Lawton, 1999, On the asymmetry of symmetric MAPE

**Edited**

*Fixed a mislabeling in the Contour Maps*

So, I present to you, the Battle of the Boosters.

I have chosen a few datasets for regression from Kaggle Datasets, mainly because it’s easy to setup and run in Google Colab. Another reason is that I do not need to spend a lot of time in data preprocessing, instead I can pick one of the public kernels and get cracking. I’ll also share one kernel for EDA of the datasets we choose. All notebooks will also be shared and links at the bottom of the blog.

- AutoMPG – The data is technical spec of cars from UCI Machine Learning Repository.

Shape: (398,9) , EDA Kernel, Data Volume: Low - House Prices: Advanced Regression Techniques – The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Shape: (1460, 81), EDA kernel, Data Volume: Medium - Electric Motor Temperature – The data set comprises several sensor data collected from a permanent magnet synchronous motor (PMSM) deployed on a test bench. The PMSM represents a German OEM’s prototype model.

Shape: (998k,13), EDA Kernel, Data Volume: High

- XGBoost
- LightGBM
- Regularized Greedy Forest
- NGBoost

Nothing fancy here. Just did some basic data cleansing and scaling. Most of the code is from some random kernel. The only point is that the same preprocessing is applied to all algorithms

I have chosen cross validation to make sure the comparison between different algorithms is more generalized than specific to one particular split of the data. I have chosen a simple K-Fold with 5 folds for this exercise.

**Evaluation Metric :** Mean Squared Error

To have standard evaluation for all the algorithms(thankfully all of them are Sci-kit Learn api), I defined a couple of functions.

**Default Parameters:** First fit the CV splits with default parameters of the model. We record the mean and standard deviation of the CV scores and then fit the entire train split to predict on the test split.

def eval_algo_sklearn(alg, x_train, y_train,x_test, y_test, cv): MSEs=ms.cross_val_score(alg, x_train, y_train, scoring='neg_mean_squared_error', cv=cv) meanMSE=np.mean(MSEs) stdMSE = np.std(MSEs) alg=alg.fit(x_train,y_train) pred=alg.predict(x_test) rmse_train = math.sqrt(-meanMSE) rmse_test = math.sqrt(sklm.mean_squared_error(y_test, pred)) return rmse_train, stdMSE, rmse_test

**With Hyperparameter Tuning**: Very similar to the previous one, but with an additional step of GridSearch to find best parameters.

def tune_eval_algo_sklearn(alg, param_grid, x_train, y_train, x_test, y_test, cv): grid=GridSearchCV(alg, param_grid=param_grid, cv=cv, scoring='neg_mean_squared_error', n_jobs=-1) grid.fit(x_train,y_train) print(grid.best_estimator_) best_params = grid.best_estimator_.get_params() alg= clone(alg).set_params(**best_params) alg=alg.fit(x_train,y_train) pred=alg.predict(x_test) rmse_train = math.sqrt(-grid.cv_results_['mean_test_score'][grid.best_index_]) stdMSE = grid.cv_results_['std_test_score'][grid.best_index_] rmse_test = math.sqrt(sklm.mean_squared_error(y_test, pred)) return rmse_train, stdMSE, rmse_test, alg

Hyperparameter tuning is in no way exhaustive, but is fairly decent.

The grids over which we run GridSearch for the different algorithms are

**XGBoost**:

{ "learning_rate": [0.01, 0.1, 0.5], "n_estimators": [100, 250, 500], "max_depth": [3, 5, 7], "min_child_weight": [1, 3, 5], "colsample_bytree": [0.5, 0.7, 1], }

**LightGBM**:

{ "learning_rate": [0.01, 0.1, 0.5], "n_estimators": [100, 250, 500], "max_depth": [3, 5, 7], "min_child_weight": [1, 3, 5], "colsample_bytree": [0.5, 0.7, 1], }

**RGF**:

{ "learning_rate": [0.01, 0.1, 0.5], "max_leaf": [1000, 2500, 5000], "algorithm": ["RGF", "RGF_Opt", "RGF_Sib"], "l2": [1.0, 0.1, 0.01], }

**NGBoost**:

Because NGBoost is kinda slow, instead of defining a standard grid for all experiments, I have done search along each parameter, independently, and decided a grid based on the intuitions from that experiment

I’ve tabulated the Mean and Standard Deviations of RMSEs for the Train CV splits for all three datasets. For Electric-Motors, I did not tune the data, as it was computationally expensive.

Disclaimer: These experiments are in no way complete. One would need a much larger scale experiment to arrive at a conclusion on which algorithm is doing better. And then there is the No Free Lunch Theorem to keep in mind.

Right off the bat, NGBoost seems like a strong contender in this space. In AutoMPG and Housing Prices datasets, NGBoost performs the best among all the other boosters, both on mean RMSE as well as the Standard Deviation in the CV scores, and by a large margin. NGBoost also shows quite a large gap between the default and tuned versions. This shows that either the default parameters are not well set, or that each tuning for dataset is a key element in getting good performance from NGBoost. But the Achilles heel of the algorithm is the run-time. With those huge bars towering over the others, we can see that the runtime, really, is in a different scale as compared to the other boosters. Especially on large datasets like Electric Motors Temperature dataset, the runtime is prohibitively large and because of that I didn’t tune the algorithm as well. It fares last among the other boosters in the competition.

Another standout algorithm is Regularized Greedy Forest, which is performing as good as or even better than XGBoost. In low and medium data setting, the runtime is also comparable to the reigning king, XGBoost.

In low data setting, popular algorithms like XGBoost and LightGBM are not performing well. And the standard deviation of the CV scores are higher, especially XGBoost, showing that it overfits. XGBoost has this problem in all three examples. In the matter of runtime, LightGBM reins king(although I haven’t tuned for computational performance), beating XGBoost in all three examples. In the high data setting, it blew everything else out of the water by having much lower RMSE and runtime than the rest of the competitors.

We have come far and wide in the world of gradient boosting and I hope that at least for some of you, Gradient Boosting does not mean XGBoost. There are so many algorithms with its own quirks in this world and lot of them perform at par or better than XGBoost. And another exciting area is Probabilistic Regression. I hope NGBoost become more efficient and step over that hurdle of computational efficiency. Once that happens, NGBoost is a very strong candidate in the probabilistic regression space.

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

If you’ve not read the previous parts of the series, I strongly advise you to read up, at least the first one where I talk about the Gradient Boosting algorithm, because I am going to take it as a given that you already know what Gradient Boosting is. I would also strongly suggest to read the VI(A)so that you have a better understanding of what Natural Gradients are

The key innovation in NGBoost is the use of Natural Gradients instead of regular gradients in the boosting algorithm. And by adopting this probabilistic route, it models a full probability distribution over the outcome space, conditioned on the covariates.

The paper modularizes their approach into three components –

- Base Learner
- Parametric Distribution
- Scoring Rule

As in any boosting technique, there are base learners which are combined together to get a complete model. And the NGBoost doesn’t make any assumptions and states that the base learners can be any simple model. The implementation supports a Decision Tree and ridge Regression as base learners out of the box. But you can replace them with any other sci-kit learn style models just as easily.

Here, we are not training a model to predict the outcome as a point estimate, instead we are predicting a full probability distribution. And every distribution is parametrized by a few parameters. For eg., the normal distribution is parametrized by its mean and standard deviation. You don’t need anything else to define a normal distribution. So, if we train the model to predict these parameters, instead of the point estimate, we will have a full probability distribution as the prediction.

Any machine learning system works on a learning objective, and more often than not, it is the task of minimizing some loss. In point prediction, the predictions are compared with data with a loss function. Scoring rule is the analogue from the probabilistic regression world. The scoring rule compares the estimated probability distribution with the observed data.

A proper scoring rule, , takes as input a forecasted probability distribution and one observation *(outcome)*, and assigns a score to the forecast such that the true distribution of the outcomes gets the best score in expectation.

The most commonly used proper scoring rule is the logarithmic score , which, when minimized we get the MLE

which is nothing but the log likelihood that we have seen in so many places. And the scoring rule is parametrized by because that is what we are predicting as part of the machine learning model.

Another example is CRPS(Continuous Ranked Probability Score). While the logarithmic score or the log likelihood generalizes Mean Squared Error to a probabilistic space, CRPS does the same to Mean Absolute Error.

In the last part of the series, we saw what Natural Gradient was. And in that discussion, we talked abut KL Divergences, because traditionally, Natural Gradients were defined on the MLE scoring rule. But the paper proposes a generalization of the concept and provide a way to extend the concept to CRPS scoring rule as well. They generalized KL Divergence to a general Divergence and provided derivations for CRPS scoring rule.

Now that we have seen the major components, let’s take a look at how all of this works together. NGBoost is a supervised learning method for probabilistic prediction that uses boosting to estimate the parameters of a conditional probability distribution . As we saw earlier, we need to choose three modular components upfront:

- Base learner ()
- Parametric Probability Distribution ()
- Proper scoring rule ()

A prediction on a new input *x* is made in the form of a conditional distribution , whose parameters are obtained by an additive combination of *M* base learners outputs and in initial . Let’s denote the combined function learned by the M base learners for all parameters by . And there will be a separate set of base learners for each parameter in the chosen probability distribution. For eg. in the normal distribution, there will be and .

The predicted outputs are also scaled with a stage specific scaling factor (), and a common learning rate :

One thing to note here is that even if you have *n* parameters for your probability distribution, is still a single scalar.

Let us consider a Dataset , Boosting Iterations *M*, Learning rate , Probability Distribution with parameters , Proper scoring rule , and Base learner

- Initialize to the marginal. This is just estimating the parameters of the distribution without conditioning on any covariates; similar to initializing to mean. Mathematically, we solve this equation:
- For each iteration in M:
- Calculate the Natural gradients of the Scoring rule S with respect to the parameters at previous stage for all
*n*examples in dataset. - A set of base learners for that iteration are fit to predict the corresponding components of the natural gradients, . This output can be thought of as the projection of the natural gradient on to the range of the base learner class, because we are training the base learners to predict the natural gradient at current stage.
- This projected gradient is then scaled by a scaling factor . This is because Natural Gradients rely on local approximations(as we saw in the earlier post) and these local approximations won’t hold good far away from the current parameter position.

In practice, we use a line search to get the best scaling factor which minimizes the overall scoring rule. In the implementation, they have found out that reducing the scaling factor by half in the line search works well. - Once the scaling parameter is estimated, we update the parameters by adding the negative scaled projected gradient to the outputs of previous stage, after further scaling by a learning rate.

- Calculate the Natural gradients of the Scoring rule S with respect to the parameters at previous stage for all

The algorithm has a ready use Sci-kit Learn style implementation at https://github.com/stanfordmlgroup/ngboost. Let’s take a look at the key parameters to tune in the model.

- Dist : This parameter sets the distribution of the output. Currently, the library supports
*Normal, LogNormal, and Exponential*distributions for regression,*k_categorical and Bernoulli*for classification.*Default: Normal* - Score : This specifies the scoring rule. Currently the options are between
*LogScore or CRPScore*.*Default: LogScore* - Base: This specifies the base learner. This can be any Sci-kit Learn estimator.
*Default is a 3-depth Decision Tree* - n_estimators : The number of boosting iterations.
*Default: 500* *learning_rate*: The learning rate.*Default:0.01**minibatch_frac*: The percent subsample of rows to use in each boosting iteration. This is more of a performance hack than performance tuning. When the data set is huge, this parameter can considerably speed things up.

Although there needs to be a considerable amount of caution before using the importances from machine learning models, NGBoost also offers feature importances. It has separate set of importances for each parameter it estimates.

But the best part is not just this, but that SHAP, is also readily available for the model. You just need to use TreeExplainer to get the values.(To know more about SHAP and other interpretable techniques, check out my other blog series – Interpretability: Cracking open the black box).

The paper also looks at how the algorithm performs when compared to other popular algorithms. There were two separate type of evaluation – Probabilistic, and Point Estimation

On a variety of datasets from the UCI Machine Learning Repository, NGBoost was compared with other major probabilistic regression algorithms, like Monte-Carlo Dropout, Deep Ensembles, Concrete Dropout, Gaussian Process, Generalized Additive Model for Location, Scale and Shape(GAMLSS), Distributional Forests.

They also evaluated the algorithm on the point estimate use case against other regression algorithms like, Elastic Net, Random Forest(Sci-kit Learn), Gradient Boosting(Sci-kit Learn).

NGBoost performs as well or better than the existing algorithms, but also has an additional advantage of providing us a probabilistic forecast. And the formulation and implementation are flexible and modular enough to make it easy to use.

But one drawback here is with the performance of the algorithm. The time complexity linearly increases with each additional parameter we have to estimate. And all the efficiency hacks/changes which has made its way into popular Gradient Boosting packages like LightGBM, or XGBoost are not present in the current implementation. Maybe it would be ported over soon enough because I see the repo is under active development and see this as one of the action items they are targeting. But until that happens, this is quite slow, especially with large data. One way out is to use *minibatch_frac* parameter to make the calculation of natural gradients faster.

Now that we have seen all the major Gradient Boosters, let’s pick a sample dataset and see how these performs in the next part of the series.

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Amari, Shun-ichi. Natural Gradient Works Efficiently in Learning. Neural Computation, Volume 10, Issue 2, p.251-276.
- Duan, Tony. NGBoost: Natural Gradient Boosting for Probabilistic Prediction, arXiv:1910.03225v4 [cs.LG] 9 Jun 2020
- NGBoost documentation, https://stanfordmlgroup.github.io/ngboost

But among all the negativity, there was a sliver of light shining through. When faced with a common enemy, mankind united across borders(for the most part; there are bad apples always) to help each other tide over the current assault. Scientists, who are the heroes of the day, doubled down to find a cure, vaccine, and a million other things which helps in the battle against COVID-19. And along with the real heroes, Data Scientists were also called to action to help in any way they could. A lot of people tried their best at forecasting the progression of the disease, so that the Governments can plan better. A lot more dedicated their time in analysing the data coming out of a multitude of sources to prepare dashboards, or network diagrams, etc. to help understand the progression of the disease. And yet another set of people tried to apply the techniques of AI to things like identifying the risk of a patient, or help diagnose the disease with X-Rays, etc.

While following these developments, one particular area where a lot of people did some attempt is Chest Radiography based identification of COVID-19 cases. One of those early attempts received a lot of attention, volunteers, funding etc. along with a lot of flak for the positioning the research took(You can read more about it here). TLDR; A PhD candidate out of Australia used a pretrained model(Resnet50), trained on 50 images, messed up the code because of train validation leak, and claimed 97% accuracy on COVID-19 case identification. Some others even got a 100% accuracy(turned out it was trained on the same dataset on which it got a 100% accuracy).

Along with this noise, there was an arxiv preprint came out of University of Waterloo, Canada by Linda Wang and Alexander Wong titled, COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. In the paper, they propose a new CNN Architecture which was trained from scratch on a dataset of 5941 posteroanterior chest radiography images. To generate the dataset, they combined two publicly available datasets – COVID chest X-ray dataset, and Kaggle Chest X-ray images (pneumonia) dataset. In the paper, they divided this dataset into four classes – COVID-19, Viral, Bacterial, and Normal. The below bar chart shows the class distribution of train and test splits. This was a decently sized dataset, although the COVID cases were on the lower side. They reported a 100% Recall and an 80% Precision for the model.

This was the first dataset of decent size on COVIDx and it got me interested. And since they shared the trained model and code to evaluate in a Github Repo, this was prime for an analysis.

I feel I need to state a disclaimer right about here. Whatever follows is a purely academic exercise and not at all an attempt to suggest this as a verifiable and valid way of testing for COVID-19. First things first. I personally do not endorse this attempt at identifying COVID-19 using any of these models. I very little knowledge about medicine, and absolutely zero about reading an X-ray. Unless this has been verified and vetted by a medical professional, this is nothing better than a model trained on a competition data set.

There are a few problems also regarding the dataset.

- COVID-19 cases and the other cases comes from different data sources and it’s doubtful if the model is identifying the data source or actual visual indicators representing COVID-19. I’ve tried to look at GradCAM results, but me being an absolute zero in reading an X-ray, I don’t know if the model is looking at the right indicators.
- It is also unclear as to what stage a patient was when the X-ray was taken. If it was something that was taken too late in his disease, this method does not hold it’s ground.

The first thought I had when I saw the model and the dataset was this – Why not Transfer Learning? The dataset is quite small, especially the class that we are interested in. Training a model from scratch and trying to properly capture the complex representation of the different classes, especially the COVID-19 class was a little bit of a stretch for me.

But playing the Devil’s advocate, why would a CNN trained on animals and food (the most popular classes in ImageNet) do better in X-rays? The network trained on natural and colourful images might have learnt totally different feature representations necessary than what is necessary to handle monochromatic X-rays.

As a rational human being and a staunch believer of the process of Science, I decided to look up the existing literature on it. Surprisingly, the research community is divided about the issue. Let me make it clear. There is no debate as to whether pretraining or Transfer Learning works for medical images. But the debate is about whether pretraining on Imagenet has any benefit. There were papers which claimed Imagenet pretraining helped Medical Image Classification and segmentation. And there were papers who pushed for random initialisation for the weights.

There was a recent paper by Veronika Cheplygina[1] which did a review of the literature in the perspective of whether or not Imagenet pretraining is beneficial for medical images. The conclusion was – “It depends”. Another recent paper from Google Brain(which was accepted into NeurIPS 2019)[2] deep dives into this issue. Although the general conclusion was that transfer learning with Imagenet weights is not beneficial for medical imaging, they do point out a few other interesting behaviour:

- When the sample size is small, as is most cases in medical imaging, Transfer Learning, even if it is Imagenet based, is beneficial
- When they look at the convergence rates of the different models, pretrained ones converged faster
- Most interesting result was that they tried initializing the networks with random weights, but derived the mean and standard deviation of the random initialization based on pretrained weights and found that it too provided the convergence speedup that pretrained models had.

Bottom line was that the study didn’t show worse performance for Imagenet trained models and had faster convergence. Even though the large Imagenet models may be over-parametrized for the problem, it does offer some benefit if you want to get a model working as fast as possible.

Now that I’ve done the literature review, it was time to test out my hypothesis. I gathered the dataset, wrote up a training script, and tested out a few Pretrained models.

Model | # of Parameters | GFLOPS |

DenseNet 121 | 8,010,116 | ~3 |

Xception | 22,915,884 | – |

ResNeXt 101 32x4d | 44,237,636 | ~7.8 |

I’ve used the FastAI library(a wrapper around PyTorch), which is very easy to use, especially if you are doing Transfer Learning with it’s easy “freeze” and “unfreeze” functions. Most of the experiments were run either on my laptop with a meagre GTX 1650 or on Google Colab. I’ve used the amazing library pretrainedmodels by Cadene as a source of my pretrained models apart from torchvision.

As our training dataset is relatively small and because it has two different sources of X-rays, I’ve used a few transformations as data augmentation. It both increases the dataset samples as well as give better generalization capabilities to the model. The transforms used are:

- Horizontal Flip
- Rotate
- Zoom
- Brightness
- Warp
- Cutout

Below are the basic steps I’ve used for the training of these models. Full code is published on GitHub.

- Import the models and create a learner from fastai. fastai has a few inbuilt mechanism to cut and split pretrained models so that we can use a custom head and apply discriminative learning rates easily. for the models in torchvision, the cut and split are predefined in fastai. But for models that are loaded from outside torchvision, we need to define those as well. ‘cut’ tells fastai where to make the separation between the feature extractor part of the CNN and the classifier part so that it can replace it with a custom head. “split” tells fastai how to split the different blocks on the CNN so that each block can have different learning rates.
- Split the Train into Train and Validation using a StratifiedShuffleSplit
- Kept the loss as a standard CrossEntropy
- Freeze the feature extractor part of the CNN and train the model. I used the One-Cycle Learning Rate Scheduler proposed by Leslie Smith[3]. It is heavily advocated by Jeremy Howard in his fastai courses and is implemented in the fastai library.
- After the learning saturates, unfreeze the rest of the model and finetune the model. Whether to use One-Cycle scheduler or not and whether to use differential learning rates or not, was decided empirically by looking at the loss curves.

Mixup[4] is a form of data augmentation where we generate new examples by weighted linear interpolation of two existing examples.

is between 0 and 1. In practice, it is sampled from a beta distribution which is parametrised by . Typically, is between 0.1 to 0.4 where the effect of mixup is not too much that it leads to underfitting.

For DenseNet 121, I tried doing Progressive Resizing as well, just to see if it gets me better results. Progressive Resizing is when we start training the network with a small image size and then use the weights learned from the smaller size image and start training on a bigger size image and in stages we move to higher resolution image sizes. I tried it in three stages – 64×64, 128×128, and 224×224.

Without further ado, let’s take a look at the results.

Best DenseNet model was got by progressively resizing 64×64 –> 128×128 and using mixup during training.

The best Xception model was trained using mixup and finetuned after initial pretraining with frozen weights.

The best ResNeXt model was trained without mixup(did not try), and without finetuning(finetuning was giving me worse performance for some reason).

Let’s summarize these results in a table and place them alongside the results from the COVID-Net paper.

We can see right off the bat that all the models have a better accuracy than COVID-Net. But Accuracy isn’t the right metric to evaluate here. The Xception model, with the highest F1 score, seems to be the best performing model among the lot. But, if we look at Precision and Recall separately, we can see that COVID-Net is having high recall, especially for the COVID-19 cases, whereas our models have high Precision. Densenet 121 have a perfect recall, but the Precision is bad. But the Xception model has high precision and a not too bad recall.

We have seen that DenseNet was a high recall model and Xception was a high precision model. Would the performance be better if we average the predictions across both these models?

Not much different as before. Our ensemble still doesn’t have better recall. Let’s try a Weighted Ensemble to give more weight to Densenet which has a perfect recall in COVID-19. To determine the optimum weight, I use the predictions in the validation set and tried different weights.

Let’s add these ensembles also to the earlier table for comparison.

Finally, we have a model which has a balance between Precision and Recall and also beats the COVID-Net scores across all the metrics. Let’s take a look at the Confusion Matrix for the ensemble.

When we are thinking about the usability of the model, we should also keep in mind the model complexity and inference time. The below table shows the number of parameters as a proxy for model complexity and inference time on my machine(GTX 1650).

N.B. – Was not able to run inference on COVID-Net on my laptop(which has a terrible relationship with Tensorflow) and therefore do not know the inference time for the model. But by udging from the # of parameters, it should be more than the other models.

N.B. – The # of parameters and Inference time for the ensemble is taken as the summation of the constituents.

Class Activation maps were introduced as a way to understand what a CNN is looking at while paking predictions way back in 2015 by Zhou, Bolei et al[5]. It is a technique to understand the regions of the image a CNN focuses on while making predictions. They achieve this by projecting back the weights of the output layer back to the output from the Convolutional Neural Network part.

Grad CAM is a generalization of CAM for many end use cases, apart from classification, and they acheive this by using the gradients w.r.t. the class at the last output from the Convolutional Layers. The authors of the paper[6] say:

Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.

Let’s see a few examples of our predictions and their activations overlayed as a heatmap. Although I don’t understand if the network is looking at the right places, if somebody who is reading this know how to, reach out to me and let me know.

We can also take a look at how good the feature representations that come out of these networks are. Since the output from the Convolutional Layers are high dimensional, we’ll have to use a dimensionality reduction technique to plot it in two dimensions.

A popular method for exploring high-dimensional data is t-SNE, introduced by van der Maaten and Hinton in 2008[7]. t-SNE, unlike something like a PCA, isn;t a linear projection.It uses the local relationships between points to create a low-dimensional mapping. This allows it to capture non-linear structure.

Let’s take a look at the t-SNE vizualizations, with Perplexity 50 for the three models – COVID-Net, Xception, DenseNet.

It also appears that our Imagenet pretrained models (Xception and DenseNet), has a better feature representation than COVID-Net. The t-SNE of COVID-Net is quite spread out and there is a lot of interspersion between the different classes. But the Xception and DenseNet feature representations show much better degree of separation of the different classes. The COVID-19 cases(Green) in all three cases shows separation, but because the dataset is so small, we need to take that inference with a grain of salt.

We’ve seen that the Imagenet pretrained models performed better than the COVID-Net model. The best Xception model had better Precision and best DenseNet model had better Recall. In this particular scenario, Recall is what matters more because you need to be safe than sorry. Even if you classify a few non COVID-19 cases as positive, they will just be directed to a proper medical test. But the other kind of error is not that forgiving. So going purely by that, our DenseNet model is the better. But we also need to keep in mind that this has been trained on a limited data set. And that too, the number of COVID-19 cases were just around 60. It is highly likely that the model has memorised or overfit to those 60. A prime example of the case where the model used the label on the Xray to classify that as COVID-19. The GradCAM examination was also not very helpful, as some of the examples seemed like the model is looking at the right places. But for some examples, the heat map lit up almost all of the X-ray.

But after examining the GradCAM and the t-SNE, I think that the Xception model has learned a much better representation for the cases. The problem of having low Recall is something that can be dealt with.

On the larger point, with which we have started this whole exercise with, I think we can safely say that Imagenet pretraining does help in classification of Chest Radiography images of COVID-19(I did try to train DenseNet without pretrained weights on the same dataset without much sucess).

There are a lot of unexplored dimensions to this problem and I’m gonna mention those in case any of my readers want to take those up.

- Collecting more data, especially COVID-19 cases, and retraining the models
- Dealing with the Class Imbalance
- Pretraining on Grayscale ImageNet[9] and subsequent Transfer Learning on Grayscale X-Rays
- Using the CheXpert dataset[8] as a bridge between Imagenet and Chest Radiography images by fine-tuning Imagenet models on CheXpert dataset and then apply to the problem at hand

If you are a medical professional, who think this is a worthwhile direction of research, do reach out to me. I want to be convinced that this is effective, but currently am not.

If you are a ML researcher, who want to collaborate on publishing a paper, or continue this line of research, do reach out to me.

** Update **– After I’ve cloned the COVID-Net repo, there has been an update, both in the data set by adding a few more cases for COVID-19, and the addition of a larger model. And the performance of our ensemble is still better than the large model.

- Cheplygina, Veronika , “Cats or CAT scans: transfer learning from natural or medical image source datasets?,” arXiv:1810.05444 [cs.CV], Jan. 2019.
- Raghu, Maithra et.al, “Transfusion: Understanding Transfer Learning for Medical Imaging”, arXiv:1902.07208 [cs.CV], Feb. 2019
- Smith, Leslie N., “Cyclical Learning Rates for Training Neural Networks”, arXiv:1506.01186 [cs.CV], June.2019
- Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, David Lopez-Paz, “mixup: Beyond Empirical Risk Minimization”. ICLR (Poster) 2018
- Zhou, Bolei et al, “Learning Deep Features for Discriminative Localization”, arXiv:1512.04150, Dec. 2015
- Selvaraju, Ramaprasaath R. et al, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization”. arXiv:1610.02391 [cs.CV], Oct. 2016
- Laurens van der Maaten, Geoffrey Hinton, “Visualizing Data using t-SNE”. 2008
- Irvin, Jeremy et al. “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison”, arXiv:1901.07031 [cs.CV], Jan, 2019
- Xie, Yiting & Richmond, David. (2019). Pre-training on Grayscale ImageNet Improves Medical Image Classification: Munich, Germany, September 8-14, 2018, Proceedings, Part VI. 10.1007/978-3-030-11024-6_37.

Pre-reads: I would be talking about KL Divergence and if you are unfamiliar with the concept, take a pause and catch-up. I have another article where I give the maths and intuition behind Entropy, Cross Entropy and KL Divergence.I also assume basic familiarity with Gradient Descent. I’m sure you’ll find a million articles about Gradient Descent online.

Sometimes, during the explanation, we might take on some maths heavy portions. But wherever possible, I’ll also provide the intuition so that if you are averse to maths and equations, you can skip right to the intuition.

There are a million articles on the internet explaining what Gradient Descent is and this is not the million and one article. Briefly, we will just cover enough Gradient Descent to make the discussion ahead relevant.

The Problem setup is : Given a function ** f(x)**, we want to find its minimum. In machine learning, this

- We pick an initial value for
, mostly at random,**x** - We calculate the gradient of the loss function w.r.t. the parameters
,*x* - We adjust the parameters
, such that , where is the learning rate**x** - Repeat 2 and 3 until we are satisfied with the loss value.

There are two main ingredients to the algorithm – the gradient and the learning rate.

** Gradient** is nothing but the first derivative of the loss function w.r.t.

** Learning Rate** is a necessary scaling which is applied to the gradient update every time. It is just a small quantity by which we restrain the parameter update based on the gradient to make sure our algorithm converges. We will discuss why we need this very soon.

The first pitfall of Gradient Descent is the step size. We adjust the parameters, ** x**, using the gradient of the loss function(scaled by learning rate).

The gradient is the first derivative of the loss function and by definition, it only knows the slope at the point at which it was calculated. It is myopic towards what the slope is even a point at an infinitesimally small distance. In the diagram above, we can see two points, and the one on the steep drop has a higher gradient value and the one on the almost flat has a smaller gradient. But does that mean we should be taking larger or smaller steps respectively? Because of this we can only take the direction the gradient gives us as the absolute truth and manage the step size using a hyperparameter ** learning rate**. This hyperparameter makes sure we do not jump over the minima(if the step size is too high) or never reach the minima (if the step size is too small).

This is a pitfall you can find in almost all first order optimization algorithms. One way to solve it is to use the second derivative which also tells you the curvature of the function and make the update step based on that, which is what Newton Raphson method of optimization does (Appendix in one of the previous blog post). But second order optimization methods come with their own baggage – computational and analytical complexity.

Another pitfall is the fact that this update treats all the parameters the same by scaling everything by a learning rate. There might be some parameters which have a larger influence on the loss function than other and by restricting the update for such parameters, we are making the algorithm longer to converge.

What’s the alternative? The constant learning rate update is like a safety cushion we are giving the algorithm so that it doesn’t blindly dash from one point to the other and miss the minima altogether. But there is another way we can implement this safety cushion.

Instead of fixing the euclidean distance each parameter moves(distance in the parameter space), we can fix the distance in the distribution space of the target output. i.e. Instead of changing all the parameters within an epsilon distance, we constrain the output distribution of the model to be within an epsilon distance from the distribution from the previous step.

Now how do we measure the distance between two distributions? Kullback-Leibler Divergence. Although technically not a distance(because it is not symmetric), it can be considered a distance in the locality it is defined. This works out well for us because we are also concerned about how the output distribution changes when we make small steps in the parameter space. In essence, we are not moving in the Euclidean parameter space like in normal gradient descent. but in the distribution space with KL Divergence as the metric.

I’ll skip to the end and tell you that there is this magical matrix called the Fischer Information Matrix, which if we include in the regular gradient descent formula, we will get the Natural Gradient descent which has the property of constraining the output distribution in each update step[1].

This is exactly the gradient descent equation, with a few changes:

- , the learning rate, is replaced with to make it clear that the step size may change in each iteration
- An additional term has been added to the normal gradient.

When the normal gradient is scaled with the inverse of the Fisher’s matrix, we call it the Natural Gradient.

Now for those who can accept the hand-wave that Fisher Matrix is a magical quantity which makes the normal gradient natural, skip to the next section. For the brave souls who stick around, a little maths is on your way.

I assume everyone knows what KL divergence is. We are going to start with KL Divergence and see how that translates to Fisher Matrix and what is the Fisher Matrix.

As we saw in the Deep Learning and Information Theory, KL Divergence was defined as:

That was when we were talking about a discrete variable, like a categorical outcome. In a more general sense, we need to replace the summations with integration. Let’s also switch the notation to suit the gradient descent framework that we are working with.

Instead of P and Q as two distributions, let’s say and be the two distributions, where x is our input features or covariates, be the parameters of the loss function(for eg. the weights and biases in a neural network), and be the small change in parameters that we make in a gradient update step. So, under the new notation, the KL Divergence is defined as:

Let’s rewrite the second term of the equation in terms of using second-order Taylor Expansion(which is an approximation using the derivatives at a particular point). Taylor Expansion in it’s general form is:

Rewriting the second term in this form, we get:

Let’s whip out our trusty Chain rule from high school calculus and apply to the first term. Derivative of Log x is 1/x. The Taylor Expansion of the second term becomes:

Plugging this back into the KL Divergence equation,. we get:

Rearranging the terms we have:

The first term is the KL Divergence between the same distribution and that is going to be zero. Another way you can think about it is that log 1 = 0 and hence the first term becomes zero.

The second term will also be zero. Let’s take a look at that below. Key property which we use to infer zero is that the integration of a probability distribution P(x) with x is 1 (just like the summation of the area under the curve of a probability distribution is 1).

Now, that leaves us with the Hessian of . A little bit of mathematics(*hand-wave*) gets us:

Let’s put this back into the KL Divergence equation.

The first term becomes zero because as we saw earlier, the integral of a probability distribution P(x) over x is 1. And the first and second derivative of 1 is 0. In the second term, the integral is called the Fisher’s matrix. Let’s call it **F**. So the KL Divergence equation becomes:

The integral over x in the Fisher matrix can be interpreted as the Expected Value. And what that makes the Fisher Matrix is the negative Expected Hessian of . And is nothing but the log likelihood. We know that the first derivative gives us the slope and the second derivative(Hessian) gives us the curvature. So, the Fisher Matrix can be seen as the curvature of the log likelihood function.

The Fisher Matrix is also the covariance of the score function. In our case the score function is the log likelihood which measures the goodness of our prediction.

We looked at what Fisher Matrix is, but we still haven’t linked it to Gradient Descent. Now, we know that KL Divergence is a function of Fisher Matrix and the delta change in parameters between the two distributions. As we discussed earlier, our aim is to make sure that we minimize the loss subject to keeping the KL Divergence within a constant ** c**.

Formally, it can be written as:

Let’s take the Lagrangian relaxation of the problem and use our trusted first order Taylor relaxation for the

To minimize the function above, we set the derivative to zero and solve. The derivative of the above function is:

Setting it to zero and solving for , we get,

What this means is that to a factor of (this is the error tolerance we accepted with the Lagrangian Relaxation), we get the optimal descent direction taking into account the curvature of the log likelihood at that point. We can take this constant factor of relaxation into the learning rate and consider it part of the same constant.

And with that final piece of mathematical trickery, we have the Natural Gradient as,

We’ve talked a lot about Gradients and Natural Gradients. But it is critical to understand why Natural Gradient is different from normal gradient to understand how Natural Gradient Descent is different from Gradient Descent.

At the center of it all, is a Loss Function which measures the difference between the predicted output and the ground truth. And how do we change the loss? By changing the parameters which changes the predicted output and thereby the loss. So, in a normal Gradient, we take the derivative of the loss function w.r.t. the parameters. The derivative will be small if the predicted probability distribution is closer to the true distribution and vice versa This represents the amount your loss would change if you moved each parameter by one unit. So, when we apply the learning rate, we are scaling the gradient update to these parameters by a fixed amount.

In the Natural Gradient world, we are no longer restricting the movement of the parameters in the parameter space. Instead, we arrest the movement of the output probability distribution at each step. And how do we measure the probability distribution? Log Likelihood. And Fisher Matrix gives you the curvature of the Log Likelihood.

As we saw earlier, the normal gradient has no idea about the curvature of the loss function because it is a first order optimization method. But when we include the Fisher matrix to the gradient, what we are doing is scaling the parameter updates with the curvature of the log likelihood function. So, places in the distribution space where the log likelihood changes fast w.r.t. a parameter, the update for the parameter would be less as opposed to a flat plain in the distribution space.

Apart from the obvious benefit of adjusting our updates using the curvature information, Natural Gradient also allows a way for us to directly control the movement of your model in the predicted space. In normal gradient, your movement is strictly in the parameter space and we are restricting the movement in that space with a learning rate in the hope that the movement in the prediction space is also restricted. But in Natural Gradient update, we are directly restricting the movement in the prediction space by stipulating that the model only move a fixed distance in KL Divergence terms.

The obvious question is a lot of your minds would be: If Natural Gradient Descent is so awesome, and clearly better than Gradient Descent, why isn’t it the defacto standard in Neural Networks?

This is one of those areas where pragmatism won over theory. Theoretically, the idea of using Natural Gradients is beautiful and it works also as expected. But the catch is that the calculation of the Fisher Matrix and it’s inverse becomes an intractable problem when the number of parameters are huge, like in a typical Deep Neural Network. This calculation exists in .

One other reason why there wasn’t a lot of focus on Natural Gradients was that the Deep Learning researchers and practitioners figured out some clever tricks/heuristics to approximate the information in a second derivative without calculating it. The Deep Learning optimizers have come a long way from SGD and a lot of that progress has been in using such tricks to get better gradient updates. Momentum, RMSProp, Adam, are all variations of the SGD which uses a running mean and/or running variance of the gradients to approximate the second order derivative and use that information to do the gradient updates.

These heuristics are much less computationally intensive that calculating a second order derivative or the natural gradient and has enabled Deep Learning to be scaled to the level it is currently.

That being said, the Natural Gradient still finds use in some cases where the parameters to be estimated are relatively small, or the expected distribution is relatively standard like a Gaussian distribution, or in some areas of Reinforcement Learning. Recently, it has also been used in a form of Gradient Boosting, which we will cover in the next part of the series.

In the next part of our series, let’s look at the new kid on the block – **NgBoost**

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Amari, Shun-ichi. Natural Gradient Works Efficiently in Learning. Neural Computation, Volume 10, Issue 2, p.251-276.
- It’s Only Natural: An Excessively Deep Dive into Natural Gradient Optimization, https://towardsdatascience.com/its-only-natural-an-excessively-deep-dive-into-natural-gradient-optimization-75d464b89dbb
- Ratliff, Nathan ,Information Geometry and Natural Gradients, https://ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2015/01/mathematics_for_intelligent_systems_lecture12_notes_I.pdf
- Natural Gradient Descent, https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/
- Fisher Information Matrix, https://wiseodd.github.io/techblog/2018/03/11/fisher-information/
- What is the natural gradient, and how does it work?, http://kvfrans.com/what-is-the-natural-gradient-and-where-does-it-appear-in-trust-region-policy-optimization/

Let’s take a look at what made it different:

Let’s take a look at the innovation which gave the algorithm it’s name – CatBoost. Unlike XGBoost, CatBoost deals with Categorical variables in a native way. Many studies have shown that One-Hot encoding high cardinality categorical features is not the best way to go, especially in tree based algorithms. And other popular alternatives all come under the umbrella of Target Statistics – Target Mean Encoding, Leave-One-Out Encoding, etc.

The basic idea of Target Statistics is simple. We replace a categorical value by the mean of all the targets for the training samples with the same categorical value. For example, we have a Categorical value called weather, which has four values – sunny, rainy, cloudy, and snow. The most naive method is something called Greedy Target Statistics, where we replace “sunny” with the average of the target value for all the training samples where weather was “sunny”.

If M is the categorical feature we are encoding and is the specific value in M, and n is the number of training samples with ,

But this is unstable when the number of samples with is too low or zero. Therefore we use the Laplace Smoothing used in Naive Bayes Classifier to make the statistics much more robust.

where *a* > 0 is a parameter. A common setting for *p* (prior) is the average target value in the dataset.

But these methods usually suffer from something called Target Leakage because we are using our targets to calculate a representation for the categorical variables and then using those features to predict the target. Leave-One-Out Encoding tries to reduce this by excluding the sample for which it is calculating the representation, but is not fool proof.

CatBoost authors propose another idea here, which they call Ordered Target Statistics. This is inspired from Online Learning algorithms which get the training examples sequentially in time. And in such cases, the Target Statistics will only rely on the training examples in the past. To adapt this idea to a standard offline training paradigm, they imagine a concept of artificial time, but randomly permuting the dataset and considering them sequential in nature.

Then they calculate the target statistics using only the samples which occured before that particular sample in the artificial time. It is important to note that if we use just one permutation as the artificial time, it would not be very stable and to this end they do this encoding with multiple permutations.

The main motivation for the CatBoost algorithm is, as argued by the authors of the paper, the target leakage, which they call Prediction Shift, inherent in the traditional Gradient Boosting models. The high-level idea is quite simple. As we know, any Gradient Boosting model works iteratively by building base learners over base learners in an additive fashion. But since each base learner is build based on the same dataset, the authors argue that there is a bit of target leakage which affects the generalization capabilities of the model. Empirically, we know that Gradient Boosted Trees has an overwhelming tendency to overfit the data. The only countermeasures against this leakage are features like subsampling, which they argue is a heuristic way of handling the problem and only alleviates it and not completely removes it.

The authors formalize the proposed target leakage and mathematically show us that it is present. Another interesting observation that they had is that the target shift or the bias is inversely proportional to the size of the dataset, i.e. if the dataset is small, the target leak is much more pronounced. This observation also agrees with our empirical observation that Gradient Boosted Trees tend to overfit to a small dataset.

To combat this issue, they propose a new variant of Gradient Boosting, called Ordered Boosting. The idea, at it’s heart, is quite intuitive. The main problem with previous Gradient Boosting was the reuse of the same dataset for each iteration. So, if we have a different dataset for each of the iteration, we would be solving the problem of leakage. But since none of the datasets are infinite, this idea, purely applied, will not be feasible. So, the authors have proposed a practical implementation of the above concept.

It starts out with creating *s+1* permutations of the dataset. This permutation is the artificial time that the algorithm takes into account. Let’s call it . The permutations is used for constructing the tree splits and is used to choose the leaf values . In the absence of multiple permutations, the training samples with short “history” will have high variance and hence having multiple permutations ease out that defect.

We saw the way CatBoost deals with Categorical variables earlier and we mentioned that there we use multiple permutations to calculate the target statistics. This is implemented as part of the boosting algorithm which uses a particular permutation from in any iteration. The gradient statistics required for the tree splits and the target statistics required for the categorical encoding are calculated using the sampled permutation.

And once all the trees are built, the leaf values of the final model F are calculated by the standard gradient boosting procedure(that we saw in the previous articles) using permutation . When the final model F is applied to new examples from test set, the target statistics are calculated on the entire training data.

One important thing to note it that CatBoost supports the traditional Gradient Boosting also, apart from the Ordered Boosting (boosting_type = ‘Plain’ or ‘Ordered’). If it is ‘Plain’, and there are categorical features, the permutations are still created for the target statistic, but the tree building and boosting is done without the permutations.

CatBoost also differs from the rest of the flock in another key aspect – the kind of trees that is built in its ensemble. CatBoost, by default, builds Symmetric Trees or Oblivious Trees. These are trees the same features are responsible in splitting learning instances into the left and the right partitions for each level of the tree.

This has a two-fold effect in the algorithm –

- Regularization: Since we are restricting the tree building process to have only one feature split per level, we are essentially reducing the complexity of the algorithm and thereby regularization.
- Computational Performance: One of the most time consuming part of any tree-based algorithm is the search for the optimal split at each nodes. But because we are restricting the features split per level to one, we only have to search for a single feature split instead of k splits, where k is the number of nodes in the level. Even during inference these trees make it lightning fast. It was shown to be 8X faster than XGBoost in inference.

Although the default option is “*SymmetricTree*“, there is also the option to switch to “*Depthwise*“(XGBoost) or “*Lossguide*“(LightGBM) using the parameter “*grow_policy*“,

Another important detail of CatBoost is that it considers combinations of categorical variables implicitly in the tree building process. This helps it consider joint information of multiple categorical features. But since the total number of combinations possible can explode quickly, a greedy approach is undertaken in the tree building process. For each split in the current tree, CatBoost concatenates all previously used Categorical Features in the leaf with all the rest of the categorical features as combinations and target statistics are calculated on the fly.

Another interesting feature in CatBoost is the inbuilt Overfitting Detector. CatBoost can stop training earlier than the number of iterations we set, if it detects overfitting. there are two overfitting detectors implemented in CatBoost –

- IncToDec
- Iter

*Iter* is the equivalent of early stopping where the algorithm waits for *n* iterations since an improvement in validation loss value before stopping the iterations

*IncToDec *is more slightly involved. It takes a slightly complicated route by keeping track of the improvement of the metric iteration after iteration and also smooths the progression using an approach similar to exponential smoothing and sets a threshold to stop training whenever that smoothed value falls below it.

Following XGBoost’s footsteps, CatBoost also deals with missing values separately. There are two ways of handling missing values in CatBoost – Min and Max.

If you select “Min”, the missing values are processed as the minimum value for the feature. And if you select “Max”, the missing values are processed as the maximum value for the feature. In both cases, it is guaranteed that the split between missing values and others are considered in every tree split.

If LightGBM had a lot of hyperparameters, CatBoost has even more. With so many hyperparameters to tune, GridSearch stops being feasible. It becomes more of an art to get the right combination of parameters for any given problem. But still I’ll try to summarize a few key parameters which you have to keep in mind.

*one_hot_max_size*: This sets the maximum number of unique values in a categorical feature below which it will be one-hot encoded and not using Target statistics. It is recommended that you do not do your one-hot encoding before you feed in the feature set, because it will hurt both accuracy and performance of the algorithm.*iterations*– The number of trees to be built in the ensemble. This has to be tuned with a cv or one of the overfitting detection methods should be employed to make the iteration stop at the ideal iteration.*od_type, od_pval, od_wait*– These three parameters configure the overfitting detector.*od_type*is the type of overfitting detector.*od_pval*is the threshold for IncToDec(Recommended Range: [10e-10, 10e-2]). Larger the value, earlier Overfitting is detected.- od_wait has different meaning depending on the od_type. If it is
*IncToDec*, the*od_wait*is the number of iterations it has to run before the overfitting detector kicks in. If it is*Iter*, the*od_wait*is the number of iterations it will wait without an improvement of the metric before it stops training.

*learning_rate*– The usual meaning. But CatBoost automatically set the learning rate based on the dataset properties and the number of iterations set.*depth*– This is the depth of the tree.Optimal values range from 4 to 10. Default Value: 6 and 16 if*boosting_type*is*Lossguide*- l2_leaf_reg – This is the regularization along the leaves. Any positive value is allowed as the value. Increase this value to increase the regularization effect.
*has_time*– We have already seen that there is an artificial time which is taken to accomplish ordered boosting. But what if your data actually have a temporal order? In such cases set has_time = True to avoid using permutations in ordered boosting, but instead use the order in which the data was provided as the one and only permutation.*grow_policy*– As discussed earlier, CatBoost builds “*SymmetricTree*” by default. But sometimes “*Depthwise*” and “Lossguide” might give better results.- min_data_in_leaf is the usual parameter to control the minimum number of training samples in each leaf. This can only be used in
*Lossguide*and*Depthwise.* - max_leaves is the maximum number of leaves in any given tree. This can only be used in
*Lossguide*. It is not recommended to have values greater than 64 here as it significantly slow down the training process.

- min_data_in_leaf is the usual parameter to control the minimum number of training samples in each leaf. This can only be used in
*rsm or colsample_bylevel*– The percentage of features to be used in each split selection. This helps us control overfitting and the values range from (0,1].- nan_mode – Can take values “Forbidden”, “Min”, “Max” as the three options. “Forbidden” does not allow missing values and will throw an error. Min and Max we have discussed earlier.

In the next part of our series, let’s look at the new kid on the block – **NgBoost**

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Friedman, Jerome H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 (2001), no. 5, 1189–1232.
- Prokhorenkova, Liudmila, Gusev, Gleb et.al. (2018). CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems
- CatBoost Parameters. https://catboost.ai/docs/concepts/python-reference_parameters-list.html

The starting point for the LightGBM was XGBoost. So essentially, they took XGBoost and optimized it, and therefore, it has all the innovations XGBoost had (more or less), and some additional ones. Let’s take a look at the incremental improvements that LightGBM made:

One of the main changes from all the other GBMs, like XGBoost, is the way tree is constructed. In LightGBM, a leaf-wise tree growth strategy is adopted.

All the other popular GBM implementations follow somehting called a Level-wise tree growth, where you find the best possible node to split and you split that one level down. This strategy will result in symmetric trees, where every node in a level will have child nodes resulting in an additional layer of depth.

In LightGBM, the leaf-wise tree growth finds the leaves which will reduce the loss the maximum, and split only that leaf and not bother with the rest of the leaves in the same level. This results in an asymmetrical tree where subsequent splitting can very well happen only on one side of the tree.

Leaf-wise tree growth strategy tend to achieve lower loss as compared to the level-wise growth strategy, but it also tends to overfit, especially small datasets. So small datasets, the level-wise growth acts like a regularization to restrict the complexity of the tree, where as the leaf-wise growth tends to be greedy.

Subsampling or Downsampling is one of the ways with which we introduce variety and speed up the training process in an ensemble. It is also a form of regularization as it restricts from fitting to the complete training data. Usually, this subsampling is done by taking a random sample from the training dataset and building a tree on that subset. But what LightGBM introduced was an intelligent way of doing this downsampling.

The core of the idea is that the gradients of different samples is an indicator to how big of a role does it play in the tree building process. The instances with larger gradients (under-trained), contribute a lot more to the tree building process than instances with small gradients. So, when we downsample, we should strive to keep the instances with large gradients so that the tree building is the most efficient.

The most straightforward idea is to discard the instances with low gradients and build the tree on just the large gradient instances. But this would change the distribution of the data which in turn would hurt the accuracy of the model. And hence, the GOSS method.

The algorithm is pretty straightforward:

- Keep all the instances with large gradients
- Perform random sampling on instances with small gradients
- Introduce a constant multiplier for the data instances with small gradients when computing the information gain in the tree building process.
- If we select
*a*instances with large gradients and randomly samples*b*instances with small gradients, we amplify the sampled data by

- If we select

The motivation behind EFB is a common theme between LightGBM and XGBoost. In many real world problems, although there are a lot of features, most of them are really sparse, like on-hot encoded categorical variables. The way LightGBM tackles this problem is slightly different.

The crux of the idea lies in the fact that many of these sparse features are exclusive, i.e. they do not take non-zero values simultaneously. And we can efficiently bundle these features and treat them as one. But finding the optimal feature bundles is an NP-Hard problem.

To this end, the paper proposes a Greedy Approximation to the problem, which is the Exclusive Feature Bundling algorithm. The algorithm is also slightly fuzzy in nature, as it will allow bundling features which are not 100% mutually exclusive, but it tries to maintain the balance between accuracy and efficiency when selecting the bundles.

The algorithm, on a high level, is:

- Construct a graph with all the features, weighted by the edges which represents the total conflicts between the features
- Sort the features by their degrees in the graph in descending order
- Check each feature and either assign it to an existing bundle with a small conflict or create a new bundle.

The amount of time it takes to build a tree is proportional to the number of splits that have to be evaluated. And when you have continuous or categorical features with high cardinality, this time increases drastically. But most of the splits that can be made for a feature only offer miniscule changes in performance. And this concept is why a histogram based method is applied to tree building.

The core idea is to group features into set of bins and perform splits based on these bins. This reduces the time complexity from *O(#data)* to *O(#bins)*.

In another innovation, similar to XGBoost, LightGBM ignores the zero feature values while creating the histograms. And this reduces the cost of building the histogram from *O(#data) *to *O(#non-zero data)*.

In many real world datasets, Categorical features make a big presence and thereby it becomes essential to deal with them appropriately. The most common approach is to represent a categorical feature as it’s one-hot representation, but this is sub-optimal for tree learners. If you have high-cardinality categorical features, your tree needs to be very deep to achieve accuracy.

LightGBM takes in a list of categorical features as an input so that it can deal with it more optimally. It takes inspiration from “On Grouping for Maximum Homogeneity” by Fisher, Walter D. and uses the following methodology for finding the best split for categorical features.

- Sort the histogram on accumulated gradient statistics
- Find the best split on the sorted histogram

There are a few hyperparameters which help you tune the way the categorical features are dealt with[4]:

`cat_l2`

, default =`10.0`

, type = double, constraints:`cat_l2 >= 0.0`

- used for the categorical features

- L2 regularization in categorical split

`cat_smooth`

, default =`10.0`

, type = double, constraints:`cat_smooth >= 0.0`

- used for the categorical features
- this can reduce the effect of noises in categorical features, especially for categories with few data
`max_cat_to_onehot`

, default =`4`

, type = int, constraints:`max_cat_to_onehot > 0`

- when number of categories of one feature smaller than or equal to
`max_cat_to_onehot`

, one-vs-other split algorithm will be used

- when number of categories of one feature smaller than or equal to

The majority of the incremental performance improvements were made through GOSS and EFB.

xgb_exa is the original XGBoost, xgb_his is the histogram based version(which came out later), lgb_baseline is the LightGBM without EFB and GOSS, and LightGBM is with EFB and GOSS. It is quite evident that the improvement in GOSS and EFB is phenomenal as compared to lgb_baseline.

The rest of the improvements in performance is derived from the ability to parallelize the learning. There are two main ways of parallelizing the learning process:

Feature Parallel tries to parallelize the “Find the best split” part in a distributed manner. Evaluating different splits are done in parallel across multiple workers, and then they communicate with each other to decide among themselves who has the best split.

Data Parallel tries to parallelize the whole decision learning. In this, we typically split the data and send different parts of the data to different workers who calculate the histograms based on the section of the data they receive. Then they communicate to merge the histogram at a global level and this global level histogram is what is used in the tree learning process.

Voting Parallel is a special case of Data Parallel, where the communication cost in Data Parallel is capped to a constant.

LightGBM is one of those algorithms which has a lot, and I mean a lot, of hyperparameters. It is so flexible that it is intimidating for the beginner. But there is a way to use the algorithm and still not tune like 80% of those parameters. Let’s look at a few parameters that you can start tuning and then build up confidence and start tweaking the rest.

`objective`

︎, default =`regression`

, type = enum, options:`regression`

,`regression_l1`

,`huber`

,`fair`

,`poisson`

,`quantile`

,`mape`

,`gamma`

,`tweedie`

,`binary`

,`multiclass`

,`multiclassova`

,`cross_entropy`

,`cross_entropy_lambda`

,`lambdarank`

,`rank_xendcg`

, aliases:`objective_type`

,`app`

,`application`

`boosting`

︎, default =`gbdt`

, type = enum, options:`gbdt`

,`rf`

,`dart`

,`goss`

, aliases:`boosting_type`

,`boost`

`gbdt`

, traditional Gradient Boosting Decision Tree, aliases:`gbrt`

`rf`

, Random Forest, aliases:`random_forest`

`dart`

, Dropouts meet Multiple Additive Regression Trees`goss`

, Gradient-based One-Side Sampling

`learning_rate`

︎, default =`0.1`

, type = double, aliases:`shrinkage_rate`

,`eta`

, constraints:`learning_rate > 0.0`

- shrinkage rate
- in
`dart`

, it also affects on normalization weights of dropped trees

`num_leaves`

︎, default =`31`

, type = int, aliases:`num_leaf`

,`max_leaves`

,`max_leaf`

, constraints:`1 < num_leaves <= 131072`

- max number of leaves in one tree

`max_depth`

︎, default =`-1`

, type = int- limit the max depth for tree model. This is used to deal with over-fitting when
`#data`

is small. Tree still grows leaf-wise

`<= 0`

means no limit

- limit the max depth for tree model. This is used to deal with over-fitting when
`min_data_in_leaf`

︎, default =`20`

, type = int, aliases:`min_data_per_leaf`

,`min_data`

,`min_child_samples`

, constraints:`min_data_in_leaf >= 0`

- minimal number of data in one leaf. Can be used to deal with over-fitting

`min_sum_hessian_in_leaf`

︎, default =`1e-3`

, type = double, aliases:`min_sum_hessian_per_leaf`

,`min_sum_hessian`

,`min_hessian`

,`min_child_weight`

, constraints:`min_sum_hessian_in_leaf >= 0.0`

- minimal sum hessian in one leaf. Like
`min_data_in_leaf`

, it can be used to deal with over-fitting

- minimal sum hessian in one leaf. Like
`lambda_l1`

︎, default =`0.0`

, type = double, aliases:`reg_alpha`

, constraints:`lambda_l1 >= 0.0`

- L1 regularization

`lambda_l2`

︎, default =`0.0`

, type = double, aliases:`reg_lambda`

,`lambda`

, constraints:`lambda_l2 >= 0.0`

- L2 regularization

In the next part of our series, let’s look at the one who tread a path less taken – **CatBoost**

- Part I – Gradient boosting Algorithm

- Part I – Gradient boosting Algorithm
- Part II – Regularized Greedy Forest
- Part III – XGBoost
- Part IV – LightGBM
- Part V – CatBoost
- Part VI(A) – Natural Gradient
- Part VI(B) – NGBoost
- Part VII – The Battle of the Boosters

- Friedman, Jerome H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 (2001), no. 5, 1189–1232.
- Ke, Guolin et.al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, pages 3149-3157
- Walter D. Fisher. “On Grouping for Maximum Homogeneity.” Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
- LightGBM Parameters. https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst#core-parameters