Suppose you are the manager of the Google Play Store and Google Pixel 5a launch is coming up. HQ sends you a forecast from their ML model and says that they expect 100 units to be sold in the first week of launch. But you know, from you experience, that the HQ predictions are not always right and want to hedge against a stock out by procuring more than 100. But how much more? How do you know how wrong the prediction from the ML model will be? In other words, how confident this prediction of a 100 units is? This additional information of the confidence of the model is crucial in making this decision and the ML model from HQ does not give you that. But what if in addition to this forecast of 100 units, the model also gives you a measure of uncertainty– like the standard deviation of the expected probability distribution? Now you can make an informed decision based on your risk appetite on how much to overstock.
But how do we do that? Usually, classification problems have an added advantage because of the logistic function we slap on at the top which gives us some idea about the confidence of the model(although it is not technically the same thing). When it comes to regression, our traditional models gives us a point estimate.
There are two major kinds of uncertainty – Epistemic and Aleatoric Uncertainty (phew, that was quite a mouthful).
A typical supervised machine learning problem can be written as below:
Here the epistemic uncertainty is derived from and the aleatoric from . Typically high epistemic uncertainty is found in the part of the feature space which is sparsely populated with data examples. In such an ndimensional space, there might be many parameters which can explain the given data points and this leads to uncertainty.
The key innovation here is like this:
In a vanilla Neural Network regression, we will have a single neuron at the last layer which is trained to predict the value that we are interested in. What if we predict the parameters of a probability distribution instead? For example, a Gaussian distribution is parametrized by its mean() and its standard deviation(). So instead of having a single neuron in the last layer, we have two which predicts the mean and the standard deviation of the Gaussian Distribution.
Awesome! When properly trained, we have the mean and the standard deviation, which means we have the entire probability distribution and therefore by extension as estimate of the uncertainty.
But, there is a problem. When we train the model to predict the parameters of the distribution, we are slapping on a huge inductive bias onto the model. for all we know, the target variable that we are trying to model might not even follow any parametric distribution.
Let’s look at the bimodal example. It looks like there are two gaussian distributions, squished together. And if we do a though experiment, and extend this “squishing of two gaussian distributions” to “squishing of N gaussian distributions“, this resultant mixture can model a wide variety of probability distributions. This is exactly the code idea of a Mixture Density network is. You have a number of gaussian components(mean and standard deviation) which comprises the last layer in the network. And have another learned parameter(a latent representation) which decides how to mix these gaussian components.
Mixture Density Networks are built from two components – a Neural Network and a Mixture Model.
The Neural Network can be any valid architecture which takes in the input and converts into a set of learned features(we can think of it as an encoder or backbone).
Now, let’s take a look at the Mixture Model. The Mixture Model, like we discussed before, is a model of probability distributions built up with a weighted sum of more simple distributions. More formally, it models a probability density function(pdf) as a mixture of m pdfs indexed by j , with weights by the following equation:
, where are the parameters of the distribution describing the shape and location of the distribution.
In his paper[1], Bishop uses the Gaussian kernel and explains that any probability density function can be approximated to arbitrary accuracy, provided the mixing coefficients and the Gaussian parameters are correctly chosen. By using the Gaussian kernel in the above equation. it becomes:
Bishop proposed a few restrictions and ways to implement the MDNs as well.
The network is trained endtoend using standard backpropagation. And for that the loss function we are minimizing is the Negative Log Likelihood, which is equivalent to the Maximum Likelihood Estimation.
We already know what is and it’s just the matter of calculating it and maximizing the negative loglikelihood.
Now that we know the theory behind the model, let’s look at how we can implement it(and a few tricks and failure modes). Implementing and training an MDN is notoriously hard because there are just a lot of things that could go wrong. But through adopting prior research and a lot of experimentation, I have fixed on a few tricks which makes the training relatively stable. The implementation will be in PyTorch using a library that I developed – Pytorch Tabular(which is a highly flexible framework to work with deep learning and tabular data).
If you remember the pdf formula for the Gaussian mixture, it involves the summation of all the pdf‘s weighted by the mixing coefficients. So the Negative LogLikelihood would be:
and we know that
If you examine the equation, we see an exponential function which is then multiplied with the mixing coefficients and then take the log of it. This exponential and the subsequent log can lead to very small numbers leading to numerical underflow. So we as suggested by Axel Brando Guillaumes [3] use the LogsumExp trick and carry out the calculations in log domain.
So the negative log likelihood becomes:
Now using the torch.logsumexp
we calculate the negative loglikelihood of all the samples in the batch and then compute the mean. This helps us avoid a lot of numerical instabilities in training.
Bishop[1] suggested to use an Exponential activation function for the variance parameter. The choice has its strengths. The exponential function tends to a positive output and on the lower side, it never really reaches zero. But practically, it has some problems. The exponential function grows very large very fast and in case of datasets with high variance, the training becomes unstable.
Axel Brando Guillaumes [3] proposes an alternative, which is what we have used in our implementation. A modified version of an ELU activation.
The ELU function retains the exponential behavior and reverts to a linear behavior for higher values. The only problem is that the exponential behavior is when x is negative. But if we move this function upwards by 1, we have a function which approximates the exponential behavior. So Axel Brando Guillaumes [3] proposed to use as the activation function for the variance parameter. Since technically this also can become zero, we have added an epsilon also to the modified ELU to ensure stability. So the final activation used is:
When we are using multiple Gaussian components and training the model using backpropagation, there is a tendency for the network to ignore all but one of the components and train a single component to explain the data. For eg. if we are using two components, while training one component will tend to zero and the other to 1 and the mean and variance components corresponding to the mixing component which became 1 will be trained to explain the data.
Axel Brando Guillaumes [3] suggests to clip the value to a lower limit. But there was a problem with this approach.
So, through experimentation, we found out another set of tricks to make the network behave(although not guaranteed, but certainly encourage the network away from mode collapse).
Bishop[1] suggested to use a Softmax layer to convert the raw logits of the mixing coefficients() to probabilities. But we have used a Gumbel Softmax, which provides a much more sharper probability distribution. This is desirable because we want our model to be able to efficiently factor out one or more components when it is not needed. Softmax would still assign a small probability to those, while Gumbel Softmax makes that probability even smaller.
Below is a sample code for the implementation in PyTorch Tabular. Head over to the repo to see the full implementation(as a bonus also have access to implementations to other models like NODE, AutoInt, etc as well).
Taking inspiration from this excellent Colab Notebook by Dr. Oliver Borchers, I have designed my own set of experiments using the implementation. There is also a tutorial on how to use MDNs in PyTorch Tabular here.
This is a simple linear function, but some gaussian noise added as a function of the input x. A regular feedforward network will be able to approximate this function with a straight line through the middle. But if we use an MDN with a single component, it will approximate both the mean and the uncertainty in the function
This is a nonlinear function with a twist. For each value of x, there are two values of y – one of the positive side and another on the negative side. Let’s take a look at what happens if we fit a normal FeedForward network on this data.
Not very good isn’t it? Well, if you think of the maximum likelihood estimate, it will likely fail because both the positive and negative points are equally likely. But when you train an MDN on it, the two components are properly learned and if we plot out the two components, it will look like this:
Here we have two gaussian components which are mixed in a ratio, all of which is parameterized by x.
The histogram also shows two small bumps on the top instead of a single smooth on of a normal gaussian.
Let’s look at how the standard FeedForward network captures this function.
Pretty well, except for the part where there is a transition. And as always no information about the uncertainty. (note: Really awesome how the network bends itself to approximate the step function.
Now let’s train an MDN on this data.
Now, isn’t that something? The two components are well captured – one component captures the first part well and the other one does the second one. The uncertainty is also estimated pretty well. Let’s also take a look at the mixing coefficients.
Component 1(which is the pink one in the previous plot) has high mixing coefficients till the transition point and drops to zero pretty quickly after that.
We have seen how important uncertainty is important to business decisions and also explored one way of doing that using Mixture Density networks. The implementation with all the tricks discussed in the article is available to use in a very user friendly way at PyTorch Tabular, along with a few other State of the Art algorithms for Tabular data. Check it out here:
As always for any query/questions reach out to me on LinkedIn. For any issues you face with the library, raise an issue at Github.
Intuitively, this is strange, isn’t it? Neural networks are universal approximators and ideally, they should be able to approximate the function even in the tabular data domain. Or may be they can, but need humungous amount of data to properly learn that function? But how does the Gradient Boosted Trees do it so well? May the inductive bias of a decision tree is well suited for tabular data domain?
If they are so good, why don’t we use decision trees in neural networks? Just like we have Convolution Operation for images, and Recurrent Networks for Text, why can’t we use Decision Trees as a basic building block for Tabular data?
The answer is pretty straightforward Trees are not differentiable and without the gradients flowing through the network, backprop bombs. But this is where researchers started to bang their heads. How do we make Decision Trees differentiable.
In 2015, Kontschieder et al., presented Deep Neural Decision Forests[1], which had a decision treelike structure, but differentiable.
Let’s take a step back and think about a Decision Tree.
A typical Decision Tree looks something like the picture above. Simplistically, it is a collection of decision nodes() and leaf nodes() which together acts as a function,
, where is the decision tree, parametrized by , which maps input to output .
Let’s look at the leaf nodes first, because it’s easier. In traditional Decision Trees, the leaf nodes is typically a distribution over the class labels. This is right up the alley of a Sigmoid or a Softmax activation. So we could really replace the leaf nodes with a SoftMax layer and make that node differentiable.
Now, let’s take a deeper look at a decision node. The core purpose of a decision node is to decide whether to route the sample to the left or right. Let’s call these decisions, . And for this decision, it uses a particular feature() and a threshold() these are the parameters of the node.
In traditional Decision Trees, this decision is a binary decision; it’s either right or left, 0 or 1. But this is deterministic and not differentiable. Now, what if we relax this and make the routing stochastic. Instead of a sharp 1 or 0, it’s going to be a number between 0 and 1. This feels like familiar territory, doesn’t it? Sigmoid function?
That’s exactly what Kontschieder et al. proposed. If we relax the strict 01 decision to a stochastic one with a sigmoid function, the node becomes differentiable.
Now we know how a single node(decision or a leaf node) works. Let’s put them all together. The red path in the diagram above is one of the path in the decision tree. In the deterministic version, a sample either goes through this route or it doesn’t. If we think about the same process in probabilistic terms, we know that the probability of the sample to go in the path should be 1 for every node in that path for the sample to reach the leaf node at the end of the path.
In the probabilistic paradigm, we find the probability that a sample goes left or right() and multiply all of those along the path to get the probability that a sample reaches the leaf node.
Probability of the sample reaching the highlighted leaf node would be ().
Now, we just need to take the expected value of all the leaf nodes using the probabilities of each of the decision paths to get the prediction for a sample.
Now that you’ve got an intuition about how Decision Treelike structures were derived to be used in Neural Networks, let’s talk about the NODE model[3].
An Oblivious Tree is a decision tree which is grown symmetrically. These are trees the same features are responsible in splitting learning instances into the left and the right partitions for each level of the tree. CatBoost, a prominent gradient boosting implementation, uses oblivious trees. Oblivious Trees are particularly interesting because they can be reduced to a Decision Table with cells, where is the depth of the tree. This simplifies things up pretty neatly.
Each Oblivious Decision Tree(ODT) outputs one of responses, where is the depth of the tree. This is done by using featurethreshold combinations, which are the parameters of the ODT.
Formally, the ODT can be defined as :
, where denotes the Heaviside function(which is a step function which is 0 for negative or 1 for positive)
Now to make the tree output differentiable, we should replace the splitting feature choice() and the comparison operator using the threshold(), but their continuous counterparts.
In Traditional Trees, the choice of a feature to split a node by is a deterministic decision. But for differentiability, we choose a softer approach, i.e. A weighted sum of the features, where the weights are learned. Normally, we would think of a Softmax choice over the features, but we want to have sparse feature selection, i.e. we want the decision to be made on only a handful(preferably 1) features. So, to that effect, NODE uses entmax transformation (Peters et al., 2019) over a learnable feature selection matrix
Similarly, we relax the Heaviside function as a twoclass entmax. As different features can have different characteristic scales, we scale the entmax with a parameter
, where and are learnable parameters for thresholds and scales respectively.
We know that a tree has two sides and by , we have only defined one side. So to complete the tree, we stack one on top of the other. Now we define a “choice” tensor as the outer product of all the trees:
This gives us the choice weights, or intuitively the probabilities of each of the outputs, which is in the Response tensor. So now it reduced into a weighted sum of Response tensor, weighted by the Choice tensor.
The entire setup looks like the below diagram:
The jump from an individual tree to a “forest” is pretty simple. If we have trees in the ensemble, the final output is the concatenation of m individual trees
In addition to developing the core module(NODE layer), they also propose a deep version, where we stack multiple NODE layers on top of each other, but with residual connections. The input features and the outputs of all previous layers are concatenated and fed into the next NODE Layer and so on. And finally, the final output from all the layers are averaged(similar to the RandomForest).
In all the experiments in the paper[3], they transformed each of the features to follow a normal distribution using a Quantile Transformation. This step was important for stable training and faster convergence.
Before training the network, they propose to do dataaware initialization to get good initial parameters. They initialize the Feature Selection matrix() uniformly, while the thresholds() are initialize with random feature values . The scales are initialize in such a way that all the samples in he first batch fall in the linear region of the $latex twosided entmax and hence receive nonzero gradients. And finally, the response tensor are initialized with a standard normal distribution.
The paper performs experiments with 6 datasets – Epsilon, YearPrediction, Higgs, Microsoft, Yahoo, and Click. They compared NODE with CatBoost, XGBoost and FCNN.
First they compared with default hyperparameters across all the algorithms. the default architecture of NODE was set as below: Single layer of 2048 trees of depth 6. These parameters were inherited from CatBoost default parameters.
Then, they tuned all the algorithms and then compared.
The authors have made the implementation available in a ready to use Module in PyTorch here.
It is also implemented in the new library I released, PyTorch Tabular, along with a few other State of the Art algorithms for Tabular data. Check it out here:
The unreasonable effectiveness of Deep Learning that was displayed in many other modalities – like text and image haven not been demonstrated in tabular data. But lately, the deep learning revolution have shifted a little bit of focus to the tabular world and as a result, we are seeing new architectures and models which was designed specifically for tabular data modality. And many of them are coming up as an equivalent or even slightly better than welltuned Gradient Boosting models.
PyTorch Tabular is a framework/ wrapper library which aims to make Deep Learning with Tabular data easy and accessible to realworld cases and research alike. The core principles behind the design of the library are:
Instead of starting from scratch, the framework has been built on the shoulders of giants like PyTorch(obviously), and PyTorch Lightning.
It also comes with stateoftheart deep learning models that can be easily trained using pandas dataframes.
BaseModel
class provides an easy to extend abstract class for implementing custom models and still leverage the rest of the machinery packaged with the library.PyTorch Tabular aims to reduce the barrier for entry for both industry application and research of Deep Learning for Tabular data. As things stand now, working with Neural Networks is not that easy; at least not as easy as traditional ML models with Scikit Learn.
PyTorch Tabular attempts to make the “software engineering” part of working with Neural Networks as easy and effortless as possible and let you focus on the model. I also hopes to unify the different developments in the Tabular space into a single framework with an API that will work with different stateoftheart models.
Right now, most of the developments in Tabular Deep Learning are scattered in individual Github repos. And apart from fastai(which I love and hate), no framework has really paid attention to Tabular Data. And this is the need which led to PyTorch Tabular.
First things first – let’s look at how we can install the library.
Although the installation includes PyTorch, the best and recommended way is to first install PyTorch from here, picking up the right CUDA version for your machine. (PyTorch Version >1.3)
Once, you have got Pytorch installed, just use:
pip install pytorch_tabular[all]
to install the complete library with extra dependencies(Weights & Biases for Experiment Tracking).
And :
pip install pytorch_tabular
for the bare essentials.
The sources for pytorch_tabular can be downloaded from the Github repo
.
You can either clone the public repository:
git clone git://github.com/manujosephv/pytorch_tabular
Once you have a copy of the source, you can install it with:
python setup.py install
There are four configs that you need to provide(most of them have intelligent default values), which will drive the rest of the process.
data_config = DataConfig(
target=['target'], #target should always be a list. Multitargets are only supported for regression. MultiTask Classification is not implemented
continuous_cols=num_col_names,
categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
batch_size=1024,
max_epochs=100,
gpus=1, #index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()
model_config = CategoryEmbeddingModelConfig(
task="classification",
layers="1024512512", # Number of nodes in each layer
activation="LeakyReLU", # Activation between each layers
learning_rate = 1e3
)
Now that we have the configs defined, we need to initialize the model using these configs and call the fit method.
tabular_model = TabularModel(
data_config=data_config,
model_config=model_config,
optimizer_config=optimizer_config,
trainer_config=trainer_config,
)
tabular_model.fit(train=train, validation=val)
That’s it. The model will be trained for the specified number of epochs.
To implement new models, see the How to implement new models tutorial. It covers basic as well as advanced architectures.
To evaluate the model on new data on the same metrics/loss that was used during training, we can use the evaluate
method.
result = tabular_model.evaluate(test)

DATALOADER:0 TEST RESULTS
{'test_accuracy': tensor(0.6924, device='cuda:0'),
'train_accuracy': tensor(0.6051, device='cuda:0'),
'train_loss': tensor(0.6258, device='cuda:0'),
'valid_accuracy': tensor(0.7440, device='cuda:0'),
'valid_loss': tensor(0.5769, device='cuda:0')}

To get the prediction as a dataframe, we can use the predict
method. This will add predictions to the same dataframe that was passed in. For classification problems, we get both the probabilities and the final prediction taking 0.5 as the threshold
pred_df = tabular_model.predict(test)
We can also save a model and load it later for inferencing.
tabular_model.save_model("examples/basic")
loaded_model = TabularModel.load_from_checkpoint("examples/basic")
result = loaded_model.evaluate(test)
The code for the framework is available at PyTorch Tabular: A standard framework for modelling Deep Learning Models for tabular data (github.com).
Documentation and tutorials can be found at PyTorch Tabular (pytorchtabular.readthedocs.io)
Contributions are more than welcome and details about how to contribute is also laid out here.
fastai is the closest to PyTorch Tabular, both built on PyTorch. But where PyTorch Tabular differentiates from fastai is with it’s modular and decoupled nature and it’s usage of standard PyTorch and PyTorch Lightning components which makes adoption, including new models, and hacking the code much more easy than with fastai.
[1] Sergei Popov, Stanislav Morozov, Artem Babenko. “Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data”. arXiv:1909.06312 [cs.LG] (2019)
[2] Sercan O. Arik, Tomas Pfister;. “TabNet: Attentive Interpretable Tabular Learning”. arXiv:1908.07442 (2019).
I will be continuing to write separate blog posts to talk about the different models that have been implemented in PyTorch Tabular as of now. Watch out for them.
]]>Good quotes help make us stronger. What is truly inspiring about quotes is not their tone or contentedness but how those who share them reflect life experiences that really serve others.
I didn’t write the above quote about quotes(Quoteception), but an AI model I trained did. And it says it better than I would have. Quotes are something that means different things to different people. Sometimes they inspire us, motivate us. And some other times they make us think about life, religion, and sometimes they just make us laugh.
So, can we train an AI model to generate more quotes, and make us think, laugh, or inspired? That was the motivation behind me starting on this journey.
I have finetuned GPT2 model on quotes with personas like Motivational/Inspirational, Funny, and Serious/Philosophical and deployed in a readytouse website: AI Quotes Generator.
On June 3, 2020, OpenAI released GPT3, a mammoth language model trained on 570GB of internet text. People have put the multitalented model to use in all kinds of applications ranging from creating app designs, to websites, to excel functions which do nothing short of magic. But there was just one problem – the model weights were never released. The only way we can access then was through a paywalled API.
So, let’s time travel to November 2019, when GPT2 was released. GPT2, although not as powerful as GPT2, was a revolution when it came to text generation. I remember my jaw dropping to the floor while reading the demo text generated by the model the coherence, and grammatical syntax, it was near perfect. What I want from the model was not to be a magician, but to be able to generate perfectly structured English sentences. And for that GPT2 was more than sufficient.
First I needed a dataset. Scraping the web for quotes was one option, but before that I wanted to see if somebody had done that already. And bingo! Quotes500k is a dataset of almost 500k quotes, all scraped from the web along with tags like knowledge, love, friendship, etc.
Now I wanted to have the model be able to generate quotes according to specific themes. Since I was planning to use a pretrained model, conditional text generation was not something that was easy to do. PPLM was a model, or rather a way of using pretrained models, that Uber released which attempts to do just this(A very interesting paper. Be sure to check that out), but in the end I went another way. I decided to train three models, each finetuned to specific genre of quotes.
There were three genres I considered – Motivational, Serious, and Funny. I used the Quotes500k dataset and the tags to separate the quotes based on the tags into these three buckets. For Motivational dataset, I used tags such as love, life, inspirational, motivational, life lessons, dreams, etc. And for Serious, I went with tags like philosophy, life, god, time, faith, fear, etc. And finally for Funny, I just took tags like humour and funny.
Nothing fancy here; just the basic hygiene. Things like
You might have questions like – What about stop words? Lemmatization? Where are all of those?
Removal of stop words and lemmatization are not mandatory steps that you have to do in every single NLP task. It really depends on the task. If we were doing a text classification using a TFIDF kind of a model, then yeah, it makes sense to do all of that. But text classification with a model which uses context, like a Neural Model or an NGram model, lesser so. And if you are doing Text Generation or Machine Translation, then removing the stop words and lemmatizing may actually hurt your performance as we are losing valuable context from the corpus.
I also wanted to make the quotes not be too long. So I took a look at the dataset, plotted the frequencies of the length of the quote, and decided on an appropriate length where I should cutoff. In the motivational corpus, I discarded all the quotes greater than 100 words.
Another important aspect to making the quotes short was the ability of the model to predict an EndofSentence token. So, we wrap each quote with a beginning of sentence(bos) token and an end of sentence(eos) token.
Now that we have the dataset ready and prepared, let’s start training the model. Transformers from Hugging Face is the obvious choice with it’s easytouse api and amazing collection of models with pretrained weights. Thanks to Hugging Face, training or finetuning a model is a breezy affair.
We start off by loading the model and the tokenizer for GPT2.
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelWithLMHead.from_pretrained('gpt2')
That’s it. You have the entire power of the huge pretrained model behind your fingers while you code.
Now, let’s define a function to load the dataset.
Now that we have the datasets, let’s create the Trainer. This is the core class which handles all the training process.
By this point of time, we have all the ingredients necessary to start training – the Model, Tokenizer, and the Trainer. All that is left is to do is train the model. And once the training is done, we just need to save the model and tokenizer for our use.
trainer.train()
trainer.save_model("./storage/gpt2motivational_v6")
tokenizer.save_pretrained("./storage/gpt2motivational_v6")
I trained each of the three models ~50 epochs on a single P5000 for a total of ~12 hours(excluding all the test runs, and runs which had some problems.)
We have trained the model and saved them. Now what? For inference, we take another brilliant feature from Hugging Face – pipelines. This awesome feature let’s your put a model into production in as little as a few lines of code.
The gen_kwargs configures the text generation. I have used a hybrid approach of top_k sampling with k=50 and top_p sampling with p=0.95. To avoid repetitions in text generation, I have used no_repeat_ngram_size = 3, and repetition_penalty=1.2.
Now that we have the core model trained, we need a way to interact with it. Writing code every time we need something from it is just not user friendly. So, we need a UI. And I chose to whip up a simple UI with ReactJS.
Although I can’t make the entire project available on Github, I’ll post the key parts as Github Gists.
For the purposes of the UI, I have dubbed the three models with a persona – Inspiratobot(Motivational), Aristobot(Serious), and FunnyBot.
The key features of the UI are:
The folder structure of the UI Project is as follows:

+public
 favicon.ico
 index.html
 logo.png
 logo.svg
 manifest.json

src
App.js
customRatings.js # The heart rating component
debug.log
index.js
quotes.css # The CSS for the page
quotes.js # The Core Page
theme.js

 .gitignore
 packagelock.json
 package.json
 README.md
The quotes.js and customRatings.js can be found in my Github Gists.
For hosting and serving the model, we need a server. And for this, I chose FastAPI, a nononsense web framework, specifically designed to build APIs.
I highly recommend FastAPI because of the sheer simplicity. A very simple API example(from the docs) is below:
from typing import Optional
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
def read_root():
return {"Hello": "World"}
@app.get("/items/{item_id}")
def read_item(item_id: int, q: Optional[str] = None):
return {"item_id": item_id, "q": q}
Less than 10 lines of code to get a web server for your API sounds too good to be true, but that is just what it is.
Without going into details, the key part of the API is the generation of the quotes and below is the code for that.
The folder structure of the project:
+app
  api.py
  db_utils.py
  enums.py
  timer.py
  init.py
 
+data
 funny_quotes.txt
 motivational_quotes.txt
 serious_quotes.txt

+models
 +gpt2funny
 
 +gpt2motivational
 
 gpt2serious

 app_config.yml
 logger.py
 logging_config.json
 main.py
 memory_profiler.log
 pipeline.log
 requirements.txt
It would have been all too easy to dockerize the application, spin up an EC2 instance, and host this. But I wanted an option which was cheap, if not free. Another challenge was that the models are about 500MB each and loading them into the memory made the RAM consumption to hover around 1GB. So an EC2 instance would have been costly. In addition to that I also needed store three models in the cloud which could cost storage. And to top it all, I also needed a DB to store the ratings that the user makes.
With these specification, I went hunting for products/services. To make things easier, I had separated out the app into two – a backend API server and a frontend UI which calls the backend API internally.
During my search, I stumbled upon this awesome list of free services that developers can avail. And after evaluating a few of those options, I finally zeroed in on the below options which would make the total cost of deployment as low as possible:
“KintoHub is an allinone deployment platform designed for full stack developers by full stack developers”, reads the home page. If cost was not a concern, I could have deployed the entire app on KintoHub because that is what their offering is – a backend Server which can scale well, a frontend server to server static HTML files, and a database. In the background, they containerize the app using a very easy to use web interface and deploy it on a machine of chosen specification. And the great thing is that all of this can be done from Github. You checkin your code to a Github repository(public or private) and deploy the app directly by pointing it to the repo.
As far as the bare minimum settings that has to be done does, it fits in a single page.
That’s it. Of course there are more setting available, like the memory that should be available for the app to run, etc. Although KintoHub has a free tier, I soon realized because of the RAM consumption of the models, I need a minimum of 1.5GB to make it run without crashing. So I moved to a Payasyougo tier, where they generously award you with a $5 credit each month. The cost calculator shows we the app hosting is going to cost $5.5 and I’m fine with that(Still waiting to see till the end of the month on the actual cost).
Firebase is a lot of things. It says on the homepage that it is the complete app development platform, and it really is. But we are only interested in the Firebase Hosting, where they let you host single page websites for free. And it was a breeze to deploy the app. All you have to do is:
A very short tutorial is all you need to get this done.
MongoDB Atlas is a cloud DBaaS which provides a MongoDB on the cloud and the good thing is that they offer a free tier with 512MB of storage. For our usecase, it is more than enough.
Creating a new cluster is straightforward and once you sort out the access issues, we can use a python wrapper like pymongo to implement our DB connection.
The result of all this is a website where you can interact with the model, generate quotes and pass it off as yours :D.
Although the model isn’t perfect, it still churns out some really good quotes that makes you think. And that’s what you want from any quote, right? Have fun playing with the model.
]]>– ith element of the time series
– The index of timeseries
– The index of nonzero demand
– The interdemand interval, i.e. the gap between two nonzero demand.
– The demand size at nonzero demand point.
Traditionally, there is a class of algorithms which take a slightly different path to forecasting the intermittent time series. This set of algorithms considered the intermittent demand in two parts – Demand Size and Interdemand Interval – and modelled them separately.
Croston proposed to apply a single exponential smoothing seperately to both M and Q, as below:
After getting these estimates, final forecast,
And this is a onestep ahead forecast and if we have to extend to multiple timesteps, we are left with a flat forecast with the same value.
Syntetos and Boylan, 2005, showed that Croston forecasting was biased on intermittent demand and proposed a correction with the from inter demand interval estimation.
Shale, Boylan, and Johnston (2006) derived the expected bias when the arrival follow a Poisson process.
Renewal process is an arrival process in which the interarrival intervals are positive, independent and identically distributed (IID) random variables (rv’s). This formulation generalizes Poison process for arbitrary long times. Usually, in a Poisson process the interdemand intervals are exponentially distributed. But renewal processes have an i.i.d. interdemand time that have a finite mean.
Turkmen et al. 2019 casts Croston and its variants into a renewal process mold. The random variables, M and Q, both defined on positive integers fully define the
Once Croston forecasting was cast as a renewal process, Turkmen et al. proposed to estimate them by using a separate RNN for each “Demand Size” and “Interdemand Interval”.
where
This means that we have a single RNN, which takes in as input both M and Q and encodes that information into an encoder(h). And then we put two separate NN layers on top of this hidden layer to estimate the probability distribution of both M and Q. For both M and Q, Negative Binomial distribution is the choice suggested by the paper.
Negative Binomial Distribution is a discrete probability distribution which is commonly used to model count data. For example, the number of units of an SKU sold, number of people who visited a website, or number of service calls a call center receives.
The distribution derives from a sequence of Bernoulli Trials, which says that there are only two outcomes for each experiment. A classic example is a coin toss, which can either be heads or tails. So the probability of success is p and failure is 1p (in a fair coin toss, this is 0.5 each). So now if we keep on performing this experiment until we have seen r successes, the number of failures we see will have a negative binomial distribution.
The semantic meaning of success and failure need not hold true when we apply this, but what matters is that there are only two types of outcome.
The paper just talks about onestep ahead forecasts, which is also what you will find in a lot of Intermittent Demand Forecast literature. But in a real world, we would need longer than that to plan properly. Whether it is Croston, or Deep Renewal Process, how we generate a nstep ahead forecast is the same – a flat forecast of Demand Size(M)/Interdemand Time(Q).
We have introduced a two new method of decoding the output – Exact and Hybrid – in addition to the existing method Flat. Suppose we trained the model with a prediction length of 5.
The raw output from the mode would be:
Flat
Under flat decoding, we would just pick the first set of outputs (M=22 and Q=2) and generate a onestep ahead forecast and extend the same forecast for all 5 timesteps.
Exact
Exact decoding is more of a more confident version of decoding. Here we predict a demand of demand size M, every Inter demand time of Q and make the rest of the forecast zero.
Hybrid
In hybrid decoding, we combine these two to generate a forecast which is also taking into account longterm changes in the model’s expectation. We use the M/Q value for forecast, but we update the M/Q value based on the next steps. For eg, in the example we have, we will forecast 11(which is 22/3) for the first 2 timesteps, and then forecast 33(which is 33/1) for the next timestep, etc.
I have implemented the algorithm using GluonTS, which is a framework for Neural Time Series forecasting, built on top of MXNet. AWS Labs is behind the open source project and some of the algorithms like DeepAR are used internally by Amazon to produce forecasts.
The paper talks about two variants of the model – Discrete Time DRP(Deep Renewal Process) and Continuous Time DRP. In this library, we have only implemented the discrete time DRP, as it is the more popular use case.
The package is uploaded on pypi and can be installed using:
pip install deeprenewal
Recommended Python Version: 3.6
Source Code: https://github.com/manujosephv/deeprenewalprocess
If you are working Windows and need to use your GPU(which I recommend), you need to first install MXNet==1.6.0 version which supports GPU MXNet Official Installation Page
And if you are facing difficulties installing the GPU version, you can try(depending on the CUDA version you have)
pip install mxnetcu101==1.6.0 f https://dist.mxnet.io/python/all
Relevant Github Issue
usage: deeprenewal [h] [usecuda USE_CUDA]
[datasource {retail_dataset}]
[regeneratedatasource REGENERATE_DATASOURCE]
[modelsavedir MODEL_SAVE_DIR]
[pointforecast {median,mean}]
[calculatespec CALCULATE_SPEC]
[batch_size BATCH_SIZE]
[learningrate LEARNING_RATE]
[maxepochs MAX_EPOCHS]
[numberofbatchesperepoch NUMBER_OF_BATCHES_PER_EPOCH]
[clipgradient CLIP_GRADIENT]
[weightdecay WEIGHT_DECAY]
[contextlengthmultiplier CONTEXT_LENGTH_MULTIPLIER]
[numlayers NUM_LAYERS]
[numcells NUM_CELLS]
[celltype CELL_TYPE]
[dropoutrate DROPOUT_RATE]
[usefeatdynamicreal USE_FEAT_DYNAMIC_REAL]
[usefeatstaticcat USE_FEAT_STATIC_CAT]
[usefeatstaticreal USE_FEAT_STATIC_REAL]
[scaling SCALING]
[numparallelsamples NUM_PARALLEL_SAMPLES]
[numlags NUM_LAGS]
[forecasttype FORECAST_TYPE]
GluonTS implementation of paper 'Intermittent Demand Forecasting with Deep
Renewal Processes'
optional arguments:
h, help show this help message and exit
usecuda USE_CUDA
datasource {retail_dataset}
regeneratedatasource REGENERATE_DATASOURCE
Whether to discard locally saved dataset and
regenerate from source
modelsavedir MODEL_SAVE_DIR
Folder to save models
pointforecast {median,mean}
How to estimate point forecast? Mean or Median
calculatespec CALCULATE_SPEC
Whether to calculate SPEC. It is computationally
expensive and therefore False by default
batch_size BATCH_SIZE
learningrate LEARNING_RATE
maxepochs MAX_EPOCHS
numberofbatchesperepoch NUMBER_OF_BATCHES_PER_EPOCH
clipgradient CLIP_GRADIENT
weightdecay WEIGHT_DECAY
contextlengthmultiplier CONTEXT_LENGTH_MULTIPLIER
If context multipler is 2, context available to hte
RNN is 2*prediction length
numlayers NUM_LAYERS
numcells NUM_CELLS
celltype CELL_TYPE
dropoutrate DROPOUT_RATE
usefeatdynamicreal USE_FEAT_DYNAMIC_REAL
usefeatstaticcat USE_FEAT_STATIC_CAT
usefeatstaticreal USE_FEAT_STATIC_REAL
scaling SCALING Whether to scale targets or not
numparallelsamples NUM_PARALLEL_SAMPLES
numlags NUM_LAGS Number of lags to be included as feature
forecasttype FORECAST_TYPE
Defines how the forecast is decoded. For details look
at the documentation
There is also a notebook in examples folder which shows how to use the model. Relevant Excerpt is below:
dataset = get_dataset(args.datasource, regenerate=False)
prediction_length = dataset.metadata.prediction_length
freq = dataset.metadata.freq
cardinality = ast.literal_eval(dataset.metadata.feat_static_cat[0].cardinality)
train_ds = dataset.train
test_ds = dataset.test
trainer = Trainer(ctx=mx.context.gpu() if is_gpu&args.use_cuda else mx.context.cpu(),
batch_size=args.batch_size,
learning_rate=args.learning_rate,
epochs=20,
num_batches_per_epoch=args.number_of_batches_per_epoch,
clip_gradient=args.clip_gradient,
weight_decay=args.weight_decay,
hybridize=True) #hybridize false for development
estimator = DeepRenewalEstimator(
prediction_length=prediction_length,
context_length=prediction_length*args.context_length_multiplier,
num_layers=args.num_layers,
num_cells=args.num_cells,
cell_type=args.cell_type,
dropout_rate=args.dropout_rate,
scaling=True,
lags_seq=np.arange(1,args.num_lags+1).tolist(),
freq=freq,
use_feat_dynamic_real=args.use_feat_dynamic_real,
use_feat_static_cat=args.use_feat_static_cat,
use_feat_static_real=args.use_feat_static_real,
cardinality=cardinality if args.use_feat_static_cat else None,
trainer=trainer,
)
predictor = estimator.train(train_ds, test_ds)
deep_renewal_flat_forecast_it, ts_it = make_evaluation_predictions(
dataset=test_ds, predictor=predictor, num_samples=100
)
evaluator = IntermittentEvaluator(quantiles=[0.25,0.5,0.75], median=True, calculate_spec=False, round_integer=True)
#DeepAR
agg_metrics, item_metrics = evaluator(
ts_it, deep_renewal_flat_forecast_it, num_series=len(test_ds)
)
The paper evaluates the model on two datasets – Parts dataset and UCI Retail Dataset. And for evaluation of the probabilistic forecast, they use Quantile Loss.
The authors used a single hidden layer with 10 hidden units and used the softplus activation to map the LSTM embedding to distribution parameters. They have used a global RNN, i.e. LSTM parameters are shared across all the timeseries. And they evaluated the onestep ahead forecast.
We did not recreate the experiment, rather expanded the scope. The dataset chosen was the UCI Retail Dataset, and instead of one step ahead forecast, we took 39 days ahead forecast. This is more in line with a real world application where you need more than onestep ahead forecast to plan. And instead of just comparing with Croston and its variants, we do the comparison also with ARIMA, ETS, NPTS, and Deep AR (which is mentioned as the next steps in the paper).
UCI Retail dataset is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UKbased and registered nonstore online retail. The company mainly sells unique alloccasion gifts. Many customers of the company are wholesalers.
Columns:
Preprocessing:
Stats:
Time Series Segmentation
Using the same segmentation – Intermittent, Lumpy, Smooth and Erratic we discussed earlier, I’ve divided the dataset into four.
We can see that almost 98% of the timeseries in the dataset are either Intermittent or Lumpy, which is perfect for our use case.
The baseline we have chosen, as the paper, is Croston Forecast. We have also added slight modification to Croston, namely SBA and SBJ. So let’s first do a comparison against these baselines, for both onestep ahead and nstep ahead forecasts. We will also evaluate both point forecast(using MSE, MAPE, and MAAPE) and probabilistic forecast (using Quantile Loss)
We can see that the DRP models have considerably outperformed the baseline methods across both point as well as probabilistic forecasts.
Now let’s take a look at nstep ahead forecasts.
Here, we see a different picture. On the point forecasts, Croston is doing much better on MSE. DRPs are doing well on MAPE, but we know that MAPE favors under forecasting and not very reliable when we look at intermittent demand pattern. So from a point forecast standpoint, I wouldn’t say that DRPs outperformed Croston on long term forecast. What we do notice is that within the DRPs, the Hybrid decoding is working much better than flat decoding, in both MAPE and MSE.
For an extended comparison of the results, let’s also include results from other popular forecasting techniques as well. We have added ETS and ARIMA from the point estimation stable and DeepAR and NPTS from the probabilistic forecast stable.
On MSE, ETS takes the top spot, even though on MAPE and MAAPE DRPs retain their position. On the probabilistic forecast side, DeepAR has extraordinarily high Quantile losses, but when we look at a weighted Quantile Loss(weighted by volume) we see it emerging in the top position. This may be because DeepAR forecasting zero(or close to zero) in most of the time and only forecasting good numbers in high volume SKUs. Across all the three Quantile Losses, NPTS seems to be outperforming DRPs by a small margin.
When we look at a long term forecast, we see that ARIMA and ETS is doing considerably better on the point estimates(MSE). And on the Probabilistic side, Deep AR turned it around and managed to become the best probabilistic model.
Intermittent
Stable (less Intermittent)
Ali Caner Turkmen, Yuyang Wang, Tim Januschowski. “Intermittent Demand Forecasting with Deep Renewal Processes”. arXiv:1911.10416 [cs.LG] (2019)
]]>Casually, we call intermittent series as series with a lot of periods with no demand, i.e. sporadic demand. Syntetos and Boylan(2005) proposed a more formal way of categorizing time series. They used wo parameters of the time series for this classification – Average Demand Interval and Square of Coefficient of Variation.
Average Demand Interval is the average interval in time periods between two nonzero demand. i.e. if the ADI for a time series is 1.9, it means that on an average we see a nonzero demand every 1.9 time periods.
ADI is a measure of intermittency; the higher it is, the ore intermittent the series is.
Coefficient of Variation is the standardized standard deviation. We calculate the standard deviation, but then scale it with the mean of the series to guard against scale dependency.
This shows the variability of a time series. If is high, that means that the variability in the series is also high.
Based on these two demand characteristics, Syntetos and Boylan has theoretically derived cutoff values which defines a marked change in the type of behaviour. They have defined intermittency cutoff as 1.32 and cutoff as 0.49. Using these cutoffs, they defined highs and lows and then putting both together a grid which classifies timeseries into Smooth, Erratic, Intermittent, and Lumpy.
The Forecast measures we have discussed was all, predominantly, designed to handle Smooth and Erratic timeseries. But in the real world, there are a lot more Intermittent and Lumpy timeseries. Typical examples are Spare Parts sales, Long tail of Retail Sales, etc.
The single defining characteristic of Intermittent and Lumpy series are the number of times there are zero demand. And this wreaks havoc with a lot of the measures we have seen so far. All the percent errors(for eg. MAPE) become unstable because of the division by zero, which now is an almost certainty. Similarly, the Relative Errors(for eg. MRAE), where we use a reference forecast to scale the errors, also becomes unstable, especially when using Naïve Forecast as a reference. This happens because there would be multiple periods with zero demand, and that would create zero reference error and hence undefined.
sMAPE is designed against this division with zero, but even there sMAPE has problems when the number of zero demands increase. And we know from our previous explorations that sMAPE has problems when either forecast is much higher than actuals or vice versa. And in case of intermittent demand, such cases are galore. If there is zero demand and we have forecasted something, or the other way around, we have such a situation. For eg. for a zero demand, one method forecasts 1 and another forecasts 10, the outcome is 200% regardless.
Cumulative Forecast Error (CFE, CFE Min, CFE Max)
We have already seen Cumulative Forecast Error( a.k.a. Forecast Bias) earlier. It is just the signed error over the entire horizon, so that the negative and positive errors cancel out each other. This has direct implications to over or under stocking in a Supply Chain. Peter Wallstrom[1] also advocates the use of CFE Max and CFE Min. A CFE of zero can happen because of chance as well and over a large horizon, we miss out on a lot of detail in between. so he proposes to look at CFE in conjunction with CFE Max and CFE Min, which are the maximum and the minimum values of CFE in the horizon.
Percent Better(PBMAE, PBRMSE, etc.)
We have already seen Percent Better. This is also a pretty decent measure to use for Intermittent demand. This does not have the problem of numerical instability and is defined everywhere. But it does not measure the magnitude of errors, rather than the count of errors.
Number of Shortages (NOS and NOSp)
Generally, to trace whether a forecast is biased or not, we use tracking signal(which is CFE/MAD). But the limits that are set to trigger warnings(+/ 4) is derived on the assumption that the demand is normally distributed. In the case of intermittent demand, it is not normally distributed and because of that this trigger calls out a lot of false positives.
Another alternative to this is the Number of Shortages measure, more commonly represented as Percentage of Number of Shortages. It just counts the number of instances where the cumulative forecast error was greater than zero, resulting in a shortage. A very high number or a low number indicated bias in either direction.
Periods in Stock (PIS)
NOS does not identify systematic errors because it doesn’t consider the temporal dimension of stock carry over. PIS goes one step ahead and measures the total number of periods the forecasted items has spent in stock or number of stock out periods.
To understand how PIS works, let’s take an example.
Let’s say there is a forecast of one unit every day in a three day horizon. In the beginning of the first period the one item is delivered to the fictitious stock (this is a simplification compared to reality). If there has been no demand during the first day, the result is plus one PIS. When a demand occurs, the demand is subtracted from the forecast. A demand of one in period 1 results in zero PIS in period 1 and CFE of 1. If the demand is equal to zero during three periods, PIS in period 3 is equal to plus six. The item from day one has spent three days in stock, the item from the second day have spent two days in stock and the last item has spent one day in stock
Excerpt from Evaluation of Forecasting Techniques and Forecast Errors with a focus on Intermittent Demand
A positive number indicates over forecast and a negative number shows under forecast of demand. It can easily be calculated as the cumulative sum of the CFE, i.e. the area under the bar chart in the diagram
Stockkeepingoriented Prediction Error Costs(SPEC)
SPEC is a newer metric(Martin et al. 2020[4]) which tries to take the same route as Periods in Stock, but slightly more sophisticated
Although it looks intimidating at first, we can understand it intuitively. The crux of the calculation is handled by two inner min terms – Opportunity cost and Stock Keeping costs. These are the two costs which we need to balance in a Supply Chain from an inventory management perspective.
The left term measures the opportunity cost which arises from under forecasting. This is the sales which we could have made if there was enough stock. For eg. if the demand was 10 and we only forecasted 5, we have an opportunity loss of 5. Now, let’s suppose we have been forecasting 5 for last three time periods and there were no demand and then a demand of 10 comes in. so we have 15 in stock and we fulfill 10. So here, there is no opportunity cost. And we can also say that an opportunity cost for a time period will not be greater than the demand at that time period. So combining these conditions, we get the first term of the equation, which measures the opportunity cost.
Using the same logic as before, but inverting it, we can derive the similar equation for Stock Keeping costs(where we over forecast). That is taken care by the right term in the equation.
SPEC for a timestep. actually, looks at all the previous timesteps, calculates the opportunity costs and stock keeping costs for each timestep, and adds them up to arrive at a single number. At any timestep, there will either be an opportunity cost or a stock keeping cost, which in turn looks at the cumulative forecast and actuals till that timestep.
And SPEC for a horizon of timeseries forecast is the average across all the timesteps.
Now there are two terms which lets us apply different weightages to opportunity costs and stock keeping costs, and depending upon the strategy of the organization we can choose the right kind of weightages. The recommendation is to keep the summation of the weights to 1, and is a common choice in a retail setting.
One disadvantage of this is the time complexity. We need nested loops to calculate this metric, which makes it slow to compute.
The implementation is available here – https://github.com/DominikMartin/spec_metric
Mean Arctangent Absolute Percent Error (MAAPE)
This is a clever trick on the MAPE formula which avoids one of the main problems with it – undefined at zero. And while addressing this concern, this change also makes it symmetric.
the idea is simple. We know that,
So, if we consider a triangle, with adjacent and opposite sides equal to A and AF respectively, the Absolute Percent Error is nothing by the slope of the hypotenuse.
Slope can be measured as a ratio, ranging from 0 to infinity, and also as an angle, ranging from 0 to 90. The slope as a ratio is the traditional Absolute Percent Error that is quite popular. So the paper presents slope as an angle as a stable alternative. Those of you who remember your Trignometry would remember that:
The paper christens it as Arctangent Absolute Percent Error and defines Mean Arctangent Absolute Error as :
where
arctan is defined at all real values from negative infinity to infinity. When . So by extension, while APE ranges from , AAPE ranges from . This makes it defined everywhere and robust that way.
The symmetricity test we saw earlier gives us the below results(from the paper)
We can see that the asymmetry that we saw in APE is not as evident here. The complementary plots that we saw earlier, if we compare AAPE to APE, we see it in a much better shape.
We can see that AAPE still favours under forecasting, but not as much as APE and for that reason might be more useful.
Relative Mean Absolute Error & Relative Mean Squared Error (RelMAE & RelMSE)
There are relative measures which compare the error of the forecast with the error of a reference forecast, in most use cases a naïve forecast or more formally a random walk forecast.
Scaled Error(MASE)
We’ve seen MASE also earlier and know how it’s defined. We scale the error by the average MAE of the reference forecast. Davidenko and Fildes, 2013[3], have shows that the MASE is nothing but a weighted mean of Relative MAE, the weights being the number of error terms. This means that include both MASE and RelMAE may be redundant. But let’s check them out anyways.
Let’s pick a real dataset, run ARIMA, ETS, and Crostons, with Zero Forecast as a baseline and calculate all these measures(using GluonTS).
I’ve chosen the Retail Dataset from UCI Machine Learning Repository. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UKbased and registered nonstore online retail. The company mainly sells unique alloccasion gifts. Many customers of the company are wholesalers.
Columns:
Preprocessing:
Stats:
Time Series Segmentation
Using the same segmentation – Intermittent, Lumpy, Smooth and Erratic we discussed earlier, I’ve divided the dataset into four.
We can see that almost 98% of the timeseries in the dataset are either Intermittent or Lumpy, which is perfect for our use case.
We have included Zero Forecast as kind of a litmus test which will tell us which forecast metrics we should be wary of when using it with Intermittent Demand.
We can see sMAPE, RelMAE, MAE, MAPE, MASE and ND(which is the volume weighted MAPE), all favors zero forecast and ranks it the best forecasting method. But when we look at the inventory related metrics(like CFE, PIS etc. which measure systematic bias in the forecast), Zero Forecast is the worst performing.
MASE, which was supposed to perform better in Intermittent Demand also falls flat and rates the Zero Forecast the best. The danger with choosing a forecasting methodology based on these measures is that we end up forecasting way loo low and that will wreak havoc in the downstream planning tasks.
Surprisingly, ETS and ARIMA fares quite well(over Croston) and ranks 1st and 2nd when we look at the metrics like PIS, MSE, CFE, NOSp etc.
Croston fares well only when we look at MAAPE, MRAE, and CFE_min.
We have ranked the different forecast methods based on all these different metrics. And if a set of metrics are measuring the same thing, then these rankings would also show good correlation. So let’s calculate the Spearman’s Rank correlation for these ranks and see which metrics agree with each other.
We can see two large groups of metrics which are positively correlated among each other and negatively correlated between groups. MAE, MRAE, MASE, MAPE, RelMAPE, ND, and sMAPE falls into a group and MSE, RMSE, CFE, PIS, SPEC_0.75, SPEC_0.5, SPEC_0.25, NOSp, PBMAE, RelRMSE, and NRMSE in the other. MAAPE and CFE_min fall into the second group as well, but lightly correlated.
Are these two groups measuring different characteristics of the forecast?
Let’s look at the same agreement between metrics at an item level now, instead of the aggregate one. For eg. for each item that we are forecasting we rank the forecasting methods based on these different metrics and run a Spearman’s Rank Correlation on those ranks.
Similar to the aggregate level view, here also we can find two groups of metrics, but contrary to the aggregate level, we cannot find a strong negative correlation across two groups. SPEC_0.5(where we give equal weightage to both opportunity cost and stock keeping cost) and PIS shows a high correlation, mostly because it is conceptually the same.
Another way to visualize and understand the similarity of different metrics is to use the item level metrics and run a PCA with two dimensions. And plot the direction of the original features which point towards the two components we have extracted using PCA. It shows how the original variables contribute to creating the principal component. So if we assume the two PCA components are the main “attributes” that we measure when we talk about “accuracy” of a forecast, the Loading Plot shows you how these different features(metrics) contribute to it, both magnitude and direction wise.
Here, we can see the relationship more crystalized. Most of the metrics are clustered together around the two components. MAE, MSE, RMSE, CFE, CFE_max, and the SPEC metrics all occupy similar space in the loading plot, and it looks like it is the component for “forecast bias” as CFE and SPEC metrics dominate this component. PIS is on the other side, almost at 180 degrees to CFE, because of the sign of PIS.
The other component might be the “accuracy” component. This is dominated by RelRMSE, MASE, MAAPE, MRAE etc. MAPE seems to straddle between the two components, and so is MAAPE.
We can also see that sMAPE might be measuring something totally different, like NOSp and CFE_min.
PIS is at a 180 degrees from CFE, SPEC_0.5, and SPEC_0.75 because of the sign, but they are measuring the same thing. SPEC_0.25(where we give 0.25 weight to opportunity cost) shows more similarity to the other group, probably because it favors under forecasting because of the heavy penalty on stock keeping costs.
We’ve not done a lot of experiments in this short blog post(not as much as Peter Wallström’s thesis[1]), but whatever we have done have shown us quite a bit already. We know not to rely on metrics like sMAPE, RelMAE, MAE, MAPE, MASE because they were giving a zero forecast the best ranking. We also know that there is no single metric that would tell you the whole story. If we look at something like MAPE, we are not measuring the structural bias in the forecast. And if we just look at CFE, it might show a rosy picture, when it is not the case.
Let me quickly summarize the findings from Peter Wallström’s thesis(along with a few of my own conclusions.
GitHub Repository – https://github.com/manujosephv/forecast_metrics
Checkout the rest of the articles in the series
Both Scaled Error and Relative Error are extrinsic error measures. They depend on another reference forecast to evaluate itself, and more often than not, in practice, the reference forecast is a Naïve Forecast or a Seasonal Naïve Forecast. In addition to these errors, we will also look at measures like Percent better, cumulative Forecast Error, Tracking Signal etc.
When we say Relative Error, there are two main ways of calculating it and Shcherbakov et al. calls them Relative Errors and Relative Measures.
Relative Error is when we use the forecast from a reference model as a base to compare the errors and Relative Measures is when we use some forecast measure from a reference base model to calculate the errors.
Relative Error is calculated as below:
Similarly Relative Measures are calculated as below:
where is the Mean Absolute Error on the forecast and is the MAE of the reference forecast. This measure can be anything really, and not just MAE.
Relative Error is based on a reference forecast, although most commonly we use Naïve Forecast, not necessarily all the time. For instance, we can use the Relative measures if we have an existing forecast we are trying to better, or we can use the baseline forecast we define during the development cycle, etc.
One disadvantage we can see right away is that it will be undefined when the reference forecast is equal to ground truth. And this can be the case for either very stable time series or intermittent ones where we can have the same ground truth repeated, which makes the naïve forecast equal to the ground truth.
Scaler Error was proposed by Hyndman and Koehler in 2006. They proposed to scale the errors based on the insample MAE from the naïve forecasting method. So instead of using the ground truth from the previous timestep as the scaling factor, we use the average absolute error across the entire series as the scaling factor.
where is the error at timestep t, is the length of the timeseries, is the ground truth at timestep t, and is the offset. is 1 for naïve forecasting. Another alternative that is popularly used is . For eg. l=12, for a seasonality of 12 months.
Here insample MAE is chosen because it is always available and more reliable to estimate the scale as opposed to the out of sample ones.
In our previous blog, we checked Scale Dependency, Symmetricity, Loss Curves, Over and under Forecasting and Impact of outliers. But this time, we are dealing with relative errors. And therefore plotting loss curves are not easy anymore because there are three inputs, ground truth, forecast, and reference forecast and the value of the measure can vary with each of these. Over and Under Forecasting and Impact of Outliers we can still check.
The loss curves are plotted as a contour map to accommodate the three dimensions – Error, Reference Forecast and the measure value.
We can see that the errors are symmetric around the Error axis. If we keep the Reference Forecast constant and vary the error, the measures are symmetric on both sides of the errors. Not surprising since all these errors have their base in absolute error, which we saw was symmetric.
But the interesting thing here is the dependency on the reference forecast. The same error lead to different Relative Absolute Error values depending on the Reference Forecast.
We can see the same asymmetry in the 3D plot of the curve as well. But Scaled Error is different here because it is not directly dependent on the Reference Forecast, but rather on the mean absolute error of the reference forecast. And therefore it has the good symmetry of absolute error and very little dependency on the reference forecast.
For the Over and Under Forecasting experiment, we repeated the same setup from last time*, but for these four error measures – Mean Relative Absolute Error(MRAE), Mean Absolute Scaled Error(MASE), Relative Mean Absolute Error(RMAE), and Relative Root Mean Squared Error(RRMSE)
* – With one small change, because we also add a random noise less than 1 to make sure consecutive actuals are not the same. In such cases the Relative Measures are undefined.
We can see that these scaled and relative errors do not have that problem of favoring over or under forecasting. Both the error bars of low forecast and high forecast are equally bad. Even in cases where the base error was favoring one of these,(for eg. MAPE), the relative error measure(RMAPE) reduces that “favor” and makes the error measure more robust.
One other thing we notice is that the Mean Relative Error has a huge spread(I’ve actually zoomed in to make the plot legible). For eg. The median baseline_rmae is 2.79 and the maximum baseline_mrae is 42k. This large spread shows us that the Mean Absolute Relative Error has low reliability. Depending on different samples, the errors vary wildly. this may be partly because of the way we use the Reference Forecast. If the Ground Truth is too close to Reference Forecast(in this case the Naïve Forecast), the errors are going to be much higher. This disadvantage is partly resolved by using Median Relative Absolute Error(MdRAE)
For checking the outlier impact also, we repeated the same experiment from previous blog post for MRAE, MASE, RMAE, and RRMSE.
Apart from these standard error measures, there are a few more tailored to tackle a few aspects of the forecast which is not properly covered by the measures we have seen so far.
Out of all the measures we’ve seen so far, only MAPE is what I would call interpretable for nontechnical folks. But as we saw, MAPE does not have the best of properties. All the other measures does not intuitively expound how good or bad the forecast is. Percent Better is another attempt at getting that kind of interpretability.
Percent Better(PB) also relies on a reference forecast and measures our forecast by counting the number of instances where our forecast error measure was better than reference forecast error.
For eg.
where I = 0 when MAE>MAE* and 1 when MAE<MAE*, and N is the number of instances.
Similarly, we can extend this to any other error measure. This gives us an intuitive understand of how better are we doing as compared to reference forecast. This is also pretty resistant to outliers because it only counts the instances instead of measuring or quantifying the error.
That is also a key disadvantage. We are only measuring the count of the times we are better. But it doesn’t measure how better or how worse we are doing. If our error is 50% less than reference error or 1% less, the impact of that on the Percent better score is the same.
Normalized RMSE was proposed to neutralize the scale dependency of RMSE. The general idea is to divide RMSE with a scalar, like the maximum value in all the timeseries, or the difference between the maximum or minimum, or the mean value of all the ground truths etc.
Since dividing by maximum or the difference between maximum and minimum are prone to impact from outliers, popular use of nRMSE is by normalizing with the mean.
nRMSE =RMSE/ mean (y)
All the errors we’ve seen so far focuses on penalizing errors, no matter positive or negative. We use an absolute or squared term to make sure the errors do not cancel each other out and paint a rosier picture than what it is.
But by doing this, we are also becoming blind to structural problems with the forecast. If we are consistently over forecasting or under forecasting, that is something we should be aware of and take corrective actions. But none of the measures we’ve seen so far looks at this perspective.
This is where Forecast Bias comes in.
Although it looks like the Percent Error formula, the key here is the absence of the absolute term. So without the absolute term, we are cumulating the actuals and forecast and measuring the difference between them as a percentage. This gives an intuitive explanation. If we see a bias of 5%, we can infer that overall, we are underforecasting by 5%. Depending on whether we use Actuals – forecast or Forecast – Actuals, the interpretation is different, but in spirit the same.
If we are calculating across timeseries, then also we cumulate the actuals and forecast at whatever cut of the data we are measuring and calculate the Forecast Bias.
Let’s add the error measures we saw now to the summary table we made last time.
Again we see that there is no one ring to rule them all. There may be different choices depending on the situation and we need to pick and choose for specific purposes.
We have already seen that it is not easy to just pick one forecast metric and use it everywhere. Each of them has its own advantages and disadvantages and our choice should be cognizant of all of those.
That being said, there are thumbrules you can apply to help you along the process:
Armstrong et al. 1992, carried out an extensive study on these forecast metrics using the M competition to sample 5 subsamples totaling a set of 90 annual and 101 quarterly series, and its forecast. Then they went ahead and calculation the error measures on this sample and carried out a study to examine them.
The key dimensions they examined the different measures for were:
Reliability talks about whether repeated application of the measure produce similar results. To measure this, they first calculated the error measures for different forecasting methods on all 5 subsamples(aggregate level), and ranked them in order of performance. They carried out this 1 step ahead and 6 steps ahead for Annual and Quarterly series.
So they calculated the Spearman’s rankorder correlation coefficients(pairwise) for each subsample and averaged them. e.g. We took the rankings from subsample 1 and compared them with subsample 2, and then subsample 1 with subsample 3, etc., until we covered all the pairs and then averaged them.
The rankings based on RMSE was the least reliable with very low correlation coefficients. They state that the use of RMSE can overcome this reliability issue only when there is a high number of time series in the mix which might neutralize the effect.
They also found out that Relative Measures like the Percent Better and MdRAE has much higher reliability than their peers. They also tried to calculate the number of samples required to achieve the same statistical significance as Percent Better – 18 series for GMRAE, 19 using MdRAE, 49 using MAPE, 55 using MdAPE, and 170 using RMSE.
While reliability was measuring the consistency, construct validity asks whether a measure does, in fact, measure what it intents to measure. This shows us the extent to which the various measures assess the “accuracy” of forecasting methods. To compare this they examined the rankings of the forecast methods as before, but this time they compared the rankings between pairs of error measures. For eg., how much agreement is there in ranking based on RMSE vs ranking based on MAPE?
These correlations are influenced by both Construct Validity as well as Reliability. To account for the change in Reliability, the authors derived the same table by using more number of samples and found that as expected the average correlations increased from 0.34 to 0.68 showing that these measures are, in fact, measuring what they are supposed to.
As a final test of validity, they constructed a consensus ranking by averaging the rankings from each of the error measures for the full sample of 90 annual series and 1010 quarterly series and then examined the correlations of each individual error measure ranking with the consensus ranking.
RMSE had the lowest correlation with the consensus. This is probably because of the low reliability. It can also be because of RMSE’s emphasis on higher errors.
Percent Better also shows low correlation(even though it had high reliability). This is probably because Percent better is the only measure which does not measure the magnitude of the error.
It is desirable to have error measures which are sensitive to effects of changes, especially for parameter calibration or tuning. The measure should indicate the effect on “accuracy” when a change is made in the parameters of the model.
Median error measures are not sensitive and neither is Percent Better. Median aggregation hides the change by focusing on the middle value and will only change slowly. Percent Better is not sensitive because once the series is performing better than the reference, it stops making any more change in the metric. It also does not measure if we improve an extremely bad forecast to a point where it is almost as accurate as a naïve forecast.
The paper makes it very clear that none of the measures they evaluated are ideal for decision making. They propose RMSE as a good enough measure and frown upon percent based errors under the argument that actual business impact occurs in dollars and not in percent errors. But I disagree with the point because when we are objectively evaluating a forecast to convey how good or bad it is doing, RMSE just does not make the cut. If I walk up to the top management and say that the financial forecast had an RMSE of 22343 that would fall flat. But instead if I say that the accuracy was 90% everybody is happy.
Both me and the paper agree on one thing, the relative error measures of not that relevant to decision making.
To help with selection of errors, the paper also rates the different measures of the dimensions they identified.
For calibration of parameter tuning, the paper suggests to use on of the measures which are rated high in sensitivity, – RMSE, MAPE, and GMRAE. And because of the low reliability of RMSE and the favoring low forecast issue of MAPE, they suggest to use GMRAE(Geometric Mean Relative Absolute Error). MASE was proposed way after the release of this paper and hence it does not actor in these analysis. But if you think about it MASE is also sensitive and immune to the problems that we see for RMSE or MAPE and can be a good candidate for calibration.
To select between forecast methods, the primary criteria are reliability, construct validity, protection against outliers, and relationship to decision making. Sensitivity is not that important in this context.
The paper, right away, dismissed RMSE because of the low reliability and the lack of protection to outliers. When the number of series is low, they suggest MdRAE, which is as reliable as GMRAE, but offers additional protection from outliers.Given a moderate number of series, reliability becomes less of an issue and in such cases MdAPE would be an appropriate choice because of its closer relationship to decision making.
Over the two blogposts, we’ve seen a lot of forecast measures and understood what are the advantages and disadvantages for each of them. And finally arrived at a few thumb rules to go by when choosing forecast measures. although not conclusive, I hope it gives you a direction when going about these decisions.
But all this discussion was made under the assumption that the timeseries that we are forecasting are stable and smooth. But in realworld business cases, there are also a lot of series which are intermittent or sporadic. We see long periods of zero demand before a nonzero demand. under such cases, almost all of the error measures(with an exception of may be MASE) fails. In the next blog post, let’s take a look at a few different measures which are suited to intermittent demand.
Github Link for the Experiments: https://github.com/manujosephv/forecast_metrics
Update – 04102020
Upon further reading, stumbled upon a few criticism of MASE as well, and thought I should mention that as well here.
Another interesting fact that Davidenko and Fildes[3] shows is that MASE is equivalent to the weighted arithmetic mean of relative MAE, where number of available error values is the weight.
Checkout the rest of the articles in the series
Measurement is the first step that leads to control and eventually improvement.
H. James Harrington
In many business applications, the ability to plan ahead is paramount and in a majority of such scenario we use forecasts to help us plan ahead. For eg., If I run a retail store, how many boxes of that shampoo should I order today? Look at the Forecast. Will I achieve my financial targets by the end of the year? Let’s forecast and make adjustments if necessary. If I run a bike rental firm, how many bikes do I need to keep at a metro station tomorrow at 4pm?
If for all of these scenarios, we are taking actions based on the forecast, we should also have an idea about how good those forecasts are. In classical statistics or machine learning, we have a few general loss functions, like the squared error or the absolute error. But because of the way Time Series Forecasting has evolved, there are a lot more ways to assess your performance.
In this blog post, let’s explore the different Forecast Error measures through experiments and understand the drawbacks and advantages of each of them.
There are a few key points which makes the metrics in Time Series Forecasting stand out from the regular metrics in Machine Learning.
As the name suggests, Time Series Forecasting have the temporal aspect built into it and there are metrics like Cumulative Forecast Error or Forecast Bias which takes this temporal aspect as well.
In most business usecases, we would not be forecasting a single time series, rather a set of time series, related or unrelated. And the higher management would not want to look at each of these time series individually, but rather an aggregate measure which tells them directionally how well we are doing the forecasting job. Even for practitioners, this aggregate measure helps them to get an overall sense of the progress they make in modelling.
Another key aspect in forecasting is the concept of over and under forecasting. We would not want the forecasting model to have structural biases which always over or under forecasts. And to combat these, we would want metrics which doesn’t favor either overforecasting or underforecasting.
The final aspect is interpretability. Because these metrics are also used by nonanalytics business functions, it needs to be interpretable.
Because of these different use cases, there are a lot of metrics that is used in this space and here we try to unify it under some structure and also critically examine them.
We can classify the different forecast metrics. broadly,. into two buckets – Intrinsic and Extrinsic. Intrinsic measures are the measures which just take the generated forecast and ground truth to compute the metric. Extrinsic measures are measures which use an external reference forecast also in addition to the generated forecast and ground truth to compute the metric.
Let’s stick with the intrinsic measures for now(Extrinsic ones require a whole different take on these metrics). There are four major ways in which we calculate errors – Absolute Error, Squared Error, Percent Error and Symmetric Error. All the metrics that come under these are just different aggregations of these fundamental errors. So, without loss of generality, we can discuss about these broad sections and they would apply to all the metrics under these heads as well.
This group of error measurement uses the absolute value of the error as the foundation.
Instead of taking the absolute, we square the errors to make it positive, and this is the foundation for these metrics.
In this group of error measurement, we scale the absolute error by the ground truth to convert it into a percentage term.
Symmetric Error was proposed as an alternative to Percent Error, where we take the average of forecast and ground truth as the base on which to scale the absolute error.
Instead of just saying that these are the drawbacks and advantages of such and such metrics, let’s design a few experiments and see for ourselves what those advantages and disadvantages are.
In this experiment, we try and figure out the impact of the scale of timeseries in aggregated measures. For this experiment, we
The error measure should be symmetric to the inputs, i.e. Forecast and Ground Truth. If we interchange the forecast and actuals, ideally the error metric should return the same value.
To test this, let’s make a grid of 0 to 10 for both actuals and forecast and calculate the error metrics on that grid.
In this experiment, we take complementary pairs of ground truths and forecasts which add up to a constant quantity and measure the performance at each point. Specifically, we use the same setup as we did the Symmetricity experiment, and calculate the points along the cross diagonal where ground truth + forecast always adds up to 10.
Our metrics depend on two entities – forecast and ground truth. We can fix one and vary the other one using a symmetric range of errors((for eg. 10 to 10), then we expect the metric to behave the same way on both sides of that range. In our experiment, we chose to fix the Ground Truth because in reality, that is the fixed quantity, and we are measure the forecast against ground truth.
In this experiment we generate 4 random time series – ground truth, baseline forecast, low forecast and high forecast. These are just random numbers generated within a range. Ground Truth and Baseline Forecast are random numbers generated between 2 and 4. Low forecast is a random number generated between 0 and 3 and High Forecast is a random number generated between 3 and 6. In this setup, the Baseline Forecast should act as a baseline for us, Low Forecast is a forecast where we continuously underforecast, and High Forecast is a forecast where we continuously overforecast. And now let’s calculate the MAPE for these three forecasts and repeat the experiment for 1000 times.
To check the impact on outliers, we setup the below experiment.
We want to check the relative impact of outliers on two axes – number of outliers, scale of outliers. So we define a grid – number of outliers [0%40%] and scale of outliers [0 to 2]. Then we picked a synthetic time series at random, and iteratively introduced outliers according to the parameters of the grid we defined earlier and recorded the error measures.
That’s a nice symmetric heatmap. We see zero errors along the diagonal, and higher errors spanning away from it in a nice symmetric pattern.
Again symmetric. MAE varies equally if we go on both sides of the curve.
Again good news. If we vary forecast, keeping actuals constant, and vice versa the variation in the metric is also symmetric.
As expected, over or under forecasting doesn’t make much of a difference in MAE. Both are equally penalized.
This is the Achilles heel of MAE. here, as we increase the base level of the timeseries, we can see that the MAE increases linearly. This means that when we are comparing performances across timeseries, this is not the measure you want to use. For eg., when comparing two timeseries, one with a level of 5 and another with a level of 100, using MAE would always assign a higher error to the timeseries with level 100. Another example is when you want to compare different subsections of your set of timeseries to see where the error is higher(for eg. different product categories, etc.), then using MAE would always tell you that the subsection which has a higher average sales would also have a higher MAE, but that doesn’t mean that subsection is not doing well.
Squared Error also shows the symmetry we are looking for. But one additional point we can see here is that the errors are skewed towards higher errors. The distribution of color from the diagonal is not as uniform as we saw in Absolute Error. This is because the squared error(because of the square term), assigns higher impact to higher errors that lower errors. This is also why Squared Errors are, typically, more prone to distortion due to outliers.
Side Note: Since squared error and absolute error are also used as loss functions in many machine learning algorithms, this also has the implications on the training of such algorithms. If we choose squared error loss, we are less sensitive to smaller errors and more to higher ones. And if we choose absolute error, we penalize higher and lower errors equally and therefore a single outlier will not influence the total loss that much.
We can see the same pattern here as well. It is symmetric around the origin, but because of the quadratic form, higher errors are having disproportionately more error as compared to lower ones.
Similar to MAE, because of the symmetry, Over and Under Forecasting has pretty much the same impact.
Similar to MAE, RMSE also has the scale dependency problem, which means that all the disadvantages we discussed for MAE, applied here as well, but worse. We can see that RMSE scales quadratically when we increase the scale.
Percent Error is the most popular error measure used in the industry. A couple of reasons why it is hugely popular are:
Now that doesn’t look right, does it? Percent Error, the most popular of them all, doesn’t look symmetric at all. In fact, we can see that the errors peak when actuals is close to zero and tending to infinity when actuals is zero(the colorless band at the bottom is where the error is infinity because of division by zero).
We can see two shortcomings of the percent error here:
Let’s look at the Loss Curves and Complementary Pairs plots to understand more.
Suddenly, the asymmetry we are seeing is no more. If we keep the ground truth fixed, Percent Error is symmetric around the origin.
But when we look at complementary pairs, we see the asymmetry we were seeing earlier in the heatmap. When the actuals are low, the same error is having a much higher Percent Error than the same error when the forecast was low.
All of this is because of the base which we take for scaling it. Even if we have the same magnitude of error, if the ground truth is low, the percent error will be high and vice versa. For example, let’s review two cases:
There are countless papers and blogs which claim the asymmetry of percent error to be a deal breaker. The popular claim is that absolute percent error penalizes overforecasting more than underforecasting, or in other words, it incentivizes underforecasting.
One argument against this point is that this asymmetry is only there because we change the ground truth. An error of 6 for a time series which has an expected value of 2 is much more serious than an error of 2 for a time series which has an expected value of 6. So according to that intuition, the percent error is doing what it is supposed to do, isn’t it?
Not exactly. On some levels the criticism of percent error is rightly justified. Here we see that the forecast where we were underforecasting has a consistently lower MAPE than the ones where we were overforecasting. The spread of the low MAPE is also considerably lower than the others. But does that mean that the forecast which always predicts on the lower side is the better forecast as far as the business is concerned? Absolutely not. In a Supply Chain, that leads to stock outs, which is not where you want to be if you want to stay competitive in the market.
Symmetric Error was proposed as an better alternative to Percent error. There were two key disadvantages for Percent Error – Undefined when Ground Truth is zero and Asymmetry. And Symmetric Error proposed to solve both by using the average of ground truth and forecast as the base over which we calculate the percent error.
Right off the bat, we can see that this is symmetric around the diagonal, almost similar to Absolute Error in case of symmetry. And the bottom bar which was empty, now has colors(which means they are not undefined). But a closer look reveals something more. It is not symmetric around the second diagonal. We see the errors are higher when both actuals and forecast are low.
This is further evident in the Loss Curves. We can see the asymmetry as we increase errors on both sides of the origin. And contrary to the name, Symmetric error penalizes under forecasting more than over forecasting.
But when we look at complementary pairs, we can see it is perfectly symmetrical. This is probably because of the base, which we are keeping constant.
We can see the same here as well. The over forecasting series has a consistently lower error as compared to the under forecasting series. So in the effort to normalize the bias towards under forecasting of Percent Error, Symmetric Error shot the other way and is biased towards over forecasting.
In addition to the above experiments, we had also ran an experiment to check the impact of outliers(single predictions which are wildly off) on the aggregate metrics.
All four error measures have similar behavior, when coming to outliers. The number of outliers have a much higher impact than the scale of outliers.
Among the four, RMSE is having the biggest impact from outliers. We can see the contour lines are spaced far apart, showing the rate of change is high when we introduce outliers. On the other end of the spectrum, we have sMAPE which has the least impact from outliers. It is evident from the flat and closely spaced contour lines. MAE and MAPE are behaving almost similarly, probably MAPE a tad bit better.
To close off, there is no one metric which satisfies all the desiderata of an error measure. And depending on the use case, we need to pick and choose. Out of the four intrinsic measures( and all its aggregations like MAPE, MAE, etc.), if we are not concerned by Interpretability and Scale Dependency, we should choose Absolute Error Measures(that is also a general statement. there are concerns with Reliability for Absolute and Squared Error Measures). And when we are looking for scale independent measures, Percent Error is the best we have(even with all of it’s short comings). Extrinsic Error measures like Scaled Error offer a much better alternative in such cases(May be in another blog post I’ll cover those as well.)
All the code to recreate the experiments are at my github repository:
https://github.com/manujosephv/forecast_metrics/tree/master
Checkout the rest of the articles in the series
Featured Image Source
Further Reading
Edited
So, I present to you, the Battle of the Boosters.
I have chosen a few datasets for regression from Kaggle Datasets, mainly because it’s easy to setup and run in Google Colab. Another reason is that I do not need to spend a lot of time in data preprocessing, instead I can pick one of the public kernels and get cracking. I’ll also share one kernel for EDA of the datasets we choose. All notebooks will also be shared and links at the bottom of the blog.
Nothing fancy here. Just did some basic data cleansing and scaling. Most of the code is from some random kernel. The only point is that the same preprocessing is applied to all algorithms
I have chosen cross validation to make sure the comparison between different algorithms is more generalized than specific to one particular split of the data. I have chosen a simple KFold with 5 folds for this exercise.
Evaluation Metric : Mean Squared Error
To have standard evaluation for all the algorithms(thankfully all of them are Scikit Learn api), I defined a couple of functions.
Default Parameters: First fit the CV splits with default parameters of the model. We record the mean and standard deviation of the CV scores and then fit the entire train split to predict on the test split.
def eval_algo_sklearn(alg, x_train, y_train,x_test, y_test, cv):
MSEs=ms.cross_val_score(alg, x_train, y_train, scoring='neg_mean_squared_error', cv=cv)
meanMSE=np.mean(MSEs)
stdMSE = np.std(MSEs)
alg=alg.fit(x_train,y_train)
pred=alg.predict(x_test)
rmse_train = math.sqrt(meanMSE)
rmse_test = math.sqrt(sklm.mean_squared_error(y_test, pred))
return rmse_train, stdMSE, rmse_test
With Hyperparameter Tuning: Very similar to the previous one, but with an additional step of GridSearch to find best parameters.
def tune_eval_algo_sklearn(alg, param_grid, x_train, y_train, x_test, y_test, cv):
grid=GridSearchCV(alg, param_grid=param_grid, cv=cv, scoring='neg_mean_squared_error', n_jobs=1)
grid.fit(x_train,y_train)
print(grid.best_estimator_)
best_params = grid.best_estimator_.get_params()
alg= clone(alg).set_params(**best_params)
alg=alg.fit(x_train,y_train)
pred=alg.predict(x_test)
rmse_train = math.sqrt(grid.cv_results_['mean_test_score'][grid.best_index_])
stdMSE = grid.cv_results_['std_test_score'][grid.best_index_]
rmse_test = math.sqrt(sklm.mean_squared_error(y_test, pred))
return rmse_train, stdMSE, rmse_test, alg
Hyperparameter tuning is in no way exhaustive, but is fairly decent.
The grids over which we run GridSearch for the different algorithms are
XGBoost:
{
"learning_rate": [0.01, 0.1, 0.5],
"n_estimators": [100, 250, 500],
"max_depth": [3, 5, 7],
"min_child_weight": [1, 3, 5],
"colsample_bytree": [0.5, 0.7, 1],
}
LightGBM:
{
"learning_rate": [0.01, 0.1, 0.5],
"n_estimators": [100, 250, 500],
"max_depth": [3, 5, 7],
"min_child_weight": [1, 3, 5],
"colsample_bytree": [0.5, 0.7, 1],
}
RGF:
{
"learning_rate": [0.01, 0.1, 0.5],
"max_leaf": [1000, 2500, 5000],
"algorithm": ["RGF", "RGF_Opt", "RGF_Sib"],
"l2": [1.0, 0.1, 0.01],
}
NGBoost:
Because NGBoost is kinda slow, instead of defining a standard grid for all experiments, I have done search along each parameter, independently, and decided a grid based on the intuitions from that experiment
I’ve tabulated the Mean and Standard Deviations of RMSEs for the Train CV splits for all three datasets. For ElectricMotors, I did not tune the data, as it was computationally expensive.
Disclaimer: These experiments are in no way complete. One would need a much larger scale experiment to arrive at a conclusion on which algorithm is doing better. And then there is the No Free Lunch Theorem to keep in mind.
Right off the bat, NGBoost seems like a strong contender in this space. In AutoMPG and Housing Prices datasets, NGBoost performs the best among all the other boosters, both on mean RMSE as well as the Standard Deviation in the CV scores, and by a large margin. NGBoost also shows quite a large gap between the default and tuned versions. This shows that either the default parameters are not well set, or that each tuning for dataset is a key element in getting good performance from NGBoost. But the Achilles heel of the algorithm is the runtime. With those huge bars towering over the others, we can see that the runtime, really, is in a different scale as compared to the other boosters. Especially on large datasets like Electric Motors Temperature dataset, the runtime is prohibitively large and because of that I didn’t tune the algorithm as well. It fares last among the other boosters in the competition.
Another standout algorithm is Regularized Greedy Forest, which is performing as good as or even better than XGBoost. In low and medium data setting, the runtime is also comparable to the reigning king, XGBoost.
In low data setting, popular algorithms like XGBoost and LightGBM are not performing well. And the standard deviation of the CV scores are higher, especially XGBoost, showing that it overfits. XGBoost has this problem in all three examples. In the matter of runtime, LightGBM reins king(although I haven’t tuned for computational performance), beating XGBoost in all three examples. In the high data setting, it blew everything else out of the water by having much lower RMSE and runtime than the rest of the competitors.
We have come far and wide in the world of gradient boosting and I hope that at least for some of you, Gradient Boosting does not mean XGBoost. There are so many algorithms with its own quirks in this world and lot of them perform at par or better than XGBoost. And another exciting area is Probabilistic Regression. I hope NGBoost become more efficient and step over that hurdle of computational efficiency. Once that happens, NGBoost is a very strong candidate in the probabilistic regression space.
If you’ve not read the previous parts of the series, I strongly advise you to read up, at least the first one where I talk about the Gradient Boosting algorithm, because I am going to take it as a given that you already know what Gradient Boosting is. I would also strongly suggest to read the VI(A) so that you have a better understanding of what Natural Gradients are
The key innovation in NGBoost is the use of Natural Gradients instead of regular gradients in the boosting algorithm. And by adopting this probabilistic route, it models a full probability distribution over the outcome space, conditioned on the covariates.
The paper modularizes their approach into three components –
As in any boosting technique, there are base learners which are combined together to get a complete model. And the NGBoost doesn’t make any assumptions and states that the base learners can be any simple model. The implementation supports a Decision Tree and ridge Regression as base learners out of the box. But you can replace them with any other scikit learn style models just as easily.
Here, we are not training a model to predict the outcome as a point estimate, instead we are predicting a full probability distribution. And every distribution is parametrized by a few parameters. For eg., the normal distribution is parametrized by its mean and standard deviation. You don’t need anything else to define a normal distribution. So, if we train the model to predict these parameters, instead of the point estimate, we will have a full probability distribution as the prediction.
Any machine learning system works on a learning objective, and more often than not, it is the task of minimizing some loss. In point prediction, the predictions are compared with data with a loss function. Scoring rule is the analogue from the probabilistic regression world. The scoring rule compares the estimated probability distribution with the observed data.
A proper scoring rule, , takes as input a forecasted probability distribution and one observation (outcome), and assigns a score to the forecast such that the true distribution of the outcomes gets the best score in expectation.
The most commonly used proper scoring rule is the logarithmic score , which, when minimized we get the MLE
which is nothing but the log likelihood that we have seen in so many places. And the scoring rule is parametrized by because that is what we are predicting as part of the machine learning model.
Another example is CRPS(Continuous Ranked Probability Score). While the logarithmic score or the log likelihood generalizes Mean Squared Error to a probabilistic space, CRPS does the same to Mean Absolute Error.
In the last part of the series, we saw what Natural Gradient was. And in that discussion, we talked abut KL Divergences, because traditionally, Natural Gradients were defined on the MLE scoring rule. But the paper proposes a generalization of the concept and provide a way to extend the concept to CRPS scoring rule as well. They generalized KL Divergence to a general Divergence and provided derivations for CRPS scoring rule.
Now that we have seen the major components, let’s take a look at how all of this works together. NGBoost is a supervised learning method for probabilistic prediction that uses boosting to estimate the parameters of a conditional probability distribution . As we saw earlier, we need to choose three modular components upfront:
A prediction on a new input x is made in the form of a conditional distribution , whose parameters are obtained by an additive combination of M base learners outputs and in initial . Let’s denote the combined function learned by the M base learners for all parameters by . And there will be a separate set of base learners for each parameter in the chosen probability distribution. For eg. in the normal distribution, there will be and .
The predicted outputs are also scaled with a stage specific scaling factor (), and a common learning rate :
One thing to note here is that even if you have n parameters for your probability distribution, is still a single scalar.
Let us consider a Dataset , Boosting Iterations M, Learning rate , Probability Distribution with parameters , Proper scoring rule , and Base learner
The algorithm has a ready use Scikit Learn style implementation at https://github.com/stanfordmlgroup/ngboost. Let’s take a look at the key parameters to tune in the model.
Although there needs to be a considerable amount of caution before using the importances from machine learning models, NGBoost also offers feature importances. It has separate set of importances for each parameter it estimates.
But the best part is not just this, but that SHAP, is also readily available for the model. You just need to use TreeExplainer to get the values.(To know more about SHAP and other interpretable techniques, check out my other blog series – Interpretability: Cracking open the black box).
The paper also looks at how the algorithm performs when compared to other popular algorithms. There were two separate type of evaluation – Probabilistic, and Point Estimation
On a variety of datasets from the UCI Machine Learning Repository, NGBoost was compared with other major probabilistic regression algorithms, like MonteCarlo Dropout, Deep Ensembles, Concrete Dropout, Gaussian Process, Generalized Additive Model for Location, Scale and Shape(GAMLSS), Distributional Forests.
They also evaluated the algorithm on the point estimate use case against other regression algorithms like, Elastic Net, Random Forest(Scikit Learn), Gradient Boosting(Scikit Learn).
NGBoost performs as well or better than the existing algorithms, but also has an additional advantage of providing us a probabilistic forecast. And the formulation and implementation are flexible and modular enough to make it easy to use.
But one drawback here is with the performance of the algorithm. The time complexity linearly increases with each additional parameter we have to estimate. And all the efficiency hacks/changes which has made its way into popular Gradient Boosting packages like LightGBM, or XGBoost are not present in the current implementation. Maybe it would be ported over soon enough because I see the repo is under active development and see this as one of the action items they are targeting. But until that happens, this is quite slow, especially with large data. One way out is to use minibatch_frac parameter to make the calculation of natural gradients faster.
Now that we have seen all the major Gradient Boosters, let’s pick a sample dataset and see how these performs in the next part of the series.