Forecast Error Measures: Understanding them through experiments

Measurement is the first step that leads to control and eventually improvement.
H. James Harrington

In many business applications, the ability to plan ahead is paramount and in a majority of such scenario we use forecasts to help us plan ahead. For eg., If I run a retail store, how many boxes of that shampoo should I order today? Look at the Forecast. Will I achieve my financial targets by the end of the year? Let’s forecast and make adjustments if necessary. If I run a bike rental firm, how many bikes do I need to keep at a metro station tomorrow at 4pm?

If for all of these scenarios, we are taking actions based on the forecast, we should also have an idea about how good those forecasts are. In classical statistics or machine learning, we have a few general loss functions, like the squared error or the absolute error. But because of the way Time Series Forecasting has evolved, there are a lot more ways to assess your performance.

In this blog post, let’s explore the different Forecast Error measures through experiments and understand the drawbacks and advantages of each of them.

Metrics in Time Series Forecasting

There are a few key points which makes the metrics in Time Series Forecasting stand out from the regular metrics in Machine Learning.

1. Temporal Relevance

As the name suggests, Time Series Forecasting have the temporal aspect built into it and there are metrics like Cumulative Forecast Error or Forecast Bias which takes this temporal aspect as well.

2. Aggregate Metrics

In most business use-cases, we would not be forecasting a single time series, rather a set of time series, related or unrelated. And the higher management would not want to look at each of these time series individually, but rather an aggregate measure which tells them directionally how well we are doing the forecasting job. Even for practitioners, this aggregate measure helps them to get an overall sense of the progress they make in modelling.

3. Over or Under Forecasting

Another key aspect in forecasting is the concept of over and under forecasting. We would not want the forecasting model to have structural biases which always over or under forecasts. And to combat these, we would want metrics which doesn’t favor either over-forecasting or under-forecasting.

4. Interpretability

The final aspect is interpretability. Because these metrics are also used by non-analytics business functions, it needs to be interpretable.

Because of these different use cases, there are a lot of metrics that is used in this space and here we try to unify it under some structure and also critically examine them.

Taxonomy of Forecast Metrics

We can classify the different forecast metrics. broadly,. into two buckets – Intrinsic and Extrinsic. Intrinsic measures are the measures which just take the generated forecast and ground truth to compute the metric. Extrinsic measures are measures which use an external reference forecast also in addition to the generated forecast and ground truth to compute the metric.

Let’s stick with the intrinsic measures for now(Extrinsic ones require a whole different take on these metrics). There are four major ways in which we calculate errors – Absolute Error, Squared Error, Percent Error and Symmetric Error. All the metrics that come under these are just different aggregations of these fundamental errors. So, without loss of generality, we can discuss about these broad sections and they would apply to all the metrics under these heads as well.

Absolute Error

This group of error measurement uses the absolute value of the error as the foundation.

$e_t = |y_t - f_t|, \text{where } y_t \text{ is the ground truth at time t \& } f_t \text{ is the forecast for time t}$

Squared Error

Instead of taking the absolute, we square the errors to make it positive, and this is the foundation for these metrics.

$e_t = (y_t - f_t)^2, \text{where } y_t \text{ is the ground truth at time t \& } f_t \text{ is the forecast for time t}$

Percent Error

In this group of error measurement, we scale the absolute error by the ground truth to convert it into a percentage term.

$e_t = \frac{|y_t - f_t|}{y_t}, \text{where } y_t \text{ is the ground truth at time t \& } f_t \text{ is the forecast for time t}$

Symmetric Error

Symmetric Error was proposed as an alternative to Percent Error, where we take the average of forecast and ground truth as the base on which to scale the absolute error.

$e_t = 2 \times \frac{|y_t - f_t|}{y_t+f_t}, \text{where } y_t \text{ is the ground truth at time t \& } f_t \text{ is the forecast for time t}$

Experiments

Instead of just saying that these are the drawbacks and advantages of such and such metrics, let’s design a few experiments and see for ourselves what those advantages and disadvantages are.

Scale Dependency

In this experiment, we try and figure out the impact of the scale of timeseries in aggregated measures. For this experiment, we

Generate 10000 synthetic time series at different scales, but with same error.
Split these series into 10 histogram bins
Sample Size = 5000; Iterate over each bin
1. Sample 50% from current bin and res, equally distributed, from other bins.
2. Calculate the aggregate measures on this set of time series
3. Record against the bin lower edge
Plot the aggregate measures against the bin edges.

Symmetricity

The error measure should be symmetric to the inputs, i.e. Forecast and Ground Truth. If we interchange the forecast and actuals, ideally the error metric should return the same value.

To test this, let’s make a grid of 0 to 10 for both actuals and forecast and calculate the error metrics on that grid.

Complementary Pairs

In this experiment, we take complementary pairs of ground truths and forecasts which add up to a constant quantity and measure the performance at each point. Specifically, we use the same setup as we did the Symmetricity experiment, and calculate the points along the cross diagonal where ground truth + forecast always adds up to 10.

Loss Curves

Our metrics depend on two entities – forecast and ground truth. We can fix one and vary the other one using a symmetric range of errors((for eg. -10 to 10), then we expect the metric to behave the same way on both sides of that range. In our experiment, we chose to fix the Ground Truth because in reality, that is the fixed quantity, and we are measure the forecast against ground truth.

Over & Under Forecasting Experiment

In this experiment we generate 4 random time series – ground truth, baseline forecast, low forecast and high forecast. These are just random numbers generated within a range. Ground Truth and Baseline Forecast are random numbers generated between 2 and 4. Low forecast is a random number generated between 0 and 3 and High Forecast is a random number generated between 3 and 6. In this setup, the Baseline Forecast should act as a baseline for us, Low Forecast is a forecast where we continuously under-forecast, and High Forecast is a forecast where we continuously over-forecast. And now let’s calculate the MAPE for these three forecasts and repeat the experiment for 1000 times.

Outlier Impact

To check the impact on outliers, we setup the below experiment.

We want to check the relative impact of outliers on two axes – number of outliers, scale of outliers. So we define a grid – number of outliers [0%-40%] and scale of outliers [0 to 2]. Then we picked a synthetic time series at random, and iteratively introduced outliers according to the parameters of the grid we defined earlier and recorded the error measures.