DeepAR Forecasting Algorithm
To this day, forecasting remains one of the most valuable applications of machine learning. For instance, we could use a model to predict the demand of a product. This information could then be used to optimize the supply chain. In other words, we could ensure that we only have enough supply on hand to meet the demand. Thus, freeing up capital to be put towards other means.
Suppose that we were working on an e-commerce site that sells a plethora of products. In the past, we’d end up building a model for every individual time series. There are two major problems with this approach.
- Even with automation, maintaining models for thousands of time series adds a lot of complexity which results in a lot of man hours.
- We don’t capture the relationships between the different time series. For example, the demand for one product (e.g. hotdogs) would affect the demand for its substitute(s) (e.g. hamburgers).
This is where DeepAR enters the picture. DeepAR is a LSTM-based recurrent neural network that is trained on the historical data of ALL time series in the data set. By training on multiple time series simultaneously, the DeepAR model learns the complex, group-dependent behavior between the time series which often times lead to better performance than the standard ARIMA and ETS methods.
There are a few issues that can arise as a result of using a single model for multiple time series. The most notable of which are:
- The time units across different time series can differ (e.g. hours, days, months).
- The starting points t = 1 can refer to a different absolute point in time for different time series. For example, the time series of product a could have started in March 1996 whereas the time series of product b could have started in February 1998 (when it was first released).
There are also problems specific to modeling time series in general. Specifically:
- The presence of missing values. In the context of demand forecasting, an item may be out-of-stock at a certain time, in which case the demand for the item cannot be observed. Not explicitly accounting for such missing observations (e.g. by assuming that the observed sales correspond to the demand even when an item is out of stock), can, in the best case, lead to systematic forecast under-bias.
- The presence of anomalies/outliers. The quality of this forecasting based approach depends on the frequency of such anomalies in the training data, and works best if known anomalies are removed or masked.
Fortunately for us, the PyTorch Forecasting library provides a timeseries dataset class which takes care of the preceding issues (e.g. missing values, multiple history lengths, outliers) for us automatically under the hood.
pip install pytorch**-**forecasting
To begin, open a Jupyter Notebook and import the following libraries:
import pandas as pd
import torch
import pytorch_lightning as pl
import matplotlib.pyplot as plt
from pytorch_forecasting import Baseline, DeepAR, TimeSeriesDataSet
from pytorch_lightning.callbacks import EarlyStopping
from pytorch_forecasting.metrics import SMAPE, MultivariateNormalDistributionLoss
We download and load a dataset from Kaggle containing the item-level sales data at different store locations.
train_df = pd.read_csv('demand-forecasting-kernels-only/train.csv')
The dataset has 4 columns: the date, the store number, the item number and the sales volume.
train_df.head()
The library expects the target to be of type float. Thus, we cast the sales column accordingly. Additionally, the store and item columns should be interpreted as categorical variables (i.e. strings) and not integers.
train_df['date'] = pd.to_datetime(train_df['date'], errors='coerce')
train_df[['store', 'item']] = train_df[['store', 'item']].astype(str)
train_df['sales'] = pd.to_numeric(train_df['sales'], downcast='float')
We need to add a time index that is incremented by one for each additional time step. We first compute the minimum date, then use it to obtain the relative number of days that have elapsed since the beginning of the dataset.
min_date = train_df['date'].min()
train_df["time_idx"] = train_df["date"].map(lambda current_date: (current_date - min_date).days)
We will use the data from the past 60 days in order to make a prediction about the next 20 days.
max_encoder_length = 60 # days
max_prediction_length = 20 # 20 days
training_cutoff = train_df["time_idx"].max() - max_prediction_length
We instantiate an instance of the TimeSeriesDataSet
class.
training = TimeSeriesDataSet(
train_df[lambda x: x.time_idx <= training_cutoff],
time_idx="time_idx",
target="sales",
group_ids=["store", "item"], # list of column names identifying a time series.
max_encoder_length=max_encoder_length,
max_prediction_length=max_prediction_length,
static_categoricals=["store", "item"], # categorical variables that do not change over time (e.g. product categories)
time_varying_unknown_reals=[
"sales"
],
)
We set aside a portion of the data for validation.
validation = TimeSeriesDataSet.from_dataset(training, train_df, min_prediction_idx=training_cutoff + 1)
While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing
to speed up data retrieval.
DataLoaders abstracts this complexity for us in an easy API.
batch_size = 128
train_dataloader = training.to_dataloader(
train=True, batch_size=batch_size, num_workers=0
)
val_dataloader = validation.to_dataloader(
train=False, batch_size=batch_size, num_workers=0
)
We set the random seed to ensure the results are reproduceable.
pl.seed_everything(42)
We instantiate an instance of the Trainer class and tell it to use the GPU at index 0.
trainer = pl.Trainer(
gpus=[0],
gradient_clip_val=0.1,
)
We create a neural network with 2 layers and 30 nodes per layer.
net = DeepAR.from_dataset(
training,
learning_rate=3e-2,
hidden_size=30,
rnn_layers=2,
loss=MultivariateNormalDistributionLoss(rank=30)
)
We then obtain the recommended learning rate.
res = trainer.tuner.lr_find(
net,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
min_lr=1e-5,
max_lr=1e0,
early_stop_threshold=100,
)
suggested learning rate: 0.7079457843841377
The resulting object has a plot function that can be used to visualize the optimal learning rate.
fig = res.plot(show=True, suggest=True)
fig.show()
While training the model, we want to stop early if the loss on the validation set is less than 0.0001.
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
We define another instance of the Trainer class.
trainer = pl.Trainer(
max_epochs=30,
gpus=[0],
enable_model_summary=True,
gradient_clip_val=0.1,
callbacks=[early_stop_callback],
limit_train_batches=50,
enable_checkpointing=True,
)
We define another RNN, specifying the learning rate we computed earlier.
net = DeepAR.from_dataset(
training,
learning_rate=0.7,
log_interval=10,
log_val_interval=1,
hidden_size=30,
rnn_layers=2,
loss=MultivariateNormalDistributionLoss(rank=30),
)
Finally, we train the model.
trainer.fit(
net,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)
We load the model with the best performance from the checkpoint.
best_model_path = trainer.checkpoint_callback.best_model_path
best_model = DeepAR.load_from_checkpoint(best_model_path)
We use the model to predict the data in the validation dataset and plot the results for 2 of the timeseries.
# Predict raw (unnormalized) values, return x = time step
raw_predictions, x = best_model.predict(val_dataloader, mode="raw", return_x=True)
for idx in range(2):
best_model.plot_prediction(
x,
raw_predictions,
idx=idx,
add_loss_to_title=SMAPE()
)