v1.0

e3f7f7b3 · chenzk · e3f7f7b3 · e3f7f7b3 · e3f7f7b3 · e3f7f7b3
Commit e3f7f7b3 authored May 06, 2024 by chenzk
20 changed files
--- a/nbs/imgs_models/tft_architecture.png
+++ b/nbs/imgs_models/tft_architecture.png
--- a/nbs/imgs_models/tft_grn.png
+++ b/nbs/imgs_models/tft_grn.png
--- a/nbs/imgs_models/tft_vsn.png
+++ b/nbs/imgs_models/tft_vsn.png
--- a/nbs/imgs_models/timellm.png
+++ b/nbs/imgs_models/timellm.png
--- a/nbs/imgs_models/timesnet.png
+++ b/nbs/imgs_models/timesnet.png
--- a/nbs/imgs_models/tsmixer.png
+++ b/nbs/imgs_models/tsmixer.png
--- a/nbs/imgs_models/tsmixerx.png
+++ b/nbs/imgs_models/tsmixerx.png
--- a/nbs/imgs_models/vanilla_transformer.png
+++ b/nbs/imgs_models/vanilla_transformer.png
--- a/nbs/index.ipynb
+++ b/nbs/index.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "30ebf8f9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "43832dcf-0452-4503-aace-8f38e5e23723",
+   "metadata": {},
+   "source": [
+    "# 🧠 NeuralForecast\n",
+    "\n",
+    "> **NeuralForecast** offers a large collection of neural forecasting models focused on their usability, and robustness. The models range from classic networks like `MLP`, `RNN`s to novel proven contributions like `NBEATS`, `NHITS`, `TFT` and other architectures."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "9bdd2aa5",
+   "metadata": {},
+   "source": [
+    "## 🎊 Features\n",
+    "\n",
+    "* **Exogenous Variables**: Static, historic and future exogenous support.\n",
+    "* **Forecast Interpretability**: Plot trend, seasonality and exogenous `NBEATS`, `NHITS`, `TFT`, `ESRNN` prediction components.\n",
+    "* **Probabilistic Forecasting**: Simple model adapters for quantile losses and parametric distributions.\n",
+    "* **Train and Evaluation Losses** Scale-dependent, percentage and scale independent errors, and parametric likelihoods.\n",
+    "* **Automatic Model Selection** Parallelized automatic hyperparameter tuning, that efficiently searches best validation configuration.\n",
+    "* **Simple Interface** Unified SKLearn Interface for `StatsForecast` and `MLForecast` compatibility.\n",
+    "* **Model Collection**: Out of the box implementation of `MLP`, `LSTM`, `RNN`, `TCN`, `DilatedRNN`, `NBEATS`, `NHITS`, `ESRNN`, `Informer`, `TFT`, `PatchTST`, `VanillaTransformer`, `StemGNN` and `HINT`. See the entire [collection here](https://nixtla.github.io/neuralforecast/models.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5cebf377",
+   "metadata": {},
+   "source": [
+    "## Why?\n",
+    "\n",
+    "There is a shared belief in Neural forecasting methods' capacity to improve our pipeline's accuracy and efficiency.\n",
+    "\n",
+    "Unfortunately, available implementations and published research are yet to realize neural networks' potential. They are hard to use and continuously fail to improve over statistical methods while being computationally prohibitive. For this reason, we created `NeuralForecast`, a library favoring proven accurate and efficient models focusing on their usability."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50d28044",
+   "metadata": {},
+   "source": [
+    "## 💻 Installation\n",
+    "\n",
+    "\n",
+    "### PyPI\n",
+    "\n",
+    "You can install `NeuralForecast`'s *released version* from the Python package index [pip](https://pypi.org/project/neuralforecast/) with:\n",
+    "\n",
+    "```python\n",
+    "pip install neuralforecast\n",
+    "```\n",
+    "\n",
+    "(Installing inside a python virtualenvironment or a conda environment is recommended.)\n",
+    "\n",
+    "\n",
+    "### Conda\n",
+    "\n",
+    "Also you can install `NeuralForecast`'s *released version* from [conda](https://anaconda.org/conda-forge/neuralforecast) with:\n",
+    "\n",
+    "```python\n",
+    "conda install -c conda-forge neuralforecast\n",
+    "```\n",
+    "\n",
+    "(Installing inside a python virtualenvironment or a conda environment is recommended.)\n",
+    "\n",
+    "### Dev Mode\n",
+    "If you want to make some modifications to the code and see the effects in real time (without reinstalling), follow the steps below:\n",
+    "\n",
+    "```bash\n",
+    "git clone https://github.com/Nixtla/neuralforecast.git\n",
+    "cd neuralforecast\n",
+    "pip install -e .\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3aeb537",
+   "metadata": {},
+   "source": [
+    "## How to Use"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "401a785d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from IPython.display import display, Markdown\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.models import NBEATS, NHITS\n",
+    "from neuralforecast.utils import AirPassengersDF\n",
+    "\n",
+    "# Split data and declare panel dataset\n",
+    "Y_df = AirPassengersDF\n",
+    "Y_train_df = Y_df[Y_df.ds<='1959-12-31'] # 132 train\n",
+    "Y_test_df = Y_df[Y_df.ds>'1959-12-31'] # 12 test\n",
+    "\n",
+    "# Fit and predict with NBEATS and NHITS models\n",
+    "horizon = len(Y_test_df)\n",
+    "models = [NBEATS(input_size=2 * horizon, h=horizon, max_steps=50),\n",
+    "          NHITS(input_size=2 * horizon, h=horizon, max_steps=50)]\n",
+    "nf = NeuralForecast(models=models, freq='M')\n",
+    "nf.fit(df=Y_train_df)\n",
+    "Y_hat_df = nf.predict().reset_index()\n",
+    "\n",
+    "# Plot predictions\n",
+    "fig, ax = plt.subplots(1, 1, figsize = (20, 7))\n",
+    "Y_hat_df = Y_test_df.merge(Y_hat_df, how='left', on=['unique_id', 'ds'])\n",
+    "plot_df = pd.concat([Y_train_df, Y_hat_df]).set_index('ds')\n",
+    "\n",
+    "plot_df[['y', 'NBEATS', 'NHITS']].plot(ax=ax, linewidth=2)\n",
+    "\n",
+    "ax.set_title('AirPassengers Forecast', fontsize=22)\n",
+    "ax.set_ylabel('Monthly Passengers', fontsize=20)\n",
+    "ax.set_xlabel('Timestamp [t]', fontsize=20)\n",
+    "ax.legend(prop={'size': 15})\n",
+    "ax.grid()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7cb3919a",
+   "metadata": {},
+   "source": [
+    "## 🙏 How to Cite"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2c669ec7",
+   "metadata": {},
+   "source": [
+    "If you enjoy or benefit from using these Python implementations, a citation to the repository will be greatly appreciated."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "0fed17d7",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "@misc{olivares2022library_neuralforecast,\n",
+    "    author={Kin G. Olivares and\n",
+    "            Cristian Challú and\n",
+    "            Federico Garza and\n",
+    "            Max Mergenthaler Canseco and\n",
+    "            Artur Dubrawski},\n",
+    "    title = {{NeuralForecast}: User friendly state-of-the-art neural forecasting models.},\n",
+    "    year={2022},\n",
+    "    howpublished={{PyCon} Salt Lake City, Utah, US 2022},\n",
+    "    url={https://github.com/Nixtla/neuralforecast}\n",
+    "}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "617b9dca",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/nbs/losses.numpy.ipynb
+++ b/nbs/losses.numpy.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp losses.numpy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# NumPy Evaluation\n",
+    "\n",
+    "> NeuralForecast contains a collection NumPy loss functions aimed to be used during the models' evaluation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The most important train signal is the forecast error, which is the difference between the observed value $y_{\\tau}$ and the prediction $\\hat{y}_{\\tau}$, at time $y_{\\tau}$:\n",
+    "\n",
+    "$$e_{\\tau} = y_{\\tau}-\\hat{y}_{\\tau} \\qquad \\qquad \\tau \\in \\{t+1,\\dots,t+H \\}$$\n",
+    "\n",
+    "The train loss summarizes the forecast errors in different evaluation metrics."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from typing import Optional, Union\n",
+    "\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from IPython.display import Image\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "WIDTH = 600\n",
+    "HEIGHT = 300"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def _divide_no_nan(a: np.ndarray, b: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Auxiliary funtion to handle divide by 0\n",
+    "    \"\"\"\n",
+    "    div = a / b\n",
+    "    div[div != div] = 0.0\n",
+    "    div[div == float('inf')] = 0.0\n",
+    "    return div"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def _metric_protections(y: np.ndarray, y_hat: np.ndarray, \n",
+    "                        weights: Optional[np.ndarray]) -> None:\n",
+    "    assert (weights is None) or (np.sum(weights) > 0), 'Sum of weights cannot be 0'\n",
+    "    assert (weights is None) or (weights.shape == y.shape),\\\n",
+    "        f'Wrong weight dimension weights.shape {weights.shape}, y.shape {y.shape}'"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 1. Scale-dependent Errors\n",
+    "\n",
+    "These metrics are on the same scale as the data."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Mean Absolute Error"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def mae(y: np.ndarray, y_hat: np.ndarray,\n",
+    "        weights: Optional[np.ndarray] = None,\n",
+    "        axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\"Mean Absolute Error\n",
+    "\n",
+    "    Calculates Mean Absolute Error between\n",
+    "    `y` and `y_hat`. MAE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the\n",
+    "    deviation of the prediction and the true\n",
+    "    value at a given time and averages these devations\n",
+    "    over the length of the series.\n",
+    "\n",
+    "    $$ \\mathrm{MAE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} |y_{\\\\tau} - \\hat{y}_{\\\\tau}| $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, Actual values.<br>\n",
+    "    `y_hat`: numpy array, Predicted values.<br>\n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `mae`: numpy array, (single value).    \n",
+    "    \"\"\"\n",
+    "    _metric_protections(y, y_hat, weights)\n",
+    "    \n",
+    "    delta_y = np.abs(y - y_hat)\n",
+    "    if weights is not None:\n",
+    "        mae = np.average(delta_y[~np.isnan(delta_y)], \n",
+    "                         weights=weights[~np.isnan(delta_y)],\n",
+    "                         axis=axis)\n",
+    "    else:\n",
+    "        mae = np.nanmean(delta_y, axis=axis)\n",
+    "        \n",
+    "    return mae"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(mae, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mae_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Mean Squared Error"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def mse(y: np.ndarray, y_hat: np.ndarray, \n",
+    "        weights: Optional[np.ndarray] = None,\n",
+    "        axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\"  Mean Squared Error\n",
+    "\n",
+    "    Calculates Mean Squared Error between\n",
+    "    `y` and `y_hat`. MSE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the \n",
+    "    squared deviation of the prediction and the true\n",
+    "    value at a given time, and averages these devations\n",
+    "    over the length of the series.\n",
+    "\n",
+    "    $$ \\mathrm{MSE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} (y_{\\\\tau} - \\hat{y}_{\\\\tau})^{2} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, Actual values.<br>\n",
+    "    `y_hat`: numpy array, Predicted values.<br>\n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `mse`: numpy array, (single value).\n",
+    "    \"\"\"\n",
+    "    _metric_protections(y, y_hat, weights)\n",
+    "\n",
+    "    delta_y = np.square(y - y_hat)\n",
+    "    if weights is not None:\n",
+    "        mse = np.average(delta_y[~np.isnan(delta_y)],\n",
+    "                         weights=weights[~np.isnan(delta_y)],\n",
+    "                         axis=axis)\n",
+    "    else:\n",
+    "        mse = np.nanmean(delta_y, axis=axis)\n",
+    "\n",
+    "    return mse"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(mse, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mse_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Root Mean Squared Error"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def rmse(y: np.ndarray, y_hat: np.ndarray,\n",
+    "         weights: Optional[np.ndarray] = None,\n",
+    "         axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\" Root Mean Squared Error\n",
+    "\n",
+    "    Calculates Root Mean Squared Error between\n",
+    "    `y` and `y_hat`. RMSE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the squared deviation\n",
+    "    of the prediction and the observed value at a given time and\n",
+    "    averages these devations over the length of the series.\n",
+    "    Finally the RMSE will be in the same scale\n",
+    "    as the original time series so its comparison with other\n",
+    "    series is possible only if they share a common scale.\n",
+    "    RMSE has a direct connection to the L2 norm.\n",
+    "\n",
+    "    $$ \\mathrm{RMSE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\sqrt{\\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} (y_{\\\\tau} - \\hat{y}_{\\\\tau})^{2}} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, Actual values.<br>\n",
+    "    `y_hat`: numpy array, Predicted values.<br>\n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `rmse`: numpy array, (single value).\n",
+    "    \"\"\"\n",
+    "    return np.sqrt(mse(y, y_hat, weights, axis))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(rmse, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/rmse_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 2. Percentage errors\n",
+    "\n",
+    "These metrics are unit-free, suitable for comparisons across series."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Mean Absolute Percentage Error"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def mape(y: np.ndarray, y_hat: np.ndarray, \n",
+    "         weights: Optional[np.ndarray] = None,\n",
+    "         axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\" Mean Absolute Percentage Error\n",
+    "\n",
+    "    Calculates Mean Absolute Percentage Error  between\n",
+    "    `y` and `y_hat`. MAPE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the percentual deviation\n",
+    "    of the prediction and the observed value at a given time and\n",
+    "    averages these devations over the length of the series.\n",
+    "    The closer to zero an observed value is, the higher penalty MAPE loss\n",
+    "    assigns to the corresponding error.\n",
+    "\n",
+    "    $$ \\mathrm{MAPE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} \\\\frac{|y_{\\\\tau}-\\hat{y}_{\\\\tau}|}{|y_{\\\\tau}|} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, Actual values.<br>\n",
+    "    `y_hat`: numpy array, Predicted values.<br>\n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `mape`: numpy array, (single value).\n",
+    "    \"\"\"\n",
+    "    _metric_protections(y, y_hat, weights)\n",
+    "        \n",
+    "    delta_y = np.abs(y - y_hat)\n",
+    "    scale = np.abs(y)\n",
+    "    mape = _divide_no_nan(delta_y, scale)\n",
+    "    mape = np.average(mape, weights=weights, axis=axis)\n",
+    "    \n",
+    "    return mape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(mape, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mape_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## SMAPE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def smape(y: np.ndarray, y_hat: np.ndarray,\n",
+    "          weights: Optional[np.ndarray] = None,\n",
+    "          axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\" Symmetric Mean Absolute Percentage Error\n",
+    "\n",
+    "    Calculates Symmetric Mean Absolute Percentage Error between\n",
+    "    `y` and `y_hat`. SMAPE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the relative deviation\n",
+    "    of the prediction and the observed value scaled by the sum of the\n",
+    "    absolute values for the prediction and observed value at a\n",
+    "    given time, then averages these devations over the length\n",
+    "    of the series. This allows the SMAPE to have bounds between\n",
+    "    0% and 200% which is desirable compared to normal MAPE that\n",
+    "    may be undetermined when the target is zero.\n",
+    "\n",
+    "    $$ \\mathrm{sMAPE}_{2}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} \\\\frac{|y_{\\\\tau}-\\hat{y}_{\\\\tau}|}{|y_{\\\\tau}|+|\\hat{y}_{\\\\tau}|} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, Actual values.<br>\n",
+    "    `y_hat`: numpy array, Predicted values.<br>\n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `smape`: numpy array, (single value).\n",
+    "    \n",
+    "    **References:**<br>\n",
+    "    [Makridakis S., \"Accuracy measures: theoretical and practical concerns\".](https://www.sciencedirect.com/science/article/pii/0169207093900793)\n",
+    "    \"\"\"\n",
+    "    _metric_protections(y, y_hat, weights)\n",
+    "        \n",
+    "    delta_y = np.abs(y - y_hat)\n",
+    "    scale = np.abs(y) + np.abs(y_hat)\n",
+    "    smape = _divide_no_nan(delta_y, scale)\n",
+    "    smape = 2 * np.average(smape, weights=weights, axis=axis)\n",
+    "    \n",
+    "    if isinstance(smape, float):\n",
+    "        assert smape <= 2, 'SMAPE should be lower than 200'\n",
+    "    else:\n",
+    "        assert all(smape <= 2), 'SMAPE should be lower than 200'\n",
+    "    \n",
+    "    return smape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(smape, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 3. Scale-independent Errors\n",
+    "\n",
+    "These metrics measure the relative improvements versus baselines."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Mean Absolute Scaled Error"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def mase(y: np.ndarray, y_hat: np.ndarray, \n",
+    "         y_train: np.ndarray,\n",
+    "         seasonality: int,\n",
+    "         weights: Optional[np.ndarray] = None,\n",
+    "         axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\" Mean Absolute Scaled Error \n",
+    "    Calculates the Mean Absolute Scaled Error between\n",
+    "    `y` and `y_hat`. MASE measures the relative prediction\n",
+    "    accuracy of a forecasting method by comparinng the mean absolute errors\n",
+    "    of the prediction and the observed value against the mean\n",
+    "    absolute errors of the seasonal naive model.\n",
+    "    The MASE partially composed the Overall Weighted Average (OWA), \n",
+    "    used in the M4 Competition.\n",
+    "\n",
+    "    $$ \\mathrm{MASE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{season}_{\\\\tau}) = \\\\frac{1}{H} \\sum^{t+H}_{\\\\tau=t+1} \\\\frac{|y_{\\\\tau}-\\hat{y}_{\\\\tau}|}{\\mathrm{MAE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{season}_{\\\\tau})} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, (batch_size, output_size), Actual values.<br>\n",
+    "    `y_hat`: numpy array, (batch_size, output_size)), Predicted values.<br>\n",
+    "    `y_insample`: numpy array, (batch_size, input_size), Actual insample Seasonal Naive predictions.<br>\n",
+    "    `seasonality`: int. Main frequency of the time series; Hourly 24,  Daily 7, Weekly 52, Monthly 12, Quarterly 4, Yearly 1.        \n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `mase`: numpy array, (single value).\n",
+    "    \n",
+    "    **References:**<br>\n",
+    "    [Rob J. Hyndman, & Koehler, A. B. \"Another look at measures of forecast accuracy\".](https://www.sciencedirect.com/science/article/pii/S0169207006000239)<br>\n",
+    "    [Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, \"The M4 Competition: 100,000 time series and 61 forecasting methods\".](https://www.sciencedirect.com/science/article/pii/S0169207019301128)\n",
+    "    \"\"\"\n",
+    "    delta_y = np.abs(y - y_hat)\n",
+    "    delta_y = np.average(delta_y, weights=weights, axis=axis)\n",
+    "\n",
+    "    scale = np.abs(y_train[:-seasonality] - y_train[seasonality:])\n",
+    "    scale = np.average(scale, axis=axis)\n",
+    "\n",
+    "    mase = delta_y / scale\n",
+    "\n",
+    "    return mase"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(mase, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mase_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Relative Mean Absolute Error"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def rmae(y: np.ndarray, \n",
+    "         y_hat1: np.ndarray, y_hat2: np.ndarray, \n",
+    "         weights: Optional[np.ndarray] = None,\n",
+    "         axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\" RMAE\n",
+    "            \n",
+    "    Calculates Relative Mean Absolute Error (RMAE) between\n",
+    "    two sets of forecasts (from two different forecasting methods).\n",
+    "    A number smaller than one implies that the forecast in the \n",
+    "    numerator is better than the forecast in the denominator.\n",
+    "    \n",
+    "    $$ \\mathrm{rMAE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{base}_{\\\\tau}) = \\\\frac{1}{H} \\sum^{t+H}_{\\\\tau=t+1} \\\\frac{|y_{\\\\tau}-\\hat{y}_{\\\\tau}|}{\\mathrm{MAE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{base}_{\\\\tau})} $$\n",
+    "    \n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, observed values.<br>\n",
+    "    `y_hat1`: numpy array. Predicted values of first model.<br>\n",
+    "    `y_hat2`: numpy array. Predicted values of baseline model.<br>\n",
+    "    `weights`: numpy array, optional. Weights for weighted average.<br>\n",
+    "    `axis`: None or int, optional.Axis or axes along which to average a.<br> \n",
+    "        The default, axis=None, will average over all of the elements of\n",
+    "        the input array.\n",
+    "    \n",
+    "    **Returns:**<br>\n",
+    "    `rmae`: numpy array or double.\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Rob J. Hyndman, & Koehler, A. B. \"Another look at measures of forecast accuracy\".](https://www.sciencedirect.com/science/article/pii/S0169207006000239)\n",
+    "    \"\"\"\n",
+    "    numerator = mae(y=y, y_hat=y_hat1, weights=weights, axis=axis)\n",
+    "    denominator = mae(y=y, y_hat=y_hat2, weights=weights, axis=axis)\n",
+    "    rmae = numerator / denominator\n",
+    "\n",
+    "    return rmae"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(rmae, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/rmae_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 4. Probabilistic Errors\n",
+    "\n",
+    "These measure absolute deviation non-symmetrically, that produce under/over estimation."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Quantile Loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def quantile_loss(y: np.ndarray, y_hat: np.ndarray, q: float = 0.5, \n",
+    "                  weights: Optional[np.ndarray] = None,\n",
+    "                  axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\" Quantile Loss\n",
+    "\n",
+    "    Computes the quantile loss between `y` and `y_hat`.\n",
+    "    QL measures the deviation of a quantile forecast.\n",
+    "    By weighting the absolute deviation in a non symmetric way, the\n",
+    "    loss pays more attention to under or over estimation.\n",
+    "    A common value for q is 0.5 for the deviation from the median (Pinball loss).\n",
+    "\n",
+    "    $$ \\mathrm{QL}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{(q)}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} \\Big( (1-q)\\,( \\hat{y}^{(q)}_{\\\\tau} - y_{\\\\tau} )_{+} + q\\,( y_{\\\\tau} - \\hat{y}^{(q)}_{\\\\tau} )_{+} \\Big) $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, Actual values.<br>\n",
+    "    `y_hat`: numpy array, Predicted values.<br>\n",
+    "    `q`: float, between 0 and 1. The slope of the quantile loss, in the context of quantile regression, the q determines the conditional quantile level.<br>\n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `quantile_loss`: numpy array, (single value).\n",
+    "    \n",
+    "    **References:**<br>\n",
+    "    [Roger Koenker and Gilbert Bassett, Jr., \"Regression Quantiles\".](https://www.jstor.org/stable/1913643)\n",
+    "    \"\"\"\n",
+    "    _metric_protections(y, y_hat, weights)\n",
+    "\n",
+    "    delta_y = y - y_hat\n",
+    "    loss = np.maximum(q * delta_y, (q - 1) * delta_y)\n",
+    "\n",
+    "    if weights is not None:\n",
+    "        quantile_loss = np.average(loss[~np.isnan(loss)], \n",
+    "                             weights=weights[~np.isnan(loss)],\n",
+    "                             axis=axis)\n",
+    "    else:\n",
+    "        quantile_loss = np.nanmean(loss, axis=axis)\n",
+    "        \n",
+    "    return quantile_loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(quantile_loss, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/q_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Multi-Quantile Loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def mqloss(y: np.ndarray, y_hat: np.ndarray, \n",
+    "           quantiles: np.ndarray, \n",
+    "           weights: Optional[np.ndarray] = None,\n",
+    "           axis: Optional[int] = None) -> Union[float, np.ndarray]:\n",
+    "    \"\"\"  Multi-Quantile loss\n",
+    "\n",
+    "    Calculates the Multi-Quantile loss (MQL) between `y` and `y_hat`.\n",
+    "    MQL calculates the average multi-quantile Loss for\n",
+    "    a given set of quantiles, based on the absolute \n",
+    "    difference between predicted quantiles and observed values.\n",
+    "\n",
+    "    $$ \\mathrm{MQL}(\\\\mathbf{y}_{\\\\tau},[\\\\mathbf{\\hat{y}}^{(q_{1})}_{\\\\tau}, ... ,\\hat{y}^{(q_{n})}_{\\\\tau}]) = \\\\frac{1}{n} \\\\sum_{q_{i}} \\mathrm{QL}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{(q_{i})}_{\\\\tau}) $$\n",
+    "\n",
+    "    The limit behavior of MQL allows to measure the accuracy \n",
+    "    of a full predictive distribution $\\mathbf{\\hat{F}}_{\\\\tau}$ with \n",
+    "    the continuous ranked probability score (CRPS). This can be achieved \n",
+    "    through a numerical integration technique, that discretizes the quantiles \n",
+    "    and treats the CRPS integral with a left Riemann approximation, averaging over \n",
+    "    uniformly distanced quantiles.    \n",
+    "\n",
+    "    $$ \\mathrm{CRPS}(y_{\\\\tau}, \\mathbf{\\hat{F}}_{\\\\tau}) = \\int^{1}_{0} \\mathrm{QL}(y_{\\\\tau}, \\hat{y}^{(q)}_{\\\\tau}) dq $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y`: numpy array, Actual values.<br>\n",
+    "    `y_hat`: numpy array, Predicted values.<br>\n",
+    "    `quantiles`: numpy array,(n_quantiles). Quantiles to estimate from the distribution of y.<br>\n",
+    "    `mask`: numpy array, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `mqloss`: numpy array, (single value).\n",
+    "    \n",
+    "    **References:**<br>\n",
+    "    [Roger Koenker and Gilbert Bassett, Jr., \"Regression Quantiles\".](https://www.jstor.org/stable/1913643)<br>\n",
+    "    [James E. Matheson and Robert L. Winkler, \"Scoring Rules for Continuous Probability Distributions\".](https://www.jstor.org/stable/2629907)\n",
+    "    \"\"\"\n",
+    "    if weights is None: weights = np.ones(y.shape)\n",
+    "        \n",
+    "    _metric_protections(y, y_hat, weights)\n",
+    "    n_q = len(quantiles)\n",
+    "    \n",
+    "    y_rep  = np.expand_dims(y, axis=-1)\n",
+    "    error  = y_hat - y_rep\n",
+    "    sq     = np.maximum(-error, np.zeros_like(error))\n",
+    "    s1_q   = np.maximum(error, np.zeros_like(error))\n",
+    "    mqloss = (quantiles * sq + (1 - quantiles) * s1_q)\n",
+    "    \n",
+    "    # Match y/weights dimensions and compute weighted average\n",
+    "    weights = np.repeat(np.expand_dims(weights, axis=-1), repeats=n_q, axis=-1)\n",
+    "    mqloss  = np.average(mqloss, weights=weights, axis=axis)\n",
+    "\n",
+    "    return mqloss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(mqloss, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mq_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Examples and Validation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import unittest\n",
+    "import torch as t \n",
+    "import numpy as np\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import (\n",
+    "    MAE, MSE, RMSE,      # unscaled errors\n",
+    "    MAPE, SMAPE,         # percentage errors\n",
+    "    MASE,                # scaled error\n",
+    "    QuantileLoss, MQLoss # probabilistic errors\n",
+    ")\n",
+    "\n",
+    "from neuralforecast.losses.numpy import (\n",
+    "    mae, mse, rmse,              # unscaled errors\n",
+    "    mape, smape,                 # percentage errors\n",
+    "    mase,                        # scaled error\n",
+    "    quantile_loss, mqloss        # probabilistic errors\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Test class for pytorch/numpy loss functions\n",
+    "class TestLoss(unittest.TestCase):\n",
+    "    def setUp(self):   \n",
+    "        self.num_quantiles = np.random.randint(3, 10)\n",
+    "        self.first_num = np.random.randint(1, 300)\n",
+    "        self.second_num = np.random.randint(1, 300)\n",
+    "        \n",
+    "        self.y = t.rand(self.first_num, self.second_num)\n",
+    "        self.y_hat = t.rand(self.first_num, self.second_num)\n",
+    "        self.y_hat2 = t.rand(self.first_num, self.second_num)\n",
+    "        self.y_hat_quantile = t.rand(self.first_num, self.second_num, self.num_quantiles)\n",
+    "        \n",
+    "        self.quantiles = t.rand(self.num_quantiles)\n",
+    "        self.q_float = np.random.random_sample()\n",
+    "\n",
+    "    def test_mae(self):\n",
+    "        mae_numpy   = mae(self.y, self.y_hat)\n",
+    "        mae_pytorch = MAE()\n",
+    "        mae_pytorch = mae_pytorch(self.y, self.y_hat).numpy()\n",
+    "        self.assertAlmostEqual(mae_numpy, mae_pytorch, places=6)\n",
+    "\n",
+    "    def test_mse(self):\n",
+    "        mse_numpy   = mse(self.y, self.y_hat)\n",
+    "        mse_pytorch = MSE()\n",
+    "        mse_pytorch = mse_pytorch(self.y, self.y_hat).numpy()\n",
+    "        self.assertAlmostEqual(mse_numpy, mse_pytorch, places=6)\n",
+    "\n",
+    "    def test_rmse(self):\n",
+    "        rmse_numpy   = rmse(self.y, self.y_hat)\n",
+    "        rmse_pytorch = RMSE()\n",
+    "        rmse_pytorch = rmse_pytorch(self.y, self.y_hat).numpy()\n",
+    "        self.assertAlmostEqual(rmse_numpy, rmse_pytorch, places=6)\n",
+    "\n",
+    "    def test_mape(self):\n",
+    "        mape_numpy   = mape(y=self.y, y_hat=self.y_hat)\n",
+    "        mape_pytorch = MAPE()\n",
+    "        mape_pytorch = mape_pytorch(y=self.y, y_hat=self.y_hat).numpy()\n",
+    "        self.assertAlmostEqual(mape_numpy, mape_pytorch, places=6)\n",
+    "\n",
+    "    def test_smape(self):\n",
+    "        smape_numpy   = smape(self.y, self.y_hat)\n",
+    "        smape_pytorch = SMAPE()\n",
+    "        smape_pytorch = smape_pytorch(self.y, self.y_hat).numpy()\n",
+    "        self.assertAlmostEqual(smape_numpy, smape_pytorch, places=4)\n",
+    "    \n",
+    "    #def test_mase(self):\n",
+    "    #    y_insample = t.rand(self.first_num, self.second_num)\n",
+    "    #    seasonality = 24\n",
+    "    #    # Hourly 24, Daily 7, Weekly 52\n",
+    "    #    # Monthly 12, Quarterly 4, Yearly 1\n",
+    "    #    mase_numpy   = mase(y=self.y, y_hat=self.y_hat,\n",
+    "    #                        y_insample=y_insample, seasonality=seasonality)\n",
+    "    #    mase_object  = MASE(seasonality=seasonality)\n",
+    "    #    mase_pytorch = mase_object(y=self.y, y_hat=self.y_hat,\n",
+    "    #                               y_insample=y_insample).numpy()\n",
+    "    #    self.assertAlmostEqual(mase_numpy, mase_pytorch, places=2)\n",
+    "\n",
+    "    #def test_rmae(self):\n",
+    "    #    rmae_numpy   = rmae(self.y, self.y_hat, self.y_hat2)\n",
+    "    #    rmae_object  = RMAE()\n",
+    "    #    rmae_pytorch = rmae_object(self.y, self.y_hat, self.y_hat2).numpy()\n",
+    "    #    self.assertAlmostEqual(rmae_numpy, rmae_pytorch, places=4)\n",
+    "\n",
+    "    def test_quantile(self):\n",
+    "        quantile_numpy = quantile_loss(self.y, self.y_hat, q = self.q_float)\n",
+    "        quantile_pytorch = QuantileLoss(q = self.q_float)\n",
+    "        quantile_pytorch = quantile_pytorch(self.y, self.y_hat).numpy()\n",
+    "        self.assertAlmostEqual(quantile_numpy, quantile_pytorch, places=6)\n",
+    "    \n",
+    "    # def test_mqloss(self):\n",
+    "    #     weights = np.ones_like(self.y)\n",
+    "\n",
+    "    #     mql_np_w = mqloss(self.y, self.y_hat_quantile, self.quantiles, weights=weights)\n",
+    "    #     mql_np_default_w = mqloss(self.y, self.y_hat_quantile, self.quantiles)\n",
+    "\n",
+    "    #     mql_object = MQLoss(quantiles=self.quantiles)\n",
+    "    #     mql_py_w = mql_object(y=self.y,\n",
+    "    #                           y_hat=self.y_hat_quantile,\n",
+    "    #                           mask=t.Tensor(weights)).numpy()\n",
+    "        \n",
+    "    #     print('self.y.shape', self.y.shape)\n",
+    "    #     print('self.y_hat_quantile.shape', self.y_hat_quantile.shape)\n",
+    "    #     mql_py_default_w = mql_object(y=self.y,\n",
+    "    #                                   y_hat=self.y_hat_quantile).numpy()\n",
+    "\n",
+    "    #     weights[0,:] = 0\n",
+    "    #     mql_np_new_w = mqloss(self.y, self.y_hat_quantile, self.quantiles, weights=weights)\n",
+    "    #     mql_py_new_w = mql_object(y=self.y,\n",
+    "    #                               y_hat=self.y_hat_quantile,\n",
+    "    #                               mask=t.Tensor(weights)).numpy()\n",
+    "\n",
+    "    #     self.assertAlmostEqual(mql_np_w,  mql_np_default_w)\n",
+    "    #     self.assertAlmostEqual(mql_py_w,  mql_py_default_w)\n",
+    "    #     self.assertAlmostEqual(mql_np_new_w,  mql_py_new_w)\n",
+    "    \n",
+    "\n",
+    "unittest.main(argv=[''], verbosity=2, exit=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/losses.pytorch.ipynb
+++ b/nbs/losses.pytorch.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "524620c1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp losses.pytorch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "15392f6f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd532cb1-d11d-468e-a0e5-eb1101ba6662",
+   "metadata": {},
+   "source": [
+    "# PyTorch Losses\n",
+    "\n",
+    "> NeuralForecast contains a collection PyTorch Loss classes aimed to be used during the models' optimization."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "096cfbec-1d59-454a-b572-5890103b2f1f",
+   "metadata": {},
+   "source": [
+    "The most important train signal is the forecast error, which is the difference between the observed value $y_{\\tau}$ and the prediction $\\hat{y}_{\\tau}$, at time $y_{\\tau}$:\n",
+    "\n",
+    "$$e_{\\tau} = y_{\\tau}-\\hat{y}_{\\tau} \\qquad \\qquad \\tau \\in \\{t+1,\\dots,t+H \\}$$\n",
+    "\n",
+    "The train loss summarizes the forecast errors in different train optimization objectives.\n",
+    "\n",
+    "All the losses are `torch.nn.modules` which helps to automatically moved them across CPU/GPU/TPU devices with Pytorch Lightning. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "acfa68dc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from typing import Optional, Union, Tuple\n",
+    "\n",
+    "import math\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "\n",
+    "import torch.nn.functional as F\n",
+    "from torch.distributions import Distribution\n",
+    "from torch.distributions import (\n",
+    "    Bernoulli,\n",
+    "    Normal, \n",
+    "    StudentT, \n",
+    "    Poisson,\n",
+    "    NegativeBinomial\n",
+    ")\n",
+    "\n",
+    "from torch.distributions import constraints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2508f7a9-1433-4ad8-8f2f-0078c6ed6c3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "import matplotlib.pyplot as plt\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc\n",
+    "from neuralforecast.utils import generate_series"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "84e07e98-b4c8-4ade-b3b6-1d27f367aa0a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "def _divide_no_nan(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:\n",
+    "    \"\"\"\n",
+    "    Auxiliary funtion to handle divide by 0\n",
+    "    \"\"\"\n",
+    "    div = a / b\n",
+    "    div[div != div] = 0.0\n",
+    "    div[div == float('inf')] = 0.0\n",
+    "    return div"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "132db0ca",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "def _weighted_mean(losses, weights):\n",
+    "    \"\"\"\n",
+    "    Compute weighted mean of losses per datapoint.\n",
+    "    \"\"\"\n",
+    "    return _divide_no_nan(torch.sum(losses * weights), torch.sum(weights))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f41562a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class BasePointLoss(torch.nn.Module):\n",
+    "    \"\"\"\n",
+    "    Base class for point loss functions.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "    `outputsize_multiplier`: Multiplier for the output size. <br>\n",
+    "    `output_names`: Names of the outputs. <br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self, horizon_weight, outputsize_multiplier, output_names):\n",
+    "        super(BasePointLoss, self).__init__()\n",
+    "        if horizon_weight is not None:\n",
+    "            horizon_weight = torch.Tensor(horizon_weight.flatten())\n",
+    "        self.horizon_weight = horizon_weight\n",
+    "        self.outputsize_multiplier = outputsize_multiplier\n",
+    "        self.output_names = output_names\n",
+    "        self.is_distribution_output = False\n",
+    "\n",
+    "    def domain_map(self, y_hat: torch.Tensor):\n",
+    "        \"\"\"\n",
+    "        Univariate loss operates in dimension [B,T,H]/[B,H]\n",
+    "        This changes the network's output from [B,H,1]->[B,H]\n",
+    "        \"\"\"\n",
+    "        return y_hat.squeeze(-1)\n",
+    "\n",
+    "    def _compute_weights(self, y, mask):\n",
+    "        \"\"\"\n",
+    "        Compute final weights for each datapoint (based on all weights and all masks)\n",
+    "        Set horizon_weight to a ones[H] tensor if not set.\n",
+    "        If set, check that it has the same length as the horizon in x.\n",
+    "        \"\"\"\n",
+    "        if mask is None:\n",
+    "            mask = torch.ones_like(y, device=y.device)\n",
+    "\n",
+    "        if self.horizon_weight is None:\n",
+    "            self.horizon_weight = torch.ones(mask.shape[-1])\n",
+    "        else:\n",
+    "            assert mask.shape[-1] == len(self.horizon_weight), \\\n",
+    "                'horizon_weight must have same length as Y'\n",
+    "\n",
+    "        weights = self.horizon_weight.clone()\n",
+    "        weights = torch.ones_like(mask, device=mask.device) * weights.to(mask.device)\n",
+    "        return weights * mask"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "b8a94d7d",
+   "metadata": {},
+   "source": [
+    "# 1. Scale-dependent Errors\n",
+    "\n",
+    "These metrics are on the same scale as the data."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "82fc4679",
+   "metadata": {},
+   "source": [
+    "## Mean Absolute Error (MAE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4e413fae-c590-4713-aab9-37c61ed37dff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MAE(BasePointLoss):\n",
+    "    \"\"\"Mean Absolute Error\n",
+    "\n",
+    "    Calculates Mean Absolute Error between\n",
+    "    `y` and `y_hat`. MAE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the\n",
+    "    deviation of the prediction and the true\n",
+    "    value at a given time and averages these devations\n",
+    "    over the length of the series.\n",
+    "\n",
+    "    $$ \\mathrm{MAE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} |y_{\\\\tau} - \\hat{y}_{\\\\tau}| $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "    \"\"\"    \n",
+    "    def __init__(self, horizon_weight=None):\n",
+    "        super(MAE, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                  outputsize_multiplier=1,\n",
+    "                                  output_names=[''])\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies datapoints to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `mae`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        losses = torch.abs(y - y_hat)\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1d004cd0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MAE, name='MAE.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0a20a273",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MAE.__call__, name='MAE.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "0292c74d",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mae_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "4f31cc3d",
+   "metadata": {},
+   "source": [
+    "## Mean Squared Error (MSE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "46cfe937",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MSE(BasePointLoss):\n",
+    "    \"\"\"  Mean Squared Error\n",
+    "\n",
+    "    Calculates Mean Squared Error between\n",
+    "    `y` and `y_hat`. MSE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the \n",
+    "    squared deviation of the prediction and the true\n",
+    "    value at a given time, and averages these devations\n",
+    "    over the length of the series.\n",
+    "    \n",
+    "    $$ \\mathrm{MSE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} (y_{\\\\tau} - \\hat{y}_{\\\\tau})^{2} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self, horizon_weight=None):\n",
+    "        super(MSE, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                  outputsize_multiplier=1,\n",
+    "                                  output_names=[''])\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies datapoints to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `mse`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        losses = (y - y_hat)**2\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e8c65b82",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MSE, name='MSE.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b0126a7f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MSE.__call__, name='MSE.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b23f9c1",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mse_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "b160140b",
+   "metadata": {},
+   "source": [
+    "## Root Mean Squared Error (RMSE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "545ebfb7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class RMSE(BasePointLoss):\n",
+    "    \"\"\" Root Mean Squared Error\n",
+    "\n",
+    "    Calculates Root Mean Squared Error between\n",
+    "    `y` and `y_hat`. RMSE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the squared deviation\n",
+    "    of the prediction and the observed value at a given time and\n",
+    "    averages these devations over the length of the series.\n",
+    "    Finally the RMSE will be in the same scale\n",
+    "    as the original time series so its comparison with other\n",
+    "    series is possible only if they share a common scale. \n",
+    "    RMSE has a direct connection to the L2 norm.\n",
+    "    \n",
+    "    $$ \\mathrm{RMSE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\sqrt{\\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} (y_{\\\\tau} - \\hat{y}_{\\\\tau})^{2}} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self, horizon_weight=None):\n",
+    "        super(RMSE, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                  outputsize_multiplier=1,\n",
+    "                                  output_names=[''])\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies datapoints to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `rmse`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        losses = (y - y_hat)**2\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        losses = _weighted_mean(losses=losses, weights=weights)\n",
+    "        return torch.sqrt(losses)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d961d383",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(RMSE, name='RMSE.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d398d3e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(RMSE.__call__, name='RMSE.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d4539e38",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/rmse_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "8bcf5488",
+   "metadata": {},
+   "source": [
+    "# 2. Percentage errors\n",
+    "\n",
+    "These metrics are unit-free, suitable for comparisons across series."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "8eab97ec",
+   "metadata": {},
+   "source": [
+    "## Mean Absolute Percentage Error (MAPE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "adecb6bf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MAPE(BasePointLoss):\n",
+    "    \"\"\" Mean Absolute Percentage Error\n",
+    "\n",
+    "    Calculates Mean Absolute Percentage Error  between\n",
+    "    `y` and `y_hat`. MAPE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the percentual deviation\n",
+    "    of the prediction and the observed value at a given time and\n",
+    "    averages these devations over the length of the series.\n",
+    "    The closer to zero an observed value is, the higher penalty MAPE loss\n",
+    "    assigns to the corresponding error.\n",
+    "\n",
+    "    $$ \\mathrm{MAPE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} \\\\frac{|y_{\\\\tau}-\\hat{y}_{\\\\tau}|}{|y_{\\\\tau}|} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Makridakis S., \"Accuracy measures: theoretical and practical concerns\".](https://www.sciencedirect.com/science/article/pii/0169207093900793)    \n",
+    "    \"\"\"\n",
+    "    def __init__(self, horizon_weight=None):\n",
+    "        super(MAPE, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                  outputsize_multiplier=1,\n",
+    "                                  output_names=[''])\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `mape`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        scale = _divide_no_nan(torch.ones_like(y, device=y.device), torch.abs(y))\n",
+    "        losses = torch.abs(y - y_hat) * scale\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        mape = _weighted_mean(losses=losses, weights=weights)\n",
+    "        return mape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "174e8042",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MAPE, name='MAPE.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "da63f136",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MAPE.__call__, name='MAPE.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c8ccdc69",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mape_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "cb245891",
+   "metadata": {},
+   "source": [
+    "## Symmetric MAPE (sMAPE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7566e649",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class SMAPE(BasePointLoss):\n",
+    "    \"\"\" Symmetric Mean Absolute Percentage Error\n",
+    "\n",
+    "    Calculates Symmetric Mean Absolute Percentage Error between\n",
+    "    `y` and `y_hat`. SMAPE measures the relative prediction\n",
+    "    accuracy of a forecasting method by calculating the relative deviation\n",
+    "    of the prediction and the observed value scaled by the sum of the\n",
+    "    absolute values for the prediction and observed value at a\n",
+    "    given time, then averages these devations over the length\n",
+    "    of the series. This allows the SMAPE to have bounds between\n",
+    "    0% and 200% which is desireble compared to normal MAPE that\n",
+    "    may be undetermined when the target is zero.\n",
+    "\n",
+    "    $$ \\mathrm{sMAPE}_{2}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} \\\\frac{|y_{\\\\tau}-\\hat{y}_{\\\\tau}|}{|y_{\\\\tau}|+|\\hat{y}_{\\\\tau}|} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Makridakis S., \"Accuracy measures: theoretical and practical concerns\".](https://www.sciencedirect.com/science/article/pii/0169207093900793)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, horizon_weight=None):\n",
+    "        super(SMAPE, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                  outputsize_multiplier=1,\n",
+    "                                  output_names=[''])\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `smape`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        delta_y = torch.abs((y - y_hat))\n",
+    "        scale = torch.abs(y) + torch.abs(y_hat)\n",
+    "        losses = _divide_no_nan(delta_y, scale)\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        return 2*_weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dee99fb8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(SMAPE, name='SMAPE.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db62a845",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(SMAPE.__call__, name='SMAPE.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "bc3f2d6f",
+   "metadata": {},
+   "source": [
+    "# 3. Scale-independent Errors\n",
+    "\n",
+    "These metrics measure the relative improvements versus baselines."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "5b2dee1f",
+   "metadata": {},
+   "source": [
+    "## Mean Absolute Scaled Error (MASE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9cc34fae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MASE(BasePointLoss):\n",
+    "    \"\"\" Mean Absolute Scaled Error \n",
+    "    Calculates the Mean Absolute Scaled Error between\n",
+    "    `y` and `y_hat`. MASE measures the relative prediction\n",
+    "    accuracy of a forecasting method by comparinng the mean absolute errors\n",
+    "    of the prediction and the observed value against the mean\n",
+    "    absolute errors of the seasonal naive model.\n",
+    "    The MASE partially composed the Overall Weighted Average (OWA), \n",
+    "    used in the M4 Competition.\n",
+    "    \n",
+    "    $$ \\mathrm{MASE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{season}_{\\\\tau}) = \\\\frac{1}{H} \\sum^{t+H}_{\\\\tau=t+1} \\\\frac{|y_{\\\\tau}-\\hat{y}_{\\\\tau}|}{\\mathrm{MAE}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{season}_{\\\\tau})} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `seasonality`: int. Main frequency of the time series; Hourly 24,  Daily 7, Weekly 52, Monthly 12, Quarterly 4, Yearly 1.\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "    \n",
+    "    **References:**<br>\n",
+    "    [Rob J. Hyndman, & Koehler, A. B. \"Another look at measures of forecast accuracy\".](https://www.sciencedirect.com/science/article/pii/S0169207006000239)<br>\n",
+    "    [Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, \"The M4 Competition: 100,000 time series and 61 forecasting methods\".](https://www.sciencedirect.com/science/article/pii/S0169207019301128)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, seasonality: int, horizon_weight=None):\n",
+    "        super(MASE, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                   outputsize_multiplier=1,\n",
+    "                                   output_names=[''])\n",
+    "        self.seasonality = seasonality\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 y_insample: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor (batch_size, output_size), Actual values.<br>\n",
+    "        `y_hat`: tensor (batch_size, output_size)), Predicted values.<br>\n",
+    "        `y_insample`: tensor (batch_size, input_size), Actual insample Seasonal Naive predictions.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `mase`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        delta_y = torch.abs(y - y_hat)\n",
+    "        scale = torch.mean(torch.abs(y_insample[:, self.seasonality:] - \\\n",
+    "                                     y_insample[:, :-self.seasonality]), axis=1)\n",
+    "        losses = _divide_no_nan(delta_y, scale[:, None])\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6a4cf21",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MASE, name='MASE.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "32a2c11b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MASE.__call__, name='MASE.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "6e0c8fe5",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mase_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "73bbdc4e",
+   "metadata": {},
+   "source": [
+    "## Relative Mean Squared Error (relMSE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "954911d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class relMSE(BasePointLoss):\n",
+    "    \"\"\"Relative Mean Squared Error\n",
+    "    Computes Relative Mean Squared Error (relMSE), as proposed by Hyndman & Koehler (2006)\n",
+    "    as an alternative to percentage errors, to avoid measure unstability.\n",
+    "    $$ \\mathrm{relMSE}(\\\\mathbf{y}, \\\\mathbf{\\hat{y}}, \\\\mathbf{\\hat{y}}^{naive1}) =\n",
+    "    \\\\frac{\\mathrm{MSE}(\\\\mathbf{y}, \\\\mathbf{\\hat{y}})}{\\mathrm{MSE}(\\\\mathbf{y}, \\\\mathbf{\\hat{y}}^{naive1})} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `y_train`: numpy array, Training values.<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    - [Hyndman, R. J and Koehler, A. B. (2006).\n",
+    "       \"Another look at measures of forecast accuracy\",\n",
+    "       International Journal of Forecasting, Volume 22, Issue 4.](https://www.sciencedirect.com/science/article/pii/S0169207006000239)<br>\n",
+    "    - [Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker. \n",
+    "       \"Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures. \n",
+    "       Submitted to the International Journal Forecasting, Working paper available at arxiv.](https://arxiv.org/pdf/2110.13179.pdf)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, y_train, horizon_weight=None):\n",
+    "        super(relMSE, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                     outputsize_multiplier=1,\n",
+    "                                     output_names=[''])\n",
+    "        self.y_train = y_train\n",
+    "        self.mse = MSE(horizon_weight=horizon_weight)\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor (batch_size, output_size), Actual values.<br>\n",
+    "        `y_hat`: tensor (batch_size, output_size)), Predicted values.<br>\n",
+    "        `y_insample`: tensor (batch_size, input_size), Actual insample Seasonal Naive predictions.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `relMSE`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        horizon = y.shape[-1]\n",
+    "        last_col = self.y_train[:, -1].unsqueeze(1)\n",
+    "        y_naive = last_col.repeat(1, horizon)\n",
+    "\n",
+    "        norm = self.mse(y=y, y_hat=y_naive, mask=mask) # Already weighted\n",
+    "        norm = norm + 1e-5 # Numerical stability\n",
+    "        loss = self.mse(y=y, y_hat=y_hat, mask=mask) # Already weighted\n",
+    "        loss = _divide_no_nan(loss, norm)\n",
+    "        return loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "edeb6f9a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(relMSE, name='relMSE.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a317b5c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(relMSE.__call__, name='relMSE.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c828438e",
+   "metadata": {},
+   "source": [
+    "# 4. Probabilistic Errors\n",
+    "\n",
+    "These methods use statistical approaches for estimating unknown probability distributions using observed data. \n",
+    "\n",
+    "Maximum likelihood estimation involves finding the parameter values that maximize the likelihood function, which measures the probability of obtaining the observed data given the parameter values. MLE has good theoretical properties and efficiency under certain satisfied assumptions.\n",
+    "\n",
+    "On the non-parametric approach, quantile regression measures non-symmetrically deviation, producing under/over estimation."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "999d8cb2",
+   "metadata": {},
+   "source": [
+    "## Quantile Loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cd296fcb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class QuantileLoss(BasePointLoss):\n",
+    "    \"\"\" Quantile Loss\n",
+    "\n",
+    "    Computes the quantile loss between `y` and `y_hat`.\n",
+    "    QL measures the deviation of a quantile forecast.\n",
+    "    By weighting the absolute deviation in a non symmetric way, the\n",
+    "    loss pays more attention to under or over estimation.\n",
+    "    A common value for q is 0.5 for the deviation from the median (Pinball loss).\n",
+    "\n",
+    "    $$ \\mathrm{QL}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{(q)}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} \\Big( (1-q)\\,( \\hat{y}^{(q)}_{\\\\tau} - y_{\\\\tau} )_{+} + q\\,( y_{\\\\tau} - \\hat{y}^{(q)}_{\\\\tau} )_{+} \\Big) $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `q`: float, between 0 and 1. The slope of the quantile loss, in the context of quantile regression, the q determines the conditional quantile level.<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Roger Koenker and Gilbert Bassett, Jr., \"Regression Quantiles\".](https://www.jstor.org/stable/1913643)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, q, horizon_weight=None):\n",
+    "        super(QuantileLoss, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                           outputsize_multiplier=1,\n",
+    "                                           output_names=[f'_ql{q}'])\n",
+    "        self.q = q\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies datapoints to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `quantile_loss`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        delta_y = y - y_hat\n",
+    "        losses = torch.max(torch.mul(self.q, delta_y), torch.mul((self.q - 1), delta_y))\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70bd46d9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(QuantileLoss, name='QuantileLoss.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b1588e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(QuantileLoss.__call__, name='QuantileLoss.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "51ac874f",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/q_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "92dbb002",
+   "metadata": {},
+   "source": [
+    "## Multi Quantile Loss (MQLoss)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "291a0530",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "def level_to_outputs(level):\n",
+    "    qs = sum([[50-l/2, 50+l/2] for l in level], [])\n",
+    "    output_names = sum([[f'-lo-{l}', f'-hi-{l}'] for l in level], [])\n",
+    "\n",
+    "    sort_idx = np.argsort(qs)\n",
+    "    quantiles = np.array(qs)[sort_idx]\n",
+    "\n",
+    "    # Add default median\n",
+    "    quantiles = np.concatenate([np.array([50]), quantiles])\n",
+    "    quantiles = torch.Tensor(quantiles) / 100\n",
+    "    output_names = list(np.array(output_names)[sort_idx])\n",
+    "    output_names.insert(0, '-median')\n",
+    "    \n",
+    "    return quantiles, output_names\n",
+    "\n",
+    "def quantiles_to_outputs(quantiles):\n",
+    "    output_names = []\n",
+    "    for q in quantiles:\n",
+    "        if q<.50:\n",
+    "            output_names.append(f'-lo-{np.round(100-200*q,2)}')\n",
+    "        elif q>.50:\n",
+    "            output_names.append(f'-hi-{np.round(100-200*(1-q),2)}')\n",
+    "        else:\n",
+    "            output_names.append('-median')\n",
+    "    return quantiles, output_names"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21dc7968",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MQLoss(BasePointLoss):\n",
+    "    \"\"\"  Multi-Quantile loss\n",
+    "\n",
+    "    Calculates the Multi-Quantile loss (MQL) between `y` and `y_hat`.\n",
+    "    MQL calculates the average multi-quantile Loss for\n",
+    "    a given set of quantiles, based on the absolute \n",
+    "    difference between predicted quantiles and observed values.\n",
+    "    \n",
+    "    $$ \\mathrm{MQL}(\\\\mathbf{y}_{\\\\tau},[\\\\mathbf{\\hat{y}}^{(q_{1})}_{\\\\tau}, ... ,\\hat{y}^{(q_{n})}_{\\\\tau}]) = \\\\frac{1}{n} \\\\sum_{q_{i}} \\mathrm{QL}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{(q_{i})}_{\\\\tau}) $$\n",
+    "    \n",
+    "    The limit behavior of MQL allows to measure the accuracy \n",
+    "    of a full predictive distribution $\\mathbf{\\hat{F}}_{\\\\tau}$ with \n",
+    "    the continuous ranked probability score (CRPS). This can be achieved \n",
+    "    through a numerical integration technique, that discretizes the quantiles \n",
+    "    and treats the CRPS integral with a left Riemann approximation, averaging over \n",
+    "    uniformly distanced quantiles.    \n",
+    "    \n",
+    "    $$ \\mathrm{CRPS}(y_{\\\\tau}, \\mathbf{\\hat{F}}_{\\\\tau}) = \\int^{1}_{0} \\mathrm{QL}(y_{\\\\tau}, \\hat{y}^{(q)}_{\\\\tau}) dq $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `level`: int list [0,100]. Probability levels for prediction intervals (Defaults median).\n",
+    "    `quantiles`: float list [0., 1.]. Alternative to level, quantiles to estimate from y distribution.\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Roger Koenker and Gilbert Bassett, Jr., \"Regression Quantiles\".](https://www.jstor.org/stable/1913643)<br>\n",
+    "    [James E. Matheson and Robert L. Winkler, \"Scoring Rules for Continuous Probability Distributions\".](https://www.jstor.org/stable/2629907)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, level=[80, 90], quantiles=None, horizon_weight=None):\n",
+    "\n",
+    "        qs, output_names = level_to_outputs(level)\n",
+    "        qs = torch.Tensor(qs)\n",
+    "        # Transform quantiles to homogeneus output names\n",
+    "        if quantiles is not None:\n",
+    "            _, output_names = quantiles_to_outputs(quantiles)\n",
+    "            qs = torch.Tensor(quantiles)\n",
+    "\n",
+    "        super(MQLoss, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                     outputsize_multiplier=len(qs),\n",
+    "                                     output_names=output_names)\n",
+    "        \n",
+    "        self.quantiles = torch.nn.Parameter(qs, requires_grad=False)\n",
+    "\n",
+    "    def domain_map(self, y_hat: torch.Tensor):\n",
+    "        \"\"\"\n",
+    "        Identity domain map [B,T,H,Q]/[B,H,Q]\n",
+    "        \"\"\"\n",
+    "        return y_hat\n",
+    "    \n",
+    "    def _compute_weights(self, y, mask):\n",
+    "        \"\"\"\n",
+    "        Compute final weights for each datapoint (based on all weights and all masks)\n",
+    "        Set horizon_weight to a ones[H] tensor if not set.\n",
+    "        If set, check that it has the same length as the horizon in x.\n",
+    "        \"\"\"\n",
+    "        if mask is None:\n",
+    "            mask = torch.ones_like(y, device=y.device)\n",
+    "        else:\n",
+    "            mask = mask.unsqueeze(1) # Add Q dimension.\n",
+    "\n",
+    "        if self.horizon_weight is None:\n",
+    "            self.horizon_weight = torch.ones(mask.shape[-1])\n",
+    "        else:\n",
+    "            assert mask.shape[-1] == len(self.horizon_weight), \\\n",
+    "                'horizon_weight must have same length as Y'\n",
+    "    \n",
+    "        weights = self.horizon_weight.clone()\n",
+    "        weights = torch.ones_like(mask, device=mask.device) * weights.to(mask.device)\n",
+    "        return weights * mask\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `mqloss`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        \n",
+    "        error  = y_hat - y.unsqueeze(-1)\n",
+    "        sq     = torch.maximum(-error, torch.zeros_like(error))\n",
+    "        s1_q   = torch.maximum(error, torch.zeros_like(error))\n",
+    "        losses = (1/len(self.quantiles))*(self.quantiles * sq + (1 - self.quantiles) * s1_q)\n",
+    "\n",
+    "        if y_hat.ndim == 3: # BaseWindows\n",
+    "            losses = losses.swapaxes(-2,-1) # [B,H,Q] -> [B,Q,H] (needed for horizon weighting, H at the end)\n",
+    "        elif y_hat.ndim == 4: # BaseRecurrent\n",
+    "            losses = losses.swapaxes(-2,-1)\n",
+    "            losses = losses.swapaxes(-2,-3) # [B,seq_len,H,Q] -> [B,Q,seq_len,H] (needed for horizon weighting, H at the end)\n",
+    "\n",
+    "        weights = self._compute_weights(y=losses, mask=mask) # Use losses for extra dim\n",
+    "        # NOTE: Weights do not have Q dimension.\n",
+    "\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8f42ec82",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MQLoss, name='MQLoss.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bac2237a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(MQLoss.__call__, name='MQLoss.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "33b66b0e",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/mq_loss.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "da37f2ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | hide\n",
+    "# Unit tests to check MQLoss' stored quantiles\n",
+    "# attribute is correctly instantiated\n",
+    "check = MQLoss(level=[80, 90])\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = MQLoss(quantiles=[0.0100, 0.1000, 0.5, 0.9000, 0.9900])\n",
+    "print(check.output_names)\n",
+    "print(check.quantiles)\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = MQLoss(quantiles=[0.0100, 0.1000, 0.9000, 0.9900])\n",
+    "test_eq(len(check.quantiles), 4)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "895ec0c0",
+   "metadata": {},
+   "source": [
+    "## DistributionLoss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "801785b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "def weighted_average(x: torch.Tensor, \n",
+    "                     weights: Optional[torch.Tensor]=None, dim=None) -> torch.Tensor:\n",
+    "    \"\"\"\n",
+    "    Computes the weighted average of a given tensor across a given dim, masking\n",
+    "    values associated with weight zero,\n",
+    "    meaning instead of `nan * 0 = nan` you will get `0 * 0 = 0`.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `x`: Input tensor, of which the average must be computed.<br>\n",
+    "    `weights`: Weights tensor, of the same shape as `x`.<br>\n",
+    "    `dim`: The dim along which to average `x`.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `Tensor`: The tensor with values averaged along the specified `dim`.<br>\n",
+    "    \"\"\"\n",
+    "    if weights is not None:\n",
+    "        weighted_tensor = torch.where(\n",
+    "            weights != 0, x * weights, torch.zeros_like(x)\n",
+    "        )\n",
+    "        sum_weights = torch.clamp(\n",
+    "            weights.sum(dim=dim) if dim else weights.sum(), min=1.0\n",
+    "        )\n",
+    "        return (\n",
+    "            weighted_tensor.sum(dim=dim) if dim else weighted_tensor.sum()\n",
+    "        ) / sum_weights\n",
+    "    else:\n",
+    "        return x.mean(dim=dim)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "83b90c8a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "def bernoulli_domain_map(input: torch.Tensor):\n",
+    "    \"\"\" Bernoulli Domain Map\n",
+    "    Maps input into distribution constraints, by construction input's \n",
+    "    last dimension is of matching `distr_args` length.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `input`: tensor, of dimensions [B,T,H,theta] or [B,H,theta].<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `(probs,)`: tuple with tensors of Poisson distribution arguments.<br>\n",
+    "    \"\"\"\n",
+    "    return (input.squeeze(-1),)\n",
+    "\n",
+    "def bernoulli_scale_decouple(output, loc=None, scale=None):\n",
+    "    \"\"\" Bernoulli Scale Decouple\n",
+    "\n",
+    "    Stabilizes model's output optimization, by learning residual\n",
+    "    variance and residual location based on anchoring `loc`, `scale`.\n",
+    "    Also adds Bernoulli domain protection to the distribution parameters.\n",
+    "    \"\"\"\n",
+    "    probs = output[0]\n",
+    "    #if (loc is not None) and (scale is not None):\n",
+    "    #    rate = (rate * scale) + loc\n",
+    "    probs = F.sigmoid(probs)#.clone()\n",
+    "    return (probs,)\n",
+    "\n",
+    "def student_domain_map(input: torch.Tensor):\n",
+    "    \"\"\" Student T Domain Map\n",
+    "    Maps input into distribution constraints, by construction input's \n",
+    "    last dimension is of matching `distr_args` length.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `input`: tensor, of dimensions [B,T,H,theta] or [B,H,theta].<br>\n",
+    "    `eps`: float, helps the initialization of scale for easier optimization.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `(df, loc, scale)`: tuple with tensors of StudentT distribution arguments.<br>\n",
+    "    \"\"\"\n",
+    "    df, loc, scale = torch.tensor_split(input, 3, dim=-1)\n",
+    "    return df.squeeze(-1), loc.squeeze(-1), scale.squeeze(-1)\n",
+    "\n",
+    "def student_scale_decouple(output, loc=None, scale=None, eps: float=0.1):\n",
+    "    \"\"\" Normal Scale Decouple\n",
+    "\n",
+    "    Stabilizes model's output optimization, by learning residual\n",
+    "    variance and residual location based on anchoring `loc`, `scale`.\n",
+    "    Also adds StudentT domain protection to the distribution parameters.\n",
+    "    \"\"\"\n",
+    "    df, mean, tscale = output\n",
+    "    tscale = F.softplus(tscale)\n",
+    "    if (loc is not None) and (scale is not None):\n",
+    "        mean = (mean * scale) + loc\n",
+    "        tscale = (tscale + eps) * scale\n",
+    "    df = 2.0 + F.softplus(df)\n",
+    "    return (df, mean, tscale)\n",
+    "\n",
+    "def normal_domain_map(input: torch.Tensor):\n",
+    "    \"\"\" Normal Domain Map\n",
+    "    Maps input into distribution constraints, by construction input's \n",
+    "    last dimension is of matching `distr_args` length.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `input`: tensor, of dimensions [B,T,H,theta] or [B,H,theta].<br>\n",
+    "    `eps`: float, helps the initialization of scale for easier optimization.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `(mean, std)`: tuple with tensors of Normal distribution arguments.<br>\n",
+    "    \"\"\"\n",
+    "    mean, std = torch.tensor_split(input, 2, dim=-1)\n",
+    "    return mean.squeeze(-1), std.squeeze(-1)\n",
+    "\n",
+    "def normal_scale_decouple(output, loc=None, scale=None, eps: float=0.2):\n",
+    "    \"\"\" Normal Scale Decouple\n",
+    "\n",
+    "    Stabilizes model's output optimization, by learning residual\n",
+    "    variance and residual location based on anchoring `loc`, `scale`.\n",
+    "    Also adds Normal domain protection to the distribution parameters.\n",
+    "    \"\"\"\n",
+    "    mean, std = output\n",
+    "    std = F.softplus(std)\n",
+    "    if (loc is not None) and (scale is not None):\n",
+    "        mean = (mean * scale) + loc\n",
+    "        std = (std + eps) * scale\n",
+    "    return (mean, std)\n",
+    "\n",
+    "def poisson_domain_map(input: torch.Tensor):\n",
+    "    \"\"\" Poisson Domain Map\n",
+    "    Maps input into distribution constraints, by construction input's \n",
+    "    last dimension is of matching `distr_args` length.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `input`: tensor, of dimensions [B,T,H,theta] or [B,H,theta].<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `(rate,)`: tuple with tensors of Poisson distribution arguments.<br>\n",
+    "    \"\"\"\n",
+    "    return (input.squeeze(-1),)\n",
+    "\n",
+    "def poisson_scale_decouple(output, loc=None, scale=None):\n",
+    "    \"\"\" Poisson Scale Decouple\n",
+    "\n",
+    "    Stabilizes model's output optimization, by learning residual\n",
+    "    variance and residual location based on anchoring `loc`, `scale`.\n",
+    "    Also adds Poisson domain protection to the distribution parameters.\n",
+    "    \"\"\"\n",
+    "    eps  = 1e-10\n",
+    "    rate = output[0]\n",
+    "    if (loc is not None) and (scale is not None):\n",
+    "        rate = (rate * scale) + loc\n",
+    "    rate = F.softplus(rate) + eps\n",
+    "    return (rate,)\n",
+    "\n",
+    "def nbinomial_domain_map(input: torch.Tensor):\n",
+    "    \"\"\" Negative Binomial Domain Map\n",
+    "    Maps input into distribution constraints, by construction input's \n",
+    "    last dimension is of matching `distr_args` length.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `input`: tensor, of dimensions [B,T,H,theta] or [B,H,theta].<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `(total_count, alpha)`: tuple with tensors of N.Binomial distribution arguments.<br>\n",
+    "    \"\"\"\n",
+    "    mu, alpha = torch.tensor_split(input, 2, dim=-1)\n",
+    "    return mu.squeeze(-1), alpha.squeeze(-1)\n",
+    "\n",
+    "def nbinomial_scale_decouple(output, loc=None, scale=None):\n",
+    "    \"\"\" Negative Binomial Scale Decouple\n",
+    "\n",
+    "    Stabilizes model's output optimization, by learning total\n",
+    "    count and logits based on anchoring `loc`, `scale`.\n",
+    "    Also adds Negative Binomial domain protection to the distribution parameters.\n",
+    "    \"\"\"\n",
+    "    mu, alpha = output\n",
+    "    mu = F.softplus(mu) + 1e-8\n",
+    "    alpha = F.softplus(alpha) + 1e-8    # alpha = 1/total_counts\n",
+    "    if (loc is not None) and (scale is not None):\n",
+    "        mu *= loc\n",
+    "        alpha /= (loc + 1.)\n",
+    "\n",
+    "    # mu = total_count * (probs/(1-probs))\n",
+    "    # => probs = mu / (total_count + mu)\n",
+    "    # => probs = mu / [total_count * (1 + mu * (1/total_count))]\n",
+    "    total_count = 1.0 / alpha\n",
+    "    probs = (mu * alpha / (1.0 + mu * alpha)) + 1e-8\n",
+    "    return (total_count, probs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "03294edd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "def est_lambda(mu, rho):\n",
+    "    return mu ** (2 - rho) / (2 - rho)\n",
+    "\n",
+    "def est_alpha(rho):\n",
+    "    return (2 - rho) / (rho - 1)\n",
+    "\n",
+    "def est_beta(mu, rho):\n",
+    "    return mu ** (1 - rho) / (rho - 1)\n",
+    "\n",
+    "\n",
+    "class Tweedie(Distribution):\n",
+    "    \"\"\" Tweedie Distribution\n",
+    "\n",
+    "    The Tweedie distribution is a compound probability, special case of exponential\n",
+    "    dispersion models EDMs defined by its mean-variance relationship.\n",
+    "    The distribution particularly useful to model sparse series as the probability has\n",
+    "    possitive mass at zero but otherwise is continuous.\n",
+    "\n",
+    "    $Y \\sim \\mathrm{ED}(\\\\mu,\\\\sigma^{2}) \\qquad\n",
+    "    \\mathbb{P}(y|\\\\mu ,\\\\sigma^{2})=h(\\\\sigma^{2},y) \\\\exp \\\\left({\\\\frac {\\\\theta y-A(\\\\theta )}{\\\\sigma^{2}}}\\\\right)$<br>\n",
+    "    \n",
+    "    $\\mu =A'(\\\\theta ) \\qquad \\mathrm{Var}(Y) = \\\\sigma^{2} \\\\mu^{\\\\rho}$\n",
+    "    \n",
+    "    Cases of the variance relationship include Normal (`rho` = 0), Poisson (`rho` = 1),\n",
+    "    Gamma (`rho` = 2), inverse Gaussian (`rho` = 3).\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `log_mu`: tensor, with log of means.<br>\n",
+    "    `rho`: float, Tweedie variance power (1,2). Fixed across all observations.<br>\n",
+    "    `sigma2`: tensor, Tweedie variance. Currently fixed in 1.<br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    - [Tweedie, M. C. K. (1984). An index which distinguishes between some important exponential families. Statistics: Applications and New Directions. \n",
+    "    Proceedings of the Indian Statistical Institute Golden Jubilee International Conference (Eds. J. K. Ghosh and J. Roy), pp. 579-604. Calcutta: Indian Statistical Institute.]()<br>\n",
+    "    - [Jorgensen, B. (1987). Exponential Dispersion Models. Journal of the Royal Statistical Society. \n",
+    "       Series B (Methodological), 49(2), 127–162. http://www.jstor.org/stable/2345415](http://www.jstor.org/stable/2345415)<br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self, log_mu, rho, validate_args=None):\n",
+    "        # TODO: add sigma2 dispersion\n",
+    "        # TODO add constraints\n",
+    "        # arg_constraints = {'log_mu': constraints.real, 'rho': constraints.positive}\n",
+    "        # support = constraints.real\n",
+    "        self.log_mu = log_mu\n",
+    "        self.rho = rho\n",
+    "        assert rho>1 and rho<2, f'rho={rho} parameter needs to be between (1,2).'\n",
+    "\n",
+    "        batch_shape = log_mu.size()\n",
+    "        super(Tweedie, self).__init__(batch_shape, validate_args=validate_args)\n",
+    "\n",
+    "    @property\n",
+    "    def mean(self):\n",
+    "        return torch.exp(self.log_mu)\n",
+    "\n",
+    "    @property\n",
+    "    def variance(self):\n",
+    "        return torch.ones_line(self.log_mu) #TODO need to be assigned\n",
+    "\n",
+    "    def sample(self, sample_shape=torch.Size()):\n",
+    "        shape = self._extended_shape(sample_shape)\n",
+    "        with torch.no_grad():\n",
+    "            mu   = self.mean\n",
+    "            rho  = self.rho * torch.ones_like(mu)\n",
+    "            sigma2 = 1 #TODO\n",
+    "\n",
+    "            rate  = est_lambda(mu, rho) / sigma2  # rate for poisson\n",
+    "            alpha = est_alpha(rho)                # alpha for Gamma distribution\n",
+    "            beta  = est_beta(mu, rho) / sigma2    # beta for Gamma distribution\n",
+    "            \n",
+    "            # Expand for sample\n",
+    "            rate = rate.expand(shape)\n",
+    "            alpha = alpha.expand(shape)\n",
+    "            beta = beta.expand(shape)\n",
+    "\n",
+    "            N = torch.poisson(rate)\n",
+    "            gamma = torch.distributions.gamma.Gamma(N*alpha, beta)\n",
+    "            samples = gamma.sample()\n",
+    "            samples[N==0] = 0\n",
+    "\n",
+    "            return samples\n",
+    "\n",
+    "    def log_prob(self, y_true):\n",
+    "        rho = self.rho\n",
+    "        y_pred = self.log_mu\n",
+    "\n",
+    "        a = y_true * torch.exp((1 - rho) * y_pred) / (1 - rho)\n",
+    "        b = torch.exp((2 - rho) * y_pred) / (2 - rho)\n",
+    "\n",
+    "        return a - b\n",
+    "\n",
+    "def tweedie_domain_map(input: torch.Tensor):\n",
+    "    \"\"\" Tweedie Domain Map\n",
+    "    Maps input into distribution constraints, by construction input's \n",
+    "    last dimension is of matching `distr_args` length.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `input`: tensor, of dimensions [B,T,H,theta] or [B,H,theta].<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `(log_mu,)`: tuple with tensors of Tweedie distribution arguments.<br>\n",
+    "    \"\"\"\n",
+    "    # log_mu, probs = torch.tensor_split(input, 2, dim=-1)\n",
+    "    return (input.squeeze(-1),)\n",
+    "\n",
+    "def tweedie_scale_decouple(output, loc=None, scale=None):\n",
+    "    \"\"\" Tweedie Scale Decouple\n",
+    "\n",
+    "    Stabilizes model's output optimization, by learning total\n",
+    "    count and logits based on anchoring `loc`, `scale`.\n",
+    "    Also adds Tweedie domain protection to the distribution parameters.\n",
+    "    \"\"\"\n",
+    "    log_mu = output[0]\n",
+    "    if (loc is not None) and (scale is not None):\n",
+    "        log_mu += torch.log(loc) # TODO : rho scaling\n",
+    "    return (log_mu,)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5931f6c6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class DistributionLoss(torch.nn.Module):\n",
+    "    \"\"\" DistributionLoss\n",
+    "\n",
+    "    This PyTorch module wraps the `torch.distribution` classes allowing it to \n",
+    "    interact with NeuralForecast models modularly. It shares the negative \n",
+    "    log-likelihood as the optimization objective and a sample method to \n",
+    "    generate empirically the quantiles defined by the `level` list.\n",
+    "\n",
+    "    Additionally, it implements a distribution transformation that factorizes the\n",
+    "    scale-dependent likelihood parameters into a base scale and a multiplier \n",
+    "    efficiently learnable within the network's non-linearities operating ranges.\n",
+    "\n",
+    "    Available distributions:<br>\n",
+    "    - Poisson<br>\n",
+    "    - Normal<br>\n",
+    "    - StudentT<br>\n",
+    "    - NegativeBinomial<br>\n",
+    "    - Tweedie<br>\n",
+    "    - Bernoulli (Temporal Classifiers)\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `distribution`: str, identifier of a torch.distributions.Distribution class.<br>\n",
+    "    `level`: float list [0,100], confidence levels for prediction intervals.<br>\n",
+    "    `quantiles`: float list [0,1], alternative to level list, target quantiles.<br>\n",
+    "    `num_samples`: int=500, number of samples for the empirical quantiles.<br>\n",
+    "    `return_params`: bool=False, wether or not return the Distribution parameters.<br><br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    - [PyTorch Probability Distributions Package: StudentT.](https://pytorch.org/docs/stable/distributions.html#studentt)<br>\n",
+    "    - [David Salinas, Valentin Flunkert, Jan Gasthaus, Tim Januschowski (2020).\n",
+    "       \"DeepAR: Probabilistic forecasting with autoregressive recurrent networks\". International Journal of Forecasting.](https://www.sciencedirect.com/science/article/pii/S0169207019301888)<br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self, distribution, level=[80, 90], quantiles=None,\n",
+    "                 num_samples=1000, return_params=False, **distribution_kwargs):\n",
+    "       super(DistributionLoss, self).__init__()\n",
+    "\n",
+    "       available_distributions = dict(\n",
+    "                          Bernoulli=Bernoulli,\n",
+    "                          Normal=Normal,\n",
+    "                          Poisson=Poisson,\n",
+    "                          StudentT=StudentT,\n",
+    "                          NegativeBinomial=NegativeBinomial,\n",
+    "                          Tweedie=Tweedie)\n",
+    "       domain_maps = dict(Bernoulli=bernoulli_domain_map,\n",
+    "                          Normal=normal_domain_map,\n",
+    "                          Poisson=poisson_domain_map,\n",
+    "                          StudentT=student_domain_map,\n",
+    "                          NegativeBinomial=nbinomial_domain_map,\n",
+    "                          Tweedie=tweedie_domain_map)\n",
+    "       scale_decouples = dict(\n",
+    "                          Bernoulli=bernoulli_scale_decouple,\n",
+    "                          Normal=normal_scale_decouple,\n",
+    "                          Poisson=poisson_scale_decouple,\n",
+    "                          StudentT=student_scale_decouple,\n",
+    "                          NegativeBinomial=nbinomial_scale_decouple,\n",
+    "                          Tweedie=tweedie_scale_decouple)\n",
+    "       param_names = dict(Bernoulli=[\"-logits\"],\n",
+    "                          Normal=[\"-loc\", \"-scale\"],\n",
+    "                          Poisson=[\"-loc\"],\n",
+    "                          StudentT=[\"-df\", \"-loc\", \"-scale\"],\n",
+    "                          NegativeBinomial=[\"-total_count\", \"-logits\"],\n",
+    "                          Tweedie=[\"-log_mu\"])\n",
+    "       assert (distribution in available_distributions.keys()), f'{distribution} not available'\n",
+    "\n",
+    "       self.distribution = distribution\n",
+    "       self._base_distribution = available_distributions[distribution]\n",
+    "       self.domain_map = domain_maps[distribution]\n",
+    "       self.scale_decouple = scale_decouples[distribution]\n",
+    "       self.param_names = param_names[distribution]\n",
+    "\n",
+    "       self.distribution_kwargs = distribution_kwargs\n",
+    "\n",
+    "       qs, self.output_names = level_to_outputs(level)\n",
+    "       qs = torch.Tensor(qs)\n",
+    "\n",
+    "        # Transform quantiles to homogeneus output names\n",
+    "       if quantiles is not None:\n",
+    "              _, self.output_names = quantiles_to_outputs(quantiles)\n",
+    "              qs = torch.Tensor(quantiles)\n",
+    "       self.quantiles = torch.nn.Parameter(qs, requires_grad=False)\n",
+    "       self.num_samples = num_samples\n",
+    "\n",
+    "       # If True, predict_step will return Distribution's parameters\n",
+    "       self.return_params = return_params\n",
+    "       if self.return_params:\n",
+    "            self.output_names = self.output_names + self.param_names\n",
+    "\n",
+    "       # Add first output entry for the sample_mean\n",
+    "       self.output_names.insert(0, \"\")\n",
+    "\n",
+    "       self.outputsize_multiplier = len(self.param_names)\n",
+    "       self.is_distribution_output = True\n",
+    "\n",
+    "    def get_distribution(self, distr_args, **distribution_kwargs) -> Distribution:\n",
+    "        \"\"\"\n",
+    "        Construct the associated Pytorch Distribution, given the collection of\n",
+    "        constructor arguments and, optionally, location and scale tensors.\n",
+    "\n",
+    "        **Parameters**<br>\n",
+    "        `distr_args`: Constructor arguments for the underlying Distribution type.<br>\n",
+    "\n",
+    "        **Returns**<br>\n",
+    "        `Distribution`: AffineTransformed distribution.<br>\n",
+    "        \"\"\"\n",
+    "        # TransformedDistribution(distr, [AffineTransform(loc=loc, scale=scale)])\n",
+    "        distr = self._base_distribution(*distr_args, **distribution_kwargs)\n",
+    "        \n",
+    "        if self.distribution =='Poisson':\n",
+    "              distr.support = constraints.nonnegative\n",
+    "        return distr\n",
+    "\n",
+    "    def sample(self,\n",
+    "               distr_args: torch.Tensor,\n",
+    "               num_samples: Optional[int] = None):\n",
+    "        \"\"\"\n",
+    "        Construct the empirical quantiles from the estimated Distribution,\n",
+    "        sampling from it `num_samples` independently.\n",
+    "\n",
+    "        **Parameters**<br>\n",
+    "        `distr_args`: Constructor arguments for the underlying Distribution type.<br>\n",
+    "        `loc`: Optional tensor, of the same shape as the batch_shape + event_shape\n",
+    "               of the resulting distribution.<br>\n",
+    "        `scale`: Optional tensor, of the same shape as the batch_shape+event_shape \n",
+    "               of the resulting distribution.<br>\n",
+    "        `num_samples`: int=500, overwrite number of samples for the empirical quantiles.<br>\n",
+    "\n",
+    "        **Returns**<br>\n",
+    "        `samples`: tensor, shape [B,H,`num_samples`].<br>\n",
+    "        `quantiles`: tensor, empirical quantiles defined by `levels`.<br>\n",
+    "        \"\"\"\n",
+    "        if num_samples is None:\n",
+    "            num_samples = self.num_samples\n",
+    "\n",
+    "        B, H = distr_args[0].size()\n",
+    "        Q = len(self.quantiles)\n",
+    "\n",
+    "        # Instantiate Scaled Decoupled Distribution\n",
+    "        distr = self.get_distribution(distr_args=distr_args, **self.distribution_kwargs)\n",
+    "        samples = distr.sample(sample_shape=(num_samples,))\n",
+    "        samples = samples.permute(1,2,0) # [samples,B,H] -> [B,H,samples]\n",
+    "        samples = samples.to(distr_args[0].device)\n",
+    "        samples = samples.view(B*H, num_samples)\n",
+    "        sample_mean = torch.mean(samples, dim=-1)\n",
+    "\n",
+    "        # Compute quantiles\n",
+    "        quantiles_device = self.quantiles.to(distr_args[0].device)\n",
+    "        quants = torch.quantile(input=samples, \n",
+    "                                q=quantiles_device, dim=1)\n",
+    "        quants = quants.permute((1,0)) # [Q, B*H] -> [B*H, Q]\n",
+    "\n",
+    "        # Final reshapes\n",
+    "        samples = samples.view(B, H, num_samples)\n",
+    "        sample_mean = sample_mean.view(B, H, 1)\n",
+    "        quants  = quants.view(B, H, Q)\n",
+    "\n",
+    "        return samples, sample_mean, quants\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 distr_args: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        Computes the negative log-likelihood objective function. \n",
+    "        To estimate the following predictive distribution:\n",
+    "\n",
+    "        $$\\mathrm{P}(\\mathbf{y}_{\\\\tau}\\,|\\,\\\\theta) \\\\quad \\mathrm{and} \\\\quad -\\log(\\mathrm{P}(\\mathbf{y}_{\\\\tau}\\,|\\,\\\\theta))$$\n",
+    "\n",
+    "        where $\\\\theta$ represents the distributions parameters. It aditionally \n",
+    "        summarizes the objective signal using a weighted average using the `mask` tensor. \n",
+    "\n",
+    "        **Parameters**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `distr_args`: Constructor arguments for the underlying Distribution type.<br>\n",
+    "        `loc`: Optional tensor, of the same shape as the batch_shape + event_shape\n",
+    "               of the resulting distribution.<br>\n",
+    "        `scale`: Optional tensor, of the same shape as the batch_shape+event_shape \n",
+    "               of the resulting distribution.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns**<br>\n",
+    "        `loss`: scalar, weighted loss function against which backpropagation will be performed.<br>\n",
+    "        \"\"\"\n",
+    "        # Instantiate Scaled Decoupled Distribution\n",
+    "        distr = self.get_distribution(distr_args=distr_args, **self.distribution_kwargs)\n",
+    "        loss_values = -distr.log_prob(y)\n",
+    "        loss_weights = mask\n",
+    "        return weighted_average(loss_values, weights=loss_weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a462101b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DistributionLoss, name='DistributionLoss.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8c367f8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DistributionLoss.sample, name='DistributionLoss.sample', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "04e32679",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DistributionLoss.__call__, name='DistributionLoss.__call__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "14a7e381",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | hide\n",
+    "# Unit tests to check DistributionLoss' stored quantiles\n",
+    "# attribute is correctly instantiated\n",
+    "check = DistributionLoss(distribution='Normal', level=[80, 90])\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = DistributionLoss(distribution='Normal', \n",
+    "                         quantiles=[0.0100, 0.1000, 0.5, 0.9000, 0.9900])\n",
+    "print(check.output_names)\n",
+    "print(check.quantiles)\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = DistributionLoss(distribution='Normal',\n",
+    "                         quantiles=[0.0100, 0.1000, 0.9000, 0.9900])\n",
+    "test_eq(len(check.quantiles), 4)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "07f459b8",
+   "metadata": {},
+   "source": [
+    "## Poisson Mixture Mesh (PMM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "46ec688f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class PMM(torch.nn.Module):\n",
+    "    \"\"\" Poisson Mixture Mesh\n",
+    "\n",
+    "    This Poisson Mixture statistical model assumes independence across groups of \n",
+    "    data $\\mathcal{G}=\\{[g_{i}]\\}$, and estimates relationships within the group.\n",
+    "\n",
+    "    $$ \\mathrm{P}\\\\left(\\mathbf{y}_{[b][t+1:t+H]}\\\\right) = \n",
+    "    \\prod_{ [g_{i}] \\in \\mathcal{G}} \\mathrm{P} \\\\left(\\mathbf{y}_{[g_{i}][\\\\tau]} \\\\right) =\n",
+    "    \\prod_{\\\\beta\\in[g_{i}]} \n",
+    "    \\\\left(\\sum_{k=1}^{K} w_k \\prod_{(\\\\beta,\\\\tau) \\in [g_i][t+1:t+H]} \\mathrm{Poisson}(y_{\\\\beta,\\\\tau}, \\hat{\\\\lambda}_{\\\\beta,\\\\tau,k}) \\\\right)$$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `n_components`: int=10, the number of mixture components.<br>\n",
+    "    `level`: float list [0,100], confidence levels for prediction intervals.<br>\n",
+    "    `quantiles`: float list [0,1], alternative to level list, target quantiles.<br>\n",
+    "    `return_params`: bool=False, wether or not return the Distribution parameters.<br>\n",
+    "    `batch_correlation`: bool=False, wether or not model batch correlations.<br>\n",
+    "    `horizon_correlation`: bool=False, wether or not model horizon correlations.<br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker. \n",
+    "    Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures. Submitted to the International \n",
+    "    Journal Forecasting, Working paper available at arxiv.](https://arxiv.org/pdf/2110.13179.pdf)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, n_components=10, level=[80, 90], quantiles=None,\n",
+    "                 num_samples=1000, return_params=False,\n",
+    "                 batch_correlation=False, horizon_correlation=False):\n",
+    "        super(PMM, self).__init__()\n",
+    "        # Transform level to MQLoss parameters\n",
+    "        qs, self.output_names = level_to_outputs(level)\n",
+    "        qs = torch.Tensor(qs)\n",
+    "\n",
+    "        # Transform quantiles to homogeneus output names\n",
+    "        if quantiles is not None:\n",
+    "            _, self.output_names = quantiles_to_outputs(quantiles)\n",
+    "            qs = torch.Tensor(quantiles)\n",
+    "        self.quantiles = torch.nn.Parameter(qs, requires_grad=False)\n",
+    "        self.num_samples = num_samples\n",
+    "        self.batch_correlation = batch_correlation\n",
+    "        self.horizon_correlation = horizon_correlation\n",
+    "\n",
+    "        # If True, predict_step will return Distribution's parameters\n",
+    "        self.return_params = return_params\n",
+    "        if self.return_params:\n",
+    "            self.param_names = [f\"-lambda-{i}\" for i in range(1, n_components + 1)]\n",
+    "            self.output_names = self.output_names + self.param_names\n",
+    "\n",
+    "        # Add first output entry for the sample_mean\n",
+    "        self.output_names.insert(0, \"\")\n",
+    "\n",
+    "        self.outputsize_multiplier = n_components\n",
+    "        self.is_distribution_output = True\n",
+    "\n",
+    "    def domain_map(self, output: torch.Tensor):\n",
+    "        return (output,)#, weights\n",
+    "        \n",
+    "    def scale_decouple(self, \n",
+    "                       output,\n",
+    "                       loc: Optional[torch.Tensor] = None,\n",
+    "                       scale: Optional[torch.Tensor] = None):\n",
+    "        \"\"\" Scale Decouple\n",
+    "\n",
+    "        Stabilizes model's output optimization, by learning residual\n",
+    "        variance and residual location based on anchoring `loc`, `scale`.\n",
+    "        Also adds domain protection to the distribution parameters.\n",
+    "        \"\"\"\n",
+    "        lambdas = output[0]\n",
+    "        if (loc is not None) and (scale is not None):\n",
+    "            loc = loc.view(lambdas.size(dim=0), 1, -1)\n",
+    "            scale = scale.view(lambdas.size(dim=0), 1, -1)\n",
+    "            lambdas = (lambdas * scale) + loc\n",
+    "        lambdas = F.softplus(lambdas)\n",
+    "        return (lambdas,)\n",
+    "\n",
+    "    def sample(self, distr_args, num_samples=None):\n",
+    "        \"\"\"\n",
+    "        Construct the empirical quantiles from the estimated Distribution,\n",
+    "        sampling from it `num_samples` independently.\n",
+    "\n",
+    "        **Parameters**<br>\n",
+    "        `distr_args`: Constructor arguments for the underlying Distribution type.<br>\n",
+    "        `loc`: Optional tensor, of the same shape as the batch_shape + event_shape\n",
+    "               of the resulting distribution.<br>\n",
+    "        `scale`: Optional tensor, of the same shape as the batch_shape+event_shape \n",
+    "               of the resulting distribution.<br>\n",
+    "        `num_samples`: int=500, overwrites number of samples for the empirical quantiles.<br>\n",
+    "\n",
+    "        **Returns**<br>\n",
+    "        `samples`: tensor, shape [B,H,`num_samples`].<br>\n",
+    "        `quantiles`: tensor, empirical quantiles defined by `levels`.<br>\n",
+    "        \"\"\"\n",
+    "        if num_samples is None:\n",
+    "            num_samples = self.num_samples\n",
+    "\n",
+    "        lambdas = distr_args[0]\n",
+    "        B, H, K = lambdas.size()\n",
+    "        Q = len(self.quantiles)\n",
+    "\n",
+    "        # Sample K ~ Mult(weights)\n",
+    "        # shared across B, H\n",
+    "        # weights = torch.repeat_interleave(input=weights, repeats=H, dim=2)\n",
+    "        weights = (1/K) * torch.ones_like(lambdas, device=lambdas.device)\n",
+    "\n",
+    "        # Avoid loop, vectorize\n",
+    "        weights = weights.reshape(-1, K)\n",
+    "        lambdas = lambdas.flatten()        \n",
+    "\n",
+    "        # Vectorization trick to recover row_idx\n",
+    "        sample_idxs = torch.multinomial(input=weights, \n",
+    "                                        num_samples=num_samples,\n",
+    "                                        replacement=True)\n",
+    "        aux_col_idx = torch.unsqueeze(torch.arange(B * H, device=lambdas.device), -1) * K\n",
+    "\n",
+    "        # To device\n",
+    "        sample_idxs = sample_idxs.to(lambdas.device)\n",
+    "\n",
+    "        sample_idxs = sample_idxs + aux_col_idx\n",
+    "        sample_idxs = sample_idxs.flatten()\n",
+    "\n",
+    "        sample_lambdas = lambdas[sample_idxs]\n",
+    "\n",
+    "        # Sample y ~ Poisson(lambda) independently\n",
+    "        samples = torch.poisson(sample_lambdas).to(lambdas.device)\n",
+    "        samples = samples.view(B*H, num_samples)\n",
+    "        sample_mean = torch.mean(samples, dim=-1)\n",
+    "\n",
+    "        # Compute quantiles\n",
+    "        quantiles_device = self.quantiles.to(lambdas.device)\n",
+    "        quants = torch.quantile(input=samples, q=quantiles_device, dim=1)\n",
+    "        quants = quants.permute((1,0)) # Q, B*H\n",
+    "\n",
+    "        # Final reshapes\n",
+    "        samples = samples.view(B, H, num_samples)\n",
+    "        sample_mean = sample_mean.view(B, H, 1)\n",
+    "        quants  = quants.view(B, H, Q)\n",
+    "\n",
+    "        return samples, sample_mean, quants\n",
+    "    \n",
+    "    def neglog_likelihood(self,\n",
+    "                          y: torch.Tensor,\n",
+    "                          distr_args: Tuple[torch.Tensor],\n",
+    "                          mask: Union[torch.Tensor, None] = None,):\n",
+    "        if mask is None: \n",
+    "            mask = (y > 0) * 1\n",
+    "        else:\n",
+    "            mask = mask * ((y > 0) * 1)\n",
+    "\n",
+    "        eps  = 1e-10\n",
+    "        lambdas = distr_args[0]\n",
+    "        B, H, K = lambdas.size()\n",
+    "\n",
+    "        weights = (1/K) * torch.ones_like(lambdas, device=lambdas.device)\n",
+    "\n",
+    "        y = y[:,:,None]\n",
+    "        mask = mask[:,:,None]\n",
+    "\n",
+    "        y = y * mask # Protect y negative entries\n",
+    "        \n",
+    "        # Single Poisson likelihood\n",
+    "        log_pi = y.xlogy(lambdas + eps) - lambdas - (y + 1).lgamma()\n",
+    "\n",
+    "        if self.batch_correlation:\n",
+    "            log_pi  = torch.sum(log_pi, dim=0, keepdim=True)\n",
+    "\n",
+    "        if self.horizon_correlation:\n",
+    "            log_pi  = torch.sum(log_pi, dim=1, keepdim=True)\n",
+    "\n",
+    "        # Numerically Stable Mixture loglikelihood\n",
+    "        loglik = torch.logsumexp((torch.log(weights) + log_pi), dim=2, keepdim=True)\n",
+    "        loglik = loglik * mask\n",
+    "\n",
+    "        mean   = torch.sum(weights * lambdas, axis=-1, keepdims=True)\n",
+    "        reglrz = torch.mean(torch.square(y - mean) * mask)\n",
+    "        loss   = -torch.mean(loglik) + 0.001 * reglrz\n",
+    "        return loss\n",
+    "\n",
+    "    def __call__(self, y: torch.Tensor,\n",
+    "                 distr_args: Tuple[torch.Tensor],\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "\n",
+    "        return self.neglog_likelihood(y=y, distr_args=distr_args, mask=mask)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "62d7daba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(PMM, name='PMM.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa8da65c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(PMM.sample, name='PMM.sample', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ba75717c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(PMM.__call__, name='PMM.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f7518450",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/pmm.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4a20e21",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | hide\n",
+    "# Unit tests to check PMM's stored quantiles\n",
+    "# attribute is correctly instantiated\n",
+    "check = PMM(n_components=2, level=[80, 90])\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = PMM(n_components=2, \n",
+    "            quantiles=[0.0100, 0.1000, 0.5, 0.9000, 0.9900])\n",
+    "print(check.output_names)\n",
+    "print(check.quantiles)\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = PMM(n_components=2,\n",
+    "            quantiles=[0.0100, 0.1000, 0.9000, 0.9900])\n",
+    "test_eq(len(check.quantiles), 4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a56a2fbe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Create single mixture and broadcast to N,H,K\n",
+    "weights = torch.ones((1,3))[None, :, :]\n",
+    "lambdas = torch.Tensor([[5,10,15], [10,20,30]])[None, :, :]\n",
+    "\n",
+    "# Create repetitions for the batch dimension N.\n",
+    "N=2\n",
+    "weights = torch.repeat_interleave(input=weights, repeats=N, dim=0)\n",
+    "lambdas = torch.repeat_interleave(input=lambdas, repeats=N, dim=0)\n",
+    "\n",
+    "print('weights.shape (N,H,K) \\t', weights.shape)\n",
+    "print('lambdas.shape (N,H,K) \\t', lambdas.shape)\n",
+    "\n",
+    "distr = PMM(quantiles=[0.1, 0.40, 0.5, 0.60, 0.9])\n",
+    "distr_args = (lambdas,)\n",
+    "samples, sample_mean, quants = distr.sample(distr_args)\n",
+    "\n",
+    "print('samples.shape (N,H,num_samples) ', samples.shape)\n",
+    "print('sample_mean.shape (N,H) ', sample_mean.shape)\n",
+    "print('quants.shape  (N,H,Q) \\t\\t', quants.shape)\n",
+    "\n",
+    "# Plot synthethic data\n",
+    "x_plot = range(quants.shape[1]) # H length\n",
+    "y_plot_hat = quants[0,:,:]  # Filter N,G,T -> H,Q\n",
+    "samples_hat = samples[0,:,:]  # Filter N,G,T -> H,num_samples\n",
+    "\n",
+    "# Kernel density plot for single forecast horizon \\tau = t+1\n",
+    "fig, ax = plt.subplots(figsize=(3.7, 2.9))\n",
+    "\n",
+    "ax.hist(samples_hat[0,:], alpha=0.5, label=r'Horizon $\\tau+1$')\n",
+    "ax.hist(samples_hat[1,:], alpha=0.5, label=r'Horizon $\\tau+2$')\n",
+    "ax.set(xlabel='Y values', ylabel='Probability')\n",
+    "plt.title('Single horizon Distributions')\n",
+    "plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)\n",
+    "plt.grid()\n",
+    "plt.show()\n",
+    "plt.close()\n",
+    "\n",
+    "# Plot simulated trajectory\n",
+    "fig, ax = plt.subplots(figsize=(3.7, 2.9))\n",
+    "plt.plot(x_plot, y_plot_hat[:,2], color='black', label='median [q50]')\n",
+    "plt.fill_between(x_plot,\n",
+    "                 y1=y_plot_hat[:,1], y2=y_plot_hat[:,3],\n",
+    "                 facecolor='blue', alpha=0.4, label='[p25-p75]')\n",
+    "plt.fill_between(x_plot,\n",
+    "                 y1=y_plot_hat[:,0], y2=y_plot_hat[:,4],\n",
+    "                 facecolor='blue', alpha=0.2, label='[p1-p99]')\n",
+    "ax.set(xlabel='Horizon', ylabel='Y values')\n",
+    "plt.title('PMM Probabilistic Predictions')\n",
+    "plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)\n",
+    "plt.grid()\n",
+    "plt.show()\n",
+    "plt.close()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e84e0dd4",
+   "metadata": {},
+   "source": [
+    "## Gaussian Mixture Mesh (GMM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6928b0c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class GMM(torch.nn.Module):\n",
+    "    \"\"\" Gaussian Mixture Mesh\n",
+    "\n",
+    "    This Gaussian Mixture statistical model assumes independence across groups of \n",
+    "    data $\\mathcal{G}=\\{[g_{i}]\\}$, and estimates relationships within the group.\n",
+    "\n",
+    "    $$ \\mathrm{P}\\\\left(\\mathbf{y}_{[b][t+1:t+H]}\\\\right) = \n",
+    "    \\prod_{ [g_{i}] \\in \\mathcal{G}} \\mathrm{P}\\left(\\mathbf{y}_{[g_{i}][\\\\tau]}\\\\right)=\n",
+    "    \\prod_{\\\\beta\\in[g_{i}]}\n",
+    "    \\\\left(\\sum_{k=1}^{K} w_k \\prod_{(\\\\beta,\\\\tau) \\in [g_i][t+1:t+H]} \n",
+    "    \\mathrm{Gaussian}(y_{\\\\beta,\\\\tau}, \\hat{\\mu}_{\\\\beta,\\\\tau,k}, \\sigma_{\\\\beta,\\\\tau,k})\\\\right)$$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `n_components`: int=10, the number of mixture components.<br>\n",
+    "    `level`: float list [0,100], confidence levels for prediction intervals.<br>\n",
+    "    `quantiles`: float list [0,1], alternative to level list, target quantiles.<br>\n",
+    "    `return_params`: bool=False, wether or not return the Distribution parameters.<br>\n",
+    "    `batch_correlation`: bool=False, wether or not model batch correlations.<br>\n",
+    "    `horizon_correlation`: bool=False, wether or not model horizon correlations.<br><br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker. \n",
+    "    Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures. Submitted to the International \n",
+    "    Journal Forecasting, Working paper available at arxiv.](https://arxiv.org/pdf/2110.13179.pdf)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, n_components=1, level=[80, 90], quantiles=None, \n",
+    "                 num_samples=1000, return_params=False,\n",
+    "                 batch_correlation=False, horizon_correlation=False):\n",
+    "        super(GMM, self).__init__()\n",
+    "        # Transform level to MQLoss parameters\n",
+    "        qs, self.output_names = level_to_outputs(level)\n",
+    "        qs = torch.Tensor(qs)\n",
+    "\n",
+    "        # Transform quantiles to homogeneus output names\n",
+    "        if quantiles is not None:\n",
+    "            _, self.output_names = quantiles_to_outputs(quantiles)\n",
+    "            qs = torch.Tensor(quantiles)\n",
+    "        self.quantiles = torch.nn.Parameter(qs, requires_grad=False)\n",
+    "        self.num_samples = num_samples\n",
+    "        self.batch_correlation = batch_correlation\n",
+    "        self.horizon_correlation = horizon_correlation        \n",
+    "\n",
+    "        # If True, predict_step will return Distribution's parameters\n",
+    "        self.return_params = return_params\n",
+    "        if self.return_params:\n",
+    "            mu_names = [f\"-mu-{i}\" for i in range(1, n_components + 1)]\n",
+    "            std_names = [f\"-std-{i}\" for i in range(1, n_components + 1)]\n",
+    "            mu_std_names = [i for j in zip(mu_names, std_names) for i in j]\n",
+    "            self.output_names = self.output_names + mu_std_names\n",
+    "\n",
+    "        # Add first output entry for the sample_mean\n",
+    "        self.output_names.insert(0, \"\")\n",
+    "\n",
+    "        self.outputsize_multiplier = 2 * n_components\n",
+    "        self.is_distribution_output = True\n",
+    "\n",
+    "    def domain_map(self, output: torch.Tensor):\n",
+    "        means, stds = torch.tensor_split(output, 2, dim=-1)\n",
+    "        return (means, stds)\n",
+    "\n",
+    "    def scale_decouple(self, \n",
+    "                       output,\n",
+    "                       loc: Optional[torch.Tensor] = None,\n",
+    "                       scale: Optional[torch.Tensor] = None,\n",
+    "                       eps: float=0.2):\n",
+    "        \"\"\" Scale Decouple\n",
+    "\n",
+    "        Stabilizes model's output optimization, by learning residual\n",
+    "        variance and residual location based on anchoring `loc`, `scale`.\n",
+    "        Also adds domain protection to the distribution parameters.\n",
+    "        \"\"\"\n",
+    "        means, stds = output\n",
+    "        stds = F.softplus(stds)\n",
+    "        if (loc is not None) and (scale is not None):\n",
+    "            loc = loc.view(means.size(dim=0), 1, -1)\n",
+    "            scale = scale.view(means.size(dim=0), 1, -1)            \n",
+    "            means = (means * scale) + loc\n",
+    "            stds = (stds + eps) * scale\n",
+    "        return (means, stds)\n",
+    "\n",
+    "    def sample(self, distr_args, num_samples=None):\n",
+    "        \"\"\"\n",
+    "        Construct the empirical quantiles from the estimated Distribution,\n",
+    "        sampling from it `num_samples` independently.\n",
+    "\n",
+    "        **Parameters**<br>\n",
+    "        `distr_args`: Constructor arguments for the underlying Distribution type.<br>\n",
+    "        `loc`: Optional tensor, of the same shape as the batch_shape + event_shape\n",
+    "               of the resulting distribution.<br>\n",
+    "        `scale`: Optional tensor, of the same shape as the batch_shape+event_shape \n",
+    "               of the resulting distribution.<br>\n",
+    "        `num_samples`: int=500, number of samples for the empirical quantiles.<br>\n",
+    "\n",
+    "        **Returns**<br>\n",
+    "        `samples`: tensor, shape [B,H,`num_samples`].<br>\n",
+    "        `quantiles`: tensor, empirical quantiles defined by `levels`.<br>\n",
+    "        \"\"\"\n",
+    "        if num_samples is None:\n",
+    "            num_samples = self.num_samples\n",
+    "            \n",
+    "        means, stds = distr_args\n",
+    "        B, H, K = means.size()\n",
+    "        Q = len(self.quantiles)\n",
+    "        assert means.shape == stds.shape\n",
+    "\n",
+    "        # Sample K ~ Mult(weights)\n",
+    "        # shared across B, H\n",
+    "        # weights = torch.repeat_interleave(input=weights, repeats=H, dim=2)\n",
+    "        \n",
+    "        weights = (1/K) * torch.ones_like(means, device=means.device)\n",
+    "        \n",
+    "        # Avoid loop, vectorize\n",
+    "        weights = weights.reshape(-1, K)\n",
+    "        means = means.flatten()\n",
+    "        stds = stds.flatten()\n",
+    "\n",
+    "        # Vectorization trick to recover row_idx\n",
+    "        sample_idxs = torch.multinomial(input=weights, \n",
+    "                                        num_samples=num_samples,\n",
+    "                                        replacement=True)\n",
+    "        aux_col_idx = torch.unsqueeze(torch.arange(B * H, device=means.device),-1) * K\n",
+    "\n",
+    "        # To device\n",
+    "        sample_idxs = sample_idxs.to(means.device)\n",
+    "\n",
+    "        sample_idxs = sample_idxs + aux_col_idx\n",
+    "        sample_idxs = sample_idxs.flatten()\n",
+    "\n",
+    "        sample_means = means[sample_idxs]\n",
+    "        sample_stds  = stds[sample_idxs]\n",
+    "\n",
+    "        # Sample y ~ Normal(mu, std) independently\n",
+    "        samples = torch.normal(sample_means, sample_stds).to(means.device)\n",
+    "        samples = samples.view(B*H, num_samples)\n",
+    "        sample_mean = torch.mean(samples, dim=-1)\n",
+    "\n",
+    "        # Compute quantiles\n",
+    "        quantiles_device = self.quantiles.to(means.device)\n",
+    "        quants = torch.quantile(input=samples, q=quantiles_device, dim=1)\n",
+    "        quants = quants.permute((1,0)) # Q, B*H\n",
+    "\n",
+    "        # Final reshapes\n",
+    "        samples = samples.view(B, H, num_samples)\n",
+    "        sample_mean = sample_mean.view(B, H, 1)\n",
+    "        quants  = quants.view(B, H, Q)\n",
+    "\n",
+    "        return samples, sample_mean, quants\n",
+    "\n",
+    "    def neglog_likelihood(self,\n",
+    "                          y: torch.Tensor,\n",
+    "                          distr_args: Tuple[torch.Tensor, torch.Tensor],\n",
+    "                          mask: Union[torch.Tensor, None] = None):\n",
+    "\n",
+    "        if mask is None: \n",
+    "            mask = torch.ones_like(y)\n",
+    "            \n",
+    "        means, stds = distr_args\n",
+    "        B, H, K = means.size()\n",
+    "        \n",
+    "        weights = (1/K) * torch.ones_like(means, device=means.device)\n",
+    "        \n",
+    "        y = y[:,:, None]\n",
+    "        mask = mask[:,:,None]\n",
+    "        \n",
+    "        var = stds ** 2\n",
+    "        log_stds = torch.log(stds)\n",
+    "        log_pi = - ((y - means) ** 2 / (2 * var)) - log_stds \\\n",
+    "                 - math.log(math.sqrt(2 * math.pi))\n",
+    "\n",
+    "        if self.batch_correlation:\n",
+    "            log_pi  = torch.sum(log_pi, dim=0, keepdim=True)\n",
+    "\n",
+    "        if self.horizon_correlation:    \n",
+    "            log_pi  = torch.sum(log_pi, dim=1, keepdim=True)\n",
+    "\n",
+    "        # Numerically Stable Mixture loglikelihood\n",
+    "        loglik = torch.logsumexp((torch.log(weights) + log_pi), dim=2, keepdim=True)\n",
+    "        loglik  = loglik * mask\n",
+    "\n",
+    "        loss = -torch.mean(loglik)\n",
+    "        return loss\n",
+    "    \n",
+    "    def __call__(self, y: torch.Tensor,\n",
+    "                 distr_args: Tuple[torch.Tensor, torch.Tensor],\n",
+    "                 mask: Union[torch.Tensor, None] = None,):\n",
+    "\n",
+    "        return self.neglog_likelihood(y=y, distr_args=distr_args, mask=mask)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec4ebf3d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(GMM, name='GMM.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bea56d8d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(GMM.sample, name='GMM.sample', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5f16e4f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(GMM.__call__, name='GMM.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "aed232a4",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/gmm.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ebe4250",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | hide\n",
+    "# Unit tests to check PMM's stored quantiles\n",
+    "# attribute is correctly instantiated\n",
+    "check = GMM(n_components=2, level=[80, 90])\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = GMM(n_components=2, \n",
+    "            quantiles=[0.0100, 0.1000, 0.5, 0.9000, 0.9900])\n",
+    "print(check.output_names)\n",
+    "print(check.quantiles)\n",
+    "test_eq(len(check.quantiles), 5)\n",
+    "\n",
+    "check = GMM(n_components=2,\n",
+    "            quantiles=[0.0100, 0.1000, 0.9000, 0.9900])\n",
+    "test_eq(len(check.quantiles), 4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "684d2382",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Create single mixture and broadcast to N,H,K\n",
+    "means   = torch.Tensor([[5,10,15], [10,20,30]])[None, :, :]\n",
+    "\n",
+    "# # Create repetitions for the batch dimension N.\n",
+    "N=2\n",
+    "means = torch.repeat_interleave(input=means, repeats=N, dim=0)\n",
+    "weights = torch.ones_like(means)\n",
+    "stds  = torch.ones_like(means)\n",
+    "\n",
+    "print('weights.shape (N,H,K) \\t', weights.shape)\n",
+    "print('means.shape (N,H,K) \\t', means.shape)\n",
+    "print('stds.shape (N,H,K) \\t', stds.shape)\n",
+    "\n",
+    "distr = GMM(quantiles=[0.1, 0.40, 0.5, 0.60, 0.9])\n",
+    "distr_args = (means, stds)\n",
+    "samples, sample_mean, quants = distr.sample(distr_args)\n",
+    "\n",
+    "print('samples.shape (N,H,num_samples) ', samples.shape)\n",
+    "print('sample_mean.shape (N,H) ', sample_mean.shape)\n",
+    "print('quants.shape  (N,H,Q) \\t\\t', quants.shape)\n",
+    "\n",
+    "# Plot synthethic data\n",
+    "x_plot = range(quants.shape[1]) # H length\n",
+    "y_plot_hat = quants[0,:,:]  # Filter N,G,T -> H,Q\n",
+    "samples_hat = samples[0,:,:]  # Filter N,G,T -> H,num_samples\n",
+    "\n",
+    "# Kernel density plot for single forecast horizon \\tau = t+1\n",
+    "fig, ax = plt.subplots(figsize=(3.7, 2.9))\n",
+    "\n",
+    "ax.hist(samples_hat[0,:], alpha=0.5, bins=50,\n",
+    "        label=r'Horizon $\\tau+1$')\n",
+    "ax.hist(samples_hat[1,:], alpha=0.5, bins=50,\n",
+    "        label=r'Horizon $\\tau+2$')\n",
+    "ax.set(xlabel='Y values', ylabel='Probability')\n",
+    "plt.title('Single horizon Distributions')\n",
+    "plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)\n",
+    "plt.grid()\n",
+    "plt.show()\n",
+    "plt.close()\n",
+    "\n",
+    "# Plot simulated trajectory\n",
+    "fig, ax = plt.subplots(figsize=(3.7, 2.9))\n",
+    "plt.plot(x_plot, y_plot_hat[:,2], color='black', label='median [q50]')\n",
+    "plt.fill_between(x_plot,\n",
+    "                 y1=y_plot_hat[:,1], y2=y_plot_hat[:,3],\n",
+    "                 facecolor='blue', alpha=0.4, label='[p25-p75]')\n",
+    "plt.fill_between(x_plot,\n",
+    "                 y1=y_plot_hat[:,0], y2=y_plot_hat[:,4],\n",
+    "                 facecolor='blue', alpha=0.2, label='[p1-p99]')\n",
+    "ax.set(xlabel='Horizon', ylabel='Y values')\n",
+    "plt.title('GMM Probabilistic Predictions')\n",
+    "plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)\n",
+    "plt.grid()\n",
+    "plt.show()\n",
+    "plt.close()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "694a2afe",
+   "metadata": {},
+   "source": [
+    "## Negative Binomial Mixture Mesh (NBMM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9cdbe5c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class NBMM(torch.nn.Module):\n",
+    "    \"\"\" Negative Binomial Mixture Mesh\n",
+    "\n",
+    "    This N. Binomial Mixture statistical model assumes independence across groups of \n",
+    "    data $\\mathcal{G}=\\{[g_{i}]\\}$, and estimates relationships within the group.\n",
+    "\n",
+    "    $$ \\mathrm{P}\\\\left(\\mathbf{y}_{[b][t+1:t+H]}\\\\right) = \n",
+    "    \\prod_{ [g_{i}] \\in \\mathcal{G}} \\mathrm{P}\\left(\\mathbf{y}_{[g_{i}][\\\\tau]}\\\\right)=\n",
+    "    \\prod_{\\\\beta\\in[g_{i}]}\n",
+    "    \\\\left(\\sum_{k=1}^{K} w_k \\prod_{(\\\\beta,\\\\tau) \\in [g_i][t+1:t+H]} \n",
+    "    \\mathrm{NBinomial}(y_{\\\\beta,\\\\tau}, \\hat{r}_{\\\\beta,\\\\tau,k}, \\hat{p}_{\\\\beta,\\\\tau,k})\\\\right)$$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `n_components`: int=10, the number of mixture components.<br>\n",
+    "    `level`: float list [0,100], confidence levels for prediction intervals.<br>\n",
+    "    `quantiles`: float list [0,1], alternative to level list, target quantiles.<br>\n",
+    "    `return_params`: bool=False, wether or not return the Distribution parameters.<br><br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker. \n",
+    "    Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures. Submitted to the International \n",
+    "    Journal Forecasting, Working paper available at arxiv.](https://arxiv.org/pdf/2110.13179.pdf)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, n_components=1, level=[80, 90], quantiles=None, \n",
+    "                 num_samples=1000, return_params=False):\n",
+    "        super(NBMM, self).__init__()\n",
+    "        # Transform level to MQLoss parameters\n",
+    "        qs, self.output_names = level_to_outputs(level)\n",
+    "        qs = torch.Tensor(qs)\n",
+    "\n",
+    "        # Transform quantiles to homogeneus output names\n",
+    "        if quantiles is not None:\n",
+    "            _, self.output_names = quantiles_to_outputs(quantiles)\n",
+    "            qs = torch.Tensor(quantiles)\n",
+    "        self.quantiles = torch.nn.Parameter(qs, requires_grad=False)\n",
+    "        self.num_samples = num_samples\n",
+    "\n",
+    "        # If True, predict_step will return Distribution's parameters\n",
+    "        self.return_params = return_params\n",
+    "        if self.return_params:\n",
+    "            total_count_names = [f\"-total_count-{i}\" for i in range(1, n_components + 1)]\n",
+    "            probs_names = [f\"-probs-{i}\" for i in range(1, n_components + 1)]\n",
+    "            param_names = [i for j in zip(total_count_names, probs_names) for i in j]\n",
+    "            self.output_names = self.output_names + param_names\n",
+    "\n",
+    "        # Add first output entry for the sample_mean\n",
+    "        self.output_names.insert(0, \"\")            \n",
+    "\n",
+    "        self.outputsize_multiplier = 2 * n_components\n",
+    "        self.is_distribution_output = True\n",
+    "\n",
+    "    def domain_map(self, output: torch.Tensor):\n",
+    "        mu, alpha = torch.tensor_split(output, 2, dim=-1)\n",
+    "        return (mu, alpha)\n",
+    "\n",
+    "    def scale_decouple(self, \n",
+    "                       output,\n",
+    "                       loc: Optional[torch.Tensor] = None,\n",
+    "                       scale: Optional[torch.Tensor] = None,\n",
+    "                       eps: float=0.2):\n",
+    "        \"\"\" Scale Decouple\n",
+    "\n",
+    "        Stabilizes model's output optimization, by learning residual\n",
+    "        variance and residual location based on anchoring `loc`, `scale`.\n",
+    "        Also adds domain protection to the distribution parameters.\n",
+    "        \"\"\"\n",
+    "        # Efficient NBinomial parametrization\n",
+    "        mu, alpha = output\n",
+    "        mu = F.softplus(mu) + 1e-8\n",
+    "        alpha = F.softplus(alpha) + 1e-8    # alpha = 1/total_counts\n",
+    "        if (loc is not None) and (scale is not None):\n",
+    "            loc = loc.view(mu.size(dim=0), 1, -1)\n",
+    "            mu *= loc\n",
+    "            alpha /= (loc + 1.)\n",
+    "\n",
+    "        # mu = total_count * (probs/(1-probs))\n",
+    "        # => probs = mu / (total_count + mu)\n",
+    "        # => probs = mu / [total_count * (1 + mu * (1/total_count))]\n",
+    "        total_count = 1.0 / alpha\n",
+    "        probs = (mu * alpha / (1.0 + mu * alpha)) + 1e-8 \n",
+    "        return (total_count, probs)\n",
+    "\n",
+    "    def sample(self, distr_args, num_samples=None):\n",
+    "        \"\"\"\n",
+    "        Construct the empirical quantiles from the estimated Distribution,\n",
+    "        sampling from it `num_samples` independently.\n",
+    "\n",
+    "        **Parameters**<br>\n",
+    "        `distr_args`: Constructor arguments for the underlying Distribution type.<br>\n",
+    "        `loc`: Optional tensor, of the same shape as the batch_shape + event_shape\n",
+    "               of the resulting distribution.<br>\n",
+    "        `scale`: Optional tensor, of the same shape as the batch_shape+event_shape \n",
+    "               of the resulting distribution.<br>\n",
+    "        `num_samples`: int=500, number of samples for the empirical quantiles.<br>\n",
+    "\n",
+    "        **Returns**<br>\n",
+    "        `samples`: tensor, shape [B,H,`num_samples`].<br>\n",
+    "        `quantiles`: tensor, empirical quantiles defined by `levels`.<br>\n",
+    "        \"\"\"\n",
+    "        if num_samples is None:\n",
+    "            num_samples = self.num_samples\n",
+    "            \n",
+    "        total_count, probs = distr_args\n",
+    "        B, H, K = total_count.size()\n",
+    "        Q = len(self.quantiles)\n",
+    "        assert total_count.shape == probs.shape\n",
+    "\n",
+    "        # Sample K ~ Mult(weights)\n",
+    "        # shared across B, H\n",
+    "        # weights = torch.repeat_interleave(input=weights, repeats=H, dim=2)\n",
+    "        \n",
+    "        weights = (1/K) * torch.ones_like(probs, device=probs.device)\n",
+    "        \n",
+    "        # Avoid loop, vectorize\n",
+    "        weights = weights.reshape(-1, K)\n",
+    "        total_count = total_count.flatten()\n",
+    "        probs = probs.flatten()\n",
+    "\n",
+    "        # Vectorization trick to recover row_idx\n",
+    "        sample_idxs = torch.multinomial(input=weights, \n",
+    "                                        num_samples=num_samples,\n",
+    "                                        replacement=True)\n",
+    "        aux_col_idx = torch.unsqueeze(torch.arange(B * H, device=probs.device),-1) * K\n",
+    "\n",
+    "        # To device\n",
+    "        sample_idxs = sample_idxs.to(probs.device)\n",
+    "\n",
+    "        sample_idxs = sample_idxs + aux_col_idx\n",
+    "        sample_idxs = sample_idxs.flatten()\n",
+    "\n",
+    "        sample_total_count = total_count[sample_idxs]\n",
+    "        sample_probs  = probs[sample_idxs]\n",
+    "\n",
+    "        # Sample y ~ NBinomial(total_count, probs) independently\n",
+    "        dist = NegativeBinomial(total_count=sample_total_count, \n",
+    "                                probs=sample_probs)\n",
+    "        samples = dist.sample(sample_shape=(1,)).to(probs.device)[0]\n",
+    "        samples = samples.view(B*H, num_samples)\n",
+    "        sample_mean = torch.mean(samples, dim=-1)\n",
+    "\n",
+    "        # Compute quantiles\n",
+    "        quantiles_device = self.quantiles.to(probs.device)\n",
+    "        quants = torch.quantile(input=samples, q=quantiles_device, dim=1)\n",
+    "        quants = quants.permute((1,0)) # Q, B*H\n",
+    "\n",
+    "        # Final reshapes\n",
+    "        samples = samples.view(B, H, num_samples)\n",
+    "        sample_mean = sample_mean.view(B, H, 1)\n",
+    "        quants  = quants.view(B, H, Q)\n",
+    "\n",
+    "        return samples, sample_mean, quants\n",
+    "\n",
+    "    def neglog_likelihood(self,\n",
+    "                          y: torch.Tensor,\n",
+    "                          distr_args: Tuple[torch.Tensor, torch.Tensor],\n",
+    "                          mask: Union[torch.Tensor, None] = None):\n",
+    "\n",
+    "        if mask is None: \n",
+    "            mask = torch.ones_like(y)\n",
+    "            \n",
+    "        total_count, probs = distr_args\n",
+    "        B, H, K = total_count.size()\n",
+    "        \n",
+    "        weights = (1/K) * torch.ones_like(probs, device=probs.device)\n",
+    "        \n",
+    "        y = y[:,:, None]\n",
+    "        mask = mask[:,:,None]\n",
+    "\n",
+    "        log_unnormalized_prob = (total_count * torch.log(1.-probs) + y * torch.log(probs))\n",
+    "        log_normalization = (-torch.lgamma(total_count + y) + torch.lgamma(1. + y) +\n",
+    "                             torch.lgamma(total_count))\n",
+    "        log_normalization[total_count + y == 0.] = 0.\n",
+    "        log =  log_unnormalized_prob - log_normalization\n",
+    "\n",
+    "        #log  = torch.sum(log, dim=0, keepdim=True) # Joint within batch/group\n",
+    "        #log  = torch.sum(log, dim=1, keepdim=True) # Joint within horizon\n",
+    "\n",
+    "        # Numerical stability mixture and loglik\n",
+    "        log_max = torch.amax(log, dim=2, keepdim=True) # [1,1,K] (collapsed joints)\n",
+    "        lik     = weights * torch.exp(log-log_max)     # Take max\n",
+    "        loglik  = torch.log(torch.sum(lik, dim=2, keepdim=True)) + log_max # Return max\n",
+    "        \n",
+    "        loglik  = loglik * mask #replace with mask\n",
+    "\n",
+    "        loss = -torch.mean(loglik)\n",
+    "        return loss\n",
+    "    \n",
+    "    def __call__(self, y: torch.Tensor,\n",
+    "                 distr_args: Tuple[torch.Tensor, torch.Tensor],\n",
+    "                 mask: Union[torch.Tensor, None] = None,):\n",
+    "\n",
+    "        return self.neglog_likelihood(y=y, distr_args=distr_args, mask=mask)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eed5e73c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(NBMM, name='NBMM.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "41ea98ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(NBMM.sample, name='NBMM.sample', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0c7189c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(NBMM.__call__, name='NBMM.__call__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b67e2931",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Create single mixture and broadcast to N,H,K\n",
+    "counts   = torch.Tensor([[10,20,30], [20,40,60]])[None, :, :]\n",
+    "\n",
+    "# # Create repetitions for the batch dimension N.\n",
+    "N=2\n",
+    "counts = torch.repeat_interleave(input=counts, repeats=N, dim=0)\n",
+    "weights = torch.ones_like(counts)\n",
+    "probs  = torch.ones_like(counts) * 0.5\n",
+    "\n",
+    "print('weights.shape (N,H,K) \\t', weights.shape)\n",
+    "print('counts.shape (N,H,K) \\t', counts.shape)\n",
+    "print('probs.shape (N,H,K) \\t', probs.shape)\n",
+    "\n",
+    "model = NBMM(quantiles=[0.1, 0.40, 0.5, 0.60, 0.9])\n",
+    "distr_args = (counts, probs)\n",
+    "samples, sample_mean, quants = model.sample(distr_args, num_samples=2000)\n",
+    "\n",
+    "print('samples.shape (N,H,num_samples) ', samples.shape)\n",
+    "print('sample_mean.shape (N,H) ', sample_mean.shape)\n",
+    "print('quants.shape  (N,H,Q) \\t\\t', quants.shape)\n",
+    "\n",
+    "# Plot synthethic data\n",
+    "x_plot = range(quants.shape[1]) # H length\n",
+    "y_plot_hat = quants[0,:,:]  # Filter N,G,T -> H,Q\n",
+    "samples_hat = samples[0,:,:]  # Filter N,G,T -> H,num_samples\n",
+    "\n",
+    "# Kernel density plot for single forecast horizon \\tau = t+1\n",
+    "fig, ax = plt.subplots(figsize=(3.7, 2.9))\n",
+    "\n",
+    "ax.hist(samples_hat[0,:], alpha=0.5, bins=30,\n",
+    "        label=r'Horizon $\\tau+1$')\n",
+    "ax.hist(samples_hat[1,:], alpha=0.5, bins=30,\n",
+    "        label=r'Horizon $\\tau+2$')\n",
+    "ax.set(xlabel='Y values', ylabel='Probability')\n",
+    "plt.title('Single horizon Distributions')\n",
+    "plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)\n",
+    "plt.grid()\n",
+    "plt.show()\n",
+    "plt.close()\n",
+    "\n",
+    "# Plot simulated trajectory\n",
+    "fig, ax = plt.subplots(figsize=(3.7, 2.9))\n",
+    "plt.plot(x_plot, y_plot_hat[:,2], color='black', label='median [q50]')\n",
+    "plt.fill_between(x_plot,\n",
+    "                 y1=y_plot_hat[:,1], y2=y_plot_hat[:,3],\n",
+    "                 facecolor='blue', alpha=0.4, label='[p25-p75]')\n",
+    "plt.fill_between(x_plot,\n",
+    "                 y1=y_plot_hat[:,0], y2=y_plot_hat[:,4],\n",
+    "                 facecolor='blue', alpha=0.2, label='[p1-p99]')\n",
+    "ax.set(xlabel='Horizon', ylabel='Y values')\n",
+    "plt.title('NBM Probabilistic Predictions')\n",
+    "plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)\n",
+    "plt.grid()\n",
+    "plt.show()\n",
+    "plt.close()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "a6cf4850",
+   "metadata": {},
+   "source": [
+    "# 5. Robustified Errors\n",
+    "\n",
+    "This type of errors from robust statistic focus on methods resistant to outliers and violations of assumptions, providing reliable estimates and inferences. Robust estimators are used to reduce the impact of outliers, offering more stable results."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7588f6d2",
+   "metadata": {},
+   "source": [
+    "## Huber Loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ae9f60c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class HuberLoss(BasePointLoss):\n",
+    "    \"\"\" Huber Loss\n",
+    "\n",
+    "    The Huber loss, employed in robust regression, is a loss function that \n",
+    "    exhibits reduced sensitivity to outliers in data when compared to the \n",
+    "    squared error loss. This function is also refered as SmoothL1.\n",
+    "\n",
+    "    The Huber loss function is quadratic for small errors and linear for large \n",
+    "    errors, with equal values and slopes of the different sections at the two \n",
+    "    points where $(y_{\\\\tau}-\\hat{y}_{\\\\tau})^{2}$=$|y_{\\\\tau}-\\hat{y}_{\\\\tau}|$.\n",
+    "\n",
+    "    $$ L_{\\delta}(y_{\\\\tau},\\; \\hat{y}_{\\\\tau})\n",
+    "    =\\\\begin{cases}{\\\\frac{1}{2}}(y_{\\\\tau}-\\hat{y}_{\\\\tau})^{2}\\;{\\\\text{for }}|y_{\\\\tau}-\\hat{y}_{\\\\tau}|\\leq \\delta \\\\\\ \n",
+    "    \\\\delta \\ \\cdot \\left(|y_{\\\\tau}-\\hat{y}_{\\\\tau}|-{\\\\frac {1}{2}}\\delta \\\\right),\\;{\\\\text{otherwise.}}\\end{cases}$$\n",
+    "\n",
+    "    where $\\\\delta$ is a threshold parameter that determines the point at which the loss transitions from quadratic to linear,\n",
+    "    and can be tuned to control the trade-off between robustness and accuracy in the predictions.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `delta`: float=1.0, Specifies the threshold at which to change between delta-scaled L1 and L2 loss.\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "    \n",
+    "    **References:**<br>\n",
+    "    [Huber Peter, J (1964). \"Robust Estimation of a Location Parameter\". Annals of Statistics](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full)\n",
+    "    \"\"\"   \n",
+    "    def __init__(self, delta: float=1., horizon_weight=None):\n",
+    "        super(HuberLoss, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                  outputsize_multiplier=1,\n",
+    "                                  output_names=[''])\n",
+    "        self.delta = delta\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `huber_loss`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        losses = F.huber_loss(y, y_hat, reduction='none', delta=self.delta)        \n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ccbfa88",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HuberLoss, name='HuberLoss.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6226178b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HuberLoss.__call__, name='HuberLoss.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "06aad81b",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/huber_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7f835621",
+   "metadata": {},
+   "source": [
+    "## Tukey Loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "26ea3109",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class TukeyLoss(torch.nn.Module):\n",
+    "    \"\"\" Tukey Loss\n",
+    "\n",
+    "    The Tukey loss function, also known as Tukey's biweight function, is a \n",
+    "    robust statistical loss function used in robust statistics. Tukey's loss exhibits\n",
+    "    quadratic behavior near the origin, like the Huber loss; however, it is even more\n",
+    "    robust to outliers as the loss for large residuals remains constant instead of \n",
+    "    scaling linearly.\n",
+    "\n",
+    "    The parameter $c$ in Tukey's loss determines the ''saturation'' point\n",
+    "    of the function: Higher values of $c$ enhance sensitivity, while lower values \n",
+    "    increase resistance to outliers.\n",
+    "\n",
+    "    $$ L_{c}(y_{\\\\tau},\\; \\hat{y}_{\\\\tau})\n",
+    "    =\\\\begin{cases}{\n",
+    "    \\\\frac{c^{2}}{6}} \\\\left[1-(\\\\frac{y_{\\\\tau}-\\hat{y}_{\\\\tau}}{c})^{2} \\\\right]^{3}    \\;\\\\text{for } |y_{\\\\tau}-\\hat{y}_{\\\\tau}|\\leq c \\\\\\ \n",
+    "    \\\\frac{c^{2}}{6} \\qquad \\\\text{otherwise.}  \\end{cases}$$\n",
+    "\n",
+    "    Please note that the Tukey loss function assumes the data to be stationary or\n",
+    "    normalized beforehand. If the error values are excessively large, the algorithm\n",
+    "    may need help to converge during optimization. It is advisable to employ small learning rates.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `c`: float=4.685, Specifies the Tukey loss' threshold on which residuals are no longer considered.<br>\n",
+    "    `normalize`: bool=True, Wether normalization is performed within Tukey loss' computation.<br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Beaton, A. E., and Tukey, J. W. (1974). \"The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data.\"](https://www.jstor.org/stable/1267936)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, c: float=4.685, normalize: bool=True):\n",
+    "        super(TukeyLoss, self).__init__()\n",
+    "        self.outputsize_multiplier = 1\n",
+    "        self.c = c\n",
+    "        self.normalize = normalize\n",
+    "        self.output_names = ['']\n",
+    "        self.is_distribution_output = False\n",
+    "\n",
+    "    def domain_map(self, y_hat: torch.Tensor):\n",
+    "        \"\"\"\n",
+    "        Univariate loss operates in dimension [B,T,H]/[B,H]\n",
+    "        This changes the network's output from [B,H,1]->[B,H]\n",
+    "        \"\"\"\n",
+    "        return y_hat.squeeze(-1)\n",
+    "\n",
+    "    def masked_mean(self, x, mask, dim):\n",
+    "        x_nan = x.masked_fill(mask < 1, float(\"nan\"))\n",
+    "        x_mean = x_nan.nanmean(dim=dim, keepdim=True)\n",
+    "        x_mean = torch.nan_to_num(x_mean, nan=0.0)\n",
+    "        return x_mean\n",
+    "\n",
+    "    def __call__(self, y: torch.Tensor, y_hat: torch.Tensor, \n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `tukey_loss`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        if mask is None:\n",
+    "            mask = torch.ones_like(y_hat)\n",
+    "\n",
+    "        # We normalize the Tukey loss, to satisfy 4.685 normal outlier bounds\n",
+    "        if self.normalize:\n",
+    "            y_mean = self.masked_mean(x=y, mask=mask, dim=-1)\n",
+    "            y_std = torch.sqrt(self.masked_mean(x=(y - y_mean) ** 2, mask=mask, dim=-1)) + 1e-2\n",
+    "        else:\n",
+    "            y_std = 1.\n",
+    "        delta_y = torch.abs(y - y_hat) / y_std\n",
+    "\n",
+    "        tukey_mask = torch.greater_equal(self.c * torch.ones_like(delta_y), delta_y)\n",
+    "        tukey_loss = tukey_mask * mask * (1-(delta_y/(self.c))**2)**3 + (1-(tukey_mask * 1))\n",
+    "        tukey_loss = (self.c**2 / 6) * torch.mean(tukey_loss)\n",
+    "        return tukey_loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dd4653e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(TukeyLoss, name='TukeyLoss.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7686462",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(TukeyLoss.__call__, name='TukeyLoss.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "8ae50f25",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/tukey_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "a8a28d9c",
+   "metadata": {},
+   "source": [
+    "## Huberized Quantile Loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "549e6bdb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class HuberQLoss(BasePointLoss):\n",
+    "    \"\"\" Huberized Quantile Loss\n",
+    "\n",
+    "    The Huberized quantile loss is a modified version of the quantile loss function that\n",
+    "    combines the advantages of the quantile loss and the Huber loss. It is commonly used\n",
+    "    in regression tasks, especially when dealing with data that contains outliers or heavy tails.\n",
+    "\n",
+    "    The Huberized quantile loss between `y` and `y_hat` measure the Huber Loss in a non-symmetric way.\n",
+    "    The loss pays more attention to under/over-estimation depending on the quantile parameter $q$; \n",
+    "    and controls the trade-off between robustness and accuracy in the predictions with the parameter $delta$.\n",
+    "\n",
+    "    $$ \\mathrm{HuberQL}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{(q)}_{\\\\tau}) = \n",
+    "    (1-q)\\, L_{\\delta}(y_{\\\\tau},\\; \\hat{y}^{(q)}_{\\\\tau}) \\mathbb{1}\\{ \\hat{y}^{(q)}_{\\\\tau} \\geq y_{\\\\tau} \\} + \n",
+    "    q\\, L_{\\delta}(y_{\\\\tau},\\; \\hat{y}^{(q)}_{\\\\tau}) \\mathbb{1}\\{ \\hat{y}^{(q)}_{\\\\tau} < y_{\\\\tau} \\} $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `delta`: float=1.0, Specifies the threshold at which to change between delta-scaled L1 and L2 loss.<br>\n",
+    "    `q`: float, between 0 and 1. The slope of the quantile loss, in the context of quantile regression, the q determines the conditional quantile level.<br>\n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Huber Peter, J (1964). \"Robust Estimation of a Location Parameter\". Annals of Statistics](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full)<br>\n",
+    "    [Roger Koenker and Gilbert Bassett, Jr., \"Regression Quantiles\".](https://www.jstor.org/stable/1913643)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, q, delta: float=1., horizon_weight=None):\n",
+    "        super(HuberQLoss, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                           outputsize_multiplier=1,\n",
+    "                                           output_names=[f'_q{q}_d{delta}'])\n",
+    "        self.q = q\n",
+    "        self.delta = delta\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies datapoints to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `huber_qloss`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        error  = y_hat - y\n",
+    "        zero_error = torch.zeros_like(error)\n",
+    "        sq     = torch.maximum(-error, zero_error)\n",
+    "        s1_q   = torch.maximum(error, zero_error)\n",
+    "        losses = self.q * F.huber_loss(sq, zero_error, \n",
+    "                                       reduction='none', delta=self.delta) + \\\n",
+    "                 (1 - self.q) * F.huber_loss(s1_q, zero_error, \n",
+    "                                        reduction='none', delta=self.delta)\n",
+    "\n",
+    "        weights = self._compute_weights(y=y, mask=mask)\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec830ac0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HuberQLoss, name='HuberQLoss.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "15409d3f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HuberQLoss.__call__, name='HuberQLoss.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "a2d97f31",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/huber_qloss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2e7e3143",
+   "metadata": {},
+   "source": [
+    "## Huberized MQLoss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc992c47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class HuberMQLoss(BasePointLoss):\n",
+    "    \"\"\"  Huberized Multi-Quantile loss\n",
+    "\n",
+    "    The Huberized Multi-Quantile loss (HuberMQL) is a modified version of the multi-quantile loss function \n",
+    "    that combines the advantages of the quantile loss and the Huber loss. HuberMQL is commonly used in regression \n",
+    "    tasks, especially when dealing with data that contains outliers or heavy tails. The loss function pays \n",
+    "    more attention to under/over-estimation depending on the quantile list $[q_{1},q_{2},\\dots]$ parameter. \n",
+    "    It controls the trade-off between robustness and prediction accuracy with the parameter $\\\\delta$.\n",
+    "\n",
+    "    $$ \\mathrm{HuberMQL}_{\\delta}(\\\\mathbf{y}_{\\\\tau},[\\\\mathbf{\\hat{y}}^{(q_{1})}_{\\\\tau}, ... ,\\hat{y}^{(q_{n})}_{\\\\tau}]) = \n",
+    "    \\\\frac{1}{n} \\\\sum_{q_{i}} \\mathrm{HuberQL}_{\\\\delta}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}^{(q_{i})}_{\\\\tau}) $$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `level`: int list [0,100]. Probability levels for prediction intervals (Defaults median).\n",
+    "    `quantiles`: float list [0., 1.]. Alternative to level, quantiles to estimate from y distribution.\n",
+    "    `delta`: float=1.0, Specifies the threshold at which to change between delta-scaled L1 and L2 loss.<br>   \n",
+    "    `horizon_weight`: Tensor of size h, weight for each timestamp of the forecasting window. <br> \n",
+    "\n",
+    "    **References:**<br>\n",
+    "    [Huber Peter, J (1964). \"Robust Estimation of a Location Parameter\". Annals of Statistics](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full)<br>\n",
+    "    [Roger Koenker and Gilbert Bassett, Jr., \"Regression Quantiles\".](https://www.jstor.org/stable/1913643)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, level=[80, 90], quantiles=None, delta: float=1.0, horizon_weight=None):\n",
+    "\n",
+    "        qs, output_names = level_to_outputs(level)\n",
+    "        qs = torch.Tensor(qs)\n",
+    "        # Transform quantiles to homogeneus output names\n",
+    "        if quantiles is not None:\n",
+    "            _, output_names = quantiles_to_outputs(quantiles)\n",
+    "            qs = torch.Tensor(quantiles)\n",
+    "\n",
+    "        super(HuberMQLoss, self).__init__(horizon_weight=horizon_weight,\n",
+    "                                     outputsize_multiplier=len(qs),\n",
+    "                                     output_names=output_names)\n",
+    "        \n",
+    "        self.quantiles = torch.nn.Parameter(qs, requires_grad=False)\n",
+    "        self.delta = delta\n",
+    "\n",
+    "    def domain_map(self, y_hat: torch.Tensor):\n",
+    "        \"\"\"\n",
+    "        Identity domain map [B,T,H,Q]/[B,H,Q]\n",
+    "        \"\"\"\n",
+    "        return y_hat\n",
+    "    \n",
+    "    def _compute_weights(self, y, mask):\n",
+    "        \"\"\"\n",
+    "        Compute final weights for each datapoint (based on all weights and all masks)\n",
+    "        Set horizon_weight to a ones[H] tensor if not set.\n",
+    "        If set, check that it has the same length as the horizon in x.\n",
+    "        \"\"\"\n",
+    "        if mask is None:\n",
+    "            mask = torch.ones_like(y, device=y.device)\n",
+    "        else:\n",
+    "            mask = mask.unsqueeze(1) # Add Q dimension.\n",
+    "\n",
+    "        if self.horizon_weight is None:\n",
+    "            self.horizon_weight = torch.ones(mask.shape[-1])\n",
+    "        else:\n",
+    "            assert mask.shape[-1] == len(self.horizon_weight), \\\n",
+    "                'horizon_weight must have same length as Y'\n",
+    "    \n",
+    "        weights = self.horizon_weight.clone()\n",
+    "        weights = torch.ones_like(mask, device=mask.device) * weights.to(mask.device)\n",
+    "        return weights * mask\n",
+    "\n",
+    "    def __call__(self,\n",
+    "                 y: torch.Tensor,\n",
+    "                 y_hat: torch.Tensor,\n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `hmqloss`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "\n",
+    "        error  = y_hat - y.unsqueeze(-1)\n",
+    "        zero_error = torch.zeros_like(error)        \n",
+    "        sq     = torch.maximum(-error, torch.zeros_like(error))\n",
+    "        s1_q   = torch.maximum(error, torch.zeros_like(error))\n",
+    "        losses = F.huber_loss(self.quantiles * sq, zero_error, \n",
+    "                                        reduction='none', delta=self.delta) + \\\n",
+    "                  F.huber_loss((1 - self.quantiles) * s1_q, zero_error, \n",
+    "                                reduction='none', delta=self.delta)\n",
+    "        losses = (1/len(self.quantiles)) * losses\n",
+    "\n",
+    "        if y_hat.ndim == 3: # BaseWindows\n",
+    "            losses = losses.swapaxes(-2,-1) # [B,H,Q] -> [B,Q,H] (needed for horizon weighting, H at the end)\n",
+    "        elif y_hat.ndim == 4: # BaseRecurrent\n",
+    "            losses = losses.swapaxes(-2,-1)\n",
+    "            losses = losses.swapaxes(-2,-3) # [B,seq_len,H,Q] -> [B,Q,seq_len,H] (needed for horizon weighting, H at the end)\n",
+    "\n",
+    "        weights = self._compute_weights(y=losses, mask=mask) # Use losses for extra dim\n",
+    "        # NOTE: Weights do not have Q dimension.\n",
+    "\n",
+    "        return _weighted_mean(losses=losses, weights=weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2a662632",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HuberMQLoss, name='HuberMQLoss.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "82f733ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HuberMQLoss.__call__, name='HuberMQLoss.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "47782e38",
+   "metadata": {},
+   "source": [
+    "![](imgs_losses/hmq_loss.png)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "eb99f88b",
+   "metadata": {},
+   "source": [
+    "# 6. Others"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "013d1502",
+   "metadata": {},
+   "source": [
+    "## Accuracy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f4fda0a2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Accuracy(torch.nn.Module):\n",
+    "    \"\"\" Accuracy\n",
+    "\n",
+    "    Computes the accuracy between categorical `y` and `y_hat`.\n",
+    "    This evaluation metric is only meant for evalution, as it\n",
+    "    is not differentiable.\n",
+    "\n",
+    "    $$ \\mathrm{Accuracy}(\\\\mathbf{y}_{\\\\tau}, \\\\mathbf{\\hat{y}}_{\\\\tau}) = \\\\frac{1}{H} \\\\sum^{t+H}_{\\\\tau=t+1} \\mathrm{1}\\{\\\\mathbf{y}_{\\\\tau}==\\\\mathbf{\\hat{y}}_{\\\\tau}\\} $$\n",
+    "\n",
+    "    \"\"\"\n",
+    "    def __init__(self,):\n",
+    "        super(Accuracy, self).__init__()\n",
+    "        self.is_distribution_output = False\n",
+    "\n",
+    "    def domain_map(self, y_hat: torch.Tensor):\n",
+    "        \"\"\"\n",
+    "        Univariate loss operates in dimension [B,T,H]/[B,H]\n",
+    "        This changes the network's output from [B,H,1]->[B,H]\n",
+    "        \"\"\"\n",
+    "        return y_hat.squeeze(-1)\n",
+    "\n",
+    "    def __call__(self, y: torch.Tensor, y_hat: torch.Tensor, \n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per serie to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `accuracy`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        if mask is None:\n",
+    "            mask = torch.ones_like(y_hat)\n",
+    "\n",
+    "        measure = (y.unsqueeze(-1) == y_hat) * mask.unsqueeze(-1)\n",
+    "        accuracy = torch.mean(measure)\n",
+    "        return accuracy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5eeb2d06",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(Accuracy, name='Accuracy.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2111646c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(Accuracy.__call__, name='Accuracy.__call__', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "3742e6be",
+   "metadata": {},
+   "source": [
+    "## Scaled Continuous Ranked Probability Score (sCRPS)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d210a2e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class sCRPS(torch.nn.Module):\n",
+    "    \"\"\"Scaled Continues Ranked Probability Score\n",
+    "\n",
+    "    Calculates a scaled variation of the CRPS, as proposed by Rangapuram (2021),\n",
+    "    to measure the accuracy of predicted quantiles `y_hat` compared to the observation `y`.\n",
+    "\n",
+    "    This metric averages percentual weighted absolute deviations as \n",
+    "    defined by the quantile losses.\n",
+    "\n",
+    "    $$ \\mathrm{sCRPS}(\\\\mathbf{\\hat{y}}^{(q)}_{\\\\tau}, \\mathbf{y}_{\\\\tau}) = \\\\frac{2}{N} \\sum_{i}\n",
+    "    \\int^{1}_{0}\n",
+    "    \\\\frac{\\mathrm{QL}(\\\\mathbf{\\hat{y}}^{(q}_{\\\\tau} y_{i,\\\\tau})_{q}}{\\sum_{i} | y_{i,\\\\tau} |} dq $$\n",
+    "\n",
+    "    where $\\\\mathbf{\\hat{y}}^{(q}_{\\\\tau}$ is the estimated quantile, and $y_{i,\\\\tau}$\n",
+    "    are the target variable realizations.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `level`: int list [0,100]. Probability levels for prediction intervals (Defaults median).\n",
+    "    `quantiles`: float list [0., 1.]. Alternative to level, quantiles to estimate from y distribution.\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    - [Gneiting, Tilmann. (2011). \\\"Quantiles as optimal point forecasts\\\". \n",
+    "    International Journal of Forecasting.](https://www.sciencedirect.com/science/article/pii/S0169207010000063)<br>\n",
+    "    - [Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Zhi Chen, Anil Gaba, Ilia Tsetlin, Robert L. Winkler. (2022). \n",
+    "    \\\"The M5 uncertainty competition: Results, findings and conclusions\\\". \n",
+    "    International Journal of Forecasting.](https://www.sciencedirect.com/science/article/pii/S0169207021001722)<br>\n",
+    "    - [Syama Sundar Rangapuram, Lucien D Werner, Konstantinos Benidis, Pedro Mercado, Jan Gasthaus, Tim Januschowski. (2021). \n",
+    "    \\\"End-to-End Learning of Coherent Probabilistic Forecasts for Hierarchical Time Series\\\". \n",
+    "    Proceedings of the 38th International Conference on Machine Learning (ICML).](https://proceedings.mlr.press/v139/rangapuram21a.html)\n",
+    "    \"\"\"\n",
+    "    def __init__(self, level=[80, 90], quantiles=None):\n",
+    "        super(sCRPS, self).__init__()\n",
+    "        self.mql = MQLoss(level=level, quantiles=quantiles)\n",
+    "        self.is_distribution_output = False\n",
+    "    \n",
+    "    def __call__(self, y: torch.Tensor, y_hat: torch.Tensor, \n",
+    "                 mask: Union[torch.Tensor, None] = None):\n",
+    "        \"\"\"\n",
+    "        **Parameters:**<br>\n",
+    "        `y`: tensor, Actual values.<br>\n",
+    "        `y_hat`: tensor, Predicted values.<br>\n",
+    "        `mask`: tensor, Specifies date stamps per series to consider in loss.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `scrps`: tensor (single value).\n",
+    "        \"\"\"\n",
+    "        mql = self.mql(y=y, y_hat=y_hat, mask=mask)\n",
+    "        norm = torch.sum(torch.abs(y))\n",
+    "        unmean = torch.sum(mask)\n",
+    "        scrps = 2 * mql * unmean / (norm + 1e-5)\n",
+    "        return scrps"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53770648",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(sCRPS, name='sCRPS.__init__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3646250f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(sCRPS.__call__, name='sCRPS.__call__', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5cdfa174",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Each 1 is an error, there are 6 datapoints.\n",
+    "y = torch.Tensor([[0,0,0],[0,0,0]])\n",
+    "y_hat = torch.Tensor([[0,0,1],[1,0,1]])\n",
+    "\n",
+    "# Complete mask and horizon_weight\n",
+    "mask = torch.Tensor([[1,1,1],[1,1,1]])\n",
+    "horizon_weight = torch.Tensor([1,1,1])\n",
+    "\n",
+    "mae = MAE(horizon_weight=horizon_weight)\n",
+    "loss = mae(y=y, y_hat=y_hat, mask=mask)\n",
+    "assert loss==(3/6), 'Should be 3/6'\n",
+    "\n",
+    "# Incomplete mask and complete horizon_weight\n",
+    "mask = torch.Tensor([[1,1,1],[0,1,1]]) # Only 1 error and points is masked.\n",
+    "horizon_weight = torch.Tensor([1,1,1])\n",
+    "mae = MAE(horizon_weight=horizon_weight)\n",
+    "loss = mae(y=y, y_hat=y_hat, mask=mask)\n",
+    "assert loss==(2/5), 'Should be 2/5'\n",
+    "\n",
+    "# Complete mask and incomplete horizon_weight\n",
+    "mask = torch.Tensor([[1,1,1],[1,1,1]])\n",
+    "horizon_weight = torch.Tensor([1,1,0]) # 2 errors and points are masked.\n",
+    "mae = MAE(horizon_weight=horizon_weight)\n",
+    "loss = mae(y=y, y_hat=y_hat, mask=mask)\n",
+    "assert loss==(1/4), 'Should be 1/4'\n",
+    "\n",
+    "# Incomplete mask and incomplete horizon_weight\n",
+    "mask = torch.Tensor([[0,1,1],[1,1,1]])\n",
+    "horizon_weight = torch.Tensor([1,1,0]) # 2 errors are masked, and 3 points.\n",
+    "mae = MAE(horizon_weight=horizon_weight)\n",
+    "loss = mae(y=y, y_hat=y_hat, mask=mask)\n",
+    "assert loss==(1/3), 'Should be 1/3'"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/nbs/mint.json
+++ b/nbs/mint.json
+{
+  "$schema": "https://mintlify.com/schema.json",
+  "name": "Nixtla",
+  "logo": {
+    "light": "/light.png",
+    "dark": "/dark.png"
+  },
+  "favicon": "/favicon.svg",
+  "colors": {
+    "primary": "#0E0E0E",
+    "light": "#FAFAFA",
+    "dark": "#0E0E0E",
+    "anchors": {
+      "from": "#2AD0CA",
+      "to": "#0E00F8"
+    }
+  },
+  "topbarCtaButton": {
+    "type": "github",
+    "url": "https://github.com/Nixtla/neuralforecast"
+  },
+  "topAnchor": {
+    "name": "NeuralForecast",
+    "icon": "brain-circuit"
+  },
+  "navigation": [
+    {
+      "group": "",
+      "pages": ["index.html"]
+    },
+    {
+      "group": "Getting Started",
+      "pages": [
+        "examples/installation.html",
+        "examples/getting_started.html",
+        "examples/data_format.html",
+        "examples/exogenous_variables.html",
+        "examples/time_series_scaling.html",
+        "examples/automatic_hyperparameter_tuning.html",
+        "examples/cross_validation_tutorial.html",
+        "examples/predictinsample.html",
+        "examples/save_load_models.html",
+        "examples/getting_started_complete.html",
+        "examples/neuralforecast_map.html",
+        "examples/how_to_add_models.html",
+        "examples/mlflow_and_neuralforecast.html"
+      ]
+    },
+    {
+      "group": "Tutorials",
+      "pages": [
+        "examples/signal_decomposition.html",
+        "examples/uncertaintyintervals.html",
+        "examples/longhorizon_probabilistic.html",
+        "examples/longhorizon_with_transformers.html",
+        "examples/intermittentdata.html",
+        "examples/robust_regression.html",
+        "examples/electricitypeakforecasting.html",
+        "examples/hierarchicalnetworks.html",
+        "examples/transfer_learning.html",
+        "examples/temporal_classifiers.html",
+        "examples/predictive_maintenance.html",
+        "examples/statsmlneuralmethods.html"
+      ]
+    },
+    {
+      "group": "",
+      "pages": ["examples/models_intro"]
+    },
+    {
+      "group": "API Reference",
+      "pages": [
+        "core.html",
+        "models.html",
+        {
+          "group": "Models' Documentation",
+          "pages": [
+            {
+              "group": "A. RNN-Based",
+              "pages": [
+                "models.rnn.html",
+                "models.gru.html",
+                "models.lstm.html",
+                "models.dilated_rnn.html",
+                "models.tcn.html",
+                "models.deepar.html"
+              ]
+            },
+            {
+              "group": "B. MLP-Based",
+              "pages": [
+                "models.mlp.html",
+                "models.nhits.html",
+                "models.nbeats.html",
+                "models.nbeatsx.html"
+              ]
+            },
+            {
+              "group": "C. Transformer-Based",
+              "pages": [
+                "models.tft.html",
+                "models.vanillatransformer.html",
+                "models.informer.html",
+                "models.autoformer.html",
+                "models.patchtst.html"
+              ]
+            },
+            {
+              "group": "D. CNN-Based",
+              "pages": ["models.timesnet.html"]
+            },
+            {
+              "group": "E. Multivariate",
+              "pages": ["models.hint.html", "models.stemgnn.html"]
+            }
+          ]
+        },
+        {
+          "group": "Train/Evaluation",
+          "pages": ["losses.pytorch.html", "losses.numpy.html"]
+        },
+        {
+          "group": "Common Components",
+          "pages": [
+            "common.base_auto.html",
+            "common.base_recurrent.html",
+            "common.base_windows.html",
+            "common.scalers.html",
+            "common.modules.html"
+          ]
+        },
+        {
+          "group": "Utils",
+          "pages": ["tsdataset.html", "utils.html"]
+        }
+      ]
+    }
+  ]
+}
--- a/nbs/models.autoformer.ipynb
+++ b/nbs/models.autoformer.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.autoformer"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Autoformer"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Autoformer model tackles the challenge of finding reliable dependencies on intricate temporal patterns of long-horizon forecasting.\n",
+    "\n",
+    "The architecture has the following distinctive features:\n",
+    "- In-built progressive decomposition in trend and seasonal compontents based on a moving average filter.\n",
+    "- Auto-Correlation mechanism that discovers the period-based dependencies by\n",
+    "calculating the autocorrelation and aggregating similar sub-series based on the periodicity.\n",
+    "- Classic encoder-decoder proposed by Vaswani et al. (2017) with a multi-head attention mechanism.\n",
+    "\n",
+    "The Autoformer model utilizes a three-component approach to define its embedding:\n",
+    "- It employs encoded autoregressive features obtained from a convolution network.\n",
+    "- Absolute positional embeddings obtained from calendar features are utilized."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**References**<br>\n",
+    "- [Wu, Haixu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. \"Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting\"](https://proceedings.neurips.cc/paper/2021/hash/bcc0d400288793e8bdcd7c19a8ac0c2b-Abstract.html)<br>"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. Autoformer Architecture.](imgs_models/autoformer.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import math\n",
+    "import numpy as np\n",
+    "from typing import Optional\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "from neuralforecast.common._modules import DataEmbedding\n",
+    "from neuralforecast.common._base_windows import BaseWindows\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import MAE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Auxiliary Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class AutoCorrelation(nn.Module):\n",
+    "    \"\"\"\n",
+    "    AutoCorrelation Mechanism with the following two phases:\n",
+    "    (1) period-based dependencies discovery\n",
+    "    (2) time delay aggregation\n",
+    "    This block can replace the self-attention family mechanism seamlessly.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, mask_flag=True, factor=1, scale=None, attention_dropout=0.1, output_attention=False):\n",
+    "        super(AutoCorrelation, self).__init__()\n",
+    "        self.factor = factor\n",
+    "        self.scale = scale\n",
+    "        self.mask_flag = mask_flag\n",
+    "        self.output_attention = output_attention\n",
+    "        self.dropout = nn.Dropout(attention_dropout)\n",
+    "\n",
+    "    def time_delay_agg_training(self, values, corr):\n",
+    "        \"\"\"\n",
+    "        SpeedUp version of Autocorrelation (a batch-normalization style design)\n",
+    "        This is for the training phase.\n",
+    "        \"\"\"\n",
+    "        head = values.shape[1]\n",
+    "        channel = values.shape[2]\n",
+    "        length = values.shape[3]\n",
+    "        # find top k\n",
+    "        top_k = int(self.factor * math.log(length))\n",
+    "        mean_value = torch.mean(torch.mean(corr, dim=1), dim=1)\n",
+    "        index = torch.topk(torch.mean(mean_value, dim=0), top_k, dim=-1)[1]\n",
+    "        weights = torch.stack([mean_value[:, index[i]] for i in range(top_k)], dim=-1)\n",
+    "        # update corr\n",
+    "        tmp_corr = torch.softmax(weights, dim=-1)\n",
+    "        # aggregation\n",
+    "        tmp_values = values\n",
+    "        delays_agg = torch.zeros_like(values, dtype=torch.float, device=values.device)\n",
+    "        for i in range(top_k):\n",
+    "            pattern = torch.roll(tmp_values, -int(index[i]), -1)\n",
+    "            delays_agg = delays_agg + pattern * \\\n",
+    "                         (tmp_corr[:, i].unsqueeze(1).unsqueeze(1).unsqueeze(1).repeat(1, head, channel, length))\n",
+    "        return delays_agg\n",
+    "\n",
+    "    def time_delay_agg_inference(self, values, corr):\n",
+    "        \"\"\"\n",
+    "        SpeedUp version of Autocorrelation (a batch-normalization style design)\n",
+    "        This is for the inference phase.\n",
+    "        \"\"\"\n",
+    "        batch = values.shape[0]\n",
+    "        head = values.shape[1]\n",
+    "        channel = values.shape[2]\n",
+    "        length = values.shape[3]\n",
+    "        # index init\n",
+    "        init_index = torch.arange(length, device=values.device).unsqueeze(0).unsqueeze(0).unsqueeze(0).repeat(batch, head, channel, 1)\n",
+    "        # find top k\n",
+    "        top_k = int(self.factor * math.log(length))\n",
+    "        mean_value = torch.mean(torch.mean(corr, dim=1), dim=1)\n",
+    "        weights = torch.topk(mean_value, top_k, dim=-1)[0]\n",
+    "        delay = torch.topk(mean_value, top_k, dim=-1)[1]\n",
+    "        # update corr\n",
+    "        tmp_corr = torch.softmax(weights, dim=-1)\n",
+    "        # aggregation\n",
+    "        tmp_values = values.repeat(1, 1, 1, 2)\n",
+    "        delays_agg = torch.zeros_like(values, dtype=torch.float, device=values.device)\n",
+    "        for i in range(top_k):\n",
+    "            tmp_delay = init_index + delay[:, i].unsqueeze(1).unsqueeze(1).unsqueeze(1).repeat(1, head, channel, length)\n",
+    "            pattern = torch.gather(tmp_values, dim=-1, index=tmp_delay)\n",
+    "            delays_agg = delays_agg + pattern * \\\n",
+    "                         (tmp_corr[:, i].unsqueeze(1).unsqueeze(1).unsqueeze(1).repeat(1, head, channel, length))\n",
+    "        return delays_agg\n",
+    "\n",
+    "    def time_delay_agg_full(self, values, corr):\n",
+    "        \"\"\"\n",
+    "        Standard version of Autocorrelation\n",
+    "        \"\"\"\n",
+    "        batch = values.shape[0]\n",
+    "        head = values.shape[1]\n",
+    "        channel = values.shape[2]\n",
+    "        length = values.shape[3]\n",
+    "        # index init\n",
+    "        init_index = torch.arange(length, device=values.device).unsqueeze(0).unsqueeze(0).unsqueeze(0).repeat(batch, head, channel, 1)\n",
+    "        # find top k\n",
+    "        top_k = int(self.factor * math.log(length))\n",
+    "        weights = torch.topk(corr, top_k, dim=-1)[0]\n",
+    "        delay = torch.topk(corr, top_k, dim=-1)[1]\n",
+    "        # update corr\n",
+    "        tmp_corr = torch.softmax(weights, dim=-1)\n",
+    "        # aggregation\n",
+    "        tmp_values = values.repeat(1, 1, 1, 2)\n",
+    "        delays_agg = torch.zeros_like(values, dtype=torch.float, device=values.device)\n",
+    "        for i in range(top_k):\n",
+    "            tmp_delay = init_index + delay[..., i].unsqueeze(-1)\n",
+    "            pattern = torch.gather(tmp_values, dim=-1, index=tmp_delay)\n",
+    "            delays_agg = delays_agg + pattern * (tmp_corr[..., i].unsqueeze(-1))\n",
+    "        return delays_agg\n",
+    "\n",
+    "    def forward(self, queries, keys, values, attn_mask):\n",
+    "        B, L, H, E = queries.shape\n",
+    "        _, S, _, D = values.shape\n",
+    "        if L > S:\n",
+    "            zeros = torch.zeros_like(queries[:, :(L - S), :], dtype=torch.float, device=queries.device)\n",
+    "            values = torch.cat([values, zeros], dim=1)\n",
+    "            keys = torch.cat([keys, zeros], dim=1)\n",
+    "        else:\n",
+    "            values = values[:, :L, :, :]\n",
+    "            keys = keys[:, :L, :, :]\n",
+    "\n",
+    "        # period-based dependencies\n",
+    "        q_fft = torch.fft.rfft(queries.permute(0, 2, 3, 1).contiguous(), dim=-1)\n",
+    "        k_fft = torch.fft.rfft(keys.permute(0, 2, 3, 1).contiguous(), dim=-1)\n",
+    "        res = q_fft * torch.conj(k_fft)\n",
+    "        corr = torch.fft.irfft(res, dim=-1)\n",
+    "\n",
+    "        # time delay agg\n",
+    "        if self.training:\n",
+    "            V = self.time_delay_agg_training(values.permute(0, 2, 3, 1).contiguous(), corr).permute(0, 3, 1, 2)\n",
+    "        else:\n",
+    "            V = self.time_delay_agg_inference(values.permute(0, 2, 3, 1).contiguous(), corr).permute(0, 3, 1, 2)\n",
+    "\n",
+    "        if self.output_attention:\n",
+    "            return (V.contiguous(), corr.permute(0, 3, 1, 2))\n",
+    "        else:\n",
+    "            return (V.contiguous(), None)\n",
+    "\n",
+    "\n",
+    "class AutoCorrelationLayer(nn.Module):\n",
+    "    def __init__(self, correlation, hidden_size, n_head, d_keys=None,\n",
+    "                 d_values=None):\n",
+    "        super(AutoCorrelationLayer, self).__init__()\n",
+    "\n",
+    "        d_keys = d_keys or (hidden_size // n_head)\n",
+    "        d_values = d_values or (hidden_size // n_head)\n",
+    "\n",
+    "        self.inner_correlation = correlation\n",
+    "        self.query_projection = nn.Linear(hidden_size, d_keys * n_head)\n",
+    "        self.key_projection = nn.Linear(hidden_size, d_keys * n_head)\n",
+    "        self.value_projection = nn.Linear(hidden_size, d_values * n_head)\n",
+    "        self.out_projection = nn.Linear(d_values * n_head, hidden_size)\n",
+    "        self.n_head = n_head\n",
+    "\n",
+    "    def forward(self, queries, keys, values, attn_mask):\n",
+    "        B, L, _ = queries.shape\n",
+    "        _, S, _ = keys.shape\n",
+    "        H = self.n_head\n",
+    "\n",
+    "        queries = self.query_projection(queries).view(B, L, H, -1)\n",
+    "        keys = self.key_projection(keys).view(B, S, H, -1)\n",
+    "        values = self.value_projection(values).view(B, S, H, -1)\n",
+    "\n",
+    "        out, attn = self.inner_correlation(\n",
+    "            queries,\n",
+    "            keys,\n",
+    "            values,\n",
+    "            attn_mask\n",
+    "        )\n",
+    "        out = out.view(B, L, -1)\n",
+    "\n",
+    "        return self.out_projection(out), attn\n",
+    "    \n",
+    "\n",
+    "class LayerNorm(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Special designed layernorm for the seasonal part\n",
+    "    \"\"\"\n",
+    "    def __init__(self, channels):\n",
+    "        super(LayerNorm, self).__init__()\n",
+    "        self.layernorm = nn.LayerNorm(channels)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        x_hat = self.layernorm(x)\n",
+    "        bias = torch.mean(x_hat, dim=1).unsqueeze(1).repeat(1, x.shape[1], 1)\n",
+    "        return x_hat - bias\n",
+    "\n",
+    "\n",
+    "class MovingAvg(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Moving average block to highlight the trend of time series\n",
+    "    \"\"\"\n",
+    "    def __init__(self, kernel_size, stride):\n",
+    "        super(MovingAvg, self).__init__()\n",
+    "        self.kernel_size = kernel_size\n",
+    "        self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        # padding on the both ends of time series\n",
+    "        front = x[:, 0:1, :].repeat(1, (self.kernel_size - 1) // 2, 1)\n",
+    "        end = x[:, -1:, :].repeat(1, (self.kernel_size - 1) // 2, 1)\n",
+    "        x = torch.cat([front, x, end], dim=1)\n",
+    "        x = self.avg(x.permute(0, 2, 1))\n",
+    "        x = x.permute(0, 2, 1)\n",
+    "        return x\n",
+    "\n",
+    "\n",
+    "class SeriesDecomp(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Series decomposition block\n",
+    "    \"\"\"\n",
+    "    def __init__(self, kernel_size):\n",
+    "        super(SeriesDecomp, self).__init__()\n",
+    "        self.MovingAvg = MovingAvg(kernel_size, stride=1)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        moving_mean = self.MovingAvg(x)\n",
+    "        res = x - moving_mean\n",
+    "        return res, moving_mean\n",
+    "\n",
+    "\n",
+    "class EncoderLayer(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Autoformer encoder layer with the progressive decomposition architecture\n",
+    "    \"\"\"\n",
+    "    def __init__(self, attention, hidden_size, conv_hidden_size=None, MovingAvg=25, dropout=0.1, activation=\"relu\"):\n",
+    "        super(EncoderLayer, self).__init__()\n",
+    "        conv_hidden_size = conv_hidden_size or 4 * hidden_size\n",
+    "        self.attention = attention\n",
+    "        self.conv1 = nn.Conv1d(in_channels=hidden_size, out_channels=conv_hidden_size, kernel_size=1, bias=False)\n",
+    "        self.conv2 = nn.Conv1d(in_channels=conv_hidden_size, out_channels=hidden_size, kernel_size=1, bias=False)\n",
+    "        self.decomp1 = SeriesDecomp(MovingAvg)\n",
+    "        self.decomp2 = SeriesDecomp(MovingAvg)\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.activation = F.relu if activation == \"relu\" else F.gelu\n",
+    "\n",
+    "    def forward(self, x, attn_mask=None):\n",
+    "        new_x, attn = self.attention(\n",
+    "            x, x, x,\n",
+    "            attn_mask=attn_mask\n",
+    "        )\n",
+    "        x = x + self.dropout(new_x)\n",
+    "        x, _ = self.decomp1(x)\n",
+    "        y = x\n",
+    "        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))\n",
+    "        y = self.dropout(self.conv2(y).transpose(-1, 1))\n",
+    "        res, _ = self.decomp2(x + y)\n",
+    "        return res, attn\n",
+    "\n",
+    "\n",
+    "class Encoder(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Autoformer encoder\n",
+    "    \"\"\"\n",
+    "    def __init__(self, attn_layers, conv_layers=None, norm_layer=None):\n",
+    "        super(Encoder, self).__init__()\n",
+    "        self.attn_layers = nn.ModuleList(attn_layers)\n",
+    "        self.conv_layers = nn.ModuleList(conv_layers) if conv_layers is not None else None\n",
+    "        self.norm = norm_layer\n",
+    "\n",
+    "    def forward(self, x, attn_mask=None):\n",
+    "        attns = []\n",
+    "        if self.conv_layers is not None:\n",
+    "            for attn_layer, conv_layer in zip(self.attn_layers, self.conv_layers):\n",
+    "                x, attn = attn_layer(x, attn_mask=attn_mask)\n",
+    "                x = conv_layer(x)\n",
+    "                attns.append(attn)\n",
+    "            x, attn = self.attn_layers[-1](x)\n",
+    "            attns.append(attn)\n",
+    "        else:\n",
+    "            for attn_layer in self.attn_layers:\n",
+    "                x, attn = attn_layer(x, attn_mask=attn_mask)\n",
+    "                attns.append(attn)\n",
+    "\n",
+    "        if self.norm is not None:\n",
+    "            x = self.norm(x)\n",
+    "\n",
+    "        return x, attns\n",
+    "\n",
+    "\n",
+    "class DecoderLayer(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Autoformer decoder layer with the progressive decomposition architecture\n",
+    "    \"\"\"\n",
+    "    def __init__(self, self_attention, cross_attention, hidden_size, c_out, conv_hidden_size=None,\n",
+    "                 MovingAvg=25, dropout=0.1, activation=\"relu\"):\n",
+    "        super(DecoderLayer, self).__init__()\n",
+    "        conv_hidden_size = conv_hidden_size or 4 * hidden_size\n",
+    "        self.self_attention = self_attention\n",
+    "        self.cross_attention = cross_attention\n",
+    "        self.conv1 = nn.Conv1d(in_channels=hidden_size, out_channels=conv_hidden_size, kernel_size=1, bias=False)\n",
+    "        self.conv2 = nn.Conv1d(in_channels=conv_hidden_size, out_channels=hidden_size, kernel_size=1, bias=False)\n",
+    "        self.decomp1 = SeriesDecomp(MovingAvg)\n",
+    "        self.decomp2 = SeriesDecomp(MovingAvg)\n",
+    "        self.decomp3 = SeriesDecomp(MovingAvg)\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.projection = nn.Conv1d(in_channels=hidden_size, out_channels=c_out, kernel_size=3, stride=1, padding=1,\n",
+    "                                    padding_mode='circular', bias=False)\n",
+    "        self.activation = F.relu if activation == \"relu\" else F.gelu\n",
+    "\n",
+    "    def forward(self, x, cross, x_mask=None, cross_mask=None):\n",
+    "        x = x + self.dropout(self.self_attention(\n",
+    "            x, x, x,\n",
+    "            attn_mask=x_mask\n",
+    "        )[0])\n",
+    "        x, trend1 = self.decomp1(x)\n",
+    "        x = x + self.dropout(self.cross_attention(\n",
+    "            x, cross, cross,\n",
+    "            attn_mask=cross_mask\n",
+    "        )[0])\n",
+    "        x, trend2 = self.decomp2(x)\n",
+    "        y = x\n",
+    "        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))\n",
+    "        y = self.dropout(self.conv2(y).transpose(-1, 1))\n",
+    "        x, trend3 = self.decomp3(x + y)\n",
+    "\n",
+    "        residual_trend = trend1 + trend2 + trend3\n",
+    "        residual_trend = self.projection(residual_trend.permute(0, 2, 1)).transpose(1, 2)\n",
+    "        return x, residual_trend\n",
+    "\n",
+    "\n",
+    "class Decoder(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Autoformer decoder\n",
+    "    \"\"\"\n",
+    "    def __init__(self, layers, norm_layer=None, projection=None):\n",
+    "        super(Decoder, self).__init__()\n",
+    "        self.layers = nn.ModuleList(layers)\n",
+    "        self.norm = norm_layer\n",
+    "        self.projection = projection\n",
+    "\n",
+    "    def forward(self, x, cross, x_mask=None, cross_mask=None, trend=None):\n",
+    "        for layer in self.layers:\n",
+    "            x, residual_trend = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask)\n",
+    "            trend = trend + residual_trend\n",
+    "\n",
+    "        if self.norm is not None:\n",
+    "            x = self.norm(x)\n",
+    "\n",
+    "        if self.projection is not None:\n",
+    "            x = self.projection(x)\n",
+    "        return x, trend"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Autoformer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Autoformer(BaseWindows):\n",
+    "    \"\"\" Autoformer\n",
+    "\n",
+    "    The Autoformer model tackles the challenge of finding reliable dependencies on intricate temporal patterns of long-horizon forecasting.\n",
+    "\n",
+    "    The architecture has the following distinctive features:\n",
+    "    - In-built progressive decomposition in trend and seasonal compontents based on a moving average filter.\n",
+    "    - Auto-Correlation mechanism that discovers the period-based dependencies by\n",
+    "    calculating the autocorrelation and aggregating similar sub-series based on the periodicity.\n",
+    "    - Classic encoder-decoder proposed by Vaswani et al. (2017) with a multi-head attention mechanism.\n",
+    "\n",
+    "    The Autoformer model utilizes a three-component approach to define its embedding:\n",
+    "    - It employs encoded autoregressive features obtained from a convolution network.\n",
+    "    - Absolute positional embeddings obtained from calendar features are utilized.\n",
+    "\n",
+    "    *Parameters:*<br>\n",
+    "    `h`: int, forecast horizon.<br>\n",
+    "    `input_size`: int, maximum sequence length for truncated train backpropagation. Default -1 uses all history.<br>\n",
+    "    `futr_exog_list`: str list, future exogenous columns.<br>\n",
+    "    `hist_exog_list`: str list, historic exogenous columns.<br>\n",
+    "    `stat_exog_list`: str list, static exogenous columns.<br>\n",
+    "    `exclude_insample_y`: bool=False, the model skips the autoregressive features y[t-input_size:t] if True.<br>\n",
+    "\t`decoder_input_size_multiplier`: float = 0.5, .<br>\n",
+    "    `hidden_size`: int=128, units of embeddings and encoders.<br>\n",
+    "    `n_head`: int=4, controls number of multi-head's attention.<br>\n",
+    "    `dropout`: float (0, 1), dropout throughout Autoformer architecture.<br>\n",
+    "\t`factor`: int=3, Probsparse attention factor.<br>\n",
+    "\t`conv_hidden_size`: int=32, channels of the convolutional encoder.<br>\n",
+    "\t`activation`: str=`GELU`, activation from ['ReLU', 'Softplus', 'Tanh', 'SELU', 'LeakyReLU', 'PReLU', 'Sigmoid', 'GELU'].<br>\n",
+    "    `encoder_layers`: int=2, number of layers for the TCN encoder.<br>\n",
+    "    `decoder_layers`: int=1, number of layers for the MLP decoder.<br>\n",
+    "    `distil`: bool = True, wether the Autoformer decoder uses bottlenecks.<br>\n",
+    "    `loss`: PyTorch module, instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `max_steps`: int=1000, maximum number of training steps.<br>\n",
+    "    `learning_rate`: float=1e-3, Learning rate between (0, 1).<br>\n",
+    "    `num_lr_decays`: int=-1, Number of learning rate decays, evenly distributed across max_steps.<br>\n",
+    "    `early_stop_patience_steps`: int=-1, Number of validation iterations before early stopping.<br>\n",
+    "    `val_check_steps`: int=100, Number of training steps between every validation loss check.<br>\n",
+    "    `batch_size`: int=32, number of different series in each batch.<br>\n",
+    "    `valid_batch_size`: int=None, number of different series in each validation and test batch, if None uses batch_size.<br>\n",
+    "    `windows_batch_size`: int=1024, number of windows to sample in each training batch, default uses all.<br>\n",
+    "    `inference_windows_batch_size`: int=1024, number of windows to sample in each inference batch.<br>\n",
+    "    `start_padding_enabled`: bool=False, if True, the model will pad the time series with zeros at the beginning, by input size.<br>\n",
+    "    `scaler_type`: str='robust', type of scaler for temporal inputs normalization see [temporal scalers](https://nixtla.github.io/neuralforecast/common.scalers.html).<br>\n",
+    "    `random_seed`: int=1, random_seed for pytorch initializer and numpy generators.<br>\n",
+    "    `num_workers_loader`: int=os.cpu_count(), workers to be used by `TimeSeriesDataLoader`.<br>\n",
+    "    `drop_last_loader`: bool=False, if True `TimeSeriesDataLoader` drops last non-full batch.<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    `optimizer`: Subclass of 'torch.optim.Optimizer', optional, user specified optimizer instead of the default choice (Adam).<br>\n",
+    "    `optimizer_kwargs`: dict, optional, list of parameters used by the user specified `optimizer`.<br>\n",
+    "    `**trainer_kwargs`: int,  keyword trainer arguments inherited from [PyTorch Lighning's trainer](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).<br>\n",
+    "\n",
+    "\t*References*<br>\n",
+    "\t- [Wu, Haixu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. \"Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting\"](https://proceedings.neurips.cc/paper/2021/hash/bcc0d400288793e8bdcd7c19a8ac0c2b-Abstract.html)<br>\n",
+    "    \"\"\"\n",
+    "    # Class attributes\n",
+    "    SAMPLING_TYPE = 'windows'\n",
+    "\n",
+    "    def __init__(self,\n",
+    "                 h: int, \n",
+    "                 input_size: int,\n",
+    "                 stat_exog_list = None,\n",
+    "                 hist_exog_list = None,\n",
+    "                 futr_exog_list = None,\n",
+    "                 exclude_insample_y = False,\n",
+    "                 decoder_input_size_multiplier: float = 0.5,\n",
+    "                 hidden_size: int = 128, \n",
+    "                 dropout: float = 0.05,\n",
+    "                 factor: int = 3,\n",
+    "                 n_head: int = 4,\n",
+    "                 conv_hidden_size: int = 32,\n",
+    "                 activation: str = 'gelu',\n",
+    "                 encoder_layers: int = 2, \n",
+    "                 decoder_layers: int = 1,\n",
+    "                 MovingAvg_window: int = 25,\n",
+    "                 loss = MAE(),\n",
+    "                 valid_loss = None,\n",
+    "                 max_steps: int = 5000,\n",
+    "                 learning_rate: float = 1e-4,\n",
+    "                 num_lr_decays: int = -1,\n",
+    "                 early_stop_patience_steps: int =-1,\n",
+    "                 val_check_steps: int = 100,\n",
+    "                 batch_size: int = 32,\n",
+    "                 valid_batch_size: Optional[int] = None,\n",
+    "                 windows_batch_size = 1024,\n",
+    "                 inference_windows_batch_size = 1024,\n",
+    "                 start_padding_enabled = False,\n",
+    "                 step_size: int = 1,\n",
+    "                 scaler_type: str = 'identity',\n",
+    "                 random_seed: int = 1,\n",
+    "                 num_workers_loader: int = 0,\n",
+    "                 drop_last_loader: bool = False,\n",
+    "                 optimizer = None,\n",
+    "                 optimizer_kwargs = None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super(Autoformer, self).__init__(h=h,\n",
+    "                                       input_size=input_size,\n",
+    "                                       hist_exog_list=hist_exog_list,\n",
+    "                                       stat_exog_list=stat_exog_list,\n",
+    "                                       futr_exog_list = futr_exog_list,\n",
+    "                                       exclude_insample_y = exclude_insample_y,\n",
+    "                                       loss=loss,\n",
+    "                                       valid_loss=valid_loss,\n",
+    "                                       max_steps=max_steps,\n",
+    "                                       learning_rate=learning_rate,\n",
+    "                                       num_lr_decays=num_lr_decays,\n",
+    "                                       early_stop_patience_steps=early_stop_patience_steps,\n",
+    "                                       val_check_steps=val_check_steps,\n",
+    "                                       batch_size=batch_size,\n",
+    "                                       windows_batch_size=windows_batch_size,\n",
+    "                                       valid_batch_size=valid_batch_size,\n",
+    "                                       inference_windows_batch_size=inference_windows_batch_size,\n",
+    "                                       start_padding_enabled = start_padding_enabled,\n",
+    "                                       step_size=step_size,\n",
+    "                                       scaler_type=scaler_type,\n",
+    "                                       num_workers_loader=num_workers_loader,\n",
+    "                                       drop_last_loader=drop_last_loader,\n",
+    "                                       random_seed=random_seed,\n",
+    "                                       optimizer=optimizer,\n",
+    "                                       optimizer_kwargs=optimizer_kwargs,\n",
+    "                                       **trainer_kwargs)\n",
+    "\n",
+    "        # Architecture\n",
+    "        self.futr_input_size = len(self.futr_exog_list)\n",
+    "        self.hist_input_size = len(self.hist_exog_list)\n",
+    "        self.stat_input_size = len(self.stat_exog_list)\n",
+    "\n",
+    "        if self.stat_input_size > 0:\n",
+    "            raise Exception('Autoformer does not support static variables yet')\n",
+    "        \n",
+    "        if self.hist_input_size > 0:\n",
+    "            raise Exception('Autoformer does not support historical variables yet')\n",
+    "\n",
+    "        self.label_len = int(np.ceil(input_size * decoder_input_size_multiplier))\n",
+    "        if (self.label_len >= input_size) or (self.label_len <= 0):\n",
+    "            raise Exception(f'Check decoder_input_size_multiplier={decoder_input_size_multiplier}, range (0,1)')\n",
+    "\n",
+    "        if activation not in ['relu', 'gelu']:\n",
+    "            raise Exception(f'Check activation={activation}')\n",
+    "        \n",
+    "        self.c_out = self.loss.outputsize_multiplier\n",
+    "        self.output_attention = False\n",
+    "        self.enc_in = 1 \n",
+    "        self.dec_in = 1\n",
+    "\n",
+    "        # Decomposition\n",
+    "        self.decomp = SeriesDecomp(MovingAvg_window)\n",
+    "\n",
+    "        # Embedding\n",
+    "        self.enc_embedding = DataEmbedding(c_in=self.enc_in,\n",
+    "                                           exog_input_size=self.hist_input_size,\n",
+    "                                           hidden_size=hidden_size, \n",
+    "                                           pos_embedding=False,\n",
+    "                                           dropout=dropout)\n",
+    "        self.dec_embedding = DataEmbedding(self.dec_in,\n",
+    "                                           exog_input_size=self.hist_input_size,\n",
+    "                                           hidden_size=hidden_size, \n",
+    "                                           pos_embedding=False,\n",
+    "                                           dropout=dropout)\n",
+    "\n",
+    "        # Encoder\n",
+    "        self.encoder = Encoder(\n",
+    "            [\n",
+    "                EncoderLayer(\n",
+    "                    AutoCorrelationLayer(\n",
+    "                        AutoCorrelation(False, factor,\n",
+    "                                      attention_dropout=dropout,\n",
+    "                                      output_attention=self.output_attention),\n",
+    "                        hidden_size, n_head),\n",
+    "                    hidden_size=hidden_size,\n",
+    "                    conv_hidden_size=conv_hidden_size,\n",
+    "                    MovingAvg=MovingAvg_window,\n",
+    "                    dropout=dropout,\n",
+    "                    activation=activation\n",
+    "                ) for l in range(encoder_layers)\n",
+    "            ],\n",
+    "            norm_layer=LayerNorm(hidden_size)\n",
+    "        )\n",
+    "        # Decoder\n",
+    "        self.decoder = Decoder(\n",
+    "            [\n",
+    "                DecoderLayer(\n",
+    "                    AutoCorrelationLayer(\n",
+    "                        AutoCorrelation(True, factor, attention_dropout=dropout, output_attention=False),\n",
+    "                        hidden_size, n_head),\n",
+    "                    AutoCorrelationLayer(\n",
+    "                        AutoCorrelation(False, factor, attention_dropout=dropout, output_attention=False),\n",
+    "                        hidden_size, n_head),\n",
+    "                    hidden_size=hidden_size,\n",
+    "                    c_out=self.c_out,\n",
+    "                    conv_hidden_size=conv_hidden_size,\n",
+    "                    MovingAvg=MovingAvg_window,\n",
+    "                    dropout=dropout,\n",
+    "                    activation=activation,\n",
+    "                )\n",
+    "                for l in range(decoder_layers)\n",
+    "            ],\n",
+    "            norm_layer=LayerNorm(hidden_size),\n",
+    "            projection=nn.Linear(hidden_size, self.c_out, bias=True)\n",
+    "        )\n",
+    "\n",
+    "    def forward(self, windows_batch):\n",
+    "        # Parse windows_batch\n",
+    "        insample_y    = windows_batch['insample_y']\n",
+    "        #insample_mask = windows_batch['insample_mask']\n",
+    "        #hist_exog     = windows_batch['hist_exog']\n",
+    "        #stat_exog     = windows_batch['stat_exog']\n",
+    "        futr_exog     = windows_batch['futr_exog']\n",
+    "\n",
+    "        # Parse inputs\n",
+    "        insample_y = insample_y.unsqueeze(-1) # [Ws,L,1]\n",
+    "        if self.futr_input_size > 0:\n",
+    "            x_mark_enc = futr_exog[:,:self.input_size,:]\n",
+    "            x_mark_dec = futr_exog[:,-(self.label_len+self.h):,:]\n",
+    "        else:\n",
+    "            x_mark_enc = None\n",
+    "            x_mark_dec = None\n",
+    "\n",
+    "        x_dec = torch.zeros(size=(len(insample_y),self.h,1), device=insample_y.device)\n",
+    "        x_dec = torch.cat([insample_y[:,-self.label_len:,:], x_dec], dim=1)\n",
+    "\n",
+    "        # decomp init\n",
+    "        mean = torch.mean(insample_y, dim=1).unsqueeze(1).repeat(1, self.h, 1)\n",
+    "        zeros = torch.zeros([x_dec.shape[0], self.h, x_dec.shape[2]], device=insample_y.device)\n",
+    "        seasonal_init, trend_init = self.decomp(insample_y)\n",
+    "        # decoder input\n",
+    "        trend_init = torch.cat([trend_init[:, -self.label_len:, :], mean], dim=1)\n",
+    "        seasonal_init = torch.cat([seasonal_init[:, -self.label_len:, :], zeros], dim=1)\n",
+    "        # enc\n",
+    "        enc_out = self.enc_embedding(insample_y, x_mark_enc)\n",
+    "        enc_out, attns = self.encoder(enc_out, attn_mask=None)\n",
+    "        # dec\n",
+    "        dec_out = self.dec_embedding(seasonal_init, x_mark_dec)\n",
+    "        seasonal_part, trend_part = self.decoder(dec_out, enc_out, x_mask=None, cross_mask=None,\n",
+    "                                                 trend=trend_init)\n",
+    "        # final\n",
+    "        dec_out = trend_part + seasonal_part\n",
+    "\n",
+    "        forecast = self.loss.domain_map(dec_out[:, -self.h:])\n",
+    "        return forecast"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(Autoformer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(Autoformer.fit, name='Autoformer.fit')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(Autoformer.predict, name='Autoformer.predict')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import pytorch_lightning as pl\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.models import MLP\n",
+    "from neuralforecast.losses.pytorch import MQLoss, DistributionLoss\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset\n",
+    "from neuralforecast.utils import AirPassengers, AirPassengersPanel, AirPassengersStatic, augment_calendar_df\n",
+    "\n",
+    "AirPassengersPanel, calendar_cols = augment_calendar_df(df=AirPassengersPanel, freq='M')\n",
+    "\n",
+    "Y_train_df = AirPassengersPanel[AirPassengersPanel.ds<AirPassengersPanel['ds'].values[-12]] # 132 train\n",
+    "Y_test_df = AirPassengersPanel[AirPassengersPanel.ds>=AirPassengersPanel['ds'].values[-12]].reset_index(drop=True) # 12 test\n",
+    "\n",
+    "model = Autoformer(h=12,\n",
+    "                 input_size=24,\n",
+    "                 hidden_size = 16,\n",
+    "                 conv_hidden_size = 32,\n",
+    "                 n_head=2,\n",
+    "                 loss=MAE(),\n",
+    "                 futr_exog_list=calendar_cols,\n",
+    "                 scaler_type='robust',\n",
+    "                 learning_rate=1e-3,\n",
+    "                 max_steps=300,\n",
+    "                 val_check_steps=50,\n",
+    "                 early_stop_patience_steps=2)\n",
+    "\n",
+    "nf = NeuralForecast(\n",
+    "    models=[model],\n",
+    "    freq='M'\n",
+    ")\n",
+    "nf.fit(df=Y_train_df, static_df=AirPassengersStatic, val_size=12)\n",
+    "forecasts = nf.predict(futr_df=Y_test_df)\n",
+    "\n",
+    "Y_hat_df = forecasts.reset_index(drop=False).drop(columns=['unique_id','ds'])\n",
+    "plot_df = pd.concat([Y_test_df, Y_hat_df], axis=1)\n",
+    "plot_df = pd.concat([Y_train_df, plot_df])\n",
+    "\n",
+    "if model.loss.is_distribution_output:\n",
+    "    plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "    plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "    plt.plot(plot_df['ds'], plot_df['Autoformer-median'], c='blue', label='median')\n",
+    "    plt.fill_between(x=plot_df['ds'][-12:], \n",
+    "                    y1=plot_df['Autoformer-lo-90'][-12:].values, \n",
+    "                    y2=plot_df['Autoformer-hi-90'][-12:].values,\n",
+    "                    alpha=0.4, label='level 90')\n",
+    "    plt.grid()\n",
+    "    plt.legend()\n",
+    "    plt.plot()\n",
+    "else:\n",
+    "    plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "    plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "    plt.plot(plot_df['ds'], plot_df['Autoformer'], c='blue', label='Forecast')\n",
+    "    plt.legend()\n",
+    "    plt.grid()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/models.bitcn.ipynb
+++ b/nbs/models.bitcn.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.bitcn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# BiTCN"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Bidirectional Temporal Convolutional Network (BiTCN) is a forecasting architecture based on two temporal convolutional networks (TCNs). The first network ('forward') encodes future covariates of the time series, whereas the second network ('backward') encodes past observations and covariates. This method allows to preserve the temporal information of sequence data, and is computationally more efficient than common RNN methods (LSTM, GRU, ...). As compared to Transformer-based methods, BiTCN has a lower space complexity, i.e. it requires orders of magnitude less parameters.\n",
+    "\n",
+    "This model may be a good choice if you seek a small model (small amount of trainable parameters) with few hyperparameters to tune (only 2).\n",
+    "\n",
+    "**References**<br>\n",
+    "-[Olivier Sprangers, Sebastian Schelter, Maarten de Rijke (2023). Parameter-Efficient Deep Probabilistic Forecasting. International Journal of Forecasting 39, no. 1 (1 January 2023): 332–45. URL: https://doi.org/10.1016/j.ijforecast.2021.11.011.](https://doi.org/10.1016/j.ijforecast.2021.11.011)<br>\n",
+    "-[Shaojie Bai, Zico Kolter, Vladlen Koltun. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Computing Research Repository, abs/1803.01271. URL: https://arxiv.org/abs/1803.01271.](https://arxiv.org/abs/1803.01271)<br>\n",
+    "-[van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. Computing Research Repository, abs/1609.03499. URL: http://arxiv.org/abs/1609.03499. arXiv:1609.03499.](https://arxiv.org/abs/1609.03499)<br>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. Visualization of a stack of dilated causal convolutional layers.](imgs_models/bitcn.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from typing import Optional\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "import numpy as np\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import MAE\n",
+    "from neuralforecast.common._base_windows import BaseWindows"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Auxiliary Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class CustomConv1d(nn.Module):\n",
+    "    def __init__(self, in_channels, out_channels, kernel_size, padding=0, dilation=1, mode='backward', groups=1):\n",
+    "        super().__init__()\n",
+    "        k = np.sqrt(1 / (in_channels * kernel_size))\n",
+    "        weight_data = -k + 2 * k * torch.rand((out_channels, in_channels // groups, kernel_size))\n",
+    "        bias_data = -k + 2 * k * torch.rand((out_channels))\n",
+    "        self.weight = nn.Parameter(weight_data, requires_grad=True)\n",
+    "        self.bias = nn.Parameter(bias_data, requires_grad=True)  \n",
+    "        self.dilation = dilation\n",
+    "        self.groups = groups\n",
+    "        if mode == 'backward':\n",
+    "            self.padding_left = padding\n",
+    "            self.padding_right= 0\n",
+    "        elif mode == 'forward':\n",
+    "            self.padding_left = 0\n",
+    "            self.padding_right= padding            \n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        xp = F.pad(x, (self.padding_left, self.padding_right))\n",
+    "        return F.conv1d(xp, self.weight, self.bias, dilation=self.dilation, groups=self.groups)\n",
+    "\n",
+    "class TCNCell(nn.Module):\n",
+    "    def __init__(self, in_channels, out_channels, kernel_size, padding, dilation, mode, groups, dropout):\n",
+    "        super().__init__()\n",
+    "        self.conv1 = CustomConv1d(in_channels, out_channels, kernel_size, padding, dilation, mode, groups)\n",
+    "        self.conv2 = CustomConv1d(out_channels, in_channels * 2, 1)\n",
+    "        self.drop = nn.Dropout(dropout)\n",
+    "        \n",
+    "    def forward(self, x):\n",
+    "        h_prev, out_prev = x\n",
+    "        h = self.drop(F.gelu(self.conv1(h_prev)))\n",
+    "        h_next, out_next = self.conv2(h).chunk(2, 1)\n",
+    "        return (h_prev + h_next, out_prev + out_next)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. BiTCN"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class BiTCN(BaseWindows):\n",
+    "    \"\"\" BiTCN\n",
+    "\n",
+    "    Bidirectional Temporal Convolutional Network (BiTCN) is a forecasting architecture based on two temporal convolutional networks (TCNs). The first network ('forward') encodes future covariates of the time series, whereas the second network ('backward') encodes past observations and covariates. This is a univariate model.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `h`: int, forecast horizon.<br>\n",
+    "    `input_size`: int, considered autorregresive inputs (lags), y=[1,2,3,4] input_size=2 -> lags=[1,2].<br>\n",
+    "    `hidden_size`: int=16, units for the TCN's hidden state size.<br>\n",
+    "    `dropout`: float=0.1, dropout rate used for the dropout layers throughout the architecture.<br>\n",
+    "    `futr_exog_list`: str list, future exogenous columns.<br>\n",
+    "    `hist_exog_list`: str list, historic exogenous columns.<br>\n",
+    "    `stat_exog_list`: str list, static exogenous columns.<br>\n",
+    "    `exclude_insample_y`: bool=False, the model skips the autoregressive features y[t-input_size:t] if True.<br>\n",
+    "    `loss`: PyTorch module, instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `valid_loss`: PyTorch module=`loss`, instantiated valid loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `max_steps`: int=1000, maximum number of training steps.<br>\n",
+    "    `learning_rate`: float=1e-3, Learning rate between (0, 1).<br>\n",
+    "    `num_lr_decays`: int=-1, Number of learning rate decays, evenly distributed across max_steps.<br>\n",
+    "    `early_stop_patience_steps`: int=-1, Number of validation iterations before early stopping.<br>\n",
+    "    `val_check_steps`: int=100, Number of training steps between every validation loss check.<br>\n",
+    "    `batch_size`: int=32, number of different series in each batch.<br>\n",
+    "    `valid_batch_size`: int=None, number of different series in each validation and test batch, if None uses batch_size.<br>\n",
+    "    `windows_batch_size`: int=1024, number of windows to sample in each training batch, default uses all.<br>\n",
+    "    `inference_windows_batch_size`: int=-1, number of windows to sample in each inference batch, -1 uses all.<br>\n",
+    "    `start_padding_enabled`: bool=False, if True, the model will pad the time series with zeros at the beginning, by input size.<br>\n",
+    "    `step_size`: int=1, step size between each window of temporal data.<br>\n",
+    "    `scaler_type`: str='identity', type of scaler for temporal inputs normalization see [temporal scalers](https://nixtla.github.io/neuralforecast/common.scalers.html).<br>\n",
+    "    `random_seed`: int=1, random_seed for pytorch initializer and numpy generators.<br>\n",
+    "    `num_workers_loader`: int=os.cpu_count(), workers to be used by `TimeSeriesDataLoader`.<br>\n",
+    "    `drop_last_loader`: bool=False, if True `TimeSeriesDataLoader` drops last non-full batch.<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    `optimizer`: Subclass of 'torch.optim.Optimizer', optional, user specified optimizer instead of the default choice (Adam).<br>\n",
+    "    `optimizer_kwargs`: dict, optional, list of parameters used by the user specified `optimizer`.<br>\n",
+    "    `**trainer_kwargs`: int,  keyword trainer arguments inherited from [PyTorch Lighning's trainer](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).<br>    \n",
+    "\n",
+    "    \"\"\"\n",
+    "    # Class attributes\n",
+    "    SAMPLING_TYPE = 'windows'\n",
+    "    \n",
+    "    def __init__(self,\n",
+    "                 h: int,\n",
+    "                 input_size: int,\n",
+    "                 hidden_size: int = 16,\n",
+    "                 dropout: float = 0.5,\n",
+    "                 futr_exog_list = None,\n",
+    "                 hist_exog_list = None,\n",
+    "                 stat_exog_list = None,\n",
+    "                 exclude_insample_y = False,\n",
+    "                 loss = MAE(),\n",
+    "                 valid_loss = None,\n",
+    "                 max_steps: int = 1000,\n",
+    "                 learning_rate: float = 1e-3,\n",
+    "                 num_lr_decays: int = -1,\n",
+    "                 early_stop_patience_steps: int =-1,\n",
+    "                 val_check_steps: int = 100,\n",
+    "                 batch_size: int = 32,\n",
+    "                 valid_batch_size: Optional[int] = None,\n",
+    "                 windows_batch_size = 1024,\n",
+    "                 inference_windows_batch_size = 1024,\n",
+    "                 start_padding_enabled = False,\n",
+    "                 step_size: int = 1,\n",
+    "                 scaler_type: str = 'identity',\n",
+    "                 random_seed: int = 1,\n",
+    "                 num_workers_loader: int = 0,\n",
+    "                 drop_last_loader: bool = False,\n",
+    "                 optimizer = None,\n",
+    "                 optimizer_kwargs = None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super(BiTCN, self).__init__(\n",
+    "            h=h,\n",
+    "            input_size=input_size,\n",
+    "            futr_exog_list=futr_exog_list,\n",
+    "            hist_exog_list=hist_exog_list,\n",
+    "            stat_exog_list=stat_exog_list,\n",
+    "            exclude_insample_y = exclude_insample_y,\n",
+    "            loss=loss,\n",
+    "            valid_loss=valid_loss,\n",
+    "            max_steps=max_steps,\n",
+    "            learning_rate=learning_rate,\n",
+    "            num_lr_decays=num_lr_decays,\n",
+    "            early_stop_patience_steps=early_stop_patience_steps,\n",
+    "            val_check_steps=val_check_steps,\n",
+    "            batch_size=batch_size,\n",
+    "            valid_batch_size=valid_batch_size,\n",
+    "            windows_batch_size=windows_batch_size,\n",
+    "            inference_windows_batch_size=inference_windows_batch_size,\n",
+    "            start_padding_enabled=start_padding_enabled,\n",
+    "            step_size=step_size,\n",
+    "            scaler_type=scaler_type,\n",
+    "            random_seed=random_seed,\n",
+    "            num_workers_loader=num_workers_loader,\n",
+    "            drop_last_loader=drop_last_loader,\n",
+    "            optimizer=optimizer,\n",
+    "            optimizer_kwargs=optimizer_kwargs,\n",
+    "            **trainer_kwargs\n",
+    "        )\n",
+    "\n",
+    "        #----------------------------------- Parse dimensions -----------------------------------#\n",
+    "        # TCN\n",
+    "        kernel_size = 2  # Not really necessary as parameter, so simplifying the architecture here.\n",
+    "        self.kernel_size = kernel_size\n",
+    "        self.hidden_size = hidden_size\n",
+    "        self.h = h\n",
+    "        self.input_size = input_size\n",
+    "        self.dropout = dropout\n",
+    "        \n",
+    "        # Calculate required number of TCN layers based on the required receptive field of the TCN\n",
+    "        self.n_layers_bwd = int(np.ceil(np.log2(((self.input_size - 1) / (self.kernel_size - 1)) + 1)))\n",
+    "\n",
+    "        self.futr_exog_size = len(self.futr_exog_list)\n",
+    "        self.hist_exog_size = len(self.hist_exog_list)\n",
+    "        self.stat_exog_size = len(self.stat_exog_list)        \n",
+    "       \n",
+    "        #---------------------------------- Instantiate Model -----------------------------------#\n",
+    "        \n",
+    "        # Dense layers\n",
+    "        self.lin_hist = nn.Linear(1 + self.hist_exog_size + self.stat_exog_size + self.futr_exog_size, hidden_size)\n",
+    "        self.drop_hist = nn.Dropout(dropout)\n",
+    "        \n",
+    "        # TCN looking back\n",
+    "        layers_bwd = [TCNCell(\n",
+    "                        hidden_size, \n",
+    "                        hidden_size, \n",
+    "                        kernel_size, \n",
+    "                        padding = (kernel_size-1)*2**i, \n",
+    "                        dilation = 2**i, \n",
+    "                        mode = 'backward', \n",
+    "                        groups = 1, \n",
+    "                        dropout = dropout) for i in range(self.n_layers_bwd)]      \n",
+    "        self.net_bwd = nn.Sequential(*layers_bwd)\n",
+    "        \n",
+    "        # TCN looking forward when future covariates exist\n",
+    "        output_lin_dim_multiplier = 1\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            self.n_layers_fwd = int(np.ceil(np.log2(((self.h + self.input_size - 1) / (self.kernel_size - 1)) + 1)))\n",
+    "            self.lin_futr = nn.Linear(self.futr_exog_size, hidden_size)\n",
+    "            self.drop_futr = nn.Dropout(dropout)\n",
+    "            layers_fwd = [TCNCell(\n",
+    "                            hidden_size, \n",
+    "                            hidden_size, \n",
+    "                            kernel_size, \n",
+    "                            padding = (kernel_size - 1)*2**i, \n",
+    "                            dilation = 2**i, \n",
+    "                            mode = 'forward', \n",
+    "                            groups = 1, \n",
+    "                            dropout = dropout) for i in range(self.n_layers_fwd)]             \n",
+    "            self.net_fwd = nn.Sequential(*layers_fwd)\n",
+    "            output_lin_dim_multiplier += 2\n",
+    "\n",
+    "        # Dense temporal and output layers\n",
+    "        self.drop_temporal = nn.Dropout(dropout)\n",
+    "        self.temporal_lin1 = nn.Linear(self.input_size, hidden_size)\n",
+    "        self.temporal_lin2 = nn.Linear(hidden_size, self.h)\n",
+    "        self.output_lin = nn.Linear(output_lin_dim_multiplier * hidden_size, self.loss.outputsize_multiplier)\n",
+    "\n",
+    "    def forward(self, windows_batch):\n",
+    "        # Parse windows_batch\n",
+    "        x             = windows_batch['insample_y'].unsqueeze(-1)       #   [B, L, 1]\n",
+    "        hist_exog     = windows_batch['hist_exog']                      #   [B, L, X]\n",
+    "        futr_exog     = windows_batch['futr_exog']                      #   [B, L + h, F]\n",
+    "        stat_exog     = windows_batch['stat_exog']                      #   [B, S]\n",
+    "\n",
+    "        # Concatenate x with historic exogenous\n",
+    "        batch_size, seq_len = x.shape[:2]                               #   B = batch_size, L = seq_len\n",
+    "        if self.hist_exog_size > 0:\n",
+    "            x = torch.cat((x, hist_exog), dim=2)                        #   [B, L, 1] + [B, L, X] -> [B, L, 1 + X]\n",
+    "\n",
+    "        # Concatenate x with static exogenous\n",
+    "        if self.stat_exog_size > 0:\n",
+    "            stat_exog = stat_exog.unsqueeze(1).repeat(1, seq_len, 1)    #   [B, S] -> [B, L, S]\n",
+    "            x = torch.cat((x, stat_exog), dim=2)                        #   [B, L, 1 + X] + [B, L, S] -> [B, L, 1 + X + S]\n",
+    "\n",
+    "        # Concatenate x with future exogenous & apply forward TCN to x_futr\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            x = torch.cat((x, futr_exog[:, :seq_len]), dim=2)           #   [B, L, 1 + X + S] + [B, L, F] -> [B, L, 1 + X + S + F]\n",
+    "            x_futr = self.drop_futr(self.lin_futr(futr_exog))           #   [B, L + h, F] -> [B, L + h, hidden_size]\n",
+    "            x_futr = x_futr.permute(0, 2, 1)                            #   [B, L + h, hidden_size] -> [B, hidden_size, L + h]\n",
+    "            _, x_futr = self.net_fwd((x_futr, 0))                       #   [B, hidden_size, L + h] -> [B, hidden_size, L + h]\n",
+    "            x_futr_L = x_futr[:, :, :seq_len]                           #   [B, hidden_size, L + h] -> [B, hidden_size, L]\n",
+    "            x_futr_h = x_futr[:, :, seq_len:]                           #   [B, hidden_size, L + h] -> [B, hidden_size, h]\n",
+    "\n",
+    "        # Apply backward TCN to x\n",
+    "        x = self.drop_hist(self.lin_hist(x))                            #   [B, L, 1 + X + S + F] -> [B, L, hidden_size]\n",
+    "        x = x.permute(0, 2, 1)                                          #   [B, L, hidden_size] -> [B, hidden_size, L]\n",
+    "        _, x = self.net_bwd((x, 0))                                     #   [B, hidden_size, L] -> [B, hidden_size, L]\n",
+    "\n",
+    "        # Concatenate with future exogenous for seq_len\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            x = torch.cat((x, x_futr_L), dim=1)                         #   [B, hidden_size, L] + [B, hidden_size, L] -> [B, 2 * hidden_size, L]\n",
+    "\n",
+    "        # Temporal dense layer to go to output horizon\n",
+    "        x = self.drop_temporal(F.gelu(self.temporal_lin1(x)))           #   [B, 2 * hidden_size, L] -> [B, 2 * hidden_size, hidden_size]\n",
+    "        x = self.temporal_lin2(x)                                       #   [B, 2 * hidden_size, hidden_size] -> [B, 2 * hidden_size, h]\n",
+    "        \n",
+    "        # Concatenate with future exogenous for horizon\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            x = torch.cat((x, x_futr_h), dim=1)                         #   [B, 2 * hidden_size, h] + [B, hidden_size, h] -> [B, 3 * hidden_size, h]\n",
+    "\n",
+    "        # Output layer to create forecasts\n",
+    "        x = x.permute(0, 2, 1)                                          #   [B, 3 * hidden_size, h] -> [B, h, 3 * hidden_size]\n",
+    "        x = self.output_lin(x)                                          #   [B, h, 3 * hidden_size] -> [B, h, n_outputs] \n",
+    "\n",
+    "        # Map to output domain\n",
+    "        forecast = self.loss.domain_map(x)\n",
+    "        \n",
+    "        return forecast"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BiTCN)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BiTCN.fit, name='BiTCN.fit')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BiTCN.predict, name='BiTCN.predict')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast.utils import AirPassengersDF as Y_df\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Y_train_df = Y_df[Y_df.ds<='1959-12-31'] # 132 train\n",
+    "Y_test_df = Y_df[Y_df.ds>'1959-12-31']   # 12 test\n",
+    "\n",
+    "dataset, *_ = TimeSeriesDataset.from_df(Y_train_df)\n",
+    "model = BiTCN(h=12, input_size=24, max_steps=500, scaler_type='standard')\n",
+    "model.fit(dataset=dataset)\n",
+    "y_hat = model.predict(dataset=dataset)\n",
+    "Y_test_df['BiTCN'] = y_hat\n",
+    "\n",
+    "#test we recover the same forecast\n",
+    "y_hat2 = model.predict(dataset=dataset)\n",
+    "test_eq(y_hat, y_hat2)\n",
+    "\n",
+    "pd.concat([Y_train_df, Y_test_df]).drop('unique_id', axis=1).set_index('ds').plot()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import pytorch_lightning as pl\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.losses.pytorch import GMM, DistributionLoss\n",
+    "from neuralforecast.utils import AirPassengersPanel, AirPassengersStatic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Y_train_df = AirPassengersPanel[AirPassengersPanel.ds<AirPassengersPanel['ds'].values[-12]] # 132 train\n",
+    "Y_test_df = AirPassengersPanel[AirPassengersPanel.ds>=AirPassengersPanel['ds'].values[-12]].reset_index(drop=True) # 12 test\n",
+    "\n",
+    "fcst = NeuralForecast(\n",
+    "    models=[\n",
+    "            BiTCN(h=12,\n",
+    "                input_size=24,\n",
+    "                loss=GMM(n_components=7, return_params=True, level=[80,90]),\n",
+    "                max_steps=500,\n",
+    "                scaler_type='standard',\n",
+    "                futr_exog_list=['y_[lag12]'],\n",
+    "                hist_exog_list=None,\n",
+    "                stat_exog_list=['airline1'],\n",
+    "                ),     \n",
+    "    ],\n",
+    "    freq='M'\n",
+    ")\n",
+    "fcst.fit(df=Y_train_df, static_df=AirPassengersStatic)\n",
+    "forecasts = fcst.predict(futr_df=Y_test_df)\n",
+    "\n",
+    "# Plot quantile predictions\n",
+    "Y_hat_df = forecasts.reset_index(drop=False).drop(columns=['unique_id','ds'])\n",
+    "plot_df = pd.concat([Y_test_df, Y_hat_df], axis=1)\n",
+    "plot_df = pd.concat([Y_train_df, plot_df])\n",
+    "\n",
+    "plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "plt.plot(plot_df['ds'], plot_df['BiTCN-median'], c='blue', label='median')\n",
+    "plt.fill_between(x=plot_df['ds'][-12:], \n",
+    "                 y1=plot_df['BiTCN-lo-90'][-12:].values,\n",
+    "                 y2=plot_df['BiTCN-hi-90'][-12:].values,\n",
+    "                 alpha=0.4, label='level 90')\n",
+    "plt.legend()\n",
+    "plt.grid()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/models.deepar.ipynb
+++ b/nbs/models.deepar.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.deepar"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# DeepAR"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The DeepAR model produces probabilistic forecasts based on an autoregressive recurrent neural network optimized on panel data using cross-learning. DeepAR obtains its forecast distribution uses a Markov Chain Monte Carlo sampler with the following conditional probability:\n",
+    "$$\\mathbb{P}(\\mathbf{y}_{[t+1:t+H]}|\\;\\mathbf{y}_{[:t]},\\; \\mathbf{x}^{(f)}_{[:t+H]},\\; \\mathbf{x}^{(s)})$$\n",
+    "\n",
+    "where $\\mathbf{x}^{(s)}$ are static exogenous inputs, $\\mathbf{x}^{(f)}_{[:t+H]}$ are future exogenous available at the time of the prediction.\n",
+    "The predictions are obtained by transforming the hidden states $\\mathbf{h}_{t}$ into predictive distribution parameters $\\theta_{t}$, and then generating samples $\\mathbf{\\hat{y}}_{[t+1:t+H]}$ through Monte Carlo sampling trajectories.\n",
+    "\n",
+    "$$\n",
+    "\\begin{align}\n",
+    "\\mathbf{h}_{t} &= \\textrm{RNN}([\\mathbf{y}_{t},\\mathbf{x}^{(f)}_{t+1},\\mathbf{x}^{(s)}], \\mathbf{h}_{t-1})\\\\\n",
+    "\\mathbf{\\theta}_{t}&=\\textrm{Linear}(\\mathbf{h}_{t}) \\\\\n",
+    "\\hat{y}_{t+1}&=\\textrm{sample}(\\;\\mathrm{P}(y_{t+1}\\;|\\;\\mathbf{\\theta}_{t})\\;)\n",
+    "\\end{align}\n",
+    "$$\n",
+    "\n",
+    "**References**<br>\n",
+    "- [David Salinas, Valentin Flunkert, Jan Gasthaus, Tim Januschowski (2020). \"DeepAR: Probabilistic forecasting with autoregressive recurrent networks\". International Journal of Forecasting.](https://www.sciencedirect.com/science/article/pii/S0169207019301888)<br>\n",
+    "- [Alexander Alexandrov et. al (2020). \"GluonTS: Probabilistic and Neural Time Series Modeling in Python\". Journal of Machine Learning Research.](https://www.jmlr.org/papers/v21/19-820.html)<br>\n",
+    "\n",
+    "\n",
+    ":::{.callout-warning collapse=\"false\"}\n",
+    "#### Exogenous Variables, Losses, and Parameters Availability\n",
+    "\n",
+    "Given the sampling procedure during inference, DeepAR only supports `DistributionLoss` as training loss.\n",
+    "\n",
+    "Note that DeepAR generates a non-parametric forecast distribution using Monte Carlo. We use this sampling procedure also during validation to make it closer to the inference procedure. Therefore, only the `MQLoss` is available for validation.\n",
+    "\n",
+    "Aditionally, Monte Carlo implies that historic exogenous variables are not available for the model.\n",
+    ":::"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. DeepAR model, during training the optimization signal comes from likelihood of observations, during inference a recurrent multi-step strategy is used to generate predictive distributions.](imgs_models/deepar.jpeg)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import numpy as np\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "from typing import Optional\n",
+    "\n",
+    "from neuralforecast.common._base_windows import BaseWindows\n",
+    "from neuralforecast.losses.pytorch import DistributionLoss, MQLoss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "import logging\n",
+    "import warnings\n",
+    "\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "logging.getLogger(\"pytorch_lightning\").setLevel(logging.ERROR)\n",
+    "warnings.filterwarnings(\"ignore\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Decoder(nn.Module):\n",
+    "    \"\"\"Multi-Layer Perceptron Decoder\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `in_features`: int, dimension of input.<br>\n",
+    "    `out_features`: int, dimension of output.<br>\n",
+    "    `hidden_size`: int, dimension of hidden layers.<br>\n",
+    "    `num_layers`: int, number of hidden layers.<br>\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, in_features, out_features, hidden_size, hidden_layers):\n",
+    "        super().__init__()\n",
+    "\n",
+    "        if hidden_layers == 0:\n",
+    "            # Input layer\n",
+    "            layers = [nn.Linear(in_features=in_features, out_features=out_features)]\n",
+    "        else:\n",
+    "            # Input layer\n",
+    "            layers = [nn.Linear(in_features=in_features, out_features=hidden_size), nn.ReLU()]\n",
+    "            # Hidden layers\n",
+    "            for i in range(hidden_layers - 2):\n",
+    "                layers += [nn.Linear(in_features=hidden_size, out_features=hidden_size), nn.ReLU()]\n",
+    "            # Output layer\n",
+    "            layers += [nn.Linear(in_features=hidden_size, out_features=out_features)]\n",
+    "\n",
+    "        # Store in layers as ModuleList\n",
+    "        self.layers = nn.Sequential(*layers)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.layers(x)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class DeepAR(BaseWindows):\n",
+    "    \"\"\" DeepAR\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `h`: int, Forecast horizon. <br>\n",
+    "    `input_size`: int, autorregresive inputs size, y=[1,2,3,4] input_size=2 -> y_[t-2:t]=[1,2].<br>\n",
+    "    `lstm_n_layers`: int=2, number of LSTM layers.<br>\n",
+    "    `lstm_hidden_size`: int=128, LSTM hidden size.<br>\n",
+    "    `lstm_dropout`: float=0.1, LSTM dropout.<br>\n",
+    "    `decoder_hidden_layers`: int=0, number of decoder MLP hidden layers. Default: 0 for linear layer. <br>\n",
+    "    `decoder_hidden_size`: int=0, decoder MLP hidden size. Default: 0 for linear layer.<br>\n",
+    "    `trajectory_samples`: int=100, number of Monte Carlo trajectories during inference.<br>\n",
+    "    `stat_exog_list`: str list, static exogenous columns.<br>\n",
+    "    `hist_exog_list`: str list, historic exogenous columns.<br>\n",
+    "    `futr_exog_list`: str list, future exogenous columns.<br>\n",
+    "    `exclude_insample_y`: bool=False, the model skips the autoregressive features y[t-input_size:t] if True.<br>\n",
+    "    `loss`: PyTorch module, instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `valid_loss`: PyTorch module=`loss`, instantiated valid loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `max_steps`: int=1000, maximum number of training steps.<br>\n",
+    "    `learning_rate`: float=1e-3, Learning rate between (0, 1).<br>\n",
+    "    `num_lr_decays`: int=-1, Number of learning rate decays, evenly distributed across max_steps.<br>\n",
+    "    `early_stop_patience_steps`: int=-1, Number of validation iterations before early stopping.<br>\n",
+    "    `val_check_steps`: int=100, Number of training steps between every validation loss check.<br>\n",
+    "    `batch_size`: int=32, number of different series in each batch.<br>\n",
+    "    `valid_batch_size`: int=None, number of different series in each validation and test batch, if None uses batch_size.<br>\n",
+    "    `windows_batch_size`: int=1024, number of windows to sample in each training batch, default uses all.<br>\n",
+    "    `inference_windows_batch_size`: int=-1, number of windows to sample in each inference batch, -1 uses all.<br>\n",
+    "    `start_padding_enabled`: bool=False, if True, the model will pad the time series with zeros at the beginning, by input size.<br>\n",
+    "    `step_size`: int=1, step size between each window of temporal data.<br>\n",
+    "    `scaler_type`: str='identity', type of scaler for temporal inputs normalization see [temporal scalers](https://nixtla.github.io/neuralforecast/common.scalers.html).<br>\n",
+    "    `random_seed`: int, random_seed for pytorch initializer and numpy generators.<br>\n",
+    "    `num_workers_loader`: int=os.cpu_count(), workers to be used by `TimeSeriesDataLoader`.<br>\n",
+    "    `drop_last_loader`: bool=False, if True `TimeSeriesDataLoader` drops last non-full batch.<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    `optimizer`: Subclass of 'torch.optim.Optimizer', optional, user specified optimizer instead of the default choice (Adam).<br>\n",
+    "    `optimizer_kwargs`: dict, optional, list of parameters used by the user specified `optimizer`.<br>\n",
+    "    `**trainer_kwargs`: int,  keyword trainer arguments inherited from [PyTorch Lighning's trainer](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).<br>    \n",
+    "\n",
+    "    **References**<br>\n",
+    "    - [David Salinas, Valentin Flunkert, Jan Gasthaus, Tim Januschowski (2020). \"DeepAR: Probabilistic forecasting with autoregressive recurrent networks\". International Journal of Forecasting.](https://www.sciencedirect.com/science/article/pii/S0169207019301888)<br>\n",
+    "    - [Alexander Alexandrov et. al (2020). \"GluonTS: Probabilistic and Neural Time Series Modeling in Python\". Journal of Machine Learning Research.](https://www.jmlr.org/papers/v21/19-820.html)<br>\n",
+    "\n",
+    "    \"\"\"\n",
+    "    # Class attributes\n",
+    "    SAMPLING_TYPE = 'windows'\n",
+    "    \n",
+    "    def __init__(self,\n",
+    "                 h,\n",
+    "                 input_size: int = -1,\n",
+    "                 lstm_n_layers: int = 2,\n",
+    "                 lstm_hidden_size: int = 128,\n",
+    "                 lstm_dropout: float = 0.1,\n",
+    "                 decoder_hidden_layers: int = 0,\n",
+    "                 decoder_hidden_size: int = 0,\n",
+    "                 trajectory_samples: int = 100,\n",
+    "                 futr_exog_list = None,\n",
+    "                 hist_exog_list = None,\n",
+    "                 stat_exog_list = None,\n",
+    "                 exclude_insample_y = False,\n",
+    "                 loss = DistributionLoss(distribution='StudentT', level=[80, 90], return_params=False),\n",
+    "                 valid_loss = MQLoss(level=[80, 90]),\n",
+    "                 max_steps: int = 1000,\n",
+    "                 learning_rate: float = 1e-3,\n",
+    "                 num_lr_decays: int = 3,\n",
+    "                 early_stop_patience_steps: int =-1,\n",
+    "                 val_check_steps: int = 100,\n",
+    "                 batch_size: int = 32,\n",
+    "                 valid_batch_size: Optional[int] = None,\n",
+    "                 windows_batch_size: int = 1024,\n",
+    "                 inference_windows_batch_size: int = -1,\n",
+    "                 start_padding_enabled = False,\n",
+    "                 step_size: int = 1,\n",
+    "                 scaler_type: str = 'identity',\n",
+    "                 random_seed: int = 1,\n",
+    "                 num_workers_loader = 0,\n",
+    "                 drop_last_loader = False,\n",
+    "                 optimizer = None,\n",
+    "                 optimizer_kwargs = None,\n",
+    "                 **trainer_kwargs):\n",
+    "\n",
+    "        # DeepAR does not support historic exogenous variables\n",
+    "        if hist_exog_list is not None:\n",
+    "            raise Exception('DeepAR does not support historic exogenous variables.')\n",
+    "\n",
+    "        if exclude_insample_y:\n",
+    "            raise Exception('DeepAR has no possibility for excluding y.')\n",
+    "        \n",
+    "        if not loss.is_distribution_output:\n",
+    "            raise Exception('DeepAR only supports distributional outputs.')\n",
+    "        \n",
+    "        if str(type(valid_loss)) not in [\"<class 'neuralforecast.losses.pytorch.MQLoss'>\"]:\n",
+    "            raise Exception('DeepAR only supports MQLoss as validation loss.')\n",
+    "\n",
+    "        if loss.return_params:\n",
+    "            raise Exception('DeepAR does not return distribution parameters due to Monte Carlo sampling.')\n",
+    "    \n",
+    "        # Inherit BaseWindows class\n",
+    "        super(DeepAR, self).__init__(h=h,\n",
+    "                                    input_size=input_size,\n",
+    "                                    futr_exog_list=futr_exog_list,\n",
+    "                                    hist_exog_list=hist_exog_list,\n",
+    "                                    stat_exog_list=stat_exog_list,\n",
+    "                                    exclude_insample_y = exclude_insample_y,\n",
+    "                                    loss=loss,\n",
+    "                                    valid_loss=valid_loss,\n",
+    "                                    max_steps=max_steps,\n",
+    "                                    learning_rate=learning_rate,\n",
+    "                                    num_lr_decays=num_lr_decays,\n",
+    "                                    early_stop_patience_steps=early_stop_patience_steps,\n",
+    "                                    val_check_steps=val_check_steps,\n",
+    "                                    batch_size=batch_size,\n",
+    "                                    windows_batch_size=windows_batch_size,\n",
+    "                                    valid_batch_size=valid_batch_size,\n",
+    "                                    inference_windows_batch_size=inference_windows_batch_size,\n",
+    "                                    start_padding_enabled=start_padding_enabled,\n",
+    "                                    step_size=step_size,\n",
+    "                                    scaler_type=scaler_type,\n",
+    "                                    num_workers_loader=num_workers_loader,\n",
+    "                                    drop_last_loader=drop_last_loader,\n",
+    "                                    random_seed=random_seed,\n",
+    "                                    optimizer=optimizer,\n",
+    "                                    optimizer_kwargs=optimizer_kwargs,\n",
+    "                                    **trainer_kwargs)\n",
+    "\n",
+    "        self.horizon_backup = self.h # Used because h=0 during training\n",
+    "        self.trajectory_samples = trajectory_samples\n",
+    "\n",
+    "        # LSTM\n",
+    "        self.encoder_n_layers = lstm_n_layers\n",
+    "        self.encoder_hidden_size = lstm_hidden_size\n",
+    "        self.encoder_dropout = lstm_dropout\n",
+    "\n",
+    "        self.futr_exog_size = len(self.futr_exog_list)\n",
+    "        self.hist_exog_size = 0\n",
+    "        self.stat_exog_size = len(self.stat_exog_list)\n",
+    "        \n",
+    "        # LSTM input size (1 for target variable y)\n",
+    "        input_encoder = 1 + self.futr_exog_size + self.stat_exog_size\n",
+    "\n",
+    "        # Instantiate model\n",
+    "        self.hist_encoder = nn.LSTM(input_size=input_encoder,\n",
+    "                                    hidden_size=self.encoder_hidden_size,\n",
+    "                                    num_layers=self.encoder_n_layers,\n",
+    "                                    dropout=self.encoder_dropout,\n",
+    "                                    batch_first=True)\n",
+    "\n",
+    "        # Decoder MLP\n",
+    "        self.decoder = Decoder(in_features=lstm_hidden_size,\n",
+    "                               out_features=self.loss.outputsize_multiplier,\n",
+    "                               hidden_size=decoder_hidden_size,\n",
+    "                               hidden_layers=decoder_hidden_layers)\n",
+    "\n",
+    "    # Override BaseWindows method\n",
+    "    def training_step(self, batch, batch_idx):\n",
+    "\n",
+    "        # During training h=0  \n",
+    "        self.h = 0\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Create and normalize windows [Ws, L, C]\n",
+    "        windows = self._create_windows(batch, step='train')\n",
+    "        original_insample_y = windows['temporal'][:, :, y_idx].clone() # windows: [B, L, Feature] -> [B, L]\n",
+    "        original_insample_y = original_insample_y[:,1:] # Remove first (shift in DeepAr, cell at t outputs t+1)\n",
+    "        windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, _, _, _, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [Ws, L]\n",
+    "                             insample_mask=insample_mask, # [Ws, L]\n",
+    "                             futr_exog=futr_exog, # [Ws, L+H]\n",
+    "                             hist_exog=None, # None\n",
+    "                             stat_exog=stat_exog,\n",
+    "                             y_idx=y_idx) # [Ws, 1]\n",
+    "\n",
+    "        # Model Predictions\n",
+    "        output = self.train_forward(windows_batch)\n",
+    "\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            _, y_loc, y_scale = self._inv_normalization(y_hat=original_insample_y,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "            outsample_y = original_insample_y\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            mask = insample_mask[:,1:].clone() # Remove first (shift in DeepAr, cell at t outputs t+1)\n",
+    "            loss = self.loss(y=outsample_y, distr_args=distr_args, mask=mask)\n",
+    "        else:\n",
+    "            raise Exception('DeepAR only supports distributional outputs.')\n",
+    "\n",
+    "        if torch.isnan(loss):\n",
+    "            print('Model Parameters', self.hparams)\n",
+    "            print('insample_y', torch.isnan(insample_y).sum())\n",
+    "            print('outsample_y', torch.isnan(outsample_y).sum())\n",
+    "            print('output', torch.isnan(output).sum())\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'train_loss',\n",
+    "            loss.item(),\n",
+    "            batch_size=outsample_y.size(0),\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.train_trajectories.append((self.global_step, loss.item()))\n",
+    "\n",
+    "        self.h = self.horizon_backup # Restore horizon\n",
+    "        return loss\n",
+    "\n",
+    "    def validation_step(self, batch, batch_idx):\n",
+    "\n",
+    "        self.h == self.horizon_backup\n",
+    "\n",
+    "        if self.val_size == 0:\n",
+    "            return np.nan\n",
+    "\n",
+    "        # TODO: Hack to compute number of windows\n",
+    "        windows = self._create_windows(batch, step='val')\n",
+    "        n_windows = len(windows['temporal'])\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Number of windows in batch\n",
+    "        windows_batch_size = self.inference_windows_batch_size\n",
+    "        if windows_batch_size < 0:\n",
+    "            windows_batch_size = n_windows\n",
+    "        n_batches = int(np.ceil(n_windows/windows_batch_size))\n",
+    "\n",
+    "        valid_losses = []\n",
+    "        batch_sizes = []\n",
+    "        for i in range(n_batches):\n",
+    "            # Create and normalize windows [Ws, L+H, C]\n",
+    "            w_idxs = np.arange(i*windows_batch_size, \n",
+    "                               min((i+1)*windows_batch_size, n_windows))\n",
+    "            windows = self._create_windows(batch, step='val', w_idxs=w_idxs)\n",
+    "            original_outsample_y = torch.clone(windows['temporal'][:,-self.h:,0])\n",
+    "            windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "            # Parse windows\n",
+    "            insample_y, insample_mask, _, outsample_mask, \\\n",
+    "                _, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "            windows_batch = dict(insample_y=insample_y,\n",
+    "                        insample_mask=insample_mask,\n",
+    "                        futr_exog=futr_exog,\n",
+    "                        hist_exog=None,\n",
+    "                        stat_exog=stat_exog,\n",
+    "                        temporal_cols=batch['temporal_cols'],\n",
+    "                        y_idx=y_idx) \n",
+    "            \n",
+    "            # Model Predictions\n",
+    "            output_batch = self(windows_batch)\n",
+    "            # Monte Carlo already returns y_hat with mean and quantiles\n",
+    "            output_batch = output_batch[:,:, 1:] # Remove mean\n",
+    "            valid_loss_batch = self.valid_loss(y=original_outsample_y, y_hat=output_batch, mask=outsample_mask)\n",
+    "            valid_losses.append(valid_loss_batch)\n",
+    "            batch_sizes.append(len(output_batch))\n",
+    "\n",
+    "        valid_loss = torch.stack(valid_losses)\n",
+    "        batch_sizes = torch.tensor(batch_sizes, device=valid_loss.device)\n",
+    "        batch_size = torch.sum(batch_sizes)\n",
+    "        valid_loss = torch.sum(valid_loss * batch_sizes) / batch_size\n",
+    "\n",
+    "        if torch.isnan(valid_loss):\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'valid_loss',\n",
+    "            valid_loss.item(),\n",
+    "            batch_size=batch_size,\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.validation_step_outputs.append(valid_loss)\n",
+    "        return valid_loss\n",
+    "\n",
+    "    def predict_step(self, batch, batch_idx):\n",
+    "\n",
+    "        self.h == self.horizon_backup\n",
+    "\n",
+    "        # TODO: Hack to compute number of windows\n",
+    "        windows = self._create_windows(batch, step='predict')\n",
+    "        n_windows = len(windows['temporal'])\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Number of windows in batch\n",
+    "        windows_batch_size = self.inference_windows_batch_size\n",
+    "        if windows_batch_size < 0:\n",
+    "            windows_batch_size = n_windows\n",
+    "        n_batches = int(np.ceil(n_windows/windows_batch_size))\n",
+    "\n",
+    "        y_hats = []\n",
+    "        for i in range(n_batches):\n",
+    "            # Create and normalize windows [Ws, L+H, C]\n",
+    "            w_idxs = np.arange(i*windows_batch_size, \n",
+    "                    min((i+1)*windows_batch_size, n_windows))\n",
+    "            windows = self._create_windows(batch, step='predict', w_idxs=w_idxs)\n",
+    "            windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "            # Parse windows\n",
+    "            insample_y, insample_mask, _, _, _, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "            windows_batch = dict(insample_y=insample_y, # [Ws, L]\n",
+    "                                insample_mask=insample_mask, # [Ws, L]\n",
+    "                                futr_exog=futr_exog, # [Ws, L+H]\n",
+    "                                stat_exog=stat_exog,\n",
+    "                                temporal_cols=batch['temporal_cols'],\n",
+    "                                y_idx=y_idx)\n",
+    "            \n",
+    "            # Model Predictions\n",
+    "            y_hat = self(windows_batch)\n",
+    "            # Monte Carlo already returns y_hat with mean and quantiles\n",
+    "            y_hats.append(y_hat)\n",
+    "        y_hat = torch.cat(y_hats, dim=0)\n",
+    "        return y_hat\n",
+    "\n",
+    "    def train_forward(self, windows_batch):\n",
+    "\n",
+    "        # Parse windows_batch\n",
+    "        encoder_input = windows_batch['insample_y'][:,:, None] # <- [B,T,1]\n",
+    "        futr_exog  = windows_batch['futr_exog']\n",
+    "        stat_exog  = windows_batch['stat_exog']\n",
+    "\n",
+    "        #[B, input_size-1, X]\n",
+    "        encoder_input = encoder_input[:,:-1,:] # Remove last (shift in DeepAr, cell at t outputs t+1)\n",
+    "        _, input_size = encoder_input.shape[:2]\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            # Shift futr_exog (t predicts t+1, last output is outside insample_y)\n",
+    "            encoder_input = torch.cat((encoder_input, futr_exog[:,1:,:]), dim=2)\n",
+    "        if self.stat_exog_size > 0:\n",
+    "            stat_exog = stat_exog.unsqueeze(1).repeat(1, input_size, 1) # [B, S] -> [B, input_size-1, S]\n",
+    "            encoder_input = torch.cat((encoder_input, stat_exog), dim=2)\n",
+    "\n",
+    "        # RNN forward\n",
+    "        hidden_state, _ = self.hist_encoder(encoder_input) # [B, input_size-1, rnn_hidden_state]\n",
+    "\n",
+    "        # Decoder forward\n",
+    "        output = self.decoder(hidden_state) # [B, input_size-1, output_size]\n",
+    "        output = self.loss.domain_map(output)\n",
+    "        return output\n",
+    "    \n",
+    "    def forward(self, windows_batch):\n",
+    "\n",
+    "        # Parse windows_batch\n",
+    "        encoder_input = windows_batch['insample_y'][:,:, None] # <- [B,L,1]\n",
+    "        futr_exog  = windows_batch['futr_exog'] # <- [B,L+H, n_f]\n",
+    "        stat_exog  = windows_batch['stat_exog']\n",
+    "        y_idx = windows_batch['y_idx']\n",
+    "\n",
+    "        #[B, seq_len, X]\n",
+    "        batch_size, input_size = encoder_input.shape[:2]\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            futr_exog_input_window = futr_exog[:,1:input_size+1,:] # Align y_t with futr_exog_t+1\n",
+    "            encoder_input = torch.cat((encoder_input, futr_exog_input_window), dim=2)\n",
+    "        if self.stat_exog_size > 0:\n",
+    "            stat_exog_input_window = stat_exog.unsqueeze(1).repeat(1, input_size, 1) # [B, S] -> [B, input_size, S]\n",
+    "            encoder_input = torch.cat((encoder_input, stat_exog_input_window), dim=2)\n",
+    "\n",
+    "        # Use input_size history to predict first h of the forecasting window\n",
+    "        _, h_c_tuple = self.hist_encoder(encoder_input)\n",
+    "        h_n = h_c_tuple[0] # [n_layers, B, lstm_hidden_state]\n",
+    "        c_n = h_c_tuple[1] # [n_layers, B, lstm_hidden_state]\n",
+    "\n",
+    "        # Vectorizes trajectory samples in batch dimension [1]\n",
+    "        h_n = torch.repeat_interleave(h_n, self.trajectory_samples, 1) # [n_layers, B*trajectory_samples, rnn_hidden_state]\n",
+    "        c_n = torch.repeat_interleave(c_n, self.trajectory_samples, 1) # [n_layers, B*trajectory_samples, rnn_hidden_state]\n",
+    "\n",
+    "        # Scales for inverse normalization\n",
+    "        y_scale = self.scaler.x_scale[:, 0, [y_idx]].squeeze(-1).to(encoder_input.device)\n",
+    "        y_loc = self.scaler.x_shift[:, 0, [y_idx]].squeeze(-1).to(encoder_input.device)\n",
+    "        y_scale = torch.repeat_interleave(y_scale, self.trajectory_samples, 0)\n",
+    "        y_loc = torch.repeat_interleave(y_loc, self.trajectory_samples, 0)\n",
+    "\n",
+    "        # Recursive strategy prediction\n",
+    "        quantiles = self.loss.quantiles.to(encoder_input.device)\n",
+    "        y_hat = torch.zeros(batch_size, self.h, len(quantiles)+1, device=encoder_input.device)\n",
+    "        for tau in range(self.h):\n",
+    "            # Decoder forward\n",
+    "            last_layer_h = h_n[-1] # [B*trajectory_samples, lstm_hidden_state]\n",
+    "            output = self.decoder(last_layer_h) \n",
+    "            output = self.loss.domain_map(output)\n",
+    "\n",
+    "            # Inverse normalization\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            # Add horizon (1) dimension\n",
+    "            distr_args = list(distr_args)\n",
+    "            for i in range(len(distr_args)):\n",
+    "                distr_args[i] = distr_args[i].unsqueeze(-1)\n",
+    "            distr_args = tuple(distr_args)\n",
+    "            samples_tau, _, _ = self.loss.sample(distr_args=distr_args, num_samples=1)\n",
+    "            samples_tau = samples_tau.reshape(batch_size, self.trajectory_samples)\n",
+    "            sample_mean = torch.mean(samples_tau, dim=-1).to(encoder_input.device)\n",
+    "            quants = torch.quantile(input=samples_tau, \n",
+    "                                    q=quantiles, dim=-1).to(encoder_input.device)\n",
+    "            y_hat[:,tau,0] = sample_mean\n",
+    "            y_hat[:,tau,1:] = quants.permute((1,0)) # [Q, B] -> [B, Q]\n",
+    "            \n",
+    "            # Stop if already in the last step (no need to predict next step)\n",
+    "            if tau+1 == self.h:\n",
+    "                continue\n",
+    "            # Normalize to use as input\n",
+    "            encoder_input = self.scaler.scaler(samples_tau.flatten(), y_loc, y_scale) # [B*n_samples]\n",
+    "            encoder_input = encoder_input[:, None, None] # [B*n_samples, 1, 1]\n",
+    "\n",
+    "            # Update input\n",
+    "            if self.futr_exog_size > 0:\n",
+    "                futr_exog_tau = futr_exog[:,[input_size+tau+1],:] # [B, 1, n_f]\n",
+    "                futr_exog_tau = torch.repeat_interleave(futr_exog_tau, self.trajectory_samples, 0) # [B*n_samples, 1, n_f]\n",
+    "                encoder_input = torch.cat((encoder_input, futr_exog_tau), dim=2) # [B*n_samples, 1, 1+n_f]\n",
+    "            if self.stat_exog_size > 0:\n",
+    "                stat_exog_tau = torch.repeat_interleave(stat_exog, self.trajectory_samples, 0) # [B*n_samples, n_s]\n",
+    "                encoder_input = torch.cat((encoder_input, stat_exog_tau[:,None,:]), dim=2) # [B*n_samples, 1, 1+n_f+n_s]\n",
+    "            \n",
+    "            _, h_c_tuple = self.hist_encoder(encoder_input, (h_n, c_n))\n",
+    "            h_n = h_c_tuple[0] # [n_layers, B, rnn_hidden_state]\n",
+    "            c_n = h_c_tuple[1] # [n_layers, B, rnn_hidden_state]\n",
+    "\n",
+    "        return y_hat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DeepAR, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DeepAR.fit, name='DeepAR.fit', title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DeepAR.predict, name='DeepAR.predict', title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.losses.pytorch import MQLoss, DistributionLoss, GMM, PMM\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset\n",
+    "from neuralforecast.utils import AirPassengers, AirPassengersPanel, AirPassengersStatic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import pandas as pd\n",
+    "import pytorch_lightning as pl\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast import NeuralForecast\n",
+    "#from neuralforecast.models import DeepAR\n",
+    "from neuralforecast.losses.pytorch import DistributionLoss, HuberMQLoss\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset\n",
+    "from neuralforecast.utils import AirPassengers, AirPassengersPanel, AirPassengersStatic\n",
+    "\n",
+    "#AirPassengersPanel['y'] = AirPassengersPanel['y'] + 10\n",
+    "Y_train_df = AirPassengersPanel[AirPassengersPanel.ds<AirPassengersPanel['ds'].values[-12]] # 132 train\n",
+    "Y_test_df = AirPassengersPanel[AirPassengersPanel.ds>=AirPassengersPanel['ds'].values[-12]].reset_index(drop=True) # 12 test\n",
+    "\n",
+    "nf = NeuralForecast(\n",
+    "    models=[DeepAR(h=12,\n",
+    "                   input_size=48,\n",
+    "                   lstm_n_layers=3,\n",
+    "                   trajectory_samples=100,\n",
+    "                   loss=DistributionLoss(distribution='Normal', level=[80, 90], return_params=False),\n",
+    "                   learning_rate=0.005,\n",
+    "                   stat_exog_list=['airline1'],\n",
+    "                   futr_exog_list=['trend'],\n",
+    "                   max_steps=100,\n",
+    "                   val_check_steps=10,\n",
+    "                   early_stop_patience_steps=-1,\n",
+    "                   scaler_type='standard',\n",
+    "                   enable_progress_bar=True),\n",
+    "    ],\n",
+    "    freq='M'\n",
+    ")\n",
+    "nf.fit(df=Y_train_df, static_df=AirPassengersStatic, val_size=12)\n",
+    "Y_hat_df = nf.predict(futr_df=Y_test_df)\n",
+    "\n",
+    "# Plot quantile predictions\n",
+    "Y_hat_df = Y_hat_df.reset_index(drop=False).drop(columns=['unique_id','ds'])\n",
+    "plot_df = pd.concat([Y_test_df, Y_hat_df], axis=1)\n",
+    "plot_df = pd.concat([Y_train_df, plot_df])\n",
+    "\n",
+    "plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "#plt.plot(plot_df['ds'], plot_df['DeepAR'], c='purple', label='mean')\n",
+    "plt.plot(plot_df['ds'], plot_df['DeepAR-median'], c='blue', label='median')\n",
+    "plt.fill_between(x=plot_df['ds'][-12:], \n",
+    "                 y1=plot_df['DeepAR-lo-90'][-12:].values, \n",
+    "                 y2=plot_df['DeepAR-hi-90'][-12:].values,\n",
+    "                 alpha=0.4, label='level 90')\n",
+    "plt.legend()\n",
+    "plt.grid()\n",
+    "plt.plot()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/models.dilated_rnn.ipynb
+++ b/nbs/models.dilated_rnn.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.dilated_rnn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dilated RNN"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Dilated Recurrent Neural Network (`DilatedRNN`) addresses common challenges of modeling long sequences like vanishing gradients, computational efficiency, and improved model flexibility to model complex relationships while maintaining its parsimony. The `DilatedRNN` builds a deep stack of RNN layers using skip conditions on the temporal and the network's depth dimensions. The temporal dilated recurrent skip connections offer the capability to focus on multi-resolution inputs.The predictions are obtained by transforming the hidden states into contexts $\\mathbf{c}_{[t+1:t+H]}$, that are decoded and adapted into $\\mathbf{\\hat{y}}_{[t+1:t+H],[q]}$ through MLPs.\n",
+    "\n",
+    "\\begin{align}\n",
+    " \\mathbf{h}_{t} &= \\textrm{DilatedRNN}([\\mathbf{y}_{t},\\mathbf{x}^{(h)}_{t},\\mathbf{x}^{(s)}], \\mathbf{h}_{t-1})\\\\\n",
+    "\\mathbf{c}_{[t+1:t+H]}&=\\textrm{Linear}([\\mathbf{h}_{t}, \\mathbf{x}^{(f)}_{[:t+H]}]) \\\\ \n",
+    "\\hat{y}_{\\tau,[q]}&=\\textrm{MLP}([\\mathbf{c}_{\\tau},\\mathbf{x}^{(f)}_{\\tau}])\n",
+    "\\end{align}\n",
+    "\n",
+    "where $\\mathbf{h}_{t}$, is the hidden state for time $t$, $\\mathbf{y}_{t}$ is the input at time $t$ and $\\mathbf{h}_{t-1}$ is the hidden state of the previous layer at $t-1$, $\\mathbf{x}^{(s)}$ are static exogenous inputs, $\\mathbf{x}^{(h)}_{t}$ historic exogenous, $\\mathbf{x}^{(f)}_{[:t+H]}$ are future exogenous available at the time of the prediction.\n",
+    "\n",
+    "**References**<br>-[Shiyu Chang, et al. \"Dilated Recurrent Neural Networks\".](https://arxiv.org/abs/1710.02224)<br>-[Yao Qin, et al. \"A Dual-Stage Attention-Based recurrent neural network for time series prediction\".](https://arxiv.org/abs/1704.02971)<br>-[Kashif Rasul, et al. \"Zalando Research: PyTorch Dilated Recurrent Neural Networks\".](https://arxiv.org/abs/1710.02224)<br>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. Three layer DilatedRNN with dilation 1, 2, 4.](imgs_models/dilated_rnn.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from nbdev.showdoc import show_doc\n",
+    "from neuralforecast.utils import generate_series"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from typing import List, Optional\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import MAE\n",
+    "from neuralforecast.common._base_recurrent import BaseRecurrent\n",
+    "from neuralforecast.common._modules import MLP"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "class LSTMCell(nn.Module):\n",
+    "    def __init__(self, input_size, hidden_size, dropout=0.):\n",
+    "        super(LSTMCell, self).__init__()\n",
+    "        self.input_size = input_size\n",
+    "        self.hidden_size = hidden_size\n",
+    "        self.weight_ih = nn.Parameter(torch.randn(4 * hidden_size, input_size))\n",
+    "        self.weight_hh = nn.Parameter(torch.randn(4 * hidden_size, hidden_size))\n",
+    "        self.bias_ih = nn.Parameter(torch.randn(4 * hidden_size))\n",
+    "        self.bias_hh = nn.Parameter(torch.randn(4 * hidden_size))\n",
+    "        self.dropout = dropout\n",
+    "\n",
+    "    def forward(self, inputs, hidden):\n",
+    "        hx, cx = hidden[0].squeeze(0), hidden[1].squeeze(0)\n",
+    "        gates = (torch.matmul(inputs, self.weight_ih.t()) + self.bias_ih +\n",
+    "                         torch.matmul(hx, self.weight_hh.t()) + self.bias_hh)\n",
+    "        ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)\n",
+    "\n",
+    "        ingate = torch.sigmoid(ingate)\n",
+    "        forgetgate = torch.sigmoid(forgetgate)\n",
+    "        cellgate = torch.tanh(cellgate)\n",
+    "        outgate = torch.sigmoid(outgate)\n",
+    "\n",
+    "        cy = (forgetgate * cx) + (ingate * cellgate)\n",
+    "        hy = outgate * torch.tanh(cy)\n",
+    "\n",
+    "        return hy, (hy, cy)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "class ResLSTMCell(nn.Module):\n",
+    "    def __init__(self, input_size, hidden_size, dropout=0.):\n",
+    "        super(ResLSTMCell, self).__init__()\n",
+    "        self.register_buffer('input_size', torch.Tensor([input_size]))\n",
+    "        self.register_buffer('hidden_size', torch.Tensor([hidden_size]))\n",
+    "        self.weight_ii = nn.Parameter(torch.randn(3 * hidden_size, input_size))\n",
+    "        self.weight_ic = nn.Parameter(torch.randn(3 * hidden_size, hidden_size))\n",
+    "        self.weight_ih = nn.Parameter(torch.randn(3 * hidden_size, hidden_size))\n",
+    "        self.bias_ii = nn.Parameter(torch.randn(3 * hidden_size))\n",
+    "        self.bias_ic = nn.Parameter(torch.randn(3 * hidden_size))\n",
+    "        self.bias_ih = nn.Parameter(torch.randn(3 * hidden_size))\n",
+    "        self.weight_hh = nn.Parameter(torch.randn(1 * hidden_size, hidden_size))\n",
+    "        self.bias_hh = nn.Parameter(torch.randn(1 * hidden_size))\n",
+    "        self.weight_ir = nn.Parameter(torch.randn(hidden_size, input_size))\n",
+    "        self.dropout = dropout\n",
+    "\n",
+    "    def forward(self, inputs, hidden):\n",
+    "        hx, cx = hidden[0].squeeze(0), hidden[1].squeeze(0)\n",
+    "\n",
+    "        ifo_gates = (torch.matmul(inputs, self.weight_ii.t()) + self.bias_ii +\n",
+    "                                  torch.matmul(hx, self.weight_ih.t()) + self.bias_ih +\n",
+    "                                  torch.matmul(cx, self.weight_ic.t()) + self.bias_ic)\n",
+    "        ingate, forgetgate, outgate = ifo_gates.chunk(3, 1)\n",
+    "\n",
+    "        cellgate = torch.matmul(hx, self.weight_hh.t()) + self.bias_hh\n",
+    "\n",
+    "        ingate = torch.sigmoid(ingate)\n",
+    "        forgetgate = torch.sigmoid(forgetgate)\n",
+    "        cellgate = torch.tanh(cellgate)\n",
+    "        outgate = torch.sigmoid(outgate)\n",
+    "\n",
+    "        cy = (forgetgate * cx) + (ingate * cellgate)\n",
+    "        ry = torch.tanh(cy)\n",
+    "\n",
+    "        if self.input_size == self.hidden_size:\n",
+    "            hy = outgate * (ry + inputs)\n",
+    "        else:\n",
+    "            hy = outgate * (ry + torch.matmul(inputs, self.weight_ir.t()))\n",
+    "        return hy, (hy, cy)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "class ResLSTMLayer(nn.Module):\n",
+    "    def __init__(self, input_size, hidden_size, dropout=0.):\n",
+    "        super(ResLSTMLayer, self).__init__()\n",
+    "        self.input_size = input_size\n",
+    "        self.hidden_size = hidden_size\n",
+    "        self.cell = ResLSTMCell(input_size, hidden_size, dropout=0.)\n",
+    "\n",
+    "    def forward(self, inputs, hidden):\n",
+    "        inputs = inputs.unbind(0)\n",
+    "        outputs = []\n",
+    "        for i in range(len(inputs)):\n",
+    "                out, hidden = self.cell(inputs[i], hidden)\n",
+    "                outputs += [out]\n",
+    "        outputs = torch.stack(outputs)\n",
+    "        return outputs, hidden"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "class AttentiveLSTMLayer(nn.Module):\n",
+    "    def __init__(self, input_size, hidden_size, dropout=0.0):\n",
+    "        super(AttentiveLSTMLayer, self).__init__()\n",
+    "        self.input_size = input_size\n",
+    "        self.hidden_size = hidden_size\n",
+    "        attention_hsize = hidden_size\n",
+    "        self.attention_hsize = attention_hsize\n",
+    "\n",
+    "        self.cell = LSTMCell(input_size, hidden_size)\n",
+    "        self.attn_layer = nn.Sequential(nn.Linear(2 * hidden_size + input_size, attention_hsize),\n",
+    "                                        nn.Tanh(),\n",
+    "                                        nn.Linear(attention_hsize, 1))\n",
+    "        self.softmax = nn.Softmax(dim=0)\n",
+    "        self.dropout = dropout\n",
+    "\n",
+    "    def forward(self, inputs, hidden):\n",
+    "        inputs = inputs.unbind(0)\n",
+    "        outputs = []\n",
+    "\n",
+    "        for t in range(len(inputs)):\n",
+    "            # attention on windows\n",
+    "            hx, cx = (tensor.squeeze(0) for tensor in hidden)\n",
+    "            hx_rep = hx.repeat(len(inputs), 1, 1)\n",
+    "            cx_rep = cx.repeat(len(inputs), 1, 1)\n",
+    "            x = torch.cat((inputs, hx_rep, cx_rep), dim=-1)\n",
+    "            l = self.attn_layer(x)\n",
+    "            beta = self.softmax(l)\n",
+    "            context = torch.bmm(beta.permute(1, 2, 0),\n",
+    "                                inputs.permute(1, 0, 2)).squeeze(1)\n",
+    "            out, hidden = self.cell(context, hidden)\n",
+    "            outputs += [out]\n",
+    "        outputs = torch.stack(outputs)\n",
+    "        return outputs, hidden"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "class DRNN(nn.Module):\n",
+    "\n",
+    "    def __init__(self, n_input, n_hidden, n_layers, dilations, dropout=0, cell_type='GRU', batch_first=True):\n",
+    "        super(DRNN, self).__init__()\n",
+    "\n",
+    "        self.dilations = dilations\n",
+    "        self.cell_type = cell_type\n",
+    "        self.batch_first = batch_first\n",
+    "\n",
+    "        layers = []\n",
+    "        if self.cell_type == \"GRU\":\n",
+    "            cell = nn.GRU\n",
+    "        elif self.cell_type == \"RNN\":\n",
+    "            cell = nn.RNN\n",
+    "        elif self.cell_type == \"LSTM\":\n",
+    "            cell = nn.LSTM\n",
+    "        elif self.cell_type == \"ResLSTM\":\n",
+    "            cell = ResLSTMLayer\n",
+    "        elif self.cell_type == \"AttentiveLSTM\":\n",
+    "            cell = AttentiveLSTMLayer\n",
+    "        else:\n",
+    "            raise NotImplementedError\n",
+    "\n",
+    "        for i in range(n_layers):\n",
+    "            if i == 0:\n",
+    "                c = cell(n_input, n_hidden, dropout=dropout)\n",
+    "            else:\n",
+    "                c = cell(n_hidden, n_hidden, dropout=dropout)\n",
+    "            layers.append(c)\n",
+    "        self.cells = nn.Sequential(*layers)\n",
+    "\n",
+    "    def forward(self, inputs, hidden=None):\n",
+    "        if self.batch_first:\n",
+    "            inputs = inputs.transpose(0, 1)\n",
+    "        outputs = []\n",
+    "        for i, (cell, dilation) in enumerate(zip(self.cells, self.dilations)):\n",
+    "            if hidden is None:\n",
+    "                inputs, _ = self.drnn_layer(cell, inputs, dilation)\n",
+    "            else:\n",
+    "                inputs, hidden[i] = self.drnn_layer(cell, inputs, dilation, hidden[i])\n",
+    "\n",
+    "            outputs.append(inputs[-dilation:])\n",
+    "\n",
+    "        if self.batch_first:\n",
+    "            inputs = inputs.transpose(0, 1)\n",
+    "        return inputs, outputs\n",
+    "\n",
+    "    def drnn_layer(self, cell, inputs, rate, hidden=None):\n",
+    "        n_steps = len(inputs)\n",
+    "        batch_size = inputs[0].size(0)\n",
+    "        hidden_size = cell.hidden_size\n",
+    "\n",
+    "        inputs, dilated_steps = self._pad_inputs(inputs, n_steps, rate)\n",
+    "        dilated_inputs = self._prepare_inputs(inputs, rate)\n",
+    "\n",
+    "        if hidden is None:\n",
+    "            dilated_outputs, hidden = self._apply_cell(dilated_inputs, cell, batch_size, rate, hidden_size)\n",
+    "        else:\n",
+    "            hidden = self._prepare_inputs(hidden, rate)\n",
+    "            dilated_outputs, hidden = self._apply_cell(dilated_inputs, cell, batch_size, rate, hidden_size,\n",
+    "                                                       hidden=hidden)\n",
+    "\n",
+    "        splitted_outputs = self._split_outputs(dilated_outputs, rate)\n",
+    "        outputs = self._unpad_outputs(splitted_outputs, n_steps)\n",
+    "\n",
+    "        return outputs, hidden\n",
+    "\n",
+    "    def _apply_cell(self, dilated_inputs, cell, batch_size, rate, hidden_size, hidden=None):\n",
+    "        if hidden is None:\n",
+    "            hidden = torch.zeros(batch_size * rate, hidden_size,\n",
+    "                                 dtype=dilated_inputs.dtype,\n",
+    "                                 device=dilated_inputs.device)\n",
+    "            hidden = hidden.unsqueeze(0)\n",
+    "            \n",
+    "            if self.cell_type in ['LSTM', 'ResLSTM', 'AttentiveLSTM']:\n",
+    "                hidden = (hidden, hidden)\n",
+    "                \n",
+    "        dilated_outputs, hidden = cell(dilated_inputs, hidden) # compatibility hack\n",
+    "\n",
+    "        return dilated_outputs, hidden\n",
+    "\n",
+    "    def _unpad_outputs(self, splitted_outputs, n_steps):\n",
+    "        return splitted_outputs[:n_steps]\n",
+    "\n",
+    "    def _split_outputs(self, dilated_outputs, rate):\n",
+    "        batchsize = dilated_outputs.size(1) // rate\n",
+    "\n",
+    "        blocks = [dilated_outputs[:, i * batchsize: (i + 1) * batchsize, :] for i in range(rate)]\n",
+    "\n",
+    "        interleaved = torch.stack((blocks)).transpose(1, 0).contiguous()\n",
+    "        interleaved = interleaved.view(dilated_outputs.size(0) * rate,\n",
+    "                                       batchsize,\n",
+    "                                       dilated_outputs.size(2))\n",
+    "        return interleaved\n",
+    "\n",
+    "    def _pad_inputs(self, inputs, n_steps, rate):\n",
+    "        iseven = (n_steps % rate) == 0\n",
+    "\n",
+    "        if not iseven:\n",
+    "            dilated_steps = n_steps // rate + 1\n",
+    "\n",
+    "            zeros_ = torch.zeros(dilated_steps * rate - inputs.size(0),\n",
+    "                                 inputs.size(1),\n",
+    "                                 inputs.size(2), \n",
+    "                                 dtype=inputs.dtype,\n",
+    "                                 device=inputs.device)\n",
+    "            inputs = torch.cat((inputs, zeros_))\n",
+    "        else:\n",
+    "            dilated_steps = n_steps // rate\n",
+    "\n",
+    "        return inputs, dilated_steps\n",
+    "\n",
+    "    def _prepare_inputs(self, inputs, rate):\n",
+    "        dilated_inputs = torch.cat([inputs[j::rate, :, :] for j in range(rate)], 1)\n",
+    "        return dilated_inputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class DilatedRNN(BaseRecurrent):\n",
+    "    \"\"\" DilatedRNN\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `h`: int, forecast horizon.<br>\n",
+    "    `input_size`: int, maximum sequence length for truncated train backpropagation. Default -1 uses all history.<br>\n",
+    "    `inference_input_size`: int, maximum sequence length for truncated inference. Default -1 uses all history.<br>\n",
+    "    `cell_type`: str, type of RNN cell to use. Options: 'GRU', 'RNN', 'LSTM', 'ResLSTM', 'AttentiveLSTM'.<br>\n",
+    "    `dilations`: int list, dilations betweem layers.<br>\n",
+    "    `encoder_hidden_size`: int=200, units for the RNN's hidden state size.<br>\n",
+    "    `context_size`: int=10, size of context vector for each timestamp on the forecasting window.<br>\n",
+    "    `decoder_hidden_size`: int=200, size of hidden layer for the MLP decoder.<br>\n",
+    "    `decoder_layers`: int=2, number of layers for the MLP decoder.<br>\n",
+    "    `futr_exog_list`: str list, future exogenous columns.<br>\n",
+    "    `hist_exog_list`: str list, historic exogenous columns.<br>\n",
+    "    `stat_exog_list`: str list, static exogenous columns.<br>\n",
+    "    `loss`: PyTorch module, instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `valid_loss`: PyTorch module=`loss`, instantiated valid loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `max_steps`: int, maximum number of training steps.<br>\n",
+    "    `learning_rate`: float, Learning rate between (0, 1).<br>\n",
+    "    `num_lr_decays`: int, Number of learning rate decays, evenly distributed across max_steps.<br>\n",
+    "    `early_stop_patience_steps`: int, Number of validation iterations before early stopping.<br>\n",
+    "    `val_check_steps`: int, Number of training steps between every validation loss check.<br>\n",
+    "    `batch_size`: int=32, number of different series in each batch.<br>\n",
+    "    `valid_batch_size`: int=None, number of different series in each validation and test batch.<br>\n",
+    "    `step_size`: int=1, step size between each window of temporal data.<br>\n",
+    "    `scaler_type`: str='robust', type of scaler for temporal inputs normalization see [temporal scalers](https://nixtla.github.io/neuralforecast/common.scalers.html).<br>\n",
+    "    `random_seed`: int=1, random_seed for pytorch initializer and numpy generators.<br>\n",
+    "    `num_workers_loader`: int=os.cpu_count(), workers to be used by `TimeSeriesDataLoader`.<br>\n",
+    "    `drop_last_loader`: bool=False, if True `TimeSeriesDataLoader` drops last non-full batch.<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    `optimizer`: Subclass of 'torch.optim.Optimizer', optional, user specified optimizer instead of the default choice (Adam).<br>\n",
+    "    `optimizer_kwargs`: dict, optional, list of parameters used by the user specified `optimizer`.<br>\n",
+    "    `**trainer_kwargs`: int,  keyword trainer arguments inherited from [PyTorch Lighning's trainer](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).<br>    \n",
+    "    \"\"\"\n",
+    "    # Class attributes\n",
+    "    SAMPLING_TYPE = 'recurrent'\n",
+    "    \n",
+    "    def __init__(self,\n",
+    "                 h: int,\n",
+    "                 input_size: int = -1,\n",
+    "                 inference_input_size: int = -1,\n",
+    "                 cell_type: str = 'LSTM',\n",
+    "                 dilations: List[List[int]] = [[1, 2], [4, 8]],\n",
+    "                 encoder_hidden_size: int = 200,\n",
+    "                 context_size: int = 10,\n",
+    "                 decoder_hidden_size: int = 200,\n",
+    "                 decoder_layers: int = 2,\n",
+    "                 futr_exog_list = None,\n",
+    "                 hist_exog_list = None,\n",
+    "                 stat_exog_list = None,\n",
+    "                 loss = MAE(),\n",
+    "                 valid_loss = None,\n",
+    "                 max_steps: int = 1000,\n",
+    "                 learning_rate: float = 1e-3,\n",
+    "                 num_lr_decays: int = 3,\n",
+    "                 early_stop_patience_steps: int =-1,\n",
+    "                 val_check_steps: int = 100,\n",
+    "                 batch_size = 32,\n",
+    "                 valid_batch_size: Optional[int] = None,\n",
+    "                 step_size: int = 1,\n",
+    "                 scaler_type: str = 'robust',\n",
+    "                 random_seed: int = 1,\n",
+    "                 num_workers_loader: int = 0,\n",
+    "                 drop_last_loader: bool = False,\n",
+    "                 optimizer = None,\n",
+    "                 optimizer_kwargs = None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super(DilatedRNN, self).__init__(\n",
+    "            h=h,\n",
+    "            input_size=input_size,\n",
+    "            inference_input_size=inference_input_size,\n",
+    "            loss=loss,\n",
+    "            valid_loss=valid_loss,\n",
+    "            max_steps=max_steps,\n",
+    "            learning_rate=learning_rate,\n",
+    "            num_lr_decays=num_lr_decays,\n",
+    "            early_stop_patience_steps=early_stop_patience_steps,\n",
+    "            val_check_steps=val_check_steps,\n",
+    "            batch_size=batch_size,\n",
+    "            valid_batch_size=valid_batch_size,\n",
+    "            scaler_type=scaler_type,\n",
+    "            futr_exog_list=futr_exog_list,\n",
+    "            hist_exog_list=hist_exog_list,\n",
+    "            stat_exog_list=stat_exog_list,\n",
+    "            num_workers_loader=num_workers_loader,\n",
+    "            drop_last_loader=drop_last_loader,\n",
+    "            random_seed=random_seed,\n",
+    "            optimizer=optimizer,\n",
+    "            optimizer_kwargs=optimizer_kwargs,\n",
+    "            **trainer_kwargs\n",
+    "        )\n",
+    "\n",
+    "        # Dilated RNN\n",
+    "        self.cell_type = cell_type\n",
+    "        self.dilations = dilations\n",
+    "        self.encoder_hidden_size = encoder_hidden_size\n",
+    "        \n",
+    "        # Context adapter\n",
+    "        self.context_size = context_size\n",
+    "\n",
+    "        # MLP decoder\n",
+    "        self.decoder_hidden_size = decoder_hidden_size\n",
+    "        self.decoder_layers = decoder_layers\n",
+    "\n",
+    "        self.futr_exog_size = len(self.futr_exog_list)\n",
+    "        self.hist_exog_size = len(self.hist_exog_list)\n",
+    "        self.stat_exog_size = len(self.stat_exog_list)\n",
+    "        \n",
+    "        # RNN input size (1 for target variable y)\n",
+    "        input_encoder = 1 + self.hist_exog_size + self.stat_exog_size\n",
+    "\n",
+    "        # Instantiate model\n",
+    "        layers = []\n",
+    "        for grp_num in range(len(self.dilations)):\n",
+    "            if grp_num == 0:\n",
+    "                input_encoder = 1 + self.hist_exog_size + self.stat_exog_size\n",
+    "            else:\n",
+    "                input_encoder = self.encoder_hidden_size\n",
+    "            layer = DRNN(input_encoder,\n",
+    "                         self.encoder_hidden_size,\n",
+    "                         n_layers=len(self.dilations[grp_num]),\n",
+    "                         dilations=self.dilations[grp_num],\n",
+    "                         cell_type=self.cell_type)\n",
+    "            layers.append(layer)\n",
+    "\n",
+    "        self.rnn_stack = nn.Sequential(*layers)\n",
+    "\n",
+    "        # Context adapter\n",
+    "        self.context_adapter = nn.Linear(in_features=self.encoder_hidden_size + self.futr_exog_size * h,\n",
+    "                                         out_features=self.context_size * h)\n",
+    "\n",
+    "        # Decoder MLP\n",
+    "        self.mlp_decoder = MLP(in_features=self.context_size + self.futr_exog_size,\n",
+    "                               out_features=self.loss.outputsize_multiplier,\n",
+    "                               hidden_size=self.decoder_hidden_size,\n",
+    "                               num_layers=self.decoder_layers,\n",
+    "                               activation='ReLU',\n",
+    "                               dropout=0.0)\n",
+    "\n",
+    "    def forward(self, windows_batch):\n",
+    "        \n",
+    "        # Parse windows_batch\n",
+    "        encoder_input = windows_batch['insample_y'] # [B, seq_len, 1]\n",
+    "        futr_exog     = windows_batch['futr_exog']\n",
+    "        hist_exog     = windows_batch['hist_exog']\n",
+    "        stat_exog     = windows_batch['stat_exog']\n",
+    "\n",
+    "        # Concatenate y, historic and static inputs\n",
+    "        # [B, C, seq_len, 1] -> [B, seq_len, C]\n",
+    "        # Contatenate [ Y_t, | X_{t-L},..., X_{t} | S ]\n",
+    "        batch_size, seq_len = encoder_input.shape[:2]\n",
+    "        if self.hist_exog_size > 0:\n",
+    "            hist_exog = hist_exog.permute(0,2,1,3).squeeze(-1) # [B, X, seq_len, 1] -> [B, seq_len, X]\n",
+    "            encoder_input = torch.cat((encoder_input, hist_exog), dim=2)\n",
+    "\n",
+    "        if self.stat_exog_size > 0:\n",
+    "            stat_exog = stat_exog.unsqueeze(1).repeat(1, seq_len, 1) # [B, S] -> [B, seq_len, S]\n",
+    "            encoder_input = torch.cat((encoder_input, stat_exog), dim=2)\n",
+    "\n",
+    "        # DilatedRNN forward\n",
+    "        for layer_num in range(len(self.rnn_stack)):\n",
+    "            residual = encoder_input\n",
+    "            output, _ = self.rnn_stack[layer_num](encoder_input)\n",
+    "            if layer_num > 0:\n",
+    "                output += residual\n",
+    "            encoder_input = output\n",
+    "\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            futr_exog = futr_exog.permute(0,2,3,1)[:,:,1:,:]  # [B, F, seq_len, 1+H] -> [B, seq_len, H, F]\n",
+    "            encoder_input = torch.cat(( encoder_input, futr_exog.reshape(batch_size, seq_len, -1)), dim=2)\n",
+    "\n",
+    "        # Context adapter\n",
+    "        context = self.context_adapter(encoder_input)\n",
+    "        context = context.reshape(batch_size, seq_len, self.h, self.context_size)\n",
+    "\n",
+    "        # Residual connection with futr_exog\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            context = torch.cat((context, futr_exog), dim=-1)\n",
+    "\n",
+    "        # Final forecast\n",
+    "        output = self.mlp_decoder(context)\n",
+    "        output = self.loss.domain_map(output)\n",
+    "        \n",
+    "        return output"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import pytorch_lightning as pl\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.models import DilatedRNN\n",
+    "from neuralforecast.losses.pytorch import MQLoss, DistributionLoss\n",
+    "from neuralforecast.utils import AirPassengersPanel, AirPassengersStatic\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset, TimeSeriesLoader\n",
+    "\n",
+    "Y_train_df = AirPassengersPanel[AirPassengersPanel.ds<AirPassengersPanel['ds'].values[-12]] # 132 train\n",
+    "Y_test_df = AirPassengersPanel[AirPassengersPanel.ds>=AirPassengersPanel['ds'].values[-12]].reset_index(drop=True) # 12 test\n",
+    "\n",
+    "fcst = NeuralForecast(\n",
+    "    models=[DilatedRNN(h=12,\n",
+    "                       input_size=-1,\n",
+    "                       loss=DistributionLoss(distribution='Normal', level=[80, 90]),\n",
+    "                       scaler_type='robust',\n",
+    "                       encoder_hidden_size=100,\n",
+    "                       max_steps=200,\n",
+    "                       futr_exog_list=['y_[lag12]'],\n",
+    "                       hist_exog_list=None,\n",
+    "                       stat_exog_list=['airline1'],\n",
+    "    )\n",
+    "    ],\n",
+    "    freq='M'\n",
+    ")\n",
+    "fcst.fit(df=Y_train_df, static_df=AirPassengersStatic)\n",
+    "forecasts = fcst.predict(futr_df=Y_test_df)\n",
+    "\n",
+    "Y_hat_df = forecasts.reset_index(drop=False).drop(columns=['unique_id','ds'])\n",
+    "plot_df = pd.concat([Y_test_df, Y_hat_df], axis=1)\n",
+    "plot_df = pd.concat([Y_train_df, plot_df])\n",
+    "\n",
+    "plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "plt.plot(plot_df['ds'], plot_df['DilatedRNN-median'], c='blue', label='median')\n",
+    "plt.fill_between(x=plot_df['ds'][-12:], \n",
+    "                 y1=plot_df['DilatedRNN-lo-90'][-12:].values, \n",
+    "                 y2=plot_df['DilatedRNN-hi-90'][-12:].values,\n",
+    "                 alpha=0.4, label='level 90')\n",
+    "plt.legend()\n",
+    "plt.grid()\n",
+    "plt.plot()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/models.dlinear.ipynb
+++ b/nbs/models.dlinear.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.dlinear"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# DLinear"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "DLinear is a simple and fast yet accurate time series forecasting model for long-horizon forecasting.\n",
+    "\n",
+    "The architecture has the following distinctive features:\n",
+    "- Uses Autoformmer's trend and seasonality decomposition.\n",
+    "- Simple linear layers for trend and seasonality component."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**References**<br>\n",
+    "- [Zeng, Ailing, et al. \"Are transformers effective for time series forecasting?.\" Proceedings of the AAAI conference on artificial intelligence. Vol. 37. No. 9. 2023.\"](https://ojs.aaai.org/index.php/AAAI/article/view/26317)<br>"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. DLinear Architecture.](imgs_models/dlinear.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from typing import Optional\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "from neuralforecast.common._base_windows import BaseWindows\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import MAE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Auxiliary Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MovingAvg(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Moving average block to highlight the trend of time series\n",
+    "    \"\"\"\n",
+    "    def __init__(self, kernel_size, stride):\n",
+    "        super(MovingAvg, self).__init__()\n",
+    "        self.kernel_size = kernel_size\n",
+    "        self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)\n",
+    "        \n",
+    "    def forward(self, x):\n",
+    "        # padding on the both ends of time series\n",
+    "        front = x[:, 0:1].repeat(1, (self.kernel_size - 1) // 2)\n",
+    "        end = x[:, -1:].repeat(1, (self.kernel_size - 1) // 2)\n",
+    "        x = torch.cat([front, x, end], dim=1)\n",
+    "        x = self.avg(x)\n",
+    "        return x\n",
+    "    \n",
+    "class SeriesDecomp(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Series decomposition block\n",
+    "    \"\"\"\n",
+    "    def __init__(self, kernel_size):\n",
+    "        super(SeriesDecomp, self).__init__()\n",
+    "        self.MovingAvg = MovingAvg(kernel_size, stride=1)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        moving_mean = self.MovingAvg(x)\n",
+    "        res = x - moving_mean\n",
+    "        return res, moving_mean"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. DLinear"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class DLinear(BaseWindows):\n",
+    "    \"\"\" DLinear\n",
+    "\n",
+    "    *Parameters:*<br>\n",
+    "    `h`: int, forecast horizon.<br>\n",
+    "    `input_size`: int, maximum sequence length for truncated train backpropagation. Default -1 uses all history.<br>\n",
+    "    `futr_exog_list`: str list, future exogenous columns.<br>\n",
+    "    `hist_exog_list`: str list, historic exogenous columns.<br>\n",
+    "    `stat_exog_list`: str list, static exogenous columns.<br>\n",
+    "    `exclude_insample_y`: bool=False, the model skips the autoregressive features y[t-input_size:t] if True.<br>\n",
+    "    `moving_avg_window`: int=25, window size for trend-seasonality decomposition. Should be uneven.<br>\n",
+    "    `loss`: PyTorch module, instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `max_steps`: int=1000, maximum number of training steps.<br>\n",
+    "    `learning_rate`: float=1e-3, Learning rate between (0, 1).<br>\n",
+    "    `num_lr_decays`: int=-1, Number of learning rate decays, evenly distributed across max_steps.<br>\n",
+    "    `early_stop_patience_steps`: int=-1, Number of validation iterations before early stopping.<br>\n",
+    "    `val_check_steps`: int=100, Number of training steps between every validation loss check.<br>\n",
+    "    `batch_size`: int=32, number of different series in each batch.<br>\n",
+    "    `valid_batch_size`: int=None, number of different series in each validation and test batch, if None uses batch_size.<br>\n",
+    "    `windows_batch_size`: int=1024, number of windows to sample in each training batch, default uses all.<br>\n",
+    "    `inference_windows_batch_size`: int=1024, number of windows to sample in each inference batch.<br>\n",
+    "    `start_padding_enabled`: bool=False, if True, the model will pad the time series with zeros at the beginning, by input size.<br>\n",
+    "    `scaler_type`: str='robust', type of scaler for temporal inputs normalization see [temporal scalers](https://nixtla.github.io/neuralforecast/common.scalers.html).<br>\n",
+    "    `random_seed`: int=1, random_seed for pytorch initializer and numpy generators.<br>\n",
+    "    `num_workers_loader`: int=os.cpu_count(), workers to be used by `TimeSeriesDataLoader`.<br>\n",
+    "    `drop_last_loader`: bool=False, if True `TimeSeriesDataLoader` drops last non-full batch.<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    `optimizer`: Subclass of 'torch.optim.Optimizer', optional, user specified optimizer instead of the default choice (Adam).<br>\n",
+    "    `optimizer_kwargs`: dict, optional, list of parameters used by the user specified `optimizer`.<br>\n",
+    "    `**trainer_kwargs`: int,  keyword trainer arguments inherited from [PyTorch Lighning's trainer](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).<br>\n",
+    "\n",
+    "\t*References*<br>\n",
+    "\t- Zeng, Ailing, et al. \"Are transformers effective for time series forecasting?.\" Proceedings of the AAAI conference on artificial intelligence. Vol. 37. No. 9. 2023.\"\n",
+    "    \"\"\"\n",
+    "    # Class attributes\n",
+    "    SAMPLING_TYPE = 'windows'\n",
+    "\n",
+    "    def __init__(self,\n",
+    "                 h: int, \n",
+    "                 input_size: int,\n",
+    "                 stat_exog_list = None,\n",
+    "                 hist_exog_list = None,\n",
+    "                 futr_exog_list = None,\n",
+    "                 exclude_insample_y = False,\n",
+    "                 moving_avg_window: int = 25,\n",
+    "                 loss = MAE(),\n",
+    "                 valid_loss = None,\n",
+    "                 max_steps: int = 5000,\n",
+    "                 learning_rate: float = 1e-4,\n",
+    "                 num_lr_decays: int = -1,\n",
+    "                 early_stop_patience_steps: int =-1,\n",
+    "                 val_check_steps: int = 100,\n",
+    "                 batch_size: int = 32,\n",
+    "                 valid_batch_size: Optional[int] = None,\n",
+    "                 windows_batch_size = 1024,\n",
+    "                 inference_windows_batch_size = 1024,\n",
+    "                 start_padding_enabled = False,\n",
+    "                 step_size: int = 1,\n",
+    "                 scaler_type: str = 'identity',\n",
+    "                 random_seed: int = 1,\n",
+    "                 num_workers_loader: int = 0,\n",
+    "                 drop_last_loader: bool = False,\n",
+    "                 optimizer = None,\n",
+    "                 optimizer_kwargs = None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super(DLinear, self).__init__(h=h,\n",
+    "                                       input_size=input_size,\n",
+    "                                       hist_exog_list=hist_exog_list,\n",
+    "                                       stat_exog_list=stat_exog_list,\n",
+    "                                       futr_exog_list = futr_exog_list,\n",
+    "                                       exclude_insample_y = exclude_insample_y,\n",
+    "                                       loss=loss,\n",
+    "                                       valid_loss=valid_loss,\n",
+    "                                       max_steps=max_steps,\n",
+    "                                       learning_rate=learning_rate,\n",
+    "                                       num_lr_decays=num_lr_decays,\n",
+    "                                       early_stop_patience_steps=early_stop_patience_steps,\n",
+    "                                       val_check_steps=val_check_steps,\n",
+    "                                       batch_size=batch_size,\n",
+    "                                       windows_batch_size=windows_batch_size,\n",
+    "                                       valid_batch_size=valid_batch_size,\n",
+    "                                       inference_windows_batch_size=inference_windows_batch_size,\n",
+    "                                       start_padding_enabled = start_padding_enabled,\n",
+    "                                       step_size=step_size,\n",
+    "                                       scaler_type=scaler_type,\n",
+    "                                       num_workers_loader=num_workers_loader,\n",
+    "                                       drop_last_loader=drop_last_loader,\n",
+    "                                       random_seed=random_seed,\n",
+    "                                       optimizer=optimizer,\n",
+    "                                       optimizer_kwargs=optimizer_kwargs,\n",
+    "                                       **trainer_kwargs)\n",
+    "                                                                \n",
+    "        # Architecture\n",
+    "        self.futr_input_size = len(self.futr_exog_list)\n",
+    "        self.hist_input_size = len(self.hist_exog_list)\n",
+    "        self.stat_input_size = len(self.stat_exog_list)\n",
+    "\n",
+    "        if self.stat_input_size > 0:\n",
+    "            raise Exception('DLinear does not support static variables yet')\n",
+    "        \n",
+    "        if self.hist_input_size > 0:\n",
+    "            raise Exception('DLinear does not support historical variables yet')\n",
+    "        \n",
+    "        if self.futr_input_size > 0:\n",
+    "            raise Exception('DLinear does not support future variables yet')\n",
+    "\n",
+    "        if moving_avg_window % 2 == 0:\n",
+    "            raise Exception('moving_avg_window should be uneven')\n",
+    "\n",
+    "        self.c_out = self.loss.outputsize_multiplier\n",
+    "        self.output_attention = False\n",
+    "        self.enc_in = 1 \n",
+    "        self.dec_in = 1\n",
+    "\n",
+    "        # Decomposition\n",
+    "        self.decomp = SeriesDecomp(moving_avg_window)\n",
+    "\n",
+    "        self.linear_trend = nn.Linear(self.input_size, self.loss.outputsize_multiplier * h, bias=True)\n",
+    "        self.linear_season = nn.Linear(self.input_size, self.loss.outputsize_multiplier * h, bias=True)\n",
+    "\n",
+    "    def forward(self, windows_batch):\n",
+    "        # Parse windows_batch\n",
+    "        insample_y    = windows_batch['insample_y']\n",
+    "        #insample_mask = windows_batch['insample_mask']\n",
+    "        #hist_exog     = windows_batch['hist_exog']\n",
+    "        #stat_exog     = windows_batch['stat_exog']\n",
+    "        #futr_exog     = windows_batch['futr_exog']\n",
+    "\n",
+    "        # Parse inputs\n",
+    "        batch_size = len(insample_y)\n",
+    "        seasonal_init, trend_init = self.decomp(insample_y)\n",
+    "\n",
+    "        trend_part = self.linear_trend(trend_init)\n",
+    "        seasonal_part = self.linear_season(seasonal_init)\n",
+    "        \n",
+    "        # Final\n",
+    "        forecast = trend_part + seasonal_part\n",
+    "        forecast = forecast.reshape(batch_size, self.h, self.loss.outputsize_multiplier)\n",
+    "        forecast =  self.loss.domain_map(forecast)\n",
+    "        return forecast"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DLinear)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DLinear.fit, name='DLinear.fit')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(DLinear.predict, name='DLinear.predict')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import pytorch_lightning as pl\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.models import MLP\n",
+    "from neuralforecast.losses.pytorch import MQLoss, DistributionLoss\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset\n",
+    "from neuralforecast.utils import AirPassengers, AirPassengersPanel, AirPassengersStatic, augment_calendar_df\n",
+    "\n",
+    "AirPassengersPanel, calendar_cols = augment_calendar_df(df=AirPassengersPanel, freq='M')\n",
+    "\n",
+    "Y_train_df = AirPassengersPanel[AirPassengersPanel.ds<AirPassengersPanel['ds'].values[-12]] # 132 train\n",
+    "Y_test_df = AirPassengersPanel[AirPassengersPanel.ds>=AirPassengersPanel['ds'].values[-12]].reset_index(drop=True) # 12 test\n",
+    "\n",
+    "model = DLinear(h=12,\n",
+    "                 input_size=24,\n",
+    "                 loss=MAE(),\n",
+    "                 #loss=DistributionLoss(distribution='StudentT', level=[80, 90], return_params=True),\n",
+    "                 scaler_type='robust',\n",
+    "                 learning_rate=1e-3,\n",
+    "                 max_steps=500,\n",
+    "                 val_check_steps=50,\n",
+    "                 early_stop_patience_steps=2)\n",
+    "\n",
+    "nf = NeuralForecast(\n",
+    "    models=[model],\n",
+    "    freq='M'\n",
+    ")\n",
+    "nf.fit(df=Y_train_df, static_df=AirPassengersStatic, val_size=12)\n",
+    "forecasts = nf.predict(futr_df=Y_test_df)\n",
+    "\n",
+    "Y_hat_df = forecasts.reset_index(drop=False).drop(columns=['unique_id','ds'])\n",
+    "plot_df = pd.concat([Y_test_df, Y_hat_df], axis=1)\n",
+    "plot_df = pd.concat([Y_train_df, plot_df])\n",
+    "\n",
+    "if model.loss.is_distribution_output:\n",
+    "    plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "    plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "    plt.plot(plot_df['ds'], plot_df['DLinear-median'], c='blue', label='median')\n",
+    "    plt.fill_between(x=plot_df['ds'][-12:], \n",
+    "                    y1=plot_df['DLinear-lo-90'][-12:].values, \n",
+    "                    y2=plot_df['DLinear-hi-90'][-12:].values,\n",
+    "                    alpha=0.4, label='level 90')\n",
+    "    plt.grid()\n",
+    "    plt.legend()\n",
+    "    plt.plot()\n",
+    "else:\n",
+    "    plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "    plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "    plt.plot(plot_df['ds'], plot_df['DLinear'], c='blue', label='Forecast')\n",
+    "    plt.legend()\n",
+    "    plt.grid()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/models.fedformer.ipynb
+++ b/nbs/models.fedformer.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.fedformer"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# FEDformer"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The FEDformer model tackles the challenge of finding reliable dependencies on intricate temporal patterns of long-horizon forecasting.\n",
+    "\n",
+    "The architecture has the following distinctive features:\n",
+    "- In-built progressive decomposition in trend and seasonal components based on a moving average filter.\n",
+    "- Frequency Enhanced Block and Frequency Enhanced Attention to perform attention in the sparse representation on basis such as Fourier transform.\n",
+    "- Classic encoder-decoder proposed by Vaswani et al. (2017) with a multi-head attention mechanism.\n",
+    "\n",
+    "The FEDformer model utilizes a three-component approach to define its embedding:\n",
+    "- It employs encoded autoregressive features obtained from a convolution network.\n",
+    "- Absolute positional embeddings obtained from calendar features are utilized."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**References**<br>\n",
+    "- [Zhou, Tian, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin.. \"FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting\"](https://proceedings.mlr.press/v162/zhou22g.html)<br>"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. FEDformer Architecture.](imgs_models/fedformer.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import numpy as np\n",
+    "from typing import Optional\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "from neuralforecast.common._modules import DataEmbedding\n",
+    "from neuralforecast.common._base_windows import BaseWindows\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import MAE"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Auxiliary functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MovingAvg(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Moving average block to highlight the trend of time series\n",
+    "    \"\"\"\n",
+    "    def __init__(self, kernel_size, stride):\n",
+    "        super(MovingAvg, self).__init__()\n",
+    "        self.kernel_size = kernel_size\n",
+    "        self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        # padding on the both ends of time series\n",
+    "        front = x[:, 0:1, :].repeat(1, (self.kernel_size - 1) // 2, 1)\n",
+    "        end = x[:, -1:, :].repeat(1, (self.kernel_size - 1) // 2, 1)\n",
+    "        x = torch.cat([front, x, end], dim=1)\n",
+    "        x = self.avg(x.permute(0, 2, 1))\n",
+    "        x = x.permute(0, 2, 1)\n",
+    "        return x\n",
+    "\n",
+    "class SeriesDecomp(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Series decomposition block\n",
+    "    \"\"\"\n",
+    "    def __init__(self, kernel_size):\n",
+    "        super(SeriesDecomp, self).__init__()\n",
+    "        self.MovingAvg = MovingAvg(kernel_size, stride=1)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        moving_mean = self.MovingAvg(x)\n",
+    "        res = x - moving_mean\n",
+    "        return res, moving_mean\n",
+    "    \n",
+    "class LayerNorm(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Special designed layernorm for the seasonal part\n",
+    "    \"\"\"\n",
+    "    def __init__(self, channels):\n",
+    "        super(LayerNorm, self).__init__()\n",
+    "        self.layernorm = nn.LayerNorm(channels)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        x_hat = self.layernorm(x)\n",
+    "        bias = torch.mean(x_hat, dim=1).unsqueeze(1).repeat(1, x.shape[1], 1)\n",
+    "        return x_hat - bias\n",
+    "\n",
+    "\n",
+    "class AutoCorrelationLayer(nn.Module):\n",
+    "    def __init__(self, correlation, hidden_size, n_head, d_keys=None,\n",
+    "                 d_values=None):\n",
+    "        super(AutoCorrelationLayer, self).__init__()\n",
+    "\n",
+    "        d_keys = d_keys or (hidden_size // n_head)\n",
+    "        d_values = d_values or (hidden_size // n_head)\n",
+    "\n",
+    "        self.inner_correlation = correlation\n",
+    "        self.query_projection = nn.Linear(hidden_size, d_keys * n_head)\n",
+    "        self.key_projection = nn.Linear(hidden_size, d_keys * n_head)\n",
+    "        self.value_projection = nn.Linear(hidden_size, d_values * n_head)\n",
+    "        self.out_projection = nn.Linear(d_values * n_head, hidden_size)\n",
+    "        self.n_head = n_head\n",
+    "\n",
+    "    def forward(self, queries, keys, values, attn_mask):\n",
+    "        B, L, _ = queries.shape\n",
+    "        _, S, _ = keys.shape\n",
+    "        H = self.n_head\n",
+    "\n",
+    "        queries = self.query_projection(queries).view(B, L, H, -1)\n",
+    "        keys = self.key_projection(keys).view(B, S, H, -1)\n",
+    "        values = self.value_projection(values).view(B, S, H, -1)\n",
+    "\n",
+    "        out, attn = self.inner_correlation(\n",
+    "            queries,\n",
+    "            keys,\n",
+    "            values,\n",
+    "            attn_mask\n",
+    "        )\n",
+    "        out = out.view(B, L, -1)\n",
+    "\n",
+    "        return self.out_projection(out), attn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class EncoderLayer(nn.Module):\n",
+    "    \"\"\"\n",
+    "    FEDformer encoder layer with the progressive decomposition architecture\n",
+    "    \"\"\"\n",
+    "    def __init__(self, attention, hidden_size, conv_hidden_size=None, MovingAvg=25, dropout=0.1, activation=\"relu\"):\n",
+    "        super(EncoderLayer, self).__init__()\n",
+    "        conv_hidden_size = conv_hidden_size or 4 * hidden_size\n",
+    "        self.attention = attention\n",
+    "        self.conv1 = nn.Conv1d(in_channels=hidden_size, out_channels=conv_hidden_size, kernel_size=1, bias=False)\n",
+    "        self.conv2 = nn.Conv1d(in_channels=conv_hidden_size, out_channels=hidden_size, kernel_size=1, bias=False)\n",
+    "        self.decomp1 = SeriesDecomp(MovingAvg)\n",
+    "        self.decomp2 = SeriesDecomp(MovingAvg)\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.activation = F.relu if activation == \"relu\" else F.gelu\n",
+    "\n",
+    "    def forward(self, x, attn_mask=None):\n",
+    "        new_x, attn = self.attention(\n",
+    "            x, x, x,\n",
+    "            attn_mask=attn_mask\n",
+    "        )\n",
+    "        x = x + self.dropout(new_x)\n",
+    "        x, _ = self.decomp1(x)\n",
+    "        y = x\n",
+    "        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))\n",
+    "        y = self.dropout(self.conv2(y).transpose(-1, 1))\n",
+    "        res, _ = self.decomp2(x + y)\n",
+    "        return res, attn\n",
+    "\n",
+    "\n",
+    "class Encoder(nn.Module):\n",
+    "    \"\"\"\n",
+    "    FEDformer encoder\n",
+    "    \"\"\"\n",
+    "    def __init__(self, attn_layers, conv_layers=None, norm_layer=None):\n",
+    "        super(Encoder, self).__init__()\n",
+    "        self.attn_layers = nn.ModuleList(attn_layers)\n",
+    "        self.conv_layers = nn.ModuleList(conv_layers) if conv_layers is not None else None\n",
+    "        self.norm = norm_layer\n",
+    "\n",
+    "    def forward(self, x, attn_mask=None):\n",
+    "        attns = []\n",
+    "        if self.conv_layers is not None:\n",
+    "            for attn_layer, conv_layer in zip(self.attn_layers, self.conv_layers):\n",
+    "                x, attn = attn_layer(x, attn_mask=attn_mask)\n",
+    "                x = conv_layer(x)\n",
+    "                attns.append(attn)\n",
+    "            x, attn = self.attn_layers[-1](x)\n",
+    "            attns.append(attn)\n",
+    "        else:\n",
+    "            for attn_layer in self.attn_layers:\n",
+    "                x, attn = attn_layer(x, attn_mask=attn_mask)\n",
+    "                attns.append(attn)\n",
+    "\n",
+    "        if self.norm is not None:\n",
+    "            x = self.norm(x)\n",
+    "\n",
+    "        return x, attns\n",
+    "\n",
+    "\n",
+    "class DecoderLayer(nn.Module):\n",
+    "    \"\"\"\n",
+    "    FEDformer decoder layer with the progressive decomposition architecture\n",
+    "    \"\"\"\n",
+    "    def __init__(self, self_attention, cross_attention, hidden_size, c_out, conv_hidden_size=None,\n",
+    "                 MovingAvg=25, dropout=0.1, activation=\"relu\"):\n",
+    "        super(DecoderLayer, self).__init__()\n",
+    "        conv_hidden_size = conv_hidden_size or 4 * hidden_size\n",
+    "        self.self_attention = self_attention\n",
+    "        self.cross_attention = cross_attention\n",
+    "        self.conv1 = nn.Conv1d(in_channels=hidden_size, out_channels=conv_hidden_size, kernel_size=1, bias=False)\n",
+    "        self.conv2 = nn.Conv1d(in_channels=conv_hidden_size, out_channels=hidden_size, kernel_size=1, bias=False)\n",
+    "        self.decomp1 = SeriesDecomp(MovingAvg)\n",
+    "        self.decomp2 = SeriesDecomp(MovingAvg)\n",
+    "        self.decomp3 = SeriesDecomp(MovingAvg)\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.projection = nn.Conv1d(in_channels=hidden_size, out_channels=c_out, kernel_size=3, stride=1, padding=1,\n",
+    "                                    padding_mode='circular', bias=False)\n",
+    "        self.activation = F.relu if activation == \"relu\" else F.gelu\n",
+    "\n",
+    "    def forward(self, x, cross, x_mask=None, cross_mask=None):\n",
+    "        x = x + self.dropout(self.self_attention(\n",
+    "            x, x, x,\n",
+    "            attn_mask=x_mask\n",
+    "        )[0])\n",
+    "        x, trend1 = self.decomp1(x)\n",
+    "        x = x + self.dropout(self.cross_attention(\n",
+    "            x, cross, cross,\n",
+    "            attn_mask=cross_mask\n",
+    "        )[0])\n",
+    "        x, trend2 = self.decomp2(x)\n",
+    "        y = x\n",
+    "        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))\n",
+    "        y = self.dropout(self.conv2(y).transpose(-1, 1))\n",
+    "        x, trend3 = self.decomp3(x + y)\n",
+    "\n",
+    "        residual_trend = trend1 + trend2 + trend3\n",
+    "        residual_trend = self.projection(residual_trend.permute(0, 2, 1)).transpose(1, 2)\n",
+    "        return x, residual_trend\n",
+    "\n",
+    "\n",
+    "class Decoder(nn.Module):\n",
+    "    \"\"\"\n",
+    "    FEDformer decoder\n",
+    "    \"\"\"\n",
+    "    def __init__(self, layers, norm_layer=None, projection=None):\n",
+    "        super(Decoder, self).__init__()\n",
+    "        self.layers = nn.ModuleList(layers)\n",
+    "        self.norm = norm_layer\n",
+    "        self.projection = projection\n",
+    "\n",
+    "    def forward(self, x, cross, x_mask=None, cross_mask=None, trend=None):\n",
+    "        for layer in self.layers:\n",
+    "            x, residual_trend = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask)\n",
+    "            trend = trend + residual_trend\n",
+    "\n",
+    "        if self.norm is not None:\n",
+    "            x = self.norm(x)\n",
+    "\n",
+    "        if self.projection is not None:\n",
+    "            x = self.projection(x)\n",
+    "        return x, trend"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def get_frequency_modes(seq_len, modes=64, mode_select_method='random'):\n",
+    "    \"\"\"\n",
+    "    Get modes on frequency domain:\n",
+    "        'random' for sampling randomly\n",
+    "        'else' for sampling the lowest modes;\n",
+    "    \"\"\"\n",
+    "    modes = min(modes, seq_len//2)\n",
+    "    if mode_select_method == 'random':\n",
+    "        index = list(range(0, seq_len // 2))\n",
+    "        np.random.shuffle(index)\n",
+    "        index = index[:modes]\n",
+    "    else:\n",
+    "        index = list(range(0, modes))\n",
+    "    index.sort()\n",
+    "    return index\n",
+    "\n",
+    "\n",
+    "class FourierBlock(nn.Module):\n",
+    "    def __init__(self, in_channels, out_channels, seq_len, modes=0, mode_select_method='random'):\n",
+    "        super(FourierBlock, self).__init__()\n",
+    "        \"\"\"\n",
+    "        Fourier block\n",
+    "        \"\"\"\n",
+    "        # get modes on frequency domain\n",
+    "        self.index = get_frequency_modes(seq_len, modes=modes, mode_select_method=mode_select_method)\n",
+    "\n",
+    "        self.scale = (1 / (in_channels * out_channels))\n",
+    "        self.weights1 = nn.Parameter(\n",
+    "            self.scale * torch.rand(8, in_channels // 8, out_channels // 8, len(self.index), dtype=torch.cfloat))\n",
+    "\n",
+    "    # Complex multiplication\n",
+    "    def compl_mul1d(self, input, weights):\n",
+    "        # (batch, in_channel, x ), (in_channel, out_channel, x) -> (batch, out_channel, x)\n",
+    "        return torch.einsum(\"bhi,hio->bho\", input, weights)\n",
+    "\n",
+    "    def forward(self, q, k, v, mask):\n",
+    "        # size = [B, L, H, E]\n",
+    "        B, L, H, E = q.shape\n",
+    "        \n",
+    "        x = q.permute(0, 2, 3, 1)\n",
+    "        # Compute Fourier coefficients\n",
+    "        x_ft = torch.fft.rfft(x, dim=-1)\n",
+    "        # Perform Fourier neural operations\n",
+    "        out_ft = torch.zeros(B, H, E, L // 2 + 1, device=x.device, dtype=torch.cfloat)\n",
+    "        for wi, i in enumerate(self.index):\n",
+    "            out_ft[:, :, :, wi] = self.compl_mul1d(x_ft[:, :, :, i], self.weights1[:, :, :, wi])\n",
+    "        # Return to time domain\n",
+    "        x = torch.fft.irfft(out_ft, n=x.size(-1))\n",
+    "        return (x, None)\n",
+    "\n",
+    "class FourierCrossAttention(nn.Module):\n",
+    "    def __init__(self, in_channels, out_channels, seq_len_q, seq_len_kv, modes=64, mode_select_method='random',\n",
+    "                 activation='tanh', policy=0):\n",
+    "        super(FourierCrossAttention, self).__init__()\n",
+    "        \"\"\"\n",
+    "        Fourier Cross Attention layer\n",
+    "        \"\"\"\n",
+    "        self.activation = activation\n",
+    "        self.in_channels = in_channels\n",
+    "        self.out_channels = out_channels\n",
+    "        # get modes for queries and keys (& values) on frequency domain\n",
+    "        self.index_q = get_frequency_modes(seq_len_q, modes=modes, mode_select_method=mode_select_method)\n",
+    "        self.index_kv = get_frequency_modes(seq_len_kv, modes=modes, mode_select_method=mode_select_method)\n",
+    "\n",
+    "        self.scale = (1 / (in_channels * out_channels))\n",
+    "        self.weights1 = nn.Parameter(\n",
+    "            self.scale * torch.rand(8, in_channels // 8, out_channels // 8, len(self.index_q), dtype=torch.cfloat))\n",
+    "\n",
+    "    # Complex multiplication\n",
+    "    def compl_mul1d(self, input, weights):\n",
+    "        # (batch, in_channel, x ), (in_channel, out_channel, x) -> (batch, out_channel, x)\n",
+    "        return torch.einsum(\"bhi,hio->bho\", input, weights)\n",
+    "\n",
+    "    def forward(self, q, k, v, mask):\n",
+    "        # size = [B, L, H, E]\n",
+    "        B, L, H, E = q.shape\n",
+    "        xq = q.permute(0, 2, 3, 1)  # size = [B, H, E, L]\n",
+    "        xk = k.permute(0, 2, 3, 1)\n",
+    "        #xv = v.permute(0, 2, 3, 1)\n",
+    "\n",
+    "        # Compute Fourier coefficients\n",
+    "        xq_ft_ = torch.zeros(B, H, E, len(self.index_q), device=xq.device, dtype=torch.cfloat)\n",
+    "        xq_ft = torch.fft.rfft(xq, dim=-1)\n",
+    "        for i, j in enumerate(self.index_q):\n",
+    "            xq_ft_[:, :, :, i] = xq_ft[:, :, :, j]\n",
+    "        xk_ft_ = torch.zeros(B, H, E, len(self.index_kv), device=xq.device, dtype=torch.cfloat)\n",
+    "        xk_ft = torch.fft.rfft(xk, dim=-1)\n",
+    "        for i, j in enumerate(self.index_kv):\n",
+    "            xk_ft_[:, :, :, i] = xk_ft[:, :, :, j]\n",
+    "\n",
+    "        # Attention mechanism on frequency domain\n",
+    "        xqk_ft = (torch.einsum(\"bhex,bhey->bhxy\", xq_ft_, xk_ft_))\n",
+    "        if self.activation == 'tanh':\n",
+    "            xqk_ft = xqk_ft.tanh()\n",
+    "        elif self.activation == 'softmax':\n",
+    "            xqk_ft = torch.softmax(abs(xqk_ft), dim=-1)\n",
+    "            xqk_ft = torch.complex(xqk_ft, torch.zeros_like(xqk_ft))\n",
+    "        else:\n",
+    "            raise Exception('{} actiation function is not implemented'.format(self.activation))\n",
+    "        xqkv_ft = torch.einsum(\"bhxy,bhey->bhex\", xqk_ft, xk_ft_)\n",
+    "        xqkvw = torch.einsum(\"bhex,heox->bhox\", xqkv_ft, self.weights1)\n",
+    "        out_ft = torch.zeros(B, H, E, L // 2 + 1, device=xq.device, dtype=torch.cfloat)\n",
+    "        for i, j in enumerate(self.index_q):\n",
+    "            out_ft[:, :, :, j] = xqkvw[:, :, :, i]\n",
+    "        \n",
+    "        # Return to time domain\n",
+    "        out = torch.fft.irfft(out_ft / self.in_channels / self.out_channels, n=xq.size(-1))\n",
+    "        return (out, None)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class FEDformer(BaseWindows):\n",
+    "    \"\"\" FEDformer\n",
+    "\n",
+    "    The FEDformer model tackles the challenge of finding reliable dependencies on intricate temporal patterns of long-horizon forecasting.\n",
+    "\n",
+    "    The architecture has the following distinctive features:\n",
+    "    - In-built progressive decomposition in trend and seasonal components based on a moving average filter.\n",
+    "    - Frequency Enhanced Block and Frequency Enhanced Attention to perform attention in the sparse representation on basis such as Fourier transform.\n",
+    "    - Classic encoder-decoder proposed by Vaswani et al. (2017) with a multi-head attention mechanism.\n",
+    "\n",
+    "    The FEDformer model utilizes a three-component approach to define its embedding:\n",
+    "    - It employs encoded autoregressive features obtained from a convolution network.\n",
+    "    - Absolute positional embeddings obtained from calendar features are utilized.\n",
+    "\n",
+    "    *Parameters:*<br>\n",
+    "    `h`: int, forecast horizon.<br>\n",
+    "    `input_size`: int, maximum sequence length for truncated train backpropagation. Default -1 uses all history.<br>\n",
+    "    `futr_exog_list`: str list, future exogenous columns.<br>\n",
+    "    `hist_exog_list`: str list, historic exogenous columns.<br>\n",
+    "    `stat_exog_list`: str list, static exogenous columns.<br>\n",
+    "\t`decoder_input_size_multiplier`: float = 0.5, .<br>\n",
+    "    `version`: str = 'Fourier', version of the model.<br>\n",
+    "    `modes`: int = 64, number of modes for the Fourier block.<br>\n",
+    "    `mode_select`: str = 'random', method to select the modes for the Fourier block.<br>\n",
+    "    `hidden_size`: int=128, units of embeddings and encoders.<br>\n",
+    "    `dropout`: float (0, 1), dropout throughout Autoformer architecture.<br>\n",
+    "    `n_head`: int=8, controls number of multi-head's attention.<br>\n",
+    "\t`conv_hidden_size`: int=32, channels of the convolutional encoder.<br>\n",
+    "\t`activation`: str=`GELU`, activation from ['ReLU', 'Softplus', 'Tanh', 'SELU', 'LeakyReLU', 'PReLU', 'Sigmoid', 'GELU'].<br>\n",
+    "    `encoder_layers`: int=2, number of layers for the TCN encoder.<br>\n",
+    "    `decoder_layers`: int=1, number of layers for the MLP decoder.<br>\n",
+    "    `MovingAvg_window`: int=25, window size for the moving average filter.<br>\n",
+    "    `loss`: PyTorch module, instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `valid_loss`: PyTorch module, instantiated validation loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `max_steps`: int=1000, maximum number of training steps.<br>\n",
+    "    `learning_rate`: float=1e-3, Learning rate between (0, 1).<br>\n",
+    "    `num_lr_decays`: int=-1, Number of learning rate decays, evenly distributed across max_steps.<br>\n",
+    "    `early_stop_patience_steps`: int=-1, Number of validation iterations before early stopping.<br>\n",
+    "    `val_check_steps`: int=100, Number of training steps between every validation loss check.<br>\n",
+    "    `batch_size`: int=32, number of different series in each batch.<br>\n",
+    "    `valid_batch_size`: int=None, number of different series in each validation and test batch, if None uses batch_size.<br>\n",
+    "    `windows_batch_size`: int=1024, number of windows to sample in each training batch, default uses all.<br>\n",
+    "    `inference_windows_batch_size`: int=1024, number of windows to sample in each inference batch.<br>\n",
+    "    `start_padding_enabled`: bool=False, if True, the model will pad the time series with zeros at the beginning, by input size.<br>\n",
+    "    `scaler_type`: str='robust', type of scaler for temporal inputs normalization see [temporal scalers](https://nixtla.github.io/neuralforecast/common.scalers.html).<br>\n",
+    "    `random_seed`: int=1, random_seed for pytorch initializer and numpy generators.<br>\n",
+    "    `num_workers_loader`: int=os.cpu_count(), workers to be used by `TimeSeriesDataLoader`.<br>\n",
+    "    `drop_last_loader`: bool=False, if True `TimeSeriesDataLoader` drops last non-full batch.<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    `optimizer`: Subclass of 'torch.optim.Optimizer', optional, user specified optimizer instead of the default choice (Adam).<br>\n",
+    "    `optimizer_kwargs`: dict, optional, list of parameters used by the user specified `optimizer`.<br>\n",
+    "    `**trainer_kwargs`: int,  keyword trainer arguments inherited from [PyTorch Lighning's trainer](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).<br>\n",
+    "\n",
+    "    \"\"\"\n",
+    "    # Class attributes\n",
+    "    SAMPLING_TYPE = 'windows'\n",
+    "\n",
+    "    def __init__(self,\n",
+    "                 h: int, \n",
+    "                 input_size: int,\n",
+    "                 stat_exog_list = None,\n",
+    "                 hist_exog_list = None,\n",
+    "                 futr_exog_list = None,\n",
+    "                 decoder_input_size_multiplier: float = 0.5,\n",
+    "                 version: str = 'Fourier',\n",
+    "                 modes: int = 64,\n",
+    "                 mode_select: str = 'random',\n",
+    "                 hidden_size: int = 128, \n",
+    "                 dropout: float = 0.05,\n",
+    "                 n_head: int = 8,\n",
+    "                 conv_hidden_size: int = 32,\n",
+    "                 activation: str = 'gelu',\n",
+    "                 encoder_layers: int = 2, \n",
+    "                 decoder_layers: int = 1,\n",
+    "                 MovingAvg_window: int = 25,\n",
+    "                 loss = MAE(),\n",
+    "                 valid_loss = None,\n",
+    "                 max_steps: int = 5000,\n",
+    "                 learning_rate: float = 1e-4,\n",
+    "                 num_lr_decays: int = -1,\n",
+    "                 early_stop_patience_steps: int =-1,\n",
+    "                 start_padding_enabled = False,\n",
+    "                 val_check_steps: int = 100,\n",
+    "                 batch_size: int = 32,\n",
+    "                 valid_batch_size: Optional[int] = None,\n",
+    "                 windows_batch_size = 1024,\n",
+    "                 inference_windows_batch_size = 1024,\n",
+    "                 step_size: int = 1,\n",
+    "                 scaler_type: str = 'identity',\n",
+    "                 random_seed: int = 1,\n",
+    "                 num_workers_loader: int = 0,\n",
+    "                 drop_last_loader: bool = False,\n",
+    "                 optimizer=None,\n",
+    "                 optimizer_kwargs=None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super(FEDformer, self).__init__(h=h,\n",
+    "                                       input_size=input_size,\n",
+    "                                       hist_exog_list=hist_exog_list,\n",
+    "                                       stat_exog_list=stat_exog_list,\n",
+    "                                       futr_exog_list = futr_exog_list,\n",
+    "                                       loss=loss,\n",
+    "                                       valid_loss=valid_loss,\n",
+    "                                       max_steps=max_steps,\n",
+    "                                       learning_rate=learning_rate,\n",
+    "                                       num_lr_decays=num_lr_decays,\n",
+    "                                       early_stop_patience_steps=early_stop_patience_steps,\n",
+    "                                       val_check_steps=val_check_steps,\n",
+    "                                       batch_size=batch_size,\n",
+    "                                       windows_batch_size=windows_batch_size,\n",
+    "                                       valid_batch_size=valid_batch_size,\n",
+    "                                       inference_windows_batch_size=inference_windows_batch_size,\n",
+    "                                       start_padding_enabled=start_padding_enabled,\n",
+    "                                       step_size=step_size,\n",
+    "                                       scaler_type=scaler_type,\n",
+    "                                       num_workers_loader=num_workers_loader,\n",
+    "                                       drop_last_loader=drop_last_loader,\n",
+    "                                       random_seed=random_seed,\n",
+    "                                       optimizer=optimizer,\n",
+    "                                       optimizer_kwargs=optimizer_kwargs,\n",
+    "                                       **trainer_kwargs)\n",
+    "        # Architecture\n",
+    "        self.futr_input_size = len(self.futr_exog_list)\n",
+    "        self.hist_input_size = len(self.hist_exog_list)\n",
+    "        self.stat_input_size = len(self.stat_exog_list)\n",
+    "\n",
+    "        if self.stat_input_size > 0:\n",
+    "            raise Exception('Autoformer does not support static variables yet')\n",
+    "        \n",
+    "        if self.hist_input_size > 0:\n",
+    "            raise Exception('Autoformer does not support historical variables yet')\n",
+    "\n",
+    "        self.label_len = int(np.ceil(input_size * decoder_input_size_multiplier))\n",
+    "        if (self.label_len >= input_size) or (self.label_len <= 0):\n",
+    "            raise Exception(f'Check decoder_input_size_multiplier={decoder_input_size_multiplier}, range (0,1)')\n",
+    "\n",
+    "        if activation not in ['relu', 'gelu']:\n",
+    "            raise Exception(f'Check activation={activation}')\n",
+    "        \n",
+    "        if n_head != 8:\n",
+    "            raise Exception('n_head must be 8')\n",
+    "        \n",
+    "        if version not in ['Fourier']:\n",
+    "            raise Exception('Only Fourier version is supported currently.')\n",
+    "\n",
+    "        self.c_out = self.loss.outputsize_multiplier\n",
+    "        self.output_attention = False\n",
+    "        self.enc_in = 1 \n",
+    "        self.dec_in = 1\n",
+    "        \n",
+    "        self.decomp = SeriesDecomp(MovingAvg_window)\n",
+    "\n",
+    "        # Embedding\n",
+    "        self.enc_embedding = DataEmbedding(c_in=self.enc_in,\n",
+    "                                           exog_input_size=self.hist_input_size,\n",
+    "                                           hidden_size=hidden_size, \n",
+    "                                           pos_embedding=False,\n",
+    "                                           dropout=dropout)\n",
+    "        self.dec_embedding = DataEmbedding(self.dec_in,\n",
+    "                                           exog_input_size=self.hist_input_size,\n",
+    "                                           hidden_size=hidden_size, \n",
+    "                                           pos_embedding=False,\n",
+    "                                           dropout=dropout)\n",
+    "\n",
+    "        encoder_self_att = FourierBlock(in_channels=hidden_size,\n",
+    "                                        out_channels=hidden_size,\n",
+    "                                        seq_len=input_size,\n",
+    "                                        modes=modes,\n",
+    "                                        mode_select_method=mode_select)\n",
+    "        decoder_self_att = FourierBlock(in_channels=hidden_size,\n",
+    "                                        out_channels=hidden_size,\n",
+    "                                        seq_len=input_size//2+self.h,\n",
+    "                                        modes=modes,\n",
+    "                                        mode_select_method=mode_select)\n",
+    "        decoder_cross_att = FourierCrossAttention(in_channels=hidden_size,\n",
+    "                                                    out_channels=hidden_size,\n",
+    "                                                    seq_len_q=input_size//2+self.h,\n",
+    "                                                    seq_len_kv=input_size,\n",
+    "                                                    modes=modes,\n",
+    "                                                    mode_select_method=mode_select)\n",
+    "\n",
+    "        self.encoder = Encoder(\n",
+    "            [\n",
+    "                EncoderLayer(\n",
+    "                    AutoCorrelationLayer(\n",
+    "                        encoder_self_att,\n",
+    "                        hidden_size, n_head),\n",
+    "\n",
+    "                    hidden_size=hidden_size,\n",
+    "                    conv_hidden_size=conv_hidden_size,\n",
+    "                    MovingAvg=MovingAvg_window,\n",
+    "                    dropout=dropout,\n",
+    "                    activation=activation\n",
+    "                ) for l in range(encoder_layers)\n",
+    "            ],\n",
+    "            norm_layer=LayerNorm(hidden_size)\n",
+    "        )\n",
+    "        # Decoder\n",
+    "        self.decoder = Decoder(\n",
+    "            [\n",
+    "                DecoderLayer(\n",
+    "                    AutoCorrelationLayer(\n",
+    "                        decoder_self_att,\n",
+    "                        hidden_size, n_head),\n",
+    "                    AutoCorrelationLayer(\n",
+    "                        decoder_cross_att,\n",
+    "                        hidden_size, n_head),\n",
+    "                    hidden_size=hidden_size,\n",
+    "                    c_out=self.c_out,\n",
+    "                    conv_hidden_size=conv_hidden_size,\n",
+    "                    MovingAvg=MovingAvg_window,\n",
+    "                    dropout=dropout,\n",
+    "                    activation=activation,\n",
+    "                )\n",
+    "                for l in range(decoder_layers)\n",
+    "            ],\n",
+    "            norm_layer=LayerNorm(hidden_size),\n",
+    "            projection=nn.Linear(hidden_size, self.c_out, bias=True)\n",
+    "        )\n",
+    "\n",
+    "    def forward(self, windows_batch):\n",
+    "        # Parse windows_batch\n",
+    "        insample_y    = windows_batch['insample_y']\n",
+    "        #insample_mask = windows_batch['insample_mask']\n",
+    "        #hist_exog     = windows_batch['hist_exog']\n",
+    "        #stat_exog     = windows_batch['stat_exog']\n",
+    "        futr_exog     = windows_batch['futr_exog']\n",
+    "\n",
+    "        # Parse inputs\n",
+    "        insample_y = insample_y.unsqueeze(-1) # [Ws,L,1]\n",
+    "        if self.futr_input_size > 0:\n",
+    "            x_mark_enc = futr_exog[:,:self.input_size,:]\n",
+    "            x_mark_dec = futr_exog[:,-(self.label_len+self.h):,:]\n",
+    "        else:\n",
+    "            x_mark_enc = None\n",
+    "            x_mark_dec = None\n",
+    "\n",
+    "        x_dec = torch.zeros(size=(len(insample_y),self.h, self.dec_in), device=insample_y.device)\n",
+    "        x_dec = torch.cat([insample_y[:,-self.label_len:,:], x_dec], dim=1)\n",
+    "                \n",
+    "        # decomp init\n",
+    "        mean = torch.mean(insample_y, dim=1).unsqueeze(1).repeat(1, self.h, 1)\n",
+    "        zeros = torch.zeros([x_dec.shape[0], self.h, x_dec.shape[2]], device=insample_y.device)\n",
+    "        seasonal_init, trend_init = self.decomp(insample_y)\n",
+    "        # decoder input\n",
+    "        trend_init = torch.cat([trend_init[:, -self.label_len:, :], mean], dim=1)\n",
+    "        seasonal_init = torch.cat([seasonal_init[:, -self.label_len:, :], zeros], dim=1)\n",
+    "        # enc\n",
+    "        enc_out = self.enc_embedding(insample_y, x_mark_enc)\n",
+    "        enc_out, attns = self.encoder(enc_out, attn_mask=None)\n",
+    "        # dec\n",
+    "        dec_out = self.dec_embedding(seasonal_init, x_mark_dec)\n",
+    "        seasonal_part, trend_part = self.decoder(dec_out, enc_out, x_mask=None, cross_mask=None,\n",
+    "                                                 trend=trend_init)\n",
+    "        # final\n",
+    "        dec_out = trend_part + seasonal_part\n",
+    "\n",
+    "        forecast = self.loss.domain_map(dec_out[:, -self.h:])\n",
+    "        return forecast"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import pytorch_lightning as pl\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.models import MLP\n",
+    "from neuralforecast.losses.pytorch import MQLoss, DistributionLoss, MSE\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset\n",
+    "from neuralforecast.utils import AirPassengers, AirPassengersPanel, AirPassengersStatic, augment_calendar_df\n",
+    "\n",
+    "AirPassengersPanel, calendar_cols = augment_calendar_df(df=AirPassengersPanel, freq='M')\n",
+    "\n",
+    "Y_train_df = AirPassengersPanel[AirPassengersPanel.ds<AirPassengersPanel['ds'].values[-12]] # 132 train\n",
+    "Y_test_df = AirPassengersPanel[AirPassengersPanel.ds>=AirPassengersPanel['ds'].values[-12]].reset_index(drop=True) # 12 test"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "model = FEDformer(h=12,\n",
+    "                 input_size=24,\n",
+    "                 modes=64,\n",
+    "                 hidden_size=64,\n",
+    "                 conv_hidden_size=128,\n",
+    "                 n_head=8,\n",
+    "                 loss=MAE(),\n",
+    "                 futr_exog_list=calendar_cols,\n",
+    "                 scaler_type='robust',\n",
+    "                 learning_rate=1e-3,\n",
+    "                 max_steps=500,\n",
+    "                 batch_size=2,\n",
+    "                 windows_batch_size=32,\n",
+    "                 val_check_steps=50,\n",
+    "                 early_stop_patience_steps=2)\n",
+    "\n",
+    "nf = NeuralForecast(\n",
+    "    models=[model],\n",
+    "    freq='M',\n",
+    ")\n",
+    "nf.fit(df=Y_train_df, static_df=None, val_size=12)\n",
+    "forecasts = nf.predict(futr_df=Y_test_df)\n",
+    "\n",
+    "Y_hat_df = forecasts.reset_index(drop=False).drop(columns=['unique_id','ds'])\n",
+    "plot_df = pd.concat([Y_test_df, Y_hat_df], axis=1)\n",
+    "plot_df = pd.concat([Y_train_df, plot_df])\n",
+    "\n",
+    "if model.loss.is_distribution_output:\n",
+    "    plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "    plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "    plt.plot(plot_df['ds'], plot_df['FEDformer-median'], c='blue', label='median')\n",
+    "    plt.fill_between(x=plot_df['ds'][-12:], \n",
+    "                    y1=plot_df['FEDformer-lo-90'][-12:].values, \n",
+    "                    y2=plot_df['FEDformer-hi-90'][-12:].values,\n",
+    "                    alpha=0.4, label='level 90')\n",
+    "    plt.grid()\n",
+    "    plt.legend()\n",
+    "    plt.plot()\n",
+    "else:\n",
+    "    plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "    plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "    plt.plot(plot_df['ds'], plot_df['FEDformer'], c='blue', label='Forecast')\n",
+    "    plt.legend()\n",
+    "    plt.grid()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/nbs/models.gru.ipynb
+++ b/nbs/models.gru.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.gru"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#  GRU"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Cho et. al proposed the Gated Recurrent Unit (`GRU`) to improve on LSTM and Elman cells. The predictions at each time are given by a MLP decoder. This architecture follows closely the original Multi Layer Elman `RNN` with the main difference being its use of the GRU cells. The predictions are obtained by transforming the hidden states into contexts $\\mathbf{c}_{[t+1:t+H]}$, that are decoded and adapted into $\\mathbf{\\hat{y}}_{[t+1:t+H],[q]}$ through MLPs.\n",
+    "\n",
+    "\\begin{align}\n",
+    " \\mathbf{h}_{t} &= \\textrm{GRU}([\\mathbf{y}_{t},\\mathbf{x}^{(h)}_{t},\\mathbf{x}^{(s)}], \\mathbf{h}_{t-1})\\\\\n",
+    "\\mathbf{c}_{[t+1:t+H]}&=\\textrm{Linear}([\\mathbf{h}_{t}, \\mathbf{x}^{(f)}_{[:t+H]}]) \\\\ \n",
+    "\\hat{y}_{\\tau,[q]}&=\\textrm{MLP}([\\mathbf{c}_{\\tau},\\mathbf{x}^{(f)}_{\\tau}])\n",
+    "\\end{align}\n",
+    "\n",
+    "where $\\mathbf{h}_{t}$, is the hidden state for time $t$, $\\mathbf{y}_{t}$ is the input at time $t$ and $\\mathbf{h}_{t-1}$ is the hidden state of the previous layer at $t-1$, $\\mathbf{x}^{(s)}$ are static exogenous inputs, $\\mathbf{x}^{(h)}_{t}$ historic exogenous, $\\mathbf{x}^{(f)}_{[:t+H]}$ are future exogenous available at the time of the prediction.\n",
+    "\n",
+    "**References**<br>\n",
+    "-[Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio (2014). \"Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling\".](https:arxivorg/abs/1412.3555)<br>\n",
+    "-[Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, Yoshua Bengio (2014). \"On the Properties of Neural Machine Translation: Encoder-Decoder Approaches\".](https://arxiv.org/abs/1409.1259)<br>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. Gated Recurrent Unit Cell.](imgs_models/gru.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from typing import Optional\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import MAE\n",
+    "from neuralforecast.common._base_recurrent import BaseRecurrent\n",
+    "from neuralforecast.common._modules import MLP"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class GRU(BaseRecurrent):\n",
+    "    \"\"\" GRU\n",
+    "\n",
+    "    Multi Layer Recurrent Network with Gated Units (GRU), and\n",
+    "    MLP decoder. The network has `tanh` or `relu` non-linearities, it is trained \n",
+    "    using ADAM stochastic gradient descent. The network accepts static, historic \n",
+    "    and future exogenous data, flattens the inputs.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "    `h`: int, forecast horizon.<br>\n",
+    "    `input_size`: int, maximum sequence length for truncated train backpropagation. Default -1 uses all history.<br>\n",
+    "    `inference_input_size`: int, maximum sequence length for truncated inference. Default -1 uses all history.<br>\n",
+    "    `encoder_n_layers`: int=2, number of layers for the GRU.<br>\n",
+    "    `encoder_hidden_size`: int=200, units for the GRU's hidden state size.<br>\n",
+    "    `encoder_activation`: str=`tanh`, type of GRU activation from `tanh` or `relu`.<br>\n",
+    "    `encoder_bias`: bool=True, whether or not to use biases b_ih, b_hh within GRU units.<br>\n",
+    "    `encoder_dropout`: float=0., dropout regularization applied to GRU outputs.<br>\n",
+    "    `context_size`: int=10, size of context vector for each timestamp on the forecasting window.<br>\n",
+    "    `decoder_hidden_size`: int=200, size of hidden layer for the MLP decoder.<br>\n",
+    "    `decoder_layers`: int=2, number of layers for the MLP decoder.<br>\n",
+    "    `futr_exog_list`: str list, future exogenous columns.<br>\n",
+    "    `hist_exog_list`: str list, historic exogenous columns.<br>\n",
+    "    `stat_exog_list`: str list, static exogenous columns.<br>\n",
+    "    `loss`: PyTorch module, instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `valid_loss`: PyTorch module=`loss`, instantiated valid loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).<br>\n",
+    "    `max_steps`: int=1000, maximum number of training steps.<br>\n",
+    "    `learning_rate`: float=1e-3, Learning rate between (0, 1).<br>\n",
+    "    `num_lr_decays`: int=-1, Number of learning rate decays, evenly distributed across max_steps.<br>\n",
+    "    `early_stop_patience_steps`: int=-1, Number of validation iterations before early stopping.<br>\n",
+    "    `val_check_steps`: int=100, Number of training steps between every validation loss check.<br>\n",
+    "    `batch_size`: int=32, number of differentseries in each batch.<br>\n",
+    "    `valid_batch_size`: int=None, number of different series in each validation and test batch.<br>\n",
+    "    `scaler_type`: str='robust', type of scaler for temporal inputs normalization see [temporal scalers](https://nixtla.github.io/neuralforecast/common.scalers.html).<br>\n",
+    "    `random_seed`: int=1, random_seed for pytorch initializer and numpy generators.<br>\n",
+    "    `num_workers_loader`: int=os.cpu_count(), workers to be used by `TimeSeriesDataLoader`.<br>\n",
+    "    `drop_last_loader`: bool=False, if True `TimeSeriesDataLoader` drops last non-full batch.<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    `optimizer`: Subclass of 'torch.optim.Optimizer', optional, user specified optimizer instead of the default choice (Adam).<br>\n",
+    "    `optimizer_kwargs`: dict, optional, list of parameters used by the user specified `optimizer`.<br>\n",
+    "    `**trainer_kwargs`: int,  keyword trainer arguments inherited from [PyTorch Lighning's trainer](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).<br>    \n",
+    "    \"\"\"\n",
+    "    # Class attributes\n",
+    "    SAMPLING_TYPE = 'recurrent'\n",
+    "    \n",
+    "    def __init__(self,\n",
+    "                 h: int,\n",
+    "                 input_size: int = -1,\n",
+    "                 inference_input_size: int = -1,\n",
+    "                 encoder_n_layers: int = 2,\n",
+    "                 encoder_hidden_size: int = 200,\n",
+    "                 encoder_activation: str = 'tanh',\n",
+    "                 encoder_bias: bool = True,\n",
+    "                 encoder_dropout: float = 0.,\n",
+    "                 context_size: int = 10,\n",
+    "                 decoder_hidden_size: int = 200,\n",
+    "                 decoder_layers: int = 2,\n",
+    "                 futr_exog_list = None,\n",
+    "                 hist_exog_list = None,\n",
+    "                 stat_exog_list = None,\n",
+    "                 loss = MAE(),\n",
+    "                 valid_loss = None,\n",
+    "                 max_steps: int = 1000,\n",
+    "                 learning_rate: float = 1e-3,\n",
+    "                 num_lr_decays: int = -1,\n",
+    "                 early_stop_patience_steps: int =-1,\n",
+    "                 val_check_steps: int = 100,\n",
+    "                 batch_size=32,\n",
+    "                 valid_batch_size: Optional[int] = None,\n",
+    "                 scaler_type: str='robust',\n",
+    "                 random_seed=1,\n",
+    "                 num_workers_loader=0,\n",
+    "                 drop_last_loader=False,\n",
+    "                 optimizer=None,\n",
+    "                 optimizer_kwargs=None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super(GRU, self).__init__(\n",
+    "            h=h,\n",
+    "            input_size=input_size,\n",
+    "            inference_input_size=inference_input_size,\n",
+    "            loss=loss,\n",
+    "            valid_loss=valid_loss,\n",
+    "            max_steps=max_steps,\n",
+    "            learning_rate=learning_rate,\n",
+    "            num_lr_decays=num_lr_decays,\n",
+    "            early_stop_patience_steps=early_stop_patience_steps,\n",
+    "            val_check_steps=val_check_steps,\n",
+    "            batch_size=batch_size,\n",
+    "            valid_batch_size=valid_batch_size,\n",
+    "            scaler_type=scaler_type,\n",
+    "            futr_exog_list=futr_exog_list,\n",
+    "            hist_exog_list=hist_exog_list,\n",
+    "            stat_exog_list=stat_exog_list,\n",
+    "            num_workers_loader=num_workers_loader,\n",
+    "            drop_last_loader=drop_last_loader,\n",
+    "            random_seed=random_seed,\n",
+    "            optimizer=optimizer,\n",
+    "            optimizer_kwargs=optimizer_kwargs,\n",
+    "            **trainer_kwargs\n",
+    "        )\n",
+    "\n",
+    "        # RNN\n",
+    "        self.encoder_n_layers = encoder_n_layers\n",
+    "        self.encoder_hidden_size = encoder_hidden_size\n",
+    "        self.encoder_bias = encoder_bias\n",
+    "        self.encoder_dropout = encoder_dropout\n",
+    "        \n",
+    "        # Context adapter\n",
+    "        self.context_size = context_size\n",
+    "\n",
+    "        # MLP decoder\n",
+    "        self.decoder_hidden_size = decoder_hidden_size\n",
+    "        self.decoder_layers = decoder_layers\n",
+    "\n",
+    "        self.futr_exog_size = len(self.futr_exog_list)\n",
+    "        self.hist_exog_size = len(self.hist_exog_list)\n",
+    "        self.stat_exog_size = len(self.stat_exog_list)\n",
+    "        \n",
+    "        # RNN input size (1 for target variable y)\n",
+    "        input_encoder = 1 + self.hist_exog_size + self.stat_exog_size\n",
+    "\n",
+    "        # Instantiate model\n",
+    "        self.hist_encoder = nn.GRU(input_size=input_encoder,\n",
+    "                                   hidden_size=self.encoder_hidden_size,\n",
+    "                                   num_layers=self.encoder_n_layers,\n",
+    "                                   bias=self.encoder_bias,\n",
+    "                                   dropout=self.encoder_dropout,\n",
+    "                                   batch_first=True)\n",
+    "\n",
+    "        # Context adapter\n",
+    "        self.context_adapter = nn.Linear(in_features=self.encoder_hidden_size + self.futr_exog_size * h,\n",
+    "                                         out_features=self.context_size * h)\n",
+    "\n",
+    "        # Decoder MLP\n",
+    "        self.mlp_decoder = MLP(in_features=self.context_size + self.futr_exog_size,\n",
+    "                               out_features=self.loss.outputsize_multiplier,\n",
+    "                               hidden_size=self.decoder_hidden_size,\n",
+    "                               num_layers=self.decoder_layers,\n",
+    "                               activation='ReLU',\n",
+    "                               dropout=0.0)\n",
+    "\n",
+    "    def forward(self, windows_batch):\n",
+    "        \n",
+    "        # Parse windows_batch\n",
+    "        encoder_input = windows_batch['insample_y'] # [B, seq_len, 1]\n",
+    "        futr_exog     = windows_batch['futr_exog']\n",
+    "        hist_exog     = windows_batch['hist_exog']\n",
+    "        stat_exog     = windows_batch['stat_exog']\n",
+    "\n",
+    "        # Concatenate y, historic and static inputs\n",
+    "        # [B, C, seq_len, 1] -> [B, seq_len, C]\n",
+    "        # Contatenate [ Y_t, | X_{t-L},..., X_{t} | S ]\n",
+    "        batch_size, seq_len = encoder_input.shape[:2]\n",
+    "        if self.hist_exog_size > 0:\n",
+    "            hist_exog = hist_exog.permute(0,2,1,3).squeeze(-1) # [B, X, seq_len, 1] -> [B, seq_len, X]\n",
+    "            encoder_input = torch.cat((encoder_input, hist_exog), dim=2)\n",
+    "\n",
+    "        if self.stat_exog_size > 0:\n",
+    "            stat_exog = stat_exog.unsqueeze(1).repeat(1, seq_len, 1) # [B, S] -> [B, seq_len, S]\n",
+    "            encoder_input = torch.cat((encoder_input, stat_exog), dim=2)\n",
+    "\n",
+    "        # RNN forward\n",
+    "        hidden_state, _ = self.hist_encoder(encoder_input) # [B, seq_len, rnn_hidden_state]\n",
+    "\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            futr_exog = futr_exog.permute(0,2,3,1)[:,:,1:,:]  # [B, F, seq_len, 1+H] -> [B, seq_len, H, F]\n",
+    "            hidden_state = torch.cat(( hidden_state, futr_exog.reshape(batch_size, seq_len, -1)), dim=2)\n",
+    "\n",
+    "        # Context adapter\n",
+    "        context = self.context_adapter(hidden_state)\n",
+    "        context = context.reshape(batch_size, seq_len, self.h, self.context_size)\n",
+    "\n",
+    "        # Residual connection with futr_exog\n",
+    "        if self.futr_exog_size > 0:\n",
+    "            context = torch.cat((context, futr_exog), dim=-1)\n",
+    "\n",
+    "        # Final forecast\n",
+    "        output = self.mlp_decoder(context)\n",
+    "        output = self.loss.domain_map(output)\n",
+    "        \n",
+    "        return output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(GRU)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(GRU.fit, name='GRU.fit')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(GRU.predict, name='GRU.predict')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import pytorch_lightning as pl\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.models import GRU\n",
+    "from neuralforecast.losses.pytorch import MQLoss, DistributionLoss\n",
+    "from neuralforecast.utils import AirPassengersPanel, AirPassengersStatic\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset, TimeSeriesLoader\n",
+    "\n",
+    "Y_train_df = AirPassengersPanel[AirPassengersPanel.ds<AirPassengersPanel['ds'].values[-12]] # 132 train\n",
+    "Y_test_df = AirPassengersPanel[AirPassengersPanel.ds>=AirPassengersPanel['ds'].values[-12]].reset_index(drop=True) # 12 test\n",
+    "\n",
+    "fcst = NeuralForecast(\n",
+    "    models=[GRU(h=12,input_size=-1,\n",
+    "                loss=DistributionLoss(distribution='Normal', level=[80, 90]),\n",
+    "                scaler_type='robust',\n",
+    "                encoder_n_layers=2,\n",
+    "                encoder_hidden_size=128,\n",
+    "                context_size=10,\n",
+    "                decoder_hidden_size=128,\n",
+    "                decoder_layers=2,\n",
+    "                max_steps=200,\n",
+    "                futr_exog_list=None,\n",
+    "                hist_exog_list=['y_[lag12]'],\n",
+    "                stat_exog_list=['airline1'],\n",
+    "                )\n",
+    "    ],\n",
+    "    freq='M'\n",
+    ")\n",
+    "fcst.fit(df=Y_train_df, static_df=AirPassengersStatic)\n",
+    "forecasts = fcst.predict(futr_df=Y_test_df)\n",
+    "\n",
+    "Y_hat_df = forecasts.reset_index(drop=False).drop(columns=['unique_id','ds'])\n",
+    "plot_df = pd.concat([Y_test_df, Y_hat_df], axis=1)\n",
+    "plot_df = pd.concat([Y_train_df, plot_df])\n",
+    "\n",
+    "plot_df = plot_df[plot_df.unique_id=='Airline1'].drop('unique_id', axis=1)\n",
+    "plt.plot(plot_df['ds'], plot_df['y'], c='black', label='True')\n",
+    "plt.plot(plot_df['ds'], plot_df['GRU-median'], c='blue', label='median')\n",
+    "plt.fill_between(x=plot_df['ds'][-12:], \n",
+    "                 y1=plot_df['GRU-lo-90'][-12:].values, \n",
+    "                 y2=plot_df['GRU-hi-90'][-12:].values,\n",
+    "                 alpha=0.4, label='level 90')\n",
+    "plt.legend()\n",
+    "plt.grid()\n",
+    "plt.plot()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/models.hint.ipynb
+++ b/nbs/models.hint.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp models.hint"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# HINT"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Hierarchical Mixture Networks (HINT) are a highly modular framework that combines SoTA neural forecast architectures with task-specialized mixture probability and advanced hierarchical reconciliation strategies. This powerful combination allows HINT to produce accurate and coherent probabilistic forecasts.\n",
+    "\n",
+    "HINT's incorporates a `TemporalNorm` module into any neural forecast architecture, the module normalizes inputs into the network's non-linearities operating range and recomposes its output's scales through a global skip connection, improving accuracy and training robustness. HINT ensures the forecast coherence via bootstrap sample reconciliation that restores the aggregation constraints into its base samples.\n",
+    "\n",
+    "**References**<br>\n",
+    "- [Kin G. Olivares, David Luo, Cristian Challu, Stefania La Vattiata, Max Mergenthaler, Artur Dubrawski (2023). \"HINT: Hierarchical Mixture Networks For Coherent Probabilistic Forecasting\". Neural Information Processing Systems, submitted. Working Paper version available at arxiv.](https://arxiv.org/abs/2305.07089)<br>\n",
+    "- [Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker (2022).\"Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures\". International Journal Forecasting, accepted paper available at arxiv.](https://arxiv.org/pdf/2110.13179.pdf)<br>\n",
+    "- [Kin G. Olivares, Federico Garza, David Luo, Cristian Challu, Max Mergenthaler, Souhaib Ben Taieb, Shanika Wickramasuriya, and Artur Dubrawski (2022). \"HierarchicalForecast: A reference framework for hierarchical forecasting in python\". Journal of Machine Learning Research, submitted, abs/2207.03517, 2022b.](https://arxiv.org/abs/2207.03517)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Figure 1. Hierarchical Mixture Networks (HINT).](imgs_models/hint.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from nbdev.showdoc import show_doc\n",
+    "from neuralforecast.losses.pytorch import GMM\n",
+    "from neuralforecast import NeuralForecast\n",
+    "from neuralforecast.models import NHITS\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from typing import Optional\n",
+    "\n",
+    "import numpy as np\n",
+    "import torch"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Reconciliation Methods"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def get_bottomup_P(S: np.ndarray):\n",
+    "    \"\"\"BottomUp Reconciliation Matrix.\n",
+    "\n",
+    "    Creates BottomUp hierarchical \\\"projection\\\" matrix is defined as:\n",
+    "    $$\\mathbf{P}_{\\\\text{BU}} = [\\mathbf{0}_{\\mathrm{[b],[a]}}\\;|\\;\\mathbf{I}_{\\mathrm{[b][b]}}]$$    \n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `S`: Summing matrix of size (`base`, `bottom`).<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `P`: Reconciliation matrix of size (`bottom`, `base`).<br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    - [Orcutt, G.H., Watts, H.W., & Edwards, J.B.(1968). \\\"Data aggregation and information loss\\\". The American \n",
+    "    Economic Review, 58 , 773(787)](http://www.jstor.org/stable/1815532).    \n",
+    "    \"\"\"\n",
+    "    n_series = len(S)\n",
+    "    n_agg = n_series-S.shape[1]\n",
+    "    P = np.zeros_like(S)\n",
+    "    P[n_agg:,:] = S[n_agg:,:]\n",
+    "    P = P.T\n",
+    "    return P\n",
+    "\n",
+    "def get_mintrace_ols_P(S: np.ndarray):\n",
+    "    \"\"\"MinTraceOLS Reconciliation Matrix.\n",
+    "\n",
+    "    Creates MinTraceOLS reconciliation matrix as proposed by Wickramasuriya et al.\n",
+    "\n",
+    "    $$\\mathbf{P}_{\\\\text{MinTraceOLS}}=\\\\left(\\mathbf{S}^{\\intercal}\\mathbf{S}\\\\right)^{-1}\\mathbf{S}^{\\intercal}$$\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `S`: Summing matrix of size (`base`, `bottom`).<br>\n",
+    "      \n",
+    "    **Returns:**<br>\n",
+    "    `P`: Reconciliation matrix of size (`bottom`, `base`).<br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    - [Wickramasuriya, S.L., Turlach, B.A. & Hyndman, R.J. (2020). \\\"Optimal non-negative\n",
+    "    forecast reconciliation\". Stat Comput 30, 1167–1182,\n",
+    "    https://doi.org/10.1007/s11222-020-09930-0](https://robjhyndman.com/publications/nnmint/).\n",
+    "    \"\"\"\n",
+    "    n_hiers, n_bottom = S.shape\n",
+    "    n_agg = n_hiers - n_bottom\n",
+    "\n",
+    "    W = np.eye(n_hiers)\n",
+    "\n",
+    "    # We compute reconciliation matrix with\n",
+    "    # Equation 10 from https://robjhyndman.com/papers/MinT.pdf\n",
+    "    A = S[:n_agg,:]\n",
+    "    U = np.hstack((np.eye(n_agg), -A)).T\n",
+    "    J = np.hstack((np.zeros((n_bottom,n_agg)), np.eye(n_bottom)))\n",
+    "    P = J - (J @ W @ U) @ np.linalg.pinv(U.T @ W @ U) @ U.T\n",
+    "    return P\n",
+    "\n",
+    "def get_mintrace_wls_P(S: np.ndarray):\n",
+    "    \"\"\"MinTraceOLS Reconciliation Matrix.\n",
+    "\n",
+    "    Creates MinTraceOLS reconciliation matrix as proposed by Wickramasuriya et al.\n",
+    "    Depending on a weighted GLS estimator and an estimator of the covariance matrix of the coherency errors $\\mathbf{W}_{h}$.\n",
+    "\n",
+    "    $$ \\mathbf{W}_{h} = \\mathrm{Diag}(\\mathbf{S} \\mathbb{1}_{[b]})$$\n",
+    "\n",
+    "    $$\\mathbf{P}_{\\\\text{MinTraceWLS}}=\\\\left(\\mathbf{S}^{\\intercal}\\mathbf{W}_{h}\\mathbf{S}\\\\right)^{-1}\n",
+    "    \\mathbf{S}^{\\intercal}\\mathbf{W}^{-1}_{h}$$    \n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `S`: Summing matrix of size (`base`, `bottom`).<br>\n",
+    "      \n",
+    "    **Returns:**<br>\n",
+    "    `P`: Reconciliation matrix of size (`bottom`, `base`).<br>\n",
+    "\n",
+    "    **References:**<br>\n",
+    "    - [Wickramasuriya, S.L., Turlach, B.A. & Hyndman, R.J. (2020). \\\"Optimal non-negative\n",
+    "    forecast reconciliation\". Stat Comput 30, 1167–1182,\n",
+    "    https://doi.org/10.1007/s11222-020-09930-0](https://robjhyndman.com/publications/nnmint/).\n",
+    "    \"\"\"\n",
+    "    n_hiers, n_bottom = S.shape\n",
+    "    n_agg = n_hiers - n_bottom\n",
+    "    \n",
+    "    W = np.diag(S @ np.ones((n_bottom,)))\n",
+    "\n",
+    "    # We compute reconciliation matrix with\n",
+    "    # Equation 10 from https://robjhyndman.com/papers/MinT.pdf\n",
+    "    A = S[:n_agg,:]\n",
+    "    U = np.hstack((np.eye(n_agg), -A)).T\n",
+    "    J = np.hstack((np.zeros((n_bottom,n_agg)), np.eye(n_bottom)))\n",
+    "    P = J - (J @ W @ U) @ np.linalg.pinv(U.T @ W @ U) @ U.T\n",
+    "    return P\n",
+    "\n",
+    "def get_identity_P(S: np.ndarray):\n",
+    "    # Placeholder function for identity P (no reconciliation).\n",
+    "    pass"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(get_bottomup_P, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(get_mintrace_ols_P, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(get_mintrace_wls_P, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## HINT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class HINT:\n",
+    "    \"\"\" HINT\n",
+    "\n",
+    "    The Hierarchical Mixture Networks (HINT) are a highly modular framework that \n",
+    "    combines SoTA neural forecast architectures with a task-specialized mixture \n",
+    "    probability and advanced hierarchical reconciliation strategies. This powerful \n",
+    "    combination allows HINT to produce accurate and coherent probabilistic forecasts.\n",
+    "\n",
+    "    HINT's incorporates a `TemporalNorm` module into any neural forecast architecture, \n",
+    "    the module normalizes inputs into the network's non-linearities operating range \n",
+    "    and recomposes its output's scales through a global skip connection, improving \n",
+    "    accuracy and training robustness. HINT ensures the forecast coherence via bootstrap \n",
+    "    sample reconciliation that restores the aggregation constraints into its base samples.\n",
+    "\n",
+    "    Available reconciliations:<br>\n",
+    "    - BottomUp<br>\n",
+    "    - MinTraceOLS<br>\n",
+    "    - MinTraceWLS<br>\n",
+    "    - Identity\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `h`: int, Forecast horizon. <br>\n",
+    "    `model`: NeuralForecast model, instantiated model class from [architecture collection](https://nixtla.github.io/neuralforecast/models.pytorch.html).<br>\n",
+    "    `S`: np.ndarray, dumming matrix of size (`base`, `bottom`) see HierarchicalForecast's [aggregate method](https://nixtla.github.io/hierarchicalforecast/utils.html#aggregate).<br>\n",
+    "    `reconciliation`: str, HINT's reconciliation method from ['BottomUp', 'MinTraceOLS', 'MinTraceWLS'].<br>\n",
+    "    `alias`: str, optional,  Custom name of the model.<br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self,\n",
+    "                 h: int,\n",
+    "                 S: np.ndarray,\n",
+    "                 model,\n",
+    "                 reconciliation: str,\n",
+    "                 alias: Optional[str] = None):\n",
+    "        \n",
+    "        if model.h != h:\n",
+    "            raise Exception(f\"Model h {model.h} does not match HINT h {h}\")\n",
+    "        \n",
+    "        if not model.loss.is_distribution_output:\n",
+    "            raise Exception(f\"The NeuralForecast model's loss {model.loss} is not a probabilistic objective\")\n",
+    "        \n",
+    "        self.h = h\n",
+    "        self.model = model\n",
+    "        self.early_stop_patience_steps = model.early_stop_patience_steps\n",
+    "        self.S = S\n",
+    "        self.reconciliation = reconciliation\n",
+    "        self.loss = model.loss\n",
+    "\n",
+    "        available_reconciliations = dict(\n",
+    "                                BottomUp=get_bottomup_P,\n",
+    "                                MinTraceOLS=get_mintrace_ols_P,\n",
+    "                                MinTraceWLS=get_mintrace_wls_P,\n",
+    "                                Identity=get_identity_P,\n",
+    "                                )\n",
+    "\n",
+    "        if reconciliation not in available_reconciliations:\n",
+    "            raise Exception(f\"Reconciliation {reconciliation} not available\")\n",
+    "\n",
+    "        # Get SP matrix\n",
+    "        self.reconciliation = reconciliation\n",
+    "        if reconciliation== 'Identity':\n",
+    "            self.SP = None\n",
+    "        else:\n",
+    "            P = available_reconciliations[reconciliation](S=S)\n",
+    "            self.SP = S @ P\n",
+    "\n",
+    "        qs = torch.Tensor((np.arange(self.loss.num_samples)/self.loss.num_samples))\n",
+    "        self.sample_quantiles = torch.nn.Parameter(qs, requires_grad=False)\n",
+    "        self.alias = alias\n",
+    "    \n",
+    "    def __repr__(self):\n",
+    "        return type(self).__name__ if self.alias is None else self.alias\n",
+    "\n",
+    "\n",
+    "    def fit(self, dataset, val_size=0, test_size=0, random_seed=None, distributed_config=None):\n",
+    "        \"\"\" HINT.fit\n",
+    "\n",
+    "        HINT trains on the entire hierarchical dataset, by minimizing a composite log likelihood objective.\n",
+    "        HINT framework integrates `TemporalNorm` into the neural forecast architecture for a scale-decoupled \n",
+    "        optimization that robustifies cross-learning the hierachy's series scales.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset` see details [here](https://nixtla.github.io/neuralforecast/tsdataset.html)<br>\n",
+    "        `val_size`: int, size of the validation set, (default 0).<br>\n",
+    "        `test_size`: int, size of the test set, (default 0).<br>\n",
+    "        `random_seed`: int, random seed for the prediction.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `self`: A fitted base `NeuralForecast` model.<br>\n",
+    "        \"\"\"\n",
+    "        model = self.model.fit(dataset=dataset,\n",
+    "                       val_size=val_size,\n",
+    "                       test_size=test_size,\n",
+    "                       random_seed=random_seed,\n",
+    "                       distributed_config=distributed_config)\n",
+    "\n",
+    "        # Added attributes for compatibility with NeuralForecast core\n",
+    "        self.futr_exog_list = self.model.futr_exog_list\n",
+    "        self.hist_exog_list = self.model.hist_exog_list\n",
+    "        self.stat_exog_list = self.model.stat_exog_list\n",
+    "        return model\n",
+    "\n",
+    "    def predict(self, dataset, step_size=1, random_seed=None, **data_module_kwargs):\n",
+    "        \"\"\" HINT.predict\n",
+    "\n",
+    "        After fitting a base model on the entire hierarchical dataset.\n",
+    "        HINT restores the hierarchical aggregation constraints using \n",
+    "        bootstrapped sample reconciliation.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset` see details [here](https://nixtla.github.io/neuralforecast/tsdataset.html)<br>\n",
+    "        `step_size`: int, steps between sequential predictions, (default 1).<br>\n",
+    "        `random_seed`: int, random seed for the prediction.<br>\n",
+    "        `**data_kwarg`: additional parameters for the dataset module.<br>\n",
+    "\n",
+    "        **Returns:**<br>\n",
+    "        `y_hat`: numpy predictions of the `NeuralForecast` model.<br>\n",
+    "        \"\"\"\n",
+    "        # Non-reconciled predictions\n",
+    "        if self.reconciliation=='Identity':\n",
+    "            forecasts = self.model.predict(dataset=dataset, \n",
+    "                                        step_size=step_size,\n",
+    "                                        random_seed=random_seed,\n",
+    "                                        **data_module_kwargs)\n",
+    "            return forecasts\n",
+    "\n",
+    "        num_samples = self.model.loss.num_samples\n",
+    "\n",
+    "        # Hack to get samples by simulating quantiles (samples will be ordered)\n",
+    "        # Mysterious parsing associated to default [mean,quantiles] output\n",
+    "        quantiles_old = self.model.loss.quantiles\n",
+    "        names_old = self.model.loss.output_names\n",
+    "        self.model.loss.quantiles = self.sample_quantiles\n",
+    "        self.model.loss.output_names = ['1'] * (1 + num_samples)\n",
+    "        samples = self.model.predict(dataset=dataset, \n",
+    "                                     step_size=step_size,\n",
+    "                                     random_seed=random_seed,\n",
+    "                                     **data_module_kwargs)\n",
+    "        samples = samples[:,1:] # Eliminate mean from quantiles\n",
+    "        self.model.loss.quantiles = quantiles_old\n",
+    "        self.model.loss.output_names = names_old\n",
+    "\n",
+    "        # Hack requires to break quantiles correlations between samples\n",
+    "        idxs = np.random.choice(num_samples, size=samples.shape, replace=True)\n",
+    "        aux_col_idx = np.arange(len(samples))[:,None] * num_samples\n",
+    "        idxs = idxs + aux_col_idx\n",
+    "        samples = samples.flatten()[idxs]\n",
+    "        samples = samples.reshape(dataset.n_groups, -1, self.h, num_samples)\n",
+    "        \n",
+    "        # Bootstrap Sample Reconciliation\n",
+    "        # Default output [mean, quantiles]\n",
+    "        samples = np.einsum('ij, jwhp -> iwhp', self.SP, samples)\n",
+    "\n",
+    "        sample_mean = np.mean(samples, axis=-1, keepdims=True)\n",
+    "        sample_mean = sample_mean.reshape(-1, 1)\n",
+    "\n",
+    "        forecasts = np.quantile(samples, self.model.loss.quantiles, axis=-1)\n",
+    "        forecasts = forecasts.transpose(1,2,3,0) # [...,samples]\n",
+    "        forecasts = forecasts.reshape(-1, len(self.model.loss.quantiles))\n",
+    "\n",
+    "        forecasts = np.concatenate([sample_mean, forecasts], axis=-1)\n",
+    "        return forecasts\n",
+    "\n",
+    "    def set_test_size(self, test_size):\n",
+    "        self.model.test_size = test_size\n",
+    "\n",
+    "    def get_test_size(self):\n",
+    "        return self.model.test_size\n",
+    "\n",
+    "    def save(self, path):\n",
+    "        \"\"\" HINT.save\n",
+    "\n",
+    "        Save the HINT fitted model to disk.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `path`: str, path to save the model.<br>\n",
+    "        \"\"\"\n",
+    "        self.model.save(path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HINT, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HINT.fit, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(HINT.predict, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | hide\n",
+    "# Unit test to check hierarchical coherence\n",
+    "# Probabilistic coherent => Sample coherent => Mean coherence\n",
+    "\n",
+    "def sort_df_hier(Y_df, S_df):\n",
+    "    # NeuralForecast core, sorts unique_id lexicographically\n",
+    "    # by default, this class matches S_df and Y_hat_df order.    \n",
+    "    Y_df.unique_id = Y_df.unique_id.astype('category')\n",
+    "    Y_df.unique_id = Y_df.unique_id.cat.set_categories(S_df.index)\n",
+    "    Y_df = Y_df.sort_values(by=['unique_id', 'ds'])\n",
+    "    return Y_df\n",
+    "\n",
+    "# -----Create synthetic dataset-----\n",
+    "np.random.seed(123)\n",
+    "train_steps = 20\n",
+    "num_levels = 7\n",
+    "level = np.arange(0, 100, 0.1)\n",
+    "qs = [[50-lv/2, 50+lv/2] for lv in level]\n",
+    "quantiles = np.sort(np.concatenate(qs)/100)\n",
+    "\n",
+    "levels = ['Top', 'Mid1', 'Mid2', 'Bottom1', 'Bottom2', 'Bottom3', 'Bottom4']\n",
+    "unique_ids = np.repeat(levels, train_steps)\n",
+    "\n",
+    "S = np.array([[1., 1., 1., 1.],\n",
+    "              [1., 1., 0., 0.],\n",
+    "              [0., 0., 1., 1.],\n",
+    "              [1., 0., 0., 0.],\n",
+    "              [0., 1., 0., 0.],\n",
+    "              [0., 0., 1., 0.],\n",
+    "              [0., 0., 0., 1.]])\n",
+    "\n",
+    "S_dict = {col: S[:, i] for i, col in enumerate(levels[3:])}\n",
+    "S_df = pd.DataFrame(S_dict, index=levels)\n",
+    "\n",
+    "ds = pd.date_range(start='2018-03-31', periods=train_steps, freq='Q').tolist() * num_levels\n",
+    "# Create Y_df\n",
+    "y_lists = [S @ np.random.uniform(low=100, high=500, size=4) for i in range(train_steps)]\n",
+    "y = [elem for tup in zip(*y_lists) for elem in tup]\n",
+    "Y_df = pd.DataFrame({'unique_id': unique_ids, 'ds': ds, 'y': y})\n",
+    "Y_df = sort_df_hier(Y_df, S_df)\n",
+    "\n",
+    "# ------Fit/Predict HINT Model------\n",
+    "# Model + Distribution + Reconciliation\n",
+    "nhits = NHITS(h=4,\n",
+    "              input_size=4,\n",
+    "              loss=GMM(n_components=2, quantiles=quantiles, num_samples=len(quantiles)),\n",
+    "              max_steps=5,\n",
+    "              early_stop_patience_steps=2,\n",
+    "              val_check_steps=1,\n",
+    "              scaler_type='robust',\n",
+    "              learning_rate=1e-3)\n",
+    "model = HINT(h=4, model=nhits, S=S, reconciliation='BottomUp')\n",
+    "\n",
+    "# Fit and Predict\n",
+    "nf = NeuralForecast(models=[model], freq='Q')\n",
+    "forecasts = nf.cross_validation(df=Y_df, val_size=4, n_windows=1)\n",
+    "\n",
+    "# ---Check Hierarchical Coherence---\n",
+    "parent_children_dict = {0: [1, 2], 1: [3, 4], 2: [5, 6]}\n",
+    "# check coherence for each horizon time step\n",
+    "for _, df in forecasts.groupby('ds'):\n",
+    "    hint_mean = df['HINT'].values\n",
+    "    for parent_idx, children_list in parent_children_dict.items():\n",
+    "        parent_value = hint_mean[parent_idx]\n",
+    "        children_sum = hint_mean[children_list].sum()\n",
+    "        np.testing.assert_allclose(children_sum, parent_value)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example we will use HINT for the hierarchical forecast task, a multivariate regression problem with aggregation constraints. The aggregation constraints can be compactcly represented by the summing matrix $\\mathbf{S}_{[i][b]}$, the Figure belows shows an example.\n",
+    "\n",
+    "In this example we will make coherent predictions for the TourismL dataset. \n",
+    "\n",
+    "Outline<br>\n",
+    "1. Import packages<br>\n",
+    "2. Load hierarchical dataset<br>\n",
+    "3. Fit and Predict HINT<br>\n",
+    "4. Forecast Plot"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](imgs_models/hint_notation.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from neuralforecast.losses.pytorch import GMM, sCRPS\n",
+    "from datasetsforecast.hierarchical import HierarchicalData\n",
+    "\n",
+    "# Auxiliary sorting\n",
+    "def sort_df_hier(Y_df, S_df):\n",
+    "    # NeuralForecast core, sorts unique_id lexicographically\n",
+    "    # by default, this class matches S_df and Y_hat_df order.    \n",
+    "    Y_df.unique_id = Y_df.unique_id.astype('category')\n",
+    "    Y_df.unique_id = Y_df.unique_id.cat.set_categories(S_df.index)\n",
+    "    Y_df = Y_df.sort_values(by=['unique_id', 'ds'])\n",
+    "    return Y_df\n",
+    "\n",
+    "# Load TourismSmall dataset\n",
+    "horizon = 12\n",
+    "Y_df, S_df, tags = HierarchicalData.load('./data', 'TourismLarge')\n",
+    "Y_df['ds'] = pd.to_datetime(Y_df['ds'])\n",
+    "Y_df = sort_df_hier(Y_df, S_df)\n",
+    "level = [80,90]\n",
+    "\n",
+    "# Instantiate HINT\n",
+    "# BaseNetwork + Distribution + Reconciliation\n",
+    "nhits = NHITS(h=horizon,\n",
+    "              input_size=24,\n",
+    "              loss=GMM(n_components=10, level=level),\n",
+    "              max_steps=2000,\n",
+    "              early_stop_patience_steps=10,\n",
+    "              val_check_steps=50,\n",
+    "              scaler_type='robust',\n",
+    "              learning_rate=1e-3,\n",
+    "              valid_loss=sCRPS(level=level))\n",
+    "\n",
+    "model = HINT(h=horizon, S=S_df.values,\n",
+    "             model=nhits,  reconciliation='BottomUp')\n",
+    "\n",
+    "# Fit and Predict\n",
+    "nf = NeuralForecast(models=[model], freq='MS')\n",
+    "Y_hat_df = nf.cross_validation(df=Y_df, val_size=12, n_windows=1)\n",
+    "Y_hat_df = Y_hat_df.reset_index()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| eval: false\n",
+    "# Plot coherent probabilistic forecast\n",
+    "unique_id = 'TotalAll'\n",
+    "Y_plot_df = Y_df[Y_df.unique_id==unique_id]\n",
+    "plot_df = Y_hat_df[Y_hat_df.unique_id==unique_id]\n",
+    "plot_df = Y_plot_df.merge(plot_df, on=['ds', 'unique_id'], how='left')\n",
+    "n_years = 5\n",
+    "\n",
+    "plt.plot(plot_df['ds'][-12*n_years:], plot_df['y_x'][-12*n_years:], c='black', label='True')\n",
+    "plt.plot(plot_df['ds'][-12*n_years:], plot_df['HINT'][-12*n_years:], c='purple', label='mean')\n",
+    "plt.plot(plot_df['ds'][-12*n_years:], plot_df['HINT-median'][-12*n_years:], c='blue', label='median')\n",
+    "plt.fill_between(x=plot_df['ds'][-12*n_years:],\n",
+    "                 y1=plot_df['HINT-lo-90'][-12*n_years:].values,\n",
+    "                 y2=plot_df['HINT-hi-90'][-12*n_years:].values,\n",
+    "                 alpha=0.4, label='level 90')\n",
+    "plt.legend()\n",
+    "plt.grid()\n",
+    "plt.plot()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}