v1.0

e3f7f7b3 · chenzk · e3f7f7b3 · e3f7f7b3 · e3f7f7b3 · e3f7f7b3
Commit e3f7f7b3 authored May 06, 2024 by chenzk
20 changed files
--- a/doc/transformer.png
+++ b/doc/transformer.png
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10-py38
+ENV DEBIAN_FRONTEND=noninteractive
+# RUN yum update && yum install -y git cmake wget build-essential
+RUN source /opt/dtk-23.10/env.sh
+# 安装pip相关依赖
+COPY requirements.txt requirements.txt
+RUN pip3 install -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com -r requirements.txt
+
--- a/docker/requirements.txt
+++ b/docker/requirements.txt
+coreforecast>=0.0.6
+fsspec
+gitpython
+hyperopt
+jupyterlab
+matplotlib
+numba
+numpy>=1.21.6
+optuna
+pandas>=1.3.5
+pyarrow
+# pytorch>=2.0.0
+# pytorch-cuda>=11.8
+pytorch-lightning>=2.0.0
+s3fs
+nbdev
+black
+polars
+ray[tune]>=2.2.0
+utilsforecast>=0.0.24
+datasetsforecast
--- a/docker_start.sh
+++ b/docker_start.sh
+docker run -it --shm-size=32G -v $PWD/neuralforecast:/home/neuralforecast -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name neuralforecast ffa1f63239fc bash
+# python -m torch.utils.collect_env
--- a/environment-cpu.yml
+++ b/environment-cpu.yml
+name: neuralforecast
+channels:
+  - pytorch
+  - conda-forge
+dependencies:
+  - coreforecast>=0.0.6
+  - cpuonly
+  - fsspec
+  - gitpython
+  - hyperopt
+  - jupyterlab
+  - matplotlib
+  - numba
+  - numpy>=1.21.6
+  - optuna
+  - pandas>=1.3.5
+  - pyarrow
+  - pytorch>=2.0.0
+  - pytorch-lightning>=2.0.0
+  - pip
+  - s3fs
+  - snappy<1.2.0
+  - pip:
+    - nbdev
+    - black
+    - polars
+    - ray[tune]>=2.2.0
+    - utilsforecast>=0.0.25
--- a/environment-cuda.yml
+++ b/environment-cuda.yml
+name: neuralforecast
+channels:
+  - pytorch
+  - nvidia
+  - conda-forge
+dependencies:
+  - coreforecast>=0.0.6
+  - fsspec
+  - gitpython
+  - hyperopt
+  - jupyterlab
+  - matplotlib
+  - numba
+  - numpy>=1.21.6
+  - optuna
+  - pandas>=1.3.5
+  - pyarrow
+  - pytorch>=2.0.0
+  - pytorch-cuda>=11.8
+  - pytorch-lightning>=2.0.0
+  - pip
+  - s3fs
+  - pip:
+    - nbdev
+    - black
+    - polars
+    - "ray[tune]>=2.2.0"
+    - utilsforecast>=0.0.24
--- a/experiments/long_horizon/README.md
+++ b/experiments/long_horizon/README.md
+# Long Horizon Forecasting Experiments with NHITS
+
+In these experiments we use `NHITS` on the [ETTh1, ETTh2, ETTm1, ETTm2](https://github.com/zhouhaoyi/ETDataset) benchmark datasets.
+
+| Dataset  | Horizon  | NHITS-MSE  | NHITS-MAE  | TIDE-MSE   | TIDE-MAE   |
+|----------|----------|------------|------------|------------|------------|
+| ETTh1    | 96       | 0.378      | 0.393      | 0.375      | 0.398      |
+| ETTh1    | 192      | 0.427      | 0.436      | 0.412      | 0.422      |
+| ETTh1    | 336      | 0.458      | 0.484      | 0.435      | 0.433      |
+| ETTh1    | 720      | 0.561      | 0.501      | 0.454      | 0.465      |
+|----------|----------|------------|------------|------------|------------|
+| ETTh2    | 96       | 0.274      | 0.345      | 0.270      | 0.336      |
+| ETTh2    | 192      | 0.353      | 0.401      | 0.332      | 0.380      |
+| ETTh2    | 336      | 0.382      | 0.425      | 0.360      | 0.407      |
+| ETTh2    | 720      | 0.625      | 0.557      | 0.419      | 0.451      |
+|----------|----------|------------|------------|------------|------------|
+| ETTm1    | 96       | 0.302      | 0.35       | 0.306      | 0.349      |
+| ETTm1    | 192      | 0.347      | 0.383      | 0.335      | 0.366      |
+| ETTm1    | 336      | 0.369      | 0.402      | 0.364      | 0.384      |
+| ETTm1    | 720      | 0.431      | 0.441      | 0.413      | 0.413      |
+|----------|----------|------------|------------|------------|------------|
+| ETTm2    | 96       | 0.176      | 0.255      | 0.161      | 0.251      |
+| ETTm2    | 192      | 0.245      | 0.305      | 0.215      | 0.289      |
+| ETTm2    | 336      | 0.295      | 0.346      | 0.267      | 0.326      |
+| ETTm2    | 720      | 0.401      | 0.413      | 0.352      | 0.383      |
+|----------|----------|------------|------------|------------|------------|
+<br>
+
+## Reproducibility
+
+1. Create a conda environment `long_horizon` using the `environment.yml` file.
+  ```shell
+  conda env create -f environment.yml
+  ```
+
+3. Activate the conda environment using 
+  ```shell
+  conda activate long_horizon
+  ```
+
+Alternatively simply installing neuralforecast and datasetsforecast with pip may suffice:
+```
+pip install git+https://github.com/Nixtla/datasetsforecast.git
+pip install git+https://github.com/Nixtla/neuralforecast.git
+```
+
+4. Run the experiments for each dataset and each model using with 
+- `--horizon` parameter in `[96, 192, 336, 720]`
+- `--dataset` parameter in `['ETTh1', 'ETTh2', 'ETTm1', 'ETTm2']`
+<br>
+
+```shell
+python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 20
+```
+
+You can access the final forecasts from the `./data/{dataset}/{horizon}_forecasts.csv` file. Example: `./data/ETTh1/96_forecasts.csv`.
+<br><br>
+
+## References
+-[Cristian Challu, Kin G. Olivares, Boris N. Oreshkin, Federico Garza, Max Mergenthaler-Canseco, Artur Dubrawski (2023). "NHITS: Neural Hierarchical Interpolation for Time Series Forecasting". Accepted at the Thirty-Seventh AAAI Conference on Artificial Intelligence.](https://arxiv.org/abs/2201.12886)
\ No newline at end of file
--- a/experiments/long_horizon/environment.yml
+++ b/experiments/long_horizon/environment.yml
+name: long_horizon
+channels:
+  - conda-forge
+dependencies:
+  - numpy<1.24
+  - pip
+  - pip:
+    - "git+https://github.com/Nixtla/datasetsforecast.git"
+    - "git+https://github.com/Nixtla/neuralforecast.git"
\ No newline at end of file
--- a/experiments/long_horizon/run_nhits.py
+++ b/experiments/long_horizon/run_nhits.py
+import os
+os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
+
+import argparse
+import pandas as pd
+
+from ray import tune
+
+from neuralforecast.auto import AutoNHITS
+from neuralforecast.core import NeuralForecast
+
+from neuralforecast.losses.pytorch import MAE, HuberLoss
+from neuralforecast.losses.numpy import mae, mse
+#from datasetsforecast.long_horizon import LongHorizon, LongHorizonInfo
+from datasetsforecast.long_horizon2 import LongHorizon2, LongHorizon2Info
+
+import logging
+logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
+
+
+if __name__ == '__main__':
+
+    # Parse execution parameters
+    verbose = True
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-horizon", "--horizon", type=int)
+    parser.add_argument("-dataset", "--dataset", type=str)
+    parser.add_argument("-num_samples", "--num_samples", default=5, type=int)
+
+    args = parser.parse_args()
+    horizon = args.horizon
+    dataset = args.dataset
+    num_samples = args.num_samples
+
+    assert horizon in [96, 192, 336, 720]
+
+    # Load dataset
+    #Y_df, _, _ = LongHorizon.load(directory='./data/', group=dataset)
+    #Y_df['ds'] = pd.to_datetime(Y_df['ds'])
+
+    Y_df = LongHorizon2.load(directory='./data/', group=dataset)
+    freq = LongHorizon2Info[dataset].freq
+    n_time = len(Y_df.ds.unique())
+    #val_size = int(.2 * n_time)
+    #test_size = int(.2 * n_time)
+    val_size = LongHorizon2Info[dataset].val_size
+    test_size = LongHorizon2Info[dataset].test_size
+
+    # Adapt input_size to available data
+    input_size = tune.choice([7 * horizon])
+    if dataset=='ETTm1' and horizon==720:
+        input_size = tune.choice([2 * horizon])
+
+    nhits_config = {
+        #"learning_rate": tune.choice([1e-3]),                                     # Initial Learning rate
+        "learning_rate": tune.loguniform(1e-5, 5e-3),
+        "max_steps": tune.choice([200, 1000]),                                    # Number of SGD steps
+        "input_size": input_size,                                                 # input_size = multiplier * horizon
+        "batch_size": tune.choice([7]),                                           # Number of series in windows
+        "windows_batch_size": tune.choice([256]),                                 # Number of windows in batch
+        "n_pool_kernel_size": tune.choice([[2, 2, 2], [16, 8, 1]]),               # MaxPool's Kernelsize
+        "n_freq_downsample": tune.choice([[(96*7)//2, 96//2, 1],
+                                          [(24*7)//2, 24//2, 1],
+                                          [1, 1, 1]]),                            # Interpolation expressivity ratios
+        "dropout_prob_theta": tune.choice([0.5]),                                 # Dropout regularization
+        "activation": tune.choice(['ReLU']),                                      # Type of non-linear activation
+        "n_blocks":  tune.choice([[1, 1, 1]]),                                    # Blocks per each 3 stacks
+        "mlp_units":  tune.choice([[[512, 512], [512, 512], [512, 512]]]),        # 2 512-Layers per block for each stack
+        "interpolation_mode": tune.choice(['linear']),                            # Type of multi-step interpolation
+        "val_check_steps": tune.choice([100]),                                    # Compute validation every 100 epochs
+        "random_seed": tune.randint(1, 10),
+        }
+
+    models = [AutoNHITS(h=horizon,
+                        loss=HuberLoss(delta=0.5),
+                        valid_loss=MAE(),
+                        config=nhits_config, 
+                        num_samples=num_samples,
+                        refit_with_val=True)]
+
+    nf = NeuralForecast(models=models, freq=freq)
+
+    Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
+                                   test_size=test_size, n_windows=None)
+
+
+    y_true = Y_hat_df.y.values
+    y_hat = Y_hat_df['AutoNHITS'].values
+
+    n_series = len(Y_df.unique_id.unique())
+
+    y_true = y_true.reshape(n_series, -1, horizon)
+    y_hat = y_hat.reshape(n_series, -1, horizon)
+
+    print('\n'*4)
+    print('Parsed results')
+    print(f'NHITS {dataset} h={horizon}')
+    print('test_size', test_size)
+    print('y_true.shape (n_series, n_windows, n_time_out):\t', y_true.shape)
+    print('y_hat.shape  (n_series, n_windows, n_time_out):\t', y_hat.shape)
+
+    print('MSE: ', mse(y_hat, y_true))
+    print('MAE: ', mae(y_hat, y_true))
+
+    # Save Outputs
+    if not os.path.exists(f'./data/{dataset}'):
+        os.makedirs(f'./data/{dataset}')
+    yhat_file = f'./data/{dataset}/{horizon}_forecasts.csv'
+    Y_hat_df.to_csv(yhat_file, index=False)
--- a/infer.py
+++ b/infer.py
+import pandas as pd
+import numpy as np
+from datasetsforecast.long_horizon import LongHorizon
+from neuralforecast.core import NeuralForecast
+
+
+def load_data(name):
+    if name == "ettm2":
+        
+        Y_df, X_df, S_df = LongHorizon.load(directory='./ETT-small/', group='ETTm2')
+        Y_df = Y_df[Y_df['unique_id'] == 'OT']
+        Y_df['ds'] = pd.to_datetime(Y_df['ds'])
+        val_size = 11520
+        test_size = 11520
+        freq = '15T'
+
+    return Y_df, val_size, test_size, freq
+
+
+# infer
+Y_df, val_size, test_size, freq = load_data('ettm2')
+nf = NeuralForecast.load(path='./checkpoints/test_run/')
+
+Y_hat_df = nf.predict(Y_df).reset_index()#_predict(df: pd.DataFrame, static_cols, futr_exog_cols, models, freq, id_col, time_col, target_col)
+print("Y_hat_df: ", Y_hat_df)
+
+
+
+
+'''
+futr_df = pd.read_csv('https://datasets-nixtla.s3.amazonaws.com/EPF_FR_BE_futr.csv')
+futr_df['ds'] = pd.to_datetime(futr_df['ds'])
+Y_hat_df = nf.predict(futr_df=futr_df)
+Y_hat_df.head()
+'''
+
+'''
+from neuralforecast.utils import AirPassengersDF
+Y_df = AirPassengersDF # Defined in neuralforecast.utils
+Y_df.head()
+'''
--- a/minimal_example.py
+++ b/minimal_example.py
+from neuralforecast import NeuralForecast
+from neuralforecast.models import iTransformer
+from neuralforecast.utils import AirPassengersDF
+
+horizon =8
+nf = NeuralForecast(
+    # models = [iTransformer( h=12, input_size=24, n_series=1, max_steps=100)],
+    models = [iTransformer(h=horizon, input_size=2*horizon, n_series=4, max_steps=1000, early_stop_patience_steps=3)],
+    freq = 'M'
+)
+
+nf.fit(df=AirPassengersDF, val_size=20)
+print(nf.predict())
--- a/model.properties
+++ b/model.properties
+# 模型编码
+modelCode=613
+# 模型名称
+modelName=neuralforecast-itransformer_pytorch
+# 模型描述
+modelDescription=时序预测库neuralforecast中的iTransformer算法能高效利用长程时序特征。
+# 应用场景
+appScenario=推理,训练,金融,运维,电商,制造,能源,医疗
+# 框架类型
+frameType=pytorch
--- a/nbs/.gitignore
+++ b/nbs/.gitignore
+/.quarto/
+/lightning_logs/
--- a/nbs/_quarto.yml
+++ b/nbs/_quarto.yml
+project:
+  type: website
+
+format:
+  html:
+    theme: cosmo
+    fontsize: 1em
+    linestretch: 1.7
+    css: styles.css
+    toc: true
+
+website:
+  twitter-card: 
+    image: "https://farm6.staticflickr.com/5510/14338202952_93595258ff_z.jpg"
+    site: "@Nixtlainc"
+  open-graph:
+    image: "https://github.com/Nixtla/styles/blob/2abf51612584169874c90cd7c4d347e3917eaf73/images/Banner%20Github.png"
+  google-analytics: "G-NXJNCVR18L"
+  repo-actions: [issue]
+  favicon: favicon_png.png
+  navbar:
+    background: primary
+    search: true
+    collapse-below: lg
+    left:
+      - text: "Get Started"
+        href: examples/Getting_Started.ipynb
+      - text: "NixtlaVerse"
+        menu:
+          - text: "MLForecast 🤖"
+            href: https://github.com/nixtla/mlforecast
+          - text: "StatsForecast ⚡️"
+            href: https://github.com/nixtla/statsforecast
+          - text: "HierarchicalForecast 👑"
+            href: "https://github.com/nixtla/hierarchicalforecast"
+          
+      - text: "Help"
+        menu:
+          - text: "Report an Issue"
+            icon: bug
+            href: https://github.com/nixtla/neuralforecast/issues/new/choose
+          - text: "Join our Slack"
+            icon: chat-right-text
+            href: https://join.slack.com/t/nixtlaworkspace/shared_invite/zt-135dssye9-fWTzMpv2WBthq8NK0Yvu6A
+    right:
+      - icon: github
+        href: "https://github.com/nixtla/neuralforecast"
+      - icon: twitter
+        href: https://twitter.com/nixtlainc
+        aria-label: Nixtla Twitter
+
+  sidebar:
+    style: floating
+  body-footer: |
+    Give us a ⭐ on [Github](https://github.com/nixtla/neuralforecast)
+
+metadata-files: [nbdev.yml, sidebar.yml]
--- a/nbs/common.base_auto.ipynb
+++ b/nbs/common.base_auto.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "524620c1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp common._base_auto"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "15392f6f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "12fa25a4",
+   "metadata": {},
+   "source": [
+    "# Hyperparameter Optimization\n",
+    "\n",
+    "> Machine Learning forecasting methods are defined by many hyperparameters that control their behavior, with effects ranging from their speed and memory requirements to their predictive performance. For a long time, manual hyperparameter tuning prevailed. This approach is time-consuming, **automated hyperparameter optimization** methods have been introduced, proving more efficient than manual tuning, grid search, and random search.<br><br> The `BaseAuto` class offers shared API connections to hyperparameter optimization algorithms like [Optuna](https://docs.ray.io/en/latest/tune/examples/bayesopt_example.html), [HyperOpt](https://docs.ray.io/en/latest/tune/examples/hyperopt_example.html), [Dragonfly](https://docs.ray.io/en/latest/tune/examples/dragonfly_example.html) among others through `ray`, which gives you access to grid search, bayesian optimization and other state-of-the-art tools like hyperband.<br><br>Comprehending the impacts of hyperparameters is still a precious skill, as it can help guide the design of informed hyperparameter spaces that are faster to explore automatically."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e37fd67c",
+   "metadata": {},
+   "source": [
+    "![Figure 1. Example of dataset split (left), validation (yellow) and test (orange). The hyperparameter optimization guiding signal is obtained from the validation set.](imgs_models/data_splits.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2508f7a9-1433-4ad8-8f2f-0078c6ed6c3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44065066-e72a-431f-938f-1528adef9fe8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from copy import deepcopy\n",
+    "from os import cpu_count\n",
+    "\n",
+    "import torch\n",
+    "import pytorch_lightning as pl\n",
+    "\n",
+    "from ray import air, tune\n",
+    "from ray.tune.integration.pytorch_lightning import TuneReportCallback\n",
+    "from ray.tune.search.basic_variant import BasicVariantGenerator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "45cecbda-68c8-4426-a186-9a2a94dcc54e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "class MockTrial:\n",
+    "    def suggest_int(*args, **kwargs):\n",
+    "        return 'int'\n",
+    "    def suggest_categorical(*args, **kwargs):\n",
+    "        return 'categorical'\n",
+    "    def suggest_uniform(*args, **kwargs):\n",
+    "        return 'uniform'\n",
+    "    def suggest_loguniform(*args, **kwargs):\n",
+    "        return 'loguniform'\n",
+    "    def suggest_float(*args, **kwargs):\n",
+    "        if 'log' in kwargs:\n",
+    "            return 'quantized_log'\n",
+    "        elif 'step' in kwargs:\n",
+    "            return 'quantized_loguniform'\n",
+    "        return 'float'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c253583-8239-4abe-8a04-0c0ba635d8a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class BaseAuto(pl.LightningModule):\n",
+    "    \"\"\"\n",
+    "    Class for Automatic Hyperparameter Optimization, it builds on top of `ray` to \n",
+    "    give access to a wide variety of hyperparameter optimization tools ranging \n",
+    "    from classic grid search, to Bayesian optimization and HyperBand algorithm.\n",
+    "\n",
+    "    The validation loss to be optimized is defined by the `config['loss']` dictionary\n",
+    "    value, the config also contains the rest of the hyperparameter search space.\n",
+    "\n",
+    "    It is important to note that the success of this hyperparameter optimization\n",
+    "    heavily relies on a strong correlation between the validation and test periods.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    cls_model : PyTorch/PyTorchLightning model\n",
+    "        See `neuralforecast.models` [collection here](https://nixtla.github.io/neuralforecast/models.html).\n",
+    "    h : int\n",
+    "        Forecast horizon\n",
+    "    loss : PyTorch module\n",
+    "        Instantiated train loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).\n",
+    "    valid_loss : PyTorch module\n",
+    "        Instantiated valid loss class from [losses collection](https://nixtla.github.io/neuralforecast/losses.pytorch.html).\n",
+    "    config : dict or callable\n",
+    "        Dictionary with ray.tune defined search space or function that takes an optuna trial and returns a configuration dict.\n",
+    "    search_alg : ray.tune.search variant or optuna.sampler\n",
+    "        For ray see https://docs.ray.io/en/latest/tune/api_docs/suggestion.html\n",
+    "        For optuna see https://optuna.readthedocs.io/en/stable/reference/samplers/index.html.\n",
+    "    num_samples : int\n",
+    "        Number of hyperparameter optimization steps/samples.\n",
+    "    cpus : int (default=os.cpu_count())\n",
+    "        Number of cpus to use during optimization. Only used with ray tune.\n",
+    "    gpus : int (default=torch.cuda.device_count())\n",
+    "        Number of gpus to use during optimization, default all available. Only used with ray tune.\n",
+    "    refit_with_val : bool\n",
+    "        Refit of best model should preserve val_size.\n",
+    "    verbose : bool\n",
+    "        Track progress.\n",
+    "    alias : str, optional (default=None)\n",
+    "        Custom name of the model.\n",
+    "    backend : str (default='ray')\n",
+    "        Backend to use for searching the hyperparameter space, can be either 'ray' or 'optuna'.\n",
+    "    callbacks : list of callable, optional (default=None)\n",
+    "        List of functions to call during the optimization process.\n",
+    "        ray reference: https://docs.ray.io/en/latest/tune/tutorials/tune-metrics.html\n",
+    "        optuna reference: https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/007_optuna_callback.html\n",
+    "    \"\"\"\n",
+    "    def __init__(self, \n",
+    "                 cls_model,\n",
+    "                 h,\n",
+    "                 loss,\n",
+    "                 valid_loss,\n",
+    "                 config, \n",
+    "                 search_alg=BasicVariantGenerator(random_state=1),\n",
+    "                 num_samples=10,\n",
+    "                 cpus=cpu_count(),\n",
+    "                 gpus=torch.cuda.device_count(),\n",
+    "                 refit_with_val=False,\n",
+    "                 verbose=False,\n",
+    "                 alias=None,\n",
+    "                 backend='ray',\n",
+    "                 callbacks=None,\n",
+    "                ):\n",
+    "        super(BaseAuto, self).__init__()\n",
+    "        self.save_hyperparameters() # Allows instantiation from a checkpoint from class\n",
+    "\n",
+    "        if backend == 'ray':\n",
+    "            if not isinstance(config, dict):\n",
+    "                raise ValueError(\n",
+    "                    \"You have to provide a dict as `config` when using `backend='ray'`\"\n",
+    "                )\n",
+    "            config_base = deepcopy(config)\n",
+    "        elif backend == 'optuna':\n",
+    "            if not callable(config):\n",
+    "                raise ValueError(\n",
+    "                    \"You have to provide a function that takes a trial and returns a dict as `config` when using `backend='optuna'`\"\n",
+    "                )\n",
+    "            # extract constant values from the config fn for validations\n",
+    "            config_base = config(MockTrial())\n",
+    "        else:\n",
+    "            raise ValueError(f\"Unknown backend {backend}. The supported backends are 'ray' and 'optuna'.\")\n",
+    "        if config_base.get('h', None) is not None:\n",
+    "            raise Exception(\"Please use `h` init argument instead of `config['h']`.\")\n",
+    "        if config_base.get('loss', None) is not None:\n",
+    "            raise Exception(\"Please use `loss` init argument instead of `config['loss']`.\")\n",
+    "        if config_base.get('valid_loss', None) is not None:\n",
+    "            raise Exception(\"Please use `valid_loss` init argument instead of `config['valid_loss']`.\")\n",
+    "        # This attribute helps to protect \n",
+    "        # model and datasets interactions protections\n",
+    "        if 'early_stop_patience_steps' in config_base.keys():\n",
+    "            self.early_stop_patience_steps = 1\n",
+    "        else:\n",
+    "            self.early_stop_patience_steps = -1\n",
+    "\n",
+    "        if callable(config):\n",
+    "            # reset config_base here to save params to override in the config fn\n",
+    "            config_base = {}\n",
+    "\n",
+    "        # Add losses to config and protect valid_loss default\n",
+    "        config_base['h'] = h\n",
+    "        config_base['loss'] = loss\n",
+    "        if valid_loss is None:\n",
+    "            valid_loss = loss\n",
+    "        config_base['valid_loss'] = valid_loss\n",
+    "\n",
+    "        if isinstance(config, dict):\n",
+    "            self.config = config_base            \n",
+    "        else:\n",
+    "            def config_f(trial):\n",
+    "                return {**config(trial), **config_base}\n",
+    "            self.config = config_f            \n",
+    "        \n",
+    "        self.h = h\n",
+    "        self.cls_model = cls_model\n",
+    "        self.loss = loss\n",
+    "        self.valid_loss = valid_loss\n",
+    "\n",
+    "        self.num_samples = num_samples\n",
+    "        self.search_alg = search_alg\n",
+    "        self.cpus = cpus\n",
+    "        self.gpus = gpus\n",
+    "        self.refit_with_val = refit_with_val\n",
+    "        self.verbose = verbose\n",
+    "        self.alias = alias\n",
+    "        self.backend = backend\n",
+    "        self.callbacks = callbacks\n",
+    "\n",
+    "        # Base Class attributes\n",
+    "        self.SAMPLING_TYPE = cls_model.SAMPLING_TYPE\n",
+    "\n",
+    "    def __repr__(self):\n",
+    "        return type(self).__name__ if self.alias is None else self.alias\n",
+    "    \n",
+    "    def _train_tune(self, config_step, cls_model, dataset, val_size, test_size):\n",
+    "        \"\"\" BaseAuto._train_tune\n",
+    "\n",
+    "        Internal function that instantiates a NF class model, then automatically\n",
+    "        explores the validation loss (ptl/val_loss) on which the hyperparameter \n",
+    "        exploration is based.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `config_step`: Dict, initialization parameters of a NF model.<br>\n",
+    "        `cls_model`: NeuralForecast model class, yet to be instantiated.<br>\n",
+    "        `dataset`: NeuralForecast dataset, to fit the model.<br>\n",
+    "        `val_size`: int, validation size for temporal cross-validation.<br>\n",
+    "        `test_size`: int, test size for temporal cross-validation.<br>\n",
+    "        \"\"\"\n",
+    "        metrics = {\"loss\": \"ptl/val_loss\", \"train_loss\": \"train_loss\"}\n",
+    "        callbacks = [TuneReportCallback(metrics, on=\"validation_end\")]\n",
+    "        if 'callbacks' in config_step.keys():\n",
+    "            callbacks.extend(config_step['callbacks'])\n",
+    "        config_step = {**config_step, **{'callbacks': callbacks}}\n",
+    "\n",
+    "        # Protect dtypes from tune samplers\n",
+    "        if 'batch_size' in config_step.keys():\n",
+    "            config_step['batch_size'] = int(config_step['batch_size'])\n",
+    "        if 'windows_batch_size' in config_step.keys():\n",
+    "            config_step['windows_batch_size'] = int(config_step['windows_batch_size'])\n",
+    "\n",
+    "        # Tune session receives validation signal\n",
+    "        # from the specialized PL TuneReportCallback\n",
+    "        _ = self._fit_model(cls_model=cls_model,\n",
+    "                                config=config_step,\n",
+    "                                dataset=dataset,\n",
+    "                                val_size=val_size,\n",
+    "                                test_size=test_size)\n",
+    "\n",
+    "    def _tune_model(self, cls_model, dataset, val_size, test_size,\n",
+    "                cpus, gpus, verbose, num_samples, search_alg, config):\n",
+    "        train_fn_with_parameters = tune.with_parameters(\n",
+    "            self._train_tune,\n",
+    "            cls_model=cls_model,\n",
+    "            dataset=dataset,\n",
+    "            val_size=val_size,\n",
+    "            test_size=test_size,\n",
+    "        )\n",
+    "\n",
+    "        # Device\n",
+    "        if gpus > 0:\n",
+    "            device_dict = {'gpu':gpus}\n",
+    "        else:\n",
+    "            device_dict = {'cpu':cpus}\n",
+    "\n",
+    "        # on Windows, prevent long trial directory names\n",
+    "        import platform\n",
+    "        trial_dirname_creator=(lambda trial: f\"{trial.trainable_name}_{trial.trial_id}\") if platform.system() == 'Windows' else None\n",
+    "\n",
+    "        tuner = tune.Tuner(\n",
+    "            tune.with_resources(train_fn_with_parameters, device_dict),\n",
+    "            run_config=air.RunConfig(callbacks=self.callbacks, verbose=verbose),\n",
+    "            tune_config=tune.TuneConfig(\n",
+    "                metric=\"loss\",\n",
+    "                mode=\"min\",\n",
+    "                num_samples=num_samples, \n",
+    "                search_alg=search_alg,\n",
+    "                trial_dirname_creator=trial_dirname_creator,\n",
+    "            ),\n",
+    "            param_space=config,\n",
+    "        )\n",
+    "        results = tuner.fit()\n",
+    "        return results\n",
+    "\n",
+    "    @staticmethod\n",
+    "    def _ray_config_to_optuna(ray_config):\n",
+    "        def optuna_config(trial):\n",
+    "            out = {}\n",
+    "            for k, v in ray_config.items():\n",
+    "                if hasattr(v, 'sampler'):\n",
+    "                    sampler = v.sampler\n",
+    "                    if isinstance(sampler, tune.search.sample.Integer.default_sampler_cls):\n",
+    "                        v = trial.suggest_int(k, v.lower, v.upper)\n",
+    "                    elif isinstance(sampler, tune.search.sample.Categorical.default_sampler_cls):\n",
+    "                        v = trial.suggest_categorical(k, v.categories)                    \n",
+    "                    elif isinstance(sampler, tune.search.sample.Uniform):\n",
+    "                        v = trial.suggest_uniform(k, v.lower, v.upper)\n",
+    "                    elif isinstance(sampler, tune.search.sample.LogUniform):\n",
+    "                        v = trial.suggest_loguniform(k, v.lower, v.upper)\n",
+    "                    elif isinstance(sampler, tune.search.sample.Quantized):\n",
+    "                        if isinstance(sampler.get_sampler(), tune.search.sample.Float._LogUniform):\n",
+    "                            v = trial.suggest_float(k, v.lower, v.upper, log=True)\n",
+    "                        elif isinstance(sampler.get_sampler(), tune.search.sample.Float._Uniform):\n",
+    "                            v = trial.suggest_float(k, v.lower, v.upper, step=sampler.q)\n",
+    "                    else:\n",
+    "                        raise ValueError(f\"Couldn't translate {type(v)} to optuna.\")\n",
+    "                out[k] = v\n",
+    "            return out\n",
+    "        return optuna_config\n",
+    "\n",
+    "    def _optuna_tune_model(\n",
+    "        self,\n",
+    "        cls_model,\n",
+    "        dataset,\n",
+    "        val_size,\n",
+    "        test_size,\n",
+    "        verbose,\n",
+    "        num_samples,\n",
+    "        search_alg,\n",
+    "        config,\n",
+    "        distributed_config,\n",
+    "    ):\n",
+    "        import optuna\n",
+    "\n",
+    "        def objective(trial):\n",
+    "            user_cfg = config(trial)\n",
+    "            cfg = deepcopy(user_cfg)\n",
+    "            model = self._fit_model(\n",
+    "                cls_model=cls_model,\n",
+    "                config=cfg,\n",
+    "                dataset=dataset,\n",
+    "                val_size=val_size,\n",
+    "                test_size=test_size,\n",
+    "                distributed_config=distributed_config,\n",
+    "            )\n",
+    "            trial.set_user_attr('ALL_PARAMS', user_cfg)\n",
+    "            metrics = model.metrics\n",
+    "            trial.set_user_attr('METRICS', {\n",
+    "                \"loss\": metrics[\"ptl/val_loss\"],\n",
+    "                \"train_loss\": metrics[\"train_loss\"],\n",
+    "            })\n",
+    "            return trial.user_attrs['METRICS']['loss']\n",
+    "\n",
+    "        if isinstance(search_alg, optuna.samplers.BaseSampler):\n",
+    "            sampler = search_alg\n",
+    "        else:\n",
+    "            sampler = None\n",
+    "\n",
+    "        study = optuna.create_study(sampler=sampler, direction='minimize')\n",
+    "        study.optimize(\n",
+    "            objective,\n",
+    "            n_trials=num_samples,\n",
+    "            show_progress_bar=verbose,\n",
+    "            callbacks=self.callbacks,\n",
+    "        )\n",
+    "        return study\n",
+    "\n",
+    "    def _fit_model(self, cls_model, config,\n",
+    "                   dataset, val_size, test_size, distributed_config=None):\n",
+    "        model = cls_model(**config)\n",
+    "        model = model.fit(\n",
+    "            dataset,\n",
+    "            val_size=val_size, \n",
+    "            test_size=test_size,\n",
+    "            distributed_config=distributed_config,\n",
+    "        )\n",
+    "        return model\n",
+    "\n",
+    "    def fit(self, dataset, val_size=0, test_size=0, random_seed=None, distributed_config=None):\n",
+    "        \"\"\" BaseAuto.fit\n",
+    "\n",
+    "        Perform the hyperparameter optimization as specified by the BaseAuto configuration \n",
+    "        dictionary `config`.\n",
+    "\n",
+    "        The optimization is performed on the `TimeSeriesDataset` using temporal cross validation with \n",
+    "        the validation set that sequentially precedes the test set.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset` see details [here](https://nixtla.github.io/neuralforecast/tsdataset.html)<br>\n",
+    "        `val_size`: int, size of temporal validation set (needs to be bigger than 0).<br>\n",
+    "        `test_size`: int, size of temporal test set (default 0).<br>\n",
+    "        `random_seed`: int=None, random_seed for hyperparameter exploration algorithms, not yet implemented.<br>\n",
+    "        **Returns:**<br>\n",
+    "        `self`: fitted instance of `BaseAuto` with best hyperparameters and results<br>.\n",
+    "        \"\"\"\n",
+    "        #we need val_size > 0 to perform\n",
+    "        #hyperparameter selection.\n",
+    "        search_alg = deepcopy(self.search_alg)\n",
+    "        val_size = val_size if val_size > 0 else self.h\n",
+    "        if self.backend == 'ray':\n",
+    "            if distributed_config is not None:\n",
+    "                raise ValueError('distributed training is not supported for the ray backend.')\n",
+    "            results = self._tune_model(\n",
+    "                cls_model=self.cls_model,\n",
+    "                dataset=dataset,\n",
+    "                val_size=val_size,\n",
+    "                test_size=test_size, \n",
+    "                cpus=self.cpus,\n",
+    "                gpus=self.gpus,\n",
+    "                verbose=self.verbose,\n",
+    "                num_samples=self.num_samples, \n",
+    "                search_alg=search_alg, \n",
+    "                config=self.config,\n",
+    "            )            \n",
+    "            best_config = results.get_best_result().config            \n",
+    "        else:\n",
+    "            results = self._optuna_tune_model(\n",
+    "                cls_model=self.cls_model,\n",
+    "                dataset=dataset,\n",
+    "                val_size=val_size, \n",
+    "                test_size=test_size, \n",
+    "                verbose=self.verbose,\n",
+    "                num_samples=self.num_samples, \n",
+    "                search_alg=search_alg, \n",
+    "                config=self.config,\n",
+    "                distributed_config=distributed_config,\n",
+    "            )\n",
+    "            best_config = results.best_trial.user_attrs['ALL_PARAMS']\n",
+    "        self.model = self._fit_model(\n",
+    "            cls_model=self.cls_model,\n",
+    "            config=best_config,\n",
+    "            dataset=dataset,\n",
+    "            val_size=val_size * (1 - self.refit_with_val),\n",
+    "            test_size=test_size,\n",
+    "            distributed_config=distributed_config,\n",
+    "        )\n",
+    "        self.results = results\n",
+    "\n",
+    "         # Added attributes for compatibility with NeuralForecast core\n",
+    "        self.futr_exog_list = self.model.futr_exog_list\n",
+    "        self.hist_exog_list = self.model.hist_exog_list\n",
+    "        self.stat_exog_list = self.model.stat_exog_list\n",
+    "        return self\n",
+    "\n",
+    "    def predict(self, dataset, step_size=1, **data_kwargs):\n",
+    "        \"\"\" BaseAuto.predict\n",
+    "\n",
+    "        Predictions of the best performing model on validation.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset` see details [here](https://nixtla.github.io/neuralforecast/tsdataset.html)<br>\n",
+    "        `step_size`: int, steps between sequential predictions, (default 1).<br>\n",
+    "        `**data_kwarg`: additional parameters for the dataset module.<br>\n",
+    "        `random_seed`: int=None, random_seed for hyperparameter exploration algorithms (not implemented).<br>\n",
+    "        **Returns:**<br>\n",
+    "        `y_hat`: numpy predictions of the `NeuralForecast` model.<br>\n",
+    "        \"\"\"\n",
+    "        return self.model.predict(dataset=dataset, \n",
+    "                                  step_size=step_size, **data_kwargs)\n",
+    "\n",
+    "    def set_test_size(self, test_size):\n",
+    "        self.model.set_test_size(test_size)\n",
+    "\n",
+    "    def get_test_size(self):\n",
+    "        return self.model.test_size\n",
+    "    \n",
+    "    def save(self, path):\n",
+    "        \"\"\" BaseAuto.save\n",
+    "\n",
+    "        Save the fitted model to disk.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `path`: str, path to save the model.<br>\n",
+    "        \"\"\"\n",
+    "        self.model.save(path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2376ed06",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseAuto, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "623ebb06",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseAuto.fit, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "69d3c1ae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseAuto.predict, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bbfd4e8f-2565-4f85-b615-7329a1ae3f43",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "import logging\n",
+    "import warnings\n",
+    "\n",
+    "import pytorch_lightning as pl"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "421db156-4ee6-420f-ac9e-f0ddc9781841",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "logging.getLogger(\"pytorch_lightning\").setLevel(logging.ERROR)\n",
+    "warnings.filterwarnings(\"ignore\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1e776fb-fa7e-49c6-afd2-b30891c83a73",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "import optuna\n",
+    "import pandas as pd\n",
+    "from neuralforecast.models.mlp import MLP\n",
+    "from neuralforecast.utils import AirPassengersDF as Y_df\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset\n",
+    "from neuralforecast.losses.numpy import mae\n",
+    "from neuralforecast.losses.pytorch import MAE, MSE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8c26739d-c405-4700-a833-79c3a0fec497",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "Y_train_df = Y_df[Y_df.ds<='1959-12-31'] # 132 train\n",
+    "Y_test_df = Y_df[Y_df.ds>'1959-12-31']   # 12 test\n",
+    "\n",
+    "dataset, *_ = TimeSeriesDataset.from_df(Y_train_df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88148bbe-b4c1-41c3-8ce1-4f7695161d99",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class RayLogLossesCallback(tune.Callback):\n",
+    "    def on_trial_complete(self, iteration, trials, trial, **info):\n",
+    "        result = trial.last_result\n",
+    "        print(40 * '-' + 'Trial finished' + 40 * '-')\n",
+    "        print(f'Train loss: {result[\"train_loss\"]:.2f}. Valid loss: {result[\"loss\"]:.2f}')\n",
+    "        print(80 * '-')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae8912d7-9128-42ab-a581-5f63b6ea34eb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config = {\n",
+    "    \"hidden_size\": tune.choice([512]),\n",
+    "    \"num_layers\": tune.choice([3, 4]),\n",
+    "    \"input_size\": 12,\n",
+    "    \"max_steps\": 10,\n",
+    "    \"val_check_steps\": 5\n",
+    "}\n",
+    "auto = BaseAuto(h=12, loss=MAE(), valid_loss=MSE(), cls_model=MLP, config=config, num_samples=2, cpus=1, gpus=0, callbacks=[RayLogLossesCallback()])\n",
+    "auto.fit(dataset=dataset)\n",
+    "y_hat = auto.predict(dataset=dataset)\n",
+    "assert mae(Y_test_df['y'].values, y_hat[:, 0]) < 200"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "63d46d13-f0d0-4bc0-aba2-bd094a9a78c4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def config_f(trial):\n",
+    "    return {\n",
+    "        \"hidden_size\": trial.suggest_categorical('hidden_size', [512]),\n",
+    "        \"num_layers\": trial.suggest_categorical('num_layers', [3, 4]),\n",
+    "        \"input_size\": 12,\n",
+    "        \"max_steps\": 10,\n",
+    "        \"val_check_steps\": 5\n",
+    "    }\n",
+    "\n",
+    "class OptunaLogLossesCallback:\n",
+    "    def __call__(self, study, trial):\n",
+    "        metrics = trial.user_attrs['METRICS']\n",
+    "        print(40 * '-' + 'Trial finished' + 40 * '-')\n",
+    "        print(f'Train loss: {metrics[\"train_loss\"]:.2f}. Valid loss: {metrics[\"loss\"]:.2f}')\n",
+    "        print(80 * '-')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d979d9df-3a8d-4aab-aaa9-5b66067aef26",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "auto2 = BaseAuto(h=12, loss=MAE(), valid_loss=MSE(), cls_model=MLP, config=config_f, search_alg=optuna.samplers.RandomSampler(), num_samples=2, backend='optuna', callbacks=[OptunaLogLossesCallback()])\n",
+    "auto2.fit(dataset=dataset)\n",
+    "assert isinstance(auto2.results, optuna.Study)\n",
+    "y_hat2 = auto2.predict(dataset=dataset)\n",
+    "assert mae(Y_test_df['y'].values, y_hat2[:, 0]) < 200"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66ad2eec-dd93-4bc4-ae19-5df4199577be",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "Y_test_df['AutoMLP'] = y_hat\n",
+    "\n",
+    "pd.concat([Y_train_df, Y_test_df]).drop('unique_id', axis=1).set_index('ds').plot()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "463d4dc0-b25a-4ce6-9172-5690dc979f0b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Unit tests to guarantee that losses are correctly instantiated\n",
+    "import pandas as pd\n",
+    "from neuralforecast.models.mlp import MLP\n",
+    "from neuralforecast.utils import AirPassengersDF as Y_df\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset\n",
+    "from neuralforecast.losses.pytorch import MAE, MSE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "882c8331-440a-4758-a56c-07a78c0b1603",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Unit tests to guarantee that losses are correctly instantiated\n",
+    "Y_train_df = Y_df[Y_df.ds<='1959-12-31'] # 132 train\n",
+    "Y_test_df = Y_df[Y_df.ds>'1959-12-31']   # 12 test\n",
+    "\n",
+    "dataset, *_ = TimeSeriesDataset.from_df(Y_train_df)\n",
+    "config = {\n",
+    "    \"hidden_size\": tune.choice([512]),\n",
+    "    \"num_layers\": tune.choice([3, 4]),\n",
+    "    \"input_size\": 12,\n",
+    "    \"max_steps\": 1,\n",
+    "    \"val_check_steps\": 1\n",
+    "}\n",
+    "\n",
+    "# Test instantiation\n",
+    "auto = BaseAuto(h=12, loss=MAE(), valid_loss=MSE(), \n",
+    "                cls_model=MLP, config=config, num_samples=2, cpus=1, gpus=0)\n",
+    "test_eq(str(type(auto.loss)), \"<class 'neuralforecast.losses.pytorch.MAE'>\")\n",
+    "test_eq(str(type(auto.valid_loss)), \"<class 'neuralforecast.losses.pytorch.MSE'>\")\n",
+    "\n",
+    "# Test validation default\n",
+    "auto = BaseAuto(h=12, loss=MSE(), valid_loss=None,\n",
+    "                cls_model=MLP, config=config, num_samples=2, cpus=1, gpus=0)\n",
+    "test_eq(str(type(auto.loss)), \"<class 'neuralforecast.losses.pytorch.MSE'>\")\n",
+    "test_eq(str(type(auto.valid_loss)), \"<class 'neuralforecast.losses.pytorch.MSE'>\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "3c8e2d46",
+   "metadata": {},
+   "source": [
+    "### References\n",
+    "- [James Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl (2011). \"Algorithms for Hyper-Parameter Optimization\". In: Advances in Neural Information Processing Systems. url: https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf](https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf)\n",
+    "- [Kirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christopher R. Collins, Jeff Schneider, Barnabas Poczos, Eric P. Xing (2019). \"Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly\". Journal of Machine Learning Research. url: https://arxiv.org/abs/1903.06694](https://arxiv.org/abs/1903.06694)\n",
+    "- [Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar (2016). \"Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization\". Journal of Machine Learning Research. url: https://arxiv.org/abs/1603.06560](https://arxiv.org/abs/1603.06560)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "267cbf1e",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/nbs/common.base_model.ipynb
+++ b/nbs/common.base_model.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8e5c6594-e5e8-4966-8cb8-a3e6a9ed7d89",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp common._base_model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fce0c950-2e03-4be1-95d4-a02409d8dba3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c7c2ba5-19ee-421e-9252-7224b03f5201",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import inspect\n",
+    "import random\n",
+    "import warnings\n",
+    "from contextlib import contextmanager\n",
+    "from copy import deepcopy\n",
+    "from dataclasses import dataclass\n",
+    "\n",
+    "import fsspec\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import pytorch_lightning as pl\n",
+    "from pytorch_lightning.callbacks.early_stopping import EarlyStopping\n",
+    "\n",
+    "from neuralforecast.tsdataset import (\n",
+    "    TimeSeriesDataModule,\n",
+    "    TimeSeriesDataset,\n",
+    "    _DistributedTimeSeriesDataModule,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c6d4c4fd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "@dataclass\n",
+    "class DistributedConfig:\n",
+    "    partitions_path: str\n",
+    "    num_nodes: int\n",
+    "    devices: int"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5197e340-11f1-4c8c-96d1-ed396ac2b710",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| exporti\n",
+    "@contextmanager\n",
+    "def _disable_torch_init():\n",
+    "    \"\"\"Context manager used to disable pytorch's weight initialization.\n",
+    "\n",
+    "    This is especially useful when loading saved models, since when initializing\n",
+    "    a model the weights are also initialized following some method\n",
+    "    (e.g. kaiming uniform), and that time is wasted since we'll override them with\n",
+    "    the saved weights.\"\"\"\n",
+    "    def noop(*args, **kwargs):\n",
+    "        return\n",
+    "        \n",
+    "    kaiming_uniform = nn.init.kaiming_uniform_\n",
+    "    kaiming_normal = nn.init.kaiming_normal_\n",
+    "    xavier_uniform = nn.init.xavier_uniform_\n",
+    "    xavier_normal = nn.init.xavier_normal_\n",
+    "    \n",
+    "    nn.init.kaiming_uniform_ = noop\n",
+    "    nn.init.kaiming_normal_ = noop\n",
+    "    nn.init.xavier_uniform_ = noop\n",
+    "    nn.init.xavier_normal_ = noop\n",
+    "    try:\n",
+    "        yield\n",
+    "    finally:\n",
+    "        nn.init.kaiming_uniform_ = kaiming_uniform\n",
+    "        nn.init.kaiming_normal_ = kaiming_normal\n",
+    "        nn.init.xavier_uniform_ = xavier_uniform\n",
+    "        nn.init.xavier_normal_ = xavier_normal"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "60c40a64-8381-46a2-8cbb-70ec70ed7914",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class BaseModel(pl.LightningModule):\n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        random_seed,\n",
+    "        loss,\n",
+    "        valid_loss,\n",
+    "        optimizer,\n",
+    "        optimizer_kwargs,\n",
+    "        futr_exog_list,\n",
+    "        hist_exog_list,\n",
+    "        stat_exog_list,\n",
+    "        max_steps,\n",
+    "        early_stop_patience_steps,\n",
+    "        **trainer_kwargs,\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "        with warnings.catch_warnings(record=False):\n",
+    "            warnings.filterwarnings('ignore')\n",
+    "            # the following line issues a warning about the loss attribute being saved\n",
+    "            # but we do want to save it\n",
+    "            self.save_hyperparameters() # Allows instantiation from a checkpoint from class\n",
+    "        self.random_seed = random_seed\n",
+    "        pl.seed_everything(self.random_seed, workers=True)\n",
+    "\n",
+    "        # Loss\n",
+    "        self.loss = loss\n",
+    "        if valid_loss is None:\n",
+    "            self.valid_loss = loss\n",
+    "        else:\n",
+    "            self.valid_loss = valid_loss\n",
+    "        self.train_trajectories = []\n",
+    "        self.valid_trajectories = []\n",
+    "\n",
+    "        # Optimization\n",
+    "        if optimizer is not None and not issubclass(optimizer, torch.optim.Optimizer):\n",
+    "            raise TypeError(\"optimizer is not a valid subclass of torch.optim.Optimizer\")\n",
+    "        self.optimizer = optimizer\n",
+    "        self.optimizer_kwargs = optimizer_kwargs if optimizer_kwargs else {}\n",
+    "\n",
+    "        # Variables\n",
+    "        self.futr_exog_list = list(futr_exog_list) if futr_exog_list is not None else []\n",
+    "        self.hist_exog_list = list(hist_exog_list) if hist_exog_list is not None else []\n",
+    "        self.stat_exog_list = list(stat_exog_list) if stat_exog_list is not None else []\n",
+    "\n",
+    "        ## Trainer arguments ##\n",
+    "        # Max steps, validation steps and check_val_every_n_epoch\n",
+    "        trainer_kwargs = {**trainer_kwargs, 'max_steps': max_steps}\n",
+    "\n",
+    "        if 'max_epochs' in trainer_kwargs.keys():\n",
+    "            raise Exception('max_epochs is deprecated, use max_steps instead.')\n",
+    "\n",
+    "        # Callbacks\n",
+    "        if early_stop_patience_steps > 0:\n",
+    "            if 'callbacks' not in trainer_kwargs:\n",
+    "                trainer_kwargs['callbacks'] = []\n",
+    "            trainer_kwargs['callbacks'].append(\n",
+    "                EarlyStopping(\n",
+    "                    monitor='ptl/val_loss', patience=early_stop_patience_steps\n",
+    "                )\n",
+    "            )\n",
+    "\n",
+    "        # Add GPU accelerator if available\n",
+    "        if trainer_kwargs.get('accelerator', None) is None:\n",
+    "            if torch.cuda.is_available():\n",
+    "                trainer_kwargs['accelerator'] = \"gpu\"\n",
+    "        if trainer_kwargs.get('devices', None) is None:\n",
+    "            if torch.cuda.is_available():\n",
+    "                trainer_kwargs['devices'] = -1\n",
+    "\n",
+    "        # Avoid saturating local memory, disabled fit model checkpoints\n",
+    "        if trainer_kwargs.get('enable_checkpointing', None) is None:\n",
+    "            trainer_kwargs['enable_checkpointing'] = False\n",
+    "\n",
+    "        self.trainer_kwargs = trainer_kwargs\n",
+    "\n",
+    "    def __repr__(self):\n",
+    "        return type(self).__name__ if self.alias is None else self.alias\n",
+    "\n",
+    "    def _check_exog(self, dataset):\n",
+    "        temporal_cols = set(dataset.temporal_cols.tolist())\n",
+    "        static_cols = set(dataset.static_cols.tolist() if dataset.static_cols is not None else [])\n",
+    "\n",
+    "        missing_hist = set(self.hist_exog_list) - temporal_cols\n",
+    "        missing_futr = set(self.futr_exog_list) - temporal_cols\n",
+    "        missing_stat = set(self.stat_exog_list) - static_cols\n",
+    "        if missing_hist:\n",
+    "            raise Exception(f'{missing_hist} historical exogenous variables not found in input dataset')\n",
+    "        if missing_futr:\n",
+    "            raise Exception(f'{missing_futr} future exogenous variables not found in input dataset')\n",
+    "        if missing_stat:\n",
+    "            raise Exception(f'{missing_stat} static exogenous variables not found in input dataset')\n",
+    "\n",
+    "    def _restart_seed(self, random_seed):\n",
+    "        if random_seed is None:\n",
+    "            random_seed = self.random_seed\n",
+    "        torch.manual_seed(random_seed)\n",
+    "\n",
+    "    def _get_temporal_exogenous_cols(self, temporal_cols):\n",
+    "        return list(\n",
+    "            set(temporal_cols.tolist()) & set(self.hist_exog_list + self.futr_exog_list)\n",
+    "        )\n",
+    "\n",
+    "    def _fit(\n",
+    "        self,\n",
+    "        dataset,\n",
+    "        batch_size,\n",
+    "        valid_batch_size=1024,\n",
+    "        val_size=0,\n",
+    "        test_size=0,\n",
+    "        random_seed=None,\n",
+    "        shuffle_train=True,\n",
+    "        distributed_config=None,\n",
+    "    ):\n",
+    "        self._check_exog(dataset)\n",
+    "        self._restart_seed(random_seed)\n",
+    "\n",
+    "        self.val_size = val_size\n",
+    "        self.test_size = test_size\n",
+    "        is_local = isinstance(dataset, TimeSeriesDataset)\n",
+    "        if is_local:\n",
+    "            datamodule_constructor = TimeSeriesDataModule\n",
+    "        else:\n",
+    "            datamodule_constructor = _DistributedTimeSeriesDataModule\n",
+    "        datamodule = datamodule_constructor(\n",
+    "            dataset=dataset, \n",
+    "            batch_size=batch_size,\n",
+    "            valid_batch_size=valid_batch_size,\n",
+    "            num_workers=self.num_workers_loader,\n",
+    "            drop_last=self.drop_last_loader,\n",
+    "            shuffle_train=shuffle_train,\n",
+    "        )\n",
+    "\n",
+    "        if self.val_check_steps > self.max_steps:\n",
+    "            warnings.warn(\n",
+    "                'val_check_steps is greater than max_steps, '\n",
+    "                'setting val_check_steps to max_steps.'\n",
+    "            )\n",
+    "        val_check_interval = min(self.val_check_steps, self.max_steps)\n",
+    "        self.trainer_kwargs['val_check_interval'] = int(val_check_interval)\n",
+    "        self.trainer_kwargs['check_val_every_n_epoch'] = None\n",
+    "\n",
+    "        if is_local:\n",
+    "            model = self\n",
+    "            trainer = pl.Trainer(**model.trainer_kwargs)\n",
+    "            trainer.fit(model, datamodule=datamodule)\n",
+    "            model.metrics = trainer.callback_metrics\n",
+    "            model.__dict__.pop('_trainer', None)\n",
+    "        else:\n",
+    "            assert distributed_config is not None\n",
+    "            from pyspark.ml.torch.distributor import TorchDistributor\n",
+    "\n",
+    "            def train_fn(\n",
+    "                model_cls,\n",
+    "                model_params,\n",
+    "                datamodule,\n",
+    "                trainer_kwargs,\n",
+    "                num_tasks,\n",
+    "                num_proc_per_task,\n",
+    "                val_size,\n",
+    "                test_size,\n",
+    "            ):\n",
+    "                import pytorch_lightning as pl\n",
+    "\n",
+    "                # we instantiate here to avoid pickling large tensors (weights)\n",
+    "                model = model_cls(**model_params)\n",
+    "                model.val_size = val_size\n",
+    "                model.test_size = test_size\n",
+    "                for arg in ('devices', 'num_nodes'):\n",
+    "                    trainer_kwargs.pop(arg, None)\n",
+    "                trainer = pl.Trainer(\n",
+    "                    strategy=\"ddp\",\n",
+    "                    use_distributed_sampler=False,  # to ensure our dataloaders are used as-is\n",
+    "                    num_nodes=num_tasks,\n",
+    "                    devices=num_proc_per_task,\n",
+    "                    **trainer_kwargs,\n",
+    "                )\n",
+    "                trainer.fit(model=model, datamodule=datamodule)\n",
+    "                model.metrics = trainer.callback_metrics\n",
+    "                model.__dict__.pop('_trainer', None)\n",
+    "                return model\n",
+    "\n",
+    "            def is_gpu_accelerator(accelerator):\n",
+    "                from pytorch_lightning.accelerators.cuda import CUDAAccelerator\n",
+    "\n",
+    "                return (\n",
+    "                    accelerator == \"gpu\"\n",
+    "                    or isinstance(accelerator, CUDAAccelerator)\n",
+    "                    or (accelerator == \"auto\" and CUDAAccelerator.is_available())\n",
+    "                )\n",
+    "\n",
+    "            local_mode = distributed_config.num_nodes == 1\n",
+    "            if local_mode:\n",
+    "                num_tasks = 1\n",
+    "                num_proc_per_task = distributed_config.devices\n",
+    "            else:\n",
+    "                num_tasks = distributed_config.devices * distributed_config.devices\n",
+    "                num_proc_per_task = 1  # number of GPUs per task\n",
+    "            num_proc = num_tasks * num_proc_per_task\n",
+    "            use_gpu = is_gpu_accelerator(self.trainer_kwargs[\"accelerator\"])\n",
+    "            model = TorchDistributor(\n",
+    "                num_processes=num_proc,\n",
+    "                local_mode=local_mode,\n",
+    "                use_gpu=use_gpu,\n",
+    "            ).run(\n",
+    "                train_fn,\n",
+    "                model_cls=type(self),\n",
+    "                model_params=self.hparams,\n",
+    "                datamodule=datamodule,\n",
+    "                trainer_kwargs=self.trainer_kwargs,\n",
+    "                num_tasks=num_tasks,\n",
+    "                num_proc_per_task=num_proc_per_task,\n",
+    "                val_size=val_size,\n",
+    "                test_size=test_size,\n",
+    "            )\n",
+    "        return model\n",
+    "\n",
+    "    def on_fit_start(self):\n",
+    "        torch.manual_seed(self.random_seed)\n",
+    "        np.random.seed(self.random_seed)\n",
+    "        random.seed(self.random_seed)\n",
+    "\n",
+    "    def configure_optimizers(self):\n",
+    "        if self.optimizer:\n",
+    "            optimizer_signature = inspect.signature(self.optimizer)\n",
+    "            optimizer_kwargs = deepcopy(self.optimizer_kwargs)\n",
+    "            if 'lr' in optimizer_signature.parameters:\n",
+    "                if 'lr' in optimizer_kwargs:\n",
+    "                    warnings.warn(\"ignoring learning rate passed in optimizer_kwargs, using the model's learning rate\")\n",
+    "                optimizer_kwargs['lr'] = self.learning_rate\n",
+    "            optimizer = self.optimizer(params=self.parameters(), **optimizer_kwargs)\n",
+    "        else:\n",
+    "            optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)\n",
+    "        scheduler = {\n",
+    "            'scheduler': torch.optim.lr_scheduler.StepLR(\n",
+    "                optimizer=optimizer, step_size=self.lr_decay_steps, gamma=0.5\n",
+    "            ),\n",
+    "            'frequency': 1,\n",
+    "            'interval': 'step',\n",
+    "        }\n",
+    "        return {'optimizer': optimizer, 'lr_scheduler': scheduler}\n",
+    "\n",
+    "    def get_test_size(self):\n",
+    "        return self.test_size\n",
+    "\n",
+    "    def set_test_size(self, test_size):\n",
+    "        self.test_size = test_size\n",
+    "\n",
+    "    def on_validation_epoch_end(self):\n",
+    "        if self.val_size == 0:\n",
+    "            return\n",
+    "        losses = torch.stack(self.validation_step_outputs)\n",
+    "        avg_loss = losses.mean().item()\n",
+    "        self.log(\n",
+    "            \"ptl/val_loss\",\n",
+    "            avg_loss,\n",
+    "            batch_size=losses.size(0),\n",
+    "            sync_dist=True,\n",
+    "        )\n",
+    "        self.valid_trajectories.append((self.global_step, avg_loss))\n",
+    "        self.validation_step_outputs.clear() # free memory (compute `avg_loss` per epoch)\n",
+    "\n",
+    "    def save(self, path):\n",
+    "        with fsspec.open(path, 'wb') as f:\n",
+    "            torch.save(\n",
+    "                {'hyper_parameters': self.hparams, 'state_dict': self.state_dict()},\n",
+    "                f,\n",
+    "            )\n",
+    "\n",
+    "    @classmethod\n",
+    "    def load(cls, path, **kwargs):\n",
+    "        with fsspec.open(path, 'rb') as f:\n",
+    "            content = torch.load(f, **kwargs)\n",
+    "        with _disable_torch_init():\n",
+    "            model = cls(**content['hyper_parameters']) \n",
+    "        model.load_state_dict(content['state_dict'], strict=True, assign=True)\n",
+    "        return model"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/nbs/common.base_multivariate.ipynb
+++ b/nbs/common.base_multivariate.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp common._base_multivariate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# BaseMultivariate\n",
+    "\n",
+    "> The `BaseWindows` class contains standard methods shared across window-based multivariate neural networks; in contrast to recurrent neural networks these models commit to a fixed sequence length input."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The standard methods include data preprocessing `_normalization`, optimization utilities like parameter initialization, `training_step`, `validation_step`, and shared `fit` and `predict` methods.These shared methods enable all the `neuralforecast.models` compatibility with the `core.NeuralForecast` wrapper class. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import pytorch_lightning as pl\n",
+    "\n",
+    "import neuralforecast.losses.pytorch as losses\n",
+    "from neuralforecast.common._base_model import BaseModel\n",
+    "from neuralforecast.common._scalers import TemporalNorm\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataModule\n",
+    "from neuralforecast.utils import get_indexer_raise_missing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class BaseMultivariate(BaseModel):\n",
+    "    \"\"\" Base Multivariate\n",
+    "    \n",
+    "    Base class for all multivariate models. The forecasts for all time-series are produced simultaneously \n",
+    "    within each window, which are randomly sampled during training.\n",
+    "    \n",
+    "    This class implements the basic functionality for all windows-based models, including:\n",
+    "    - PyTorch Lightning's methods training_step, validation_step, predict_step.<br>\n",
+    "    - fit and predict methods used by NeuralForecast.core class.<br>\n",
+    "    - sampling and wrangling methods to generate multivariate windows.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, \n",
+    "                 h,\n",
+    "                 input_size,\n",
+    "                 loss,\n",
+    "                 valid_loss,\n",
+    "                 learning_rate,\n",
+    "                 max_steps,\n",
+    "                 val_check_steps,\n",
+    "                 n_series,\n",
+    "                 batch_size,\n",
+    "                 step_size=1,\n",
+    "                 num_lr_decays=0,\n",
+    "                 early_stop_patience_steps=-1,\n",
+    "                 scaler_type='robust',\n",
+    "                 futr_exog_list=None,\n",
+    "                 hist_exog_list=None,\n",
+    "                 stat_exog_list=None,\n",
+    "                 num_workers_loader=0,\n",
+    "                 drop_last_loader=False,\n",
+    "                 random_seed=1, \n",
+    "                 alias=None,\n",
+    "                 optimizer=None,\n",
+    "                 optimizer_kwargs=None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super().__init__(\n",
+    "            random_seed=random_seed,\n",
+    "            loss=loss,\n",
+    "            valid_loss=valid_loss,\n",
+    "            optimizer=optimizer,\n",
+    "            optimizer_kwargs=optimizer_kwargs,\n",
+    "            futr_exog_list=futr_exog_list,\n",
+    "            hist_exog_list=hist_exog_list,\n",
+    "            stat_exog_list=stat_exog_list,\n",
+    "            max_steps=max_steps,\n",
+    "            early_stop_patience_steps=early_stop_patience_steps,\n",
+    "            **trainer_kwargs,\n",
+    "        )\n",
+    "\n",
+    "        # Padder to complete train windows, \n",
+    "        # example y=[1,2,3,4,5] h=3 -> last y_output = [5,0,0]\n",
+    "        self.h = h\n",
+    "        self.input_size = input_size\n",
+    "        self.n_series = n_series\n",
+    "        self.padder = nn.ConstantPad1d(padding=(0, self.h), value=0)\n",
+    "\n",
+    "        # Multivariate models do not support these loss functions yet.\n",
+    "        unsupported_losses = (\n",
+    "            losses.sCRPS,\n",
+    "            losses.MQLoss,\n",
+    "            losses.DistributionLoss,\n",
+    "            losses.PMM,\n",
+    "            losses.GMM,\n",
+    "            losses.HuberMQLoss,\n",
+    "            losses.MASE,\n",
+    "            losses.relMSE,\n",
+    "            losses.NBMM,\n",
+    "        )\n",
+    "        if isinstance(self.loss, unsupported_losses):\n",
+    "            raise Exception(f\"{self.loss} is not supported in a Multivariate model.\")            \n",
+    "        if isinstance(self.valid_loss, unsupported_losses):\n",
+    "            raise Exception(f\"{self.valid_loss} is not supported in a Multivariate model.\")            \n",
+    "\n",
+    "        self.batch_size = batch_size\n",
+    "        \n",
+    "        # Optimization\n",
+    "        self.learning_rate = learning_rate\n",
+    "        self.max_steps = max_steps\n",
+    "        self.num_lr_decays = num_lr_decays\n",
+    "        self.lr_decay_steps = max(max_steps // self.num_lr_decays, 1) if self.num_lr_decays > 0 else 10e7\n",
+    "        self.early_stop_patience_steps = early_stop_patience_steps\n",
+    "        self.val_check_steps = val_check_steps\n",
+    "        self.step_size = step_size\n",
+    "\n",
+    "        # Scaler\n",
+    "        self.scaler = TemporalNorm(scaler_type=scaler_type, dim=2) # Time dimension is in the second axis\n",
+    "\n",
+    "        # Fit arguments\n",
+    "        self.val_size = 0\n",
+    "        self.test_size = 0\n",
+    "\n",
+    "        # Model state\n",
+    "        self.decompose_forecast = False\n",
+    "\n",
+    "        # DataModule arguments\n",
+    "        self.num_workers_loader = num_workers_loader\n",
+    "        self.drop_last_loader = drop_last_loader\n",
+    "        # used by on_validation_epoch_end hook\n",
+    "        self.validation_step_outputs = []\n",
+    "        self.alias = alias\n",
+    "\n",
+    "    def _create_windows(self, batch, step):\n",
+    "        # Parse common data\n",
+    "        window_size = self.input_size + self.h\n",
+    "        temporal_cols = batch['temporal_cols']\n",
+    "        temporal = batch['temporal']\n",
+    "\n",
+    "        if step == 'train':\n",
+    "            if self.val_size + self.test_size > 0:\n",
+    "                cutoff = -self.val_size - self.test_size\n",
+    "                temporal = temporal[:, :, :cutoff]\n",
+    "\n",
+    "            temporal = self.padder(temporal)\n",
+    "            windows = temporal.unfold(dimension=-1, \n",
+    "                                      size=window_size, \n",
+    "                                      step=self.step_size)\n",
+    "            # [n_series, C, Ws, L+H] 0, 1, 2, 3\n",
+    "\n",
+    "            # Sample and Available conditions\n",
+    "            available_idx = temporal_cols.get_loc('available_mask')\n",
+    "            sample_condition = windows[:, available_idx, :, -self.h:]\n",
+    "            sample_condition = torch.sum(sample_condition, axis=2) # Sum over time\n",
+    "            sample_condition = torch.sum(sample_condition, axis=0) # Sum over time-series\n",
+    "            available_condition = windows[:, available_idx, :, :-self.h]\n",
+    "            available_condition = torch.sum(available_condition, axis=2) # Sum over time\n",
+    "            available_condition = torch.sum(available_condition, axis=0) # Sum over time-series\n",
+    "            final_condition = (sample_condition > 0) & (available_condition > 0) # Of shape [Ws]\n",
+    "            windows = windows[:, :, final_condition, :]\n",
+    "\n",
+    "            # Get Static data\n",
+    "            static = batch.get('static', None)\n",
+    "            static_cols = batch.get('static_cols', None)\n",
+    "\n",
+    "            # Protection of empty windows\n",
+    "            if final_condition.sum() == 0:\n",
+    "                raise Exception('No windows available for training')\n",
+    "\n",
+    "            # Sample windows\n",
+    "            n_windows = windows.shape[2]\n",
+    "            if self.batch_size is not None:\n",
+    "                w_idxs = np.random.choice(n_windows, \n",
+    "                                          size=self.batch_size,\n",
+    "                                          replace=(n_windows < self.batch_size))\n",
+    "                windows = windows[:, :, w_idxs, :]\n",
+    "\n",
+    "            windows = windows.permute(2, 1, 3, 0) # [Ws, C, L+H, n_series]\n",
+    "\n",
+    "            windows_batch = dict(temporal=windows,\n",
+    "                                 temporal_cols=temporal_cols,\n",
+    "                                 static=static,\n",
+    "                                 static_cols=static_cols)\n",
+    "\n",
+    "            return windows_batch\n",
+    "\n",
+    "        elif step in ['predict', 'val']:\n",
+    "\n",
+    "            if step == 'predict':\n",
+    "                predict_step_size = self.predict_step_size\n",
+    "                cutoff = - self.input_size - self.test_size\n",
+    "                temporal = batch['temporal'][:, :, cutoff:]\n",
+    "\n",
+    "            elif step == 'val':\n",
+    "                predict_step_size = self.step_size\n",
+    "                cutoff = -self.input_size - self.val_size - self.test_size\n",
+    "                if self.test_size > 0:\n",
+    "                    temporal = batch['temporal'][:, :, cutoff:-self.test_size]\n",
+    "                else:\n",
+    "                    temporal = batch['temporal'][:, :, cutoff:]\n",
+    "\n",
+    "            if (step=='predict') and (self.test_size==0) and (len(self.futr_exog_list)==0):\n",
+    "                temporal = self.padder(temporal)\n",
+    "\n",
+    "            windows = temporal.unfold(dimension=-1,\n",
+    "                                      size=window_size,\n",
+    "                                      step=predict_step_size)\n",
+    "            # [n_series, C, Ws, L+H] -> [Ws, C, L+H, n_series]\n",
+    "            windows = windows.permute(2, 1, 3, 0)\n",
+    "\n",
+    "            # Get Static data\n",
+    "            static = batch.get('static', None)\n",
+    "            static_cols=batch.get('static_cols', None)\n",
+    "\n",
+    "            windows_batch = dict(temporal=windows,\n",
+    "                                 temporal_cols=temporal_cols,\n",
+    "                                 static=static,\n",
+    "                                 static_cols=static_cols)\n",
+    "\n",
+    "\n",
+    "            return windows_batch\n",
+    "        else:\n",
+    "            raise ValueError(f'Unknown step {step}') \n",
+    "\n",
+    "    def _normalization(self, windows, y_idx):\n",
+    "        \n",
+    "        # windows are already filtered by train/validation/test\n",
+    "        # from the `create_windows_method` nor leakage risk\n",
+    "        temporal = windows['temporal']                  # [Ws, C, L+H, n_series]\n",
+    "        temporal_cols = windows['temporal_cols'].copy() # [Ws, C, L+H, n_series]\n",
+    "\n",
+    "        # To avoid leakage uses only the lags\n",
+    "        temporal_data_cols = self._get_temporal_exogenous_cols(temporal_cols=temporal_cols)\n",
+    "        temporal_idxs = get_indexer_raise_missing(temporal_cols, temporal_data_cols)\n",
+    "        temporal_idxs = np.append(y_idx, temporal_idxs)\n",
+    "        temporal_data = temporal[:, temporal_idxs, :, :]\n",
+    "        temporal_mask = temporal[:, temporal_cols.get_loc('available_mask'), :, :].clone()\n",
+    "        temporal_mask[:, -self.h:, :] = 0.0\n",
+    "\n",
+    "        # Normalize. self.scaler stores the shift and scale for inverse transform\n",
+    "        temporal_mask = temporal_mask.unsqueeze(1) # Add channel dimension for scaler.transform.\n",
+    "        temporal_data = self.scaler.transform(x=temporal_data, mask=temporal_mask)\n",
+    "        # Replace values in windows dict\n",
+    "        temporal[:, temporal_idxs, :, :] = temporal_data\n",
+    "        windows['temporal'] = temporal\n",
+    "\n",
+    "        return windows\n",
+    "\n",
+    "    def _inv_normalization(self, y_hat, temporal_cols, y_idx):\n",
+    "        # Receives window predictions [Ws, H, n_series]\n",
+    "        # Broadcasts outputs and inverts normalization\n",
+    "\n",
+    "        # Add C dimension\n",
+    "        # if y_hat.ndim == 2:\n",
+    "        #     remove_dimension = True\n",
+    "        #     y_hat = y_hat.unsqueeze(-1)\n",
+    "        # else:\n",
+    "        #     remove_dimension = False\n",
+    "        \n",
+    "        y_scale = self.scaler.x_scale[:, [y_idx], :].squeeze(1)\n",
+    "        y_loc = self.scaler.x_shift[:, [y_idx], :].squeeze(1)\n",
+    "\n",
+    "        # y_scale = torch.repeat_interleave(y_scale, repeats=y_hat.shape[-1], dim=-1)\n",
+    "        # y_loc = torch.repeat_interleave(y_loc, repeats=y_hat.shape[-1], dim=-1)\n",
+    "\n",
+    "        y_hat = self.scaler.inverse_transform(z=y_hat, x_scale=y_scale, x_shift=y_loc)\n",
+    "\n",
+    "        # if remove_dimension:\n",
+    "        #     y_hat = y_hat.squeeze(-1)\n",
+    "        #     y_loc = y_loc.squeeze(-1)\n",
+    "        #     y_scale = y_scale.squeeze(-1)\n",
+    "\n",
+    "        return y_hat, y_loc, y_scale\n",
+    "\n",
+    "    def _parse_windows(self, batch, windows):\n",
+    "        # Temporal: [Ws, C, L+H, n_series]\n",
+    "\n",
+    "        # Filter insample lags from outsample horizon\n",
+    "        mask_idx = batch['temporal_cols'].get_loc('available_mask')\n",
+    "        y_idx = batch['y_idx']        \n",
+    "        insample_y = windows['temporal'][:, y_idx, :-self.h, :]\n",
+    "        insample_mask = windows['temporal'][:, mask_idx, :-self.h, :]\n",
+    "        outsample_y = windows['temporal'][:, y_idx, -self.h:, :]\n",
+    "        outsample_mask = windows['temporal'][:, mask_idx, -self.h:, :]\n",
+    "\n",
+    "        # Filter historic exogenous variables\n",
+    "        if len(self.hist_exog_list):\n",
+    "            hist_exog_idx = get_indexer_raise_missing(windows['temporal_cols'], self.hist_exog_list)\n",
+    "            hist_exog = windows['temporal'][:, hist_exog_idx, :-self.h, :]\n",
+    "        else:\n",
+    "            hist_exog = None\n",
+    "        \n",
+    "        # Filter future exogenous variables\n",
+    "        if len(self.futr_exog_list):\n",
+    "            futr_exog_idx = get_indexer_raise_missing(windows['temporal_cols'], self.futr_exog_list)\n",
+    "            futr_exog = windows['temporal'][:, futr_exog_idx, :, :]\n",
+    "        else:\n",
+    "            futr_exog = None\n",
+    "\n",
+    "        # Filter static variables\n",
+    "        if len(self.stat_exog_list):\n",
+    "            static_idx = get_indexer_raise_missing(windows['static_cols'], self.stat_exog_list)\n",
+    "            stat_exog = windows['static'][:, static_idx]\n",
+    "        else:\n",
+    "            stat_exog = None\n",
+    "\n",
+    "        return insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog\n",
+    "\n",
+    "    def training_step(self, batch, batch_idx):        \n",
+    "        # Create and normalize windows [batch_size, n_series, C, L+H]\n",
+    "        windows = self._create_windows(batch, step='train')\n",
+    "        y_idx = batch['y_idx']\n",
+    "        windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [batch_size, L, n_series]\n",
+    "                             insample_mask=insample_mask, # [batch_size, L, n_series]\n",
+    "                             futr_exog=futr_exog, # [batch_size, n_feats, L+H, n_series]\n",
+    "                             hist_exog=hist_exog, # [batch_size, n_feats, L, n_series]\n",
+    "                             stat_exog=stat_exog) # [n_series, n_feats]\n",
+    "\n",
+    "        # Model Predictions\n",
+    "        output = self(windows_batch)\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            outsample_y, y_loc, y_scale = self._inv_normalization(y_hat=outsample_y,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            loss = self.loss(y=outsample_y, distr_args=distr_args, mask=outsample_mask)\n",
+    "        else:\n",
+    "            loss = self.loss(y=outsample_y, y_hat=output, mask=outsample_mask)\n",
+    "\n",
+    "        if torch.isnan(loss):\n",
+    "            print('Model Parameters', self.hparams)\n",
+    "            print('insample_y', torch.isnan(insample_y).sum())\n",
+    "            print('outsample_y', torch.isnan(outsample_y).sum())\n",
+    "            print('output', torch.isnan(output).sum())\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'train_loss',\n",
+    "            loss.item(),\n",
+    "            batch_size=outsample_y.size(0),\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.train_trajectories.append((self.global_step, loss.item()))\n",
+    "        return loss\n",
+    "\n",
+    "    def validation_step(self, batch, batch_idx):\n",
+    "        if self.val_size == 0:\n",
+    "            return np.nan\n",
+    "        \n",
+    "        # Create and normalize windows [Ws, L+H, C]\n",
+    "        windows = self._create_windows(batch, step='val')\n",
+    "        y_idx = batch['y_idx']\n",
+    "        windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [Ws, L]\n",
+    "                             insample_mask=insample_mask, # [Ws, L]\n",
+    "                             futr_exog=futr_exog, # [Ws, L+H]\n",
+    "                             hist_exog=hist_exog, # [Ws, L]\n",
+    "                             stat_exog=stat_exog) # [Ws, 1]\n",
+    "\n",
+    "        # Model Predictions\n",
+    "        output = self(windows_batch)\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            outsample_y, y_loc, y_scale = self._inv_normalization(y_hat=outsample_y,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "\n",
+    "            if str(type(self.valid_loss)) in\\\n",
+    "                [\"<class 'neuralforecast.losses.pytorch.sCRPS'>\", \"<class 'neuralforecast.losses.pytorch.MQLoss'>\"]:\n",
+    "                _, output = self.loss.sample(distr_args=distr_args)\n",
+    "\n",
+    "        # Validation Loss evaluation\n",
+    "        if self.valid_loss.is_distribution_output:\n",
+    "            valid_loss = self.valid_loss(y=outsample_y, distr_args=distr_args, mask=outsample_mask)\n",
+    "        else:\n",
+    "            valid_loss = self.valid_loss(y=outsample_y, y_hat=output, mask=outsample_mask)\n",
+    "\n",
+    "        if torch.isnan(valid_loss):\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'valid_loss',\n",
+    "            valid_loss.item(),\n",
+    "            batch_size=outsample_y.size(0),\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.validation_step_outputs.append(valid_loss)\n",
+    "        return valid_loss\n",
+    "\n",
+    "    def predict_step(self, batch, batch_idx):        \n",
+    "        # Create and normalize windows [Ws, L+H, C]\n",
+    "        windows = self._create_windows(batch, step='predict')\n",
+    "        y_idx = batch['y_idx']        \n",
+    "        windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, _, _, \\\n",
+    "               hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [Ws, L]\n",
+    "                             insample_mask=insample_mask, # [Ws, L]\n",
+    "                             futr_exog=futr_exog, # [Ws, L+H]\n",
+    "                             hist_exog=hist_exog, # [Ws, L]\n",
+    "                             stat_exog=stat_exog) # [Ws, 1]\n",
+    "\n",
+    "        # Model Predictions\n",
+    "        output = self(windows_batch)\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            _, y_loc, y_scale = self._inv_normalization(y_hat=output[0],\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            _, y_hat = self.loss.sample(distr_args=distr_args)\n",
+    "\n",
+    "            if self.loss.return_params:\n",
+    "                distr_args = torch.stack(distr_args, dim=-1)\n",
+    "                distr_args = torch.reshape(distr_args, (len(windows[\"temporal\"]), self.h, -1))\n",
+    "                y_hat = torch.concat((y_hat, distr_args), axis=2)\n",
+    "        else:\n",
+    "            y_hat, _, _ = self._inv_normalization(y_hat=output,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "        return y_hat\n",
+    "    \n",
+    "    def fit(self, dataset, val_size=0, test_size=0, random_seed=None, distributed_config=None):\n",
+    "        \"\"\" Fit.\n",
+    "\n",
+    "        The `fit` method, optimizes the neural network's weights using the\n",
+    "        initialization parameters (`learning_rate`, `windows_batch_size`, ...)\n",
+    "        and the `loss` function as defined during the initialization. \n",
+    "        Within `fit` we use a PyTorch Lightning `Trainer` that\n",
+    "        inherits the initialization's `self.trainer_kwargs`, to customize\n",
+    "        its inputs, see [PL's trainer arguments](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).\n",
+    "\n",
+    "        The method is designed to be compatible with SKLearn-like classes\n",
+    "        and in particular to be compatible with the StatsForecast library.\n",
+    "\n",
+    "        By default the `model` is not saving training checkpoints to protect \n",
+    "        disk memory, to get them change `enable_checkpointing=True` in `__init__`.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset`, see [documentation](https://nixtla.github.io/neuralforecast/tsdataset.html).<br>\n",
+    "        `val_size`: int, validation size for temporal cross-validation.<br>\n",
+    "        `test_size`: int, test size for temporal cross-validation.<br>\n",
+    "        \"\"\"\n",
+    "        if distributed_config is not None:\n",
+    "            raise ValueError(\"multivariate models cannot be trained using distributed data parallel.\")\n",
+    "        return self._fit(\n",
+    "            dataset=dataset,\n",
+    "            batch_size=self.n_series,\n",
+    "            valid_batch_size=self.n_series,\n",
+    "            val_size=val_size,\n",
+    "            test_size=test_size,\n",
+    "            random_seed=random_seed,\n",
+    "            shuffle_train=False,\n",
+    "            distributed_config=None,\n",
+    "        )\n",
+    "\n",
+    "    def predict(self, dataset, test_size=None, step_size=1, random_seed=None, **data_module_kwargs):\n",
+    "        \"\"\" Predict.\n",
+    "\n",
+    "        Neural network prediction with PL's `Trainer` execution of `predict_step`.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset`, see [documentation](https://nixtla.github.io/neuralforecast/tsdataset.html).<br>\n",
+    "        `test_size`: int=None, test size for temporal cross-validation.<br>\n",
+    "        `step_size`: int=1, Step size between each window.<br>\n",
+    "        `**data_module_kwargs`: PL's TimeSeriesDataModule args, see [documentation](https://pytorch-lightning.readthedocs.io/en/1.6.1/extensions/datamodules.html#using-a-datamodule).\n",
+    "        \"\"\"\n",
+    "        self._check_exog(dataset)\n",
+    "        self._restart_seed(random_seed)\n",
+    "\n",
+    "        self.predict_step_size = step_size\n",
+    "        self.decompose_forecast = False\n",
+    "        datamodule = TimeSeriesDataModule(dataset=dataset, \n",
+    "                                          valid_batch_size=self.n_series,                                           \n",
+    "                                          batch_size=self.n_series,\n",
+    "                                          **data_module_kwargs)\n",
+    "\n",
+    "        # Protect when case of multiple gpu. PL does not support return preds with multiple gpu.\n",
+    "        pred_trainer_kwargs = self.trainer_kwargs.copy()\n",
+    "        if (pred_trainer_kwargs.get('accelerator', None) == \"gpu\") and (torch.cuda.device_count() > 1):\n",
+    "            pred_trainer_kwargs['devices'] = [0]\n",
+    "\n",
+    "        trainer = pl.Trainer(**pred_trainer_kwargs)\n",
+    "        fcsts = trainer.predict(self, datamodule=datamodule)\n",
+    "        fcsts = torch.vstack(fcsts).numpy()\n",
+    "\n",
+    "        fcsts = np.transpose(fcsts, (2,0,1))\n",
+    "        fcsts = fcsts.flatten()\n",
+    "        fcsts = fcsts.reshape(-1, len(self.loss.output_names))\n",
+    "        return fcsts\n",
+    "\n",
+    "    def decompose(self, dataset, step_size=1, random_seed=None, **data_module_kwargs):\n",
+    "        raise NotImplementedError('decompose')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_fail"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# test unsupported losses\n",
+    "test_fail(\n",
+    "    lambda: BaseMultivariate(\n",
+    "        h=1,\n",
+    "        input_size=1,\n",
+    "        loss=losses.MQLoss(),\n",
+    "        valid_loss=losses.RMSE(),\n",
+    "        learning_rate=1,\n",
+    "        max_steps=1,\n",
+    "        val_check_steps=1,\n",
+    "        n_series=1,\n",
+    "        batch_size=1,\n",
+    "    ),\n",
+    "    contains='MQLoss() is not supported'\n",
+    ")\n",
+    "\n",
+    "test_fail(\n",
+    "    lambda: BaseMultivariate(\n",
+    "        h=1,\n",
+    "        input_size=1,\n",
+    "        loss=losses.RMSE(),\n",
+    "        valid_loss=losses.MASE(seasonality=1),\n",
+    "        learning_rate=1,\n",
+    "        max_steps=1,\n",
+    "        val_check_steps=1,\n",
+    "        n_series=1,\n",
+    "        batch_size=1,\n",
+    "    ),\n",
+    "    contains='MASE() is not supported'\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/common.base_recurrent.ipynb
+++ b/nbs/common.base_recurrent.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp common._base_recurrent"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# BaseRecurrent"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> The `BaseRecurrent` class contains standard methods shared across recurrent neural networks; these models possess the ability to process variable-length sequences of inputs through their internal memory states. The class is represented by `LSTM`, `GRU`, and `RNN`, along with other more sophisticated architectures like `MQCNN`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The standard methods include `TemporalNorm` preprocessing, optimization utilities like parameter initialization, `training_step`, `validation_step`, and shared `fit` and `predict` methods.These shared methods enable all the `neuralforecast.models` compatibility with the `core.NeuralForecast` wrapper class."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import pytorch_lightning as pl\n",
+    "\n",
+    "from neuralforecast.common._base_model import BaseModel\n",
+    "from neuralforecast.common._scalers import TemporalNorm\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataModule\n",
+    "from neuralforecast.utils import get_indexer_raise_missing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class BaseRecurrent(BaseModel):\n",
+    "    \"\"\" Base Recurrent\n",
+    "    \n",
+    "    Base class for all recurrent-based models. The forecasts are produced sequentially between \n",
+    "    windows.\n",
+    "    \n",
+    "    This class implements the basic functionality for all windows-based models, including:\n",
+    "    - PyTorch Lightning's methods training_step, validation_step, predict_step. <br>\n",
+    "    - fit and predict methods used by NeuralForecast.core class. <br>\n",
+    "    - sampling and wrangling methods to sequential windows. <br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self,\n",
+    "                 h,\n",
+    "                 input_size,\n",
+    "                 inference_input_size,\n",
+    "                 loss,\n",
+    "                 valid_loss,\n",
+    "                 learning_rate,\n",
+    "                 max_steps,\n",
+    "                 val_check_steps,\n",
+    "                 batch_size,\n",
+    "                 valid_batch_size,\n",
+    "                 scaler_type='robust',\n",
+    "                 num_lr_decays=0,\n",
+    "                 early_stop_patience_steps=-1,\n",
+    "                 futr_exog_list=None,\n",
+    "                 hist_exog_list=None,\n",
+    "                 stat_exog_list=None,\n",
+    "                 num_workers_loader=0,\n",
+    "                 drop_last_loader=False,\n",
+    "                 random_seed=1, \n",
+    "                 alias=None,\n",
+    "                 optimizer=None,\n",
+    "                 optimizer_kwargs=None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super().__init__(\n",
+    "            random_seed=random_seed,\n",
+    "            loss=loss,\n",
+    "            valid_loss=valid_loss,\n",
+    "            optimizer=optimizer,\n",
+    "            optimizer_kwargs=optimizer_kwargs,\n",
+    "            futr_exog_list=futr_exog_list,\n",
+    "            hist_exog_list=hist_exog_list,\n",
+    "            stat_exog_list=stat_exog_list,\n",
+    "            max_steps=max_steps,\n",
+    "            early_stop_patience_steps=early_stop_patience_steps,            \n",
+    "            **trainer_kwargs,\n",
+    "        )\n",
+    "\n",
+    "        # Padder to complete train windows, \n",
+    "        # example y=[1,2,3,4,5] h=3 -> last y_output = [5,0,0]\n",
+    "        self.h = h\n",
+    "        self.input_size = input_size\n",
+    "        self.inference_input_size = inference_input_size\n",
+    "        self.padder = nn.ConstantPad1d(padding=(0, self.h), value=0)\n",
+    "\n",
+    "\n",
+    "        if str(type(self.loss)) == \"<class 'neuralforecast.losses.pytorch.DistributionLoss'>\" and\\\n",
+    "            self.loss.distribution=='Bernoulli':\n",
+    "                raise Exception('Temporal Classification not yet available for Recurrent-based models')\n",
+    "\n",
+    "        # Valid batch_size\n",
+    "        self.batch_size = batch_size\n",
+    "        if valid_batch_size is None:\n",
+    "            self.valid_batch_size = batch_size\n",
+    "        else:\n",
+    "            self.valid_batch_size = valid_batch_size\n",
+    "\n",
+    "        # Optimization\n",
+    "        self.learning_rate = learning_rate\n",
+    "        self.max_steps = max_steps\n",
+    "        self.num_lr_decays = num_lr_decays\n",
+    "        self.lr_decay_steps = max(max_steps // self.num_lr_decays, 1) if self.num_lr_decays > 0 else 10e7\n",
+    "        self.early_stop_patience_steps = early_stop_patience_steps\n",
+    "        self.val_check_steps = val_check_steps\n",
+    "\n",
+    "        # Scaler\n",
+    "        self.scaler = TemporalNorm(\n",
+    "            scaler_type=scaler_type,\n",
+    "            dim=-1,  # Time dimension is -1.\n",
+    "            num_features=1+len(self.hist_exog_list)+len(self.futr_exog_list)\n",
+    "        )\n",
+    "\n",
+    "        # Fit arguments\n",
+    "        self.val_size = 0\n",
+    "        self.test_size = 0\n",
+    "\n",
+    "        # DataModule arguments\n",
+    "        self.num_workers_loader = num_workers_loader\n",
+    "        self.drop_last_loader = drop_last_loader\n",
+    "        # used by on_validation_epoch_end hook\n",
+    "        self.validation_step_outputs = []\n",
+    "        self.alias = alias\n",
+    "\n",
+    "    def _normalization(self, batch, val_size=0, test_size=0):\n",
+    "        temporal = batch['temporal'] # B, C, T\n",
+    "        temporal_cols = batch['temporal_cols'].copy()\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Separate data and mask\n",
+    "        temporal_data_cols = self._get_temporal_exogenous_cols(temporal_cols=temporal_cols)\n",
+    "        temporal_idxs = get_indexer_raise_missing(temporal_cols, temporal_data_cols)\n",
+    "        temporal_idxs = np.append(y_idx, temporal_idxs)\n",
+    "        temporal_data = temporal[:, temporal_idxs, :]\n",
+    "        temporal_mask = temporal[:, temporal_cols.get_loc('available_mask'), :].clone()\n",
+    "\n",
+    "        # Remove validation and test set to prevent leakeage\n",
+    "        if val_size + test_size > 0:\n",
+    "            cutoff = val_size + test_size\n",
+    "            temporal_mask[:, -cutoff:] = 0\n",
+    "\n",
+    "        # Normalize. self.scaler stores the shift and scale for inverse transform\n",
+    "        temporal_mask = temporal_mask.unsqueeze(1) # Add channel dimension for scaler.transform.\n",
+    "        temporal_data = self.scaler.transform(x=temporal_data, mask=temporal_mask)\n",
+    "\n",
+    "        # Replace values in windows dict\n",
+    "        temporal[:, temporal_idxs, :] = temporal_data\n",
+    "        batch['temporal'] = temporal\n",
+    "\n",
+    "        return batch\n",
+    "\n",
+    "    def _inv_normalization(self, y_hat, temporal_cols, y_idx):\n",
+    "        # Receives window predictions [B, seq_len, H, output]\n",
+    "        # Broadcasts outputs and inverts normalization\n",
+    "\n",
+    "        # Get 'y' scale and shift, and add W dimension\n",
+    "        y_loc = self.scaler.x_shift[:, [y_idx], 0].flatten() #[B,C,T] -> [B]        \n",
+    "        y_scale = self.scaler.x_scale[:, [y_idx], 0].flatten() #[B,C,T] -> [B]\n",
+    "\n",
+    "        # Expand scale and shift to y_hat dimensions\n",
+    "        y_loc = y_loc.view(*y_loc.shape, *(1,)*(y_hat.ndim-1))#.expand(y_hat)        \n",
+    "        y_scale = y_scale.view(*y_scale.shape, *(1,)*(y_hat.ndim-1))#.expand(y_hat)\n",
+    "\n",
+    "        y_hat = self.scaler.inverse_transform(z=y_hat, x_scale=y_scale, x_shift=y_loc)\n",
+    "\n",
+    "        return y_hat, y_loc, y_scale\n",
+    "\n",
+    "    def _create_windows(self, batch, step):\n",
+    "        temporal = batch['temporal']\n",
+    "        temporal_cols = batch['temporal_cols']\n",
+    "\n",
+    "        if step == 'train':\n",
+    "            if self.val_size + self.test_size > 0:\n",
+    "                cutoff = -self.val_size - self.test_size\n",
+    "                temporal = temporal[:, :, :cutoff]\n",
+    "            temporal = self.padder(temporal)\n",
+    "\n",
+    "            # Truncate batch to shorter time-series \n",
+    "            av_condition = torch.nonzero(torch.min(temporal[:, temporal_cols.get_loc('available_mask')], axis=0).values)\n",
+    "            min_time_stamp = int(av_condition.min())\n",
+    "            \n",
+    "            available_ts = temporal.shape[-1] - min_time_stamp\n",
+    "            if available_ts < 1 + self.h:\n",
+    "                raise Exception(\n",
+    "                    'Time series too short for given input and output size. \\n'\n",
+    "                    f'Available timestamps: {available_ts}'\n",
+    "                )\n",
+    "\n",
+    "            temporal = temporal[:, :, min_time_stamp:]\n",
+    "\n",
+    "        if step == 'val':\n",
+    "            if self.test_size > 0:\n",
+    "                temporal = temporal[:, :, :-self.test_size]\n",
+    "            temporal = self.padder(temporal)\n",
+    "\n",
+    "        if step == 'predict':\n",
+    "            if (self.test_size == 0) and (len(self.futr_exog_list)==0):\n",
+    "                temporal = self.padder(temporal)\n",
+    "\n",
+    "            # Test size covers all data, pad left one timestep with zeros\n",
+    "            if temporal.shape[-1] == self.test_size:\n",
+    "                padder_left = nn.ConstantPad1d(padding=(1, 0), value=0)\n",
+    "                temporal = padder_left(temporal)\n",
+    "\n",
+    "        # Parse batch\n",
+    "        window_size = 1 + self.h # 1 for current t and h for future\n",
+    "        windows = temporal.unfold(dimension=-1,\n",
+    "                                  size=window_size,\n",
+    "                                  step=1)\n",
+    "\n",
+    "        # Truncated backprogatation/inference (shorten sequence where RNNs unroll)\n",
+    "        n_windows = windows.shape[2]\n",
+    "        input_size = -1\n",
+    "        if (step == 'train') and (self.input_size>0):\n",
+    "            input_size = self.input_size\n",
+    "            if (input_size > 0) and (n_windows > input_size):\n",
+    "                max_sampleable_time = n_windows-self.input_size+1\n",
+    "                start = np.random.choice(max_sampleable_time)\n",
+    "                windows = windows[:, :, start:(start+input_size), :]\n",
+    "\n",
+    "        if (step == 'val') and (self.inference_input_size>0):\n",
+    "            cutoff = self.inference_input_size + self.val_size\n",
+    "            windows = windows[:, :, -cutoff:, :]\n",
+    "\n",
+    "        if (step == 'predict') and (self.inference_input_size>0):\n",
+    "            cutoff = self.inference_input_size + self.test_size\n",
+    "            windows = windows[:, :, -cutoff:, :]\n",
+    "        \n",
+    "        # [B, C, input_size, 1+H]\n",
+    "        windows_batch = dict(temporal=windows,\n",
+    "                             temporal_cols=temporal_cols,\n",
+    "                             static=batch.get('static', None),\n",
+    "                             static_cols=batch.get('static_cols', None))\n",
+    "\n",
+    "        return windows_batch\n",
+    "\n",
+    "    def _parse_windows(self, batch, windows):\n",
+    "        # [B, C, seq_len, 1+H]\n",
+    "        # Filter insample lags from outsample horizon\n",
+    "        mask_idx = batch['temporal_cols'].get_loc('available_mask')\n",
+    "        y_idx = batch['y_idx']        \n",
+    "        insample_y = windows['temporal'][:, y_idx, :, :-self.h]\n",
+    "        insample_mask = windows['temporal'][:, mask_idx, :, :-self.h]\n",
+    "        outsample_y = windows['temporal'][:, y_idx, :, -self.h:].contiguous()\n",
+    "        outsample_mask = windows['temporal'][:, mask_idx, :, -self.h:].contiguous()\n",
+    "\n",
+    "        # Filter historic exogenous variables\n",
+    "        if len(self.hist_exog_list):\n",
+    "            hist_exog_idx = get_indexer_raise_missing(windows['temporal_cols'], self.hist_exog_list)\n",
+    "            hist_exog = windows['temporal'][:, hist_exog_idx, :, :-self.h]\n",
+    "        else:\n",
+    "            hist_exog = None\n",
+    "        \n",
+    "        # Filter future exogenous variables\n",
+    "        if len(self.futr_exog_list):\n",
+    "            futr_exog_idx = get_indexer_raise_missing(windows['temporal_cols'], self.futr_exog_list)\n",
+    "            futr_exog = windows['temporal'][:, futr_exog_idx, :, :]\n",
+    "        else:\n",
+    "            futr_exog = None\n",
+    "        # Filter static variables\n",
+    "        if len(self.stat_exog_list):\n",
+    "            static_idx = get_indexer_raise_missing(windows['static_cols'], self.stat_exog_list)\n",
+    "            stat_exog = windows['static'][:, static_idx]\n",
+    "        else:\n",
+    "            stat_exog = None\n",
+    "\n",
+    "        return insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog\n",
+    "\n",
+    "    def training_step(self, batch, batch_idx):\n",
+    "        # Create and normalize windows [Ws, L+H, C]\n",
+    "        batch = self._normalization(batch, val_size=self.val_size, test_size=self.test_size)\n",
+    "        windows = self._create_windows(batch, step='train')\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [B, seq_len, 1]\n",
+    "                             insample_mask=insample_mask, # [B, seq_len, 1]\n",
+    "                             futr_exog=futr_exog, # [B, F, seq_len, 1+H]\n",
+    "                             hist_exog=hist_exog, # [B, C, seq_len]\n",
+    "                             stat_exog=stat_exog) # [B, S]\n",
+    "\n",
+    "        # Model predictions\n",
+    "        output = self(windows_batch) # tuple([B, seq_len, H, output])\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            outsample_y, y_loc, y_scale = self._inv_normalization(y_hat=outsample_y,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=batch['y_idx'])\n",
+    "            B = output[0].size()[0]\n",
+    "            T = output[0].size()[1]\n",
+    "            H = output[0].size()[2]\n",
+    "            output = [arg.view(-1, *(arg.size()[2:])) for arg in output]\n",
+    "            outsample_y = outsample_y.view(B*T,H)\n",
+    "            outsample_mask = outsample_mask.view(B*T,H)\n",
+    "            y_loc = y_loc.repeat_interleave(repeats=T, dim=0).squeeze(-1)\n",
+    "            y_scale = y_scale.repeat_interleave(repeats=T, dim=0).squeeze(-1)\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            loss = self.loss(y=outsample_y, distr_args=distr_args, mask=outsample_mask)\n",
+    "        else:\n",
+    "            loss = self.loss(y=outsample_y, y_hat=output, mask=outsample_mask)\n",
+    "\n",
+    "        if torch.isnan(loss):\n",
+    "            print('Model Parameters', self.hparams)\n",
+    "            print('insample_y', torch.isnan(insample_y).sum())\n",
+    "            print('outsample_y', torch.isnan(outsample_y).sum())\n",
+    "            print('output', torch.isnan(output).sum())\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'train_loss',\n",
+    "            loss.item(),\n",
+    "            batch_size=outsample_y.size(0),\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.train_trajectories.append((self.global_step, loss.item()))\n",
+    "        return loss\n",
+    "\n",
+    "    def validation_step(self, batch, batch_idx):\n",
+    "        if self.val_size == 0:\n",
+    "            return np.nan\n",
+    "\n",
+    "        # Create and normalize windows [Ws, L+H, C]\n",
+    "        batch = self._normalization(batch, val_size=self.val_size, test_size=self.test_size)\n",
+    "        windows = self._create_windows(batch, step='val')\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [B, seq_len, 1]\n",
+    "                             insample_mask=insample_mask, # [B, seq_len, 1]\n",
+    "                             futr_exog=futr_exog, # [B, F, seq_len, 1+H]\n",
+    "                             hist_exog=hist_exog, # [B, C, seq_len]\n",
+    "                             stat_exog=stat_exog) # [B, S]\n",
+    "\n",
+    "        # Remove train y_hat (+1 and -1 for padded last window with zeros)\n",
+    "        # tuple([B, seq_len, H, output]) -> tuple([B, validation_size, H, output])\n",
+    "        val_windows = (self.val_size) + 1\n",
+    "        outsample_y = outsample_y[:, -val_windows:-1, :]\n",
+    "        outsample_mask = outsample_mask[:, -val_windows:-1, :]        \n",
+    "\n",
+    "        # Model predictions\n",
+    "        output = self(windows_batch) # tuple([B, seq_len, H, output])\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            output = [arg[:, -val_windows:-1] for arg in output]\n",
+    "            outsample_y, y_loc, y_scale = self._inv_normalization(y_hat=outsample_y,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "            B = output[0].size()[0]\n",
+    "            T = output[0].size()[1]\n",
+    "            H = output[0].size()[2]\n",
+    "            output = [arg.reshape(-1, *(arg.size()[2:])) for arg in output]\n",
+    "            outsample_y = outsample_y.reshape(B*T,H)\n",
+    "            outsample_mask = outsample_mask.reshape(B*T,H)\n",
+    "            y_loc = y_loc.repeat_interleave(repeats=T, dim=0).squeeze(-1)\n",
+    "            y_scale = y_scale.repeat_interleave(repeats=T, dim=0).squeeze(-1)\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            _, sample_mean, quants  = self.loss.sample(distr_args=distr_args)\n",
+    "\n",
+    "            if str(type(self.valid_loss)) in\\\n",
+    "                [\"<class 'neuralforecast.losses.pytorch.sCRPS'>\", \"<class 'neuralforecast.losses.pytorch.MQLoss'>\"]:\n",
+    "                output = quants\n",
+    "            elif str(type(self.valid_loss)) in [\"<class 'neuralforecast.losses.pytorch.relMSE'>\"]:\n",
+    "                output = torch.unsqueeze(sample_mean, dim=-1) # [N,H,1] -> [N,H]\n",
+    "            \n",
+    "        else:\n",
+    "            output = output[:, -val_windows:-1, :]\n",
+    "\n",
+    "        # Validation Loss evaluation\n",
+    "        if self.valid_loss.is_distribution_output:\n",
+    "            valid_loss = self.valid_loss(y=outsample_y, distr_args=distr_args, mask=outsample_mask)\n",
+    "        else:\n",
+    "            outsample_y, _, _ = self._inv_normalization(y_hat=outsample_y, temporal_cols=batch['temporal_cols'], y_idx=y_idx)\n",
+    "            output, _, _      = self._inv_normalization(y_hat=output, temporal_cols=batch['temporal_cols'], y_idx=y_idx)\n",
+    "            valid_loss = self.valid_loss(y=outsample_y, y_hat=output, mask=outsample_mask)\n",
+    "\n",
+    "        if torch.isnan(valid_loss):\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'valid_loss',\n",
+    "            valid_loss.item(),\n",
+    "            batch_size=outsample_y.size(0),\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.validation_step_outputs.append(valid_loss)\n",
+    "        return valid_loss\n",
+    "\n",
+    "    def predict_step(self, batch, batch_idx):\n",
+    "        # Create and normalize windows [Ws, L+H, C]\n",
+    "        batch = self._normalization(batch, val_size=0, test_size=self.test_size)\n",
+    "        windows = self._create_windows(batch, step='predict')\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, _, _, \\\n",
+    "               hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [B, seq_len, 1]\n",
+    "                             insample_mask=insample_mask, # [B, seq_len, 1]\n",
+    "                             futr_exog=futr_exog, # [B, F, seq_len, 1+H]\n",
+    "                             hist_exog=hist_exog, # [B, C, seq_len]\n",
+    "                             stat_exog=stat_exog) # [B, S]\n",
+    "\n",
+    "        # Model Predictions\n",
+    "        output = self(windows_batch) # tuple([B, seq_len, H], ...)\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            _, y_loc, y_scale = self._inv_normalization(y_hat=output[0],\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "            B = output[0].size()[0]\n",
+    "            T = output[0].size()[1]\n",
+    "            H = output[0].size()[2]\n",
+    "            output = [arg.reshape(-1, *(arg.size()[2:])) for arg in output]\n",
+    "            y_loc = y_loc.repeat_interleave(repeats=T, dim=0).squeeze(-1)\n",
+    "            y_scale = y_scale.repeat_interleave(repeats=T, dim=0).squeeze(-1)\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            _, sample_mean, quants = self.loss.sample(distr_args=distr_args)\n",
+    "            y_hat = torch.concat((sample_mean, quants), axis=2)\n",
+    "            y_hat = y_hat.view(B, T, H, -1)\n",
+    "\n",
+    "            if self.loss.return_params:\n",
+    "                distr_args = torch.stack(distr_args, dim=-1)\n",
+    "                distr_args = torch.reshape(distr_args, (B, T, H, -1))\n",
+    "                y_hat = torch.concat((y_hat, distr_args), axis=3)\n",
+    "        else:\n",
+    "            y_hat, _, _ = self._inv_normalization(y_hat=output,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "        return y_hat\n",
+    "\n",
+    "    def fit(self, dataset, val_size=0, test_size=0, random_seed=None, distributed_config=None):\n",
+    "        \"\"\" Fit.\n",
+    "\n",
+    "        The `fit` method, optimizes the neural network's weights using the\n",
+    "        initialization parameters (`learning_rate`, `batch_size`, ...)\n",
+    "        and the `loss` function as defined during the initialization. \n",
+    "        Within `fit` we use a PyTorch Lightning `Trainer` that\n",
+    "        inherits the initialization's `self.trainer_kwargs`, to customize\n",
+    "        its inputs, see [PL's trainer arguments](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).\n",
+    "\n",
+    "        The method is designed to be compatible with SKLearn-like classes\n",
+    "        and in particular to be compatible with the StatsForecast library.\n",
+    "\n",
+    "        By default the `model` is not saving training checkpoints to protect \n",
+    "        disk memory, to get them change `enable_checkpointing=True` in `__init__`.        \n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset`, see [documentation](https://nixtla.github.io/neuralforecast/tsdataset.html).<br>\n",
+    "        `val_size`: int, validation size for temporal cross-validation.<br>\n",
+    "        `test_size`: int, test size for temporal cross-validation.<br>\n",
+    "        `random_seed`: int=None, random_seed for pytorch initializer and numpy generators, overwrites model.__init__'s.<br>\n",
+    "        \"\"\"\n",
+    "        return self._fit(\n",
+    "            dataset=dataset,\n",
+    "            batch_size=self.batch_size,\n",
+    "            valid_batch_size=self.valid_batch_size,\n",
+    "            val_size=val_size,\n",
+    "            test_size=test_size,\n",
+    "            random_seed=random_seed,\n",
+    "            distributed_config=distributed_config,\n",
+    "        )\n",
+    "\n",
+    "    def predict(self, dataset, step_size=1,\n",
+    "                random_seed=None, **data_module_kwargs):\n",
+    "        \"\"\" Predict.\n",
+    "\n",
+    "        Neural network prediction with PL's `Trainer` execution of `predict_step`.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset`, see [documentation](https://nixtla.github.io/neuralforecast/tsdataset.html).<br>\n",
+    "        `step_size`: int=1, Step size between each window.<br>\n",
+    "        `random_seed`: int=None, random_seed for pytorch initializer and numpy generators, overwrites model.__init__'s.<br>\n",
+    "        `**data_module_kwargs`: PL's TimeSeriesDataModule args, see [documentation](https://pytorch-lightning.readthedocs.io/en/1.6.1/extensions/datamodules.html#using-a-datamodule).\n",
+    "        \"\"\"\n",
+    "        self._check_exog(dataset)\n",
+    "        self._restart_seed(random_seed)\n",
+    "\n",
+    "        if step_size > 1:\n",
+    "            raise Exception('Recurrent models do not support step_size > 1')\n",
+    "\n",
+    "        # fcsts (window, batch, h)\n",
+    "        # Protect when case of multiple gpu. PL does not support return preds with multiple gpu.\n",
+    "        pred_trainer_kwargs = self.trainer_kwargs.copy()\n",
+    "        if (pred_trainer_kwargs.get('accelerator', None) == \"gpu\") and (torch.cuda.device_count() > 1):\n",
+    "            pred_trainer_kwargs['devices'] = [0]\n",
+    "\n",
+    "        trainer = pl.Trainer(**pred_trainer_kwargs)\n",
+    "\n",
+    "        datamodule = TimeSeriesDataModule(\n",
+    "            dataset=dataset,\n",
+    "            valid_batch_size=self.valid_batch_size,\n",
+    "            num_workers=self.num_workers_loader,\n",
+    "            **data_module_kwargs\n",
+    "        )\n",
+    "        fcsts = trainer.predict(self, datamodule=datamodule)\n",
+    "        if self.test_size > 0:\n",
+    "            # Remove warmup windows (from train and validation)\n",
+    "            # [N,T,H,output], avoid indexing last dim for univariate output compatibility\n",
+    "            fcsts = torch.vstack([fcst[:, -(1+self.test_size-self.h):,:] for fcst in fcsts])\n",
+    "            fcsts = fcsts.numpy().flatten()\n",
+    "            fcsts = fcsts.reshape(-1, len(self.loss.output_names))\n",
+    "        else:\n",
+    "            fcsts = torch.vstack([fcst[:,-1:,:] for fcst in fcsts]).numpy().flatten()\n",
+    "            fcsts = fcsts.reshape(-1, len(self.loss.output_names))\n",
+    "        return fcsts"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseRecurrent, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseRecurrent.fit, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseRecurrent.predict, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from neuralforecast.losses.pytorch import MAE\n",
+    "from neuralforecast.utils import AirPassengersDF\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset, TimeSeriesDataModule"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# add h=0,1 unit test for _parse_windows \n",
+    "# Declare batch\n",
+    "AirPassengersDF['x'] = np.array(len(AirPassengersDF))\n",
+    "AirPassengersDF['x2'] = np.array(len(AirPassengersDF)) * 2\n",
+    "dataset, indices, dates, ds = TimeSeriesDataset.from_df(df=AirPassengersDF)\n",
+    "data = TimeSeriesDataModule(dataset=dataset, batch_size=1, drop_last=True)\n",
+    "\n",
+    "train_loader =  data.train_dataloader()\n",
+    "batch = next(iter(train_loader))\n",
+    "\n",
+    "# Test that hist_exog_list and futr_exog_list correctly filter data that is sent to scaler.\n",
+    "baserecurrent = BaseRecurrent(h=12,\n",
+    "                              input_size=117,\n",
+    "                              hist_exog_list=['x', 'x2'],\n",
+    "                              futr_exog_list=['x'],\n",
+    "                              loss=MAE(),\n",
+    "                              valid_loss=MAE(),\n",
+    "                              learning_rate=0.001,\n",
+    "                              max_steps=1,\n",
+    "                              val_check_steps=0,\n",
+    "                              batch_size=1,\n",
+    "                              valid_batch_size=1,\n",
+    "                              windows_batch_size=10,\n",
+    "                              inference_input_size=2,\n",
+    "                              start_padding_enabled=True)\n",
+    "\n",
+    "windows = baserecurrent._create_windows(batch, step='train')\n",
+    "\n",
+    "temporal_cols = windows['temporal_cols'].copy() # B, L+H, C\n",
+    "temporal_data_cols = baserecurrent._get_temporal_exogenous_cols(temporal_cols=temporal_cols)\n",
+    "\n",
+    "test_eq(set(temporal_data_cols), set(['x', 'x2']))\n",
+    "test_eq(windows['temporal'].shape, torch.Size([1,len(['y', 'x', 'x2', 'available_mask']),117,12+1]))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/nbs/common.base_windows.ipynb
+++ b/nbs/common.base_windows.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "524620c1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp common._base_windows"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "15392f6f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e0f9607-d12d-44e5-b2be-91a57a0bca79",
+   "metadata": {},
+   "source": [
+    "# BaseWindows\n",
+    "\n",
+    "> The `BaseWindows` class contains standard methods shared across window-based neural networks; in contrast to recurrent neural networks these models commit to a fixed sequence length input. The class is represented by `MLP`, and other more sophisticated architectures like `NBEATS`, and `NHITS`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1730a556-1574-40ad-92a2-23b924ceb398",
+   "metadata": {},
+   "source": [
+    "The standard methods include data preprocessing `_normalization`, optimization utilities like parameter initialization, `training_step`, `validation_step`, and shared `fit` and `predict` methods.These shared methods enable all the `neuralforecast.models` compatibility with the `core.NeuralForecast` wrapper class. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2508f7a9-1433-4ad8-8f2f-0078c6ed6c3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from fastcore.test import test_eq\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44065066-e72a-431f-938f-1528adef9fe8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import pytorch_lightning as pl\n",
+    "\n",
+    "from neuralforecast.common._base_model import BaseModel\n",
+    "from neuralforecast.common._scalers import TemporalNorm\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataModule\n",
+    "from neuralforecast.utils import get_indexer_raise_missing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce70cd14-ecb1-4205-8511-fecbd26c8408",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class BaseWindows(BaseModel):\n",
+    "    \"\"\" Base Windows\n",
+    "    \n",
+    "    Base class for all windows-based models. The forecasts are produced separately \n",
+    "    for each window, which are randomly sampled during training.\n",
+    "    \n",
+    "    This class implements the basic functionality for all windows-based models, including:\n",
+    "    - PyTorch Lightning's methods training_step, validation_step, predict_step.<br>\n",
+    "    - fit and predict methods used by NeuralForecast.core class.<br>\n",
+    "    - sampling and wrangling methods to generate windows.\n",
+    "    \"\"\"\n",
+    "    def __init__(self,\n",
+    "                 h,\n",
+    "                 input_size,\n",
+    "                 loss,\n",
+    "                 valid_loss,\n",
+    "                 learning_rate,\n",
+    "                 max_steps,\n",
+    "                 val_check_steps,\n",
+    "                 batch_size,\n",
+    "                 valid_batch_size,\n",
+    "                 windows_batch_size,\n",
+    "                 inference_windows_batch_size,\n",
+    "                 start_padding_enabled,\n",
+    "                 step_size=1,\n",
+    "                 num_lr_decays=0,\n",
+    "                 early_stop_patience_steps=-1,\n",
+    "                 scaler_type='identity',\n",
+    "                 futr_exog_list=None,\n",
+    "                 hist_exog_list=None,\n",
+    "                 stat_exog_list=None,\n",
+    "                 exclude_insample_y=False,\n",
+    "                 num_workers_loader=0,\n",
+    "                 drop_last_loader=False,\n",
+    "                 random_seed=1,\n",
+    "                 alias=None,\n",
+    "                 optimizer=None,\n",
+    "                 optimizer_kwargs=None,\n",
+    "                 **trainer_kwargs):\n",
+    "        super().__init__(\n",
+    "            random_seed=random_seed,\n",
+    "            loss=loss,\n",
+    "            valid_loss=valid_loss,\n",
+    "            optimizer=optimizer,\n",
+    "            optimizer_kwargs=optimizer_kwargs,\n",
+    "            futr_exog_list=futr_exog_list,\n",
+    "            hist_exog_list=hist_exog_list,\n",
+    "            stat_exog_list=stat_exog_list,\n",
+    "            max_steps=max_steps,\n",
+    "            early_stop_patience_steps=early_stop_patience_steps,            \n",
+    "            **trainer_kwargs,\n",
+    "        )\n",
+    "\n",
+    "        # Padder to complete train windows, \n",
+    "        # example y=[1,2,3,4,5] h=3 -> last y_output = [5,0,0]\n",
+    "        self.h = h\n",
+    "        self.input_size = input_size\n",
+    "        self.windows_batch_size = windows_batch_size\n",
+    "        self.start_padding_enabled = start_padding_enabled\n",
+    "        if start_padding_enabled:\n",
+    "            self.padder_train = nn.ConstantPad1d(padding=(self.input_size-1, self.h), value=0)\n",
+    "        else:\n",
+    "            self.padder_train = nn.ConstantPad1d(padding=(0, self.h), value=0)\n",
+    "\n",
+    "        # Batch sizes\n",
+    "        self.batch_size = batch_size\n",
+    "        if valid_batch_size is None:\n",
+    "            self.valid_batch_size = batch_size\n",
+    "        else:\n",
+    "            self.valid_batch_size = valid_batch_size\n",
+    "        if inference_windows_batch_size is None:\n",
+    "            self.inference_windows_batch_size = windows_batch_size\n",
+    "        else:\n",
+    "            self.inference_windows_batch_size = inference_windows_batch_size\n",
+    "\n",
+    "        # Optimization \n",
+    "        self.learning_rate = learning_rate\n",
+    "        self.max_steps = max_steps\n",
+    "        self.num_lr_decays = num_lr_decays\n",
+    "        self.lr_decay_steps = (\n",
+    "            max(max_steps // self.num_lr_decays, 1) if self.num_lr_decays > 0 else 10e7\n",
+    "        )\n",
+    "        self.early_stop_patience_steps = early_stop_patience_steps\n",
+    "        self.val_check_steps = val_check_steps\n",
+    "        self.windows_batch_size = windows_batch_size\n",
+    "        self.step_size = step_size\n",
+    "        \n",
+    "        self.exclude_insample_y = exclude_insample_y\n",
+    "\n",
+    "        # Scaler\n",
+    "        self.scaler = TemporalNorm(\n",
+    "            scaler_type=scaler_type,\n",
+    "            dim=1,  # Time dimension is 1.\n",
+    "            num_features=1+len(self.hist_exog_list)+len(self.futr_exog_list)\n",
+    "        )\n",
+    "\n",
+    "        # Fit arguments\n",
+    "        self.val_size = 0\n",
+    "        self.test_size = 0\n",
+    "\n",
+    "        # Model state\n",
+    "        self.decompose_forecast = False\n",
+    "\n",
+    "        # DataModule arguments\n",
+    "        self.num_workers_loader = num_workers_loader\n",
+    "        self.drop_last_loader = drop_last_loader\n",
+    "        # used by on_validation_epoch_end hook\n",
+    "        self.validation_step_outputs = []\n",
+    "        self.alias = alias\n",
+    "\n",
+    "    def _create_windows(self, batch, step, w_idxs=None):\n",
+    "        # Parse common data\n",
+    "        window_size = self.input_size + self.h\n",
+    "        temporal_cols = batch['temporal_cols']\n",
+    "        temporal = batch['temporal']\n",
+    "\n",
+    "        if step == 'train':\n",
+    "            if self.val_size + self.test_size > 0:\n",
+    "                cutoff = -self.val_size - self.test_size\n",
+    "                temporal = temporal[:, :, :cutoff]\n",
+    "\n",
+    "            temporal = self.padder_train(temporal)\n",
+    "            if temporal.shape[-1] < window_size:\n",
+    "                raise Exception('Time series is too short for training, consider setting a smaller input size or set start_padding_enabled=True')\n",
+    "            windows = temporal.unfold(dimension=-1, \n",
+    "                                      size=window_size, \n",
+    "                                      step=self.step_size)\n",
+    "\n",
+    "            # [B, C, Ws, L+H] 0, 1, 2, 3\n",
+    "            # -> [B * Ws, L+H, C] 0, 2, 3, 1\n",
+    "            windows_per_serie = windows.shape[2]\n",
+    "            windows = windows.permute(0, 2, 3, 1).contiguous()\n",
+    "            windows = windows.reshape(-1, window_size, len(temporal_cols))\n",
+    "\n",
+    "            # Sample and Available conditions\n",
+    "            available_idx = temporal_cols.get_loc('available_mask')\n",
+    "            available_condition = windows[:, :self.input_size, available_idx]\n",
+    "            available_condition = torch.sum(available_condition, axis=1)\n",
+    "            final_condition = (available_condition > 0)\n",
+    "            if self.h > 0:\n",
+    "                sample_condition = windows[:, self.input_size:, available_idx]\n",
+    "                sample_condition = torch.sum(sample_condition, axis=1)\n",
+    "                final_condition = (sample_condition > 0) & (available_condition > 0)\n",
+    "            windows = windows[final_condition]\n",
+    "\n",
+    "            # Parse Static data to match windows\n",
+    "            # [B, S_in] -> [B, Ws, S_in] -> [B*Ws, S_in]\n",
+    "            static = batch.get('static', None)\n",
+    "            static_cols=batch.get('static_cols', None)\n",
+    "            if static is not None:\n",
+    "                static = torch.repeat_interleave(static, \n",
+    "                                    repeats=windows_per_serie, dim=0)\n",
+    "                static = static[final_condition]\n",
+    "\n",
+    "            # Protection of empty windows\n",
+    "            if final_condition.sum() == 0:\n",
+    "                raise Exception('No windows available for training')\n",
+    "\n",
+    "            # Sample windows\n",
+    "            n_windows = len(windows)\n",
+    "            if self.windows_batch_size is not None:\n",
+    "                w_idxs = np.random.choice(n_windows, \n",
+    "                                          size=self.windows_batch_size,\n",
+    "                                          replace=(n_windows < self.windows_batch_size))\n",
+    "                windows = windows[w_idxs]\n",
+    "                \n",
+    "                if static is not None:\n",
+    "                    static = static[w_idxs]\n",
+    "\n",
+    "            # think about interaction available * sample mask\n",
+    "            # [B, C, Ws, L+H]\n",
+    "            windows_batch = dict(temporal=windows,\n",
+    "                                 temporal_cols=temporal_cols,\n",
+    "                                 static=static,\n",
+    "                                 static_cols=static_cols)\n",
+    "            return windows_batch\n",
+    "\n",
+    "        elif step in ['predict', 'val']:\n",
+    "\n",
+    "            if step == 'predict':\n",
+    "                initial_input = temporal.shape[-1] - self.test_size\n",
+    "                if initial_input <= self.input_size: # There is not enough data to predict first timestamp\n",
+    "                    padder_left = nn.ConstantPad1d(padding=(self.input_size-initial_input, 0), value=0)\n",
+    "                    temporal = padder_left(temporal)\n",
+    "                predict_step_size = self.predict_step_size\n",
+    "                cutoff = - self.input_size - self.test_size\n",
+    "                temporal = temporal[:, :, cutoff:]\n",
+    "\n",
+    "            elif step == 'val':\n",
+    "                predict_step_size = self.step_size\n",
+    "                cutoff = -self.input_size - self.val_size - self.test_size\n",
+    "                if self.test_size > 0:\n",
+    "                    temporal = batch['temporal'][:, :, cutoff:-self.test_size]\n",
+    "                else:\n",
+    "                    temporal = batch['temporal'][:, :, cutoff:]\n",
+    "                if temporal.shape[-1] < window_size:\n",
+    "                    initial_input = temporal.shape[-1] - self.val_size\n",
+    "                    padder_left = nn.ConstantPad1d(padding=(self.input_size-initial_input, 0), value=0)\n",
+    "                    temporal = padder_left(temporal)\n",
+    "\n",
+    "            if (step=='predict') and (self.test_size==0) and (len(self.futr_exog_list)==0):\n",
+    "                padder_right = nn.ConstantPad1d(padding=(0, self.h), value=0)\n",
+    "                temporal = padder_right(temporal)\n",
+    "\n",
+    "            windows = temporal.unfold(dimension=-1,\n",
+    "                                      size=window_size,\n",
+    "                                      step=predict_step_size)\n",
+    "\n",
+    "            # [batch, channels, windows, window_size] 0, 1, 2, 3\n",
+    "            # -> [batch * windows, window_size, channels] 0, 2, 3, 1\n",
+    "            windows_per_serie = windows.shape[2]\n",
+    "            windows = windows.permute(0, 2, 3, 1).contiguous()\n",
+    "            windows = windows.reshape(-1, window_size, len(temporal_cols))\n",
+    "\n",
+    "            static = batch.get('static', None)\n",
+    "            static_cols=batch.get('static_cols', None)\n",
+    "            if static is not None:\n",
+    "                static = torch.repeat_interleave(static, \n",
+    "                                    repeats=windows_per_serie, dim=0)\n",
+    "            \n",
+    "            # Sample windows for batched prediction\n",
+    "            if w_idxs is not None:\n",
+    "                windows = windows[w_idxs]\n",
+    "                if static is not None:\n",
+    "                    static = static[w_idxs]\n",
+    "            \n",
+    "            windows_batch = dict(temporal=windows,\n",
+    "                                 temporal_cols=temporal_cols,\n",
+    "                                 static=static,\n",
+    "                                 static_cols=static_cols)\n",
+    "            return windows_batch\n",
+    "        else:\n",
+    "            raise ValueError(f'Unknown step {step}')\n",
+    "\n",
+    "    def _normalization(self, windows, y_idx):\n",
+    "        # windows are already filtered by train/validation/test\n",
+    "        # from the `create_windows_method` nor leakage risk\n",
+    "        temporal = windows['temporal']                  # B, L+H, C\n",
+    "        temporal_cols = windows['temporal_cols'].copy() # B, L+H, C\n",
+    "\n",
+    "        # To avoid leakage uses only the lags\n",
+    "        #temporal_data_cols = temporal_cols.drop('available_mask').tolist()\n",
+    "        temporal_data_cols = self._get_temporal_exogenous_cols(temporal_cols=temporal_cols)\n",
+    "        temporal_idxs = get_indexer_raise_missing(temporal_cols, temporal_data_cols)\n",
+    "        temporal_idxs = np.append(y_idx, temporal_idxs)\n",
+    "        temporal_data = temporal[:, :, temporal_idxs]\n",
+    "        temporal_mask = temporal[:, :, temporal_cols.get_loc('available_mask')].clone()\n",
+    "        if self.h > 0:\n",
+    "            temporal_mask[:, -self.h:] = 0.0\n",
+    "\n",
+    "        # Normalize. self.scaler stores the shift and scale for inverse transform\n",
+    "        temporal_mask = temporal_mask.unsqueeze(-1) # Add channel dimension for scaler.transform.\n",
+    "        temporal_data = self.scaler.transform(x=temporal_data, mask=temporal_mask)\n",
+    "\n",
+    "        # Replace values in windows dict\n",
+    "        temporal[:, :, temporal_idxs] = temporal_data\n",
+    "        windows['temporal'] = temporal\n",
+    "\n",
+    "        return windows\n",
+    "\n",
+    "    def _inv_normalization(self, y_hat, temporal_cols, y_idx):\n",
+    "        # Receives window predictions [B, H, output]\n",
+    "        # Broadcasts outputs and inverts normalization\n",
+    "\n",
+    "        # Add C dimension\n",
+    "        if y_hat.ndim == 2:\n",
+    "            remove_dimension = True\n",
+    "            y_hat = y_hat.unsqueeze(-1)\n",
+    "        else:\n",
+    "            remove_dimension = False\n",
+    "\n",
+    "        y_scale = self.scaler.x_scale[:, :, [y_idx]]\n",
+    "        y_loc = self.scaler.x_shift[:, :, [y_idx]]\n",
+    "\n",
+    "        y_scale = torch.repeat_interleave(y_scale, repeats=y_hat.shape[-1], dim=-1).to(y_hat.device)\n",
+    "        y_loc = torch.repeat_interleave(y_loc, repeats=y_hat.shape[-1], dim=-1).to(y_hat.device)\n",
+    "\n",
+    "        y_hat = self.scaler.inverse_transform(z=y_hat, x_scale=y_scale, x_shift=y_loc)\n",
+    "        y_loc = y_loc.to(y_hat.device)\n",
+    "        y_scale = y_scale.to(y_hat.device)\n",
+    "        \n",
+    "        if remove_dimension:\n",
+    "            y_hat = y_hat.squeeze(-1)\n",
+    "            y_loc = y_loc.squeeze(-1)\n",
+    "            y_scale = y_scale.squeeze(-1)\n",
+    "\n",
+    "        return y_hat, y_loc, y_scale\n",
+    "\n",
+    "    def _parse_windows(self, batch, windows):\n",
+    "        # Filter insample lags from outsample horizon\n",
+    "        y_idx = batch['y_idx']\n",
+    "        mask_idx = batch['temporal_cols'].get_loc('available_mask')\n",
+    "\n",
+    "        insample_y = windows['temporal'][:, :self.input_size, y_idx]\n",
+    "        insample_mask = windows['temporal'][:, :self.input_size, mask_idx]\n",
+    "\n",
+    "        # Declare additional information\n",
+    "        outsample_y = None\n",
+    "        outsample_mask = None\n",
+    "        hist_exog = None\n",
+    "        futr_exog = None\n",
+    "        stat_exog = None\n",
+    "\n",
+    "        if self.h > 0:\n",
+    "            outsample_y = windows['temporal'][:, self.input_size:, y_idx]\n",
+    "            outsample_mask = windows['temporal'][:, self.input_size:, mask_idx]\n",
+    "\n",
+    "        if len(self.hist_exog_list):\n",
+    "            hist_exog_idx = get_indexer_raise_missing(windows['temporal_cols'], self.hist_exog_list)\n",
+    "            hist_exog = windows['temporal'][:, :self.input_size, hist_exog_idx]\n",
+    "\n",
+    "        if len(self.futr_exog_list):\n",
+    "            futr_exog_idx = get_indexer_raise_missing(windows['temporal_cols'], self.futr_exog_list)\n",
+    "            futr_exog = windows['temporal'][:, :, futr_exog_idx]\n",
+    "\n",
+    "        if len(self.stat_exog_list):\n",
+    "            static_idx = get_indexer_raise_missing(windows['static_cols'], self.stat_exog_list)\n",
+    "            stat_exog = windows['static'][:, static_idx]\n",
+    "\n",
+    "        # TODO: think a better way of removing insample_y features\n",
+    "        if self.exclude_insample_y:\n",
+    "            insample_y = insample_y * 0\n",
+    "\n",
+    "        return insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog\n",
+    "\n",
+    "    def training_step(self, batch, batch_idx):\n",
+    "        # Create and normalize windows [Ws, L+H, C]\n",
+    "        windows = self._create_windows(batch, step='train')\n",
+    "        y_idx = batch['y_idx']\n",
+    "        original_outsample_y = torch.clone(windows['temporal'][:,-self.h:,y_idx])\n",
+    "        windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "        # Parse windows\n",
+    "        insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "               hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "\n",
+    "        windows_batch = dict(insample_y=insample_y, # [Ws, L]\n",
+    "                             insample_mask=insample_mask, # [Ws, L]\n",
+    "                             futr_exog=futr_exog, # [Ws, L+H]\n",
+    "                             hist_exog=hist_exog, # [Ws, L]\n",
+    "                             stat_exog=stat_exog) # [Ws, 1]\n",
+    "\n",
+    "        # Model Predictions\n",
+    "        output = self(windows_batch)\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            _, y_loc, y_scale = self._inv_normalization(y_hat=outsample_y,\n",
+    "                                            temporal_cols=batch['temporal_cols'],\n",
+    "                                            y_idx=y_idx)\n",
+    "            outsample_y = original_outsample_y\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            loss = self.loss(y=outsample_y, distr_args=distr_args, mask=outsample_mask)\n",
+    "        else:\n",
+    "            loss = self.loss(y=outsample_y, y_hat=output, mask=outsample_mask)\n",
+    "\n",
+    "        if torch.isnan(loss):\n",
+    "            print('Model Parameters', self.hparams)\n",
+    "            print('insample_y', torch.isnan(insample_y).sum())\n",
+    "            print('outsample_y', torch.isnan(outsample_y).sum())\n",
+    "            print('output', torch.isnan(output).sum())\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'train_loss',\n",
+    "            loss.item(),\n",
+    "            batch_size=outsample_y.size(0),\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.train_trajectories.append((self.global_step, loss.item()))\n",
+    "        return loss\n",
+    "\n",
+    "    def _compute_valid_loss(self, outsample_y, output, outsample_mask, temporal_cols, y_idx):\n",
+    "        if self.loss.is_distribution_output:\n",
+    "            _, y_loc, y_scale = self._inv_normalization(y_hat=outsample_y,\n",
+    "                                                        temporal_cols=temporal_cols,\n",
+    "                                                        y_idx=y_idx)\n",
+    "            distr_args = self.loss.scale_decouple(output=output, loc=y_loc, scale=y_scale)\n",
+    "            _, sample_mean, quants  = self.loss.sample(distr_args=distr_args)\n",
+    "\n",
+    "            if str(type(self.valid_loss)) in\\\n",
+    "                [\"<class 'neuralforecast.losses.pytorch.sCRPS'>\", \"<class 'neuralforecast.losses.pytorch.MQLoss'>\"]:\n",
+    "                output = quants\n",
+    "            elif str(type(self.valid_loss)) in [\"<class 'neuralforecast.losses.pytorch.relMSE'>\"]:\n",
+    "                output = torch.unsqueeze(sample_mean, dim=-1) # [N,H,1] -> [N,H]\n",
+    "\n",
+    "        # Validation Loss evaluation\n",
+    "        if self.valid_loss.is_distribution_output:\n",
+    "            valid_loss = self.valid_loss(y=outsample_y, distr_args=distr_args, mask=outsample_mask)\n",
+    "        else:\n",
+    "            output, _, _ = self._inv_normalization(y_hat=output,\n",
+    "                                                   temporal_cols=temporal_cols,\n",
+    "                                                   y_idx=y_idx)\n",
+    "            valid_loss = self.valid_loss(y=outsample_y, y_hat=output, mask=outsample_mask)\n",
+    "        return valid_loss\n",
+    "    \n",
+    "    def validation_step(self, batch, batch_idx):\n",
+    "        if self.val_size == 0:\n",
+    "            return np.nan\n",
+    "\n",
+    "        # TODO: Hack to compute number of windows\n",
+    "        windows = self._create_windows(batch, step='val')\n",
+    "        n_windows = len(windows['temporal'])\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Number of windows in batch\n",
+    "        windows_batch_size = self.inference_windows_batch_size\n",
+    "        if windows_batch_size < 0:\n",
+    "            windows_batch_size = n_windows\n",
+    "        n_batches = int(np.ceil(n_windows/windows_batch_size))\n",
+    "\n",
+    "        valid_losses = []\n",
+    "        batch_sizes = []\n",
+    "        for i in range(n_batches):\n",
+    "            # Create and normalize windows [Ws, L+H, C]\n",
+    "            w_idxs = np.arange(i*windows_batch_size, \n",
+    "                               min((i+1)*windows_batch_size, n_windows))\n",
+    "            windows = self._create_windows(batch, step='val', w_idxs=w_idxs)\n",
+    "            original_outsample_y = torch.clone(windows['temporal'][:,-self.h:,y_idx])\n",
+    "            windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "            # Parse windows\n",
+    "            insample_y, insample_mask, _, outsample_mask, \\\n",
+    "                hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "            windows_batch = dict(insample_y=insample_y, # [Ws, L]\n",
+    "                        insample_mask=insample_mask, # [Ws, L]\n",
+    "                        futr_exog=futr_exog, # [Ws, L+H]\n",
+    "                        hist_exog=hist_exog, # [Ws, L]\n",
+    "                        stat_exog=stat_exog) # [Ws, 1]\n",
+    "            \n",
+    "            # Model Predictions\n",
+    "            output_batch = self(windows_batch)\n",
+    "            valid_loss_batch = self._compute_valid_loss(outsample_y=original_outsample_y,\n",
+    "                                                output=output_batch, outsample_mask=outsample_mask,\n",
+    "                                                temporal_cols=batch['temporal_cols'],\n",
+    "                                                y_idx=batch['y_idx'])\n",
+    "            valid_losses.append(valid_loss_batch)\n",
+    "            batch_sizes.append(len(output_batch))\n",
+    "        \n",
+    "        valid_loss = torch.stack(valid_losses)\n",
+    "        batch_sizes = torch.tensor(batch_sizes, device=valid_loss.device)\n",
+    "        batch_size = torch.sum(batch_sizes)\n",
+    "        valid_loss = torch.sum(valid_loss * batch_sizes) / batch_size\n",
+    "\n",
+    "        if torch.isnan(valid_loss):\n",
+    "            raise Exception('Loss is NaN, training stopped.')\n",
+    "\n",
+    "        self.log(\n",
+    "            'valid_loss',\n",
+    "            valid_loss.item(),\n",
+    "            batch_size=batch_size,\n",
+    "            prog_bar=True,\n",
+    "            on_epoch=True,\n",
+    "        )\n",
+    "        self.validation_step_outputs.append(valid_loss)\n",
+    "        return valid_loss\n",
+    "\n",
+    "    def predict_step(self, batch, batch_idx):\n",
+    "\n",
+    "        # TODO: Hack to compute number of windows\n",
+    "        windows = self._create_windows(batch, step='predict')\n",
+    "        n_windows = len(windows['temporal'])\n",
+    "        y_idx = batch['y_idx']\n",
+    "\n",
+    "        # Number of windows in batch\n",
+    "        windows_batch_size = self.inference_windows_batch_size\n",
+    "        if windows_batch_size < 0:\n",
+    "            windows_batch_size = n_windows\n",
+    "        n_batches = int(np.ceil(n_windows/windows_batch_size))\n",
+    "\n",
+    "        y_hats = []\n",
+    "        for i in range(n_batches):\n",
+    "            # Create and normalize windows [Ws, L+H, C]\n",
+    "            w_idxs = np.arange(i*windows_batch_size, \n",
+    "                    min((i+1)*windows_batch_size, n_windows))\n",
+    "            windows = self._create_windows(batch, step='predict', w_idxs=w_idxs)\n",
+    "            windows = self._normalization(windows=windows, y_idx=y_idx)\n",
+    "\n",
+    "            # Parse windows\n",
+    "            insample_y, insample_mask, _, _, \\\n",
+    "                hist_exog, futr_exog, stat_exog = self._parse_windows(batch, windows)\n",
+    "            windows_batch = dict(insample_y=insample_y, # [Ws, L]\n",
+    "                                insample_mask=insample_mask, # [Ws, L]\n",
+    "                                futr_exog=futr_exog, # [Ws, L+H]\n",
+    "                                hist_exog=hist_exog, # [Ws, L]\n",
+    "                                stat_exog=stat_exog) # [Ws, 1]\n",
+    "            \n",
+    "            # Model Predictions\n",
+    "            output_batch = self(windows_batch)\n",
+    "            # Inverse normalization and sampling\n",
+    "            if self.loss.is_distribution_output:\n",
+    "                _, y_loc, y_scale = self._inv_normalization(y_hat=output_batch[0],\n",
+    "                                                temporal_cols=batch['temporal_cols'],\n",
+    "                                                y_idx=y_idx)\n",
+    "                distr_args = self.loss.scale_decouple(output=output_batch, loc=y_loc, scale=y_scale)\n",
+    "                _, sample_mean, quants = self.loss.sample(distr_args=distr_args)\n",
+    "                y_hat = torch.concat((sample_mean, quants), axis=2)\n",
+    "\n",
+    "                if self.loss.return_params:\n",
+    "                    distr_args = torch.stack(distr_args, dim=-1)\n",
+    "                    distr_args = torch.reshape(distr_args, (len(windows[\"temporal\"]), self.h, -1))\n",
+    "                    y_hat = torch.concat((y_hat, distr_args), axis=2)\n",
+    "            else:\n",
+    "                y_hat, _, _ = self._inv_normalization(y_hat=output_batch,\n",
+    "                                                temporal_cols=batch['temporal_cols'],\n",
+    "                                                y_idx=y_idx)\n",
+    "            y_hats.append(y_hat)\n",
+    "        y_hat = torch.cat(y_hats, dim=0)\n",
+    "        return y_hat\n",
+    "    \n",
+    "    def fit(self, dataset, val_size=0, test_size=0, random_seed=None, distributed_config=None):\n",
+    "        \"\"\" Fit.\n",
+    "\n",
+    "        The `fit` method, optimizes the neural network's weights using the\n",
+    "        initialization parameters (`learning_rate`, `windows_batch_size`, ...)\n",
+    "        and the `loss` function as defined during the initialization. \n",
+    "        Within `fit` we use a PyTorch Lightning `Trainer` that\n",
+    "        inherits the initialization's `self.trainer_kwargs`, to customize\n",
+    "        its inputs, see [PL's trainer arguments](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.trainer.trainer.Trainer.html?highlight=trainer).\n",
+    "\n",
+    "        The method is designed to be compatible with SKLearn-like classes\n",
+    "        and in particular to be compatible with the StatsForecast library.\n",
+    "\n",
+    "        By default the `model` is not saving training checkpoints to protect \n",
+    "        disk memory, to get them change `enable_checkpointing=True` in `__init__`.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset`, see [documentation](https://nixtla.github.io/neuralforecast/tsdataset.html).<br>\n",
+    "        `val_size`: int, validation size for temporal cross-validation.<br>\n",
+    "        `random_seed`: int=None, random_seed for pytorch initializer and numpy generators, overwrites model.__init__'s.<br>\n",
+    "        `test_size`: int, test size for temporal cross-validation.<br>\n",
+    "        \"\"\"\n",
+    "        return self._fit(\n",
+    "            dataset=dataset,\n",
+    "            batch_size=self.batch_size,\n",
+    "            valid_batch_size=self.valid_batch_size,\n",
+    "            val_size=val_size,\n",
+    "            test_size=test_size,\n",
+    "            random_seed=random_seed,\n",
+    "            distributed_config=distributed_config,\n",
+    "        )\n",
+    "\n",
+    "    def predict(self, dataset, test_size=None, step_size=1,\n",
+    "                random_seed=None, **data_module_kwargs):\n",
+    "        \"\"\" Predict.\n",
+    "\n",
+    "        Neural network prediction with PL's `Trainer` execution of `predict_step`.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset`, see [documentation](https://nixtla.github.io/neuralforecast/tsdataset.html).<br>\n",
+    "        `test_size`: int=None, test size for temporal cross-validation.<br>\n",
+    "        `step_size`: int=1, Step size between each window.<br>\n",
+    "        `random_seed`: int=None, random_seed for pytorch initializer and numpy generators, overwrites model.__init__'s.<br>\n",
+    "        `**data_module_kwargs`: PL's TimeSeriesDataModule args, see [documentation](https://pytorch-lightning.readthedocs.io/en/1.6.1/extensions/datamodules.html#using-a-datamodule).\n",
+    "        \"\"\"\n",
+    "        self._check_exog(dataset)\n",
+    "        self._restart_seed(random_seed)\n",
+    "\n",
+    "        self.predict_step_size = step_size\n",
+    "        self.decompose_forecast = False\n",
+    "        datamodule = TimeSeriesDataModule(dataset=dataset,\n",
+    "                                          valid_batch_size=self.valid_batch_size,\n",
+    "                                          **data_module_kwargs)\n",
+    "\n",
+    "        # Protect when case of multiple gpu. PL does not support return preds with multiple gpu.\n",
+    "        pred_trainer_kwargs = self.trainer_kwargs.copy()\n",
+    "        if (pred_trainer_kwargs.get('accelerator', None) == \"gpu\") and (torch.cuda.device_count() > 1):\n",
+    "            pred_trainer_kwargs['devices'] = [0]\n",
+    "\n",
+    "        trainer = pl.Trainer(**pred_trainer_kwargs)\n",
+    "        fcsts = trainer.predict(self, datamodule=datamodule)        \n",
+    "        fcsts = torch.vstack(fcsts).numpy().flatten()\n",
+    "        fcsts = fcsts.reshape(-1, len(self.loss.output_names))\n",
+    "        return fcsts\n",
+    "\n",
+    "    def decompose(self, dataset, step_size=1, random_seed=None, **data_module_kwargs):\n",
+    "        \"\"\" Decompose Predictions.\n",
+    "\n",
+    "        Decompose the predictions through the network's layers.\n",
+    "        Available methods are `ESRNN`, `NHITS`, `NBEATS`, and `NBEATSx`.\n",
+    "\n",
+    "        **Parameters:**<br>\n",
+    "        `dataset`: NeuralForecast's `TimeSeriesDataset`, see [documentation here](https://nixtla.github.io/neuralforecast/tsdataset.html).<br>\n",
+    "        `step_size`: int=1, step size between each window of temporal data.<br>\n",
+    "        `**data_module_kwargs`: PL's TimeSeriesDataModule args, see [documentation](https://pytorch-lightning.readthedocs.io/en/1.6.1/extensions/datamodules.html#using-a-datamodule).\n",
+    "        \"\"\"\n",
+    "        # Restart random seed\n",
+    "        if random_seed is None:\n",
+    "            random_seed = self.random_seed\n",
+    "        torch.manual_seed(random_seed)\n",
+    "\n",
+    "        self.predict_step_size = step_size\n",
+    "        self.decompose_forecast = True\n",
+    "        datamodule = TimeSeriesDataModule(dataset=dataset,\n",
+    "                                          valid_batch_size=self.valid_batch_size,\n",
+    "                                          **data_module_kwargs)\n",
+    "        trainer = pl.Trainer(**self.trainer_kwargs)\n",
+    "        fcsts = trainer.predict(self, datamodule=datamodule)\n",
+    "        self.decompose_forecast = False # Default decomposition back to false\n",
+    "        return torch.vstack(fcsts).numpy()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1712ea15",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseWindows, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48063f70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseWindows.fit, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "75529be6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseWindows.predict, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a1f8315d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(BaseWindows.decompose, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8927f2e5-f376-4c99-bb8f-8cbb73efe01e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from neuralforecast.losses.pytorch import MAE\n",
+    "from neuralforecast.utils import AirPassengersDF\n",
+    "from neuralforecast.tsdataset import TimeSeriesDataset, TimeSeriesDataModule"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "61490e69-f014-4087-83c5-540d5bd7d458",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# add h=0,1 unit test for _parse_windows \n",
+    "# Declare batch\n",
+    "AirPassengersDF['x'] = np.array(len(AirPassengersDF))\n",
+    "AirPassengersDF['x2'] = np.array(len(AirPassengersDF)) * 2\n",
+    "dataset, indices, dates, ds = TimeSeriesDataset.from_df(df=AirPassengersDF)\n",
+    "data = TimeSeriesDataModule(dataset=dataset, batch_size=1, drop_last=True)\n",
+    "\n",
+    "train_loader =  data.train_dataloader()\n",
+    "batch = next(iter(train_loader))\n",
+    "\n",
+    "# Instantiate BaseWindows to test _parse_windows method h in [0,1]\n",
+    "for h in [0, 1]:\n",
+    "        basewindows = BaseWindows(h=h,\n",
+    "                                  input_size=len(AirPassengersDF)-h,\n",
+    "                                  hist_exog_list=['x'],\n",
+    "                                  loss=MAE(),\n",
+    "                                  valid_loss=MAE(),\n",
+    "                                  learning_rate=0.001,\n",
+    "                                  max_steps=1,\n",
+    "                                  val_check_steps=0,\n",
+    "                                  batch_size=1,\n",
+    "                                  valid_batch_size=1,\n",
+    "                                  windows_batch_size=1,\n",
+    "                                  inference_windows_batch_size=1,\n",
+    "                                  start_padding_enabled=False)\n",
+    "\n",
+    "        windows = basewindows._create_windows(batch, step='train')\n",
+    "        original_outsample_y = torch.clone(windows['temporal'][:,-basewindows.h:,0])\n",
+    "        windows = basewindows._normalization(windows=windows, y_idx=0)\n",
+    "\n",
+    "        insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "                hist_exog, futr_exog, stat_exog = basewindows._parse_windows(batch, windows)\n",
+    "\n",
+    "        # Check equality of parsed and original insample_y\n",
+    "        parsed_insample_y = insample_y.numpy().flatten()\n",
+    "        original_insample_y = AirPassengersDF.y.values\n",
+    "        test_eq(parsed_insample_y, original_insample_y[:basewindows.input_size])\n",
+    "\n",
+    "        # Check equality of parsed and original hist_exog\n",
+    "        parsed_hist_exog = hist_exog.numpy().flatten()\n",
+    "        original_hist_exog = AirPassengersDF.x.values\n",
+    "        test_eq(parsed_hist_exog, original_hist_exog[:basewindows.input_size])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "86ab58a9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "# Test that start_padding_enabled=True solves the problem of short series\n",
+    "h = 12\n",
+    "basewindows = BaseWindows(h=h,\n",
+    "                        input_size=500,\n",
+    "                        hist_exog_list=['x'],\n",
+    "                        loss=MAE(),\n",
+    "                        valid_loss=MAE(),\n",
+    "                        learning_rate=0.001,\n",
+    "                        max_steps=1,\n",
+    "                        val_check_steps=0,\n",
+    "                        batch_size=1,\n",
+    "                        valid_batch_size=1,\n",
+    "                        windows_batch_size=10,\n",
+    "                        inference_windows_batch_size=2,\n",
+    "                        start_padding_enabled=True)\n",
+    "\n",
+    "windows = basewindows._create_windows(batch, step='train')\n",
+    "windows = basewindows._normalization(windows=windows, y_idx=0)\n",
+    "insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "        hist_exog, futr_exog, stat_exog = basewindows._parse_windows(batch, windows)\n",
+    "\n",
+    "basewindows.val_size = 12\n",
+    "windows = basewindows._create_windows(batch, step='val')\n",
+    "windows = basewindows._normalization(windows=windows, y_idx=0)\n",
+    "insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "        hist_exog, futr_exog, stat_exog = basewindows._parse_windows(batch, windows)\n",
+    "\n",
+    "basewindows.test_size = 12\n",
+    "basewindows.predict_step_size = 1\n",
+    "windows = basewindows._create_windows(batch, step='predict')\n",
+    "windows = basewindows._normalization(windows=windows, y_idx=0)\n",
+    "insample_y, insample_mask, outsample_y, outsample_mask, \\\n",
+    "        hist_exog, futr_exog, stat_exog = basewindows._parse_windows(batch, windows)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "54d2e850",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "\n",
+    "# Test that hist_exog_list and futr_exog_list correctly filter data.\n",
+    "# that is sent to scaler.\n",
+    "basewindows = BaseWindows(h=12,\n",
+    "                          input_size=500,\n",
+    "                          hist_exog_list=['x', 'x2'],\n",
+    "                          futr_exog_list=['x'],\n",
+    "                          loss=MAE(),\n",
+    "                          valid_loss=MAE(),\n",
+    "                          learning_rate=0.001,\n",
+    "                          max_steps=1,\n",
+    "                          val_check_steps=0,\n",
+    "                          batch_size=1,\n",
+    "                          valid_batch_size=1,\n",
+    "                          windows_batch_size=10,\n",
+    "                          inference_windows_batch_size=2,\n",
+    "                          start_padding_enabled=True)\n",
+    "\n",
+    "windows = basewindows._create_windows(batch, step='train')\n",
+    "\n",
+    "temporal_cols = windows['temporal_cols'].copy() # B, L+H, C\n",
+    "temporal_data_cols = basewindows._get_temporal_exogenous_cols(temporal_cols=temporal_cols)\n",
+    "\n",
+    "test_eq(set(temporal_data_cols), set(['x', 'x2']))\n",
+    "test_eq(windows['temporal'].shape, torch.Size([10,500+12,len(['y', 'x', 'x2', 'available_mask'])]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf493ff9",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/nbs/common.modules.ipynb
+++ b/nbs/common.modules.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp common._modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# NN Modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "import math\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| hide\n",
+    "from nbdev.showdoc import show_doc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "ACTIVATIONS = ['ReLU','Softplus','Tanh','SELU','LeakyReLU','PReLU','Sigmoid']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. MLP\n",
+    "\n",
+    "Multi-Layer Perceptron"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class MLP(nn.Module):\n",
+    "    \"\"\"Multi-Layer Perceptron Class\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `in_features`: int, dimension of input.<br>\n",
+    "    `out_features`: int, dimension of output.<br>\n",
+    "    `activation`: str, activation function to use.<br>\n",
+    "    `hidden_size`: int, dimension of hidden layers.<br>\n",
+    "    `num_layers`: int, number of hidden layers.<br>\n",
+    "    `dropout`: float, dropout rate.<br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self, in_features, out_features, activation, hidden_size, num_layers, dropout):\n",
+    "        super().__init__()\n",
+    "        assert activation in ACTIVATIONS, f'{activation} is not in {ACTIVATIONS}'\n",
+    "        \n",
+    "        self.activation = getattr(nn, activation)()\n",
+    "\n",
+    "        # MultiLayer Perceptron\n",
+    "        # Input layer\n",
+    "        layers = [nn.Linear(in_features=in_features, out_features=hidden_size),\n",
+    "                  self.activation,\n",
+    "                  nn.Dropout(dropout)]\n",
+    "        # Hidden layers\n",
+    "        for i in range(num_layers - 2):\n",
+    "            layers += [nn.Linear(in_features=hidden_size, out_features=hidden_size),\n",
+    "                       self.activation,\n",
+    "                       nn.Dropout(dropout)]\n",
+    "        # Output layer\n",
+    "        layers += [nn.Linear(in_features=hidden_size, out_features=out_features)]\n",
+    "\n",
+    "        # Store in layers as ModuleList\n",
+    "        self.layers = nn.Sequential(*layers)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.layers(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Temporal Convolutions\n",
+    "\n",
+    "For long time in deep learning, sequence modelling was synonymous with recurrent networks, yet several papers have shown that simple convolutional architectures can outperform canonical recurrent networks like LSTMs by demonstrating longer effective memory.\n",
+    "\n",
+    "**References**<br>\n",
+    "-[van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. Computing Research Repository, abs/1609.03499. URL: http://arxiv.org/abs/1609.03499. arXiv:1609.03499.](https://arxiv.org/abs/1609.03499)<br>\n",
+    "-[Shaojie Bai, Zico Kolter, Vladlen Koltun. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Computing Research Repository, abs/1803.01271. URL: https://arxiv.org/abs/1803.01271.](https://arxiv.org/abs/1803.01271)<br>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Chomp1d(nn.Module):\n",
+    "    \"\"\" Chomp1d\n",
+    "\n",
+    "    Receives `x` input of dim [N,C,T], and trims it so that only\n",
+    "    'time available' information is used. \n",
+    "    Used by one dimensional causal convolutions `CausalConv1d`.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `horizon`: int, length of outsample values to skip.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, horizon):\n",
+    "        super(Chomp1d, self).__init__()\n",
+    "        self.horizon = horizon\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return x[:, :, :-self.horizon].contiguous()\n",
+    "\n",
+    "\n",
+    "class CausalConv1d(nn.Module):\n",
+    "    \"\"\" Causal Convolution 1d\n",
+    "\n",
+    "    Receives `x` input of dim [N,C_in,T], and computes a causal convolution\n",
+    "    in the time dimension. Skipping the H steps of the forecast horizon, through\n",
+    "    its dilation.\n",
+    "    Consider a batch of one element, the dilated convolution operation on the\n",
+    "    $t$ time step is defined:\n",
+    "\n",
+    "    $\\mathrm{Conv1D}(\\mathbf{x},\\mathbf{w})(t) = (\\mathbf{x}_{[*d]} \\mathbf{w})(t) = \\sum^{K}_{k=1} w_{k} \\mathbf{x}_{t-dk}$\n",
+    "\n",
+    "    where $d$ is the dilation factor, $K$ is the kernel size, $t-dk$ is the index of\n",
+    "    the considered past observation. The dilation effectively applies a filter with skip\n",
+    "    connections. If $d=1$ one recovers a normal convolution.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `in_channels`: int, dimension of `x` input's initial channels.<br> \n",
+    "    `out_channels`: int, dimension of `x` outputs's channels.<br> \n",
+    "    `activation`: str, identifying activations from PyTorch activations.\n",
+    "        select from 'ReLU','Softplus','Tanh','SELU', 'LeakyReLU','PReLU','Sigmoid'.<br>\n",
+    "    `padding`: int, number of zero padding used to the left.<br>\n",
+    "    `kernel_size`: int, convolution's kernel size.<br>\n",
+    "    `dilation`: int, dilation skip connections.<br>\n",
+    "    \n",
+    "    **Returns:**<br>\n",
+    "    `x`: tensor, torch tensor of dim [N,C_out,T] activation(conv1d(inputs, kernel) + bias). <br>\n",
+    "    \"\"\"\n",
+    "    def __init__(self, in_channels, out_channels, kernel_size,\n",
+    "                 padding, dilation, activation, stride:int=1):\n",
+    "        super(CausalConv1d, self).__init__()\n",
+    "        assert activation in ACTIVATIONS, f'{activation} is not in {ACTIVATIONS}'\n",
+    "        \n",
+    "        self.conv       = nn.Conv1d(in_channels=in_channels, out_channels=out_channels, \n",
+    "                                    kernel_size=kernel_size, stride=stride, padding=padding,\n",
+    "                                    dilation=dilation)\n",
+    "        \n",
+    "        self.chomp      = Chomp1d(padding)\n",
+    "        self.activation = getattr(nn, activation)()\n",
+    "        self.causalconv = nn.Sequential(self.conv, self.chomp, self.activation)\n",
+    "    \n",
+    "    def forward(self, x):\n",
+    "        return self.causalconv(x)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(CausalConv1d, title_level=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class TemporalConvolutionEncoder(nn.Module):\n",
+    "    \"\"\" Temporal Convolution Encoder\n",
+    "\n",
+    "    Receives `x` input of dim [N,T,C_in], permutes it to  [N,C_in,T]\n",
+    "    applies a deep stack of exponentially dilated causal convolutions.\n",
+    "    The exponentially increasing dilations of the convolutions allow for \n",
+    "    the creation of weighted averages of exponentially large long-term memory.\n",
+    "\n",
+    "    **Parameters:**<br>\n",
+    "    `in_channels`: int, dimension of `x` input's initial channels.<br> \n",
+    "    `out_channels`: int, dimension of `x` outputs's channels.<br>\n",
+    "    `kernel_size`: int, size of the convolving kernel.<br>\n",
+    "    `dilations`: int list, controls the temporal spacing between the kernel points.<br>\n",
+    "    `activation`: str, identifying activations from PyTorch activations.\n",
+    "        select from 'ReLU','Softplus','Tanh','SELU', 'LeakyReLU','PReLU','Sigmoid'.<br>\n",
+    "\n",
+    "    **Returns:**<br>\n",
+    "    `x`: tensor, torch tensor of dim [N,T,C_out].<br>\n",
+    "    \"\"\"\n",
+    "    # TODO: Add dilations parameter and change layers declaration to for loop\n",
+    "    def __init__(self, in_channels, out_channels, \n",
+    "                 kernel_size, dilations,\n",
+    "                 activation:str='ReLU'):\n",
+    "        super(TemporalConvolutionEncoder, self).__init__()\n",
+    "        layers = []\n",
+    "        for dilation in dilations:\n",
+    "            layers.append(CausalConv1d(in_channels=in_channels, out_channels=out_channels, \n",
+    "                                        kernel_size=kernel_size, padding=(kernel_size-1)*dilation, \n",
+    "                                        activation=activation, dilation=dilation))\n",
+    "            in_channels = out_channels\n",
+    "        self.tcn = nn.Sequential(*layers)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        # [N,T,C_in] -> [N,C_in,T] -> [N,T,C_out]\n",
+    "        x = x.permute(0, 2, 1).contiguous()\n",
+    "        x = self.tcn(x)\n",
+    "        x = x.permute(0, 2, 1).contiguous()\n",
+    "        return x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_doc(TemporalConvolutionEncoder, title_level=3)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Transformers"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**References**<br>\n",
+    "- [Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, Wancai Zhang. \"Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting\"](https://arxiv.org/abs/2012.07436)<br>\n",
+    "- [Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.](https://arxiv.org/abs/2106.13008)<br>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class TransEncoderLayer(nn.Module):\n",
+    "    def __init__(self, attention, hidden_size, conv_hidden_size=None, dropout=0.1, activation=\"relu\"):\n",
+    "        super(TransEncoderLayer, self).__init__()\n",
+    "        conv_hidden_size = conv_hidden_size or 4 * hidden_size\n",
+    "        self.attention = attention\n",
+    "        self.conv1 = nn.Conv1d(in_channels=hidden_size, out_channels=conv_hidden_size, kernel_size=1)\n",
+    "        self.conv2 = nn.Conv1d(in_channels=conv_hidden_size, out_channels=hidden_size, kernel_size=1)\n",
+    "        self.norm1 = nn.LayerNorm(hidden_size)\n",
+    "        self.norm2 = nn.LayerNorm(hidden_size)\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.activation = F.relu if activation == \"relu\" else F.gelu\n",
+    "\n",
+    "    def forward(self, x, attn_mask=None):\n",
+    "        new_x, attn = self.attention(\n",
+    "            x, x, x,\n",
+    "            attn_mask=attn_mask\n",
+    "        )\n",
+    "        \n",
+    "        x = x + self.dropout(new_x)\n",
+    "\n",
+    "        y = x = self.norm1(x)\n",
+    "        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))\n",
+    "        y = self.dropout(self.conv2(y).transpose(-1, 1))\n",
+    "\n",
+    "        return self.norm2(x + y), attn\n",
+    "\n",
+    "\n",
+    "class TransEncoder(nn.Module):\n",
+    "    def __init__(self, attn_layers, conv_layers=None, norm_layer=None):\n",
+    "        super(TransEncoder, self).__init__()\n",
+    "        self.attn_layers = nn.ModuleList(attn_layers)\n",
+    "        self.conv_layers = nn.ModuleList(conv_layers) if conv_layers is not None else None\n",
+    "        self.norm = norm_layer\n",
+    "\n",
+    "    def forward(self, x, attn_mask=None):\n",
+    "        # x [B, L, D]\n",
+    "        attns = []\n",
+    "        if self.conv_layers is not None:\n",
+    "            for attn_layer, conv_layer in zip(self.attn_layers, self.conv_layers):\n",
+    "                x, attn = attn_layer(x, attn_mask=attn_mask)\n",
+    "                x = conv_layer(x)\n",
+    "                attns.append(attn)\n",
+    "            x, attn = self.attn_layers[-1](x)\n",
+    "            attns.append(attn)\n",
+    "        else:\n",
+    "            for attn_layer in self.attn_layers:\n",
+    "                x, attn = attn_layer(x, attn_mask=attn_mask)\n",
+    "                attns.append(attn)\n",
+    "\n",
+    "        if self.norm is not None:\n",
+    "            x = self.norm(x)\n",
+    "\n",
+    "        return x, attns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class TransDecoderLayer(nn.Module):\n",
+    "    def __init__(self, self_attention, cross_attention, hidden_size, conv_hidden_size=None,\n",
+    "                 dropout=0.1, activation=\"relu\"):\n",
+    "        super(TransDecoderLayer, self).__init__()\n",
+    "        conv_hidden_size = conv_hidden_size or 4 * hidden_size\n",
+    "        self.self_attention = self_attention\n",
+    "        self.cross_attention = cross_attention\n",
+    "        self.conv1 = nn.Conv1d(in_channels=hidden_size, out_channels=conv_hidden_size, kernel_size=1)\n",
+    "        self.conv2 = nn.Conv1d(in_channels=conv_hidden_size, out_channels=hidden_size, kernel_size=1)\n",
+    "        self.norm1 = nn.LayerNorm(hidden_size)\n",
+    "        self.norm2 = nn.LayerNorm(hidden_size)\n",
+    "        self.norm3 = nn.LayerNorm(hidden_size)\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.activation = F.relu if activation == \"relu\" else F.gelu\n",
+    "\n",
+    "    def forward(self, x, cross, x_mask=None, cross_mask=None):\n",
+    "        x = x + self.dropout(self.self_attention(\n",
+    "            x, x, x,\n",
+    "            attn_mask=x_mask\n",
+    "        )[0])\n",
+    "        x = self.norm1(x)\n",
+    "\n",
+    "        x = x + self.dropout(self.cross_attention(\n",
+    "            x, cross, cross,\n",
+    "            attn_mask=cross_mask\n",
+    "        )[0])\n",
+    "\n",
+    "        y = x = self.norm2(x)\n",
+    "        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))\n",
+    "        y = self.dropout(self.conv2(y).transpose(-1, 1))\n",
+    "\n",
+    "        return self.norm3(x + y)\n",
+    "\n",
+    "\n",
+    "class TransDecoder(nn.Module):\n",
+    "    def __init__(self, layers, norm_layer=None, projection=None):\n",
+    "        super(TransDecoder, self).__init__()\n",
+    "        self.layers = nn.ModuleList(layers)\n",
+    "        self.norm = norm_layer\n",
+    "        self.projection = projection\n",
+    "\n",
+    "    def forward(self, x, cross, x_mask=None, cross_mask=None):\n",
+    "        for layer in self.layers:\n",
+    "            x = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask)\n",
+    "\n",
+    "        if self.norm is not None:\n",
+    "            x = self.norm(x)\n",
+    "\n",
+    "        if self.projection is not None:\n",
+    "            x = self.projection(x)\n",
+    "        return x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class AttentionLayer(nn.Module):\n",
+    "    def __init__(self, attention, hidden_size, n_head, d_keys=None,\n",
+    "                 d_values=None):\n",
+    "        super(AttentionLayer, self).__init__()\n",
+    "\n",
+    "        d_keys = d_keys or (hidden_size // n_head)\n",
+    "        d_values = d_values or (hidden_size // n_head)\n",
+    "\n",
+    "        self.inner_attention = attention\n",
+    "        self.query_projection = nn.Linear(hidden_size, d_keys * n_head)\n",
+    "        self.key_projection = nn.Linear(hidden_size, d_keys * n_head)\n",
+    "        self.value_projection = nn.Linear(hidden_size, d_values * n_head)\n",
+    "        self.out_projection = nn.Linear(d_values * n_head, hidden_size)\n",
+    "        self.n_head = n_head\n",
+    "\n",
+    "    def forward(self, queries, keys, values, attn_mask):\n",
+    "        B, L, _ = queries.shape\n",
+    "        _, S, _ = keys.shape\n",
+    "        H = self.n_head\n",
+    "\n",
+    "        queries = self.query_projection(queries).view(B, L, H, -1)\n",
+    "        keys = self.key_projection(keys).view(B, S, H, -1)\n",
+    "        values = self.value_projection(values).view(B, S, H, -1)\n",
+    "\n",
+    "        out, attn = self.inner_attention(\n",
+    "            queries,\n",
+    "            keys,\n",
+    "            values,\n",
+    "            attn_mask\n",
+    "        )\n",
+    "        out = out.view(B, L, -1)\n",
+    "\n",
+    "        return self.out_projection(out), attn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class PositionalEmbedding(nn.Module):\n",
+    "    def __init__(self, hidden_size, max_len=5000):\n",
+    "        super(PositionalEmbedding, self).__init__()\n",
+    "        # Compute the positional encodings once in log space.\n",
+    "        pe = torch.zeros(max_len, hidden_size).float()\n",
+    "        pe.require_grad = False\n",
+    "\n",
+    "        position = torch.arange(0, max_len).float().unsqueeze(1)\n",
+    "        div_term = (torch.arange(0, hidden_size, 2).float() * -(math.log(10000.0) / hidden_size)).exp()\n",
+    "\n",
+    "        pe[:, 0::2] = torch.sin(position * div_term)\n",
+    "        pe[:, 1::2] = torch.cos(position * div_term)\n",
+    "\n",
+    "        pe = pe.unsqueeze(0)\n",
+    "        self.register_buffer('pe', pe)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.pe[:, :x.size(1)]\n",
+    "\n",
+    "class TokenEmbedding(nn.Module):\n",
+    "    def __init__(self, c_in, hidden_size):\n",
+    "        super(TokenEmbedding, self).__init__()\n",
+    "        padding = 1 if torch.__version__ >= '1.5.0' else 2\n",
+    "        self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=hidden_size,\n",
+    "                                   kernel_size=3, padding=padding, padding_mode='circular', bias=False)\n",
+    "        for m in self.modules():\n",
+    "            if isinstance(m, nn.Conv1d):\n",
+    "                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)\n",
+    "        return x\n",
+    "\n",
+    "class TimeFeatureEmbedding(nn.Module):\n",
+    "    def __init__(self, input_size, hidden_size):\n",
+    "        super(TimeFeatureEmbedding, self).__init__()\n",
+    "        self.embed = nn.Linear(input_size, hidden_size, bias=False)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.embed(x)\n",
+    "\n",
+    "class DataEmbedding(nn.Module):\n",
+    "    def __init__(self, c_in, exog_input_size, hidden_size, pos_embedding=True, dropout=0.1):\n",
+    "        super(DataEmbedding, self).__init__()\n",
+    "\n",
+    "        self.value_embedding = TokenEmbedding(c_in=c_in, hidden_size=hidden_size)\n",
+    "\n",
+    "        if pos_embedding:\n",
+    "            self.position_embedding = PositionalEmbedding(hidden_size=hidden_size)\n",
+    "        else:\n",
+    "            self.position_embedding = None\n",
+    "\n",
+    "        if exog_input_size > 0:\n",
+    "            self.temporal_embedding = TimeFeatureEmbedding(input_size=exog_input_size,\n",
+    "                                                        hidden_size=hidden_size)\n",
+    "        else:\n",
+    "            self.temporal_embedding = None\n",
+    "\n",
+    "        self.dropout = nn.Dropout(p=dropout)\n",
+    "\n",
+    "    def forward(self, x, x_mark=None):\n",
+    "\n",
+    "        # Convolution\n",
+    "        x = self.value_embedding(x)\n",
+    "\n",
+    "        # Add positional (relative withing window) embedding with sines and cosines\n",
+    "        if self.position_embedding is not None:\n",
+    "            x = x + self.position_embedding(x)\n",
+    "\n",
+    "        # Add temporal (absolute in time series) embedding with linear layer\n",
+    "        if self.temporal_embedding is not None:\n",
+    "            x = x + self.temporal_embedding(x_mark)            \n",
+    "\n",
+    "        return self.dropout(x)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}