Unverified Commit 8abe4930 authored by Eli Simhayev's avatar Eli Simhayev Committed by GitHub
Browse files

[Time-Series] informer model (#21099)

* added informer to gitignore

* added informer to gitignore

* WIP informer2020

* added checking that instantiate works

* added config using gluonTS by kashif

* WIP config

* adding informeConfig. need to remove FeatureEmbedder

* done InformerConfig, but need to change the names

* Done informer model init. working on enc-dec

* added things to address, after reading again enc-dec in the paper

* done modeling - checking initialization work

* added informer to gitignore

* WIP informer2020

* added checking that instantiate works

* added config using gluonTS by kashif

* WIP config

* adding informeConfig. need to remove FeatureEmbedder

* done InformerConfig, but need to change the names

* Done informer model init. working on enc-dec

* added things to address, after reading again enc-dec in the paper

* done modeling - checking initialization work

* moved enc-dec init to InformerEncoder/Decoder init

* added 'init_std' to config, now model init works!

* WIP conversion script, and added code sources

* WIP conversion script: loading original informer pth works

* WIP conversion script: change defaults in the config

* WIP conversion script: supporting Informer input embedding

* WIP conversion script: added parameters for the informer embed

* WIP conversion script: change dim_feedforward=2048

* WIP conversion script: remove unused args for loading checkpoint

* just cleaning up

* DataEmbedding removed, after thinking with Kashif

* working on forward pass

* WIP forward pass: trying to establish working batch for forward pass

* cleaning and finalizing

* adding HF names and docs

* init after cleaning works

* WIP in tests

* added docs for the informer specific args

* fix style

* undo change

* cleaning informer, now need to work only enc-dec

* initial enc-dec classes

* added encoder and decoder

* added todo

* add todos for conv_layers

* added decoder docs from vanilla

* added encoder docs from vanilla

* remove encoder decoder from the original informer

* removed AttentionLayer from the original paper

* removed TriangularCausalMask, same as decoder_attention_mask

* initial sparse attention

* use conv_layers

* fixed test_config test

* fix parenthesis when itearting zip(layers, conv_layers)

* error found in prob attention, added sizes as comments

* fix sizes

* added proposal for q_reduce indexing, and remove unused

* WIP ProbMask, and changed factor=2 for testing

* remove unused libs for this PR for creating the env

* fix checking the attn_weights.size() after bmm

* Q_reduce: changed from torch.gather to simple slicing

* WIP calculate final attn_output

* finish adding v_aggregated, attn_output ready

* changed tgt_len to u in attention_mask, need to fix the size error

* comment attention_mask for encoder, and fix if cond for v_agg

* added ProbMask support (wip), removed old original code

* finished ProbMask 😃



* Revert "remove unused libs for this PR for creating the env"

This reverts commit 11a081e09e92771e51a5d2758d53a9afb59547f0.

* fixes

* make style

* fix initial tests

* fix more tests

* dry

* make style

* remove unused files

* style

* added integration tests

* fix num_static_real_features

* fix header

* remove unused function

* fix example

* fix docs

* Update src/transformers/models/informer/configuration_informer.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/informer/modeling_informer.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/informer/configuration_informer.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/informer/configuration_informer.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/informer/configuration_informer.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/informer/configuration_informer.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* fixes for reviewer

* use prediction_length from model

* fix style

* fixed informer.mdx

* added to index

* updated readme

* undo

* make fix-copies

* typo

* fix copy

* added Informer to toctree

* in order

* fixed comments

* remove unneeded new lines in docs

* make static real and cat optional

* fix use of distil conv layers

* fixed integration test

* added checkpoint for convlayer

* make fix-copies

* updated from time series model

* make fix-copies

* copy decoder

* fix unit tests

* updated scaling config

* fix integration tests

* IGNORE_NON_TESTED

* IGNORE_NON_AUTO_CONFIGURED

* IGNORE_NON_AUTO_CONFIGURED

* updated check configs

* fix formatting

* undo change from time series

* prediction_length should not be None

* aliign with the blog: prettify ProbSparse and change attention_factor  to sampling_factor

* make style

* make fix-copies

* niels CR: update contributed by

* niels CR: update configuration_informer.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* niels CR: update kashif -> huggingface
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* niels CR: `sampling_factor` only relevant when `attention_type`=prob

* make style

* fixed U_part: added multiplication by `L_Q`

* fixed bug: remove `is not None` from `if config.distil`

* fixed test: `decoder_seq_length` to `encoder_seq_length` in cross_attentions check

* fix integration tests

* updated model hub

* do not shift as in training

* undo

* fix make-copies

* make fix-copies

* added `if prediction_length is None`

* changed `ProbSparseAttention` to `InformerProbSparseAttention`

* changed `V_sum` -> `v_mean_dim_time`

* changed `ConvLayer` to `InformerConvLayer` and fixed `super()`

* TimeSeriesTansformer->Informer in decoder's Copied from

* more descriptive in ProbSparse

* make style

* fix coped from

* Revert "added `if prediction_length is None`"

This reverts commit b4cbddfa05e3bd739b79569cd3c3b89e316f2451.

* fixed indent

* use InformerSinusoidalPositionalEmbedding

* make fix-style

* fix from #21860

* fix name

* make fix-copies

* use time series utils

* fix dec num_heads

* docstring

* added time series util doc

* _import_structure

* formatting

* changes from review

* make style

* fix docs

* fix doc

* removed NegativeLogLikelihood

---------
Co-authored-by: default avatarKashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>
parent dde718e7
......@@ -16,25 +16,22 @@
""" PyTorch Time Series Transformer model."""
import random
from dataclasses import dataclass
from typing import Callable, Dict, List, Optional, Tuple, Union
from typing import List, Optional, Tuple, Union
import numpy as np
import torch
from torch import nn
from torch.distributions import (
AffineTransform,
Distribution,
Independent,
NegativeBinomial,
Normal,
StudentT,
TransformedDistribution,
)
from ...activations import ACT2FN
from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions, ModelOutput
from ...modeling_outputs import (
BaseModelOutput,
BaseModelOutputWithPastAndCrossAttentions,
SampleTSPredictionOutput,
Seq2SeqTSModelOutput,
Seq2SeqTSPredictionOutput,
)
from ...modeling_utils import PreTrainedModel
from ...time_series_utils import NegativeBinomialOutput, NormalOutput, StudentTOutput
from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
from .configuration_time_series_transformer import TimeSeriesTransformerConfig
......@@ -50,189 +47,17 @@ TIME_SERIES_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
]
class AffineTransformed(TransformedDistribution):
def __init__(self, base_distribution: Distribution, loc=None, scale=None, event_dim=0):
self.scale = 1.0 if scale is None else scale
self.loc = 0.0 if loc is None else loc
super().__init__(base_distribution, [AffineTransform(loc=self.loc, scale=self.scale, event_dim=event_dim)])
@property
def mean(self):
"""
Returns the mean of the distribution.
"""
return self.base_dist.mean * self.scale + self.loc
@property
def variance(self):
"""
Returns the variance of the distribution.
"""
return self.base_dist.variance * self.scale**2
@property
def stddev(self):
"""
Returns the standard deviation of the distribution.
"""
return self.variance.sqrt()
class ParameterProjection(nn.Module):
def __init__(
self, in_features: int, args_dim: Dict[str, int], domain_map: Callable[..., Tuple[torch.Tensor]], **kwargs
) -> None:
super().__init__(**kwargs)
self.args_dim = args_dim
self.proj = nn.ModuleList([nn.Linear(in_features, dim) for dim in args_dim.values()])
self.domain_map = domain_map
def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor]:
params_unbounded = [proj(x) for proj in self.proj]
return self.domain_map(*params_unbounded)
class LambdaLayer(nn.Module):
def __init__(self, function):
super().__init__()
self.function = function
def forward(self, x, *args):
return self.function(x, *args)
class DistributionOutput:
distribution_class: type
in_features: int
args_dim: Dict[str, int]
def __init__(self, dim: int = 1) -> None:
self.dim = dim
self.args_dim = {k: dim * self.args_dim[k] for k in self.args_dim}
def _base_distribution(self, distr_args):
if self.dim == 1:
return self.distribution_class(*distr_args)
else:
return Independent(self.distribution_class(*distr_args), 1)
def distribution(
self,
distr_args,
loc: Optional[torch.Tensor] = None,
scale: Optional[torch.Tensor] = None,
) -> Distribution:
distr = self._base_distribution(distr_args)
if loc is None and scale is None:
return distr
else:
return AffineTransformed(distr, loc=loc, scale=scale, event_dim=self.event_dim)
@property
def event_shape(self) -> Tuple:
r"""
Shape of each individual event contemplated by the distributions that this object constructs.
"""
return () if self.dim == 1 else (self.dim,)
@property
def event_dim(self) -> int:
r"""
Number of event dimensions, i.e., length of the `event_shape` tuple, of the distributions that this object
constructs.
"""
return len(self.event_shape)
@property
def value_in_support(self) -> float:
r"""
A float that will have a valid numeric value when computing the log-loss of the corresponding distribution. By
default 0.0. This value will be used when padding data series.
"""
return 0.0
def get_parameter_projection(self, in_features: int) -> nn.Module:
r"""
Return the parameter projection layer that maps the input to the appropriate parameters of the distribution.
"""
return ParameterProjection(
in_features=in_features,
args_dim=self.args_dim,
domain_map=LambdaLayer(self.domain_map),
)
def domain_map(self, *args: torch.Tensor):
r"""
Converts arguments to the right shape and domain. The domain depends on the type of distribution, while the
correct shape is obtained by reshaping the trailing axis in such a way that the returned tensors define a
distribution of the right event_shape.
class TimeSeriesFeatureEmbedder(nn.Module):
"""
raise NotImplementedError()
Embed a sequence of categorical features.
@classmethod
def squareplus(cls, x: torch.Tensor) -> torch.Tensor:
r"""
Helper to map inputs to the positive orthant by applying the square-plus operation. Reference:
https://twitter.com/jon_barron/status/1387167648669048833
Args:
cardinalities (`list[int]`):
List of cardinalities of the categorical features.
embedding_dims (`list[int]`):
List of embedding dimensions of the categorical features.
"""
return (x + torch.sqrt(torch.square(x) + 4.0)) / 2.0
class StudentTOutput(DistributionOutput):
args_dim: Dict[str, int] = {"df": 1, "loc": 1, "scale": 1}
distribution_class: type = StudentT
@classmethod
def domain_map(cls, df: torch.Tensor, loc: torch.Tensor, scale: torch.Tensor):
scale = cls.squareplus(scale)
df = 2.0 + cls.squareplus(df)
return df.squeeze(-1), loc.squeeze(-1), scale.squeeze(-1)
class NormalOutput(DistributionOutput):
args_dim: Dict[str, int] = {"loc": 1, "scale": 1}
distribution_class: type = Normal
@classmethod
def domain_map(cls, loc: torch.Tensor, scale: torch.Tensor):
scale = cls.squareplus(scale)
return loc.squeeze(-1), scale.squeeze(-1)
class NegativeBinomialOutput(DistributionOutput):
args_dim: Dict[str, int] = {"total_count": 1, "logits": 1}
distribution_class: type = NegativeBinomial
@classmethod
def domain_map(cls, total_count: torch.Tensor, logits: torch.Tensor):
total_count = cls.squareplus(total_count)
return total_count.squeeze(-1), logits.squeeze(-1)
def _base_distribution(self, distr_args) -> Distribution:
total_count, logits = distr_args
if self.dim == 1:
return self.distribution_class(total_count=total_count, logits=logits)
else:
return Independent(self.distribution_class(total_count=total_count, logits=logits), 1)
# Overwrites the parent class method. We cannot scale using the affine
# transformation since negative binomial should return integers. Instead
# we scale the parameters.
def distribution(
self, distr_args, loc: Optional[torch.Tensor] = None, scale: Optional[torch.Tensor] = None
) -> Distribution:
total_count, logits = distr_args
if scale is not None:
# See scaling property of Gamma.
logits += scale.log()
return self._base_distribution((total_count, logits))
class FeatureEmbedder(nn.Module):
def __init__(self, cardinalities: List[int], embedding_dims: List[int]) -> None:
super().__init__()
......@@ -256,7 +81,7 @@ class FeatureEmbedder(nn.Module):
)
class StdScaler(nn.Module):
class TimeSeriesStdScaler(nn.Module):
"""
Standardize features by calculating the mean and scaling along some given dimension `dim`, and then normalizes it
by subtracting from the mean and dividing by the standard deviation.
......@@ -289,7 +114,7 @@ class StdScaler(nn.Module):
return (data - loc) / scale, loc, scale
class MeanScaler(nn.Module):
class TimeSeriesMeanScaler(nn.Module):
"""
Computes a scaling factor as the weighted average absolute value along dimension `dim`, and scales the data
accordingly.
......@@ -344,7 +169,7 @@ class MeanScaler(nn.Module):
return scaled_data, torch.zeros_like(scale), scale
class NOPScaler(nn.Module):
class TimeSeriesNOPScaler(nn.Module):
"""
Assigns a scaling factor equal to 1 along dimension `dim`, and therefore applies no scaling to the input data.
......@@ -368,6 +193,13 @@ class NOPScaler(nn.Module):
return data, loc, scale
def nll(input: torch.distributions.Distribution, target: torch.Tensor) -> torch.Tensor:
"""
Computes the negative log likelihood loss from input distribution with respect to target.
"""
return -input.log_prob(target)
def weighted_average(input_tensor: torch.Tensor, weights: Optional[torch.Tensor] = None, dim=None) -> torch.Tensor:
"""
Computes the weighted average of a given tensor across a given `dim`, masking values associated with weight zero,
......@@ -392,15 +224,6 @@ def weighted_average(input_tensor: torch.Tensor, weights: Optional[torch.Tensor]
return input_tensor.mean(dim=dim)
class NegativeLogLikelihood:
"""
Computes the negative log likelihood loss from input distribution with respect to target.
"""
def __call__(self, input: torch.distributions.Distribution, target: torch.Tensor) -> torch.Tensor:
return -input.log_prob(target)
# Copied from transformers.models.bart.modeling_bart._make_causal_mask
def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0):
"""
......@@ -467,164 +290,15 @@ class TimeSeriesSinusoidalPositionalEmbedding(nn.Embedding):
return super().forward(positions)
class ValueEmbedding(nn.Module):
class TimeSeriesValueEmbedding(nn.Module):
def __init__(self, feature_size, d_model):
super(ValueEmbedding, self).__init__()
super().__init__()
self.value_projection = nn.Linear(in_features=feature_size, out_features=d_model, bias=False)
def forward(self, x):
return self.value_projection(x)
@dataclass
class Seq2SeqTimeSeriesModelOutput(ModelOutput):
"""
Base class for model encoder's outputs that also contains pre-computed hidden states that can speed up sequential
decoding.
Args:
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the decoder of the model.
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
hidden_size)` is output.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
loc (`torch.FloatTensor` of shape `(batch_size,)`, *optional*):
Shift values of each time series' context window which is used to give the model inputs of the same
magnitude and then used to shift back to the original magnitude.
scale (`torch.FloatTensor` of shape `(batch_size,)`, *optional*):
Scaling values of each time series' context window which is used to give the model inputs of the same
magnitude and then used to rescale back to the original magnitude.
static_features: (`torch.FloatTensor` of shape `(batch_size, feature size)`, *optional*):
Static features of each time series' in a batch which are copied to the covariates at inference time.
"""
last_hidden_state: torch.FloatTensor = None
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
encoder_last_hidden_state: Optional[torch.FloatTensor] = None
encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
loc: Optional[torch.FloatTensor] = None
scale: Optional[torch.FloatTensor] = None
static_features: Optional[torch.FloatTensor] = None
@dataclass
class Seq2SeqTimeSeriesPredictionOutput(ModelOutput):
"""
Base class for model's predictions outputs that also contain the loss as well parameters of the chosen
distribution.
Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when a `future_values` is provided):
Distributional loss.
params (`torch.FloatTensor` of shape `(batch_size, num_samples, num_params)`):
Parameters of the chosen distribution.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
loc (`torch.FloatTensor` of shape `(batch_size,)`, *optional*):
Shift values of each time series' context window which is used to give the model inputs of the same
magnitude and then used to shift back to the original magnitude.
scale (`torch.FloatTensor` of shape `(batch_size,)`, *optional*):
Scaling values of each time series' context window which is used to give the model inputs of the same
magnitude and then used to rescale back to the original magnitude.
static_features: (`torch.FloatTensor` of shape `(batch_size, feature size)`, *optional*):
Static features of each time series' in a batch which are copied to the covariates at inference time.
"""
loss: Optional[torch.FloatTensor] = None
params: Optional[Tuple[torch.FloatTensor]] = None
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
encoder_last_hidden_state: Optional[torch.FloatTensor] = None
encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
loc: Optional[torch.FloatTensor] = None
scale: Optional[torch.FloatTensor] = None
static_features: Optional[torch.FloatTensor] = None
@dataclass
class SampleTimeSeriesPredictionOutput(ModelOutput):
sequences: torch.FloatTensor = None
# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->TimeSeriesTransformer
class TimeSeriesTransformerAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
......@@ -1179,7 +853,7 @@ class TimeSeriesTransformerEncoder(TimeSeriesTransformerPreTrainedModel):
if config.prediction_length is None:
raise ValueError("The `prediction_length` config needs to be specified.")
self.value_embedding = ValueEmbedding(feature_size=config.feature_size, d_model=config.d_model)
self.value_embedding = TimeSeriesValueEmbedding(feature_size=config.feature_size, d_model=config.d_model)
self.embed_positions = TimeSeriesSinusoidalPositionalEmbedding(
config.context_length + config.prediction_length, config.d_model
)
......@@ -1316,7 +990,7 @@ class TimeSeriesTransformerDecoder(TimeSeriesTransformerPreTrainedModel):
if config.prediction_length is None:
raise ValueError("The `prediction_length` config needs to be specified.")
self.value_embedding = ValueEmbedding(feature_size=config.feature_size, d_model=config.d_model)
self.value_embedding = TimeSeriesValueEmbedding(feature_size=config.feature_size, d_model=config.d_model)
self.embed_positions = TimeSeriesSinusoidalPositionalEmbedding(
config.context_length + config.prediction_length, config.d_model
)
......@@ -1547,14 +1221,14 @@ class TimeSeriesTransformerModel(TimeSeriesTransformerPreTrainedModel):
super().__init__(config)
if config.scaling == "mean" or config.scaling:
self.scaler = MeanScaler(dim=1, keepdim=True)
self.scaler = TimeSeriesMeanScaler(dim=1, keepdim=True)
elif config.scaling == "std":
self.scaler = StdScaler(dim=1, keepdim=True)
self.scaler = TimeSeriesStdScaler(dim=1, keepdim=True)
else:
self.scaler = NOPScaler(dim=1, keepdim=True)
self.scaler = TimeSeriesNOPScaler(dim=1, keepdim=True)
if config.num_static_categorical_features > 0:
self.embedder = FeatureEmbedder(
self.embedder = TimeSeriesFeatureEmbedder(
cardinalities=config.cardinality,
embedding_dims=config.embedding_dimension,
)
......@@ -1681,7 +1355,7 @@ class TimeSeriesTransformerModel(TimeSeriesTransformerPreTrainedModel):
return self.decoder
@add_start_docstrings_to_model_forward(TIME_SERIES_TRANSFORMER_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=Seq2SeqTimeSeriesModelOutput, config_class=_CONFIG_FOR_DOC)
@replace_return_docstrings(output_type=Seq2SeqTSModelOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
past_values: torch.Tensor,
......@@ -1701,7 +1375,7 @@ class TimeSeriesTransformerModel(TimeSeriesTransformerPreTrainedModel):
output_attentions: Optional[bool] = None,
use_cache: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Seq2SeqTimeSeriesModelOutput, Tuple]:
) -> Union[Seq2SeqTSModelOutput, Tuple]:
r"""
Returns:
......@@ -1784,7 +1458,7 @@ class TimeSeriesTransformerModel(TimeSeriesTransformerPreTrainedModel):
if not return_dict:
return decoder_outputs + encoder_outputs + (loc, scale, static_feat)
return Seq2SeqTimeSeriesModelOutput(
return Seq2SeqTSModelOutput(
last_hidden_state=decoder_outputs.last_hidden_state,
past_key_values=decoder_outputs.past_key_values,
decoder_hidden_states=decoder_outputs.hidden_states,
......@@ -1820,7 +1494,7 @@ class TimeSeriesTransformerForPrediction(TimeSeriesTransformerPreTrainedModel):
self.target_shape = self.distribution_output.event_shape
if config.loss == "nll":
self.loss = NegativeLogLikelihood()
self.loss = nll
else:
raise ValueError(f"Unknown loss function {config.loss}")
......@@ -1844,7 +1518,7 @@ class TimeSeriesTransformerForPrediction(TimeSeriesTransformerPreTrainedModel):
return self.distribution_output.distribution(sliced_params, loc=loc, scale=scale)
@add_start_docstrings_to_model_forward(TIME_SERIES_TRANSFORMER_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=Seq2SeqTimeSeriesModelOutput, config_class=_CONFIG_FOR_DOC)
@replace_return_docstrings(output_type=Seq2SeqTSModelOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
past_values: torch.Tensor,
......@@ -1865,7 +1539,7 @@ class TimeSeriesTransformerForPrediction(TimeSeriesTransformerPreTrainedModel):
output_attentions: Optional[bool] = None,
use_cache: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Seq2SeqTimeSeriesModelOutput, Tuple]:
) -> Union[Seq2SeqTSModelOutput, Tuple]:
r"""
Returns:
......@@ -1962,7 +1636,7 @@ class TimeSeriesTransformerForPrediction(TimeSeriesTransformerPreTrainedModel):
outputs = ((params,) + outputs[1:]) if params is not None else outputs[1:]
return ((prediction_loss,) + outputs) if prediction_loss is not None else outputs
return Seq2SeqTimeSeriesPredictionOutput(
return Seq2SeqTSPredictionOutput(
loss=prediction_loss,
params=params,
past_key_values=outputs.past_key_values,
......@@ -1988,7 +1662,7 @@ class TimeSeriesTransformerForPrediction(TimeSeriesTransformerPreTrainedModel):
static_real_features: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
) -> SampleTimeSeriesPredictionOutput:
) -> SampleTSPredictionOutput:
r"""
Greedily generate sequences of sample predictions from a model with a probability distribution head.
......@@ -2072,9 +1746,9 @@ class TimeSeriesTransformerForPrediction(TimeSeriesTransformerPreTrainedModel):
Whether or not to return the hidden states of all layers.
Return:
[`SampleTimeSeriesPredictionOutput`] where the outputs `sequences` tensor will have shape `(batch_size,
number of samples, prediction_length)` or `(batch_size, number of samples, prediction_length, input_size)`
for multivariate predictions.
[`SampleTSPredictionOutput`] where the outputs `sequences` tensor will have shape `(batch_size, number of
samples, prediction_length)` or `(batch_size, number of samples, prediction_length, input_size)` for
multivariate predictions.
"""
outputs = self(
static_categorical_features=static_categorical_features,
......@@ -2139,7 +1813,7 @@ class TimeSeriesTransformerForPrediction(TimeSeriesTransformerPreTrainedModel):
concat_future_samples = torch.cat(future_samples, dim=1)
return SampleTimeSeriesPredictionOutput(
return SampleTSPredictionOutput(
sequences=concat_future_samples.reshape(
(-1, num_parallel_samples, self.config.prediction_length) + self.target_shape,
)
......
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Time series distributional output classes and utilities.
"""
from typing import Callable, Dict, Optional, Tuple
import torch
from torch import nn
from torch.distributions import (
AffineTransform,
Distribution,
Independent,
NegativeBinomial,
Normal,
StudentT,
TransformedDistribution,
)
class AffineTransformed(TransformedDistribution):
def __init__(self, base_distribution: Distribution, loc=None, scale=None, event_dim=0):
self.scale = 1.0 if scale is None else scale
self.loc = 0.0 if loc is None else loc
super().__init__(base_distribution, [AffineTransform(loc=self.loc, scale=self.scale, event_dim=event_dim)])
@property
def mean(self):
"""
Returns the mean of the distribution.
"""
return self.base_dist.mean * self.scale + self.loc
@property
def variance(self):
"""
Returns the variance of the distribution.
"""
return self.base_dist.variance * self.scale**2
@property
def stddev(self):
"""
Returns the standard deviation of the distribution.
"""
return self.variance.sqrt()
class ParameterProjection(nn.Module):
def __init__(
self, in_features: int, args_dim: Dict[str, int], domain_map: Callable[..., Tuple[torch.Tensor]], **kwargs
) -> None:
super().__init__(**kwargs)
self.args_dim = args_dim
self.proj = nn.ModuleList([nn.Linear(in_features, dim) for dim in args_dim.values()])
self.domain_map = domain_map
def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor]:
params_unbounded = [proj(x) for proj in self.proj]
return self.domain_map(*params_unbounded)
class LambdaLayer(nn.Module):
def __init__(self, function):
super().__init__()
self.function = function
def forward(self, x, *args):
return self.function(x, *args)
class DistributionOutput:
distribution_class: type
in_features: int
args_dim: Dict[str, int]
def __init__(self, dim: int = 1) -> None:
self.dim = dim
self.args_dim = {k: dim * self.args_dim[k] for k in self.args_dim}
def _base_distribution(self, distr_args):
if self.dim == 1:
return self.distribution_class(*distr_args)
else:
return Independent(self.distribution_class(*distr_args), 1)
def distribution(
self,
distr_args,
loc: Optional[torch.Tensor] = None,
scale: Optional[torch.Tensor] = None,
) -> Distribution:
distr = self._base_distribution(distr_args)
if loc is None and scale is None:
return distr
else:
return AffineTransformed(distr, loc=loc, scale=scale, event_dim=self.event_dim)
@property
def event_shape(self) -> Tuple:
r"""
Shape of each individual event contemplated by the distributions that this object constructs.
"""
return () if self.dim == 1 else (self.dim,)
@property
def event_dim(self) -> int:
r"""
Number of event dimensions, i.e., length of the `event_shape` tuple, of the distributions that this object
constructs.
"""
return len(self.event_shape)
@property
def value_in_support(self) -> float:
r"""
A float that will have a valid numeric value when computing the log-loss of the corresponding distribution. By
default 0.0. This value will be used when padding data series.
"""
return 0.0
def get_parameter_projection(self, in_features: int) -> nn.Module:
r"""
Return the parameter projection layer that maps the input to the appropriate parameters of the distribution.
"""
return ParameterProjection(
in_features=in_features,
args_dim=self.args_dim,
domain_map=LambdaLayer(self.domain_map),
)
def domain_map(self, *args: torch.Tensor):
r"""
Converts arguments to the right shape and domain. The domain depends on the type of distribution, while the
correct shape is obtained by reshaping the trailing axis in such a way that the returned tensors define a
distribution of the right event_shape.
"""
raise NotImplementedError()
@staticmethod
def squareplus(x: torch.Tensor) -> torch.Tensor:
r"""
Helper to map inputs to the positive orthant by applying the square-plus operation. Reference:
https://twitter.com/jon_barron/status/1387167648669048833
"""
return (x + torch.sqrt(torch.square(x) + 4.0)) / 2.0
class StudentTOutput(DistributionOutput):
"""
Student-T distribution output class.
"""
args_dim: Dict[str, int] = {"df": 1, "loc": 1, "scale": 1}
distribution_class: type = StudentT
@classmethod
def domain_map(cls, df: torch.Tensor, loc: torch.Tensor, scale: torch.Tensor):
scale = cls.squareplus(scale)
df = 2.0 + cls.squareplus(df)
return df.squeeze(-1), loc.squeeze(-1), scale.squeeze(-1)
class NormalOutput(DistributionOutput):
"""
Normal distribution output class.
"""
args_dim: Dict[str, int] = {"loc": 1, "scale": 1}
distribution_class: type = Normal
@classmethod
def domain_map(cls, loc: torch.Tensor, scale: torch.Tensor):
scale = cls.squareplus(scale)
return loc.squeeze(-1), scale.squeeze(-1)
class NegativeBinomialOutput(DistributionOutput):
"""
Negative Binomial distribution output class.
"""
args_dim: Dict[str, int] = {"total_count": 1, "logits": 1}
distribution_class: type = NegativeBinomial
@classmethod
def domain_map(cls, total_count: torch.Tensor, logits: torch.Tensor):
total_count = cls.squareplus(total_count)
return total_count.squeeze(-1), logits.squeeze(-1)
def _base_distribution(self, distr_args) -> Distribution:
total_count, logits = distr_args
if self.dim == 1:
return self.distribution_class(total_count=total_count, logits=logits)
else:
return Independent(self.distribution_class(total_count=total_count, logits=logits), 1)
# Overwrites the parent class method. We cannot scale using the affine
# transformation since negative binomial should return integers. Instead
# we scale the parameters.
def distribution(
self, distr_args, loc: Optional[torch.Tensor] = None, scale: Optional[torch.Tensor] = None
) -> Distribution:
total_count, logits = distr_args
if scale is not None:
# See scaling property of Gamma.
logits += scale.log()
return self._base_distribution((total_count, logits))
......@@ -3430,6 +3430,30 @@ def load_tf_weights_in_imagegpt(*args, **kwargs):
requires_backends(load_tf_weights_in_imagegpt, ["torch"])
INFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
class InformerForPrediction(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class InformerModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class InformerPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch Informer model. """
import inspect
import tempfile
import unittest
import numpy as np
from huggingface_hub import hf_hub_download
from transformers import is_torch_available
from transformers.testing_utils import is_flaky, require_torch, slow, torch_device
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
TOLERANCE = 1e-4
if is_torch_available():
import torch
from transformers import InformerConfig, InformerForPrediction, InformerModel
from transformers.models.informer.modeling_informer import InformerDecoder, InformerEncoder
@require_torch
class InformerModelTester:
def __init__(
self,
parent,
batch_size=13,
prediction_length=7,
context_length=14,
cardinality=19,
embedding_dimension=5,
num_time_features=4,
is_training=True,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=4,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
lags_sequence=[1, 2, 3, 4, 5],
sampling_factor=10,
distil=False,
):
self.parent = parent
self.batch_size = batch_size
self.prediction_length = prediction_length
self.context_length = context_length
self.cardinality = cardinality
self.num_time_features = num_time_features
self.lags_sequence = lags_sequence
self.embedding_dimension = embedding_dimension
self.is_training = is_training
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.encoder_seq_length = min(
sampling_factor * np.ceil(np.log1p(context_length)).astype("int").item(), context_length
)
self.decoder_seq_length = min(
sampling_factor * np.ceil(np.log1p(prediction_length)).astype("int").item(), prediction_length
)
self.sampling_factor = sampling_factor
self.distil = distil
def get_config(self):
return InformerConfig(
prediction_length=self.prediction_length,
d_model=self.hidden_size,
encoder_layers=self.num_hidden_layers,
decoder_layers=self.num_hidden_layers,
encoder_attention_heads=self.num_attention_heads,
decoder_attention_heads=self.num_attention_heads,
encoder_ffn_dim=self.intermediate_size,
decoder_ffn_dim=self.intermediate_size,
dropout=self.hidden_dropout_prob,
attention_dropout=self.attention_probs_dropout_prob,
context_length=self.context_length,
lags_sequence=self.lags_sequence,
num_time_features=self.num_time_features,
num_static_categorical_features=1,
num_static_real_features=1,
cardinality=[self.cardinality],
embedding_dimension=[self.embedding_dimension],
sampling_factor=self.sampling_factor,
distil=self.distil,
)
def prepare_informer_inputs_dict(self, config):
_past_length = config.context_length + max(config.lags_sequence)
static_categorical_features = ids_tensor([self.batch_size, 1], config.cardinality[0])
static_real_features = floats_tensor([self.batch_size, 1])
past_time_features = floats_tensor([self.batch_size, _past_length, config.num_time_features])
past_values = floats_tensor([self.batch_size, _past_length])
past_observed_mask = floats_tensor([self.batch_size, _past_length])
# decoder inputs
future_time_features = floats_tensor([self.batch_size, config.prediction_length, config.num_time_features])
future_values = floats_tensor([self.batch_size, config.prediction_length])
inputs_dict = {
"past_values": past_values,
"static_categorical_features": static_categorical_features,
"static_real_features": static_real_features,
"past_time_features": past_time_features,
"past_observed_mask": past_observed_mask,
"future_time_features": future_time_features,
"future_values": future_values,
}
return inputs_dict
def prepare_config_and_inputs(self):
config = self.get_config()
inputs_dict = self.prepare_informer_inputs_dict(config)
return config, inputs_dict
def prepare_config_and_inputs_for_common(self):
config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def check_encoder_decoder_model_standalone(self, config, inputs_dict):
model = InformerModel(config=config).to(torch_device).eval()
outputs = model(**inputs_dict)
encoder_last_hidden_state = outputs.encoder_last_hidden_state
last_hidden_state = outputs.last_hidden_state
with tempfile.TemporaryDirectory() as tmpdirname:
encoder = model.get_encoder()
encoder.save_pretrained(tmpdirname)
encoder = InformerEncoder.from_pretrained(tmpdirname).to(torch_device)
transformer_inputs, _, _, _ = model.create_network_inputs(**inputs_dict)
enc_input = transformer_inputs[:, : config.context_length, ...]
dec_input = transformer_inputs[:, config.context_length :, ...]
encoder_last_hidden_state_2 = encoder(inputs_embeds=enc_input)[0]
self.parent.assertTrue((encoder_last_hidden_state_2 - encoder_last_hidden_state).abs().max().item() < 1e-3)
with tempfile.TemporaryDirectory() as tmpdirname:
decoder = model.get_decoder()
decoder.save_pretrained(tmpdirname)
decoder = InformerDecoder.from_pretrained(tmpdirname).to(torch_device)
last_hidden_state_2 = decoder(
inputs_embeds=dec_input,
encoder_hidden_states=encoder_last_hidden_state,
)[0]
self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3)
@require_torch
class InformerModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (InformerModel, InformerForPrediction) if is_torch_available() else ()
all_generative_model_classes = (InformerForPrediction,) if is_torch_available() else ()
is_encoder_decoder = True
test_pruning = False
test_head_masking = False
test_missing_keys = False
test_torchscript = False
test_inputs_embeds = False
test_model_common_attributes = False
def setUp(self):
self.model_tester = InformerModelTester(self)
self.config_tester = ConfigTester(
self,
config_class=InformerConfig,
has_text_modality=False,
prediction_length=self.model_tester.prediction_length,
)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, _ = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_encoder_decoder_model_standalone(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs)
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
expected_num_layers = getattr(
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
)
self.assertEqual(len(hidden_states), expected_num_layers)
if hasattr(self.model_tester, "encoder_seq_length"):
seq_length = self.model_tester.context_length
if hasattr(self.model_tester, "chunk_length") and self.model_tester.chunk_length > 1:
seq_length = seq_length * self.model_tester.chunk_length
else:
seq_length = self.model_tester.seq_length
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[seq_length, self.model_tester.hidden_size],
)
if config.is_encoder_decoder:
hidden_states = outputs.decoder_hidden_states
self.assertIsInstance(hidden_states, (list, tuple))
self.assertEqual(len(hidden_states), expected_num_layers)
seq_len = getattr(self.model_tester, "seq_length", None)
decoder_seq_length = getattr(self.model_tester, "prediction_length", seq_len)
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[decoder_seq_length, self.model_tester.hidden_size],
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
# Ignore since we have no tokens embeddings
def test_resize_tokens_embeddings(self):
pass
def test_model_outputs_equivalence(self):
pass
def test_determinism(self):
pass
# # Input is 'static_categorical_features' not 'input_ids'
def test_model_main_input_name(self):
model_signature = inspect.signature(getattr(InformerModel, "forward"))
# The main input is the name of the argument after `self`
observed_main_input_name = list(model_signature.parameters.keys())[1]
self.assertEqual(InformerModel.main_input_name, observed_main_input_name)
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = [
"past_values",
"past_time_features",
"past_observed_mask",
"static_categorical_features",
"static_real_features",
"future_values",
"future_time_features",
]
expected_arg_names.extend(
[
"future_observed_mask",
"decoder_attention_mask",
"head_mask",
"decoder_head_mask",
"cross_attn_head_mask",
"encoder_outputs",
"past_key_values",
"output_hidden_states",
"output_attentions",
"use_cache",
"return_dict",
]
if "future_observed_mask" in arg_names
else [
"decoder_attention_mask",
"head_mask",
"decoder_head_mask",
"cross_attn_head_mask",
"encoder_outputs",
"past_key_values",
"output_hidden_states",
"output_attentions",
"use_cache",
"return_dict",
]
)
self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
def test_attention_outputs(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
seq_len = getattr(self.model_tester, "seq_length", None)
decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
context_length = getattr(self.model_tester, "context_length", seq_len)
prediction_length = getattr(self.model_tester, "prediction_length", seq_len)
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, context_length],
)
out_len = len(outputs)
correct_outlen = 7
if "last_hidden_state" in outputs:
correct_outlen += 1
if "past_key_values" in outputs:
correct_outlen += 1 # past_key_values have been returned
if "loss" in outputs:
correct_outlen += 1
if "params" in outputs:
correct_outlen += 1
self.assertEqual(out_len, correct_outlen)
# decoder attentions
decoder_attentions = outputs.decoder_attentions
self.assertIsInstance(decoder_attentions, (list, tuple))
self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(decoder_attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, decoder_seq_length, prediction_length],
)
# cross attentions
cross_attentions = outputs.cross_attentions
self.assertIsInstance(cross_attentions, (list, tuple))
self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(cross_attentions[0].shape[-3:]),
[
self.model_tester.num_attention_heads,
decoder_seq_length,
encoder_seq_length,
],
)
# Check attention is always last and order is fine
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self.assertEqual(out_len + 2, len(outputs))
self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, context_length],
)
@is_flaky()
def test_retain_grad_hidden_states_attentions(self):
super().test_retain_grad_hidden_states_attentions()
def prepare_batch(filename="train-batch.pt"):
file = hf_hub_download(repo_id="kashif/tourism-monthly-batch", filename=filename, repo_type="dataset")
batch = torch.load(file, map_location=torch_device)
return batch
@require_torch
@slow
class InformerModelIntegrationTests(unittest.TestCase):
def test_inference_no_head(self):
model = InformerModel.from_pretrained("huggingface/informer-tourism-monthly").to(torch_device)
batch = prepare_batch()
torch.manual_seed(0)
with torch.no_grad():
output = model(
past_values=batch["past_values"],
past_time_features=batch["past_time_features"],
past_observed_mask=batch["past_observed_mask"],
static_categorical_features=batch["static_categorical_features"],
future_values=batch["future_values"],
future_time_features=batch["future_time_features"],
).last_hidden_state
expected_shape = torch.Size((64, model.config.context_length, model.config.d_model))
self.assertEqual(output.shape, expected_shape)
expected_slice = torch.tensor(
[[0.4699, 0.7295, 0.8967], [0.4858, 0.3810, 0.9641], [-0.0233, 0.3608, 1.0303]],
device=torch_device,
)
self.assertTrue(torch.allclose(output[0, :3, :3], expected_slice, atol=TOLERANCE))
def test_inference_head(self):
model = InformerForPrediction.from_pretrained("huggingface/informer-tourism-monthly").to(torch_device)
batch = prepare_batch("val-batch.pt")
torch.manual_seed(0)
with torch.no_grad():
output = model(
past_values=batch["past_values"],
past_time_features=batch["past_time_features"],
past_observed_mask=batch["past_observed_mask"],
static_categorical_features=batch["static_categorical_features"],
future_time_features=batch["future_time_features"],
).encoder_last_hidden_state
# encoder distils the context length to 1/8th of the original length
expected_shape = torch.Size((64, model.config.context_length // 8, model.config.d_model))
self.assertEqual(output.shape, expected_shape)
expected_slice = torch.tensor(
[[0.4170, 0.9067, 0.8153], [0.3004, 0.7574, 0.7066], [0.6803, -0.6323, 1.2802]], device=torch_device
)
self.assertTrue(torch.allclose(output[0, :3, :3], expected_slice, atol=TOLERANCE))
def test_seq_to_seq_generation(self):
model = InformerForPrediction.from_pretrained("huggingface/informer-tourism-monthly").to(torch_device)
batch = prepare_batch("val-batch.pt")
torch.manual_seed(0)
with torch.no_grad():
outputs = model.generate(
static_categorical_features=batch["static_categorical_features"],
past_time_features=batch["past_time_features"],
past_values=batch["past_values"],
future_time_features=batch["future_time_features"],
past_observed_mask=batch["past_observed_mask"],
)
expected_shape = torch.Size((64, model.config.num_parallel_samples, model.config.prediction_length))
self.assertEqual(outputs.sequences.shape, expected_shape)
expected_slice = torch.tensor([3400.8005, 4289.2637, 7101.9209], device=torch_device)
mean_prediction = outputs.sequences.mean(dim=1)
self.assertTrue(torch.allclose(mean_prediction[0, -3:], expected_slice, rtol=1e-1))
......@@ -69,6 +69,10 @@ SPECIAL_CASES_TO_ALLOW = {
"RetriBertConfig": ["layer_norm_eps"],
# having default values other than `1e-5` - we can't fix them without breaking
"TrajectoryTransformerConfig": ["layer_norm_eps"],
# used internally to calculate the feature size
"InformerConfig": ["num_static_real_features", "num_time_features"],
# used internally to calculate the feature size
"TimeSeriesTransformerConfig": ["num_static_real_features", "num_time_features"],
}
# TODO (ydshieh): Check the failing cases, try to fix them or move some cases to the above block once we are sure
......@@ -97,7 +101,6 @@ SPECIAL_CASES_TO_ALLOW.update(
"SwitchTransformersConfig": True,
"TableTransformerConfig": True,
"TapasConfig": True,
"TimeSeriesTransformerConfig": True,
"TrajectoryTransformerConfig": True,
"TransfoXLConfig": True,
"UniSpeechConfig": True,
......
......@@ -68,6 +68,8 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
"TableTransformerDecoder", # Building part of bigger (tested) model.
"TimeSeriesTransformerEncoder", # Building part of bigger (tested) model.
"TimeSeriesTransformerDecoder", # Building part of bigger (tested) model.
"InformerEncoder", # Building part of bigger (tested) model.
"InformerDecoder", # Building part of bigger (tested) model.
"JukeboxVQVAE", # Building part of bigger (tested) model.
"JukeboxPrior", # Building part of bigger (tested) model.
"DeformableDetrEncoder", # Building part of bigger (tested) model.
......@@ -208,6 +210,7 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
"EsmForProteinFolding",
"GPTSanJapaneseModel",
"TimeSeriesTransformerForPrediction",
"InformerForPrediction",
"JukeboxVQVAE",
"JukeboxPrior",
"PegasusXEncoder",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment