Unverified Commit 3323146e authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Models doc (#7345)



* Clean up model documentation

* Formatting

* Preparation work

* Long lines

* Main work on rst files

* Cleanup all config files

* Syntax fix

* Clean all tokenizers

* Work on first models

* Models beginning

* FaluBERT

* All PyTorch models

* All models

* Long lines again

* Fixes

* More fixes

* Update docs/source/model_doc/bert.rst
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/electra.rst
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Last fixes
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
parent 58405a52
...@@ -38,38 +38,37 @@ class RetriBertConfig(PretrainedConfig): ...@@ -38,38 +38,37 @@ class RetriBertConfig(PretrainedConfig):
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the BERT model. Defines the different tokens that Vocabulary size of the RetriBERT model. Defines the number of different tokens that can be represented by
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`. the :obj:`inputs_ids` passed when calling :class:`~transformers.RetriBertModel`
hidden_size (:obj:`int`, optional, defaults to 768): hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12): num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 12): num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 3072): intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"): hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`. The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
share_encoders (:obj:`bool`, optional, defaults to :obj:`True`): share_encoders (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to use the same Bert-type encoder for the queries and document Whether or not to use the same Bert-type encoder for the queries and document
projection_dim (:obj:`int`, optional, defaults to 128): projection_dim (:obj:`int`, `optional`, defaults to 128):
Final dimension of the query and document representation after projection Final dimension of the query and document representation after projection
""" """
model_type = "retribert" model_type = "retribert"
......
...@@ -33,10 +33,10 @@ ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -33,10 +33,10 @@ ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class RobertaConfig(BertConfig): class RobertaConfig(BertConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.RobertaModel`. This is the configuration class to store the configuration of a :class:`~transformers.RobertaModel` or a
It is used to instantiate an RoBERTa model according to the specified arguments, defining the model :class:`~transformers.TFRobertaModel`. It is used to instantiate a RoBERTa model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture.
the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
...@@ -45,7 +45,7 @@ class RobertaConfig(BertConfig): ...@@ -45,7 +45,7 @@ class RobertaConfig(BertConfig):
The :class:`~transformers.RobertaConfig` class directly inherits :class:`~transformers.BertConfig`. The :class:`~transformers.RobertaConfig` class directly inherits :class:`~transformers.BertConfig`.
It reuses the same defaults. Please check the parent class for more information. It reuses the same defaults. Please check the parent class for more information.
Example:: Examples::
>>> from transformers import RobertaConfig, RobertaModel >>> from transformers import RobertaConfig, RobertaModel
......
...@@ -31,33 +31,44 @@ T5_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -31,33 +31,44 @@ T5_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class T5Config(PretrainedConfig): class T5Config(PretrainedConfig):
r""" r"""
:class:`~transformers.T5Config` is the configuration class to store the configuration of a This is the configuration class to store the configuration of a :class:`~transformers.T5Model` or a
`T5Model`. :class:`~transformers.TFT5Model`. It is used to instantiate a T5 model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the T5 `t5-small <https://huggingface.co/t5-small>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Arguments: Arguments:
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`. vocab_size (:obj:`int`, `optional`, defaults to 32128):
d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`. Vocabulary size of the T5 model. Defines the number of different tokens that can be represented by the
num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`. :obj:`inputs_ids` passed when calling :class:`~transformers.T5Model` or
d_kv: Size of the key, query, value projections per attention head. `d_kv` has to be equal to `d_model // num_heads`. :class:`~transformers.TFT5Model`.
d_ff: Size of the intermediate feed forward layer in each `T5Block`. n_positions (:obj:`int`, `optional`, defaults to 512):
num_heads: Number of attention heads for each attention layer in The maximum sequence length that this model might ever be used with. Typically set this to something large
the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`. just in case (e.g., 512 or 1024 or 2048).
intermediate_size: The size of the "intermediate" (i.e., feed-forward) d_model (:obj:`int`, `optional`, defaults to 512):
layer in the Transformer encoder. Size of the encoder layers and the pooler layer.
hidden_act: The non-linear activation function (function or string) in the d_kv (:obj:`int`, `optional`, defaults to 64):
encoder and pooler. If string, "gelu", "relu", "swish" and "gelu_new" are supported. Size of the key, query, value projections per attention head. :obj:`d_kv` has to be equal to
hidden_dropout_prob: The dropout probabilitiy for all fully connected :obj:`d_model // num_heads`.
layers in the embeddings, encoder, and pooler. d_ff (:obj:`int`, `optional`, defaults to 2048):
attention_probs_dropout_prob: The dropout ratio for the attention Size of the intermediate feed forward layer in each :obj:`T5Block`.
probabilities. num_layers (:obj:`int`, `optional`, defaults to 6):
n_positions: The maximum sequence length that this model might Number of hidden layers in the Transformer encoder.
ever be used with. Typically set this to something large just in case num_heads (:obj:`int`, `optional`, defaults to 8):
(e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings`. Number of attention heads for each attention layer in
type_vocab_size: The vocabulary size of the `token_type_ids` passed into the Transformer encoder.
`T5Model`. relative_attention_num_buckets (:obj:`int`, `optional`, defaults to 32):
initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing). The number of buckets to use for each attention layer.
layer_norm_eps: The epsilon used by LayerNorm. dropout_rate (:obj:`float`, `optional`, defaults to 0.1):
The ratio for all dropout layers.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-6):
The epsilon used by the layer normalization layers.
initializer_factor (:obj:`float`, `optional`, defaults to 1):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
""" """
model_type = "t5" model_type = "t5"
......
...@@ -31,69 +31,70 @@ TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -31,69 +31,70 @@ TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class TransfoXLConfig(PretrainedConfig): class TransfoXLConfig(PretrainedConfig):
""" """
This is the configuration class to store the configuration of a :class:`~transformers.TransfoXLModel`. This is the configuration class to store the configuration of a :class:`~transformers.TransfoXLModel` or a
It is used to instantiate a Transformer XL model according to the specified arguments, defining the model :class:`~transformers.TFTransfoXLModel`. It is used to instantiate a Transformer-XL model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture. configuration to that of the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 267735): vocab_size (:obj:`int`, `optional`, defaults to 267735):
Vocabulary size of the Transformer XL model. Defines the different tokens that Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.TransfoXLModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.TransfoXLModel` or
cutoffs (:obj:`List[int]`, optional, defaults to :obj:`[20000, 40000, 200000]`): :class:`~transformers.TFTransfoXLModel`.
Cutoffs for the adaptive softmax cutoffs (:obj:`List[int]`, `optional`, defaults to :obj:`[20000, 40000, 200000]`):
d_model (:obj:`int`, optional, defaults to 1024): Cutoffs for the adaptive softmax.
d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the model's hidden states. Dimensionality of the model's hidden states.
d_embed (:obj:`int`, optional, defaults to 1024): d_embed (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the embeddings Dimensionality of the embeddings
n_head (:obj:`int`, optional, defaults to 16): n_head (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
d_head (:obj:`int`, optional, defaults to 64): d_head (:obj:`int`, `optional`, defaults to 64):
Dimensionality of the model's heads. Dimensionality of the model's heads.
d_inner (:obj:`int`, optional, defaults to 4096): d_inner (:obj:`int`, `optional`, defaults to 4096):
Inner dimension in FF Inner dimension in FF
div_val (:obj:`int`, optional, defaults to 4): div_val (:obj:`int`, `optional`, defaults to 4):
Divident value for adapative input and softmax Divident value for adapative input and softmax
pre_lnorm (:obj:`boolean`, optional, defaults to :obj:`False`): pre_lnorm (:obj:`boolean`, `optional`, defaults to :obj:`False`):
Apply LayerNorm to the input instead of the output Whether or not to apply LayerNorm to the input instead of the output in the blocks.
n_layer (:obj:`int`, optional, defaults to 18): n_layer (:obj:`int`, `optional`, defaults to 18):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
mem_len (:obj:`int`, optional, defaults to 1600): mem_len (:obj:`int`, `optional`, defaults to 1600):
Length of the retained previous heads Length of the retained previous heads.
clamp_len (:obj:`int`, optional, defaults to 1000): clamp_len (:obj:`int`, `optional`, defaults to 1000):
use the same pos embeddings after clamp_len Use the same pos embeddings after clamp_len.
same_length (:obj:`boolean`, optional, defaults to :obj:`True`): same_length (:obj:`boolean`, `optional`, defaults to :obj:`True`):
Use the same attn length for all tokens Whether or not to use the same attn length for all tokens
proj_share_all_but_first (:obj:`boolean`, optional, defaults to :obj:`True`): proj_share_all_but_first (:obj:`boolean`, `optional`, defaults to :obj:`True`):
True to share all but first projs, False not to share. True to share all but first projs, False not to share.
attn_type (:obj:`int`, optional, defaults to 0): attn_type (:obj:`int`, `optional`, defaults to 0):
Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al. Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
sample_softmax (:obj:`int`, optional, defaults to -1): sample_softmax (:obj:`int`, `optional`, defaults to -1):
number of samples in sampled softmax Number of samples in the sampled softmax.
adaptive (:obj:`boolean`, optional, defaults to :obj:`True`): adaptive (:obj:`boolean`, `optional`, defaults to :obj:`True`):
use adaptive softmax Whether or not to use adaptive softmax.
dropout (:obj:`float`, optional, defaults to 0.1): dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
dropatt (:obj:`float`, optional, defaults to 0): dropatt (:obj:`float`, `optional`, defaults to 0):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
untie_r (:obj:`boolean`, optional, defaults to :obj:`True`): untie_r (:obj:`boolean`, `optional`, defaults to :obj:`True`):
Untie relative position biases Whether ot not to untie relative position biases.
init (:obj:`string`, optional, defaults to `normal`): init (:obj:`str`, `optional`, defaults to :obj:`"normal"`):
Parameter initializer to use Parameter initializer to use.
init_range (:obj:`float`, optional, defaults to 0.01): init_range (:obj:`float`, `optional`, defaults to 0.01):
Parameters initialized by U(-init_range, init_range). Parameters initialized by U(-init_range, init_range).
proj_init_std (:obj:`float`, optional, defaults to 0.01): proj_init_std (:obj:`float`, `optional`, defaults to 0.01):
Parameters initialized by N(0, init_std) Parameters initialized by N(0, init_std)
init_std (:obj:`float`, optional, defaults to 0.02): init_std (:obj:`float`, `optional`, defaults to 0.02):
Parameters initialized by N(0, init_std) Parameters initialized by N(0, init_std)
layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5): layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
The epsilon to use in the layer normalization layers The epsilon to use in the layer normalization layers
Example:: Examples::
>>> from transformers import TransfoXLConfig, TransfoXLModel >>> from transformers import TransfoXLConfig, TransfoXLModel
......
...@@ -36,109 +36,109 @@ XLM_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -36,109 +36,109 @@ XLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class XLMConfig(PretrainedConfig): class XLMConfig(PretrainedConfig):
""" """
This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`. This is the configuration class to store the configuration of a :class:`~transformers.XLMModel` or a
It is used to instantiate an XLM model according to the specified arguments, defining the model :class:`~transformers.TFXLMModel`. It is used to instantiate a XLM model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture. configuration to that of the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30145): vocab_size (:obj:`int`, `optional`, defaults to 30145):
Vocabulary size of the XLM model. Defines the different tokens that Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XLMModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.XLMModel` or
emb_dim (:obj:`int`, optional, defaults to 2048): :class:`~transformers.TFXLMModel`.
emb_dim (:obj:`int`, `optional`, defaults to 2048):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
n_layer (:obj:`int`, optional, defaults to 12): n_layer (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 16): n_head (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
dropout (:obj:`float`, optional, defaults to 0.1): dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler. layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, optional, defaults to 0.1): attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for the attention mechanism The dropout probability for the attention mechanism
gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`): gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
The non-linear activation function (function or string) in the Whether or not to use `gelu` for the activations instead of `relu`.
encoder and pooler. If set to `True`, "gelu" will be used instead of "relu". sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`): Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
Whether to use sinusoidal positional embeddings instead of absolute positional embeddings. causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
causal (:obj:`boolean`, optional, defaults to :obj:`False`): Whether or not the model should behave in a causal manner.
Set this to `True` for the model to behave in a causal manner.
Causal models use a triangular attention mask in order to only attend to the left-side context instead Causal models use a triangular attention mask in order to only attend to the left-side context instead
if a bidirectional context. if a bidirectional context.
asm (:obj:`boolean`, optional, defaults to :obj:`False`): asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
layer. layer.
n_langs (:obj:`int`, optional, defaults to 1): n_langs (:obj:`int`, `optional`, defaults to 1):
The number of languages the model handles. Set to 1 for monolingual models. The number of languages the model handles. Set to 1 for monolingual models.
use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`) use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
Whether to use language embeddings. Some models use additional language embeddings, see Whether to use language embeddings. Some models use additional language embeddings, see
`the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__ `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
for information on how to use them. for information on how to use them.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048). (e.g., 512 or 1024 or 2048).
embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5): embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
The standard deviation of the truncated_normal_initializer for The standard deviation of the truncated_normal_initializer for
initializing the embedding matrices. initializing the embedding matrices.
init_std (:obj:`int`, optional, defaults to 50257): init_std (:obj:`int`, `optional`, defaults to 50257):
The standard deviation of the truncated_normal_initializer for The standard deviation of the truncated_normal_initializer for
initializing all weight matrices except the embedding matrices. initializing all weight matrices except the embedding matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
bos_index (:obj:`int`, optional, defaults to 0): bos_index (:obj:`int`, `optional`, defaults to 0):
The index of the beginning of sentence token in the vocabulary. The index of the beginning of sentence token in the vocabulary.
eos_index (:obj:`int`, optional, defaults to 1): eos_index (:obj:`int`, `optional`, defaults to 1):
The index of the end of sentence token in the vocabulary. The index of the end of sentence token in the vocabulary.
pad_index (:obj:`int`, optional, defaults to 2): pad_index (:obj:`int`, `optional`, defaults to 2):
The index of the padding token in the vocabulary. The index of the padding token in the vocabulary.
unk_index (:obj:`int`, optional, defaults to 3): unk_index (:obj:`int`, `optional`, defaults to 3):
The index of the unknown token in the vocabulary. The index of the unknown token in the vocabulary.
mask_index (:obj:`int`, optional, defaults to 5): mask_index (:obj:`int`, `optional`, defaults to 5):
The index of the masking token in the vocabulary. The index of the masking token in the vocabulary.
is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`): is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al. Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
summary_type (:obj:`string`, optional, defaults to "first"): summary_type (:obj:`string`, `optional`, defaults to "first"):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
Is one of the following options: Has to be one of the following options:
- 'last' => take the last token hidden state (like XLNet) - :obj:`"last"`: Take the last token hidden state (like XLNet).
- 'first' => take the first token hidden state (like Bert) - :obj:`"first"`: Take the first token hidden state (like BERT).
- 'mean' => take the mean of all tokens hidden states - :obj:`"mean"`: Take the mean of all tokens hidden states.
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2) - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- 'attn' => Not implemented now, use multi-head attention - :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`): summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
Add a projection after the vector extraction Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`string` or :obj:`None`, optional): summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
'tanh' => add a tanh activation to the output, Other => no activation. Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`): summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False. Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_first_dropout (:obj:`float`, optional, defaults to 0.1): summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
Add a dropout before the projection and activation The dropout ratio to be used after the projection and activation.
start_n_top (:obj:`int`, optional, defaults to 5): start_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet. Used in the SQuAD evaluation script.
end_n_top (:obj:`int`, optional, defaults to 5): end_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet. Used in the SQuAD evaluation script.
mask_token_id (:obj:`int`, optional, defaults to 0): mask_token_id (:obj:`int`, `optional`, defaults to 0):
Model agnostic parameter to identify masked tokens when generating text in an MLM context. Model agnostic parameter to identify masked tokens when generating text in an MLM context.
lang_id (:obj:`int`, optional, defaults to 1): lang_id (:obj:`int`, `optional`, defaults to 1):
The ID of the language used by the model. This parameter is used when generating The ID of the language used by the model. This parameter is used when generating
text in a given language. text in a given language.
Example:: Examples::
>>> from transformers import XLMConfig, XLMModel >>> from transformers import XLMConfig, XLMModel
......
...@@ -31,85 +31,86 @@ XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -31,85 +31,86 @@ XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class XLNetConfig(PretrainedConfig): class XLNetConfig(PretrainedConfig):
""" """
This is the configuration class to store the configuration of a :class:`~transformers.XLNetModel`. This is the configuration class to store the configuration of a :class:`~transformers.XLNetModel` or a
It is used to instantiate an XLNet model according to the specified arguments, defining the model :class:`~transformers.TFXLNetModel`. It is used to instantiate a XLNet model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture. configuration to that of the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 32000): vocab_size (:obj:`int`, `optional`, defaults to 32000):
Vocabulary size of the XLNet model. Defines the different tokens that Vocabulary size of the XLNet model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XLNetModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.XLNetModel` or
d_model (:obj:`int`, optional, defaults to 1024): :class:`~transformers.TFXLNetModel`.
d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
n_layer (:obj:`int`, optional, defaults to 24): n_layer (:obj:`int`, `optional`, defaults to 24):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 16): n_head (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
d_inner (:obj:`int`, optional, defaults to 4096): d_inner (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
ff_activation (:obj:`string`, optional, defaults to "gelu"): ff_activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the The non-linear activation function (function or string) in the
encoder and pooler. If string, "gelu", "relu" and "swish" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
untie_r (:obj:`boolean`, optional, defaults to :obj:`True`): untie_r (:obj:`bool`, `optional`, defaults to :obj:`True`):
Untie relative position biases Whether or not to untie relative position biases
attn_type (:obj:`string`, optional, defaults to "bi"): attn_type (:obj:`str`, `optional`, defaults to :obj:`"bi"`):
The attention type used by the model. Set 'bi' for XLNet, 'uni' for Transformer-XL. The attention type used by the model. Set :obj:`"bi"` for XLNet, :obj:`"uni"` for Transformer-XL.
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
dropout (:obj:`float`, optional, defaults to 0.1): dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
mem_len (:obj:`int` or :obj:`None`, optional): mem_len (:obj:`int` or :obj:`None`, `optional`):
The number of tokens to cache. The key/value pairs that have already been pre-computed The number of tokens to cache. The key/value pairs that have already been pre-computed
in a previous forward pass won't be re-computed. See the in a previous forward pass won't be re-computed. See the
`quickstart <https://huggingface.co/transformers/quickstart.html#using-the-past>`__ `quickstart <https://huggingface.co/transformers/quickstart.html#using-the-past>`__
for more information. for more information.
reuse_len (:obj:`int` or :obj:`None`, optional): reuse_len (:obj:`int`, `optional`):
The number of tokens in the current batch to be cached and reused in the future. The number of tokens in the current batch to be cached and reused in the future.
bi_data (:obj:`boolean`, optional, defaults to :obj:`False`): bi_data (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use bidirectional input pipeline. Usually set to `True` during Whether or not to use bidirectional input pipeline. Usually set to :obj:`True` during
pretraining and `False` during finetuning. pretraining and :obj:`False` during finetuning.
clamp_len (:obj:`int`, optional, defaults to -1): clamp_len (:obj:`int`, `optional`, defaults to -1):
Clamp all relative distances larger than clamp_len. Clamp all relative distances larger than clamp_len.
Setting this attribute to -1 means no clamping. Setting this attribute to -1 means no clamping.
same_length (:obj:`boolean`, optional, defaults to :obj:`False`): same_length (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use the same attention length for each token. Whether or not to use the same attention length for each token.
summary_type (:obj:`string`, optional, defaults to "last"): summary_type (:obj:`str`, `optional`, defaults to "last"):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
Is one of the following options: Has to be one of the following options:
- 'last' => take the last token hidden state (like XLNet) - :obj:`"last"`: Take the last token hidden state (like XLNet).
- 'first' => take the first token hidden state (like Bert) - :obj:`"first"`: Take the first token hidden state (like BERT).
- 'mean' => take the mean of all tokens hidden states - :obj:`"mean"`: Take the mean of all tokens hidden states.
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2) - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- 'attn' => Not implemented now, use multi-head attention - :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`): summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
Add a projection after the vector extraction Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`string` or :obj:`None`, optional): summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
'tanh' => add a tanh activation to the output, Other => no activation. Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`): summary_proj_to_labels (:obj:`boo`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in Used in the sequence classification and multiple choice models.
:class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False. Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_last_dropout (:obj:`float`, optional, defaults to 0.1): summary_last_dropout (:obj:`float`, `optional`, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in Used in the sequence classification and multiple choice models.
:class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
Add a dropout after the projection and activation The dropout ratio to be used after the projection and activation.
start_n_top (:obj:`int`, optional, defaults to 5): start_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet. Used in the SQuAD evaluation script.
end_n_top (:obj:`int`, optional, defaults to 5): end_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet. Used in the SQuAD evaluation script.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`): use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last pre-computed hidden states. Whether or not the model should return the last pre-computed hidden states.
...@@ -117,7 +118,7 @@ class XLNetConfig(PretrainedConfig): ...@@ -117,7 +118,7 @@ class XLNetConfig(PretrainedConfig):
This flag behaves differently from with other models: it just controls the inference behavior, during This flag behaves differently from with other models: it just controls the inference behavior, during
training the model always uses ``use_cache=True``. training the model always uses ``use_cache=True``.
Example:: Examples::
>>> from transformers import XLNetConfig, XLNetModel >>> from transformers import XLNetConfig, XLNetModel
......
...@@ -471,6 +471,7 @@ TF_SEQUENCE_CLASSIFICATION_SAMPLE = r""" ...@@ -471,6 +471,7 @@ TF_SEQUENCE_CLASSIFICATION_SAMPLE = r"""
TF_MASKED_LM_SAMPLE = r""" TF_MASKED_LM_SAMPLE = r"""
Example:: Example::
>>> from transformers import {tokenizer_class}, {model_class} >>> from transformers import {tokenizer_class}, {model_class}
>>> import tensorflow as tf >>> import tensorflow as tf
......
...@@ -428,7 +428,8 @@ class AlbertForPreTrainingOutput(ModelOutput): ...@@ -428,7 +428,8 @@ class AlbertForPreTrainingOutput(ModelOutput):
Args: Args:
loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`): loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss. Total loss as the sum of the masked language modeling loss and the next sequence prediction
(classification) loss.
prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`): prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
sop_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`): sop_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
...@@ -456,7 +457,11 @@ class AlbertForPreTrainingOutput(ModelOutput): ...@@ -456,7 +457,11 @@ class AlbertForPreTrainingOutput(ModelOutput):
ALBERT_START_DOCSTRING = r""" ALBERT_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -468,27 +473,31 @@ ALBERT_START_DOCSTRING = r""" ...@@ -468,27 +473,31 @@ ALBERT_START_DOCSTRING = r"""
ALBERT_INPUTS_DOCSTRING = r""" ALBERT_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.AlbertTokenizer`. Indices can be obtained using :class:`~transformers.AlbertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.__call__` and
:func:`transformers.PreTrainedTokenizer` for details. :meth:`transformers.PreTrainedTokenizer.encode` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_ `What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``. Selected in the range ``[0, config.max_position_embeddings - 1]``.
...@@ -496,18 +505,22 @@ ALBERT_INPUTS_DOCSTRING = r""" ...@@ -496,18 +505,22 @@ ALBERT_INPUTS_DOCSTRING = r"""
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
...@@ -562,7 +575,7 @@ class AlbertModel(AlbertPreTrainedModel): ...@@ -562,7 +575,7 @@ class AlbertModel(AlbertPreTrainedModel):
inner_group_idx = int(layer - group_idx * self.config.inner_group_num) inner_group_idx = int(layer - group_idx * self.config.inner_group_num)
self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads) self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="albert-base-v2", checkpoint="albert-base-v2",
...@@ -656,7 +669,7 @@ class AlbertForPreTraining(AlbertPreTrainedModel): ...@@ -656,7 +669,7 @@ class AlbertForPreTraining(AlbertPreTrainedModel):
def get_input_embeddings(self): def get_input_embeddings(self):
return self.albert.embeddings.word_embeddings return self.albert.embeddings.word_embeddings
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=AlbertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=AlbertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -674,22 +687,22 @@ class AlbertForPreTraining(AlbertPreTrainedModel): ...@@ -674,22 +687,22 @@ class AlbertForPreTraining(AlbertPreTrainedModel):
**kwargs, **kwargs,
): ):
r""" r"""
labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`): labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
Labels for computing the masked language modeling loss. Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]`` in ``[0, ..., config.vocab_size]``
sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`): sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring) Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
Indices should be in ``[0, 1]``. Indices should be in ``[0, 1]``.
``0`` indicates original order (sequence A, then sequence B), ``0`` indicates original order (sequence A, then sequence B),
``1`` indicates switched order (sequence B, then sequence A). ``1`` indicates switched order (sequence B, then sequence A).
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
Returns: Returns:
Examples:: Example::
>>> from transformers import AlbertTokenizer, AlbertForPreTraining >>> from transformers import AlbertTokenizer, AlbertForPreTraining
>>> import torch >>> import torch
...@@ -807,7 +820,7 @@ class AlbertForMaskedLM(AlbertPreTrainedModel): ...@@ -807,7 +820,7 @@ class AlbertForMaskedLM(AlbertPreTrainedModel):
def get_input_embeddings(self): def get_input_embeddings(self):
return self.albert.embeddings.word_embeddings return self.albert.embeddings.word_embeddings
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="albert-base-v2", checkpoint="albert-base-v2",
...@@ -894,7 +907,7 @@ class AlbertForSequenceClassification(AlbertPreTrainedModel): ...@@ -894,7 +907,7 @@ class AlbertForSequenceClassification(AlbertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="albert-base-v2", checkpoint="albert-base-v2",
...@@ -978,7 +991,7 @@ class AlbertForTokenClassification(AlbertPreTrainedModel): ...@@ -978,7 +991,7 @@ class AlbertForTokenClassification(AlbertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="albert-base-v2", checkpoint="albert-base-v2",
...@@ -1061,7 +1074,7 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel): ...@@ -1061,7 +1074,7 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="albert-base-v2", checkpoint="albert-base-v2",
...@@ -1085,11 +1098,11 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel): ...@@ -1085,11 +1098,11 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel):
r""" r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss. Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss. Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
...@@ -1158,7 +1171,7 @@ class AlbertForMultipleChoice(AlbertPreTrainedModel): ...@@ -1158,7 +1171,7 @@ class AlbertForMultipleChoice(AlbertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)")) @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="albert-base-v2", checkpoint="albert-base-v2",
......
...@@ -619,7 +619,8 @@ class BertForPreTrainingOutput(ModelOutput): ...@@ -619,7 +619,8 @@ class BertForPreTrainingOutput(ModelOutput):
Args: Args:
loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`): loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss. Total loss as the sum of the masked language modeling loss and the next sequence prediction
(classification) loss.
prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`): prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`): seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
...@@ -646,7 +647,12 @@ class BertForPreTrainingOutput(ModelOutput): ...@@ -646,7 +647,12 @@ class BertForPreTrainingOutput(ModelOutput):
BERT_START_DOCSTRING = r""" BERT_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -658,27 +664,31 @@ BERT_START_DOCSTRING = r""" ...@@ -658,27 +664,31 @@ BERT_START_DOCSTRING = r"""
BERT_INPUTS_DOCSTRING = r""" BERT_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`): input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.BertTokenizer`. Indices can be obtained using :class:`~transformers.BertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`): token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_ `What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`): position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``. Selected in the range ``[0, config.max_position_embeddings - 1]``.
...@@ -686,18 +696,22 @@ BERT_INPUTS_DOCSTRING = r""" ...@@ -686,18 +696,22 @@ BERT_INPUTS_DOCSTRING = r"""
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
...@@ -710,18 +724,15 @@ class BertModel(BertPreTrainedModel): ...@@ -710,18 +724,15 @@ class BertModel(BertPreTrainedModel):
The model can behave as an encoder (with only self-attention) as well The model can behave as an encoder (with only self-attention) as well
as a decoder, in which case a layer of cross-attention is added between as a decoder, in which case a layer of cross-attention is added between
the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani, the self-attention layers, following the architecture described in `Attention is all you need
Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
To behave as an decoder the model needs to be initialized with the To behave as an decoder the model needs to be initialized with the
:obj:`is_decoder` argument of the configuration set to :obj:`True`. :obj:`is_decoder` argument of the configuration set to :obj:`True`.
To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder` To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
argument and :obj:`add_cross_attention` set to :obj:`True`; an argument and :obj:`add_cross_attention` set to :obj:`True`; an
:obj:`encoder_hidden_states` is then expected as an input to the forward pass. :obj:`encoder_hidden_states` is then expected as an input to the forward pass.
.. _`Attention is all you need`:
https://arxiv.org/abs/1706.03762
""" """
def __init__(self, config): def __init__(self, config):
...@@ -748,7 +759,7 @@ class BertModel(BertPreTrainedModel): ...@@ -748,7 +759,7 @@ class BertModel(BertPreTrainedModel):
for layer, heads in heads_to_prune.items(): for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads) self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="bert-base-uncased", checkpoint="bert-base-uncased",
...@@ -777,7 +788,9 @@ class BertModel(BertPreTrainedModel): ...@@ -777,7 +788,9 @@ class BertModel(BertPreTrainedModel):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder. is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
""" """
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = ( output_hidden_states = (
...@@ -867,7 +880,7 @@ class BertForPreTraining(BertPreTrainedModel): ...@@ -867,7 +880,7 @@ class BertForPreTraining(BertPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.cls.predictions.decoder return self.cls.predictions.decoder
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -885,22 +898,23 @@ class BertForPreTraining(BertPreTrainedModel): ...@@ -885,22 +898,23 @@ class BertForPreTraining(BertPreTrainedModel):
**kwargs **kwargs
): ):
r""" r"""
labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`): labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):
Labels for computing the masked language modeling loss. Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]`` in ``[0, ..., config.vocab_size]``
next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`): next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring) Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
Indices should be in ``[0, 1]``. Indices should be in ``[0, 1]``:
``0`` indicates sequence B is a continuation of sequence A,
``1`` indicates sequence B is a random sequence. - 0 indicates sequence B is a continuation of sequence A,
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): - 1 indicates sequence B is a random sequence.
Used to hide legacy arguments that have been deprecated. kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated.
Returns: Returns:
Examples:: Example::
>>> from transformers import BertTokenizer, BertForPreTraining >>> from transformers import BertTokenizer, BertForPreTraining
>>> import torch >>> import torch
...@@ -976,7 +990,7 @@ class BertLMHeadModel(BertPreTrainedModel): ...@@ -976,7 +990,7 @@ class BertLMHeadModel(BertPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.cls.predictions.decoder return self.cls.predictions.decoder
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=CausalLMOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=CausalLMOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -994,19 +1008,21 @@ class BertLMHeadModel(BertPreTrainedModel): ...@@ -994,19 +1008,21 @@ class BertLMHeadModel(BertPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder. if the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder. is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): - 1 for tokens that are **not masked**,
Labels for computing the left-to-right language modeling loss (next word prediction). - 0 for tokens that are **maked**.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels Labels for computing the left-to-right language modeling loss (next word prediction).
in ``[0, ..., config.vocab_size]`` Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
n ``[0, ..., config.vocab_size]``
Returns: Returns:
...@@ -1092,7 +1108,7 @@ class BertForMaskedLM(BertPreTrainedModel): ...@@ -1092,7 +1108,7 @@ class BertForMaskedLM(BertPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.cls.predictions.decoder return self.cls.predictions.decoder
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="bert-base-uncased", checkpoint="bert-base-uncased",
...@@ -1196,7 +1212,7 @@ class BertForNextSentencePrediction(BertPreTrainedModel): ...@@ -1196,7 +1212,7 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -1212,11 +1228,12 @@ class BertForNextSentencePrediction(BertPreTrainedModel): ...@@ -1212,11 +1228,12 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring) Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
Indices should be in ``[0, 1]``. (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:
``0`` indicates sequence B is a continuation of sequence A,
``1`` indicates sequence B is a random sequence. - 0 indicates sequence B is a continuation of sequence A,
- 1 indicates sequence B is a random sequence.
Returns: Returns:
...@@ -1287,7 +1304,7 @@ class BertForSequenceClassification(BertPreTrainedModel): ...@@ -1287,7 +1304,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="bert-base-uncased", checkpoint="bert-base-uncased",
...@@ -1370,7 +1387,7 @@ class BertForMultipleChoice(BertPreTrainedModel): ...@@ -1370,7 +1387,7 @@ class BertForMultipleChoice(BertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="bert-base-uncased", checkpoint="bert-base-uncased",
...@@ -1393,8 +1410,8 @@ class BertForMultipleChoice(BertPreTrainedModel): ...@@ -1393,8 +1410,8 @@ class BertForMultipleChoice(BertPreTrainedModel):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss. Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above) of the input tensors. (See :obj:`input_ids` above)
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1] num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
...@@ -1460,7 +1477,7 @@ class BertForTokenClassification(BertPreTrainedModel): ...@@ -1460,7 +1477,7 @@ class BertForTokenClassification(BertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="bert-base-uncased", checkpoint="bert-base-uncased",
...@@ -1545,7 +1562,7 @@ class BertForQuestionAnswering(BertPreTrainedModel): ...@@ -1545,7 +1562,7 @@ class BertForQuestionAnswering(BertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="bert-base-uncased", checkpoint="bert-base-uncased",
...@@ -1569,11 +1586,11 @@ class BertForQuestionAnswering(BertPreTrainedModel): ...@@ -1569,11 +1586,11 @@ class BertForQuestionAnswering(BertPreTrainedModel):
r""" r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss. Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss. Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
...@@ -188,7 +188,12 @@ class BertGenerationPreTrainedModel(PreTrainedModel): ...@@ -188,7 +188,12 @@ class BertGenerationPreTrainedModel(PreTrainedModel):
BERT_GENERATION_START_DOCSTRING = r""" BERT_GENERATION_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -200,21 +205,23 @@ BERT_GENERATION_START_DOCSTRING = r""" ...@@ -200,21 +205,23 @@ BERT_GENERATION_START_DOCSTRING = r"""
BERT_GENERATION_INPUTS_DOCSTRING = r""" BERT_GENERATION_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`): input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.BertGenerationTokenizer`. Indices can be obtained using :class:`~transformers.BertGenerationTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.__call__` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.encode` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`): position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``. Selected in the range ``[0, config.max_position_embeddings - 1]``.
...@@ -222,18 +229,22 @@ BERT_GENERATION_INPUTS_DOCSTRING = r""" ...@@ -222,18 +229,22 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
...@@ -246,10 +257,13 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel): ...@@ -246,10 +257,13 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
The model can behave as an encoder (with only self-attention) as well The model can behave as an encoder (with only self-attention) as well
as a decoder, in which case a layer of cross-attention is added between as a decoder, in which case a layer of cross-attention is added between
the self-attention layers, following the architecture described in `Attention is all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, the self-attention layers, following the architecture described in `Attention is all you need
Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
This model should be used when leveraging Bert or Roberta checkpoints for the `EncoderDecoderModel` class as described in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. This model should be used when leveraging Bert or Roberta checkpoints for the
:class:`~transformers.EncoderDecoderModel` class as described in `Leveraging Pre-trained Checkpoints for Sequence
Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
To behave as an decoder the model needs to be initialized with the To behave as an decoder the model needs to be initialized with the
:obj:`is_decoder` argument of the configuration set to :obj:`True`. :obj:`is_decoder` argument of the configuration set to :obj:`True`.
...@@ -281,7 +295,7 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel): ...@@ -281,7 +295,7 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
for layer, heads in heads_to_prune.items(): for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads) self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="google/bert_for_seq_generation_L-24_bbc_encoder", checkpoint="google/bert_for_seq_generation_L-24_bbc_encoder",
...@@ -410,7 +424,7 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel): ...@@ -410,7 +424,7 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.lm_head.decoder return self.lm_head.decoder
@add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=CausalLMOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=CausalLMOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -427,19 +441,21 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel): ...@@ -427,19 +441,21 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder. if the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder. is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): - 1 for tokens that are **not masked**,
Labels for computing the left-to-right language modeling loss (next word prediction). - 0 for tokens that are **maked**.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels Labels for computing the left-to-right language modeling loss (next word prediction).
in ``[0, ..., config.vocab_size]`` Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
labels in ``[0, ..., config.vocab_size]``
Returns: Returns:
......
...@@ -42,7 +42,11 @@ CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ ...@@ -42,7 +42,11 @@ CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
CAMEMBERT_START_DOCSTRING = r""" CAMEMBERT_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
......
...@@ -233,7 +233,12 @@ class CTRLPreTrainedModel(PreTrainedModel): ...@@ -233,7 +233,12 @@ class CTRLPreTrainedModel(PreTrainedModel):
CTRL_START_DOCSTRING = r""" CTRL_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -245,33 +250,38 @@ CTRL_START_DOCSTRING = r""" ...@@ -245,33 +250,38 @@ CTRL_START_DOCSTRING = r"""
CTRL_INPUTS_DOCSTRING = r""" CTRL_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`): input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
:obj:`input_ids_length` = ``sequence_length`` if ``past_key_values`` is ``None`` else :obj:`input_ids_length` = ``sequence_length`` if ``past_key_values`` is ``None`` else
``past_key_values[0].shape[-2]`` (``sequence_length`` of input past key value states). ``past_key_values[0].shape[-2]`` (``sequence_length`` of input past key value states).
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
If ``past_key_values`` is used, only input_ids that do not have their past calculated should be passed as If ``past_key_values`` is used, only input IDs that do not have their past calculated should be passed as
``input_ids``. ``input_ids``.
Indices can be obtained using :class:`transformers.CTRLTokenizer`. Indices can be obtained using :class:`~transformers.CTRLTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.__call__` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.encode` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`): past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
(see ``past_key_values`` output below). Can be used to speed up sequential decoding. (see ``past_key_values`` output below). Can be used to speed up sequential decoding.
The ``input_ids`` which have their past given to this model should not be passed as input ids as they have already been computed. The ``input_ids`` which have their past given to this model should not be passed as input ids as they have
already been computed.
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_ `What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
...@@ -282,21 +292,25 @@ CTRL_INPUTS_DOCSTRING = r""" ...@@ -282,21 +292,25 @@ CTRL_INPUTS_DOCSTRING = r"""
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
This is useful if you want more control over how to convert `input_ids` indices into associated vectors Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
than the model's internal embedding lookup matrix. This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
If ``past_key_values`` is used, optionally only the last `inputs_embeds` have to be input (see ``past_key_values``). vectors than the model's internal embedding lookup matrix.
use_cache (:obj:`bool`): use_cache (:obj:`bool`, `optional`):
If `use_cache` is True, ``past_key_values`` key value states are returned and If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
can be used to speed up decoding (see ``past_key_values``). Defaults to `True`. decoding (see ``past_key_values``).
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
......
...@@ -372,7 +372,11 @@ class DistilBertPreTrainedModel(PreTrainedModel): ...@@ -372,7 +372,11 @@ class DistilBertPreTrainedModel(PreTrainedModel):
DISTILBERT_START_DOCSTRING = r""" DISTILBERT_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -384,35 +388,41 @@ DISTILBERT_START_DOCSTRING = r""" ...@@ -384,35 +388,41 @@ DISTILBERT_START_DOCSTRING = r"""
DISTILBERT_INPUTS_DOCSTRING = r""" DISTILBERT_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.DistilBertTokenizer`. Indices can be obtained using :class:`~transformers.DistilBertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
...@@ -443,7 +453,7 @@ class DistilBertModel(DistilBertPreTrainedModel): ...@@ -443,7 +453,7 @@ class DistilBertModel(DistilBertPreTrainedModel):
for layer, heads in heads_to_prune.items(): for layer, heads in heads_to_prune.items():
self.transformer.layer[layer].attention.prune_heads(heads) self.transformer.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="distilbert-base-uncased", checkpoint="distilbert-base-uncased",
...@@ -516,7 +526,7 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel): ...@@ -516,7 +526,7 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.vocab_projector return self.vocab_projector
@add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="distilbert-base-uncased", checkpoint="distilbert-base-uncased",
...@@ -539,8 +549,8 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel): ...@@ -539,8 +549,8 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel):
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss. Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
in ``[0, ..., config.vocab_size]`` labels in ``[0, ..., config.vocab_size]``.
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
""" """
...@@ -601,7 +611,7 @@ class DistilBertForSequenceClassification(DistilBertPreTrainedModel): ...@@ -601,7 +611,7 @@ class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="distilbert-base-uncased", checkpoint="distilbert-base-uncased",
...@@ -681,7 +691,7 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel): ...@@ -681,7 +691,7 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="distilbert-base-uncased", checkpoint="distilbert-base-uncased",
...@@ -703,11 +713,11 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel): ...@@ -703,11 +713,11 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
r""" r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss. Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss. Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
...@@ -857,7 +867,7 @@ class DistilBertForMultipleChoice(DistilBertPreTrainedModel): ...@@ -857,7 +867,7 @@ class DistilBertForMultipleChoice(DistilBertPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)")) @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
@replace_return_docstrings(output_type=MultipleChoiceModelOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=MultipleChoiceModelOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -871,10 +881,10 @@ class DistilBertForMultipleChoice(DistilBertPreTrainedModel): ...@@ -871,10 +881,10 @@ class DistilBertForMultipleChoice(DistilBertPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss. Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above) of the input tensors. (See :obj:`input_ids` above)
Returns: Returns:
......
...@@ -315,7 +315,11 @@ class DPRPretrainedReader(PreTrainedModel): ...@@ -315,7 +315,11 @@ class DPRPretrainedReader(PreTrainedModel):
DPR_START_DOCSTRING = r""" DPR_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -327,9 +331,9 @@ DPR_START_DOCSTRING = r""" ...@@ -327,9 +331,9 @@ DPR_START_DOCSTRING = r"""
DPR_ENCODERS_INPUTS_DOCSTRING = r""" DPR_ENCODERS_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids: (:obj:``torch.LongTensor`` of shape ``(batch_size, sequence_length)``): input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
To match pre-training, DPR input sequence should be formatted with [CLS] and [SEP] tokens as follows: To match pretraining, DPR input sequence should be formatted with [CLS] and [SEP] tokens as follows:
(a) For sequence pairs (for a pair title+text for example): (a) For sequence pairs (for a pair title+text for example):
...@@ -346,57 +350,74 @@ DPR_ENCODERS_INPUTS_DOCSTRING = r""" ...@@ -346,57 +350,74 @@ DPR_ENCODERS_INPUTS_DOCSTRING = r"""
DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left. the right rather than the left.
Indices can be obtained using :class:`transformers.DPRTokenizer`. Indices can be obtained using :class:`~transformers.DPRTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
attention_mask: (:obj:``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
token_type_ids: (:obj:``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`): - 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states tensors of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
DPR_READER_INPUTS_DOCSTRING = r""" DPR_READER_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids: (:obj:``torch.LongTensor`` of shape ``(n_passages, sequence_length)``): input_ids: (:obj:`Tuple[torch.LongTensor]` of shapes :obj:`(n_passages, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
It has to be a sequence triplet with 1) the question and 2) the passages titles and 3) the passages texts It has to be a sequence triplet with 1) the question and 2) the passages titles and 3) the passages texts
To match pre-training, DPR `input_ids` sequence should be formatted with [CLS] and [SEP] with the format: To match pretraining, DPR :obj:`input_ids` sequence should be formatted with [CLS] and [SEP] with the
format:
[CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids> ``[CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids>``
DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left. the right rather than the left.
Indices can be obtained using :class:`transformers.DPRReaderTokenizer`. Indices can be obtained using :class:`~transformers.DPRReaderTokenizer`. See this class documentation for
See :class:`transformers.DPRReaderTokenizer` for more details more details.
attention_mask: (:obj:torch.FloatTensor``, of shape ``(n_passages, sequence_length)``, `optional`: attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(n_passages, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(n_passages, sequence_length, hidden_size)`, `optional`): inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(n_passages, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states tensors of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to rturn the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
......
...@@ -218,7 +218,12 @@ class ElectraForPreTrainingOutput(ModelOutput): ...@@ -218,7 +218,12 @@ class ElectraForPreTrainingOutput(ModelOutput):
ELECTRA_START_DOCSTRING = r""" ELECTRA_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -230,27 +235,31 @@ ELECTRA_START_DOCSTRING = r""" ...@@ -230,27 +235,31 @@ ELECTRA_START_DOCSTRING = r"""
ELECTRA_INPUTS_DOCSTRING = r""" ELECTRA_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.ElectraTokenizer`. Indices can be obtained using :class:`~transformers.ElectraTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_ `What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``. Selected in the range ``[0, config.max_position_embeddings - 1]``.
...@@ -258,26 +267,33 @@ ELECTRA_INPUTS_DOCSTRING = r""" ...@@ -258,26 +267,33 @@ ELECTRA_INPUTS_DOCSTRING = r"""
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder. if the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder. is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
...@@ -318,7 +334,7 @@ class ElectraModel(ElectraPreTrainedModel): ...@@ -318,7 +334,7 @@ class ElectraModel(ElectraPreTrainedModel):
for layer, heads in heads_to_prune.items(): for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads) self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="google/electra-small-discriminator", checkpoint="google/electra-small-discriminator",
...@@ -414,7 +430,7 @@ class ElectraForSequenceClassification(ElectraPreTrainedModel): ...@@ -414,7 +430,7 @@ class ElectraForSequenceClassification(ElectraPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="google/electra-small-discriminator", checkpoint="google/electra-small-discriminator",
...@@ -496,7 +512,7 @@ class ElectraForPreTraining(ElectraPreTrainedModel): ...@@ -496,7 +512,7 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
self.discriminator_predictions = ElectraDiscriminatorPredictions(config) self.discriminator_predictions = ElectraDiscriminatorPredictions(config)
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=ElectraForPreTrainingOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=ElectraForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -512,11 +528,12 @@ class ElectraForPreTraining(ElectraPreTrainedModel): ...@@ -512,11 +528,12 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`): labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring) Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)
Indices should be in ``[0, 1]``. Indices should be in ``[0, 1]``:
``0`` indicates the token is an original token,
``1`` indicates the token was replaced. - 0 indicates the token is an original token,
- 1 indicates the token was replaced.
Returns: Returns:
...@@ -592,7 +609,7 @@ class ElectraForMaskedLM(ElectraPreTrainedModel): ...@@ -592,7 +609,7 @@ class ElectraForMaskedLM(ElectraPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.generator_lm_head return self.generator_lm_head
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="google/electra-small-discriminator", checkpoint="google/electra-small-discriminator",
...@@ -681,7 +698,7 @@ class ElectraForTokenClassification(ElectraPreTrainedModel): ...@@ -681,7 +698,7 @@ class ElectraForTokenClassification(ElectraPreTrainedModel):
self.classifier = nn.Linear(config.hidden_size, config.num_labels) self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="google/electra-small-discriminator", checkpoint="google/electra-small-discriminator",
...@@ -767,7 +784,7 @@ class ElectraForQuestionAnswering(ElectraPreTrainedModel): ...@@ -767,7 +784,7 @@ class ElectraForQuestionAnswering(ElectraPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="google/electra-small-discriminator", checkpoint="google/electra-small-discriminator",
...@@ -791,11 +808,11 @@ class ElectraForQuestionAnswering(ElectraPreTrainedModel): ...@@ -791,11 +808,11 @@ class ElectraForQuestionAnswering(ElectraPreTrainedModel):
r""" r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss. Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss. Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
...@@ -866,7 +883,7 @@ class ElectraForMultipleChoice(ElectraPreTrainedModel): ...@@ -866,7 +883,7 @@ class ElectraForMultipleChoice(ElectraPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)")) @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="google/electra-small-discriminator", checkpoint="google/electra-small-discriminator",
...@@ -889,8 +906,8 @@ class ElectraForMultipleChoice(ElectraPreTrainedModel): ...@@ -889,8 +906,8 @@ class ElectraForMultipleChoice(ElectraPreTrainedModel):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss. Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above) of the input tensors. (See :obj:`input_ids` above)
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1] num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
......
...@@ -30,16 +30,28 @@ logger = logging.get_logger(__name__) ...@@ -30,16 +30,28 @@ logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "EncoderDecoderConfig" _CONFIG_FOR_DOC = "EncoderDecoderConfig"
ENCODER_DECODER_START_DOCSTRING = r""" ENCODER_DECODER_START_DOCSTRING = r"""
This class can be used to inialize a sequence-to-sequnece model with any pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder. The encoder is loaded via :meth:`~transformers.AutoModel.from_pretrained` function and the decoder is loaded via :meth:`~transformers.AutoModelForCausalLM.from_pretrained` function. This class can be used to inialize a sequence-to-sequnece model with any pretrained autoencoding model as the
Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, *i.e.* summarization. encoder and any pretrained autoregressive model as the decoder. The encoder is loaded via
:meth:`~transformers.AutoModel.from_pretrained` function and the decoder is loaded via
:meth:`~transformers.AutoModelForCausalLM.from_pretrained` function.
Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative
task, like summarization.
The effectiveness of initializing sequence-to-sequence models with pre-trained checkpoints for sequence generation tasks was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation
Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. tasks was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
<https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. Michael Matena, Yanqi
Zhou, Wei Li, Peter J. Liu.
After such an Encoder Decoder model has been trained / fine-tuned, it can be saved / loaded just like any other models (see Examples for more information). After such an Encoder Decoder model has been trained/fine-tuned, it can be saved/loaded just like any other models
(see the examples for more information).
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#module>`__ sub-class. Use it as a This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
Parameters: Parameters:
config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model. config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
...@@ -50,38 +62,47 @@ ENCODER_DECODER_START_DOCSTRING = r""" ...@@ -50,38 +62,47 @@ ENCODER_DECODER_START_DOCSTRING = r"""
ENCODER_DECODER_INPUTS_DOCSTRING = r""" ENCODER_DECODER_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary for the encoder. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.PretrainedTokenizer`.
See :meth:`~transformers.PreTrainedTokenizer.encode` and Indices can be obtained using :class:`~transformers.PreTrainedTokenizer`.
:meth:`~transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert :obj:`input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices for the encoder. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__
encoder_outputs (:obj:`tuple(torch.FloatTensor)`, `optional`): encoder_outputs (:obj:`tuple(torch.FloatTensor)`, `optional`):
This tuple must consist of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`) This tuple must consist of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`)
`last_hidden_state` (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`) is a tensor of hidden-states at the output of the last layer of the encoder. :obj:`last_hidden_state` (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`)
is a tensor of hidden-states at the output of the last layer of the encoder.
Used in the cross-attention of the decoder. Used in the cross-attention of the decoder.
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for sequence to sequence training to the decoder. Provide for sequence to sequence training to the decoder.
Indices can be obtained using :class:`transformers.PretrainedTokenizer`. Indices can be obtained using :class:`~transformers.PretrainedTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default. Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default.
decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`): decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded
This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors representation. This is useful if you want more control over how to convert :obj:`decoder_input_ids`
than the model's internal embedding lookup matrix. indices into associated vectors than the model's internal embedding lookup matrix.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss for the decoder. Labels for computing the masked language modeling loss for the decoder.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
in ``[0, ..., config.vocab_size]`` labels in ``[0, ..., config.vocab_size]``
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.Seq2SeqLMOutput` instead of a If set to ``True``, the model will return a :class:`~transformers.file_utils.Seq2SeqLMOutput` instead of a
plain tuple. plain tuple.
...@@ -97,8 +118,8 @@ class EncoderDecoderModel(PreTrainedModel): ...@@ -97,8 +118,8 @@ class EncoderDecoderModel(PreTrainedModel):
:class:`~transformers.EncoderDecoder` is a generic model class that will be :class:`~transformers.EncoderDecoder` is a generic model class that will be
instantiated as a transformer architecture with one of the base model instantiated as a transformer architecture with one of the base model
classes of the library as encoder and another one as classes of the library as encoder and another one as
decoder when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)` decoder when created with the :meth`~transformers.AutoModel.from_pretrained` class method for the encoder and
class method for the encoder and `AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)` class method for the decoder. :meth`~transformers.AutoModelForCausalLM.from_pretrained` class method for the decoder.
""" """
config_class = EncoderDecoderConfig config_class = EncoderDecoderConfig
base_model_prefix = "encoder_decoder" base_model_prefix = "encoder_decoder"
...@@ -169,40 +190,57 @@ class EncoderDecoderModel(PreTrainedModel): ...@@ -169,40 +190,57 @@ class EncoderDecoderModel(PreTrainedModel):
*model_args, *model_args,
**kwargs **kwargs
) -> PreTrainedModel: ) -> PreTrainedModel:
r"""Instantiates an encoder and a decoder from one or two base classes of the library from pre-trained model checkpoints. r"""
Instantiate an encoder and a decoder from one or two base classes of the library from pretrained model
checkpoints.
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated). The model is set in evaluation mode by default using :obj:`model.eval()` (Dropout modules are deactivated).
To train the model, you need to first set it back in training mode with `model.train()`. To train the model, you need to first set it back in training mode with :obj:`model.train()`.
Params: Params:
encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`): encoder_pretrained_model_name_or_path (:obj: `str`, `optional`):
information necessary to initiate the encoder. Either: Information necessary to initiate the encoder. Can be either:
- a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``. - A string with the `shortcut name` of a pretrained model to load from cache or download, e.g.,
- a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``. ``bert-base-uncased``.
- a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/encoder``. - A string with the `identifier name` of a pretrained model that was user-uploaded to our S3, e.g.,
- a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards. ``dbmdz/bert-base-german-cased``.
- A path to a `directory` containing model weights saved using
:func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
- A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In
this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided
as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in
a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`): decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):
information necessary to initiate the decoder. Either: Information necessary to initiate the decoder. Can be either:
- a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``. - A string with the `shortcut name` of a pretrained model to load from cache or download, e.g.,
- a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``. ``bert-base-uncased``.
- a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/decoder``. - A string with the `identifier name` of a pretrained model that was user-uploaded to our S3, e.g.,
- a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards. ``dbmdz/bert-base-german-cased``.
- A path to a `directory` containing model weights saved using
:func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
- A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In
this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided
as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in
a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
model_args: (`optional`) Sequence of positional arguments: model_args (remaining positional arguments, `optional`):
All remaning positional arguments will be passed to the underlying model's ``__init__`` method All remaning positional arguments will be passed to the underlying model's ``__init__`` method.
kwargs: (`optional`) Remaining dictionary of keyword arguments. kwargs (remaining dictionary of keyword arguments, `optional`):
Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attentions=True``). Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
- To update the encoder configuration, use the prefix `encoder_` for each configuration parameter :obj:`output_attentions=True`).
- To update the decoder configuration, use the prefix `decoder_` for each configuration parameter
- To update the parent model configuration, do not use a prefix for each configuration parameter
Behave differently depending on whether a :obj:`config` is provided or automatically loaded.
Examples:: - To update the encoder configuration, use the prefix `encoder_` for each configuration parameter.
- To update the decoder configuration, use the prefix `decoder_` for each configuration parameter.
- To update the parent model configuration, do not use a prefix for each configuration parameter.
Behaves differently depending on whether a :obj:`config` is provided or automatically loaded.
Example::
>>> from transformers import EncoderDecoderModel >>> from transformers import EncoderDecoderModel
>>> # initialize a bert2bert from two pretrained BERT models. Note that the cross-attention layers will be randomly initialized >>> # initialize a bert2bert from two pretrained BERT models. Note that the cross-attention layers will be randomly initialized
......
...@@ -52,7 +52,11 @@ FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ ...@@ -52,7 +52,11 @@ FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
FLAUBERT_START_DOCSTRING = r""" FLAUBERT_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -67,21 +71,25 @@ FLAUBERT_INPUTS_DOCSTRING = r""" ...@@ -67,21 +71,25 @@ FLAUBERT_INPUTS_DOCSTRING = r"""
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.BertTokenizer`. Indices can be obtained using :class:`~transformers.FlaubertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_ `What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
...@@ -91,28 +99,32 @@ FLAUBERT_INPUTS_DOCSTRING = r""" ...@@ -91,28 +99,32 @@ FLAUBERT_INPUTS_DOCSTRING = r"""
`What are position IDs? <../glossary.html#position-ids>`_ `What are position IDs? <../glossary.html#position-ids>`_
lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Length of each sentence that can be used to avoid performing attention on padding token indices. Length of each sentence that can be used to avoid performing attention on padding token indices.
You can also use `attention_mask` for the same result (see above), kept here for compatbility. You can also use :obj:`attention_mask` for the same result (see above), kept here for compatbility.
Indices selected in ``[0, ..., input_ids.size(-1)]``: Indices selected in ``[0, ..., input_ids.size(-1)]``:
cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`): cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`):
dictionary with ``torch.FloatTensor`` that contains pre-computed Dictionary strings to ``torch.FloatTensor`` that contains precomputed
hidden-states (key and values in the attention blocks) as computed by the model hidden-states (key and values in the attention blocks) as computed by the model
(see `cache` output below). Can be used to speed up sequential decoding. (see :obj:`cache` output below). Can be used to speed up sequential decoding.
The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states. The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
...@@ -372,8 +384,8 @@ class FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple): ...@@ -372,8 +384,8 @@ class FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):
@add_start_docstrings( @add_start_docstrings(
"""Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of """Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like
the hidden-states output to compute `span start logits` and `span end logits`). """, SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
FLAUBERT_START_DOCSTRING, FLAUBERT_START_DOCSTRING,
) )
class FlaubertForQuestionAnswering(XLMForQuestionAnswering): class FlaubertForQuestionAnswering(XLMForQuestionAnswering):
......
...@@ -176,8 +176,13 @@ PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py facebook/w ...@@ -176,8 +176,13 @@ PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py facebook/w
FSMT_START_DOCSTRING = r""" FSMT_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
refer to the PyTorch documentation for all matters related to general usage and behavior. methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
Parameters: Parameters:
config (:class:`~transformers.FSMTConfig`): Model configuration class with all the parameters of the model. config (:class:`~transformers.FSMTConfig`): Model configuration class with all the parameters of the model.
...@@ -207,39 +212,52 @@ FSMT_GENERATION_EXAMPLE = r""" ...@@ -207,39 +212,52 @@ FSMT_GENERATION_EXAMPLE = r"""
FSMT_INPUTS_DOCSTRING = r""" FSMT_INPUTS_DOCSTRING = r"""
Args: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Use FSMTTokenizer.encode to produce them. Indices of input sequence tokens in the vocabulary.
Padding will be ignored by default should you provide it.
Indices can be obtained using :class:`transformers.FSMTTokenizer.encode(text)`. IIndices can be obtained using :class:`~transformers.FSTMTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices in input_ids. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`): - 1 for tokens that are **not masked**,
Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`) - 0 for tokens that are **maked**.
`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`) is a sequence of hidden-states at the output of the last layer of the encoder.
Used in the cross-attention of the decoder. `What are attention masks? <../glossary.html#attention-mask>`__
encoder_outputs (:obj:`Tuple(torch.FloatTensor)`, `optional`):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`)
:obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)` is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper. Provide for translation and summarization training. By default, the model will create this tensor by
shifting the input_ids right, following the paper.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default. Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
If you want to change padding behavior, you should read :func:`~transformers.modeling_fairseqtranslator._prepare_decoder_inputs` and modify. also be used by default.
If you want to change padding behavior, you should read
:func:`modeling_fstm._prepare_fstm_decoder_inputs` and modify.
See diagram 1 in the paper for more info on the default strategy See diagram 1 in the paper for more info on the default strategy
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): past_key_values (:obj:`Tuple(torch.FloatTensor)` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains pre-computed key and value hidden-states of the attention blocks. Contains precomputed key and value hidden-states of the attention blocks.
Can be used to speed up decoding. Can be used to speed up decoding.
If ``past_key_values`` are used, the user can optionally input only the last If :obj:`past_key_values` are used, the user can optionally input only the last
``decoder_input_ids`` (those that don't have their past key value states given to this model) of shape :obj:`decoder_input_ids` (those that don't have their past key value states given to this model) of shape
:obj:`(batch_size, 1)` instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`. :obj:`(batch_size, 1)` instead of all :obj:`decoder_input_ids` of shape
:obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`): use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
If `use_cache` is True, ``past_key_values`` are returned and can be used to speed up decoding (see If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
``past_key_values``). decoding (see ``past_key_values``).
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
......
...@@ -826,11 +826,17 @@ class FunnelForPreTrainingOutput(ModelOutput): ...@@ -826,11 +826,17 @@ class FunnelForPreTrainingOutput(ModelOutput):
attentions: Optional[Tuple[torch.FloatTensor]] = None attentions: Optional[Tuple[torch.FloatTensor]] = None
FUNNEL_START_DOCSTRING = r""" The Funnel Transformer model was proposed in FUNNEL_START_DOCSTRING = r"""
The Funnel Transformer model was proposed in
`Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing `Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
<https://arxiv.org/abs/2006.03236>`__ by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. <https://arxiv.org/abs/2006.03236>`__ by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -841,38 +847,41 @@ FUNNEL_START_DOCSTRING = r""" The Funnel Transformer model was proposed in ...@@ -841,38 +847,41 @@ FUNNEL_START_DOCSTRING = r""" The Funnel Transformer model was proposed in
""" """
FUNNEL_INPUTS_DOCSTRING = r""" FUNNEL_INPUTS_DOCSTRING = r"""
Inputs: Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`): input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.FunnelTokenizer`. Indices can be obtained using :class:`~transformers.BertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens. ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`): token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_ `What are token type IDs? <../glossary.html#token-type-ids>`_
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`): inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
than the model's internal embedding lookup matrix. vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`): tensors for more detail.
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. output_hidden_states (:obj:`bool`, `optional`):
return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`): Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a more detail.
plain tuple. return_dict (:obj:`bool`, `optional`):
Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
""" """
...@@ -896,7 +905,7 @@ class FunnelBaseModel(FunnelPreTrainedModel): ...@@ -896,7 +905,7 @@ class FunnelBaseModel(FunnelPreTrainedModel):
def set_input_embeddings(self, new_embeddings): def set_input_embeddings(self, new_embeddings):
self.embeddings.word_embeddings = new_embeddings self.embeddings.word_embeddings = new_embeddings
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="funnel-transformer/small-base", checkpoint="funnel-transformer/small-base",
...@@ -973,7 +982,7 @@ class FunnelModel(FunnelPreTrainedModel): ...@@ -973,7 +982,7 @@ class FunnelModel(FunnelPreTrainedModel):
def set_input_embeddings(self, new_embeddings): def set_input_embeddings(self, new_embeddings):
self.embeddings.word_embeddings = new_embeddings self.embeddings.word_embeddings = new_embeddings
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="funnel-transformer/small", checkpoint="funnel-transformer/small",
...@@ -1071,7 +1080,7 @@ class FunnelForPreTraining(FunnelPreTrainedModel): ...@@ -1071,7 +1080,7 @@ class FunnelForPreTraining(FunnelPreTrainedModel):
self.discriminator_predictions = FunnelDiscriminatorPredictions(config) self.discriminator_predictions = FunnelDiscriminatorPredictions(config)
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=FunnelForPreTrainingOutput, config_class=_CONFIG_FOR_DOC) @replace_return_docstrings(output_type=FunnelForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
def forward( def forward(
self, self,
...@@ -1085,11 +1094,12 @@ class FunnelForPreTraining(FunnelPreTrainedModel): ...@@ -1085,11 +1094,12 @@ class FunnelForPreTraining(FunnelPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`): labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
Labels for computing the ELECTRA-style loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring) Labels for computing the ELECTRA-style loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)
Indices should be in ``[0, 1]``. Indices should be in ``[0, 1]``:
``0`` indicates the token is an original token,
``1`` indicates the token was replaced. - 0 indicates the token is an original token,
- 1 indicates the token was replaced.
Returns: Returns:
...@@ -1155,7 +1165,7 @@ class FunnelForMaskedLM(FunnelPreTrainedModel): ...@@ -1155,7 +1165,7 @@ class FunnelForMaskedLM(FunnelPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.lm_head return self.lm_head
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="funnel-transformer/small", checkpoint="funnel-transformer/small",
...@@ -1174,7 +1184,7 @@ class FunnelForMaskedLM(FunnelPreTrainedModel): ...@@ -1174,7 +1184,7 @@ class FunnelForMaskedLM(FunnelPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss. Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
...@@ -1226,7 +1236,7 @@ class FunnelForSequenceClassification(FunnelPreTrainedModel): ...@@ -1226,7 +1236,7 @@ class FunnelForSequenceClassification(FunnelPreTrainedModel):
self.classifier = FunnelClassificationHead(config, config.num_labels) self.classifier = FunnelClassificationHead(config, config.num_labels)
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="funnel-transformer/small-base", checkpoint="funnel-transformer/small-base",
...@@ -1245,7 +1255,7 @@ class FunnelForSequenceClassification(FunnelPreTrainedModel): ...@@ -1245,7 +1255,7 @@ class FunnelForSequenceClassification(FunnelPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss. Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`. Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss), If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
...@@ -1302,7 +1312,7 @@ class FunnelForMultipleChoice(FunnelPreTrainedModel): ...@@ -1302,7 +1312,7 @@ class FunnelForMultipleChoice(FunnelPreTrainedModel):
self.classifier = FunnelClassificationHead(config, 1) self.classifier = FunnelClassificationHead(config, 1)
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)")) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="funnel-transformer/small-base", checkpoint="funnel-transformer/small-base",
...@@ -1321,10 +1331,10 @@ class FunnelForMultipleChoice(FunnelPreTrainedModel): ...@@ -1321,10 +1331,10 @@ class FunnelForMultipleChoice(FunnelPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss. Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above) of the input tensors. (See :obj:`input_ids` above)
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1] num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
...@@ -1386,7 +1396,7 @@ class FunnelForTokenClassification(FunnelPreTrainedModel): ...@@ -1386,7 +1396,7 @@ class FunnelForTokenClassification(FunnelPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="funnel-transformer/small", checkpoint="funnel-transformer/small",
...@@ -1405,7 +1415,7 @@ class FunnelForTokenClassification(FunnelPreTrainedModel): ...@@ -1405,7 +1415,7 @@ class FunnelForTokenClassification(FunnelPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss. Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``. Indices should be in ``[0, ..., config.num_labels - 1]``.
""" """
...@@ -1466,7 +1476,7 @@ class FunnelForQuestionAnswering(FunnelPreTrainedModel): ...@@ -1466,7 +1476,7 @@ class FunnelForQuestionAnswering(FunnelPreTrainedModel):
self.init_weights() self.init_weights()
@add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings( @add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC, tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="funnel-transformer/small", checkpoint="funnel-transformer/small",
...@@ -1486,13 +1496,13 @@ class FunnelForQuestionAnswering(FunnelPreTrainedModel): ...@@ -1486,13 +1496,13 @@ class FunnelForQuestionAnswering(FunnelPreTrainedModel):
return_dict=None, return_dict=None,
): ):
r""" r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`): start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss. Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`): end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss. Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss. Position outside of the sequence are not taken into account for computing the loss.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
...@@ -391,7 +391,11 @@ class GPT2DoubleHeadsModelOutput(ModelOutput): ...@@ -391,7 +391,11 @@ class GPT2DoubleHeadsModelOutput(ModelOutput):
GPT2_START_DOCSTRING = r""" GPT2_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -411,27 +415,31 @@ GPT2_INPUTS_DOCSTRING = r""" ...@@ -411,27 +415,31 @@ GPT2_INPUTS_DOCSTRING = r"""
If ``past_key_values`` is used, only ``input_ids`` that do not have their past calculated should be passed If ``past_key_values`` is used, only ``input_ids`` that do not have their past calculated should be passed
as ``input_ids``. as ``input_ids``.
Indices can be obtained using :class:`transformers.GPT2Tokenizer`. Indices can be obtained using :class:`~transformers.GPT2Tokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :meth:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__ `What are input IDs? <../glossary.html#input-ids>`__
past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`): past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model
(see ``past_key_values`` output below). Can be used to speed up sequential decoding. (see ``past_key_values`` output below). Can be used to speed up sequential decoding.
The ``input_ids`` which have their past given to this model should not be passed as ``input_ids`` as they have already been computed. The ``input_ids`` which have their past given to this model should not be passed as ``input_ids`` as they
have already been computed.
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
- 1 for tokens that are **not masked**,
- 0 for tokens that are **maked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`): token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`):
`input_ids_length` = `sequence_length if `past` is None else 1
Segment token indices to indicate first and second portions of the inputs. Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1`` Indices are selected in ``[0, 1]``:
corresponds to a `sentence B` token
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_ `What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. Indices of positions of each input sequence tokens in the position embeddings.
...@@ -441,20 +449,28 @@ GPT2_INPUTS_DOCSTRING = r""" ...@@ -441,20 +449,28 @@ GPT2_INPUTS_DOCSTRING = r"""
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
This is useful if you want more control over how to convert `input_ids` indices into associated vectors Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
than the model's internal embedding lookup matrix. This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
If ``past_key_values`` is used, optionally only the last `inputs_embeds` have to be input (see ``past_key_values``). vectors than the model's internal embedding lookup matrix.
use_cache (:obj:`bool`):
If `use_cache` is True, ``past_key_values`` key value states are returned and can be used to speed up decoding (see ``past_key_values``). Defaults to `True`. If ``past_key_values`` is used, optionally only the last :obj:`inputs_embeds` have to be input (see
``past_key_values``).
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
decoding (see ``past_key_values``).
output_attentions (:obj:`bool`, `optional`): output_attentions (:obj:`bool`, `optional`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`): output_hidden_states (:obj:`bool`, `optional`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail. Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`): return_dict (:obj:`bool`, `optional`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
plain tuple.
""" """
...@@ -809,25 +825,25 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel): ...@@ -809,25 +825,25 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
**kwargs, **kwargs,
): ):
r""" r"""
mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input) mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
Index of the classification token in each input sequence. Index of the classification token in each input sequence.
Selected in the range ``[0, input_ids.size(-1) - 1[``. Selected in the range ``[0, input_ids.size(-1) - 1[``.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`) labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`)
Labels for language modeling. Labels for language modeling.
Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids`` Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
Indices are selected in ``[-1, 0, ..., config.vocab_size]`` Indices are selected in ``[-1, 0, ..., config.vocab_size]``
All labels set to ``-100`` are ignored (masked), the loss is only All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]`` computed for labels in ``[0, ..., config.vocab_size]``
mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`) mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`)
Labels for computing the multiple choice classification loss. Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above) of the input tensors. (see `input_ids` above)
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
Return: Return:
Examples:: Example::
>>> import torch >>> import torch
>>> from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel >>> from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment