Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
If you want to change padding behavior, you should read :func:`~transformers.modeling_bart._prepare_decoder_inputs` and modify.
If you want to change padding behavior, you should read :func:`~transformers.modeling_bart._prepare_decoder_inputs` and modify.
See diagram 1 in the paper for more info on the default strategy
See diagram 1 in the paper for more info on the default strategy
decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains pre-computed key and value hidden-states of the attention blocks.
Can be used to speed up decoding.
If ``decoder_past_key_value_states`` are used, the user can optionally input only the last
``decoder_input_ids`` (those that don't have their past key value states given to this model) of shape
:obj:`(batch_size, 1)` instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
If `use_cache` is True, ``decoder_past_key_values`` are returned and can be used to speed up decoding (see
``decoder_past_key_values``).
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
...
@@ -482,7 +491,7 @@ class BartDecoder(nn.Module):
...
@@ -482,7 +491,7 @@ class BartDecoder(nn.Module):
encoder_padding_mask,
encoder_padding_mask,
decoder_padding_mask,
decoder_padding_mask,
decoder_causal_mask,
decoder_causal_mask,
decoder_cached_states=None,
decoder_past_key_values=None,
use_cache=False,
use_cache=False,
output_attentions=False,
output_attentions=False,
output_hidden_states=False,
output_hidden_states=False,
...
@@ -499,7 +508,7 @@ class BartDecoder(nn.Module):
...
@@ -499,7 +508,7 @@ class BartDecoder(nn.Module):
encoder_hidden_states: output from the encoder, used for
encoder_hidden_states: output from the encoder, used for
encoder-side attention
encoder-side attention
encoder_padding_mask: for ignoring pad tokens
encoder_padding_mask: for ignoring pad tokens
decoder_cached_states (dict or None): dictionary used for storing state during generation
decoder_past_key_values (dict or None): dictionary used for storing state during generation
Returns:
Returns:
BaseModelOutputWithPast or tuple:
BaseModelOutputWithPast or tuple:
...
@@ -508,6 +517,13 @@ class BartDecoder(nn.Module):
...
@@ -508,6 +517,13 @@ class BartDecoder(nn.Module):
- hidden states
- hidden states
- attentions
- attentions
"""
"""
if"decoder_cached_states"inunused:
warnings.warn(
"The `decoder_cached_states` argument is deprecated and will be removed in a future version, use `decoder_past_key_values` instead.",
`What are input IDs? <../glossary.html#input-ids>`__
`What are input IDs? <../glossary.html#input-ids>`__
past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
(see `past` output below). Can be used to speed up sequential decoding.
(see ``past_key_values`` output below). Can be used to speed up sequential decoding.
The `input_ids` which have their past given to this model should not be passed as `input_ids` as they have already been computed.
The ``input_ids`` which have their past given to this model should not be passed as ``input_ids`` as they have already been computed.
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices.
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask values selected in ``[0, 1]``:
...
@@ -386,9 +388,9 @@ GPT2_INPUTS_DOCSTRING = r"""
...
@@ -386,9 +388,9 @@ GPT2_INPUTS_DOCSTRING = r"""
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
than the model's internal embedding lookup matrix.
If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).
If ``past_key_values`` is used, optionally only the last `inputs_embeds` have to be input (see ``past_key_values``).
use_cache (:obj:`bool`):
use_cache (:obj:`bool`):
If `use_cache` is True, `past` key value states are returned and can be used to speed up decoding (see `past`). Defaults to `True`.
If `use_cache` is True, ``past_key_values`` key value states are returned and can be used to speed up decoding (see ``past_key_values``). Defaults to `True`.
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
...
@@ -437,7 +439,7 @@ class GPT2Model(GPT2PreTrainedModel):
...
@@ -437,7 +439,7 @@ class GPT2Model(GPT2PreTrainedModel):
defforward(
defforward(
self,
self,
input_ids=None,
input_ids=None,
past=None,
past_key_values=None,
attention_mask=None,
attention_mask=None,
token_type_ids=None,
token_type_ids=None,
position_ids=None,
position_ids=None,
...
@@ -447,7 +449,16 @@ class GPT2Model(GPT2PreTrainedModel):
...
@@ -447,7 +449,16 @@ class GPT2Model(GPT2PreTrainedModel):
output_attentions=None,
output_attentions=None,
output_hidden_states=None,
output_hidden_states=None,
return_tuple=None,
return_tuple=None,
**kwargs,
):
):
if"past"inkwargs:
warnings.warn(
"The `past` argument is deprecated and will be removed in a future version, use `past_key_values` instead.",
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).
If `decoder_past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_values`).
To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`__. If decoder_input_ids and decoder_inputs_embeds are both None,
`T5 Training <./t5.html#training>`__. If decoder_input_ids and decoder_inputs_embeds are both None,
decoder_input_ids takes the value of input_ids.
decoder_input_ids takes the value of input_ids.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
decoder_past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains pre-computed key and value hidden-states of the attention blocks.
Contains pre-computed key and value hidden-states of the attention blocks.
Can be used to speed up decoding.
Can be used to speed up decoding.
If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`
If `decoder_past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all `decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
instead of all `decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).
If `use_cache` is True, `decoder_past_key_values` are returned and can be used to speed up decoding (see `decoder_past_key_values`).
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
than the model's internal embedding lookup matrix.
decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
If `decoder_past_key_value_states` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `decoder_past_key_value_states`).
If `decoder_past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `decoder_past_key_values`).
This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
than the model's internal embedding lookup matrix. If decoder_input_ids and decoder_inputs_embeds are both None,
than the model's internal embedding lookup matrix. If decoder_input_ids and decoder_inputs_embeds are both None,
decoder_inputs_embeds takes the value of inputs_embeds.
decoder_inputs_embeds takes the value of inputs_embeds.
...
@@ -923,7 +923,7 @@ class T5Model(T5PreTrainedModel):
...
@@ -923,7 +923,7 @@ class T5Model(T5PreTrainedModel):
encoder_outputs=None,
encoder_outputs=None,
decoder_input_ids=None,
decoder_input_ids=None,
decoder_attention_mask=None,
decoder_attention_mask=None,
decoder_past_key_value_states=None,
decoder_past_key_values=None,
use_cache=None,
use_cache=None,
inputs_embeds=None,
inputs_embeds=None,
decoder_inputs_embeds=None,
decoder_inputs_embeds=None,
...
@@ -931,6 +931,7 @@ class T5Model(T5PreTrainedModel):
...
@@ -931,6 +931,7 @@ class T5Model(T5PreTrainedModel):
output_attentions=None,
output_attentions=None,
output_hidden_states=None,
output_hidden_states=None,
return_tuple=None,
return_tuple=None,
**kwargs,
):
):
r"""
r"""
Returns:
Returns:
...
@@ -947,6 +948,14 @@ class T5Model(T5PreTrainedModel):
...
@@ -947,6 +948,14 @@ class T5Model(T5PreTrainedModel):
>>> last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
>>> last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
"""
if"decoder_past_key_value_states"inkwargs:
warnings.warn(
"The `decoder_past_key_value_states` argument is deprecated and will be removed in a future version, use `decoder_past_key_values` instead.",