Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -710,11 +717,11 @@ class BartEncoder(BartPretrainedModel):
...
@@ -710,11 +717,11 @@ class BartEncoder(BartPretrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
...
@@ -875,7 +882,7 @@ class BartDecoder(BartPretrainedModel):
...
@@ -875,7 +882,7 @@ class BartDecoder(BartPretrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -912,18 +919,18 @@ class BartDecoder(BartPretrainedModel):
...
@@ -912,18 +919,18 @@ class BartDecoder(BartPretrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
on hidden heads. Mask values selected in ``[0, 1]``:
cross-attention on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -993,11 +1000,12 @@ class BartDecoder(BartPretrainedModel):
...
@@ -993,11 +1000,12 @@ class BartDecoder(BartPretrainedModel):
@@ -1123,6 +1133,7 @@ class BartModel(BartPretrainedModel):
...
@@ -1123,6 +1133,7 @@ class BartModel(BartPretrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1172,7 +1183,7 @@ class BartModel(BartPretrainedModel):
...
@@ -1172,7 +1183,7 @@ class BartModel(BartPretrainedModel):
encoder_hidden_states=encoder_outputs[0],
encoder_hidden_states=encoder_outputs[0],
encoder_attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
head_mask=decoder_head_mask,
encoder_head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=decoder_inputs_embeds,
inputs_embeds=decoder_inputs_embeds,
use_cache=use_cache,
use_cache=use_cache,
...
@@ -1248,6 +1259,7 @@ class BartForConditionalGeneration(BartPretrainedModel):
...
@@ -1248,6 +1259,7 @@ class BartForConditionalGeneration(BartPretrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1282,6 +1294,7 @@ class BartForConditionalGeneration(BartPretrainedModel):
...
@@ -1282,6 +1294,7 @@ class BartForConditionalGeneration(BartPretrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1386,6 +1399,7 @@ class BartForSequenceClassification(BartPretrainedModel):
...
@@ -1386,6 +1399,7 @@ class BartForSequenceClassification(BartPretrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
inputs_embeds=None,
inputs_embeds=None,
decoder_inputs_embeds=None,
decoder_inputs_embeds=None,
...
@@ -1416,6 +1430,7 @@ class BartForSequenceClassification(BartPretrainedModel):
...
@@ -1416,6 +1430,7 @@ class BartForSequenceClassification(BartPretrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
encoder_outputs=encoder_outputs,
encoder_outputs=encoder_outputs,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1496,6 +1511,7 @@ class BartForQuestionAnswering(BartPretrainedModel):
...
@@ -1496,6 +1511,7 @@ class BartForQuestionAnswering(BartPretrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
start_positions=None,
start_positions=None,
end_positions=None,
end_positions=None,
...
@@ -1527,6 +1543,7 @@ class BartForQuestionAnswering(BartPretrainedModel):
...
@@ -1527,6 +1543,7 @@ class BartForQuestionAnswering(BartPretrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
encoder_outputs=encoder_outputs,
encoder_outputs=encoder_outputs,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1633,7 +1650,7 @@ class BartForCausalLM(BartPretrainedModel):
...
@@ -1633,7 +1650,7 @@ class BartForCausalLM(BartPretrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
labels=None,
labels=None,
...
@@ -1666,18 +1683,17 @@ class BartForCausalLM(BartPretrainedModel):
...
@@ -1666,18 +1683,17 @@ class BartForCausalLM(BartPretrainedModel):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -1734,7 +1750,7 @@ class BartForCausalLM(BartPretrainedModel):
...
@@ -1734,7 +1750,7 @@ class BartForCausalLM(BartPretrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -666,11 +673,11 @@ class BlenderbotEncoder(BlenderbotPreTrainedModel):
...
@@ -666,11 +673,11 @@ class BlenderbotEncoder(BlenderbotPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
...
@@ -834,7 +841,7 @@ class BlenderbotDecoder(BlenderbotPreTrainedModel):
...
@@ -834,7 +841,7 @@ class BlenderbotDecoder(BlenderbotPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -871,18 +878,19 @@ class BlenderbotDecoder(BlenderbotPreTrainedModel):
...
@@ -871,18 +878,19 @@ class BlenderbotDecoder(BlenderbotPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in ``[0,
1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
on hidden heads. Mask values selected in ``[0, 1]``:
cross-attention on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -951,11 +959,12 @@ class BlenderbotDecoder(BlenderbotPreTrainedModel):
...
@@ -951,11 +959,12 @@ class BlenderbotDecoder(BlenderbotPreTrainedModel):
@@ -1090,6 +1101,7 @@ class BlenderbotModel(BlenderbotPreTrainedModel):
...
@@ -1090,6 +1101,7 @@ class BlenderbotModel(BlenderbotPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1147,7 +1159,7 @@ class BlenderbotModel(BlenderbotPreTrainedModel):
...
@@ -1147,7 +1159,7 @@ class BlenderbotModel(BlenderbotPreTrainedModel):
encoder_hidden_states=encoder_outputs[0],
encoder_hidden_states=encoder_outputs[0],
encoder_attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
head_mask=decoder_head_mask,
encoder_head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=decoder_inputs_embeds,
inputs_embeds=decoder_inputs_embeds,
use_cache=use_cache,
use_cache=use_cache,
...
@@ -1241,6 +1253,7 @@ class BlenderbotForConditionalGeneration(BlenderbotPreTrainedModel):
...
@@ -1241,6 +1253,7 @@ class BlenderbotForConditionalGeneration(BlenderbotPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1275,6 +1288,7 @@ class BlenderbotForConditionalGeneration(BlenderbotPreTrainedModel):
...
@@ -1275,6 +1288,7 @@ class BlenderbotForConditionalGeneration(BlenderbotPreTrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1395,7 +1409,7 @@ class BlenderbotForCausalLM(BlenderbotPreTrainedModel):
...
@@ -1395,7 +1409,7 @@ class BlenderbotForCausalLM(BlenderbotPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
labels=None,
labels=None,
...
@@ -1428,18 +1442,17 @@ class BlenderbotForCausalLM(BlenderbotPreTrainedModel):
...
@@ -1428,18 +1442,17 @@ class BlenderbotForCausalLM(BlenderbotPreTrainedModel):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -1496,7 +1509,7 @@ class BlenderbotForCausalLM(BlenderbotPreTrainedModel):
...
@@ -1496,7 +1509,7 @@ class BlenderbotForCausalLM(BlenderbotPreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -667,11 +674,11 @@ class BlenderbotSmallEncoder(BlenderbotSmallPreTrainedModel):
...
@@ -667,11 +674,11 @@ class BlenderbotSmallEncoder(BlenderbotSmallPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
...
@@ -834,7 +841,7 @@ class BlenderbotSmallDecoder(BlenderbotSmallPreTrainedModel):
...
@@ -834,7 +841,7 @@ class BlenderbotSmallDecoder(BlenderbotSmallPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -871,18 +878,18 @@ class BlenderbotSmallDecoder(BlenderbotSmallPreTrainedModel):
...
@@ -871,18 +878,18 @@ class BlenderbotSmallDecoder(BlenderbotSmallPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
on hidden heads. Mask values selected in ``[0, 1]``:
cross-attention on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -953,10 +960,12 @@ class BlenderbotSmallDecoder(BlenderbotSmallPreTrainedModel):
...
@@ -953,10 +960,12 @@ class BlenderbotSmallDecoder(BlenderbotSmallPreTrainedModel):
@@ -1077,6 +1088,7 @@ class BlenderbotSmallModel(BlenderbotSmallPreTrainedModel):
...
@@ -1077,6 +1088,7 @@ class BlenderbotSmallModel(BlenderbotSmallPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1134,7 +1146,7 @@ class BlenderbotSmallModel(BlenderbotSmallPreTrainedModel):
...
@@ -1134,7 +1146,7 @@ class BlenderbotSmallModel(BlenderbotSmallPreTrainedModel):
encoder_hidden_states=encoder_outputs[0],
encoder_hidden_states=encoder_outputs[0],
encoder_attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
head_mask=decoder_head_mask,
encoder_head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=decoder_inputs_embeds,
inputs_embeds=decoder_inputs_embeds,
use_cache=use_cache,
use_cache=use_cache,
...
@@ -1216,6 +1228,7 @@ class BlenderbotSmallForConditionalGeneration(BlenderbotSmallPreTrainedModel):
...
@@ -1216,6 +1228,7 @@ class BlenderbotSmallForConditionalGeneration(BlenderbotSmallPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1250,6 +1263,7 @@ class BlenderbotSmallForConditionalGeneration(BlenderbotSmallPreTrainedModel):
...
@@ -1250,6 +1263,7 @@ class BlenderbotSmallForConditionalGeneration(BlenderbotSmallPreTrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1370,7 +1384,7 @@ class BlenderbotSmallForCausalLM(BlenderbotSmallPreTrainedModel):
...
@@ -1370,7 +1384,7 @@ class BlenderbotSmallForCausalLM(BlenderbotSmallPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
labels=None,
labels=None,
...
@@ -1403,18 +1417,17 @@ class BlenderbotSmallForCausalLM(BlenderbotSmallPreTrainedModel):
...
@@ -1403,18 +1417,17 @@ class BlenderbotSmallForCausalLM(BlenderbotSmallPreTrainedModel):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -1471,7 +1484,7 @@ class BlenderbotSmallForCausalLM(BlenderbotSmallPreTrainedModel):
...
@@ -1471,7 +1484,7 @@ class BlenderbotSmallForCausalLM(BlenderbotSmallPreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -1730,7 +1685,7 @@ class LEDEncoder(LEDPreTrainedModel):
...
@@ -1730,7 +1685,7 @@ class LEDEncoder(LEDPreTrainedModel):
- 0 for local attention (a sliding window attention),
- 0 for local attention (a sliding window attention),
- 1 for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
- 1 for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
...
@@ -1914,7 +1869,7 @@ class LEDDecoder(LEDPreTrainedModel):
...
@@ -1914,7 +1869,7 @@ class LEDDecoder(LEDPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -1961,18 +1916,17 @@ class LEDDecoder(LEDPreTrainedModel):
...
@@ -1961,18 +1916,17 @@ class LEDDecoder(LEDPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the heas is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -2052,11 +2006,12 @@ class LEDDecoder(LEDPreTrainedModel):
...
@@ -2052,11 +2006,12 @@ class LEDDecoder(LEDPreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -704,6 +722,12 @@ class M2M100Encoder(M2M100PreTrainedModel):
...
@@ -704,6 +722,12 @@ class M2M100Encoder(M2M100PreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
representation. This is useful if you want more control over how to convert :obj:`input_ids` indices
representation. This is useful if you want more control over how to convert :obj:`input_ids` indices
...
@@ -841,7 +865,7 @@ class M2M100Decoder(M2M100PreTrainedModel):
...
@@ -841,7 +865,7 @@ class M2M100Decoder(M2M100PreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -878,6 +902,19 @@ class M2M100Decoder(M2M100PreTrainedModel):
...
@@ -878,6 +902,19 @@ class M2M100Decoder(M2M100PreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
cross-attention on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
decoding.
decoding.
...
@@ -955,11 +992,12 @@ class M2M100Decoder(M2M100PreTrainedModel):
...
@@ -955,11 +992,12 @@ class M2M100Decoder(M2M100PreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -678,11 +685,11 @@ class MarianEncoder(MarianPreTrainedModel):
...
@@ -678,11 +685,11 @@ class MarianEncoder(MarianPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
...
@@ -842,7 +849,7 @@ class MarianDecoder(MarianPreTrainedModel):
...
@@ -842,7 +849,7 @@ class MarianDecoder(MarianPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -879,18 +886,18 @@ class MarianDecoder(MarianPreTrainedModel):
...
@@ -879,18 +886,18 @@ class MarianDecoder(MarianPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
on hidden heads. Mask values selected in ``[0, 1]``:
cross-attention on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -959,11 +966,12 @@ class MarianDecoder(MarianPreTrainedModel):
...
@@ -959,11 +966,12 @@ class MarianDecoder(MarianPreTrainedModel):
@@ -1084,6 +1094,7 @@ class MarianModel(MarianPreTrainedModel):
...
@@ -1084,6 +1094,7 @@ class MarianModel(MarianPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1142,7 +1153,7 @@ class MarianModel(MarianPreTrainedModel):
...
@@ -1142,7 +1153,7 @@ class MarianModel(MarianPreTrainedModel):
encoder_hidden_states=encoder_outputs[0],
encoder_hidden_states=encoder_outputs[0],
encoder_attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
head_mask=decoder_head_mask,
encoder_head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=decoder_inputs_embeds,
inputs_embeds=decoder_inputs_embeds,
use_cache=use_cache,
use_cache=use_cache,
...
@@ -1229,6 +1240,7 @@ class MarianMTModel(MarianPreTrainedModel):
...
@@ -1229,6 +1240,7 @@ class MarianMTModel(MarianPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1264,6 +1276,7 @@ class MarianMTModel(MarianPreTrainedModel):
...
@@ -1264,6 +1276,7 @@ class MarianMTModel(MarianPreTrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1391,7 +1404,7 @@ class MarianForCausalLM(MarianPreTrainedModel):
...
@@ -1391,7 +1404,7 @@ class MarianForCausalLM(MarianPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
labels=None,
labels=None,
...
@@ -1424,18 +1437,17 @@ class MarianForCausalLM(MarianPreTrainedModel):
...
@@ -1424,18 +1437,17 @@ class MarianForCausalLM(MarianPreTrainedModel):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -1492,7 +1504,7 @@ class MarianForCausalLM(MarianPreTrainedModel):
...
@@ -1492,7 +1504,7 @@ class MarianForCausalLM(MarianPreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -708,11 +715,11 @@ class MBartEncoder(MBartPreTrainedModel):
...
@@ -708,11 +715,11 @@ class MBartEncoder(MBartPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
...
@@ -877,7 +884,7 @@ class MBartDecoder(MBartPreTrainedModel):
...
@@ -877,7 +884,7 @@ class MBartDecoder(MBartPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -914,18 +921,18 @@ class MBartDecoder(MBartPreTrainedModel):
...
@@ -914,18 +921,18 @@ class MBartDecoder(MBartPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
on hidden heads. Mask values selected in ``[0, 1]``:
cross-attention on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -995,11 +1002,12 @@ class MBartDecoder(MBartPreTrainedModel):
...
@@ -995,11 +1002,12 @@ class MBartDecoder(MBartPreTrainedModel):
@@ -1127,6 +1137,7 @@ class MBartModel(MBartPreTrainedModel):
...
@@ -1127,6 +1137,7 @@ class MBartModel(MBartPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1173,7 +1184,7 @@ class MBartModel(MBartPreTrainedModel):
...
@@ -1173,7 +1184,7 @@ class MBartModel(MBartPreTrainedModel):
encoder_hidden_states=encoder_outputs[0],
encoder_hidden_states=encoder_outputs[0],
encoder_attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
head_mask=decoder_head_mask,
encoder_head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=decoder_inputs_embeds,
inputs_embeds=decoder_inputs_embeds,
use_cache=use_cache,
use_cache=use_cache,
...
@@ -1254,6 +1265,7 @@ class MBartForConditionalGeneration(MBartPreTrainedModel):
...
@@ -1254,6 +1265,7 @@ class MBartForConditionalGeneration(MBartPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1287,6 +1299,7 @@ class MBartForConditionalGeneration(MBartPreTrainedModel):
...
@@ -1287,6 +1299,7 @@ class MBartForConditionalGeneration(MBartPreTrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1384,6 +1397,7 @@ class MBartForSequenceClassification(MBartPreTrainedModel):
...
@@ -1384,6 +1397,7 @@ class MBartForSequenceClassification(MBartPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
inputs_embeds=None,
inputs_embeds=None,
decoder_inputs_embeds=None,
decoder_inputs_embeds=None,
...
@@ -1414,6 +1428,7 @@ class MBartForSequenceClassification(MBartPreTrainedModel):
...
@@ -1414,6 +1428,7 @@ class MBartForSequenceClassification(MBartPreTrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
encoder_outputs=encoder_outputs,
encoder_outputs=encoder_outputs,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1495,6 +1510,7 @@ class MBartForQuestionAnswering(MBartPreTrainedModel):
...
@@ -1495,6 +1510,7 @@ class MBartForQuestionAnswering(MBartPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
start_positions=None,
start_positions=None,
end_positions=None,
end_positions=None,
...
@@ -1526,6 +1542,7 @@ class MBartForQuestionAnswering(MBartPreTrainedModel):
...
@@ -1526,6 +1542,7 @@ class MBartForQuestionAnswering(MBartPreTrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
encoder_outputs=encoder_outputs,
encoder_outputs=encoder_outputs,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1634,7 +1651,7 @@ class MBartForCausalLM(MBartPreTrainedModel):
...
@@ -1634,7 +1651,7 @@ class MBartForCausalLM(MBartPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
labels=None,
labels=None,
...
@@ -1667,18 +1684,17 @@ class MBartForCausalLM(MBartPreTrainedModel):
...
@@ -1667,18 +1684,17 @@ class MBartForCausalLM(MBartPreTrainedModel):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -1735,7 +1751,7 @@ class MBartForCausalLM(MBartPreTrainedModel):
...
@@ -1735,7 +1751,7 @@ class MBartForCausalLM(MBartPreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -679,11 +686,11 @@ class PegasusEncoder(PegasusPreTrainedModel):
...
@@ -679,11 +686,11 @@ class PegasusEncoder(PegasusPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
...
@@ -848,7 +855,7 @@ class PegasusDecoder(PegasusPreTrainedModel):
...
@@ -848,7 +855,7 @@ class PegasusDecoder(PegasusPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -885,18 +892,18 @@ class PegasusDecoder(PegasusPreTrainedModel):
...
@@ -885,18 +892,18 @@ class PegasusDecoder(PegasusPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules in decoder to avoid performing
on hidden heads. Mask values selected in ``[0, 1]``:
cross-attention on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -965,11 +972,12 @@ class PegasusDecoder(PegasusPreTrainedModel):
...
@@ -965,11 +972,12 @@ class PegasusDecoder(PegasusPreTrainedModel):
@@ -1092,6 +1102,7 @@ class PegasusModel(PegasusPreTrainedModel):
...
@@ -1092,6 +1102,7 @@ class PegasusModel(PegasusPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1150,7 +1161,7 @@ class PegasusModel(PegasusPreTrainedModel):
...
@@ -1150,7 +1161,7 @@ class PegasusModel(PegasusPreTrainedModel):
encoder_hidden_states=encoder_outputs[0],
encoder_hidden_states=encoder_outputs[0],
encoder_attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
head_mask=decoder_head_mask,
encoder_head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=decoder_inputs_embeds,
inputs_embeds=decoder_inputs_embeds,
use_cache=use_cache,
use_cache=use_cache,
...
@@ -1232,6 +1243,7 @@ class PegasusForConditionalGeneration(PegasusPreTrainedModel):
...
@@ -1232,6 +1243,7 @@ class PegasusForConditionalGeneration(PegasusPreTrainedModel):
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -1267,6 +1279,7 @@ class PegasusForConditionalGeneration(PegasusPreTrainedModel):
...
@@ -1267,6 +1279,7 @@ class PegasusForConditionalGeneration(PegasusPreTrainedModel):
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -1390,7 +1403,7 @@ class PegasusForCausalLM(PegasusPreTrainedModel):
...
@@ -1390,7 +1403,7 @@ class PegasusForCausalLM(PegasusPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
labels=None,
labels=None,
...
@@ -1423,18 +1436,17 @@ class PegasusForCausalLM(PegasusPreTrainedModel):
...
@@ -1423,18 +1436,17 @@ class PegasusForCausalLM(PegasusPreTrainedModel):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -1491,7 +1503,7 @@ class PegasusForCausalLM(PegasusPreTrainedModel):
...
@@ -1491,7 +1503,7 @@ class PegasusForCausalLM(PegasusPreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -728,11 +738,11 @@ class Speech2TextEncoder(Speech2TextPreTrainedModel):
...
@@ -728,11 +738,11 @@ class Speech2TextEncoder(Speech2TextPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
output_attentions (:obj:`bool`, `optional`):
output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
...
@@ -884,7 +894,7 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel):
...
@@ -884,7 +894,7 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel):
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -921,18 +931,18 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel):
...
@@ -921,18 +931,18 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel):
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
on hidden heads. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -1001,12 +1011,12 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel):
...
@@ -1001,12 +1011,12 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
...
@@ -2211,10 +2218,11 @@ class {{cookiecutter.camelcase_modelname}}Encoder({{cookiecutter.camelcase_model
...
@@ -2211,10 +2218,11 @@ class {{cookiecutter.camelcase_modelname}}Encoder({{cookiecutter.camelcase_model
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
...
@@ -2377,7 +2385,7 @@ class {{cookiecutter.camelcase_modelname}}Decoder({{cookiecutter.camelcase_model
...
@@ -2377,7 +2385,7 @@ class {{cookiecutter.camelcase_modelname}}Decoder({{cookiecutter.camelcase_model
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
use_cache=None,
use_cache=None,
...
@@ -2414,18 +2422,17 @@ class {{cookiecutter.camelcase_modelname}}Decoder({{cookiecutter.camelcase_model
...
@@ -2414,18 +2422,17 @@ class {{cookiecutter.camelcase_modelname}}Decoder({{cookiecutter.camelcase_model
- 0 for tokens that are **masked**.
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -2493,12 +2500,12 @@ class {{cookiecutter.camelcase_modelname}}Decoder({{cookiecutter.camelcase_model
...
@@ -2493,12 +2500,12 @@ class {{cookiecutter.camelcase_modelname}}Decoder({{cookiecutter.camelcase_model
@@ -2621,6 +2628,7 @@ class {{cookiecutter.camelcase_modelname}}Model({{cookiecutter.camelcase_modelna
...
@@ -2621,6 +2628,7 @@ class {{cookiecutter.camelcase_modelname}}Model({{cookiecutter.camelcase_modelna
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -2662,7 +2670,7 @@ class {{cookiecutter.camelcase_modelname}}Model({{cookiecutter.camelcase_modelna
...
@@ -2662,7 +2670,7 @@ class {{cookiecutter.camelcase_modelname}}Model({{cookiecutter.camelcase_modelna
encoder_hidden_states=encoder_outputs[0],
encoder_hidden_states=encoder_outputs[0],
encoder_attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
head_mask=decoder_head_mask,
encoder_head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=decoder_inputs_embeds,
inputs_embeds=decoder_inputs_embeds,
use_cache=use_cache,
use_cache=use_cache,
...
@@ -2743,6 +2751,7 @@ class {{cookiecutter.camelcase_modelname}}ForConditionalGeneration({{cookiecutte
...
@@ -2743,6 +2751,7 @@ class {{cookiecutter.camelcase_modelname}}ForConditionalGeneration({{cookiecutte
decoder_attention_mask=None,
decoder_attention_mask=None,
head_mask=None,
head_mask=None,
decoder_head_mask=None,
decoder_head_mask=None,
cross_attn_head_mask=None,
encoder_outputs=None,
encoder_outputs=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
...
@@ -2791,6 +2800,7 @@ class {{cookiecutter.camelcase_modelname}}ForConditionalGeneration({{cookiecutte
...
@@ -2791,6 +2800,7 @@ class {{cookiecutter.camelcase_modelname}}ForConditionalGeneration({{cookiecutte
decoder_attention_mask=decoder_attention_mask,
decoder_attention_mask=decoder_attention_mask,
head_mask=head_mask,
head_mask=head_mask,
decoder_head_mask=decoder_head_mask,
decoder_head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
inputs_embeds=inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_inputs_embeds=decoder_inputs_embeds,
...
@@ -3124,7 +3134,7 @@ class {{cookiecutter.camelcase_modelname}}ForCausalLM({{cookiecutter.camelcase_m
...
@@ -3124,7 +3134,7 @@ class {{cookiecutter.camelcase_modelname}}ForCausalLM({{cookiecutter.camelcase_m
encoder_hidden_states=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
encoder_attention_mask=None,
head_mask=None,
head_mask=None,
encoder_head_mask=None,
cross_attn_head_mask=None,
past_key_values=None,
past_key_values=None,
inputs_embeds=None,
inputs_embeds=None,
labels=None,
labels=None,
...
@@ -3157,18 +3167,17 @@ class {{cookiecutter.camelcase_modelname}}ForCausalLM({{cookiecutter.camelcase_m
...
@@ -3157,18 +3167,17 @@ class {{cookiecutter.camelcase_modelname}}ForCausalLM({{cookiecutter.camelcase_m
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``:
on hidden heads. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 1 indicates the head is **not masked**,
- 0 indicates the heas is **masked**.
- 0 indicates the head is **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
...
@@ -3225,7 +3234,7 @@ class {{cookiecutter.camelcase_modelname}}ForCausalLM({{cookiecutter.camelcase_m
...
@@ -3225,7 +3234,7 @@ class {{cookiecutter.camelcase_modelname}}ForCausalLM({{cookiecutter.camelcase_m