Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
......@@ -24,11 +24,11 @@ from .modeling_bart import BartForConditionalGeneration
class MarianMTModel(BartForConditionalGeneration):
r"""
Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
Available models are listed `here <https://huggingface.co/models?search=Helsinki-NLP>`__.
Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints. Available
models are listed `here <https://huggingface.co/models?search=Helsinki-NLP>`__.
This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the superclass for the
appropriate documentation alongside usage examples.
Examples::
......
......@@ -14,8 +14,8 @@ MBART_PRETRAINED_MODEL_ARCHIVE_LIST = [
class MBartForConditionalGeneration(BartForConditionalGeneration):
r"""
This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the superclass for the
appropriate documentation alongside usage examples.
Examples::
>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
......
......@@ -77,23 +77,23 @@ class ModalEmbeddings(nn.Module):
MMBT_START_DOCSTRING = r"""
MMBT model was proposed in
`Supervised Multimodal Bitransformers for Classifying Images and Text <https://github.com/facebookresearch/mmbt>`__
by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
It's a supervised multimodal bitransformer model that fuses information from text and other image encoders,
and obtain state-of-the-art performance on various multimodal classification benchmark tasks.
MMBT model was proposed in `Supervised Multimodal Bitransformers for Classifying Images and Text
<https://github.com/facebookresearch/mmbt>`__ by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
It's a supervised multimodal bitransformer model that fuses information from text and other image encoders, and
obtain state-of-the-art performance on various multimodal classification benchmark tasks.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.MMBTConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Initializing with a config file does not load the weights associated with the model, only the
configuration.
transformer (:class: `~nn.Module`): A text transformer that is used by MMBT.
It should have embeddings, encoder, and pooler attributes.
encoder (:class: `~nn.Module`): Encoder for the second modality.
......@@ -103,13 +103,12 @@ MMBT_START_DOCSTRING = r"""
MMBT_INPUTS_DOCSTRING = r"""
Args:
input_modal (``torch.FloatTensor`` of shape ``(batch_size, ***)``):
The other modality data. It will be the shape that the encoder for that type expects.
e.g. With an Image Encoder, the shape would be (batch_size, channels, height, width)
The other modality data. It will be the shape that the encoder for that type expects. e.g. With an Image
Encoder, the shape would be (batch_size, channels, height, width)
input_ids (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``):
Indices of input sequence tokens in the vocabulary.
It does not expect [CLS] token to be added as it's appended to the end of other modality embeddings.
Indices can be obtained using :class:`~transformers.BertTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
Indices of input sequence tokens in the vocabulary. It does not expect [CLS] token to be added as it's
appended to the end of other modality embeddings. Indices can be obtained using
:class:`~transformers.BertTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__
......@@ -119,27 +118,26 @@ MMBT_INPUTS_DOCSTRING = r"""
modal_end_tokens (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
Optional end token to be added to Other Modality Embedding. [SEP] Most commonly used.
attention_mask (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
modal_token_type_ids (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:
Segment token indices to indicate different portions of the non-text modality.
The embeddings from these tokens will be summed with the respective token embeddings for the non-text modality.
Segment token indices to indicate different portions of the non-text modality. The embeddings from these
tokens will be summed with the respective token embeddings for the non-text modality.
position_ids (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`__
modal_position_ids (``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``, `optional`):
......@@ -148,8 +146,7 @@ MMBT_INPUTS_DOCSTRING = r"""
`What are position IDs? <../glossary.html#position-ids>`__
head_mask (``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -162,9 +159,8 @@ MMBT_INPUTS_DOCSTRING = r"""
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -309,31 +305,29 @@ class MMBTModel(nn.Module, ModuleUtilsMixin):
@add_start_docstrings(
"""MMBT Model with a sequence classification/regression head on top (a linear layer on top of
the pooled output)""",
"""
MMBT Model with a sequence classification/regression head on top (a linear layer on top of the pooled output)
""",
MMBT_START_DOCSTRING,
MMBT_INPUTS_DOCSTRING,
)
class MMBTForClassification(nn.Module):
r"""
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for computing the sequence classification/regression loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in ``[0, ...,
config.num_labels - 1]``. If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Classification (or regression if config.num_labels==1) loss.
**logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
Classification (or regression if config.num_labels==1) scores (before SoftMax).
**hidden_states**: (`optional`, returned when ``output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Returns: `Tuple` comprising various elements depending on the configuration (config) and inputs: **loss**:
(`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``: Classification (or
regression if config.num_labels==1) loss. **logits**: ``torch.FloatTensor`` of shape ``(batch_size,
config.num_labels)`` Classification (or regression if config.num_labels==1) scores (before SoftMax).
**hidden_states**: (`optional`, returned when ``output_hidden_states=True``) list of ``torch.FloatTensor`` (one for
the output of each layer + the output of the embeddings) of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs. **attentions**:
(`optional`, returned when ``output_attentions=True``) list of ``torch.FloatTensor`` (one for each layer) of shape
``(batch_size, num_heads, sequence_length, sequence_length)``: Attentions weights after the attention softmax, used
to compute the weighted average in the self-attention heads.
Examples::
......
......@@ -668,8 +668,9 @@ class MobileBertPreTrainingHeads(nn.Module):
class MobileBertPreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = MobileBertConfig
......@@ -703,16 +704,16 @@ class MobileBertForPreTrainingOutput(ModelOutput):
prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False
continuation before SoftMax).
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -731,14 +732,15 @@ MOBILEBERT_START_DOCSTRING = r"""
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.MobileBertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
MOBILEBERT_INPUTS_DOCSTRING = r"""
......@@ -746,35 +748,33 @@ MOBILEBERT_INPUTS_DOCSTRING = r"""
input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.BertTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.BertTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -784,12 +784,11 @@ MOBILEBERT_INPUTS_DOCSTRING = r"""
This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder.
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -831,9 +830,9 @@ class MobileBertModel(MobileBertPreTrainedModel):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
......@@ -933,8 +932,10 @@ class MobileBertModel(MobileBertPreTrainedModel):
@add_start_docstrings(
"""MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and
a `next sentence prediction (classification)` head. """,
"""
MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a
`next sentence prediction (classification)` head.
""",
MOBILEBERT_START_DOCSTRING,
)
class MobileBertForPreTraining(MobileBertPreTrainedModel):
......@@ -950,9 +951,8 @@ class MobileBertForPreTraining(MobileBertPreTrainedModel):
def tie_weights(self):
"""
Tie the weights between the input embeddings and the output embeddings.
If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning
the weights instead.
Tie the weights between the input embeddings and the output embeddings. If the `torchscript` flag is set in the
configuration, can't handle parameter sharing so we are cloning the weights instead.
"""
output_embeddings = self.get_output_embeddings()
input_embeddings = self.get_input_embeddings()
......@@ -988,13 +988,12 @@ class MobileBertForPreTraining(MobileBertPreTrainedModel):
):
r"""
labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
Indices should be in ``[0, 1]``:
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
(see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:
- 0 indicates sequence B is a continuation of sequence A,
- 1 indicates sequence B is a random sequence.
......@@ -1070,9 +1069,8 @@ class MobileBertForMaskedLM(MobileBertPreTrainedModel):
def tie_weights(self):
"""
Tie the weights between the input embeddings and the output embeddings.
If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning
the weights instead.
Tie the weights between the input embeddings and the output embeddings. If the `torchscript` flag is set in the
configuration, can't handle parameter sharing so we are cloning the weights instead.
"""
output_embeddings = self.get_output_embeddings()
input_embeddings = self.get_input_embeddings()
......@@ -1115,10 +1113,9 @@ class MobileBertForMaskedLM(MobileBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated.
"""
......@@ -1204,8 +1201,8 @@ class MobileBertForNextSentencePrediction(MobileBertPreTrainedModel):
):
r"""
next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
Indices should be in ``[0, 1]``.
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
(see ``input_ids`` docstring) Indices should be in ``[0, 1]``.
- 0 indicates sequence B is a continuation of sequence A,
- 1 indicates sequence B is a random sequence.
......@@ -1263,8 +1260,10 @@ class MobileBertForNextSentencePrediction(MobileBertPreTrainedModel):
@add_start_docstrings(
"""MobileBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """,
"""
MobileBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the
pooled output) e.g. for GLUE tasks.
""",
MOBILEBERT_START_DOCSTRING,
)
class MobileBertForSequenceClassification(MobileBertPreTrainedModel):
......@@ -1299,9 +1298,8 @@ class MobileBertForSequenceClassification(MobileBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1344,8 +1342,10 @@ class MobileBertForSequenceClassification(MobileBertPreTrainedModel):
@add_start_docstrings(
"""MobileBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layers on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
"""
MobileBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a
linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
MOBILEBERT_START_DOCSTRING,
)
class MobileBertForQuestionAnswering(MobileBertPreTrainedModel):
......@@ -1385,12 +1385,12 @@ class MobileBertForQuestionAnswering(MobileBertPreTrainedModel):
r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1444,8 +1444,10 @@ class MobileBertForQuestionAnswering(MobileBertPreTrainedModel):
@add_start_docstrings(
"""MobileBert Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
"""
MobileBert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and
a softmax) e.g. for RocStories/SWAG tasks.
""",
MOBILEBERT_START_DOCSTRING,
)
class MobileBertForMultipleChoice(MobileBertPreTrainedModel):
......@@ -1480,9 +1482,9 @@ class MobileBertForMultipleChoice(MobileBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (See :obj:`input_ids` above)
Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
:obj:`input_ids` above)
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
......@@ -1533,8 +1535,10 @@ class MobileBertForMultipleChoice(MobileBertPreTrainedModel):
@add_start_docstrings(
"""MoibleBert Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
"""
MoibleBert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
for Named-Entity-Recognition (NER) tasks.
""",
MOBILEBERT_START_DOCSTRING,
)
class MobileBertForTokenClassification(MobileBertPreTrainedModel):
......@@ -1573,8 +1577,8 @@ class MobileBertForTokenClassification(MobileBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
1]``.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
......@@ -272,8 +272,9 @@ class Block(nn.Module):
class OpenAIGPTPreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = OpenAIGPTConfig
......@@ -314,8 +315,8 @@ class OpenAIGPTDoubleHeadsModelOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -335,14 +336,15 @@ OPENAI_GPT_START_DOCSTRING = r"""
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
OPENAI_GPT_INPUTS_DOCSTRING = r"""
......@@ -350,35 +352,33 @@ OPENAI_GPT_INPUTS_DOCSTRING = r"""
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.OpenAIGPTTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.OpenAIGPTTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`__
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -421,8 +421,8 @@ class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
self.tokens_embed = new_embeddings
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
"""
for layer, heads in heads_to_prune.items():
self.h[layer].attn.prune_heads(heads)
......@@ -526,8 +526,10 @@ class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
@add_start_docstrings(
"""OpenAI GPT Model transformer with a language modeling head on top
(linear layer with weights tied to the input embeddings). """,
"""
OpenAI GPT Model transformer with a language modeling head on top (linear layer with weights tied to the input
embeddings).
""",
OPENAI_GPT_START_DOCSTRING,
)
class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
......@@ -563,11 +565,9 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for language modeling.
Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
Indices are selected in ``[-100, 0, ..., config.vocab_size]``
All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``
Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
``labels = input_ids`` Indices are selected in ``[-100, 0, ..., config.vocab_size]`` All labels set to
``-100`` are ignored (masked), the loss is only computed for labels in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -607,10 +607,11 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
@add_start_docstrings(
"""OpenAI GPT Model transformer with a language modeling and a multiple-choice classification
head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.
The language modeling head has its weights tied to the input embeddings,
the classification head takes as input the input of a specified classification token index in the input sequence).
"""
OpenAI GPT Model transformer with a language modeling and a multiple-choice classification head on top e.g. for
RocStories/SWAG tasks. The two heads are two linear layers. The language modeling head has its weights tied to the
input embeddings, the classification head takes as input the input of a specified classification token index in the
input sequence).
""",
OPENAI_GPT_START_DOCSTRING,
)
......@@ -648,18 +649,16 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
):
r"""
mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input):
Index of the classification token in each input sequence.
Selected in the range ``[0, input_ids.size(-1) - 1]``.
Index of the classification token in each input sequence. Selected in the range ``[0, input_ids.size(-1) -
1]``.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for language modeling.
Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
Indices are selected in ``[-1, 0, ..., config.vocab_size]``
All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``
Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
``labels = input_ids`` Indices are selected in ``[-1, 0, ..., config.vocab_size]`` All labels set to
``-100`` are ignored (masked), the loss is only computed for labels in ``[0, ..., config.vocab_size]``
mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`):
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above)
Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
num_choices]`` where `num_choices` is the size of the second dimension of the input tensors. (see
`input_ids` above)
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated.
......@@ -735,15 +734,14 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
@add_start_docstrings(
"""The Original OpenAI GPT Model transformer with a sequence classification head on top
(linear layer).
"""
The Original OpenAI GPT Model transformer with a sequence classification head on top (linear layer).
:class:`~transformers.OpenAIGPTForSequenceClassification` uses the last token in order to do the classification, as
other causal models (e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token.
If a :obj:`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token
in each row. If no :obj:`pad_token_id` is defined, it simply takes the last value in each row of the batch.
Since it cannot guess the padding tokens when :obj:`inputs_embeds` are passed instead of :obj:`input_ids`, it
does the same (take the last value in each row of the batch).
other causal models (e.g. GPT-2) do. Since it does classification on the last token, it requires to know the
position of the last token. If a :obj:`pad_token_id` is defined in the configuration, it finds the last token that
is not a padding token in each row. If no :obj:`pad_token_id` is defined, it simply takes the last value in each
row of the batch. Since it cannot guess the padding tokens when :obj:`inputs_embeds` are passed instead of
:obj:`input_ids`, it does the same (take the last value in each row of the batch).
""",
OPENAI_GPT_START_DOCSTRING,
)
......@@ -778,9 +776,8 @@ class OpenAIGPTForSequenceClassification(OpenAIGPTPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
......@@ -20,8 +20,8 @@ class BaseModelOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -41,18 +41,17 @@ class BaseModelOutputWithPooling(ModelOutput):
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, hidden_size)`):
Last layer hidden-state of the first token of the sequence (classification token)
further processed by a Linear layer and a Tanh activation function. The Linear
layer weights are trained from the next sentence prediction (classification)
objective during pretraining.
Last layer hidden-state of the first token of the sequence (classification token) further processed by a
Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence
prediction (classification) objective during pretraining.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -73,11 +72,11 @@ class BaseModelOutputWithPast(ModelOutput):
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape
:obj:`(batch_size, 1, hidden_size)` is output.
If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape :obj:`(batch_size,
1, hidden_size)` is output.
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
:obj:`past_key_values` input) to speed up sequential decoding.
......@@ -87,8 +86,8 @@ class BaseModelOutputWithPast(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -110,10 +109,11 @@ class Seq2SeqModelOutput(ModelOutput):
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the decoder of the model.
If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.
If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape :obj:`(batch_size,
1, hidden_size)` is output.
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
......@@ -123,8 +123,8 @@ class Seq2SeqModelOutput(ModelOutput):
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -136,8 +136,8 @@ class Seq2SeqModelOutput(ModelOutput):
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -168,8 +168,8 @@ class CausalLMOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -192,8 +192,8 @@ class CausalLMOutputWithPast(ModelOutput):
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
:obj:`past_key_values` input) to speed up sequential decoding.
......@@ -203,8 +203,8 @@ class CausalLMOutputWithPast(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -228,8 +228,8 @@ class SequenceClassifierOutputWithPast(ModelOutput):
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
``past_key_values`` input) to speed up sequential decoding.
......@@ -239,8 +239,8 @@ class SequenceClassifierOutputWithPast(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -269,8 +269,8 @@ class MaskedLMOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -293,8 +293,8 @@ class Seq2SeqLMOutput(ModelOutput):
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
......@@ -304,8 +304,8 @@ class Seq2SeqLMOutput(ModelOutput):
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -317,8 +317,8 @@ class Seq2SeqLMOutput(ModelOutput):
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -343,15 +343,16 @@ class NextSentencePredictorOutput(ModelOutput):
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):
Next sequence prediction (classification) loss.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -379,8 +380,8 @@ class SequenceClassifierOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -403,8 +404,8 @@ class Seq2SeqSequenceClassifierOutput(ModelOutput):
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
......@@ -414,8 +415,8 @@ class Seq2SeqSequenceClassifierOutput(ModelOutput):
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -427,8 +428,8 @@ class Seq2SeqSequenceClassifierOutput(ModelOutput):
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -462,8 +463,8 @@ class MultipleChoiceModelOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -491,8 +492,8 @@ class TokenClassifierOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -522,8 +523,8 @@ class QuestionAnsweringModelOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -549,8 +550,8 @@ class Seq2SeqQuestionAnsweringModelOutput(ModelOutput):
end_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
Span-end scores (before SoftMax).
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
......@@ -560,8 +561,8 @@ class Seq2SeqQuestionAnsweringModelOutput(ModelOutput):
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -573,8 +574,8 @@ class Seq2SeqQuestionAnsweringModelOutput(ModelOutput):
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......
......@@ -23,11 +23,11 @@ from .modeling_bart import BART_START_DOCSTRING, BartForConditionalGeneration
@add_start_docstrings("The Pegasus Model for summarization ", BART_START_DOCSTRING)
class PegasusForConditionalGeneration(BartForConditionalGeneration):
r"""
Pytorch version of google's pegasus model for summarization.
Available models are listed `here <https://huggingface.co/models?search=pegasus>`__.
Pytorch version of google's pegasus model for summarization. Available models are listed `here
<https://huggingface.co/models?search=pegasus>`__.
This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the superclass for the
appropriate documentation alongside usage examples.
Examples::
......
......@@ -47,17 +47,19 @@ PROPHETNET_START_DOCSTRING = r"""
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
Original ProphetNet code can be found at <https://github.com/microsoft/ProphetNet> .
Checkpoints were converted from original Fairseq checkpoints. For more information on
the checkpoint conversion, please take a look at the file ``convert_prophetnet_original_pytorch_checkpoint_to_pytorch.py``.
Original ProphetNet code can be found at <https://github.com/microsoft/ProphetNet> . Checkpoints were converted
from original Fairseq checkpoints. For more information on the checkpoint conversion, please take a look at the
file ``convert_prophetnet_original_pytorch_checkpoint_to_pytorch.py``.
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matters related to general usage and behavior.
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matters related to general usage and
behavior.
Parameters:
config (:class:`~transformers.ProphetNetConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
PROPHETNET_INPUTS_DOCSTRING = r"""
......@@ -66,14 +68,13 @@ PROPHETNET_INPUTS_DOCSTRING = r"""
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using :class:`~transformers.ProphetNetTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.ProphetNetTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -90,16 +91,16 @@ PROPHETNET_INPUTS_DOCSTRING = r"""
modify to your needs. See diagram 1 in `the paper <https://arxiv.org/abs/1910.13461>`__ for more
information on the default strategy.
encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`)
:obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`) is a
sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of
the decoder.
Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
:obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
`optional`) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the
cross-attention of the decoder.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up decoding.
If :obj:`past_key_values` are used, the user can optionally input only the last
``decoder_input_ids`` (those that don't have their past key value states given to this model) of shape
:obj:`(batch_size, 1)` instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
If :obj:`past_key_values` are used, the user can optionally input only the last ``decoder_input_ids``
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
decoding (see :obj:`past_key_values`).
......@@ -119,14 +120,13 @@ PROPHETNET_STANDALONE_INPUTS_DOCSTRING = r"""
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using :class:`~transformers.ProphetNetTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.ProphetNetTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -222,21 +222,19 @@ class ProphetNetSeq2SeqLMOutput(ModelOutput):
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Languaged modeling loss.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, decoder_sequence_length, config.vocab_size)`):
Prediction scores of the main stream language modeling head
(scores for each vocabulary token before SoftMax).
Prediction scores of the main stream language modeling head (scores for each vocabulary token before
SoftMax).
logits_ngram (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, ngram * decoder_sequence_length, config.vocab_size)`):
Prediction scores of the predict stream language modeling head (scores for
each vocabulary token before SoftMax).
Prediction scores of the predict stream language modeling head (scores for each vocabulary token before
SoftMax).
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks)
of the decoder that can be
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
decoder_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings
+ one for the output of each layer)
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, decoder_sequence_length, hidden_size)`.
Hidden-states of main stream of the decoder at the output of each layer plus the initial embedding outputs.
......@@ -244,40 +242,37 @@ class ProphetNetSeq2SeqLMOutput(ModelOutput):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, ngram * decoder_sequence_length, hidden_size)`.
Hidden-states of the predict stream of the decoder at the
output of each layer plus the initial embedding outputs.
Hidden-states of the predict stream of the decoder at the output of each layer plus the initial embedding
outputs.
decoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
decoder_ngram_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the predict stream of the decoder, after the attention softmax,
used to compute the weighted average in the self-attention heads.
Attentions weights of the predict stream of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
decoder_cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, encoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
encoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the cross-attention layer of the decoder,
after the attention softmax, used to compute the weighted average in the
Attentions weights of the cross-attention layer of the decoder, after the attention softmax, used to
compute the weighted average in the
encoder_last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, encoder_sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the
encoder of the model.
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, encoder_sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, encoder_sequence_length, encoder_sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used
to compute the weighted average in the
self-attention heads.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
encoder_sequence_length, encoder_sequence_length)`. Attentions weights of the encoder, after the attention
softmax, used to compute the weighted average in the self-attention heads.
"""
loss: Optional[torch.FloatTensor] = None
......@@ -304,16 +299,15 @@ class ProphetNetSeq2SeqModelOutput(ModelOutput):
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, decoder_sequence_length, hidden_size)`):
Sequence of main stream hidden-states at the output of the last layer of the decoder of the model.
If :obj:`past_key_values` is used only the last hidden-state of
the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.
If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape :obj:`(batch_size,
1, hidden_size)` is output.
last_hidden_state_ngram (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,ngram * decoder_sequence_length, config.vocab_size)`):
Sequence of predict stream hidden-states at the output of the last layer of the decoder of the model.
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks)
of the decoder that can be
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
decoder_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
......@@ -324,27 +318,26 @@ class ProphetNetSeq2SeqModelOutput(ModelOutput):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, ngram * decoder_sequence_length, hidden_size)`.
Hidden-states of the predict stream of the decoder at the
output of each layer plus the initial embedding outputs.
Hidden-states of the predict stream of the decoder at the output of each layer plus the initial embedding
outputs.
decoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
decoder_ngram_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the predict stream of the decoder,
after the attention softmax, used to compute the weighted average in the
Attentions weights of the predict stream of the decoder, after the attention softmax, used to compute the
weighted average in the
decoder_cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, encoder_sequence_length,
decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
encoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the cross-attention layer of the decoder,
after the attention softmax, used to compute the weighted average in the
Attentions weights of the cross-attention layer of the decoder, after the attention softmax, used to
compute the weighted average in the
encoder_last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, encoder_sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
......@@ -353,8 +346,8 @@ class ProphetNetSeq2SeqModelOutput(ModelOutput):
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, encoder_sequence_length, encoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
encoder_sequence_length, encoder_sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
......@@ -382,13 +375,13 @@ class ProphetNetDecoderModelOutput(ModelOutput):
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, decoder_sequence_length, hidden_size)`):
Sequence of main stream hidden-states at the output of the last layer of the decoder of the model.
If :obj:`past_key_values` is used only the last hidden-state
of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.
If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape :obj:`(batch_size,
1, hidden_size)` is output.
last_hidden_state_ngram (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, ngram * decoder_sequence_length, config.vocab_size)`):
Sequence of predict stream hidden-states at the output of the last layer of the decoder of the model.
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
......@@ -396,32 +389,31 @@ class ProphetNetDecoderModelOutput(ModelOutput):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, decoder_sequence_length, hidden_size)`.
Hidden-states of main stream of the decoder at the output
of each layer plus the initial embedding outputs.
Hidden-states of main stream of the decoder at the output of each layer plus the initial embedding outputs.
ngram_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, ngram * decoder_sequence_length, hidden_size)`.
Hidden-states of the predict stream of the decoder at the
output of each layer plus the initial embedding outputs.
Hidden-states of the predict stream of the decoder at the output of each layer plus the initial embedding
outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the decoder, after the attention softmax,
used to compute the weighted average in the
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
ngram_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the predict stream of the decoder,
after the attention softmax, used to compute the weighted average in the
Attentions weights of the predict stream of the decoder, after the attention softmax, used to compute the
weighted average in the
cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, encoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
encoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the cross-attention layer of the decoder, after the attention softmax, used to compute the weighted average in the
Attentions weights of the cross-attention layer of the decoder, after the attention softmax, used to
compute the weighted average in the
"""
last_hidden_state: torch.FloatTensor
......@@ -443,12 +435,14 @@ class ProphetNetDecoderLMOutput(ModelOutput):
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Languaged modeling loss.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, decoder_sequence_length, config.vocab_size)`):
Prediction scores of the main stream language modeling head (scores for each vocabulary token before SoftMax).
Prediction scores of the main stream language modeling head (scores for each vocabulary token before
SoftMax).
logits_ngram (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, ngram * decoder_sequence_length, config.vocab_size)`):
Prediction scores of the predict stream language modeling head (scores for each vocabulary token before SoftMax).
Prediction scores of the predict stream language modeling head (scores for each vocabulary token before
SoftMax).
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
used (see :obj:`past_key_values` input) to speed up sequential decoding.
......@@ -461,23 +455,26 @@ class ProphetNetDecoderLMOutput(ModelOutput):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, ngram * decoder_sequence_length, hidden_size)`.
Hidden-states of the predict stream of the decoder at the output of each layer plus the initial embedding outputs.
Hidden-states of the predict stream of the decoder at the output of each layer plus the initial embedding
outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
ngram_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, decoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
decoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the predict stream of the decoder, after the attention softmax, used to compute the weighted average in the
Attentions weights of the predict stream of the decoder, after the attention softmax, used to compute the
weighted average in the
cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_attn_heads, encoder_sequence_length, decoder_sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_attn_heads,
encoder_sequence_length, decoder_sequence_length)`.
Attentions weights of the cross-attention layer of the decoder, after the attention softmax, used to compute the weighted average in the
Attentions weights of the cross-attention layer of the decoder, after the attention softmax, used to
compute the weighted average in the
"""
loss: Optional[torch.FloatTensor] = None
......@@ -540,10 +537,9 @@ class ProphetNetPreTrainedModel(PreTrainedModel):
class ProhpetNetPositionalEmbeddings(nn.Embedding):
"""
This module learns positional embeddings up to a fixed maximum size.
Padding ids are ignored by either offsetting based on padding_idx
or by setting padding_idx to None and ensuring that the appropriate
position ids are passed to the forward function.
This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting
based on padding_idx or by setting padding_idx to None and ensuring that the appropriate position ids are passed to
the forward function.
"""
def __init__(self, config: ProphetNetConfig):
......@@ -701,8 +697,7 @@ class ProphetNetSelfAttention(nn.Module):
class ProhpetNetFeedForward(nn.Module):
"""
This is the residual two feed-forward layer block based on the original
Transformer implementation.
This is the residual two feed-forward layer block based on the original Transformer implementation.
"""
def __init__(self, config: ProphetNetConfig, ffn_dim: int):
......@@ -1118,8 +1113,8 @@ class ProphetNetDecoderLayer(nn.Module):
class ProphetNetEncoder(ProphetNetPreTrainedModel):
r"""
word_embeddings (:obj:`torch.nn.Embeddings` of shape :obj:`(config.vocab_size, config.hidden_size)`, `optional`):
The word embedding parameters. This can be used to initialize :class:`~transformers.ProphetNetEncoder` with pre-defined
word embeddings instead of randomely initialized word embeddings.
The word embedding parameters. This can be used to initialize :class:`~transformers.ProphetNetEncoder` with
pre-defined word embeddings instead of randomely initialized word embeddings.
"""
def __init__(self, config: ProphetNetConfig, word_embeddings: nn.Embedding = None):
......@@ -1227,8 +1222,8 @@ class ProphetNetEncoder(ProphetNetPreTrainedModel):
class ProphetNetDecoder(ProphetNetPreTrainedModel):
r"""
word_embeddings (:obj:`torch.nn.Embeddings` of shape :obj:`(config.vocab_size, config.hidden_size)`, `optional`):
The word embedding parameters. This can be used to initialize :class:`~transformers.ProphetNetEncoder` with pre-defined
word embeddings instead of randomely initialized word embeddings.
The word embedding parameters. This can be used to initialize :class:`~transformers.ProphetNetEncoder` with
pre-defined word embeddings instead of randomely initialized word embeddings.
"""
def __init__(self, config: ProphetNetConfig, word_embeddings: nn.Embedding = None):
......@@ -1276,18 +1271,17 @@ class ProphetNetDecoder(ProphetNetPreTrainedModel):
):
r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder.
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up decoding.
If :obj:`past_key_values` are used, the user can optionally input only the last
``decoder_input_ids`` (those that don't have their past key value states given to this model) of shape
:obj:`(batch_size, 1)` instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
If :obj:`past_key_values` are used, the user can optionally input only the last ``decoder_input_ids``
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
decoding (see :obj:`past_key_values`).
......@@ -1700,10 +1694,9 @@ class ProphetNetForConditionalGeneration(ProphetNetPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[-100, 0, ...,
config.vocab_size - 1]`. All labels set to ``-100`` are ignored (masked), the loss is only computed for
labels in ``[0, ..., config.vocab_size]``
Returns:
......@@ -1890,18 +1883,17 @@ class ProphetNetForCausalLM(ProphetNetPreTrainedModel):
):
r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder.
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up decoding.
If :obj:`past_key_values` are used, the user can optionally input only the last
``decoder_input_ids`` (those that don't have their past key value states given to this model) of shape
:obj:`(batch_size, 1)` instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
If :obj:`past_key_values` are used, the user can optionally input only the last ``decoder_input_ids``
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
decoding (see :obj:`past_key_values`).
......@@ -1910,10 +1902,9 @@ class ProphetNetForCausalLM(ProphetNetPreTrainedModel):
- 0 for tokens that are **masked**.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the left-to-right language modeling loss (next word prediction).
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
n ``[0, ..., config.vocab_size]``
Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
Returns:
......
......@@ -42,25 +42,24 @@ class RetrievAugLMMarginOutput(ModelOutput):
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Languaged modeling loss.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head.
The score is possibly marginalized over all documents for each vocabulary token.
Prediction scores of the language modeling head. The score is possibly marginalized over all documents for
each vocabulary token.
doc_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs)`):
Score between each retrieved document embeddings (see :obj:`retrieved_doc_embeds`) and
:obj:`question_encoder_last_hidden_state`.
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used
(see :obj:`past_key_values` input) to speed up sequential decoding.
retrieved_doc_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs, hidden_size)`, `optional`, returned when `output_retrieved=True`):
Embedded documents retrieved by the retriever.
Is used with ``question_encoder_last_hidden_state`` to compute the ``doc_scores``.
Embedded documents retrieved by the retriever. Is used with ``question_encoder_last_hidden_state`` to
compute the ``doc_scores``.
retrieved_doc_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, config.n_docs)`, `optional`, returned when `output_retrieved=True`):
The indexes of the embedded documents retrieved by the retriever.
context_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
Input ids post-processed from the retrieved documents
and the question encoder input_ids by the retriever.
Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever.
context_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
Attention mask post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
retriever.
......@@ -73,8 +72,8 @@ class RetrievAugLMMarginOutput(ModelOutput):
Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.
question_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the question encoder, after the attention softmax, used to compute the weighted
average in the self-attention heads.
......@@ -86,8 +85,8 @@ class RetrievAugLMMarginOutput(ModelOutput):
Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.
generator_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted
average in the self-attention heads.
......@@ -97,8 +96,8 @@ class RetrievAugLMMarginOutput(ModelOutput):
Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.
generator_dec_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted
average in the self-attention heads.
......@@ -127,25 +126,24 @@ class RetrievAugLMOutput(ModelOutput):
"""
Args:
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head.
The score is possibly marginalized over all documents for each vocabulary token.
Prediction scores of the language modeling head. The score is possibly marginalized over all documents for
each vocabulary token.
doc_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs)`):
Score between each retrieved document embeddings (see :obj:`retrieved_doc_embeds`) and
:obj:`question_encoder_last_hidden_state`.
past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape
:obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`, with each tensor of shape :obj:`(2,
batch_size, num_heads, sequence_length, embed_size_per_head)`).
Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used
(see :obj:`past_key_values` input) to speed up sequential decoding.
retrieved_doc_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs, hidden_size)`, `optional`, returned when `output_retrieved=True`):
Embedded documents retrieved by the retriever.
Is used with ``question_encoder_last_hidden_state`` to compute the ``doc_scores``.
Embedded documents retrieved by the retriever. Is used with ``question_encoder_last_hidden_state`` to
compute the ``doc_scores``.
retrieved_doc_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, config.n_docs)`, `optional`, returned when `output_retrieved=True`):
The indexes of the embedded documents retrieved by the retriever.
context_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
Input ids post-processed from the retrieved documents
and the question encoder input_ids by the retriever.
Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever.
context_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
Attention mask post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
retriever.
......@@ -158,8 +156,8 @@ class RetrievAugLMOutput(ModelOutput):
Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.
question_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the question encoder, after the attention softmax, used to compute the weighted
average in the self-attention heads.
......@@ -171,8 +169,8 @@ class RetrievAugLMOutput(ModelOutput):
Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.
generator_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted
average in the self-attention heads.
......@@ -182,8 +180,8 @@ class RetrievAugLMOutput(ModelOutput):
Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.
generator_dec_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted
average in the self-attention heads.
......@@ -232,8 +230,8 @@ class RagPreTrainedModel(PreTrainedModel):
Instantiates an question encoder and a generator from one or two base classes of the library from pretrained
model checkpoints.
The model is set in evaluation mode by default using :obj:`model.eval()` (Dropout modules are deactivated).
To train the model, you need to first set it back in training mode with :obj:`model.train()`.
The model is set in evaluation mode by default using :obj:`model.eval()` (Dropout modules are deactivated). To
train the model, you need to first set it back in training mode with :obj:`model.train()`.
Params:
question_encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):
......@@ -269,8 +267,8 @@ class RagPreTrainedModel(PreTrainedModel):
retriever (:class:`~transformers.RagRetriever`, `optional`):
The retriever to use.
kwwargs (remaining dictionary of keyword arguments, `optional`):
Can be used to update the configuration object (after it being loaded) and initiate the model
(e.g., ``output_attentions=True``).
Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
``output_attentions=True``).
- To update the question_encoder configuration, use the prefix `question_encoder_` for each
configuration parameter.
......@@ -358,34 +356,33 @@ class RagPreTrainedModel(PreTrainedModel):
RAG_START_DOCSTRING = r"""
RAG is a seq2seq model which encapsulates two core components: a question encoder and a generator.
During a forward pass, we encode the input with the question encoder and pass it
to the retriever to extract relevant context documents. The documents are then prepended to the input.
Such contextualized inputs is passed to the generator.
RAG is a seq2seq model which encapsulates two core components: a question encoder and a generator. During a forward
pass, we encode the input with the question encoder and pass it to the retriever to extract relevant context
documents. The documents are then prepended to the input. Such contextualized inputs is passed to the generator.
The question encoder can be any `autoencoding` model, preferably :class:`~transformers.DPRQuestionEncoder`, and the
generator can be any `seq2seq` model, preferably :class:`~transformers.BartForConditionalGeneration`.
The model can be initialized with a :class:`~transformers.RagRetriever` for end-to-end generation or used in
combination with the outputs of a retriever in multiple steps---see examples for more details.
The model is compatible any `autoencoding` model as the ``question_encoder`` and any `seq2seq` model with language
model head as the ``generator``. It has been tested with :class:`~transformers.DPRQuestionEncoder` as the
``question_encoder`` and :class:`~transformers.BartForConditionalGeneration` or
:class:`~transformers.T5ForConditionalGeneration` as the ``generator``.
combination with the outputs of a retriever in multiple steps---see examples for more details. The model is
compatible any `autoencoding` model as the ``question_encoder`` and any `seq2seq` model with language model head as
the ``generator``. It has been tested with :class:`~transformers.DPRQuestionEncoder` as the ``question_encoder``
and :class:`~transformers.BartForConditionalGeneration` or :class:`~transformers.T5ForConditionalGeneration` as the
``generator``.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Args:
config (:class:`~transformers.RagConfig`):
Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
:meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
question_encoder (:class:`transformers.PreTrainedModel`):
An encoder model compatible with the faiss index encapsulated by the ``retriever``.
generator (:class:`transformers.PreTrainedModel`):
......@@ -398,12 +395,11 @@ RAG_START_DOCSTRING = r"""
RAG_FORWARD_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
:class:`~transformers.RagConfig`, used to initialize the model, specifies which generator to use, it also
specifies a compatible generator tokenizer. Use that tokenizer class to obtain the indices.
Indices of input sequence tokens in the vocabulary. :class:`~transformers.RagConfig`, used to initialize
the model, specifies which generator to use, it also specifies a compatible generator tokenizer. Use that
tokenizer class to obtain the indices.
attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -424,15 +420,15 @@ RAG_FORWARD_INPUTS_DOCSTRING = r"""
also be used by default.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))`):
Tuple consists of two elements: :obj:`encoder_outputs` of the RAG model (see :obj:`encoder_outputs`) and
:obj:`past_key_values` of the underlying generator.
Can be used to speed up decoding. :obj:`past_key_values` are used in the
(:class:`~transformers.RagTokenForGeneration`) model during decoding.
:obj:`past_key_values` of the underlying generator. Can be used to speed up decoding.
:obj:`past_key_values` are used in the (:class:`~transformers.RagTokenForGeneration`) model during
decoding.
doc_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs)`):
Score between each retrieved document embeddigs (see :obj:`retrieved_doc_embeds`) and
:obj:`question_encoder_last_hidden_state`.
If the model has is not initialized with a ``retriever`` :obj:`doc_scores` has to be provided to the
forward pass. :obj:`doc_scores` can be computed via :obj:`question_encoder_last_hidden_state` and
:obj:`retrieved_doc_embeds`, see examples for more information.
:obj:`question_encoder_last_hidden_state`. If the model has is not initialized with a ``retriever``
:obj:`doc_scores` has to be provided to the forward pass. :obj:`doc_scores` can be computed via
:obj:`question_encoder_last_hidden_state` and :obj:`retrieved_doc_embeds`, see examples for more
information.
context_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
Input IDs post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
retriever.
......@@ -663,7 +659,8 @@ class RagModel(RagPreTrainedModel):
@add_start_docstrings_to_callable(
"""A RAG-sequence model impementation. It performs RAG-sequence specific marginalization in the forward pass.
"""
A RAG-sequence model impementation. It performs RAG-sequence specific marginalization in the forward pass.
""",
RAG_START_DOCSTRING,
)
......@@ -715,12 +712,11 @@ class RagSequenceForGeneration(RagPreTrainedModel):
):
r"""
exclude_bos_score (:obj:`bool`, `optional`):
Only relevant if ``labels`` is passed.
If :obj:`True`, the score of the BOS token is disregarded when computing
the loss.
Only relevant if ``labels`` is passed. If :obj:`True`, the score of the BOS token is disregarded when
computing the loss.
reduce_loss (:obj:`bool`, `optional`):
Only relevant if ``labels`` is passed.
If :obj:`True`, the NLL loss is reduced using the ``torch.Tensor.sum`` operation.
Only relevant if ``labels`` is passed. If :obj:`True`, the NLL loss is reduced using the
``torch.Tensor.sum`` operation.
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Legacy dictionary, which is required so that model can use `generate()` function.
......@@ -832,17 +828,15 @@ class RagSequenceForGeneration(RagPreTrainedModel):
**kwargs
):
"""
Implements RAG sequence "thorough" decoding.
Read the :meth:`~transformers.PreTrainedModel.generate`` documentation for more information on how to set other
generate input parameters.
Implements RAG sequence "thorough" decoding. Read the :meth:`~transformers.PreTrainedModel.generate``
documentation for more information on how to set other generate input parameters.
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
The sequence used as a prompt for the generation. If :obj:`input_ids` is not passed, then
:obj:`context_input_ids` has to be provided.
attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -852,8 +846,8 @@ class RagSequenceForGeneration(RagPreTrainedModel):
Input IDs post-processed from the retrieved documents and the question encoder input_ids by the
retriever.
do_deduplication (:obj:`bool`, `optional`):
Whether or not to deduplicate the generations from different context documents for a given input.
Has to be set to :obj:`False` if used while training with distributed backend.
Whether or not to deduplicate the generations from different context documents for a given input. Has
to be set to :obj:`False` if used while training with distributed backend.
num_return_sequences(:obj:`int`, `optional`, defaults to 1):
The number of independently computed returned sequences for each element in the batch. Note that this
is not the value we pass to the ``generator``'s `:func:`~transformers.PreTrainedModel.generate``
......@@ -866,9 +860,9 @@ class RagSequenceForGeneration(RagPreTrainedModel):
Additional kwargs will be passed to :meth:`~transformers.PreTrainedModel.generate`.
Return:
:obj:`torch.LongTensor` of shape :obj:`(batch_size * num_return_sequences, sequence_length)`:
The generated sequences. The second dimension (sequence length) is either equal to :obj:`max_length` or
shorter if all batches finished early due to the :obj:`eos_token_id`.
:obj:`torch.LongTensor` of shape :obj:`(batch_size * num_return_sequences, sequence_length)`: The generated
sequences. The second dimension (sequence length) is either equal to :obj:`max_length` or shorter if all
batches finished early due to the :obj:`eos_token_id`.
"""
n_docs = n_docs if n_docs is not None else self.config.n_docs
......@@ -991,7 +985,8 @@ class RagSequenceForGeneration(RagPreTrainedModel):
@add_start_docstrings_to_callable(
"""A RAG-token model impementation. It performs RAG-token specific marginalization in the forward pass.
"""
A RAG-token model impementation. It performs RAG-token specific marginalization in the forward pass.
""",
RAG_START_DOCSTRING,
)
......@@ -1110,11 +1105,11 @@ class RagTokenForGeneration(RagPreTrainedModel):
):
r"""
do_marginalize (:obj:`bool`, `optional`):
If :obj:`True`, the logits are marginalized over all documents
by making use of ``torch.nn.functional.log_softmax``.
If :obj:`True`, the logits are marginalized over all documents by making use of
``torch.nn.functional.log_softmax``.
reduce_loss (:obj:`bool`, `optional`):
Only relevant if ``labels`` is passed.
If :obj:`True`, the NLL loss is reduced using the ``torch.Tensor.sum`` operation.
Only relevant if ``labels`` is passed. If :obj:`True`, the NLL loss is reduced using the
``torch.Tensor.sum`` operation.
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Legacy dictionary, which is required so that model can use `generate()` function.
......@@ -1241,8 +1236,7 @@ class RagTokenForGeneration(RagPreTrainedModel):
The sequence used as a prompt for the generation. If :obj:`input_ids` is not passed, then
:obj:`context_input_ids` has to be provided.
attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -1274,7 +1268,8 @@ class RagTokenForGeneration(RagPreTrainedModel):
min_length (:obj:`int`, `optional`, defaults to 10):
The minimum length of the sequence to be generated.
early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to stop the beam search when at least ``num_beams`` sentences are finished per batch or not.
Whether or not to stop the beam search when at least ``num_beams`` sentences are finished per batch or
not.
use_cache: (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should use the past last key/values attentions (if applicable to the model) to
speed up decoding.
......@@ -1306,9 +1301,9 @@ class RagTokenForGeneration(RagPreTrainedModel):
Number of documents to retrieve and/or number of documents for which to generate an answer.
Return:
:obj:`torch.LongTensor` of shape :obj:`(batch_size * num_return_sequences, sequence_length)`:
The generated sequences. The second dimension (sequence_length) is either equal to :obj:`max_length` or
shorter if all batches finished early due to the :obj:`eos_token_id`.
:obj:`torch.LongTensor` of shape :obj:`(batch_size * num_return_sequences, sequence_length)`: The generated
sequences. The second dimension (sequence_length) is either equal to :obj:`max_length` or shorter if all
batches finished early due to the :obj:`eos_token_id`.
"""
# set default parameters
n_docs = n_docs if n_docs is not None else self.config.n_docs
......
......@@ -113,8 +113,8 @@ def _get_min_chunk_len(config):
class AxialPositionEmbeddings(nn.Module):
"""Constructs axial position embeddings. Useful for very long input
sequences to save memory and time.
"""
Constructs axial position embeddings. Useful for very long input sequences to save memory and time.
"""
def __init__(self, config):
......@@ -272,7 +272,8 @@ class EfficientAttentionMixin:
"""
def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after):
"""Used to implement attention between consecutive chunks.
"""
Used to implement attention between consecutive chunks.
Args:
vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...]
......@@ -280,8 +281,7 @@ class EfficientAttentionMixin:
num_chunks_after: chunks after current chunk to include in attention
Returns:
tensor of shape [num_chunks, N * chunk_length, ...], where
N = (1 + num_chunks_before + num_chunks_after).
tensor of shape [num_chunks, N * chunk_length, ...], where N = (1 + num_chunks_before + num_chunks_after).
"""
if num_chunks_before == 0 and num_chunks_after == 0:
return vectors
......@@ -985,11 +985,8 @@ class LSHSelfAttention(nn.Module, EfficientAttentionMixin):
class ReverseSort(Function):
"""
After chunked attention is applied which sorted clusters,
original ordering has to be restored.
Since customized backward function is used for Reformer,
the gradients of the output vectors have to be explicitely
sorted here.
After chunked attention is applied which sorted clusters, original ordering has to be restored. Since customized
backward function is used for Reformer, the gradients of the output vectors have to be explicitely sorted here.
"""
@staticmethod
......@@ -1425,11 +1422,8 @@ class ReformerLayer(nn.Module):
def _init_attention_seed(self):
"""
This function sets a new seed for the
attention layer to make dropout deterministic
for both forward calls: 1 normal forward
call and 1 forward call in backward
to recalculate activations.
This function sets a new seed for the attention layer to make dropout deterministic for both forward calls: 1
normal forward call and 1 forward call in backward to recalculate activations.
"""
# randomize seeds
......@@ -1446,11 +1440,8 @@ class ReformerLayer(nn.Module):
def _init_feed_forward_seed(self):
"""
This function sets a new seed for the
feed forward layer to make dropout deterministic
for both forward calls: 1 normal forward
call and 1 forward call in backward
to recalculate activations.
This function sets a new seed for the feed forward layer to make dropout deterministic for both forward calls:
1 normal forward call and 1 forward call in backward to recalculate activations.
"""
# randomize seeds
# use cuda generator if available
......@@ -1580,11 +1571,9 @@ class ReformerLayer(nn.Module):
class _ReversibleFunction(Function):
"""
To prevent PyTorch from performing the usual backpropagation,
a customized backward function is implemented here. This way
it is made sure that no memory expensive activations are
saved during the forward pass.
This function is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py
To prevent PyTorch from performing the usual backpropagation, a customized backward function is implemented here.
This way it is made sure that no memory expensive activations are saved during the forward pass. This function is
heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py
"""
@staticmethod
......@@ -1775,8 +1764,9 @@ class ReformerOnlyLMHead(nn.Module):
class ReformerPreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = ReformerConfig
......@@ -1825,8 +1815,8 @@ class ReformerModelOutput(ModelOutput):
past_buckets_states (:obj:`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`Tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`, with the first
element being the previous `buckets` of shape :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
and the second being the previous `hidden_states` of shape
:obj:`(batch_size, sequence_length, hidden_size)`).
and the second being the previous `hidden_states` of shape :obj:`(batch_size, sequence_length,
hidden_size)`).
Contains precomputed buckets and hidden-states that can be used (see ``past_buckets_states`` input) to
speed up sequential decoding.
......@@ -1836,8 +1826,8 @@ class ReformerModelOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -1865,8 +1855,8 @@ class ReformerModelWithLMHeadOutput(ModelOutput):
past_buckets_states (:obj:`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
List of :obj:`Tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`, with the first
element being the previous `buckets` of shape :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
and the second being the previous `hidden_states` of shape
:obj:`(batch_size, sequence_length, hidden_size)`).
and the second being the previous `hidden_states` of shape :obj:`(batch_size, sequence_length,
hidden_size)`).
Contains precomputed buckets and hidden-states that can be used (see ``past_buckets_states`` input) to
speed up sequential decoding.
......@@ -1876,8 +1866,8 @@ class ReformerModelWithLMHeadOutput(ModelOutput):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -1891,52 +1881,50 @@ class ReformerModelWithLMHeadOutput(ModelOutput):
REFORMER_START_DOCSTRING = r"""
Reformer was proposed in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.0445>`__
by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
Reformer was proposed in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.0445>`__ by Nikita
Kitaev, Łukasz Kaiser, Anselm Levskaya.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.ReformerConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
REFORMER_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
During training the input_ids sequence_length has to be a multiple of the relevant model's
chunk lengths (lsh's, local's or both). During evaluation, the indices are automatically
padded to be a multiple of the chunk length.
Indices of input sequence tokens in the vocabulary. During training the input_ids sequence_length has to be
a multiple of the relevant model's chunk lengths (lsh's, local's or both). During evaluation, the indices
are automatically padded to be a multiple of the chunk length.
Indices can be obtained using :class:`~transformers.ReformerTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.ReformerTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`__
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -1953,8 +1941,8 @@ REFORMER_INPUTS_DOCSTRING = r"""
past_buckets_states (:obj:`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, `optional`):
List of :obj:`Tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`, with the first
element being the previous `buckets` of shape :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
and the second being the previous `hidden_states` of shape
:obj:`(batch_size, sequence_length, hidden_size)`).
and the second being the previous `hidden_states` of shape :obj:`(batch_size, sequence_length,
hidden_size)`).
Contains precomputed hidden-states and buckets (only relevant for LSH Self-Attention). Can be used to speed
up sequential decoding.
......@@ -1996,9 +1984,9 @@ class ReformerModel(ReformerPreTrainedModel):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
......@@ -2231,9 +2219,8 @@ class ReformerModelWithLMHead(ReformerPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
All labels set to ``-100`` are ignored (masked), the loss is only
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[-100, 0,
..., config.vocab_size - 1]`. All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -2344,9 +2331,9 @@ class ReformerForMaskedLM(ReformerPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -2384,8 +2371,10 @@ class ReformerForMaskedLM(ReformerPreTrainedModel):
@add_start_docstrings(
"""Reformer Model transformer with a sequence classification/regression head on top (a linear layer
on top of the pooled output) e.g. for GLUE tasks. """,
"""
Reformer Model transformer with a sequence classification/regression head on top (a linear layer on top of the
pooled output) e.g. for GLUE tasks.
""",
REFORMER_START_DOCSTRING,
)
class ReformerForSequenceClassification(ReformerPreTrainedModel):
......@@ -2422,9 +2411,8 @@ class ReformerForSequenceClassification(ReformerPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -2486,9 +2474,10 @@ class ReformerClassificationHead(nn.Module):
@add_start_docstrings(
"""Reformer Model with a span classification head on top for
extractive question-answering tasks like SQuAD / TriviaQA ( a linear layer on
top of hidden-states output to compute `span start logits` and `span end logits`. """,
"""
Reformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA
( a linear layer on top of hidden-states output to compute `span start logits` and `span end logits`.
""",
REFORMER_START_DOCSTRING,
)
class ReformerForQuestionAnswering(ReformerPreTrainedModel):
......@@ -2526,12 +2515,12 @@ class ReformerForQuestionAnswering(ReformerPreTrainedModel):
r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
......@@ -40,8 +40,9 @@ RETRIBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #
class RetriBertPreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = RetriBertConfig
......@@ -65,14 +66,15 @@ RETRIBERT_START_DOCSTRING = r"""
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.RetriBertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
......@@ -175,14 +177,13 @@ class RetriBertModel(RetriBertPreTrainedModel):
input_ids_query (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary for the queries in a batch.
Indices can be obtained using :class:`~transformers.RetriBertTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.RetriBertTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__`
for details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask_query (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......
......@@ -124,8 +124,8 @@ class RobertaEmbeddings(nn.Module):
return embeddings
def create_position_ids_from_inputs_embeds(self, inputs_embeds):
"""We are provided embeddings directly. We cannot infer which are padded so just generate
sequential position ids.
"""
We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
Args:
inputs_embeds: torch.Tensor
......@@ -456,8 +456,9 @@ class RobertaPooler(nn.Module):
class RobertaPreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = RobertaConfig
......@@ -483,14 +484,15 @@ ROBERTA_START_DOCSTRING = r"""
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.RobertaConfig`): Model configuration class with all the parameters of the
model. Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
model. Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
ROBERTA_INPUTS_DOCSTRING = r"""
......@@ -498,35 +500,33 @@ ROBERTA_INPUTS_DOCSTRING = r"""
input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.RobertaTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.RobertaTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -553,19 +553,17 @@ ROBERTA_INPUTS_DOCSTRING = r"""
class RobertaModel(RobertaPreTrainedModel):
"""
The model can behave as an encoder (with only self-attention) as well
as a decoder, in which case a layer of cross-attention is added between
the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,
Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in `Attention is
all you need`_ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser and Illia Polosukhin.
To behave as an decoder the model needs to be initialized with the
:obj:`is_decoder` argument of the configuration set to :obj:`True`.
To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
argument and :obj:`add_cross_attention` set to :obj:`True`; an
:obj:`encoder_hidden_states` is then expected as an input to the forward pass.
To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration
set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
input to the forward pass.
.. _`Attention is all you need`:
https://arxiv.org/abs/1706.03762
.. _`Attention is all you need`: https://arxiv.org/abs/1706.03762
"""
......@@ -590,9 +588,9 @@ class RobertaModel(RobertaPreTrainedModel):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
......@@ -621,13 +619,12 @@ class RobertaModel(RobertaPreTrainedModel):
):
r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder.
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``: ``1`` for
tokens that are NOT MASKED, ``0`` for MASKED tokens.
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
......@@ -740,21 +737,19 @@ class RobertaForCausalLM(RobertaPreTrainedModel):
):
r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder.
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the left-to-right language modeling loss (next word prediction).
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
labels in ``[0, ..., config.vocab_size]``
Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
Returns:
......@@ -869,10 +864,9 @@ class RobertaForMaskedLM(RobertaPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated.
"""
......@@ -944,8 +938,10 @@ class RobertaLMHead(nn.Module):
@add_start_docstrings(
"""RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer
on top of the pooled output) e.g. for GLUE tasks. """,
"""
RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer on top of the
pooled output) e.g. for GLUE tasks.
""",
ROBERTA_START_DOCSTRING,
)
class RobertaForSequenceClassification(RobertaPreTrainedModel):
......@@ -982,9 +978,8 @@ class RobertaForSequenceClassification(RobertaPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1026,8 +1021,10 @@ class RobertaForSequenceClassification(RobertaPreTrainedModel):
@add_start_docstrings(
"""Roberta Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
"""
Roberta Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
softmax) e.g. for RocStories/SWAG tasks.
""",
ROBERTA_START_DOCSTRING,
)
class RobertaForMultipleChoice(RobertaPreTrainedModel):
......@@ -1064,9 +1061,9 @@ class RobertaForMultipleChoice(RobertaPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (See :obj:`input_ids` above)
Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
:obj:`input_ids` above)
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
......@@ -1116,8 +1113,10 @@ class RobertaForMultipleChoice(RobertaPreTrainedModel):
@add_start_docstrings(
"""Roberta Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
"""
Roberta Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
Named-Entity-Recognition (NER) tasks.
""",
ROBERTA_START_DOCSTRING,
)
class RobertaForTokenClassification(RobertaPreTrainedModel):
......@@ -1156,8 +1155,8 @@ class RobertaForTokenClassification(RobertaPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
1]``.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1224,8 +1223,10 @@ class RobertaClassificationHead(nn.Module):
@add_start_docstrings(
"""Roberta Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
the hidden-states output to compute `span start logits` and `span end logits`). """,
"""
Roberta Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
ROBERTA_START_DOCSTRING,
)
class RobertaForQuestionAnswering(RobertaPreTrainedModel):
......@@ -1265,12 +1266,12 @@ class RobertaForQuestionAnswering(RobertaPreTrainedModel):
r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1324,9 +1325,9 @@ class RobertaForQuestionAnswering(RobertaPreTrainedModel):
def create_position_ids_from_input_ids(input_ids, padding_idx):
"""Replace non-padding symbols with their position numbers. Position numbers begin at
padding_idx+1. Padding symbols are ignored. This is modified from fairseq's
`utils.make_positions`.
"""
Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
are ignored. This is modified from fairseq's `utils.make_positions`.
Args:
x: torch.Tensor x:
......
......@@ -93,9 +93,8 @@ class SqueezeBertEmbeddings(nn.Module):
class MatMulWrapper(torch.nn.Module):
"""
Wrapper for torch.matmul(). This makes flop-counting easier to implement.
Note that if you directly call torch.matmul() in your code, the flop counter will typically
ignore the flops of the matmul.
Wrapper for torch.matmul(). This makes flop-counting easier to implement. Note that if you directly call
torch.matmul() in your code, the flop counter will typically ignore the flops of the matmul.
"""
def __init__(self):
......@@ -104,13 +103,10 @@ class MatMulWrapper(torch.nn.Module):
def forward(self, mat1, mat2):
"""
:param inputs: two torch tensors
:return: matmul of these tensors
:param inputs: two torch tensors :return: matmul of these tensors
Here are the typical dimensions found in BERT (the B is optional)
mat1.shape: [B, <optional extra dims>, M, K]
mat2.shape: [B, <optional extra dims>, K, N]
output shape: [B, <optional extra dims>, M, N]
Here are the typical dimensions found in BERT (the B is optional) mat1.shape: [B, <optional extra dims>, M, K]
mat2.shape: [B, <optional extra dims>, K, N] output shape: [B, <optional extra dims>, M, N]
"""
return torch.matmul(mat1, mat2)
......@@ -119,9 +115,7 @@ class SqueezeBertLayerNorm(nn.LayerNorm):
"""
This is a nn.LayerNorm subclass that accepts NCW data layout and performs normalization in the C dimension.
N = batch
C = channels
W = sequence length
N = batch C = channels W = sequence length
"""
def __init__(self, hidden_size, eps=1e-12):
......@@ -171,8 +165,7 @@ class ConvActivation(nn.Module):
class SqueezeBertSelfAttention(nn.Module):
def __init__(self, config, cin, q_groups=1, k_groups=1, v_groups=1):
"""
config = used for some things; ignored for others (work in progress...)
cin = input channels = output channels
config = used for some things; ignored for others (work in progress...) cin = input channels = output channels
groups = number of groups to use in conv1d layers
"""
super().__init__()
......@@ -197,9 +190,8 @@ class SqueezeBertSelfAttention(nn.Module):
def transpose_for_scores(self, x):
"""
input: [N, C, W]
output: [N, C1, W, C2]
where C1 is the head index, and C2 is one head's contents
- input: [N, C, W]
- output: [N, C1, W, C2] where C1 is the head index, and C2 is one head's contents
"""
new_x_shape = (x.size()[0], self.num_attention_heads, self.attention_head_size, x.size()[-1]) # [N, C1, C2, W]
x = x.view(*new_x_shape)
......@@ -207,9 +199,8 @@ class SqueezeBertSelfAttention(nn.Module):
def transpose_key_for_scores(self, x):
"""
input: [N, C, W]
output: [N, C1, C2, W]
where C1 is the head index, and C2 is one head's contents
- input: [N, C, W]
- output: [N, C1, C2, W] where C1 is the head index, and C2 is one head's contents
"""
new_x_shape = (x.size()[0], self.num_attention_heads, self.attention_head_size, x.size()[-1]) # [N, C1, C2, W]
x = x.view(*new_x_shape)
......@@ -218,8 +209,8 @@ class SqueezeBertSelfAttention(nn.Module):
def transpose_output(self, x):
"""
input: [N, C1, W, C2]
output: [N, C, W]
- input: [N, C1, W, C2]
- output: [N, C, W]
"""
x = x.permute(0, 1, 3, 2).contiguous() # [N, C1, C2, W]
new_x_shape = (x.size()[0], self.all_head_size, x.size()[3]) # [N, C, W]
......@@ -265,10 +256,11 @@ class SqueezeBertSelfAttention(nn.Module):
class SqueezeBertModule(nn.Module):
def __init__(self, config):
"""
hidden_size = input chans = output chans for Q, K, V (they are all the same ... for now) = output chans for the module
intermediate_size = output chans for intermediate layer
groups = number of groups for all layers in the BertModule. (eventually we could change the interface to allow
different groups for different layers)
- hidden_size = input chans = output chans for Q, K, V (they are all the same ... for now) = output chans for
the module
- intermediate_size = output chans for intermediate layer
- groups = number of groups for all layers in the BertModule. (eventually we could change the interface to
allow different groups for different layers)
"""
super().__init__()
......@@ -382,8 +374,9 @@ class SqueezeBertPooler(nn.Module):
class SqueezeBertPreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = SqueezeBertConfig
......@@ -405,25 +398,26 @@ class SqueezeBertPreTrainedModel(PreTrainedModel):
SQUEEZEBERT_START_DOCSTRING = r"""
The SqueezeBERT model was proposed in
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer
The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural
networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W.
Keutzer
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
For best results finetuning SqueezeBERT on text classification tasks, it is recommended to use the
`squeezebert/squeezebert-mnli-headless` checkpoint as a starting point.
Parameters:
config (:class:`~transformers.SqueezeBertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
Hierarchy::
......@@ -439,9 +433,8 @@ SQUEEZEBERT_START_DOCSTRING = r"""
Input data is in [batch, sequence_length, hidden_size] format.
Data inside the encoder is in [batch, hidden_size, sequence_length] format.
But, if :obj:`output_hidden_states == True`, the data from inside the encoder is
returned in [batch, sequence_length, hidden_size] format.
Data inside the encoder is in [batch, hidden_size, sequence_length] format. But, if :obj:`output_hidden_states
== True`, the data from inside the encoder is returned in [batch, sequence_length, hidden_size] format.
The final output of the encoder is in [batch, sequence_length, hidden_size] format.
"""
......@@ -451,35 +444,33 @@ SQUEEZEBERT_INPUTS_DOCSTRING = r"""
input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.SqueezeBertTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.SqueezeBertTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -520,9 +511,9 @@ class SqueezeBertModel(SqueezeBertPreTrainedModel):
self.embeddings.word_embeddings = new_embeddings
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
......@@ -636,10 +627,9 @@ class SqueezeBertForMaskedLM(SqueezeBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -676,8 +666,10 @@ class SqueezeBertForMaskedLM(SqueezeBertPreTrainedModel):
@add_start_docstrings(
"""SqueezeBERT Model transformer with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """,
"""
SqueezeBERT Model transformer with a sequence classification/regression head on top (a linear layer on top of the
pooled output) e.g. for GLUE tasks.
""",
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertForSequenceClassification(SqueezeBertPreTrainedModel):
......@@ -713,9 +705,8 @@ class SqueezeBertForSequenceClassification(SqueezeBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -760,8 +751,10 @@ class SqueezeBertForSequenceClassification(SqueezeBertPreTrainedModel):
@add_start_docstrings(
"""SqueezeBERT Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
"""
SqueezeBERT Model with a multiple choice classification head on top (a linear layer on top of the pooled output and
a softmax) e.g. for RocStories/SWAG tasks.
""",
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertForMultipleChoice(SqueezeBertPreTrainedModel):
......@@ -798,9 +791,9 @@ class SqueezeBertForMultipleChoice(SqueezeBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above)
Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
num_choices-1]`` where `num_choices` is the size of the second dimension of the input tensors. (see
`input_ids` above)
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
......@@ -851,8 +844,10 @@ class SqueezeBertForMultipleChoice(SqueezeBertPreTrainedModel):
@add_start_docstrings(
"""SqueezeBERT Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
"""
SqueezeBERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
for Named-Entity-Recognition (NER) tasks.
""",
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertForTokenClassification(SqueezeBertPreTrainedModel):
......@@ -888,8 +883,8 @@ class SqueezeBertForTokenClassification(SqueezeBertPreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
1]``.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -937,8 +932,9 @@ class SqueezeBertForTokenClassification(SqueezeBertPreTrainedModel):
@add_start_docstrings(
"""SqueezeBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD
(a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
"""
SqueezeBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD (a
linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
SQUEEZEBERT_START_DOCSTRING,
)
......@@ -976,12 +972,12 @@ class SqueezeBertForQuestionAnswering(SqueezeBertPreTrainedModel):
r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
......@@ -155,8 +155,8 @@ def load_tf_weights_in_t5(model, config, tf_checkpoint_path):
class T5LayerNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""Construct a layernorm module in the T5 style
No bias and no substraction of mean.
"""
Construct a layernorm module in the T5 style No bias and no substraction of mean.
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
......@@ -245,24 +245,21 @@ class T5Attention(nn.Module):
Adapted from Mesh Tensorflow:
https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
Translate relative position to a bucket number for relative attention.
The relative position is defined as memory_position - query_position, i.e.
the distance in tokens from the attending position to the attended-to
position. If bidirectional=False, then positive relative positions are
invalid.
We use smaller buckets for small absolute relative_position and larger buckets
for larger absolute relative_positions. All relative positions >=max_distance
map to the same bucket. All relative positions <=-max_distance map to the
same bucket. This should allow for more graceful generalization to longer
sequences than the model has been trained on.
Translate relative position to a bucket number for relative attention. The relative position is defined as
memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
This should allow for more graceful generalization to longer sequences than the model has been trained on
Args:
relative_position: an int32 Tensor
bidirectional: a boolean - whether the attention is bidirectional
num_buckets: an integer
max_distance: an integer
max_distance: an intege
Returns:
a Tensor with the same shape as relative_position, containing int32
values in the range [0, num_buckets)
a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
"""
ret = 0
n = -relative_position
......@@ -573,8 +570,9 @@ class T5Block(nn.Module):
class T5PreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = T5Config
......@@ -808,39 +806,38 @@ T5_START_DOCSTRING = r"""
The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
<https://arxiv.org/abs/1910.10683>`__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It's an encoder decoder transformer pre-trained in a text-to-text
denoising generative setting.
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
T5_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
T5 is a model with relative position embeddings so you should be able to pad the inputs on both the right
and the left.
Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you
should be able to pad the inputs on both the right and the left.
Indices can be obtained using :class:`~transformers.T5Tokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for detail.
Indices can be obtained using :class:`~transformers.T5Tokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
detail.
To know more on how to prepare :obj:`input_ids` for pretraining take a look a
`T5 Training <./t5.html#training>`__.
To know more on how to prepare :obj:`input_ids` for pretraining take a look a `T5 Training
<./t5.html#training>`__.
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -848,20 +845,20 @@ T5_INPUTS_DOCSTRING = r"""
`What are attention masks? <../glossary.html#attention-mask>`__
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for
:obj:`decoder_input_ids` generation.
If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
:obj:`past_key_values`).
:obj:`decoder_input_ids` generation. If :obj:`past_key_values` is used, optionally only the last
:obj:`decoder_input_ids` have to be input (see :obj:`past_key_values`).
To know more on how to prepare :obj:`decoder_input_ids` for pretraining take a look at
`T5 Training <./t5.html#training>`__. If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both
unset, :obj:`decoder_input_ids` takes the value of :obj:`input_ids`.
To know more on how to prepare :obj:`decoder_input_ids` for pretraining take a look at `T5 Training
<./t5.html#training>`__. If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both unset,
:obj:`decoder_input_ids` takes the value of :obj:`input_ids`.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default.
encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):
Tuple consists of (:obj:`last_hidden_state`, :obj:`optional`: `hidden_states`, :obj:`optional`: `attentions`)
:obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)` is a sequence of
hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
Tuple consists of (:obj:`last_hidden_state`, :obj:`optional`: `hidden_states`, :obj:`optional`:
`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)` is a
sequence of hidden states at the output of the last layer of the encoder. Used in the cross-attention of
the decoder.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
......@@ -869,8 +866,7 @@ T5_INPUTS_DOCSTRING = r"""
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -881,14 +877,12 @@ T5_INPUTS_DOCSTRING = r"""
vectors than the model's internal embedding lookup matrix.
decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded
representation.
If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_inputs_embeds` have to be input
(see :obj:`past_key_values`).
This is useful if you want more control over how to convert :obj:`decoder_input_ids` indices into
associated vectors than the model's internal embedding lookup matrix.
representation. If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_inputs_embeds`
have to be input (see :obj:`past_key_values`). This is useful if you want more control over how to convert
:obj:`decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both
unset, :obj:`decoder_inputs_embeds` takes the value of :obj:`inputs_embeds`.
If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both unset, :obj:`decoder_inputs_embeds`
takes the value of :obj:`inputs_embeds`.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
......@@ -942,9 +936,9 @@ class T5Model(T5PreTrainedModel):
return self.decoder
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
......@@ -1114,10 +1108,9 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[-100, 0, ...,
config.vocab_size - 1]`. All labels set to ``-100`` are ignored (masked), the loss is only computed for
labels in ``[0, ..., config.vocab_size]``
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
Used to hide legacy arguments that have been deprecated.
......
......@@ -120,14 +120,18 @@ class TFAlbertEmbeddings(tf.keras.layers.Layer):
mode="embedding",
training=False,
):
"""Get token embeddings of inputs.
"""
Get token embeddings of inputs
Args:
inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)
mode: string, a valid value is one of "embedding" and "linear".
mode: string, a valid value is one of "embedding" and "linear"
Returns:
outputs: (1) If mode == "embedding", output embedding tensor, float32 with
shape [batch_size, length, embedding_size]; (2) mode == "linear", output
linear tensor, float32 with shape [batch_size, length, vocab_size].
outputs: (1) If mode == "embedding", output embedding tensor, float32 with shape [batch_size, length,
embedding_size]; (2) mode == "linear", output linear tensor, float32 with shape [batch_size, length,
vocab_size]
Raises:
ValueError: if mode is not valid.
......@@ -167,9 +171,12 @@ class TFAlbertEmbeddings(tf.keras.layers.Layer):
return embeddings
def _linear(self, inputs):
"""Computes logits by running inputs through a linear layer.
"""
Computes logits by running inputs through a linear layer
Args:
inputs: A float32 tensor with shape [batch_size, length, embedding_size]
inputs: A float32 tensor with shape [batch_size, length, embedding_size
Returns:
float32 tensor with shape [batch_size, length, vocab_size].
"""
......@@ -420,8 +427,9 @@ class TFAlbertTransformer(tf.keras.layers.Layer):
class TFAlbertPreTrainedModel(TFPreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = AlbertConfig
......@@ -493,9 +501,9 @@ class TFAlbertMainLayer(tf.keras.layers.Layer):
raise NotImplementedError
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
raise NotImplementedError
......@@ -619,16 +627,16 @@ class TFAlbertForPreTrainingOutput(ModelOutput):
prediction_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
sop_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False
continuation before SoftMax).
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`tf.Tensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -646,9 +654,9 @@ ALBERT_START_DOCSTRING = r"""
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
usage and behavior.
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
.. note::
......@@ -657,11 +665,11 @@ ALBERT_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having all
the tensors in the first argument of the model call function: :obj:`model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors
in the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
- a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -671,8 +679,9 @@ ALBERT_START_DOCSTRING = r"""
Args:
config (:class:`~transformers.AlbertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
ALBERT_INPUTS_DOCSTRING = r"""
......@@ -680,35 +689,33 @@ ALBERT_INPUTS_DOCSTRING = r"""
input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.AlbertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.__call__` and
:func:`transformers.PreTrainedTokenizer.encode` for details.
Indices can be obtained using :class:`~transformers.AlbertTokenizer`. See
:func:`transformers.PreTrainedTokenizer.__call__` and :func:`transformers.PreTrainedTokenizer.encode` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -753,8 +760,10 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
@add_start_docstrings(
"""Albert Model with two heads on top for pre-training:
a `masked language modeling` head and a `sentence order prediction` (classification) head. """,
"""
Albert Model with two heads on top for pre-training: a `masked language modeling` head and a `sentence order
prediction` (classification) head.
""",
ALBERT_START_DOCSTRING,
)
class TFAlbertForPreTraining(TFAlbertPreTrainedModel):
......@@ -861,10 +870,9 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.albert.return_dict
if isinstance(inputs, (tuple, list)):
......@@ -905,8 +913,10 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
@add_start_docstrings(
"""Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """,
"""
Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled
output) e.g. for GLUE tasks.
""",
ALBERT_START_DOCSTRING,
)
class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClassificationLoss):
......@@ -943,9 +953,8 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in ``[0, ...,
config.num_labels - 1]``. If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.albert.return_dict
......@@ -989,8 +998,10 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
@add_start_docstrings(
"""Albert Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
"""
Albert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
Named-Entity-Recognition (NER) tasks.
""",
ALBERT_START_DOCSTRING,
)
class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificationLoss):
......@@ -1030,8 +1041,8 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
1]``.
"""
return_dict = return_dict if return_dict is not None else self.albert.return_dict
if isinstance(inputs, (tuple, list)):
......@@ -1074,8 +1085,10 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
@add_start_docstrings(
"""Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
"""
Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
ALBERT_START_DOCSTRING,
)
class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringLoss):
......@@ -1116,12 +1129,12 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
r"""
start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.albert.return_dict
if isinstance(inputs, (tuple, list)):
......@@ -1173,8 +1186,10 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
@add_start_docstrings(
"""Albert Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
"""
Albert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
softmax) e.g. for RocStories/SWAG tasks.
""",
ALBERT_START_DOCSTRING,
)
class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
......@@ -1189,7 +1204,8 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
@property
def dummy_inputs(self):
"""Dummy inputs to build the network.
"""
Dummy inputs to build the network.
Returns:
tf.Tensor with dummy inputs
......@@ -1219,9 +1235,9 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (See :obj:`input_ids` above)
Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
num_choices]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
:obj:`input_ids` above)
"""
if isinstance(inputs, (tuple, list)):
input_ids = inputs[0]
......
......@@ -337,9 +337,9 @@ TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
TF_AUTO_MODEL_PRETRAINED_DOCSTRING = r"""
The model class to instantiate is selected based on the :obj:`model_type` property of the config object
(either passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when it's
missing, by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`:
The model class to instantiate is selected based on the :obj:`model_type` property of the config object (either
passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when it's missing,
by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`:
List options
......@@ -393,12 +393,10 @@ TF_AUTO_MODEL_PRETRAINED_DOCSTRING = r"""
Whether or not to delete incompletely received files. Will attempt to resume the download if such a
file exists.
proxies (:obj:`Dict[str, str], `optional`):
A dictionary of proxy servers to use by protocol or endpoint, e.g.,
:obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each
request.
A dictionary of proxy servers to use by protocol or endpoint, e.g., :obj:`{'http': 'foo.bar:3128',
'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
output_loading_info(:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether ot not to also return a dictionary containing missing keys, unexpected keys and error
messages.
Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages.
local_files_only(:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to only look at local files (e.g., not try doanloading the model).
use_cdn(:obj:`bool`, `optional`, defaults to :obj:`True`):
......@@ -422,8 +420,8 @@ TF_AUTO_MODEL_PRETRAINED_DOCSTRING = r"""
class TFAutoModel(object):
r"""
This is a generic model class that will be instantiated as one of the base model classes of the library
when created with the when created with the :meth:`~transformers.TFAutoModel.from_pretrained` class method or the
This is a generic model class that will be instantiated as one of the base model classes of the library when
created with the when created with the :meth:`~transformers.TFAutoModel.from_pretrained` class method or the
:meth:`~transformers.TFAutoModel.from_config` class methods.
This class cannot be instantiated directly using ``__init__()`` (throws an error).
......@@ -443,9 +441,8 @@ class TFAutoModel(object):
Instantiates one of the base model classes of the library from a configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use :meth:`~transformers.TFAutoModel.from_pretrained` to load
the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModel.from_pretrained` to load the model weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -537,9 +534,9 @@ class TFAutoModelForPreTraining(object):
model---from a configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use
:meth:`~transformers.TFAutoModelForPreTraining.from_pretrained` to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForPreTraining.from_pretrained` to load the
model weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -637,9 +634,9 @@ class TFAutoModelWithLMHead(object):
Instantiates one of the model classes of the library---with a language modeling head---from a configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use :meth:`~transformers.TFAutoModelWithLMHead.from_pretrained`
to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelWithLMHead.from_pretrained` to load the model
weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -721,8 +718,8 @@ class TFAutoModelWithLMHead(object):
class TFAutoModelForCausalLM:
r"""
This is a generic model class that will be instantiated as one of the model classes of the library---with a
causal language modeling head---when created with the when created with the
This is a generic model class that will be instantiated as one of the model classes of the library---with a causal
language modeling head---when created with the when created with the
:meth:`~transformers.TFAutoModelForCausalLM.from_pretrained` class method or the
:meth:`~transformers.TFAutoModelForCausalLM.from_config` class method.
......@@ -744,9 +741,9 @@ class TFAutoModelForCausalLM:
configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use :meth:`~transformers.TFAutoModelForCausalLM.from_pretrained`
to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForCausalLM.from_pretrained` to load the model
weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -815,8 +812,8 @@ class TFAutoModelForCausalLM:
class TFAutoModelForMaskedLM:
r"""
This is a generic model class that will be instantiated as one of the model classes of the library---with a
masked language modeling head---when created with the when created with the
This is a generic model class that will be instantiated as one of the model classes of the library---with a masked
language modeling head---when created with the when created with the
:meth:`~transformers.TFAutoModelForMaskedLM.from_pretrained` class method or the
:meth:`~transformers.TFAutoModelForMasedLM.from_config` class method.
......@@ -838,9 +835,9 @@ class TFAutoModelForMaskedLM:
configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use :meth:`~transformers.TFAutoModelForMaskedLM.from_pretrained`
to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForMaskedLM.from_pretrained` to load the model
weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -932,9 +929,9 @@ class TFAutoModelForSeq2SeqLM:
head---from a configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use
:meth:`~transformers.TFAutoModelForSeq2SeqLM.from_pretrained` to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForSeq2SeqLM.from_pretrained` to load the model
weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -1030,9 +1027,9 @@ class TFAutoModelForSequenceClassification(object):
configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use
:meth:`~transformers.TFAutoModelForSequenceClassification.from_pretrained` to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForSequenceClassification.from_pretrained` to
load the model weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -1127,9 +1124,9 @@ class TFAutoModelForQuestionAnswering(object):
Instantiates one of the model classes of the library---with a question answering head---from a configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use
:meth:`~transformers.TFAutoModelForQuestionAnswering.from_pretrained` to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForQuestionAnswering.from_pretrained` to load
the model weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -1202,8 +1199,8 @@ class TFAutoModelForQuestionAnswering(object):
class TFAutoModelForTokenClassification:
r"""
This is a generic model class that will be instantiated as one of the model classes of the library---with a
token classification head---when created with the when created with the
This is a generic model class that will be instantiated as one of the model classes of the library---with a token
classification head---when created with the when created with the
:meth:`~transformers.TFAutoModelForTokenClassification.from_pretrained` class method or the
:meth:`~transformers.TFAutoModelForTokenClassification.from_config` class method.
......@@ -1224,9 +1221,9 @@ class TFAutoModelForTokenClassification:
Instantiates one of the model classes of the library---with a token classification head---from a configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use
:meth:`~transformers.TFAutoModelForTokenClassification.from_pretrained` to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForTokenClassification.from_pretrained` to load
the model weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......@@ -1322,9 +1319,9 @@ class TFAutoModelForMultipleChoice:
configuration.
Note:
Loading a model from its configuration file does **not** load the model weights.
It only affects the model's configuration. Use
:meth:`~transformers.TFAutoModelForMultipleChoice.from_pretrained` to load the model weights.
Loading a model from its configuration file does **not** load the model weights. It only affects the
model's configuration. Use :meth:`~transformers.TFAutoModelForMultipleChoice.from_pretrained` to load the
model weights.
Args:
config (:class:`~transformers.PretrainedConfig`):
......
......@@ -51,9 +51,9 @@ BART_START_DOCSTRING = r"""
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
usage and behavior.
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
.. note::
......@@ -62,11 +62,11 @@ BART_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having all
the tensors in the first argument of the model call function: :obj:`model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors
in the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
- a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -76,8 +76,9 @@ BART_START_DOCSTRING = r"""
Args:
config (:class:`~transformers.BartConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.TFPreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.TFPreTrainedModel.from_pretrained` method to load the
model weights.
"""
......@@ -86,14 +87,13 @@ BART_INPUTS_DOCSTRING = r"""
input_ids (:obj:`tf.Tensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.BertTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
Indices can be obtained using :class:`~transformers.BertTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -134,9 +134,9 @@ logger = logging.get_logger(__name__)
def create_position_ids_from_input_ids(input_ids, padding_idx):
"""Replace non-padding symbols with their position numbers. Position numbers begin at
padding_idx+1. Padding symbols are ignored. This is modified from fairseq's
`utils.make_positions`.
"""
Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
are ignored. This is modified from fairseq's `utils.make_positions`.
"""
mask = input_ids.ne(padding_idx).int()
incremental_indices = tf.cumsum(mask, axis=1).type_as(mask) * mask
......@@ -144,8 +144,9 @@ def create_position_ids_from_input_ids(input_ids, padding_idx):
def causal_attention_mask(nd, ns, dtype):
"""1's in the lower triangle, counting from the lower right corner.
Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.
"""
1's in the lower triangle, counting from the lower right corner. Same as tf.matrix_band_part(tf.ones([nd, ns]), -1,
ns-nd), but doesn't produce garbage on TPUs.
"""
i = tf.range(nd)[:, None]
j = tf.range(ns)
......@@ -273,8 +274,8 @@ class TFEncoderLayer(tf.keras.layers.Layer):
class TFBartEncoder(tf.keras.layers.Layer):
# config_class = BartConfig
"""
Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer
is a :class:`TFEncoderLayer`.
Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a
:class:`TFEncoderLayer`.
Args:
config: BartConfig
......@@ -323,15 +324,15 @@ class TFBartEncoder(tf.keras.layers.Layer):
Args:
input_ids (Tensor): tokens in the source language of shape
`(batch, src_len)`
attention_mask (Tensor): indicating which indices are padding tokens.
attention_mask (Tensor): indicating which indices are padding tokens
Returns:
namedtuple:
- **x** (Tensor): the last encoder layer's output of
shape `(src_len, batch, embed_dim)`
- **encoder_states** (List[Tensor]): all intermediate
hidden states of shape `(src_len, batch, embed_dim)`.
Only populated if *return_all_hiddens* is True.
- **x** (Tensor): the last encoder layer's output of shape `(src_len, batch, embed_dim)`
- **encoder_states** (List[Tensor]): all intermediate hidden states of shape `(src_len, batch,
embed_dim)`. Only populated if *return_all_hiddens* is True.
- **all_attentions** (List[Tensor]): Attention weights for each layer.
During training might not be of length n_layers because of layer dropout.
"""
......@@ -424,8 +425,7 @@ class TFDecoderLayer(tf.keras.layers.Layer):
Args:
x (Tensor): input to the layer of shape `(seq_len, batch, embed_dim)`
encoder_attn_mask (ByteTensor, optional): binary
ByteTensor of shape `(batch, src_len)` where padding
elements are indicated by ``1``.
ByteTensor of shape `(batch, src_len)` where padding elements are indicated by ``1``.
need_attn_weights (bool, optional): return attention weights
for each head (default: return average over heads).
......@@ -477,8 +477,8 @@ class TFDecoderLayer(tf.keras.layers.Layer):
class TFBartDecoder(tf.keras.layers.Layer):
"""
Transformer decoder consisting of *config.decoder_layers* layers. Each layer
is a :class:`TFDecoderLayer`.
Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a :class:`TFDecoderLayer`
Args:
config: BartConfig
embed_tokens: output embedding
......@@ -666,16 +666,16 @@ class TFAttention(tf.keras.layers.Layer):
attn_mask: Optional[Tensor] = None,
training=False,
) -> Tuple[Tensor, Optional[Tensor]]:
"""Input shape: Time(SeqLen) x Batch x Channel
"""
Input shape: Time(SeqLen) x Batch x Channel
Args:
key_padding_mask (ByteTensor, optional): mask to exclude
keys that are pads, of shape `(batch, src_len)`, where
padding elements are indicated by 1s.
keys that are pads, of shape `(batch, src_len)`, where padding elements are indicated by 1s.
attn_mask (ByteTensor, optional): typically used to
implement causal attention, where the mask prevents the
attention from looking forward in time (default: None).
implement causal attention, where the mask prevents the attention from looking forward in time
(default: None).
"""
static_kv = self.encoder_decoder_attention # value=key=encoder_hidden_states,
tgt_len, bsz, embed_dim = query.shape
......@@ -755,10 +755,9 @@ class TFAttention(tf.keras.layers.Layer):
class TFLearnedPositionalEmbedding(TFSharedEmbeddings):
"""
This module learns positional embeddings up to a fixed maximum size.
Padding ids are ignored by either offsetting based on padding_idx
or by setting padding_idx to None and ensuring that the appropriate
position ids are passed to the forward function.
This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting
based on padding_idx or by setting padding_idx to None and ensuring that the appropriate position ids are passed to
the forward function.
"""
def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int, offset, **kwargs):
......@@ -812,8 +811,9 @@ class TFBartModel(TFPretrainedBartModel):
decoder_attn_mask=None,
mask_dtype=None,
):
"""Prepare masks that ignore padding tokens decoder and a causal lm mask for the decoder if
none are provided. This mimics the default behavior in fairseq. To override it pass in masks.
"""
Prepare masks that ignore padding tokens decoder and a causal lm mask for the decoder if none are provided.
This mimics the default behavior in fairseq. To override it pass in masks.
"""
pad_token_id = self.config.pad_token_id
if decoder_input_ids is None:
......
......@@ -137,14 +137,18 @@ class TFBertEmbeddings(tf.keras.layers.Layer):
mode="embedding",
training=False,
):
"""Get token embeddings of inputs.
"""
Get token embeddings of inputs.
Args:
inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)
mode: string, a valid value is one of "embedding" and "linear".
Returns:
outputs: (1) If mode == "embedding", output embedding tensor, float32 with
shape [batch_size, length, embedding_size]; (2) mode == "linear", output
linear tensor, float32 with shape [batch_size, length, vocab_size].
outputs: If mode == "embedding", output embedding tensor, float32 with shape [batch_size, length,
embedding_size]; if mode == "linear", output linear tensor, float32 with shape [batch_size, length,
vocab_size].
Raises:
ValueError: if mode is not valid.
......@@ -187,9 +191,12 @@ class TFBertEmbeddings(tf.keras.layers.Layer):
return embeddings
def _linear(self, inputs):
"""Computes logits by running inputs through a linear layer.
"""
Computes logits by running inputs through a linear layer.
Args:
inputs: A float32 tensor with shape [batch_size, length, hidden_size]
inputs: A float32 tensor with shape [batch_size, length, hidden_size].
Returns:
float32 tensor with shape [batch_size, length, vocab_size].
"""
......@@ -528,9 +535,9 @@ class TFBertMainLayer(tf.keras.layers.Layer):
self.embeddings.vocab_size = value.shape[0]
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
raise NotImplementedError
......@@ -647,8 +654,9 @@ class TFBertMainLayer(tf.keras.layers.Layer):
class TFBertPreTrainedModel(TFPreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = BertConfig
......@@ -664,16 +672,16 @@ class TFBertForPreTrainingOutput(ModelOutput):
prediction_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
seq_relationship_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False
continuation before SoftMax).
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of :obj:`tf.Tensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -691,9 +699,9 @@ BERT_START_DOCSTRING = r"""
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
usage and behavior.
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
.. note::
......@@ -702,11 +710,11 @@ BERT_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having all
the tensors in the first argument of the model call function: :obj:`model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors
in the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
- a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -716,8 +724,9 @@ BERT_START_DOCSTRING = r"""
Args:
config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.TFPreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.TFPreTrainedModel.from_pretrained` method to load the
model weights.
"""
BERT_INPUTS_DOCSTRING = r"""
......@@ -725,35 +734,33 @@ BERT_INPUTS_DOCSTRING = r"""
input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.BertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.__call__` and
:func:`transformers.PreTrainedTokenizer.encode` for details.
Indices can be obtained using :class:`~transformers.BertTokenizer`. See
:func:`transformers.PreTrainedTokenizer.__call__` and :func:`transformers.PreTrainedTokenizer.encode` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`__
position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`__
head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -800,8 +807,10 @@ class TFBertModel(TFBertPreTrainedModel):
@add_start_docstrings(
"""Bert Model with two heads on top as done during the pre-training:
a `masked language modeling` head and a `next sentence prediction (classification)` head. """,
"""
Bert Model with two heads on top as done during the pre-training:
a `masked language modeling` head and a `next sentence prediction (classification)` head.
""",
BERT_START_DOCSTRING,
)
class TFBertForPreTraining(TFBertPreTrainedModel):
......@@ -895,10 +904,9 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.bert.return_dict
......@@ -977,8 +985,8 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the cross entropy classification loss.
Indices should be in ``[0, ..., config.vocab_size - 1]``.
Labels for computing the cross entropy classification loss. Indices should be in ``[0, ...,
config.vocab_size - 1]``.
"""
return_dict = return_dict if return_dict is not None else self.bert.return_dict
......@@ -1073,8 +1081,10 @@ class TFBertForNextSentencePrediction(TFBertPreTrainedModel):
@add_start_docstrings(
"""Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """,
"""
Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled
output) e.g. for GLUE tasks.
""",
BERT_START_DOCSTRING,
)
class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassificationLoss):
......@@ -1111,9 +1121,8 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.bert.return_dict
......@@ -1156,8 +1165,10 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
@add_start_docstrings(
"""Bert Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
"""
Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
softmax) e.g. for RocStories/SWAG tasks.
""",
BERT_START_DOCSTRING,
)
class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
......@@ -1172,7 +1183,8 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
@property
def dummy_inputs(self):
"""Dummy inputs to build the network.
"""
Dummy inputs to build the network.
Returns:
tf.Tensor with dummy inputs
......@@ -1202,9 +1214,9 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
of the input tensors. (See :obj:`input_ids` above)
Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
num_choices]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
:obj:`input_ids` above)
"""
if isinstance(inputs, (tuple, list)):
input_ids = inputs[0]
......@@ -1282,8 +1294,10 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
@add_start_docstrings(
"""Bert Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
"""
Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
Named-Entity-Recognition (NER) tasks.
""",
BERT_START_DOCSTRING,
)
class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationLoss):
......@@ -1324,8 +1338,8 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
1]``.
"""
return_dict = return_dict if return_dict is not None else self.bert.return_dict
......@@ -1366,8 +1380,10 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
@add_start_docstrings(
"""Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
"""
Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
BERT_START_DOCSTRING,
)
class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss):
......@@ -1409,12 +1425,12 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
r"""
start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.bert.return_dict
......
......@@ -41,9 +41,9 @@ CAMEMBERT_START_DOCSTRING = r"""
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
usage and behavior.
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
.. note::
......@@ -52,11 +52,11 @@ CAMEMBERT_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having all
the tensors in the first argument of the model call function: :obj:`model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors
in the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
- a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -66,8 +66,9 @@ CAMEMBERT_START_DOCSTRING = r"""
Parameters:
config (:class:`~transformers.CamembertConfig`): Model configuration class with all the parameters of the
model. Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
model. Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
......@@ -77,8 +78,8 @@ CAMEMBERT_START_DOCSTRING = r"""
)
class TFCamembertModel(TFRobertaModel):
"""
This class overrides :class:`~transformers.TFRobertaModel`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.TFRobertaModel`. Please check the superclass for the appropriate
documentation alongside usage examples.
"""
config_class = CamembertConfig
......@@ -90,63 +91,72 @@ class TFCamembertModel(TFRobertaModel):
)
class TFCamembertForMaskedLM(TFRobertaForMaskedLM):
"""
This class overrides :class:`~transformers.TFRobertaForMaskedLM`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.TFRobertaForMaskedLM`. Please check the superclass for the appropriate
documentation alongside usage examples.
"""
config_class = CamembertConfig
@add_start_docstrings(
"""CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer
on top of the pooled output) e.g. for GLUE tasks. """,
"""
CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer on top of the
pooled output) e.g. for GLUE tasks.
""",
CAMEMBERT_START_DOCSTRING,
)
class TFCamembertForSequenceClassification(TFRobertaForSequenceClassification):
"""
This class overrides :class:`~transformers.TFRobertaForSequenceClassification`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.TFRobertaForSequenceClassification`. Please check the superclass for the
appropriate documentation alongside usage examples.
"""
config_class = CamembertConfig
@add_start_docstrings(
"""CamemBERT Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
"""
CamemBERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
for Named-Entity-Recognition (NER) tasks.
""",
CAMEMBERT_START_DOCSTRING,
)
class TFCamembertForTokenClassification(TFRobertaForTokenClassification):
"""
This class overrides :class:`~transformers.TFRobertaForTokenClassification`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.TFRobertaForTokenClassification`. Please check the superclass for the
appropriate documentation alongside usage examples.
"""
config_class = CamembertConfig
@add_start_docstrings(
"""CamemBERT Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
"""
CamemBERT Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
softmax) e.g. for RocStories/SWAG tasks.
""",
CAMEMBERT_START_DOCSTRING,
)
class TFCamembertForMultipleChoice(TFRobertaForMultipleChoice):
"""
This class overrides :class:`~transformers.TFRobertaForMultipleChoice`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.TFRobertaForMultipleChoice`. Please check the superclass for the
appropriate documentation alongside usage examples.
"""
config_class = CamembertConfig
@add_start_docstrings(
"""CamemBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
"""
CamemBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
CAMEMBERT_START_DOCSTRING,
)
class TFCamembertForQuestionAnswering(TFRobertaForQuestionAnswering):
"""
This class overrides :class:`~transformers.TFRobertaForQuestionAnswering`. Please check the
superclass for the appropriate documentation alongside usage examples.
This class overrides :class:`~transformers.TFRobertaForQuestionAnswering`. Please check the superclass for the
appropriate documentation alongside usage examples.
"""
config_class = CamembertConfig
......@@ -245,8 +245,8 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
raise NotImplementedError
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
"""
raise NotImplementedError
......@@ -426,8 +426,9 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
class TFCTRLPreTrainedModel(TFPreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = CTRLConfig
......@@ -440,9 +441,9 @@ CTRL_START_DOCSTRING = r"""
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
usage and behavior.
This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
.. note::
......@@ -451,11 +452,11 @@ CTRL_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having all
the tensors in the first argument of the model call function: :obj:`model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors
in the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
- a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -465,8 +466,9 @@ CTRL_START_DOCSTRING = r"""
Parameters:
config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
CTRL_INPUTS_DOCSTRING = r"""
......@@ -480,40 +482,37 @@ CTRL_INPUTS_DOCSTRING = r"""
If :obj:`past` is used, only input IDs that do not have their past calculated should be passed as
``input_ids``.
Indices can be obtained using :class:`~transformers.CTRLTokenizer`.
See :meth:`transformers.PreTrainedTokenizer.__call__` and
:meth:`transformers.PreTrainedTokenizer.encode` for details.
Indices can be obtained using :class:`~transformers.CTRLTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.__call__` and :meth:`transformers.PreTrainedTokenizer.encode` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
(see :obj:`past` output below). Can be used to speed up sequential decoding.
The token ids which have their past given to this model
should not be passed as input ids as they have already been computed.
Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see
:obj:`past` output below). Can be used to speed up sequential decoding. The token ids which have their past
given to this model should not be passed as input ids as they have already been computed.
attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`__
position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`__
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
......@@ -523,8 +522,8 @@ CTRL_INPUTS_DOCSTRING = r"""
This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, ``past`` key value states are returned and can be used to speed up
decoding (see ``past``).
If set to :obj:`True`, ``past`` key value states are returned and can be used to speed up decoding (see
``past``).
output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
......@@ -580,8 +579,10 @@ class TFCTRLLMHead(tf.keras.layers.Layer):
@add_start_docstrings(
"""The CTRL Model transformer with a language modeling head on top
(linear layer with weights tied to the input embeddings). """,
"""
The CTRL Model transformer with a language modeling head on top (linear layer with weights tied to the input
embeddings).
""",
CTRL_START_DOCSTRING,
)
class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
......@@ -626,8 +627,8 @@ class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
):
r"""
labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the cross entropy classification loss.
Indices should be in ``[0, ..., config.vocab_size - 1]``.
Labels for computing the cross entropy classification loss. Indices should be in ``[0, ...,
config.vocab_size - 1]``.
"""
return_dict = return_dict if return_dict is not None else self.transformer.return_dict
if isinstance(inputs, (tuple, list)):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment