Unverified Commit 011cc0be authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Fix all sphynx warnings (#5068)

parent af497b56
...@@ -17,7 +17,6 @@ The ``.optimization`` module provides: ...@@ -17,7 +17,6 @@ The ``.optimization`` module provides:
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AdamWeightDecay .. autoclass:: transformers.AdamWeightDecay
:members:
.. autofunction:: transformers.create_optimizer .. autofunction:: transformers.create_optimizer
......
...@@ -7,7 +7,7 @@ Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction an ...@@ -7,7 +7,7 @@ Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction an
There are two categories of pipeline abstractions to be aware about: There are two categories of pipeline abstractions to be aware about:
- The :class:`~transformers.pipeline` which is the most powerful object encapsulating all other pipelines - The :func:`~transformers.pipeline` which is the most powerful object encapsulating all other pipelines
- The other task-specific pipelines, such as :class:`~transformers.TokenClassificationPipeline` - The other task-specific pipelines, such as :class:`~transformers.TokenClassificationPipeline`
or :class:`~transformers.QuestionAnsweringPipeline` or :class:`~transformers.QuestionAnsweringPipeline`
...@@ -17,8 +17,7 @@ The pipeline abstraction ...@@ -17,8 +17,7 @@ The pipeline abstraction
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
other pipeline but requires an additional argument which is the `task`. other pipeline but requires an additional argument which is the `task`.
.. autoclass:: transformers.pipeline ... autofunction:: transformers.pipeline
:members:
The task specific pipelines The task specific pipelines
......
...@@ -30,35 +30,35 @@ Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will di ...@@ -30,35 +30,35 @@ Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will di
``AutoModelForPreTraining`` ``AutoModelForPreTraining``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForPreTraining .. autoclass:: transformers.AutoModelForPreTraining
:members: :members:
``AutoModelWithLMHead`` ``AutoModelWithLMHead``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelWithLMHead .. autoclass:: transformers.AutoModelWithLMHead
:members: :members:
``AutoModelForSequenceClassification`` ``AutoModelForSequenceClassification``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForSequenceClassification .. autoclass:: transformers.AutoModelForSequenceClassification
:members: :members:
``AutoModelForQuestionAnswering`` ``AutoModelForQuestionAnswering``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForQuestionAnswering .. autoclass:: transformers.AutoModelForQuestionAnswering
:members: :members:
``AutoModelForTokenClassification`` ``AutoModelForTokenClassification``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForTokenClassification .. autoclass:: transformers.AutoModelForTokenClassification
:members: :members:
......
Encoder Decoder Models Encoder Decoder Models
----------- ------------------------
This class can wrap an encoder model, such as ``BertModel`` and a decoder modeling with a language modeling head, such as ``BertForMaskedLM`` into a encoder-decoder model. This class can wrap an encoder model, such as ``BertModel`` and a decoder modeling with a language modeling head, such as ``BertForMaskedLM`` into a encoder-decoder model.
...@@ -10,7 +10,7 @@ An application of this architecture could be *summarization* using two pretraine ...@@ -10,7 +10,7 @@ An application of this architecture could be *summarization* using two pretraine
``EncoderDecoderConfig`` ``EncoderDecoderConfig``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.EncoderDecoderConfig .. autoclass:: transformers.EncoderDecoderConfig
:members: :members:
......
...@@ -4,7 +4,7 @@ Reformer ...@@ -4,7 +4,7 @@ Reformer
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_ file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
Overview Overview
~~~~~ ~~~~~~~~~~
The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
Here the abstract: Here the abstract:
...@@ -13,7 +13,7 @@ Here the abstract: ...@@ -13,7 +13,7 @@ Here the abstract:
The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`_ . The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`_ .
Axial Positional Encodings Axial Positional Encodings
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Axial Positional Encodings were first implemented in Google's `trax library <https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix: Axial Positional Encodings were first implemented in Google's `trax library <https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix:
.. math:: .. math::
......
...@@ -22,10 +22,12 @@ For a list that includes community-uploaded models, refer to `https://huggingfac ...@@ -22,10 +22,12 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-multilingual-uncased`` | | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``bert-base-multilingual-uncased`` | | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias | | | | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias |
| | | |
| | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__). | | | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-multilingual-cased`` | | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``bert-base-multilingual-cased`` | | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased text in the top 104 languages with the largest Wikipedias | | | | | Trained on cased text in the top 104 languages with the largest Wikipedias |
| | | |
| | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__). | | | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-chinese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``bert-base-chinese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
...@@ -33,64 +35,79 @@ For a list that includes community-uploaded models, refer to `https://huggingfac ...@@ -33,64 +35,79 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-german-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``bert-base-german-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased German text by Deepset.ai | | | | | Trained on cased German text by Deepset.ai |
| | | |
| | | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__). | | | | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-large-uncased-whole-word-masking`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. | | | ``bert-large-uncased-whole-word-masking`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. |
| | | | Trained on lower-cased English text using Whole-Word-Masking | | | | | Trained on lower-cased English text using Whole-Word-Masking |
| | | |
| | | (see `details <https://github.com/google-research/bert/#bert>`__). | | | | (see `details <https://github.com/google-research/bert/#bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-large-cased-whole-word-masking`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. | | | ``bert-large-cased-whole-word-masking`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. |
| | | | Trained on cased English text using Whole-Word-Masking | | | | | Trained on cased English text using Whole-Word-Masking |
| | | |
| | | (see `details <https://github.com/google-research/bert/#bert>`__). | | | | (see `details <https://github.com/google-research/bert/#bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-large-uncased-whole-word-masking-finetuned-squad`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. | | | ``bert-large-uncased-whole-word-masking-finetuned-squad`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. |
| | | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD | | | | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD |
| | | |
| | | (see details of fine-tuning in the `example section <https://github.com/huggingface/transformers/tree/master/examples>`__). | | | | (see details of fine-tuning in the `example section <https://github.com/huggingface/transformers/tree/master/examples>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-large-cased-whole-word-masking-finetuned-squad`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters | | | ``bert-large-cased-whole-word-masking-finetuned-squad`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters |
| | | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD | | | | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD |
| | | |
| | | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__) | | | | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-cased-finetuned-mrpc`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``bert-base-cased-finetuned-mrpc`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | The ``bert-base-cased`` model fine-tuned on MRPC | | | | | The ``bert-base-cased`` model fine-tuned on MRPC |
| | | |
| | | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__) | | | | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-german-dbmdz-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``bert-base-german-dbmdz-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased German text by DBMDZ | | | | | Trained on cased German text by DBMDZ |
| | | |
| | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). | | | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-german-dbmdz-uncased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``bert-base-german-dbmdz-uncased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on uncased German text by DBMDZ | | | | | Trained on uncased German text by DBMDZ |
| | | |
| | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). | | | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``cl-tohoku/bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``cl-tohoku/bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece. | | | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece. |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. | | | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. |
| | | |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). | | | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``cl-tohoku/bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``cl-tohoku/bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece. | | | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece. |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. | | | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. |
| | | |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). | | | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``cl-tohoku/bert-base-japanese-char`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``cl-tohoku/bert-base-japanese-char`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text. Text is tokenized into characters. | | | | | Trained on Japanese text. Text is tokenized into characters. |
| | | |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). | | | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``cl-tohoku/bert-base-japanese-char-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``cl-tohoku/bert-base-japanese-char-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters. | | | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters. |
| | | |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). | | | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``TurkuNLP/bert-base-finnish-cased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``TurkuNLP/bert-base-finnish-cased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased Finnish text. | | | | | Trained on cased Finnish text. |
| | | |
| | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). | | | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``TurkuNLP/bert-base-finnish-uncased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``TurkuNLP/bert-base-finnish-uncased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on uncased Finnish text. | | | | | Trained on uncased Finnish text. |
| | | |
| | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). | | | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``wietsedv/bert-base-dutch-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``wietsedv/bert-base-dutch-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased Dutch text. | | | | | Trained on cased Dutch text. |
| | | |
| | | (see `details on wietsedv repository <https://github.com/wietsedv/bertje/>`__). | | | | (see `details on wietsedv repository <https://github.com/wietsedv/bertje/>`__). |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| GPT | ``openai-gpt`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | GPT | ``openai-gpt`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
...@@ -149,54 +166,67 @@ For a list that includes community-uploaded models, refer to `https://huggingfac ...@@ -149,54 +166,67 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| RoBERTa | ``roberta-base`` | | 12-layer, 768-hidden, 12-heads, 125M parameters | | RoBERTa | ``roberta-base`` | | 12-layer, 768-hidden, 12-heads, 125M parameters |
| | | | RoBERTa using the BERT-base architecture | | | | | RoBERTa using the BERT-base architecture |
| | | |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__) | | | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``roberta-large`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters | | | ``roberta-large`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters |
| | | | RoBERTa using the BERT-large architecture | | | | | RoBERTa using the BERT-large architecture |
| | | |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__) | | | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``roberta-large-mnli`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters | | | ``roberta-large-mnli`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters |
| | | | ``roberta-large`` fine-tuned on `MNLI <http://www.nyu.edu/projects/bowman/multinli/>`__. | | | | | ``roberta-large`` fine-tuned on `MNLI <http://www.nyu.edu/projects/bowman/multinli/>`__. |
| | | |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__) | | | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilroberta-base`` | | 6-layer, 768-hidden, 12-heads, 82M parameters | | | ``distilroberta-base`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
| | | | The DistilRoBERTa model distilled from the RoBERTa model `roberta-base` checkpoint. | | | | | The DistilRoBERTa model distilled from the RoBERTa model `roberta-base` checkpoint. |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``roberta-base-openai-detector`` | | 12-layer, 768-hidden, 12-heads, 125M parameters | | | ``roberta-base-openai-detector`` | | 12-layer, 768-hidden, 12-heads, 125M parameters |
| | | | ``roberta-base`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. | | | | | ``roberta-base`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. |
| | | |
| | | (see `details <https://github.com/openai/gpt-2-output-dataset/tree/master/detector>`__) | | | | (see `details <https://github.com/openai/gpt-2-output-dataset/tree/master/detector>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``roberta-large-openai-detector`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters | | | ``roberta-large-openai-detector`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters |
| | | | ``roberta-large`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. | | | | | ``roberta-large`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. |
| | | |
| | | (see `details <https://github.com/openai/gpt-2-output-dataset/tree/master/detector>`__) | | | | (see `details <https://github.com/openai/gpt-2-output-dataset/tree/master/detector>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| DistilBERT | ``distilbert-base-uncased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters | | DistilBERT | ``distilbert-base-uncased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters |
| | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint | | | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-uncased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 66M parameters | | | ``distilbert-base-uncased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 66M parameters |
| | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer. | | | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer. |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-cased`` | | 6-layer, 768-hidden, 12-heads, 65M parameters | | | ``distilbert-base-cased`` | | 6-layer, 768-hidden, 12-heads, 65M parameters |
| | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint | | | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-cased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 65M parameters | | | ``distilbert-base-cased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 65M parameters |
| | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint, with an additional question answering layer. | | | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint, with an additional question answering layer. |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilgpt2`` | | 6-layer, 768-hidden, 12-heads, 82M parameters | | | ``distilgpt2`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
| | | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint. | | | | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint. |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-german-cased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters | | | ``distilbert-base-german-cased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters |
| | | | The German DistilBERT model distilled from the German DBMDZ BERT model `bert-base-german-dbmdz-cased` checkpoint. | | | | | The German DistilBERT model distilled from the German DBMDZ BERT model `bert-base-german-dbmdz-cased` checkpoint. |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-multilingual-cased`` | | 6-layer, 768-hidden, 12-heads, 134M parameters | | | ``distilbert-base-multilingual-cased`` | | 6-layer, 768-hidden, 12-heads, 134M parameters |
| | | | The multilingual DistilBERT model distilled from the Multilingual BERT model `bert-base-multilingual-cased` checkpoint. | | | | | The multilingual DistilBERT model distilled from the Multilingual BERT model `bert-base-multilingual-cased` checkpoint. |
| | | |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) | | | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| CTRL | ``ctrl`` | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters | | CTRL | ``ctrl`` | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters |
...@@ -204,38 +234,47 @@ For a list that includes community-uploaded models, refer to `https://huggingfac ...@@ -204,38 +234,47 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| CamemBERT | ``camembert-base`` | | 12-layer, 768-hidden, 12-heads, 110M parameters | | CamemBERT | ``camembert-base`` | | 12-layer, 768-hidden, 12-heads, 110M parameters |
| | | | CamemBERT using the BERT-base architecture | | | | | CamemBERT using the BERT-base architecture |
| | | |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/camembert>`__) | | | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/camembert>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| ALBERT | ``albert-base-v1`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters | | ALBERT | ``albert-base-v1`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters |
| | | | ALBERT base model | | | | | ALBERT base model |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-large-v1`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters | | | ``albert-large-v1`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters |
| | | | ALBERT large model | | | | | ALBERT large model |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xlarge-v1`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters | | | ``albert-xlarge-v1`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters |
| | | | ALBERT xlarge model | | | | | ALBERT xlarge model |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xxlarge-v1`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters | | | ``albert-xxlarge-v1`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters |
| | | | ALBERT xxlarge model | | | | | ALBERT xxlarge model |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-base-v2`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters | | | ``albert-base-v2`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters |
| | | | ALBERT base model with no dropout, additional training data and longer training | | | | | ALBERT base model with no dropout, additional training data and longer training |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-large-v2`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters | | | ``albert-large-v2`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters |
| | | | ALBERT large model with no dropout, additional training data and longer training | | | | | ALBERT large model with no dropout, additional training data and longer training |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xlarge-v2`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters | | | ``albert-xlarge-v2`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters |
| | | | ALBERT xlarge model with no dropout, additional training data and longer training | | | | | ALBERT xlarge model with no dropout, additional training data and longer training |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xxlarge-v2`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters | | | ``albert-xxlarge-v2`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters |
| | | | ALBERT xxlarge model with no dropout, additional training data and longer training | | | | | ALBERT xxlarge model with no dropout, additional training data and longer training |
| | | |
| | | (see `details <https://github.com/google-research/ALBERT>`__) | | | | (see `details <https://github.com/google-research/ALBERT>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| T5 | ``t5-small`` | | ~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads, | | T5 | ``t5-small`` | | ~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads, |
...@@ -261,21 +300,26 @@ For a list that includes community-uploaded models, refer to `https://huggingfac ...@@ -261,21 +300,26 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| FlauBERT | ``flaubert/flaubert_small_cased`` | | 6-layer, 512-hidden, 8-heads, 54M parameters | | FlauBERT | ``flaubert/flaubert_small_cased`` | | 6-layer, 512-hidden, 8-heads, 54M parameters |
| | | | FlauBERT small architecture | | | | | FlauBERT small architecture |
| | | |
| | | (see `details <https://github.com/getalp/Flaubert>`__) | | | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert/flaubert_base_uncased`` | | 12-layer, 768-hidden, 12-heads, 137M parameters | | | ``flaubert/flaubert_base_uncased`` | | 12-layer, 768-hidden, 12-heads, 137M parameters |
| | | | FlauBERT base architecture with uncased vocabulary | | | | | FlauBERT base architecture with uncased vocabulary |
| | | |
| | | (see `details <https://github.com/getalp/Flaubert>`__) | | | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert/flaubert_base_cased`` | | 12-layer, 768-hidden, 12-heads, 138M parameters | | | ``flaubert/flaubert_base_cased`` | | 12-layer, 768-hidden, 12-heads, 138M parameters |
| | | | FlauBERT base architecture with cased vocabulary | | | | | FlauBERT base architecture with cased vocabulary |
| | | |
| | | (see `details <https://github.com/getalp/Flaubert>`__) | | | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert/flaubert_large_cased`` | | 24-layer, 1024-hidden, 16-heads, 373M parameters | | | ``flaubert/flaubert_large_cased`` | | 24-layer, 1024-hidden, 16-heads, 373M parameters |
| | | | FlauBERT large architecture | | | | | FlauBERT large architecture |
| | | |
| | | (see `details <https://github.com/getalp/Flaubert>`__) | | | | (see `details <https://github.com/getalp/Flaubert>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Bart | ``facebook/bart-large`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters | | Bart | ``facebook/bart-large`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters |
| | | |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) | | | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``facebook/bart-base`` | | 12-layer, 768-hidden, 16-heads, 139M parameters | | | ``facebook/bart-base`` | | 12-layer, 768-hidden, 16-heads, 139M parameters |
......
...@@ -693,6 +693,7 @@ following array should be the output: ...@@ -693,6 +693,7 @@ following array should be the output:
:: ::
[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')] [('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
Summarization Summarization
---------------------------------------------------- ----------------------------------------------------
...@@ -770,6 +771,7 @@ Here Google`s T5 model is used that was only pre-trained on a multi-task mixed d ...@@ -770,6 +771,7 @@ Here Google`s T5 model is used that was only pre-trained on a multi-task mixed d
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512) inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True) outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(outputs) print(outputs)
Translation Translation
---------------------------------------------------- ----------------------------------------------------
......
...@@ -134,6 +134,7 @@ class AutoConfig: ...@@ -134,6 +134,7 @@ class AutoConfig:
The configuration class to instantiate is selected The configuration class to instantiate is selected
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `t5`: :class:`~transformers.T5Config` (T5 model) - `t5`: :class:`~transformers.T5Config` (T5 model)
- `distilbert`: :class:`~transformers.DistilBertConfig` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertConfig` (DistilBERT model)
- `albert`: :class:`~transformers.AlbertConfig` (ALBERT model) - `albert`: :class:`~transformers.AlbertConfig` (ALBERT model)
......
...@@ -53,7 +53,7 @@ class T5Config(PretrainedConfig): ...@@ -53,7 +53,7 @@ class T5Config(PretrainedConfig):
probabilities. probabilities.
n_positions: The maximum sequence length that this model might n_positions: The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'. (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings`.
type_vocab_size: The vocabulary size of the `token_type_ids` passed into type_vocab_size: The vocabulary size of the `token_type_ids` passed into
`T5Model`. `T5Model`.
initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing). initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).
......
...@@ -84,6 +84,7 @@ class XLNetConfig(PretrainedConfig): ...@@ -84,6 +84,7 @@ class XLNetConfig(PretrainedConfig):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in for the multiple choice head in
:class:transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`. :class:transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
Is one of the following options: Is one of the following options:
- 'last' => take the last token hidden state (like XLNet) - 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert) - 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states - 'mean' => take the mean of all tokens hidden states
......
...@@ -83,7 +83,8 @@ class DataProcessor: ...@@ -83,7 +83,8 @@ class DataProcessor:
"""Base class for data converters for sequence classification data sets.""" """Base class for data converters for sequence classification data sets."""
def get_example_from_tensor_dict(self, tensor_dict): def get_example_from_tensor_dict(self, tensor_dict):
"""Gets an example from a dict with tensorflow tensors """Gets an example from a dict with tensorflow tensors.
Args: Args:
tensor_dict: Keys and values should match the corresponding Glue tensor_dict: Keys and values should match the corresponding Glue
tensorflow_dataset examples. tensorflow_dataset examples.
...@@ -91,15 +92,15 @@ class DataProcessor: ...@@ -91,15 +92,15 @@ class DataProcessor:
raise NotImplementedError() raise NotImplementedError()
def get_train_examples(self, data_dir): def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set.""" """Gets a collection of :class:`InputExample` for the train set."""
raise NotImplementedError() raise NotImplementedError()
def get_dev_examples(self, data_dir): def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set.""" """Gets a collection of :class:`InputExample` for the dev set."""
raise NotImplementedError() raise NotImplementedError()
def get_test_examples(self, data_dir): def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the test set.""" """Gets a collection of :class:`InputExample` for the test set."""
raise NotImplementedError() raise NotImplementedError()
def get_labels(self): def get_labels(self):
......
...@@ -393,6 +393,7 @@ class AutoModel: ...@@ -393,6 +393,7 @@ class AutoModel:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `t5`: :class:`~transformers.T5Model` (T5 model) - `t5`: :class:`~transformers.T5Model` (T5 model)
- `distilbert`: :class:`~transformers.DistilBertModel` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertModel` (DistilBERT model)
- `albert`: :class:`~transformers.AlbertModel` (ALBERT model) - `albert`: :class:`~transformers.AlbertModel` (ALBERT model)
...@@ -546,6 +547,7 @@ class AutoModelForPreTraining: ...@@ -546,6 +547,7 @@ class AutoModelForPreTraining:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `t5`: :class:`~transformers.T5ModelWithLMHead` (T5 model) - `t5`: :class:`~transformers.T5ModelWithLMHead` (T5 model)
- `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
- `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model) - `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model)
...@@ -698,6 +700,7 @@ class AutoModelWithLMHead: ...@@ -698,6 +700,7 @@ class AutoModelWithLMHead:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `t5`: :class:`~transformers.T5ForConditionalGeneration` (T5 model) - `t5`: :class:`~transformers.T5ForConditionalGeneration` (T5 model)
- `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
- `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model) - `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model)
...@@ -845,6 +848,7 @@ class AutoModelForCausalLM: ...@@ -845,6 +848,7 @@ class AutoModelForCausalLM:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `bert`: :class:`~transformers.BertLMHeadModel` (Bert model) - `bert`: :class:`~transformers.BertLMHeadModel` (Bert model)
- `openai-gpt`: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model) - `openai-gpt`: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
- `gpt2`: :class:`~transformers.GPT2LMHeadModel` (OpenAI GPT-2 model) - `gpt2`: :class:`~transformers.GPT2LMHeadModel` (OpenAI GPT-2 model)
...@@ -982,6 +986,7 @@ class AutoModelForMaskedLM: ...@@ -982,6 +986,7 @@ class AutoModelForMaskedLM:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
- `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model) - `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model)
- `camembert`: :class:`~transformers.CamembertForMaskedLM` (CamemBERT model) - `camembert`: :class:`~transformers.CamembertForMaskedLM` (CamemBERT model)
...@@ -1118,6 +1123,7 @@ class AutoModelForSeq2SeqLM: ...@@ -1118,6 +1123,7 @@ class AutoModelForSeq2SeqLM:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `t5`: :class:`~transformers.T5ForConditionalGeneration` (T5 model) - `t5`: :class:`~transformers.T5ForConditionalGeneration` (T5 model)
- `bart`: :class:`~transformers.BartForConditionalGeneration` (Bert model) - `bart`: :class:`~transformers.BartForConditionalGeneration` (Bert model)
- `marian`: :class:`~transformers.MarianMTModel` (Marian model) - `marian`: :class:`~transformers.MarianMTModel` (Marian model)
...@@ -1256,6 +1262,7 @@ class AutoModelForSequenceClassification: ...@@ -1256,6 +1262,7 @@ class AutoModelForSequenceClassification:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `distilbert`: :class:`~transformers.DistilBertForSequenceClassification` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertForSequenceClassification` (DistilBERT model)
- `albert`: :class:`~transformers.AlbertForSequenceClassification` (ALBERT model) - `albert`: :class:`~transformers.AlbertForSequenceClassification` (ALBERT model)
- `camembert`: :class:`~transformers.CamembertForSequenceClassification` (CamemBERT model) - `camembert`: :class:`~transformers.CamembertForSequenceClassification` (CamemBERT model)
...@@ -1402,6 +1409,7 @@ class AutoModelForQuestionAnswering: ...@@ -1402,6 +1409,7 @@ class AutoModelForQuestionAnswering:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `distilbert`: :class:`~transformers.DistilBertForQuestionAnswering` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertForQuestionAnswering` (DistilBERT model)
- `albert`: :class:`~transformers.AlbertForQuestionAnswering` (ALBERT model) - `albert`: :class:`~transformers.AlbertForQuestionAnswering` (ALBERT model)
- `bert`: :class:`~transformers.BertForQuestionAnswering` (Bert model) - `bert`: :class:`~transformers.BertForQuestionAnswering` (Bert model)
...@@ -1547,6 +1555,7 @@ class AutoModelForTokenClassification: ...@@ -1547,6 +1555,7 @@ class AutoModelForTokenClassification:
The `from_pretrained()` method takes care of returning the correct model class instance The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing, based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string: falling back to using pattern matching on the `pretrained_model_name_or_path` string:
- `distilbert`: :class:`~transformers.DistilBertForTokenClassification` (DistilBERT model) - `distilbert`: :class:`~transformers.DistilBertForTokenClassification` (DistilBERT model)
- `xlm`: :class:`~transformers.XLMForTokenClassification` (XLM model) - `xlm`: :class:`~transformers.XLMForTokenClassification` (XLM model)
- `xlm-roberta`: :class:`~transformers.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model) - `xlm-roberta`: :class:`~transformers.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model)
......
...@@ -745,9 +745,10 @@ class ElectraForTokenClassification(ElectraPreTrainedModel): ...@@ -745,9 +745,10 @@ class ElectraForTokenClassification(ElectraPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
"""ELECTRA Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of """
the hidden-states output to compute `span start logits` and `span end logits`). """, ELECTRA Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
ELECTRA_INPUTS_DOCSTRING, layers on top of the hidden-states output to compute `span start logits` and `span end logits`).""",
ELECTRA_START_DOCSTRING,
) )
class ElectraForQuestionAnswering(ElectraPreTrainedModel): class ElectraForQuestionAnswering(ElectraPreTrainedModel):
config_class = ElectraConfig config_class = ElectraConfig
......
...@@ -435,7 +435,7 @@ class LongformerSelfAttention(nn.Module): ...@@ -435,7 +435,7 @@ class LongformerSelfAttention(nn.Module):
LONGFORMER_START_DOCSTRING = r""" LONGFORMER_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ sub-class.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
...@@ -467,7 +467,7 @@ LONGFORMER_INPUTS_DOCSTRING = r""" ...@@ -467,7 +467,7 @@ LONGFORMER_INPUTS_DOCSTRING = r"""
Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for
task-specific finetuning because it makes the model more flexible at representing the task. For example, task-specific finetuning because it makes the model more flexible at representing the task. For example,
for classification, the <s> token should be given global attention. For QA, all question tokens should also have for classification, the <s> token should be given global attention. For QA, all question tokens should also have
global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details. global attention. Please refer to the `Longformer paper <https://arxiv.org/abs/2004.05150>`__ for more details.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``0`` for local attention (a sliding window attention), ``0`` for local attention (a sliding window attention),
``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them). ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
...@@ -500,7 +500,7 @@ class LongformerModel(RobertaModel): ...@@ -500,7 +500,7 @@ class LongformerModel(RobertaModel):
""" """
This class overrides :class:`~transformers.RobertaModel` to provide the ability to process This class overrides :class:`~transformers.RobertaModel` to provide the ability to process
long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer
<https://arxiv.org/abs/2004.05150>`_ by Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention
combines a local (sliding window) and global attention to extend to long documents without the O(n^2) increase in combines a local (sliding window) and global attention to extend to long documents without the O(n^2) increase in
memory and compute. memory and compute.
......
...@@ -1451,14 +1451,10 @@ class ReformerPreTrainedModel(PreTrainedModel): ...@@ -1451,14 +1451,10 @@ class ReformerPreTrainedModel(PreTrainedModel):
REFORMER_START_DOCSTRING = r""" REFORMER_START_DOCSTRING = r"""
Reformer was proposed in Reformer was proposed in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.0445>`__
`Reformer: The Efficient Transformer`_
by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
.. _`Reformer: The Efficient Transformer`: This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ sub-class.
https://arxiv.org/abs/2001.04451
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior. usage and behavior.
......
...@@ -775,19 +775,14 @@ class T5Stack(T5PreTrainedModel): ...@@ -775,19 +775,14 @@ class T5Stack(T5PreTrainedModel):
return outputs # last-layer hidden state, (presents,) (all hidden states), (all attentions) return outputs # last-layer hidden state, (presents,) (all hidden states), (all attentions)
T5_START_DOCSTRING = r""" The T5 model was proposed in T5_START_DOCSTRING = r"""
`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_ The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting. It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#module>`__ sub-class. Use it as a
refer to the PyTorch documentation for all matter related to general usage and behavior. regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
.. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:
https://arxiv.org/abs/1910.10683
.. _`torch.nn.Module`:
https://pytorch.org/docs/stable/nn.html#module
Parameters: Parameters:
config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model. config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
...@@ -804,7 +799,7 @@ T5_INPUTS_DOCSTRING = r""" ...@@ -804,7 +799,7 @@ T5_INPUTS_DOCSTRING = r"""
See :func:`transformers.PreTrainedTokenizer.encode` and See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
To know more on how to prepare :obj:`input_ids` for pre-training take a look at To know more on how to prepare :obj:`input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`_ . `T5 Training <./t5.html#training>`__.
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`): attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
...@@ -817,7 +812,7 @@ T5_INPUTS_DOCSTRING = r""" ...@@ -817,7 +812,7 @@ T5_INPUTS_DOCSTRING = r"""
Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation. Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`). If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).
To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`_ . `T5 Training <./t5.html#training>`__.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`): decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default. Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
...@@ -902,8 +897,8 @@ class T5Model(T5PreTrainedModel): ...@@ -902,8 +897,8 @@ class T5Model(T5PreTrainedModel):
output_attentions=None, output_attentions=None,
): ):
r""" r"""
Return: Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs:
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model. Sequence of hidden-states at the output of the last layer of the model.
If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output. If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.
...@@ -1038,7 +1033,7 @@ class T5ForConditionalGeneration(T5PreTrainedModel): ...@@ -1038,7 +1033,7 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
Returns: Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided): loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Classification loss (cross entropy). Classification loss (cross entropy).
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`) prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
......
...@@ -408,12 +408,11 @@ class TFElectraModel(TFElectraPreTrainedModel): ...@@ -408,12 +408,11 @@ class TFElectraModel(TFElectraPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """Electra model with a binary classification head on top as used during pre-training for identifying generated
Electra model with a binary classification head on top as used during pre-training for identifying generated tokens.
tokens.
Even though both the discriminator and generator may be loaded into this model, the discriminator is Even though both the discriminator and generator may be loaded into this model, the discriminator is
the only model of the two to have the correct classification head to be used for this model.""", the only model of the two to have the correct classification head to be used for this model.""",
ELECTRA_START_DOCSTRING, ELECTRA_START_DOCSTRING,
) )
class TFElectraForPreTraining(TFElectraPreTrainedModel): class TFElectraForPreTraining(TFElectraPreTrainedModel):
...@@ -501,11 +500,10 @@ class TFElectraMaskedLMHead(tf.keras.layers.Layer): ...@@ -501,11 +500,10 @@ class TFElectraMaskedLMHead(tf.keras.layers.Layer):
@add_start_docstrings( @add_start_docstrings(
""" """Electra model with a language modeling head on top.
Electra model with a language modeling head on top.
Even though both the discriminator and generator may be loaded into this model, the generator is Even though both the discriminator and generator may be loaded into this model, the generator is
the only model of the two to have been trained for the masked language modeling task.""", the only model of the two to have been trained for the masked language modeling task.""",
ELECTRA_START_DOCSTRING, ELECTRA_START_DOCSTRING,
) )
class TFElectraForMaskedLM(TFElectraPreTrainedModel): class TFElectraForMaskedLM(TFElectraPreTrainedModel):
...@@ -588,10 +586,9 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel): ...@@ -588,10 +586,9 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """Electra model with a token classification head on top.
Electra model with a token classification head on top.
Both the discriminator and generator may be loaded into this model.""", Both the discriminator and generator may be loaded into this model.""",
ELECTRA_START_DOCSTRING, ELECTRA_START_DOCSTRING,
) )
class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassificationLoss): class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassificationLoss):
......
...@@ -772,19 +772,15 @@ class TFT5PreTrainedModel(TFPreTrainedModel): ...@@ -772,19 +772,15 @@ class TFT5PreTrainedModel(TFPreTrainedModel):
return dummy_inputs return dummy_inputs
T5_START_DOCSTRING = r""" The T5 model was proposed in T5_START_DOCSTRING = r"""
`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_ The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting. It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and This model is a `tf.keras.Model <https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model>`__
refer to the TF 2.0 documentation for all matter related to general usage and behavior. sub-class. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to
general usage and behavior.
.. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:
https://arxiv.org/abs/1910.10683
.. _`tf.keras.Model`:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model
Note on the model inputs: Note on the model inputs:
TF 2.0 models accepts two formats as inputs: TF 2.0 models accepts two formats as inputs:
...@@ -796,7 +792,7 @@ T5_START_DOCSTRING = r""" The T5 model was proposed in ...@@ -796,7 +792,7 @@ T5_START_DOCSTRING = r""" The T5 model was proposed in
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument : If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
- a single Tensor with inputs only and nothing else: `model(inputs_ids) - a single Tensor with inputs only and nothing else: `model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
`model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])` `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`
- a dictionary with one or several input Tensors associaed to the input names given in the docstring: - a dictionary with one or several input Tensors associaed to the input names given in the docstring:
...@@ -818,7 +814,7 @@ T5_INPUTS_DOCSTRING = r""" ...@@ -818,7 +814,7 @@ T5_INPUTS_DOCSTRING = r"""
the right or the left. the right or the left.
Indices can be obtained using :class:`transformers.T5Tokenizer`. Indices can be obtained using :class:`transformers.T5Tokenizer`.
To know more on how to prepare :obj:`inputs` for pre-training take a look at To know more on how to prepare :obj:`inputs` for pre-training take a look at
`T5 Training <./t5.html#training>`_ . `T5 Training <./t5.html#training>`__.
See :func:`transformers.PreTrainedTokenizer.encode` and See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`): decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
...@@ -850,7 +846,7 @@ T5_INPUTS_DOCSTRING = r""" ...@@ -850,7 +846,7 @@ T5_INPUTS_DOCSTRING = r"""
This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
than the model's internal embedding lookup matrix. than the model's internal embedding lookup matrix.
To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`_ . `T5 Training <./t5.html#training>`__.
head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`): head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
...@@ -897,8 +893,8 @@ class TFT5Model(TFT5PreTrainedModel): ...@@ -897,8 +893,8 @@ class TFT5Model(TFT5PreTrainedModel):
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def call(self, inputs, **kwargs): def call(self, inputs, **kwargs):
r""" r"""
Return: Returns:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs:
last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model. Sequence of hidden-states at the output of the last layer of the model.
If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output. If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.
...@@ -1024,8 +1020,8 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel): ...@@ -1024,8 +1020,8 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING) @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def call(self, inputs, **kwargs): def call(self, inputs, **kwargs):
r""" r"""
Return: Returns:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs:
loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided): loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):
Classification loss (cross entropy). Classification loss (cross entropy).
prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`) prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
......
...@@ -294,7 +294,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin): ...@@ -294,7 +294,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
Parameters: Parameters:
pretrained_model_name_or_path: either: pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``. - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``. - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
- a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``. - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
...@@ -306,8 +305,8 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin): ...@@ -306,8 +305,8 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
config: (`optional`) one of: config: (`optional`) one of:
- an instance of a class derived from :class:`~transformers.PretrainedConfig`, or - an instance of a class derived from :class:`~transformers.PretrainedConfig`, or
- a string valid as input to :func:`~transformers.PretrainedConfig.from_pretrained()` - a string valid as input to :func:`~transformers.PretrainedConfig.from_pretrained()`
Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
- the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
- the model was saved using :func:`~transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory. - the model was saved using :func:`~transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
- the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory. - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment