Models doc (#7345)

* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Models doc (#7345)
* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
3323146e · Sylvain Gugger · GitHub · 58405a52 · 3323146e · 3323146e
Unverified Commit 3323146e authored Sep 23, 2020 by Sylvain Gugger Committed by GitHub Sep 23, 2020
20 changed files
--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
 Pegasus
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange,
 file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
 @sshleifer.


 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
 Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
@@ -19,7 +19,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus


 Checkpoints
-~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 All the `checkpoints <https://huggingface.co/models?search=pegasus>`_ are finetuned for summarization, besides ``pegasus-large``, whence the other checkpoints are finetuned.
 - Each checkpoint is 2.2 GB on disk and 568M parameters.
 - FP16 is not supported (help/ideas on this appreciated!).
@@ -29,7 +29,7 @@ The gap is likely because of different alpha/length_penalty implementations in b


 Implementation Notes
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - All models are transformer encoder-decoders with 16 layers in each component.
 - The implementation is completely inherited from ``BartForConditionalGeneration``
@@ -43,7 +43,7 @@ Implementation Notes


 Usage Example
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. code-block:: python

@@ -63,7 +63,7 @@ Usage Example
    assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."

 PegasusForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
 Available models are listed at `Model List <https://huggingface.co/models?search=pegasus>`__
@@ -73,7 +73,7 @@ Available models are listed at `Model List <https://huggingface.co/models?search


 PegasusConfig
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This config fully inherits from ``BartConfig``, but pegasus uses different default values:
 Up to date parameter values can be seen in `S3 <https://s3.amazonaws.com/models.huggingface.co/bert/google/pegasus-xsum/config.json>`_.
 As of Aug 10, 2020, they are:
@@ -107,7 +107,7 @@ As of Aug 10, 2020, they are:


 PegasusTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 warning: ``add_tokens`` does not work at the moment.

 .. autoclass:: transformers.PegasusTokenizer

--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
 Reformer
----------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.

 Overview
-~~~~~~~~~~
-The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-Here the abstract: 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
+<https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.

-*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*
+The abstract from the paper is the following: 

-The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`_ .
+*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
+be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
+Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its
+complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual
+layers instead of the standard residuals, which allows storing activations only once in the training process instead of
+N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models
+while being much more memory-efficient and much faster on long sequences.*
+
+The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`__.

 Axial Positional Encodings
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Axial Positional Encodings were first implemented in Google's `trax library <https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Axial Positional Encodings were first implemented in Google's `trax library
+<https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`__
+and developed by the authors of this model's paper. In models that are treating very long input sequences, the
+conventional position id encodings store an embedings vector of size :math:`d` being the :obj:`config.hidden_size` for
+every position :math:`i, \ldots, n_s`, with :math:`n_s` being :obj:`config.max_embedding_size`. This means that having
+a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000`
+would result in a position encoding matrix:

 .. math::
    X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] 
@@ -42,94 +59,127 @@ Therefore the following holds:
                X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
              \end{cases}

-Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the ``config.max_embedding_size`` dimension :math:`j` is factorized into :math:`k \text{ and } l`.
-This design ensures that each position embedding vector :math:`x_j` is unique.
+Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two
+factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj:`config.max_embedding_size` dimension
+:math:`j` is factorized into :math:`k \text{ and } l`. This design ensures that each position embedding vector
+:math:`x_j` is unique.

-Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}` can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
-
-In practice, the parameter ``config.axial_pos_embds_dim`` is set to ``list``:math:`(d^1, d^2)` which sum has to be equal to ``config.hidden_size`` and ``config.axial_pos_shape`` is set to ``list``:math:`(n_s^1, n_s^2)` and which product has to be equal to ``config.max_embedding_size`` which during training has to be equal to the ``sequence length`` of the ``input_ids``.
+Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
+can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.

+In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to
+be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
+product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the
+`sequence length` of the :obj:`input_ids`.


 LSH Self Attention
-~~~~~~~~~~~~~~~~~~~~
-In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key query embedding vectors are also tied.
-LSH self attention uses the locality sensitive 
-hashing mechanism proposed in `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`_ to assign each of the tied key query embedding vectors to one of ``config.num_buckets`` possible buckets. The premise is that the more "similar" key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to the same bucket. 
-The accuracy of the LSH mechanism can be improved by increasing ``config.num_hashes`` or directly the argument ``num_hashes`` of the forward function so that the output of the LSH self attention better approximates the output of the "normal" full self attention.
-The buckets are then sorted and chunked into query key embedding vector chunks each of length ``config.lsh_chunk_length``. For each chunk, the query embedding vectors attend to its key vectors (which are tied to themselves) and to the key embedding vectors of ``config.lsh_num_chunks_before`` previous neighboring chunks and ``config.lsh_num_chunks_after`` following neighboring chunks.
-For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`_ or this great `blog post <https://www.pragmatic.ml/reformer-deep-dive/>`_.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
+query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
+`Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
+query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
+key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
+the same bucket. 
+
+The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument 
+:obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
+of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
+each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
+(which are tied to themselves) and to the key embedding vectors of :obj:`config.lsh_num_chunks_before` previous
+neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring chunks.
+
+For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
+<https://www.pragmatic.ml/reformer-deep-dive/>`__.
+
+Note that :obj:`config.num_buckets` can also be factorized into a list
+:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to
+one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of
+:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`.
+This is crucial for very long sequences to save memory.
+
+When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
+sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
+saved in the config and should be reused for inference.
+
+Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from
+:math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory
+and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.

-Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory.

-When training a model from scratch, it is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length a good value for ``num_buckets`` is calculated on the fly. This value will then automatically be saved in the config and should be reused for inference.
+Local Self Attention
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
+Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is
+chunked so that in each chunk of length :obj:`config.local_chunk_length` the query embedding vectors only attends to
+the key embedding vectors in its chunk and to the key embedding vectors of :obj:`config.local_num_chunks_before`
+previous neighboring chunks and :obj:`config.local_num_chunks_after` following neighboring chunks.

+Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from
+:math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory
+and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.

-Local Self Attention
-~~~~~~~~~~~~~~~~~~~~
-Local self attention is essentially a "normal" self attention layer with 
-key, query and value projections, but is chunked so that in each chunk of length ``config.local_chunk_length`` the query embedding vectors only attends to the key embedding vectors in its chunk and to the key embedding vectors of ``config.local_num_chunks_before`` previous neighboring chunks and ``config.local_num_chunks_after`` following neighboring chunks.

-Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
+Training
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+During training, we must ensure that the sequence length is set to a value that can be divided by the least common
+multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length` and that the parameters of the Axial
+Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
+easily be trained on sequences as long as 64000 tokens.

-Training
-~~~~~~~~~~~~~~~~~~~~
-During training, we must ensure that the sequence length is set to a value that can be divided by the least common multiple of ``config.lsh_chunk_length`` and ``config.local_chunk_length`` and that the parameters of the Axial Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can easily be trained on sequences as long as 64000 tokens.
-For training, the ``ReformerModelWithLMHead`` should be used as follows: 
+For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows: 

-::
+.. code-block::

  input_ids = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
  loss = model(input_ids, labels=input_ids)[0]


 ReformerConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.ReformerConfig
    :members:


 ReformerTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.ReformerTokenizer
-    :members: 
+    :members: save_vocabulary


 ReformerModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.ReformerModel
-    :members:
+    :members: forward


 ReformerModelWithLMHead
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.ReformerModelWithLMHead
-    :members:
+    :members: forward


 ReformerForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.ReformerForMaskedLM
-    :members:
+    :members: forward


 ReformerForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.ReformerForSequenceClassification
-    :members:
+    :members: forward


 ReformerForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.ReformerForQuestionAnswering
-    :members:
+    :members: forward
--- a/docs/source/model_doc/retribert.rst
+++ b/docs/source/model_doc/retribert.rst
 RetriBERT
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The RetriBERT model was proposed in the blog post
-`Explain Anything Like I'm Five: A Model for Open Domain Long Form Question Answering <https://yjernite.github.io/lfqa.html>`__,
-RetriBERT is a small model that uses either a single or pair of Bert encoders with lower-dimension projection for dense semantic indexing of text.
+The RetriBERT model was proposed in the blog post `Explain Anything Like I'm Five: A Model for Open Domain Long Form
+Question Answering <https://yjernite.github.io/lfqa.html>`__. RetriBERT is a small model that uses either a single or
+pair of BERT encoders with lower-dimension projection for dense semantic indexing of text.

-Code to train and use the model can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
+Code to train and use the model can be found `here
+<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.


 RetriBertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RetriBertConfig
    :members:


 RetriBertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RetriBertTokenizer
    :members:


 RetriBertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RetriBertTokenizerFast
    :members:


 RetriBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RetriBertModel
-    :members:
+    :members: forward
--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
 RoBERTa
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_
-by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
-Veselin Stoyanov. It is based on Google's BERT model released in 2018.
+The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach
+<https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
+Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.

 It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
 objective and training with much larger mini-batches and learning rates.
@@ -27,22 +27,23 @@ Tips:
 - This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
  setup for Roberta pretrained models.
 - RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
-  different pre-training scheme.
- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
- `Camembert <./camembert.html>`__ is a wrapper around RoBERTa. Refer to this page for usage examples.
+  different pretraining scheme.
+- RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
+  separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`</s>`)
+- :doc:`CamemBERT <camembert>` is a wrapper around RoBERTa. Refer to this page for usage examples.

 The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_.


 RobertaConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaConfig
    :members:


 RobertaTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -50,98 +51,98 @@ RobertaTokenizer


 RobertaTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaTokenizerFast
    :members: build_inputs_with_special_tokens


 RobertaModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaModel
-    :members:
+    :members: forward


 RobertaForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaForCausalLM
-    :members:
+    :members: forward


 RobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaForMaskedLM
-    :members:
+    :members: forward


 RobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaForSequenceClassification
-    :members:
+    :members: forward


 RobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaForMultipleChoice
-    :members:
+    :members: forward


 RobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaForTokenClassification
-    :members:
+    :members: forward


 RobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RobertaForQuestionAnswering
-    :members:
+    :members: forward


 TFRobertaModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFRobertaModel
-    :members:
+    :members: call


 TFRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFRobertaForMaskedLM
-    :members:
+    :members: call


 TFRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFRobertaForSequenceClassification
-    :members:
+    :members: call


 TFRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFRobertaForMultipleChoice
-    :members:
+    :members: call


 TFRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFRobertaForTokenClassification
-    :members:
+    :members: call


 TFRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFRobertaForQuestionAnswering
-    :members:
+    :members: call
--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
 T5
----------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
+<https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
+Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.

-The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in 
-Here the abstract: 
+The abstract from the paper is the following:

-*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. 
-In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. 
-Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. 
-By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. 
-To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*
+*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream
+task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
+has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
+transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
+text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer
+approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
+with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
+summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
+NLP, we release our dataset, pre-trained models, and code.*

 Tips:

- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised 
-  and supervised tasks and for which each task is converted into a text-to-text format.
-  T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
-  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
+- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which
+  each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
+  different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
+  for summarization: *summarize: ...*.
+  
+  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
+  <https://arxiv.org/pdf/1910.10683.pdf>`__.
+- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This
+  method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively
+  generates the decoder output.
 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.

-The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_.
+The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.

 Training
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
-This means that for training we always need an input sequence and a target sequence. 
-The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``labels``. The PAD token is hereby used as the start-sequence token.
-T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
+T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher
+forcing. This means that for training we always need an input sequence and a target sequence. The input sequence is fed
+to the model using :obj:`input_ids``. The target sequence is shifted to the right, i.e., prepended by a start-sequence
+token and fed to the decoder using the :obj:`decoder_input_ids`. In teacher-forcing style, the target sequence is then
+appended by the EOS token and corresponds to the :obj:`labels`. The PAD token is hereby used as the start-sequence
+token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.

 - Unsupervised denoising training

  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) 
  and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. 
-  Each sentinel token represents a unique mask token for this sentence and should start with ``<extra_id_0>``, ``<extra_id_1>``, ... up to ``<extra_id_99>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
-  *E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows: 
+  Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, 
+  :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
+  :class:`~transformers.T5Tokenizer`.
+  
+  For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
+  processed as follows: 

-::
+.. code-block::

  input_ids = tokenizer.encode('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt')
  labels = tokenizer.encode('<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>', return_tensors='pt')
@@ -50,11 +69,11 @@ T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.

 - Supervised training

-  In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
-  In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should 
-  be processed as follows:
+  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping.
+  In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
+  wunderbar.", the sentences should be processed as follows:
  
-::
+.. code-block::

  input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
  labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
@@ -63,43 +82,43 @@ T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.


 T5Config
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.T5Config
    :members:


 T5Tokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.T5Tokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
+        create_token_type_ids_from_sequences, prepare_seq2seq_batch, save_vocabulary


 T5Model
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.T5Model
-    :members:
+    :members: forward


 T5ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.T5ForConditionalGeneration
-    :members:
+    :members: forward


 TFT5Model
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFT5Model
-    :members:
+    :members: call


 TFT5ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFT5ForConditionalGeneration
-    :members:
+    :members: call
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
 Transformer XL
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The Transformer-XL model was proposed in
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__
-by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can reuse
-previously computed hidden-states to attend to longer context (memory).
-This model also uses adaptive softmax inputs and outputs (tied).
+The Transformer-XL model was proposed in `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
+<https://arxiv.org/abs/1901.02860>`__ by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
+Salakhutdinov. It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can
+reuse previously computed hidden-states to attend to longer context (memory). This model also uses adaptive softmax
+inputs and outputs (tied).

 The abstract from the paper is the following:

@@ -30,32 +29,32 @@ Tips:
  The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.

-The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`_.
+The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.


 TransfoXLConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TransfoXLConfig
    :members:


 TransfoXLTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TransfoXLTokenizer
    :members: save_vocabulary


 TransfoXLTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TransfoXLTokenizerFast
    :members:


 TransfoXL specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.modeling_transfo_xl.TransfoXLModelOutput
    :members:
@@ -71,28 +70,28 @@ TransfoXL specific outputs


 TransfoXLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TransfoXLModel
-    :members:
+    :members: forward


 TransfoXLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TransfoXLLMHeadModel
-    :members:
+    :members: forward


 TFTransfoXLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFTransfoXLModel
-    :members:
+    :members: call


 TFTransfoXLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFTransfoXLLMHeadModel
-    :members:
+    :members: call
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
 XLM
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The XLM model was proposed in `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_
-by Guillaume Lample*, Alexis Conneau*. It's a transformer pre-trained using one of the following objectives:
+The XLM model was proposed in `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by
+Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the following objectives:

 - a causal language modeling (CLM) objective (next token prediction),
- a masked language modeling (MLM) objective (Bert-like), or
- a Translation Language Modeling (TLM) object (extension of Bert's MLM to multiple language inputs)
+- a masked language modeling (MLM) objective (BERT-like), or
+- a Translation Language Modeling (TLM) object (extension of BERT's MLM to multiple language inputs)

 The abstract from the paper is the following:

@@ -27,20 +27,20 @@ Tips:

 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific `lang` parameter. Check out the
-  `multi-lingual <../multilingual.html>`__ page for more information.
+- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the
+  :doc:`multi-lingual <../multilingual>` page for more information.

-The original code can be found `here <https://github.com/facebookresearch/XLM/>`_.
+The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.


 XLMConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMConfig
    :members:

 XLMTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -48,99 +48,99 @@ XLMTokenizer


 XLM specific outputs
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.modeling_xlm.XLMForQuestionAnsweringOutput
    :members:


 XLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMModel
-    :members:
+    :members: forward


 XLMWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMWithLMHeadModel
-    :members:
+    :members: forward


 XLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMForSequenceClassification
-    :members:
+    :members: forward


 XLMForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMForMultipleChoice
-    :members:
+    :members: forward


 XLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMForTokenClassification
-    :members:
+    :members: forward


 XLMForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMForQuestionAnsweringSimple
-    :members:
+    :members: forward


 XLMForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMForQuestionAnswering
-    :members:
+    :members: forward


 TFXLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMModel
-    :members:
+    :members: call


 TFXLMWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMWithLMHeadModel
-    :members:
+    :members: call


 TFXLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMForSequenceClassification
-    :members:
+    :members: call


 TFXLMForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMForMultipleChoice
-    :members:
+    :members: call


 TFXLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMForTokenClassification
-    :members:
+    :members: call



 TFXLMForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMForQuestionAnsweringSimple
-    :members:
+    :members: call
--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
 XLM-RoBERTa
------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__
-by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán,
-Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019.
-It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
+The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale
+<https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
+Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's
+RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl
+data.

 The abstract from the paper is the following:

@@ -25,24 +26,24 @@ and XNLI benchmarks. We will make XLM-R code, data, and models publicly availabl

 Tips:

- XLM-R is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
-  not require `lang` tensors to understand which language is used, and should be able to determine the correct
+- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
+  not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
  language from the input ids.
- This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
  examples as well as the information relative to the inputs and outputs.

-The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`_.
+The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.


 XLMRobertaConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaConfig
    :members:


 XLMRobertaTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -50,91 +51,91 @@ XLMRobertaTokenizer


 XLMRobertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaModel
-    :members:
+    :members: forward


 XLMRobertaForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaForCausalLM
-    :members:
+    :members: forward


 XLMRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaForMaskedLM
-    :members:
+    :members: forward


 XLMRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaForSequenceClassification
-    :members:
+    :members: forward


 XLMRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaForMultipleChoice
-    :members:
+    :members: forward


 XLMRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaForTokenClassification
-    :members:
+    :members: forward


 XLMRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLMRobertaForQuestionAnswering
-    :members:
+    :members: forward


 TFXLMRobertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMRobertaModel
-    :members:
+    :members: call


 TFXLMRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMRobertaForMaskedLM
-    :members:
+    :members: call


 TFXLMRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMRobertaForSequenceClassification
-    :members:
+    :members: call


 TFXLMRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMRobertaForMultipleChoice
-    :members:
+    :members: call


 TFXLMRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMRobertaForTokenClassification
-    :members:
+    :members: call


 TFXLMRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLMRobertaForQuestionAnswering
-    :members:
+    :members: call
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
 XLNet
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_
-by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method
-to learn bidirectional contexts by maximizing the expected likelihood over all permutations
-of the input sequence factorization order.
+The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding
+<https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
+Quoc V. Le. XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn
+bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization
+order.

 The abstract from the paper is the following:

@@ -24,26 +24,26 @@ a large margin, including question answering, natural language inference, sentim

 Tips:

- The specific attention pattern can be controlled at training and test time using the `perm_mask` input.
- Due to the difficulty of training a fully auto-regressive model over various factorization order,
-  XLNet is pretrained using only a sub-set of the output tokens as target which are selected
-  with the `target_mapping` input.
- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
-  `target_mapping` inputs to control the attention span and outputs (see examples in `examples/text-generation/run_generation.py`)
+- The specific attention pattern can be controlled at training and test time using the :obj:`perm_mask` input.
+- Due to the difficulty of training a fully auto-regressive model over various factorization order, XLNet is pretrained
+  using only a sub-set of the output tokens as target which are selected with the :obj:`target_mapping` input.
+- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the :obj:`perm_mask` and
+  :obj:`target_mapping` inputs to control the attention span and outputs (see examples in
+  `examples/text-generation/run_generation.py`)
 - XLNet is one of the few models that has no sequence length limit.

-The original code can be found `here <https://github.com/zihangdai/xlnet/>`_.
+The original code can be found `here <https://github.com/zihangdai/xlnet/>`__.


 XLNetConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetConfig
    :members:


 XLNetTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -51,7 +51,7 @@ XLNetTokenizer


 XLNet specific outputs
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.modeling_xlnet.XLNetModelOutput
    :members:
@@ -94,91 +94,91 @@ XLNet specific outputs


 XLNetModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetModel
-    :members:
+    :members: forward


 XLNetLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetLMHeadModel
-    :members:
+    :members: forward


 XLNetForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetForSequenceClassification
-    :members:
+    :members: forward


 XLNetForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetForMultipleChoice
-    :members:
+    :members: forward


 XLNetForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetForTokenClassification
-    :members:
+    :members: forward


 XLNetForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetForQuestionAnsweringSimple
-    :members:
+    :members: forward


 XLNetForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.XLNetForQuestionAnswering
-    :members:
+    :members: forward


 TFXLNetModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLNetModel
-    :members:
+    :members: call


 TFXLNetLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLNetLMHeadModel
-    :members:
+    :members: call


 TFXLNetForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLNetForSequenceClassification
-    :members:
+    :members: call


 TFLNetForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLNetForMultipleChoice
-    :members:
+    :members: call


 TFXLNetForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLNetForTokenClassification
-    :members:
+    :members: call


 TFXLNetForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFXLNetForQuestionAnsweringSimple
-    :members:
+    :members: call
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
-Model sharing and uploading
-===========================
-
-In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
-the `model hub <https://huggingface.co/models>`__.
-
-.. note::
-
-    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
-
-    Optionally, you can join an existing organization or create a new one.
-
-Prepare your model for uploading
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
-done something similar on your task, either using the model directly in your own training loop or using the
-:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
-the `model hub <https://huggingface.co/models>`__.
-
-Basic steps
-^^^^^^^^^^^
-
-.. 
-    When #5258 is merged, we can remove the need to create the directory.
-
-First, pick a directory with the name you want your model to have on the model hub (its full name will then be
-`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`) and create it with either
-
-::
-
-    mkdir path/to/awesome-name-you-picked
-
-or in python
-
-::
-
-    import os
-    os.makedirs("path/to/awesome-name-you-picked")
-
-then you can save your model and tokenizer with:
-
-::
-
-    model.save_pretrained("path/to/awesome-name-you-picked")
-    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
-
-Or, if you're using the Trainer API
-
-::
-
-    trainer.save_model("path/to/awesome-name-you-picked")
-    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
-
-Make your model work on all frameworks
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. 
-    TODO Sylvain: make this automatic during the upload
-
-You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
-PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
-your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
-it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
-worry about the GPU, so it should be very easy. Check the
-`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
-and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
-
-First check that your model class exists in the other framework, that is try to import the same model by either adding
-or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
-type
-
-::
-
-    from transformers import TFDistilBertForSequenceClassification
-
-and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
-type
-
-::
-
-    from transformers import DistilBertForSequenceClassification
-
-This will give back an error if your model does not exist in the other framework (something that should be pretty rare
-since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
-
-Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your
-model class:
-
-::
-
-    tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
-    tf_model.save_pretrained("path/to/awesome-name-you-picked")
-
-and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
-model class:
-
-::
-
-    pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
-    pt_model.save_pretrained("path/to/awesome-name-you-picked")
-
-That's all there is to it!
-
-Check the directory before uploading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Make sure there are no garbage files in the directory you'll upload. It should only have:
-
- a `config.json` file, which saves the :doc:`configuration <main_classes/configuration>` of your model ;
- a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- a `vocab.txt`, which is the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>`
-  save;
- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
-
-Other files can safely be deleted.
-
-Upload your model with the CLI
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
-Transformers, since that command :obj:`transformers-cli` comes from the library.
-
-::
-
-    transformers-cli login
-
-Then log in using the same credentials as on huggingface.co. To upload your model, just type
-
-::
-
-    transformers-cli upload path/to/awesome-name-you-picked/
-
-This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
-
-By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
-
-::
-
-    transformers-cli upload -y path/to/awesome-name-you-picked/
-
-
-If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add),
-just type:
-
-::
-
-    transformers-cli upload path/to/awesome-name-you-picked/that-file 
-
-or
-
-::
-
-   transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name
-
-if you want to change its filename.
-
-This uploads the model to your personal account. If you want your model to be namespaced by your organization name
-rather than your username, add the following flag to any command:
-
-::
-
-    --organization organization_name
-
-so for instance:
-
-::
-
-    transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name
-
-Your model will then be accessible through its identifier, which is, as we saw above,
-`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`.
-
-Add a model card
-^^^^^^^^^^^^^^^^
-
-To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
-considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
-placed in a subfolder with your username or organization, then another subfolder named like your model
-(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
-get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
-model card template (meta-suggestions are welcome).
-
-If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
-don't forget to link to its model card so that people can fully trace how your model was built.
-
-If you have never made a pull request to the 🤗 Transformers repo, look at the
-:doc:`contributing guide <contributing>` to see the steps to follow.
-
-.. Note::
-
-    You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
-    inside `path/to/awesome-name-you-picked/`.
-
-Using your model
-^^^^^^^^^^^^^^^^
-
-Your model now has a page on huggingface.co/models 🔥
-
-Anyone can load it from code:
-
-::
-
-    tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
-    model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
-
-Additional commands
-^^^^^^^^^^^^^^^^^^^
-
-You can list all the files you uploaded on the hub like this:
-
-::
-
-    transformers-cli s3 ls
-
-You can also delete unneeded files with
-
-::
-
-    transformers-cli s3 rm awesome-name-you-picked/filename
-
+Model sharing and uploading
+=======================================================================================================================
+
+In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
+the `model hub <https://huggingface.co/models>`__.
+
+.. note::
+
+    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
+
+    Optionally, you can join an existing organization or create a new one.
+
+Prepare your model for uploading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
+done something similar on your task, either using the model directly in your own training loop or using the
+:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
+the `model hub <https://huggingface.co/models>`__.
+
+Basic steps
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. 
+    When #5258 is merged, we can remove the need to create the directory.
+
+First, pick a directory with the name you want your model to have on the model hub (its full name will then be
+`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`) and create it with either
+
+.. code-block::
+
+    mkdir path/to/awesome-name-you-picked
+
+or in python
+
+.. code-block::
+
+    import os
+    os.makedirs("path/to/awesome-name-you-picked")
+
+then you can save your model and tokenizer with:
+
+.. code-block::
+
+    model.save_pretrained("path/to/awesome-name-you-picked")
+    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
+
+Or, if you're using the Trainer API
+
+.. code-block::
+
+    trainer.save_model("path/to/awesome-name-you-picked")
+    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
+
+Make your model work on all frameworks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. 
+    TODO Sylvain: make this automatic during the upload
+
+You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
+PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
+your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
+it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
+worry about the GPU, so it should be very easy. Check the
+`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
+and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+
+First check that your model class exists in the other framework, that is try to import the same model by either adding
+or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
+type
+
+.. code-block::
+
+    from transformers import TFDistilBertForSequenceClassification
+
+and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
+type
+
+.. code-block::
+
+    from transformers import DistilBertForSequenceClassification
+
+This will give back an error if your model does not exist in the other framework (something that should be pretty rare
+since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
+
+Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your
+model class:
+
+.. code-block::
+
+    tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
+    tf_model.save_pretrained("path/to/awesome-name-you-picked")
+
+and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
+model class:
+
+.. code-block::
+
+    pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
+    pt_model.save_pretrained("path/to/awesome-name-you-picked")
+
+That's all there is to it!
+
+Check the directory before uploading
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Make sure there are no garbage files in the directory you'll upload. It should only have:
+
+- a `config.json` file, which saves the :doc:`configuration <main_classes/configuration>` of your model ;
+- a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
+- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
+- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- a `vocab.txt`, which is the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>`
+  save;
+- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
+
+Other files can safely be deleted.
+
+Upload your model with the CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
+Transformers, since that command :obj:`transformers-cli` comes from the library.
+
+.. code-block::
+
+    transformers-cli login
+
+Then log in using the same credentials as on huggingface.co. To upload your model, just type
+
+.. code-block::
+
+    transformers-cli upload path/to/awesome-name-you-picked/
+
+This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
+
+By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
+
+.. code-block::
+
+    transformers-cli upload -y path/to/awesome-name-you-picked/
+
+
+If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add),
+just type:
+
+.. code-block::
+
+    transformers-cli upload path/to/awesome-name-you-picked/that-file 
+
+or
+
+.. code-block::
+
+   transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name
+
+if you want to change its filename.
+
+This uploads the model to your personal account. If you want your model to be namespaced by your organization name
+rather than your username, add the following flag to any command:
+
+.. code-block::
+
+    --organization organization_name
+
+so for instance:
+
+.. code-block::
+
+    transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name
+
+Your model will then be accessible through its identifier, which is, as we saw above,
+`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`.
+
+Add a model card
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
+considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
+placed in a subfolder with your username or organization, then another subfolder named like your model
+(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
+get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
+model card template (meta-suggestions are welcome).
+
+If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
+don't forget to link to its model card so that people can fully trace how your model was built.
+
+If you have never made a pull request to the 🤗 Transformers repo, look at the
+:doc:`contributing guide <contributing>` to see the steps to follow.
+
+.. Note::
+
+    You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
+    inside `path/to/awesome-name-you-picked/`.
+
+Using your model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Your model now has a page on huggingface.co/models 🔥
+
+Anyone can load it from code:
+
+.. code-block::
+
+    tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
+    model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
+
+Additional commands
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can list all the files you uploaded on the hub like this:
+
+.. code-block::
+
+    transformers-cli s3 ls
+
+You can also delete unneeded files with
+
+.. code-block::
+
+    transformers-cli s3 rm awesome-name-you-picked/filename
+
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
-Summary of the models
-================================================
-
-This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
-`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
-<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also checkout the
-:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
-community models <https://huggingface.co/models>`_.
-
-Each one of the models in the library falls into one of the following categories:
-
-  * :ref:`autoregressive-models`
-  * :ref:`autoencoding-models`
-  * :ref:`seq-to-seq-models`
-  * :ref:`multimodal-models`
-  * :ref:`retrieval-based-models`
-
-Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
-previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
-sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
-models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
-A typical example of such models is GPT.
-
-Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
-sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
-full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can
-be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is
-sentence classification or token classification. A typical example of such models is BERT.
-
-Note that the only difference between autoregressive models and autoencoding models is in the way the model is
-pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
-model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
-introduced.
-
-Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
-tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
-most natural applications are translation, summarization and question answering. The original transformer model is an
-example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
-
-Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.
-
-.. _autoregressive-models:
-
-Autoregressive models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
-that at each position, the model can only look at the tokens before the attention heads.
-
-Original GPT
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=openai-gpt">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
-   </a>
-   <a href="model_doc/gpt.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
-   </a>
-
-`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
-Alec Radford et al.
-
-The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
-
-The library provides versions of the model for language modeling and multitask language modeling/multiple choice
-classification.
-
-GPT-2
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=gpt2">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
-   </a>
-   <a href="model_doc/gpt2.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
-   </a>
-
-`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
-Alec Radford et al.
-
-A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
-more).
-
-The library provides versions of the model for language modeling and multitask language modeling/multiple choice
-classification.
-
-CTRL
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=ctrl">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
-   </a>
-   <a href="model_doc/ctrl.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
-   </a>
-
-`CTRL: A Conditional Transformer Language Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`_,
-Nitish Shirish Keskar et al.
-
-Same as the GPT model but adds the idea of control codes. Text is generated from a prompt (can be empty) and one (or
-several) of those control codes which are then used to influence the text generation: generate with the style of
-wikipedia article, a book or a movie review.
-
-The library provides a version of the model for language modeling only.
-
-Transformer-XL
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=transfo-xl">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
-   </a>
-   <a href="model_doc/transformerxl.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
-   </a>
-
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
-Zihang Dai et al.
-
-Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
-RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
-may span across multiple documents, and segments are fed in order to the model.
-
-Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention
-scores. This allows the model to pay attention to information that was in the previous segment as well as the current
-one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
-
-This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would
-give the same results in the current input and the current hidden state at a given position) and needs to make some
-adjustments in the way attention scores are computed.
-
-The library provides a version of the model for language modeling only.
-
-.. _reformer:
-
-Reformer
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=reformer">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
-   </a>
-   <a href="model_doc/reformer.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
-   </a>
-
-`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
-Nikita Kitaev et al .
-
-An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
-include:
-
-  * Use :ref:`Axial position encoding <axial-pos-encoding>` (see below for more details). It’s a mechanism to avoid
-    having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller
-    matrices.
-  * Replace traditional attention by :ref:`LSH (local-sensitive hashing) attention <lsh-attention>` (see below for more
-    details). It's a technique to avoid computing the full product query-key in the attention layers.
-  * Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
-    the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
-    for results inside a given layer (less efficient than storing them but saves memory).
-  * Compute the feedforward operations by chunks and not on the whole batch.
-
-With those tricks, the model can be fed much larger sentences than traditional transformer autoregressive models.
-
-**Note:** This model could be very well be used in an autoencoding setting, there is no checkpoint for such a
-pretraining yet, though.
-
-The library provides a version of the model for language modeling only.
-
-XLNet
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=xlnet">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
-   </a>
-   <a href="model_doc/xlnet.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
-   </a>
-
-`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
-Zhilin Yang et al.
-
-XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
-tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
-with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens
-for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,...,sequence length.
-
-XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies.
-
-The library provides a version of the model for language modeling, token classification, sentence classification,
-multiple choice classification and question answering.
-
-.. _autoencoding-models:
-
-Autoencoding models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
-look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
-
-BERT
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=bert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
-   </a>
-   <a href="model_doc/bert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
-   </a>
-
-`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_,
-Jacob Devlin et al.
-
-Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
-15%) is masked by:
-
-  * a special mask token with probability 0.8
-  * a random token different from the one masked with probability 0.1
-  * the same token with probability 0.1
-
-The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a
-separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50%
-they are not related. The model has to predict if the sentences are consecutive or not.
-
-The library provides a version of the model for language modeling (traditional or masked), next sentence prediction,
-token classification, sentence classification, multiple choice classification and question answering.
-
-ALBERT
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=albert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
-   </a>
-   <a href="model_doc/albert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
-   </a>
-
-`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_,
-Zhenzhong Lan et al.
-
-Same as BERT but with a few tweaks:
-
-  * Embedding size E is different from hidden size H justified because the embeddings are context independent (one
-    embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a
-    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
-    being the vocab size). If E < H, it has less parameters.
-  * Layers are split in groups that share parameters (to save memory).
-  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
-    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
-    been swapped or not.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-RoBERTa
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=roberta">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
-   </a>
-   <a href="model_doc/roberta.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
-   </a>
-
-`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
-Yinhan Liu et al.
-
-Same as BERT with better pretraining tricks:
-
-  * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
-  * no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
-    contiguous texts together to reach 512 tokens (so the sentences are in an order than may span several documents)
-  * train with larger batches
-  * use BPE with bytes as a subunit and not characters (because of unicode characters)
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-DistilBERT
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=distilbert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
-   </a>
-   <a href="model_doc/distilbert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
-   </a>
-
-`DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_,
-Victor Sanh et al.
-
-Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it's been trained to predict
-the same probabilities as the larger model. The actual objective is a combination of:
-
-  * finding the same probabilities as the teacher model
-  * predicting the masked tokens correctly (but no next-sentence objective)
-  * a cosine similarity between the hidden states of the student and the teacher model
-
-The library provides a version of the model for masked language modeling, token classification, sentence classification
-and question answering.
-
-XLM
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=xlm">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
-   </a>
-   <a href="model_doc/xlm.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
-   </a>
-
-`Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_, Guillaume Lample and Alexis Conneau
-
-A transformer model trained on several languages. There are three different type of training for this model and the
-library provides checkpoints for all of them:
-
-  * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the
-    previous section as well). One of the languages is selected for each training sample, and the model input is a
-    sentence of 256 tokens, that may span over several documents in one of those languages.
-  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
-    dynamic masking of the tokens.
-  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
-    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
-    surrounding context in language 1 and the context given by language 2.
-
-Checkpoints refer to which method was used for pretraining by having `clm`, `mlm` or `mlm-tlm` in their names. On top
-of positional embeddings, the model has language embeddings. When training using MLM/CLM, this gives the model an
-indication of the language used, and when training using MLM+TLM, an indication of the language used for each part.
-
-The library provides a version of the model for language modeling, token classification, sentence classification and
-question answering.
-
-XLM-RoBERTa
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=xlm-roberta">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
-   </a>
-   <a href="model_doc/xlmroberta.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
-   </a>
-
-`Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_, Alexis Conneau et
-al.
-
-Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses
-masked language modeling on sentences coming from one language. However, the model is trained on many more languages
-(100) and doesn't use the language embeddings, so it's capable of detecting the input language by itself.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-FlauBERT
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=flaubert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
-   </a>
-   <a href="model_doc/flaubert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
-   </a>
-
-`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_, Hang Le et al.
-
-Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
-
-The library provides a version of the model for language modeling and sentence classification.
-
-ELECTRA
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=electra">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
-   </a>
-   <a href="model_doc/electra.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
-   </a>
-
-`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://arxiv.org/abs/2003.10555>`_,
-Kevin Clark et al.
-
-ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are
-corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA
-has to predict which token is an original and which one has been replaced. Like for GAN training, the small language
-model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a
-traditional GAN setting) then the ELECTRA model is trained for a few steps.
-
-The library provides a version of the model for masked language modeling, token classification and sentence
-classification.
-
-Funnel Transformer
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=funnel">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
-   </a>
-   <a href="model_doc/funnel.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
-   </a>
-
-`Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
-<https://arxiv.org/abs/2006.03236>`_, Zihang Dai et al.
-
-Funnel Transformer is a transformer model using pooling, a bit like a ResNet model: layers are grouped in blocks, and
-at the beginning of each block (except the first one), the hidden states are pooled among the sequence dimension. This
-way, their length is divided by 2, which speeds up the computation of the next hidden states. All pretrained models
-have three blocks, which means the final hidden state has a sequence length that is one fourth of the original sequence
-length.
-
-For tasks such as classification, this is not a problem, but for tasks like masked language modeling or token
-classification, we need a hidden state with the same sequence length as the original input. In those cases, the final
-hidden states are upsampled to the input sequence length and go through two additional layers. That's why there are two
-versions of each checkpoint. The version suffixed with "-base" contains only the three blocks, while the version
-without that suffix contains the three blocks and the upsampling head with its additional layers.
-
-The pretrained models available use the same pretraining objective as ELECTRA.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-.. _longformer:
-
-Longformer
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=longformer">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
-   </a>
-   <a href="model_doc/longformer.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
-   </a>
-
-`Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`_, Iz Beltagy et al.
-
-A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g.,
-what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are
-still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the
-:ref:`local attention section <local-attention>` for more information.
-
-It is pretrained the same way a RoBERTa otherwise.
-
-**Note:** This model could be very well be used in an autoregressive setting, there is no checkpoint for such a
-pretraining yet, though.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-.. _seq-to-seq-models:
-
-Sequence-to-sequence models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-As mentioned before, these models keep both the encoder and the decoder of the original transformer.
-
-BART
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=bart">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
-   </a>
-   <a href="model_doc/bart.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
-   </a>
-
-`BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
-<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
-
-Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
-fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder, on the
-pretraining tasks, a composition of the following transformations are applied:
-
-  * mask random tokens (like in BERT)
-  * delete random tokens
-  * mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
-  * permute sentences
-  * rotate the document to make it start at a specific token
-
-The library provides a version of this model for conditional generation and sequence classification.
-
-Pegasus
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=pegasus">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
-   </a>
-   <a href="model_doc/pegasus.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
-   </a>
-
-`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization 
-<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
-
-Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training objective, called Gap Sentence Generation (GSG).
-
-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
-  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.
-
-In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
-
-The library provides a version of this model for conditional generation, which should be used for summarization.
-
-
-MarianMT
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=marian">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
-   </a>
-   <a href="model_doc/marian.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
-   </a>
-
-`Marian: Fast Neural Machine Translation in C++ <https://arxiv.org/abs/1804.00344>`_, Marcin Junczys-Dowmunt et al.
-
-A framework for translation models, using the same models as BART
-
-The library provides a version of this model for conditional generation.
-
-T5
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=t5">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
-   </a>
-   <a href="model_doc/t5.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
-   </a>
-
-`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
-Colin Raffel et al.
-
-Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
-each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
-prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
-
-The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
-tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
-
-Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
-replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the
-original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
-
-For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and "cute", the encoder
-input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
-
-The library provides a version of this model for conditional generation.
-
-MBart
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=mbart">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
-   </a>
-   <a href="model_doc/mbart.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
-   </a>
-
-`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-
-The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
-and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
-for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
-
-The library provides a version of this model for conditional generation.
-
-The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-en-ro>`_ can be used for english -> romanian translation.
-
-The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
-
-.. _multimodal-models:
-
-Multimodal models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the
-others.
-
-MMBT
----------------------------------------------
-
-`Supervised Multimodal Bitransformers for Classifying Images and Text <https://arxiv.org/abs/1909.02950>`_, Douwe Kiela
-et al.
-
-A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
-model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
-(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
-resnet to the hidden state dimension of the transformer).
-
-The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
-model know which part of the input vector corresponds to the text and which to the image.
-
-The pretrained model only works for classification.
-
-..
-    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
-    TODO: write this page
-
-.. _retrieval-based-models:
-
-Retrieval-based models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.
-
-
-DPR
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=dpr">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
-   </a>
-   <a href="model_doc/dpr.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
-   </a>
-
-`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
-Vladimir Karpukhin et al.
-
-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
-
-
-DPR consists in three models:
-
-  * Question encoder: encode questions as vectors
-  * Context encoder: encode contexts as vectors
-  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
-
-DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
-
-RAG
----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=rag">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
-   </a>
-   <a href="model_doc/rag.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
-   </a>
-
-`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
-Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
-
-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
-RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
-The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
-
-The two models RAG-Token and RAG-Sequence are available for generation.
-
-More technical aspects
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Full vs sparse attention
----------------------------------------------
-
-Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
-computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
-use a sparse version of the attention matrix to speed up training.
-
-.. _lsh-attention:
-
-**LSH attention**
-
-:ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
-dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can  consider only
-the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
-modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
-similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
-a n_rounds parameter) and then are averaged together.
-
-.. _local-attention:
-
-**Local attention**
-
-:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
-right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
-last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
-representation of the whole sentence.
-
-Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
-all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
-their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
-
-.. image:: imgs/local_attention_mask.png
-   :scale: 50 %
-   :align: center
-
-Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
-length.
-
-Other tricks
----------------------------------------------
-
-.. _axial-pos-encoding:
-
-**Axial positional encodings**
-
-:ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
-E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
-hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
-E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
-and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
-time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
-:math:`j // l1` in E2.
+Summary of the models
+=======================================================================================================================
+
+This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
+`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
+<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
+models. You can check them more in detail in their respective documentation. Also checkout the
+:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
+community models <https://huggingface.co/models>`_.
+
+Each one of the models in the library falls into one of the following categories:
+
+  * :ref:`autoregressive-models`
+  * :ref:`autoencoding-models`
+  * :ref:`seq-to-seq-models`
+  * :ref:`multimodal-models`
+  * :ref:`retrieval-based-models`
+
+Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
+previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
+sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
+models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
+A typical example of such models is GPT.
+
+Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
+sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
+full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can
+be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is
+sentence classification or token classification. A typical example of such models is BERT.
+
+Note that the only difference between autoregressive models and autoencoding models is in the way the model is
+pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
+model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
+introduced.
+
+Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
+tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
+most natural applications are translation, summarization and question answering. The original transformer model is an
+example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
+
+Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.
+
+.. _autoregressive-models:
+
+Autoregressive models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
+that at each position, the model can only look at the tokens before the attention heads.
+
+Original GPT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=openai-gpt">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
+   </a>
+   <a href="model_doc/gpt.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
+   </a>
+
+`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
+Alec Radford et al.
+
+The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
+
+The library provides versions of the model for language modeling and multitask language modeling/multiple choice
+classification.
+
+GPT-2
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=gpt2">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
+   </a>
+   <a href="model_doc/gpt2.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
+   </a>
+
+`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
+Alec Radford et al.
+
+A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
+more).
+
+The library provides versions of the model for language modeling and multitask language modeling/multiple choice
+classification.
+
+CTRL
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=ctrl">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
+   </a>
+   <a href="model_doc/ctrl.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
+   </a>
+
+`CTRL: A Conditional Transformer Language Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`_,
+Nitish Shirish Keskar et al.
+
+Same as the GPT model but adds the idea of control codes. Text is generated from a prompt (can be empty) and one (or
+several) of those control codes which are then used to influence the text generation: generate with the style of
+wikipedia article, a book or a movie review.
+
+The library provides a version of the model for language modeling only.
+
+Transformer-XL
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=transfo-xl">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
+   </a>
+   <a href="model_doc/transformerxl.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
+   </a>
+
+`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
+Zihang Dai et al.
+
+Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
+RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
+may span across multiple documents, and segments are fed in order to the model.
+
+Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention
+scores. This allows the model to pay attention to information that was in the previous segment as well as the current
+one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
+
+This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would
+give the same results in the current input and the current hidden state at a given position) and needs to make some
+adjustments in the way attention scores are computed.
+
+The library provides a version of the model for language modeling only.
+
+.. _reformer:
+
+Reformer
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=reformer">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
+   </a>
+   <a href="model_doc/reformer.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
+   </a>
+
+`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
+Nikita Kitaev et al .
+
+An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
+include:
+
+  * Use :ref:`Axial position encoding <axial-pos-encoding>` (see below for more details). It’s a mechanism to avoid
+    having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller
+    matrices.
+  * Replace traditional attention by :ref:`LSH (local-sensitive hashing) attention <lsh-attention>` (see below for more
+    details). It's a technique to avoid computing the full product query-key in the attention layers.
+  * Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
+    the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
+    for results inside a given layer (less efficient than storing them but saves memory).
+  * Compute the feedforward operations by chunks and not on the whole batch.
+
+With those tricks, the model can be fed much larger sentences than traditional transformer autoregressive models.
+
+**Note:** This model could be very well be used in an autoencoding setting, there is no checkpoint for such a
+pretraining yet, though.
+
+The library provides a version of the model for language modeling only.
+
+XLNet
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=xlnet">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
+   </a>
+   <a href="model_doc/xlnet.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
+   </a>
+
+`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
+Zhilin Yang et al.
+
+XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
+tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
+with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens
+for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,...,sequence length.
+
+XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies.
+
+The library provides a version of the model for language modeling, token classification, sentence classification,
+multiple choice classification and question answering.
+
+.. _autoencoding-models:
+
+Autoencoding models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
+look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
+
+BERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=bert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
+   </a>
+   <a href="model_doc/bert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
+   </a>
+
+`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_,
+Jacob Devlin et al.
+
+Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
+15%) is masked by:
+
+  * a special mask token with probability 0.8
+  * a random token different from the one masked with probability 0.1
+  * the same token with probability 0.1
+
+The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a
+separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50%
+they are not related. The model has to predict if the sentences are consecutive or not.
+
+The library provides a version of the model for language modeling (traditional or masked), next sentence prediction,
+token classification, sentence classification, multiple choice classification and question answering.
+
+ALBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=albert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
+   </a>
+   <a href="model_doc/albert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
+   </a>
+
+`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_,
+Zhenzhong Lan et al.
+
+Same as BERT but with a few tweaks:
+
+  * Embedding size E is different from hidden size H justified because the embeddings are context independent (one
+    embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a
+    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
+    being the vocab size). If E < H, it has less parameters.
+  * Layers are split in groups that share parameters (to save memory).
+  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
+    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
+    been swapped or not.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+RoBERTa
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=roberta">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
+   </a>
+   <a href="model_doc/roberta.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
+   </a>
+
+`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
+Yinhan Liu et al.
+
+Same as BERT with better pretraining tricks:
+
+  * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
+  * no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
+    contiguous texts together to reach 512 tokens (so the sentences are in an order than may span several documents)
+  * train with larger batches
+  * use BPE with bytes as a subunit and not characters (because of unicode characters)
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+DistilBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=distilbert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
+   </a>
+   <a href="model_doc/distilbert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
+   </a>
+
+`DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_,
+Victor Sanh et al.
+
+Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it's been trained to predict
+the same probabilities as the larger model. The actual objective is a combination of:
+
+  * finding the same probabilities as the teacher model
+  * predicting the masked tokens correctly (but no next-sentence objective)
+  * a cosine similarity between the hidden states of the student and the teacher model
+
+The library provides a version of the model for masked language modeling, token classification, sentence classification
+and question answering.
+
+XLM
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=xlm">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
+   </a>
+   <a href="model_doc/xlm.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
+   </a>
+
+`Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_, Guillaume Lample and Alexis Conneau
+
+A transformer model trained on several languages. There are three different type of training for this model and the
+library provides checkpoints for all of them:
+
+  * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the
+    previous section as well). One of the languages is selected for each training sample, and the model input is a
+    sentence of 256 tokens, that may span over several documents in one of those languages.
+  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
+    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
+    dynamic masking of the tokens.
+  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
+    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
+    surrounding context in language 1 and the context given by language 2.
+
+Checkpoints refer to which method was used for pretraining by having `clm`, `mlm` or `mlm-tlm` in their names. On top
+of positional embeddings, the model has language embeddings. When training using MLM/CLM, this gives the model an
+indication of the language used, and when training using MLM+TLM, an indication of the language used for each part.
+
+The library provides a version of the model for language modeling, token classification, sentence classification and
+question answering.
+
+XLM-RoBERTa
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=xlm-roberta">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
+   </a>
+   <a href="model_doc/xlmroberta.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
+   </a>
+
+`Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_, Alexis Conneau et
+al.
+
+Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses
+masked language modeling on sentences coming from one language. However, the model is trained on many more languages
+(100) and doesn't use the language embeddings, so it's capable of detecting the input language by itself.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+FlauBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=flaubert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
+   </a>
+   <a href="model_doc/flaubert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
+   </a>
+
+`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_, Hang Le et al.
+
+Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
+
+The library provides a version of the model for language modeling and sentence classification.
+
+ELECTRA
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=electra">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
+   </a>
+   <a href="model_doc/electra.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
+   </a>
+
+`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://arxiv.org/abs/2003.10555>`_,
+Kevin Clark et al.
+
+ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are
+corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA
+has to predict which token is an original and which one has been replaced. Like for GAN training, the small language
+model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a
+traditional GAN setting) then the ELECTRA model is trained for a few steps.
+
+The library provides a version of the model for masked language modeling, token classification and sentence
+classification.
+
+Funnel Transformer
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=funnel">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
+   </a>
+   <a href="model_doc/funnel.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
+   </a>
+
+`Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
+<https://arxiv.org/abs/2006.03236>`_, Zihang Dai et al.
+
+Funnel Transformer is a transformer model using pooling, a bit like a ResNet model: layers are grouped in blocks, and
+at the beginning of each block (except the first one), the hidden states are pooled among the sequence dimension. This
+way, their length is divided by 2, which speeds up the computation of the next hidden states. All pretrained models
+have three blocks, which means the final hidden state has a sequence length that is one fourth of the original sequence
+length.
+
+For tasks such as classification, this is not a problem, but for tasks like masked language modeling or token
+classification, we need a hidden state with the same sequence length as the original input. In those cases, the final
+hidden states are upsampled to the input sequence length and go through two additional layers. That's why there are two
+versions of each checkpoint. The version suffixed with "-base" contains only the three blocks, while the version
+without that suffix contains the three blocks and the upsampling head with its additional layers.
+
+The pretrained models available use the same pretraining objective as ELECTRA.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+.. _longformer:
+
+Longformer
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=longformer">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
+   </a>
+   <a href="model_doc/longformer.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
+   </a>
+
+`Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`_, Iz Beltagy et al.
+
+A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g.,
+what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are
+still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the
+:ref:`local attention section <local-attention>` for more information.
+
+It is pretrained the same way a RoBERTa otherwise.
+
+**Note:** This model could be very well be used in an autoregressive setting, there is no checkpoint for such a
+pretraining yet, though.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+.. _seq-to-seq-models:
+
+Sequence-to-sequence models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned before, these models keep both the encoder and the decoder of the original transformer.
+
+BART
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=bart">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
+   </a>
+   <a href="model_doc/bart.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
+   </a>
+
+`BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
+<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
+
+Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
+fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder, on the
+pretraining tasks, a composition of the following transformations are applied:
+
+  * mask random tokens (like in BERT)
+  * delete random tokens
+  * mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
+  * permute sentences
+  * rotate the document to make it start at a specific token
+
+The library provides a version of this model for conditional generation and sequence classification.
+
+Pegasus
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=pegasus">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
+   </a>
+   <a href="model_doc/pegasus.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
+   </a>
+
+`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization 
+<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
+
+Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training objective, called Gap Sentence Generation (GSG).
+
+  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
+  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.
+
+In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+
+The library provides a version of this model for conditional generation, which should be used for summarization.
+
+
+MarianMT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=marian">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
+   </a>
+   <a href="model_doc/marian.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
+   </a>
+
+`Marian: Fast Neural Machine Translation in C++ <https://arxiv.org/abs/1804.00344>`_, Marcin Junczys-Dowmunt et al.
+
+A framework for translation models, using the same models as BART
+
+The library provides a version of this model for conditional generation.
+
+T5
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=t5">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
+   </a>
+   <a href="model_doc/t5.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
+   </a>
+
+`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
+Colin Raffel et al.
+
+Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
+each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
+prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
+
+The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
+tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
+
+Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
+replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the
+original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
+
+For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and "cute", the encoder
+input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
+
+The library provides a version of this model for conditional generation.
+
+MBart
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=mbart">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
+   </a>
+   <a href="model_doc/mbart.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
+   </a>
+
+`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+
+The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
+and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
+for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
+
+The library provides a version of this model for conditional generation.
+
+The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-en-ro>`_ can be used for english -> romanian translation.
+
+The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
+
+.. _multimodal-models:
+
+Multimodal models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the
+others.
+
+MMBT
+-----------------------------------------------------------------------------------------------------------------------
+
+`Supervised Multimodal Bitransformers for Classifying Images and Text <https://arxiv.org/abs/1909.02950>`_, Douwe Kiela
+et al.
+
+A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
+model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
+(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
+resnet to the hidden state dimension of the transformer).
+
+The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
+model know which part of the input vector corresponds to the text and which to the image.
+
+The pretrained model only works for classification.
+
+..
+    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
+    TODO: write this page
+
+.. _retrieval-based-models:
+
+Retrieval-based models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.
+
+
+DPR
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=dpr">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
+   </a>
+   <a href="model_doc/dpr.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
+   </a>
+
+`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
+Vladimir Karpukhin et al.
+
+Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
+
+
+DPR consists in three models:
+
+  * Question encoder: encode questions as vectors
+  * Context encoder: encode contexts as vectors
+  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
+
+DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
+
+RAG
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=rag">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
+   </a>
+   <a href="model_doc/rag.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
+   </a>
+
+`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
+Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
+
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
+RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
+The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+
+The two models RAG-Token and RAG-Sequence are available for generation.
+
+More technical aspects
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Full vs sparse attention
+-----------------------------------------------------------------------------------------------------------------------
+
+Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
+computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
+use a sparse version of the attention matrix to speed up training.
+
+.. _lsh-attention:
+
+**LSH attention**
+
+:ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
+dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can  consider only
+the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
+modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
+similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
+a n_rounds parameter) and then are averaged together.
+
+.. _local-attention:
+
+**Local attention**
+
+:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
+right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
+last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
+representation of the whole sentence.
+
+Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
+all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
+their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
+
+.. image:: imgs/local_attention_mask.png
+   :scale: 50 %
+   :align: center
+
+Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
+length.
+
+Other tricks
+-----------------------------------------------------------------------------------------------------------------------
+
+.. _axial-pos-encoding:
+
+**Axial positional encodings**
+
+:ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
+E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
+hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
+E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
+and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
+time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
+:math:`j // l1` in E2.
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
 Multi-lingual models
-================================================
+=======================================================================================================================

 Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
 multi-lingual models are available and have a different mechanisms than mono-lingual models.
@@ -8,13 +8,13 @@ This page details the usage of these models.
 The two models that currently support multiple languages are BERT and XLM.

 XLM
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
 be split in two categories: the checkpoints that make use of language embeddings, and those that don't

 XLM & Language Embeddings
------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 This section concerns the following checkpoints:

@@ -82,7 +82,7 @@ The example `run_generation.py <https://github.com/huggingface/transformers/blob
 can generate text using the CLM checkpoints from XLM, using the language embeddings.

 XLM without Language Embeddings
------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 This section concerns the following checkpoints:

@@ -94,7 +94,7 @@ sentence representations, differently from previously-mentioned XLM checkpoints.


 BERT
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 BERT has two checkpoints that can be used for multi-lingual tasks:

@@ -105,7 +105,7 @@ These checkpoints do not require language embeddings at inference time. They sho
 used in the context and infer accordingly.

 XLM-RoBERTa
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
 gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,

--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
 Perplexity of fixed-length models
-=================================
+=======================================================================================================================

 Perplexity (PPL) is one of the most common metrics for evaluating language
 models. Before diving in, we should note that the metric applies specifically
@@ -31,7 +31,7 @@ relationship to Bits Per Character (BPC) and data compression, check out this
 <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.

 Calculating PPL with fixed-length models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 If we weren't limited by a model's context size, we would evaluate the
 model's perplexity by autoregressively factorizing a sequence and
@@ -83,7 +83,7 @@ time. This allows computation to procede much faster while still giving the
 model a large context to make predictions at each step.

 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Let's demonstrate this process with GPT-2.


--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
 Philosophy
-==========
+=======================================================================================================================

 🤗 Transformers is an opinionated library built for:

@@ -48,7 +48,7 @@ A few other goals:
 - Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framework and inference using another.

 Main concepts
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The library is built around three types of classes for each model:


--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
-Preprocessing data
-==================
-
-In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
-
-call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
-you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
-
-As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
-punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
-build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
-work properly.
-
-.. note::
-
-    If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
-    the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
-    token to index (that we usually call a `vocab`) as during pretraining.
-
-To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the 
-:func:`~transformers.AutoTokenizer.from_pretrained` method:
-
-.. code-block::
-
-    from transformers import AutoTokenizer
-    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
-
-Base use
-~~~~~~~~
-
-A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
-is its ``__call__``: you just need to feed your sentence to your tokenizer object.
-
-.. code-block::
-
-    >>> encoded_input = tokenizer("Hello, I'm a single sentence!")
-    >>> print(encoded_input)
-    {'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-This returns a dictionary string to list of ints.
-The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
-below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
-:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
-
-The tokenizer can decode a list of token ids in a proper sentence:
-
-.. code-block::
-
-    >>> tokenizer.decode(encoded_input["input_ids"])
-    "[CLS] Hello, I'm a single sentence! [SEP]"
-
-As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
-tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
-seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
-those special tokens yourself) by passing ``add_special_tokens=False``.
-
-If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
-tokenizer:
-
-.. code-block::
-
-    >>> batch_sentences = ["Hello I'm a single sentence",
-    ...                    "And another sentence",
-    ...                    "And the very very last one"]
-    >>> encoded_inputs = tokenizer(batch_sentences)
-    >>> print(encoded_inputs)
-    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
-                   [101, 1262, 1330, 5650, 102],
-                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
-     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                        [0, 0, 0, 0, 0],
-                        [0, 0, 0, 0, 0, 0, 0, 0]],
-     'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                        [1, 1, 1, 1, 1],
-                        [1, 1, 1, 1, 1, 1, 1, 1]]}
-
-We get back a dictionary once again, this time with values being list of list of ints.
-
-If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
-probably want:
-
- To pad each sentence to the maximum length there is in your batch.
- To truncate each sentence to the maximum length the model can accept (if applicable).
- To return tensors.
-
-You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
-    >>> print(batch)
-    {'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
-                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
-                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
-     'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
-     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
-                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-    >>> ## TENSORFLOW CODE
-    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
-    >>> print(batch)
-    {'input_ids': tf.Tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
-                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
-                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
-     'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
-     'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
-                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-
-It returns a dictionary string to tensor. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
-all about: it points out which tokens the model should pay attention to and which ones it should not (because they
-represent padding in this case).
-
-
-Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
-can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
-
-.. _sentence-pairs:
-
-Preprocessing pairs of sentences
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
-pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
-then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
-
-You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
-(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
-This will once again return a dict string to list of ints:
-
-.. code-block::
-
-    >>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
-    >>> print(encoded_input)
-    {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
-of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
-`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
-its associated model expects. You can force the return (or the non-return) of any of those special arguments by
-using ``return_input_ids`` or ``return_token_type_ids``.
-
-If we decode the token ids we obtained, we will see that the special tokens have been properly added.
-
-.. code-block::
-
-    >>> tokenizer.decode(encoded_input["input_ids"])
-    "[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
-
-If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
-list of first sentences and the list of second sentences:
-
-.. code-block::
-
-    >>> batch_sentences = ["Hello I'm a single sentence",
-    ...                    "And another sentence",
-    ...                    "And the very very last one"]
-    >>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
-    ...                              "And I should be encoded with the second sentence",
-    ...                              "And I go with the very last one"]
-    >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
-    >>> print(encoded_inputs)
-    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
-                   [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], 
-                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
-    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
-    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
-
-As we can see, it returns a dictionary with the values being list of lists of ints.
-
-To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
-
-.. code-block::
-
-    >>> for ids in encoded_inputs["input_ids"]:
-    >>>     print(tokenizer.decode(ids))
-    [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
-    [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
-    [CLS] And the very very last one [SEP] And I go with the very last one [SEP]
-
-Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
-length the model can accept and return tensors directly with the following:
-
-.. code-block::
-
-    ## PYTORCH CODE
-    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
-    ## TENSORFLOW CODE
-    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
-
-Everything you always wanted to know about padding and truncation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
-
-truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
-three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
-
- :obj:`padding` controls the padding. It can be a boolean or a string which should be:
-
-    - :obj:`True` or :obj:`'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
-      a single sequence).
-    - :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
-      by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
-      padding will still be applied to it. 
-    - :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
-      behavior.
-
- :obj:`truncation` controls the truncation. It can be a boolean or a string which should be:
-
-    - :obj:`True` or :obj:`'only_first'` truncate to a maximum length specified by the :obj:`max_length` argument or
-      the maximum length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will
-      only truncate the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-    - :obj:`'only_second'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
-      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will only truncate
-      the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-    - :obj:`'longest_first'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
-      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will truncate token
-      by token, removing a token from the longest sequence in the pair until the proper length is reached.
-    - :obj:`False` or :obj:`'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
-      default behavior.
-
- :obj:`max_length` to control the length of the padding/truncation. It can be an integer or :obj:`None`, in which case
-  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
-  truncation/padding to :obj:`max_length` is deactivated.
-
-Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
-any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in 
-:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or
-:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
-
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| Truncation                           | Padding                           | Instruction                                                                                 |
-+======================================+===================================+=============================================================================================+
-| no truncation                        | no padding                        | :obj:`tokenizer(batch_sentences)`                                                           |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True)` or                                          |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='longest')`                                        |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length')`                                     |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| truncation to max model input length | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True)` or                                       |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | Not possible                                                                                |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| truncation to specific length        | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | Not possible                                                                                |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-
-Pre-tokenized inputs
-~~~~~~~~~~~~~~~~~~~~
-
-The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
-predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
-`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
-
-.. warning::
-
-    Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
-    if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
-    like BPE).
-
-If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
-tokenizer. For instance, we have:
-
-.. code-block::
-
-    >>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
-    >>> print(encoded_input)
-    {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
-``add_special_tokens=False``.
-
-This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
-like this:
-
-.. code-block::
-
-    batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
-                       ["And", "another", "sentence"],
-                       ["And", "the", "very", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
-
-or a batch of pair sentences like this:
-
-.. code-block::
-
-    batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
-                                 ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
-                                 ["And", "I", "go", "with", "the", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
-
-And you can add padding, truncation as well as directly return tensors like before:
-
-.. code-block::
-
-    ## PYTORCH CODE
-    batch = tokenizer(batch_sentences,
-                      batch_of_second_sentences,
-                      is_split_into_words=True,
-                      padding=True,
-                      truncation=True,
-                      return_tensors="pt")
-    ## TENSORFLOW CODE
-    batch = tokenizer(batch_sentences,
-                      batch_of_second_sentences,
-                      is_split_into_words=True,
-                      padding=True,
-                      truncation=True,
-                      return_tensors="tf")
+Preprocessing data
+=======================================================================================================================
+
+In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
+
+call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
+you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
+
+As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
+punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
+build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
+work properly.
+
+.. note::
+
+    If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
+    the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
+    token to index (that we usually call a `vocab`) as during pretraining.
+
+To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the 
+:func:`~transformers.AutoTokenizer.from_pretrained` method:
+
+.. code-block::
+
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
+
+Base use
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
+is its ``__call__``: you just need to feed your sentence to your tokenizer object.
+
+.. code-block::
+
+    >>> encoded_input = tokenizer("Hello, I'm a single sentence!")
+    >>> print(encoded_input)
+    {'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 
+     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+
+This returns a dictionary string to list of ints.
+The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
+below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
+:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
+
+The tokenizer can decode a list of token ids in a proper sentence:
+
+.. code-block::
+
+    >>> tokenizer.decode(encoded_input["input_ids"])
+    "[CLS] Hello, I'm a single sentence! [SEP]"
+
+As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
+tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
+seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
+those special tokens yourself) by passing ``add_special_tokens=False``.
+
+If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
+tokenizer:
+
+.. code-block::
+
+    >>> batch_sentences = ["Hello I'm a single sentence",
+    ...                    "And another sentence",
+    ...                    "And the very very last one"]
+    >>> encoded_inputs = tokenizer(batch_sentences)
+    >>> print(encoded_inputs)
+    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+                   [101, 1262, 1330, 5650, 102],
+                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
+     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                        [0, 0, 0, 0, 0],
+                        [0, 0, 0, 0, 0, 0, 0, 0]],
+     'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                        [1, 1, 1, 1, 1],
+                        [1, 1, 1, 1, 1, 1, 1, 1]]}
+
+We get back a dictionary once again, this time with values being list of list of ints.
+
+If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
+probably want:
+
+- To pad each sentence to the maximum length there is in your batch.
+- To truncate each sentence to the maximum length the model can accept (if applicable).
+- To return tensors.
+
+You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
+
+.. code-block::
+
+    >>> ## PYTORCH CODE
+    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
+    >>> print(batch)
+    {'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+     'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+    >>> ## TENSORFLOW CODE
+    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
+    >>> print(batch)
+    {'input_ids': tf.Tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+     'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+     'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+
+It returns a dictionary string to tensor. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
+all about: it points out which tokens the model should pay attention to and which ones it should not (because they
+represent padding in this case).
+
+
+Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
+can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
+
+.. _sentence-pairs:
+
+Preprocessing pairs of sentences
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
+pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
+then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
+
+You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
+(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
+This will once again return a dict string to list of ints:
+
+.. code-block::
+
+    >>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
+    >>> print(encoded_input)
+    {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
+     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+
+This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
+of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
+`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
+its associated model expects. You can force the return (or the non-return) of any of those special arguments by
+using ``return_input_ids`` or ``return_token_type_ids``.
+
+If we decode the token ids we obtained, we will see that the special tokens have been properly added.
+
+.. code-block::
+
+    >>> tokenizer.decode(encoded_input["input_ids"])
+    "[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
+
+If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
+list of first sentences and the list of second sentences:
+
+.. code-block::
+
+    >>> batch_sentences = ["Hello I'm a single sentence",
+    ...                    "And another sentence",
+    ...                    "And the very very last one"]
+    >>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
+    ...                              "And I should be encoded with the second sentence",
+    ...                              "And I go with the very last one"]
+    >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
+    >>> print(encoded_inputs)
+    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
+                   [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], 
+                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
+    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
+    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
+
+As we can see, it returns a dictionary with the values being list of lists of ints.
+
+To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
+
+.. code-block::
+
+    >>> for ids in encoded_inputs["input_ids"]:
+    >>>     print(tokenizer.decode(ids))
+    [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
+    [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
+    [CLS] And the very very last one [SEP] And I go with the very last one [SEP]
+
+Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
+length the model can accept and return tensors directly with the following:
+
+.. code-block::
+
+    ## PYTORCH CODE
+    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
+    ## TENSORFLOW CODE
+    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
+
+Everything you always wanted to know about padding and truncation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
+
+truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
+three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
+
+- :obj:`padding` controls the padding. It can be a boolean or a string which should be:
+
+    - :obj:`True` or :obj:`'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
+      a single sequence).
+    - :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
+      by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
+      padding will still be applied to it. 
+    - :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
+      behavior.
+
+- :obj:`truncation` controls the truncation. It can be a boolean or a string which should be:
+
+    - :obj:`True` or :obj:`'only_first'` truncate to a maximum length specified by the :obj:`max_length` argument or
+      the maximum length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will
+      only truncate the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+    - :obj:`'only_second'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
+      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will only truncate
+      the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+    - :obj:`'longest_first'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
+      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will truncate token
+      by token, removing a token from the longest sequence in the pair until the proper length is reached.
+    - :obj:`False` or :obj:`'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
+      default behavior.
+
+- :obj:`max_length` to control the length of the padding/truncation. It can be an integer or :obj:`None`, in which case
+  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
+  truncation/padding to :obj:`max_length` is deactivated.
+
+Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
+any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in 
+:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or
+:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+
+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+| Truncation                           | Padding                           | Instruction                                                                                 |
+======================================+===================================+=============================================================================================+
+| no truncation                        | no padding                        | :obj:`tokenizer(batch_sentences)`                                                           |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True)` or                                          |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='longest')`                                        |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length')`                                     |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+| truncation to max model input length | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True)` or                                       |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to specific length        | Not possible                                                                                |
+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+| truncation to specific length        | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max model input length | Not possible                                                                                |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+
+Pre-tokenized inputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
+predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
+`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
+
+.. warning::
+
+    Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
+    if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
+    like BPE).
+
+If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
+tokenizer. For instance, we have:
+
+.. code-block::
+
+    >>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
+    >>> print(encoded_input)
+    {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
+
+Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
+``add_special_tokens=False``.
+
+This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
+like this:
+
+.. code-block::
+
+    batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
+                       ["And", "another", "sentence"],
+                       ["And", "the", "very", "very", "last", "one"]]
+    encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
+
+or a batch of pair sentences like this:
+
+.. code-block::
+
+    batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
+                                 ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
+                                 ["And", "I", "go", "with", "the", "very", "last", "one"]]
+    encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
+
+And you can add padding, truncation as well as directly return tensors like before:
+
+.. code-block::
+
+    ## PYTORCH CODE
+    batch = tokenizer(batch_sentences,
+                      batch_of_second_sentences,
+                      is_split_into_words=True,
+                      padding=True,
+                      truncation=True,
+                      return_tensors="pt")
+    ## TENSORFLOW CODE
+    batch = tokenizer(batch_sentences,
+                      batch_of_second_sentences,
+                      is_split_into_words=True,
+                      padding=True,
+                      truncation=True,
+                      return_tensors="tf")
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
 Pretrained models
-================================================
+=======================================================================================================================

 Here is the full list of the currently provided pretrained models together with a short presentation of each model.


--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
 Quick tour
-==========
+=======================================================================================================================

 Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for
 Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
@@ -14,7 +14,7 @@ will dig a little bit more and see how the library gives you access to those mod
    not, the code is expected to work for both backends without any change needed.

 Getting started on a task with a pipeline
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. 🤗 Transformers
 provides the following tasks out of the box:
@@ -123,7 +123,7 @@ to share your fine-tuned model on the hub with the community, using :doc:`this t
 .. _pretrained-model:

 Under the hood: pretrained models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
 using the :obj:`from_pretrained` method:
@@ -142,7 +142,7 @@ using the :obj:`from_pretrained` method:
    >>> tokenizer = AutoTokenizer.from_pretrained(model_name)

 Using the tokenizer
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
 words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
@@ -210,7 +210,7 @@ padding token the model was pretrained with. The attention mask is also adapted
 You can learn more about tokenizers :doc:`here <preprocessing>`.

 Using the model
-^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
 contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the
@@ -330,7 +330,7 @@ Lastly, you can also ask the model to return all hidden states and all attention
    >>> all_hidden_states, all_attentions = tf_outputs[-2:]

 Accessing the code
-^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that will automatically work with any
 pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
@@ -358,7 +358,7 @@ without the auto magic:
    >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)

 Customizing the model
-^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 If you want to change how the model itself is built, you can define your custom configuration class. Each architecture
 comes with its own relevant configuration (in the case of DistilBERT, :class:`~transformers.DistilBertConfig`) which

--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
-**********************************************
+***********************************************************************************************************************
 Exporting transformers models
-**********************************************
+***********************************************************************************************************************

 ONNX / ONNXRuntime
-==============================================
+=======================================================================================================================

 Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field
 to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
@@ -42,7 +42,7 @@ Also, the conversion tool supports different options which let you tune the beha


 Optimizations
------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph.
 Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
@@ -68,7 +68,7 @@ Optimizations can then be enabled when loading the model through ONNX runtime fo
    For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)

 Quantization
------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 ONNX exporter supports generating a quantized version of the model to allow efficient inference.

@@ -116,7 +116,7 @@ Example of quantized BERT model export:


 TorchScript
-=======================================
+=======================================================================================================================

 .. note::
    This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities
@@ -141,10 +141,10 @@ These necessities imply several things developers should be careful about. These


 Implications
------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 TorchScript flag and tied weights
------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 This flag is necessary because most of the language models in this repository have tied weights between their
 ``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights, therefore
 it is necessary to untie and clone the weights beforehand.
@@ -157,7 +157,7 @@ This is not the case for models that do not have a Language Model head, as those
 can be safely exported without the ``torchscript`` flag.

 Dummy inputs and standard lengths
------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
 Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used
@@ -178,12 +178,12 @@ It is recommended to be careful of the total number of operations done on each i
 when exporting varying sequence-length models.

 Using TorchScript in Python
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Below is an example, showing how to save, load models as well as how to use the trace for inference.

 Saving a model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated
 according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
@@ -229,7 +229,7 @@ according to a ``BertConfig`` class and then saved to disk under the filename ``
    torch.jit.save(traced_model, "traced_bert.pt")

 Loading a model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 This snippet shows how to load the ``BertModel`` that was previously saved to disk under the name ``traced_bert.pt``.
 We are re-using the previously initialised ``dummy_input``.
@@ -242,7 +242,7 @@ We are re-using the previously initialised ``dummy_input``.
    all_encoder_layers, pooled_output = loaded_model(*dummy_input)

 Using a traced model for inference
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Using the traced model for inference is as simple as using its ``__call__`` dunder method:


--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
 Summary of the tasks
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 This page shows the most frequent use-cases when using the library. The models available allow for many different
 configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage
@@ -38,7 +38,7 @@ Both approaches are showcased here.
    This would produce random output.

 Sequence Classification
--------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Sequence classification is the task of classifying sequences according to a given number of classes. An example
 of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
@@ -152,7 +152,7 @@ of each other. The process is the following:
    is paraphrase: 6%

 Extractive Question Answering
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
 question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
@@ -297,7 +297,7 @@ Here is an example of question answering using a model and a tokenizer. The proc


 Language Modeling
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based
 models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with
@@ -308,7 +308,7 @@ domain-specific: using a language model trained over a very large corpus, and th
 or on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.

 Masked Language Modeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
 fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
@@ -421,7 +421,7 @@ This prints five sequences, with the top 5 tokens predicted by the model:


 Causal Language Modeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the
 model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
@@ -493,7 +493,7 @@ This outputs a (hopefully) coherent next token following the original sequence,
 In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.

 Text Generation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).

@@ -576,7 +576,7 @@ For more information on how to apply different decoding strategies for text gene


 Named Entity Recognition
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a
 token as a person, an organisation or a location.
@@ -723,7 +723,7 @@ following array should be the output:
    [('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]

 Summarization
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Summarization is the task of summarizing a document or an article into a shorter text.

@@ -798,7 +798,7 @@ In this example we use Google`s T5 model. Even though it was pre-trained only on
    >>> outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

 Translation
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Translation is the task of translating a text from one language to another.


--- a/docs/source/testing.rst
+++ b/docs/source/testing.rst
 Testing
-==========
+=======================================================================================================================


 Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
@@ -10,7 +10,7 @@ There are 2 test suites in the repository:
 2. ``examples`` -- tests primarily for various applications that aren't part of the API

 How transformers are tested
---------------------------
+-----------------------------------------------------------------------------------------------------------------------

 1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__, so that if needed you can reproduce the same environment on your machine.
   
@@ -34,14 +34,14 @@ How transformers are tested


 Running tests
-------------
+-----------------------------------------------------------------------------------------------------------------------





 Choosing which tests to run
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This document goes into many details of how tests can be run. If after reading everything, you need even more details you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.

@@ -75,7 +75,7 @@ which tells pytest to:


 Getting the list of all tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 All tests of the test suite:

@@ -92,7 +92,7 @@ All tests of a given test file:

   
 Run a specific test module
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 To run an individual test module:

@@ -102,7 +102,7 @@ To run an individual test module:
   

 Run specific tests
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest class containing those tests. For example, it could be:

@@ -156,7 +156,7 @@ And you can combine the two patterns in one:


 Run only modified tests
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked <https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break anything, since it won't run the tests related to files you didn't touch.

@@ -172,7 +172,7 @@ All tests will be run from files and folders which are modified, but not
 yet committed.

 Automatically rerun failed tests on source modification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 `pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a
 very useful feature of detecting all failed tests, and then waiting for
@@ -212,7 +212,7 @@ alternative implementation of this functionality.


 Skip a test module
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For example, to run all except ``test_modeling_*.py`` tests:

@@ -222,7 +222,7 @@ If you want to run all test modules, except a few you can exclude them by giving


 Clearing state
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 CI builds and when isolation is important (against speed), cache should
 be cleared:
@@ -232,7 +232,7 @@ be cleared:
    pytest --cache-clear tests

 Running tests in parallel
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2`` to run 2 parallel jobs).

@@ -246,7 +246,7 @@ tests in the same order, which should help with then somehow reducing
 that failing sequence to a minimum.

 Test order and repetition
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 It's good to repeat the tests several times, in sequence, randomly, or
 in sets, to detect any potential inter-dependency and state-related bugs
@@ -255,7 +255,7 @@ detect some problems that get uncovered by randomness of DL.


 Repeat tests
-^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 * `pytest-flakefinder <https://github.com/dropbox/pytest-flakefinder>`__:

@@ -277,7 +277,7 @@ And then run every test multiple times (50 by default):


 Run tests in a random order
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 .. code-block:: bash

@@ -333,10 +333,10 @@ details please see its `documentation <https://github.com/jbasko/pytest-random-o
 Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This module has a very similar functionality/interface, but it doesn't have the bucket modes available in ``pytest-random-order``. It has the same problem of imposing itself once installed.

 Look and feel variations
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 pytest-sugar
-^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 `pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a
 plugin that improves the look-n-feel, adds a progressbar, and show tests
@@ -358,7 +358,7 @@ or uninstall it.


 Report each sub-test name and its progress
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 For a single or a group of tests via ``pytest`` (after
 ``pip install pytest-pspec``):
@@ -370,7 +370,7 @@ For a single or a group of tests via ``pytest`` (after


 Instantly shows failed tests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 `pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__
 shows failures and errors instantly instead of waiting until the end of
@@ -385,7 +385,7 @@ test session.
    pytest --instafail

 To GPU or not to GPU
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:

@@ -403,14 +403,14 @@ This is handy when you want to run different tasks on different GPUs.
    
 And we have these decorators that require the condition described by the marker.

-```
+``
 @require_torch
 @require_tf
 @require_multigpu
 @require_non_multigpu
 @require_torch_tpu
 @require_torch_and_cuda
-```
+``

 Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage:

@@ -437,7 +437,7 @@ Inside tests:


 Output capture
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 During test execution any output sent to ``stdout`` and ``stderr`` is
 captured. If a test or a setup method fails, its according captured
@@ -458,7 +458,7 @@ To send test results to JUnit format output:


 Color control
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 To have no color (e.g., yellow on white background is not readable):

@@ -469,7 +469,7 @@ To have no color (e.g., yellow on white background is not readable):


 Sending test report to online pastebin service
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Creating a URL for each test failure:

@@ -490,7 +490,7 @@ Creating a URL for a whole test session log:


 Writing tests
-------------
+-----------------------------------------------------------------------------------------------------------------------

 🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems can be used.

@@ -498,7 +498,7 @@ You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which fe


 Parametrization
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within the test, but then there is no way of running that test for just one set of arguments.

@@ -596,7 +596,7 @@ as in the previous example.
    

 Temporary files and directories
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite each other's data. Also we want to get the temp files and directories removed at the end of each test that created them. Therefore, using packages like ``tempfile``, which address these needs is essential.

@@ -646,7 +646,7 @@ In this and all the following scenarios the temporary directory will be auto-rem


 Skipping tests
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This is useful when a bug is found and a new test is written, yet the
 bug is not fixed yet. In order to be able to commit it to the main
@@ -673,7 +673,7 @@ causes some bad state that will affect other tests, do not use
 ``xfail``.

 Implementation
-^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 - Here is how to skip whole test unconditionally:

@@ -749,7 +749,7 @@ or skip the whole module:
 More details, example and ways are `here <https://docs.pytest.org/en/latest/skipping.html>`__.

 Custom markers
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 * Slow tests

@@ -776,7 +776,7 @@ Some decorators like ``@parametrized`` rewrite test names, therefore ``@slow`` a
    def test_integration_foo():

 Testing the stdout/stderr output
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 In order to test functions that write to ``stdout`` and/or ``stderr``,
 the test can access those streams using the ``pytest``'s `capsys
@@ -885,7 +885,7 @@ If you need to capture both streams at once, use the parent


 Capturing logger stream
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 If you need to validate the output of a logger, you can use :obj:`CaptureLogger`:

@@ -903,7 +903,7 @@ If you need to validate the output of a logger, you can use :obj:`CaptureLogger`


 Testing with environment variables
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 If you want to test the impact of environment variables for a specific test you can use a helper decorator ``transformers.testing_utils.mockenv``

@@ -917,7 +917,7 @@ If you want to test the impact of environment variables for a specific test you


 Getting reproducible results
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 In some situations you may want to remove randomness for your tests. To
 get identical reproducable results set, you will need to fix the seed:
@@ -944,7 +944,7 @@ get identical reproducable results set, you will need to fix the seed:
    tf.random.set_seed(seed)

 Debugging tests
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 To start a debugger at the point of the warning, do this: