Models doc (#7345)

* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Models doc (#7345)
* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
3323146e · Sylvain Gugger · GitHub · 58405a52 · 3323146e · 3323146e
Unverified Commit 3323146e authored Sep 23, 2020 by Sylvain Gugger Committed by GitHub Sep 23, 2020
20 changed files
--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
 Benchmarks
-==========
+=======================================================================================================================
 Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
 A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
 How to benchmark 🤗 Transformer models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models.
 The benchmark classes allow us to measure the `peak memory usage` and `required time` for both 
@@ -300,7 +300,7 @@ deciding for which configuration the model should be trained.
 Benchmark best practices
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This section lists a couple of best practices one should be aware of when benchmarking a model.
@@ -311,7 +311,7 @@ This section lists a couple of best practices one should be aware of when benchm
 Sharing your benchmark
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with
 and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for

--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
 BERTology
---------
+-----------------------------------------------------------------------------------------------------------------------
 There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:

--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
 Converting Tensorflow Checkpoints
-================================================
+=======================================================================================================================
 A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
@@ -10,7 +10,7 @@ A command-line interface is provided to convert original Bert/GPT/GPT-2/Transfor
    The documentation below reflects the **transformers-cli convert** command format.
 BERT
-^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
@@ -34,7 +34,7 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
 You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
 ALBERT
-^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
@@ -54,7 +54,7 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
 You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
 OpenAI GPT
-^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
@@ -70,7 +70,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
 OpenAI GPT-2
-^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
@@ -85,7 +85,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
     [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK]
 Transformer-XL
-^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
@@ -101,7 +101,7 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
 XLNet
-^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example of the conversion process for a pre-trained XLNet model:
@@ -118,7 +118,7 @@ Here is an example of the conversion process for a pre-trained XLNet model:
 XLM
-^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example of the conversion process for a pre-trained XLM model:

--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
 Fine-tuning with custom datasets
-================================
+=======================================================================================================================
 .. note::
@@ -24,7 +24,7 @@ We include several examples, each of which demonstrates a different type of comm
 .. _seq_imdb:
 Sequence Classification with IMDb Reviews
-----------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 .. note::
@@ -139,7 +139,7 @@ Now that our datasets our ready, we can fine-tune a model either with the 🤗
 .. _ft_trainer:
 Fine-tuning with Trainer
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a
 model to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments`
@@ -200,7 +200,7 @@ and instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer
 .. _ft_native:
 Fine-tuning with native PyTorch/TensorFlow
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 We can also train use native PyTorch or TensorFlow:
@@ -244,7 +244,7 @@ We can also train use native PyTorch or TensorFlow:
 .. _tok_ner:
 Token Classification with W-NUT Emerging Entities
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 .. note::
@@ -443,7 +443,7 @@ sequence classification example above.
 .. _qa_squad:
 Question Answering with SQuAD 2.0
---------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 .. note::
@@ -655,7 +655,7 @@ multiple model outputs.
 .. _resources:
 Additional Resources
--------------------
+-----------------------------------------------------------------------------------------------------------------------
  - `How to train a new language model from scratch using Transformers and Tokenizers
    <https://huggingface.co/blog/how-to-train>`_. Blog post showing the steps to load in Esperanto data and train a
@@ -666,7 +666,7 @@ Additional Resources
 .. _nlplib:
 Using the 🤗 NLP Datasets & Metrics library
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with
 🤗 Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the

--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
 Glossary
-^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 General terms
-------------
+-----------------------------------------------------------------------------------------------------------------------
 - autoencoding models: see MLM
 - autoregressive models: see CLM
@@ -27,7 +27,7 @@ General terms
  or a punctuation symbol.
 Model inputs
------------
+-----------------------------------------------------------------------------------------------------------------------
 Every model is different yet bears similarities with the others. Therefore most models use the same inputs, which are
 detailed here alongside usage examples.
@@ -35,7 +35,7 @@ detailed here alongside usage examples.
 .. _input-ids:
 Input IDs
-~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
 numerical representations of tokens building the sequences that will be used as input by the model*.
@@ -43,7 +43,7 @@ numerical representations of tokens building the sequences that will be used as
 Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
 tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ tokenizer:
-::
+.. code-block::
    >>> from transformers import BertTokenizer
    >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
@@ -52,7 +52,7 @@ tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ token
 The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
-::
+.. code-block::
    >>> tokenized_sequence = tokenizer.tokenize(sequence)
@@ -60,7 +60,7 @@ The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the
 in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
 added for "RA" and "M":
-::
+.. code-block::
    >>> print(tokenized_sequence)
    ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
@@ -69,14 +69,14 @@ These tokens can then be converted into IDs which are understandable by the mode
 the sentence to the tokenizer, which leverages the Rust implementation of
 `huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
-::
+.. code-block::
    >>> inputs = tokenizer(sequence)
 The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The
 token indices are under the key "input_ids":
-::
+.. code-block::
    >>> encoded_sequence = inputs["input_ids"]
    >>> print(encoded_sequence)
@@ -87,13 +87,13 @@ IDs the model sometimes uses.
 If we decode the previous sequence of ids,
-::
+.. code-block::
    >>> decoded_sequence = tokenizer.decode(encoded_sequence)
 we will see
-::
+.. code-block::
    >>> print(decoded_sequence)
    [CLS] A Titan RTX has 24GB of VRAM [SEP]
@@ -103,14 +103,14 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
 .. _attention-mask:
 Attention mask
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The attention mask is an optional argument used when batching sequences together. This argument indicates to the
 model which tokens should be attended to, and which should not.
 For example, consider these two sequences:
-::
+.. code-block::
    >>> from transformers import BertTokenizer
    >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
@@ -123,7 +123,7 @@ For example, consider these two sequences:
 The encoded versions have different lengths:
-::
+.. code-block::
    >>> len(encoded_sequence_a), len(encoded_sequence_b)
    (8, 19)
@@ -134,13 +134,13 @@ of the second one, or the second one needs to be truncated down to the length of
 In the first case, the list of IDs will be extended by the padding indices. We can pass a list to the tokenizer and ask
 it to pad like this:
-::
+.. code-block::
    >>> padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
 We can see that 0s have been added on the right of the first sentence to make it the same length as the second one:
-::
+.. code-block::
    >>> padded_sequences["input_ids"]
    [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
@@ -150,7 +150,7 @@ the position of the padded indices so that the model does not attend to them. Fo
 :class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
 a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
-::
+.. code-block::
    >>> padded_sequences["attention_mask"]
    [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
@@ -158,20 +158,20 @@ a padded value. This attention mask is in the dictionary returned by the tokeniz
 .. _token-type-ids:
 Token Type IDs
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Some models' purpose is to do sequence classification or question answering. These require two different sequences to
 be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
 tokens. For example, the BERT model builds its two sequence input as such:
-::
+.. code-block::
   >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
 We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
 not a list, like before) like this:
-::
+.. code-block::
    >>> from transformers import BertTokenizer
    >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
@@ -183,7 +183,7 @@ not a list, like before) like this:
 which will return:
-::
+.. code-block::
    >>> print(decoded)
    [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
@@ -194,7 +194,7 @@ mask identifying the two types of sequence in the model.
 The tokenizer returns this mask as the "token_type_ids" entry:
-::
+.. code-block::
    >>> encoded_dict['token_type_ids']
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
@@ -207,7 +207,7 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
 .. _position-ids:
 Position IDs
-~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Contrary to RNNs that have the position of each token embedded within them,
 transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
@@ -221,7 +221,7 @@ use other types of positional embeddings, such as sinusoidal position embeddings
 .. _feed-forward-chunking:
 Feed Forward Chunking
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
 The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
 Transformers
-================================================================================================================================================
+=======================================================================================================================
 State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
@@ -11,7 +11,7 @@ TensorFlow 2.0 and PyTorch.
 This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`_.
 Features
---------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 - High performance on NLU and NLG tasks
 - Low barrier to entry for educators and practitioners
@@ -36,7 +36,7 @@ Choose the right framework for every part of a model's lifetime:
 - Seamlessly pick the right framework for training, evaluation, production
 Contents
---------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 The documentation is organized in five parts:

--- a/docs/source/internal/modeling_utils.rst
+++ b/docs/source/internal/modeling_utils.rst
 Custom Layers and Utilities
---------------------------
+-----------------------------------------------------------------------------------------------------------------------
 This page lists all the custom layers used by the library, as well as the utility functions it provides for modeling.
 Most of those are only useful if you are studying the code of the models in the library.
-``Pytorch custom modules``
+Pytorch custom modules
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_utils.Conv1D
@@ -29,8 +29,8 @@ Most of those are only useful if you are studying the code of the models in the
    :members: forward
-``PyTorch Helper Functions``
+PyTorch Helper Functions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autofunction:: transformers.apply_chunking_to_forward
@@ -42,8 +42,8 @@ Most of those are only useful if you are studying the code of the models in the
 .. autofunction:: transformers.modeling_utils.prune_linear_layer
-``TensorFlow custom layers``
+TensorFlow custom layers
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_tf_utils.TFConv1D
@@ -54,8 +54,8 @@ Most of those are only useful if you are studying the code of the models in the
    :members: call
-``TensorFlow loss functions``
+TensorFlow loss functions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_tf_utils.TFCausalLanguageModelingLoss
    :members:
@@ -76,8 +76,8 @@ Most of those are only useful if you are studying the code of the models in the
    :members:
-``TensorFlow Helper Functions``
+TensorFlow Helper Functions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autofunction:: transformers.modeling_tf_utils.cast_bool_to_primitive

--- a/docs/source/internal/pipelines_utils.rst
+++ b/docs/source/internal/pipelines_utils.rst
 Utilities for pipelines
-----------------------
+-----------------------------------------------------------------------------------------------------------------------
 This page lists all the utility functions the library provides for pipelines.
 Most of those are only useful if you are studying the code of the models in the library.
 Argument handling
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.pipelines.ArgumentHandler
 .. autoclass:: transformers.pipelines.ZeroShotClassificationArgumentHandler
 .. autoclass:: transformers.pipelines.QuestionAnsweringArgumentHandler
 Data format
-~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.pipelines.PipelineDataFormat
    :members:
 .. autoclass:: transformers.pipelines.CsvPipelineDataFormat
    :members:
 .. autoclass:: transformers.pipelines.JsonPipelineDataFormat
    :members:
 .. autoclass:: transformers.pipelines.PipedPipelineDataFormat
    :members:
 Utilities
-~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autofunction:: transformers.pipelines.get_framework
 .. autoclass:: transformers.pipelines.PipelineException
--- a/docs/source/internal/tokenization_utils.rst
+++ b/docs/source/internal/tokenization_utils.rst
 Utilities for Tokenizers
------------------------
+-----------------------------------------------------------------------------------------------------------------------
 This page lists all the utility functions used by the tokenizers, mainly the class
 :class:`~transformers.tokenization_utils_base.PreTrainedTokenizerBase` that implements the common methods between
 :class:`~transformers.PreTrainedTokenizer` and :class:`~transformers.PreTrainedTokenizerFast` and the mixin
 :class:`~transformers.tokenization_utils_base.SpecialTokensMixin`.
 Most of those are only useful if you are studying the code of the tokenizers in the library.
-``PreTrainedTokenizerBase``
+PreTrainedTokenizerBase
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.tokenization_utils_base.PreTrainedTokenizerBase
    :special-members: __call__
    :members:
-``SpecialTokensMixin``
+SpecialTokensMixin
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.tokenization_utils_base.SpecialTokensMixin
    :members:
 Enums and namedtuples
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
 .. autoclass:: transformers.tokenization_utils_base.PaddingStrategy
 .. autoclass:: transformers.tokenization_utils_base.TensorType
 .. autoclass:: transformers.tokenization_utils_base.TruncationStrategy
 .. autoclass:: transformers.tokenization_utils_base.CharSpan
 .. autoclass:: transformers.tokenization_utils_base.TokenSpan
--- a/docs/source/main_classes/configuration.rst
+++ b/docs/source/main_classes/configuration.rst
 Configuration
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 The base class :class:`~transformers.PretrainedConfig` implements the common methods for loading/saving a configuration
 either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded
@@ -7,7 +7,7 @@ from HuggingFace's AWS S3 repository).
 PretrainedConfig
-~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.PretrainedConfig
    :members:
--- a/docs/source/main_classes/logging.rst
+++ b/docs/source/main_classes/logging.rst
 Logging
-------
+-----------------------------------------------------------------------------------------------------------------------
 🤗 Transformers has a centralized logging system, so that you can setup the verbosity of the library easily.
 Currently the default verbosity of the library is ``WARNING``.
-To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity to the INFO level.
+To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity
+to the INFO level.
 .. code-block:: python
    import transformers
    transformers.logging.set_verbosity_info()
-You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override the default verbosity. You can set it to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
+You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override the default verbosity. You can set it
+to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
 .. code-block:: bash
@@ -32,7 +34,7 @@ verbose to the most verbose), those levels (with their corresponding int values
 - :obj:`transformers.logging.DEBUG` (int value, 10): report all information.
 Base setters
-~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autofunction:: transformers.logging.set_verbosity_error
@@ -43,7 +45,7 @@ Base setters
 .. autofunction:: transformers.logging.set_verbosity_debug
 Other functions
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autofunction:: transformers.logging.get_verbosity

--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
 Models
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 The base classes :class:`~transformers.PreTrainedModel` and :class:`~transformers.TFPreTrainedModel` implement the
 common methods for loading/saving a model either from a local file or directory, or from a pretrained model
@@ -17,36 +17,36 @@ for text generation, :class:`~transformers.generation_utils.GenerationMixin` (fo
 :class:`~transformers.generation_tf_utils.TFGenerationMixin` (for the TensorFlow models)
-``PreTrainedModel``
+PreTrainedModel
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.PreTrainedModel
    :members:
-``ModuleUtilsMixin``
+ModuleUtilsMixin
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_utils.ModuleUtilsMixin
    :members:
-``TFPreTrainedModel``
+TFPreTrainedModel
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFPreTrainedModel
    :members:
-``TFModelUtilsMixin``
+TFModelUtilsMixin
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_tf_utils.TFModelUtilsMixin
    :members:
 Generative models
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.generation_utils.GenerationMixin
    :members:

--- a/docs/source/main_classes/optimizer_schedules.rst
+++ b/docs/source/main_classes/optimizer_schedules.rst
 Optimization
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 The ``.optimization`` module provides:
@@ -7,29 +7,29 @@ The ``.optimization`` module provides:
 - several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
 - a gradient accumulation class to accumulate the gradients of multiple batches
-``AdamW`` (PyTorch)
+AdamW (PyTorch)
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AdamW
    :members:
-``AdaFactor`` (PyTorch)
+AdaFactor (PyTorch)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.Adafactor
-``AdamWeightDecay`` (TensorFlow)
+AdamWeightDecay (TensorFlow)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AdamWeightDecay
 .. autofunction:: transformers.create_optimizer
 Schedules
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Learning Rate Schedules (Pytorch)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. autofunction:: transformers.get_constant_schedule
@@ -62,16 +62,16 @@ Learning Rate Schedules (Pytorch)
    :target: /imgs/warmup_linear_schedule.png
    :alt:
-``Warmup`` (TensorFlow)
+Warmup (TensorFlow)
-^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. autoclass:: transformers.WarmUp
    :members:
 Gradient Strategies
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-``GradientAccumulator`` (TensorFlow)
+GradientAccumulator (TensorFlow)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. autoclass:: transformers.GradientAccumulator
--- a/docs/source/main_classes/output.rst
+++ b/docs/source/main_classes/output.rst
 Model outputs
-------------
+-----------------------------------------------------------------------------------------------------------------------
 PyTorch models have outputs that are instances of subclasses of :class:`~transformers.file_utils.ModelOutput`. Those
 are data structures containing all the information returned by the model, but that can also be used as tuples or
@@ -44,98 +44,217 @@ values. Here for instance, it has two keys that are ``loss`` and ``logits``.
 We document here the generic model outputs that are used by more than one model type. Specific output types are
 documented on their corresponding model page.
-``ModelOutput``
+ModelOutput
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.file_utils.ModelOutput
    :members:
-``BaseModelOutput``
-~~~~~~~~~~~~~~~~~~~
+BaseModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.BaseModelOutput
    :members:
-``BaseModelOutputWithPooling``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+BaseModelOutputWithPooling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.BaseModelOutputWithPooling
    :members:
-``BaseModelOutputWithPast``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+BaseModelOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.BaseModelOutputWithPast
    :members:
-``Seq2SeqModelOutput``
+Seq2SeqModelOutput
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.Seq2SeqModelOutput
    :members:
-``CausalLMOutput``
-~~~~~~~~~~~~~~~~~~
+CausalLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.CausalLMOutput
    :members:
-``CausalLMOutputWithPast``
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+CausalLMOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.CausalLMOutputWithPast
    :members:
-``MaskedLMOutput``
-~~~~~~~~~~~~~~~~~~
+MaskedLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.MaskedLMOutput
    :members:
-``Seq2SeqLMOutput``
-~~~~~~~~~~~~~~~~~~~
+Seq2SeqLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.Seq2SeqLMOutput
    :members:
-``NextSentencePredictorOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+NextSentencePredictorOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.NextSentencePredictorOutput
    :members:
-``SequenceClassifierOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+SequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.SequenceClassifierOutput
    :members:
-``Seq2SeqSequenceClassifierOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Seq2SeqSequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput
    :members:
-``MultipleChoiceModelOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+MultipleChoiceModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.MultipleChoiceModelOutput
    :members:
-``TokenClassifierOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~
+TokenClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.TokenClassifierOutput
    :members:
-``QuestionAnsweringModelOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+QuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.QuestionAnsweringModelOutput
    :members:
-``Seq2SeqQuestionAnsweringModelOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Seq2SeqQuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput
    :members:
+TFBaseModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFBaseModelOutput
+    :members:
+TFBaseModelOutputWithPooling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling
+    :members:
+TFBaseModelOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFBaseModelOutputWithPast
+    :members:
+TFSeq2SeqModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqModelOutput
+    :members:
+TFCausalLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFCausalLMOutput
+    :members:
+TFCausalLMOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFCausalLMOutputWithPast
+    :members:
+TFMaskedLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFMaskedLMOutput
+    :members:
+TFSeq2SeqLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqLMOutput
+    :members:
+TFNextSentencePredictorOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFNextSentencePredictorOutput
+    :members:
+TFSequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFSequenceClassifierOutput
+    :members:
+TFSeq2SeqSequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput
+    :members:
+TFMultipleChoiceModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFMultipleChoiceModelOutput
+    :members:
+TFTokenClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFTokenClassifierOutput
+    :members:
+TFQuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput
+    :members:
+TFSeq2SeqQuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput
+    :members:
--- a/docs/source/main_classes/pipelines.rst
+++ b/docs/source/main_classes/pipelines.rst
 Pipelines
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
 of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
@@ -24,7 +24,7 @@ There are two categories of pipeline abstractions to be aware about:
    - :class:`~transformers.Text2TextGenerationPipeline`
 The pipeline abstraction
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
 other pipeline but requires an additional argument which is the `task`.
@@ -33,10 +33,10 @@ other pipeline but requires an additional argument which is the `task`.
 The task specific pipelines
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ConversationalPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.Conversation
@@ -45,76 +45,76 @@ ConversationalPipeline
    :members:
 FeatureExtractionPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.FeatureExtractionPipeline
    :special-members: __call__
    :members:
 FillMaskPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.FillMaskPipeline
    :special-members: __call__
    :members:
 NerPipeline
-==========================================
+=======================================================================================================================
 This class is an alias of the :class:`~transformers.TokenClassificationPipeline` defined below. Please refer to that
 pipeline for documentation and usage examples.
 QuestionAnsweringPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.QuestionAnsweringPipeline
    :special-members: __call__
    :members:
 SummarizationPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.SummarizationPipeline
    :special-members: __call__
    :members:
 TextClassificationPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.TextClassificationPipeline
    :special-members: __call__
    :members:
 TextGenerationPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.TextGenerationPipeline
    :special-members: __call__
    :members:
 Text2TextGenerationPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.Text2TextGenerationPipeline
    :special-members: __call__
    :members:
 TokenClassificationPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.TokenClassificationPipeline
    :special-members: __call__
    :members:
 ZeroShotClassificationPipeline
-==========================================
+=======================================================================================================================
 .. autoclass:: transformers.ZeroShotClassificationPipeline
    :special-members: __call__
    :members:
 Parent class: :obj:`Pipeline`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.Pipeline
    :members:
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
 Processors
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 This library includes processors for several traditional tasks. These processors can be used to process a dataset into
 examples that can be fed to a model.
 Processors
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 All processors follow the same architecture which is that of the
 :class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
@@ -26,7 +26,7 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
 GLUE
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 `General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
 the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
@@ -52,13 +52,13 @@ Additionally, the following method  can be used to load values from a data file
 .. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
 Example usage
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
 XNLI
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 `The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
 the quality of cross-lingual text representations. 
@@ -78,7 +78,7 @@ An example using these processors is given in the
 SQuAD
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 `The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
 the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
@@ -88,7 +88,7 @@ the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://a
 This library hosts a processor for each of the two versions:
 Processors
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Those processors are:
    - :class:`~transformers.data.processors.utils.SquadV1Processor`
@@ -109,7 +109,7 @@ Examples are given below.
 Example usage
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example using the processors as well as the conversion method using data files:
 Example::

--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
 Tokenizer
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most
 of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the
@@ -36,24 +36,24 @@ alignment methods which can be used to map between the original string (characte
 getting the index of the token comprising a given character or the span of characters corresponding to a given token).
-``PreTrainedTokenizer``
+PreTrainedTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.PreTrainedTokenizer
    :special-members: __call__
    :members:
-``PreTrainedTokenizerFast``
+PreTrainedTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.PreTrainedTokenizerFast
    :special-members: __call__
    :members:
-``BatchEncoding``
+BatchEncoding
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BatchEncoding
    :members:
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
 Trainer
----------
+-----------------------------------------------------------------------------------------------------------------------
 The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
 training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
 Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a 
 :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
 customization during training.
 The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
 <https://github.com/NVIDIA/apex>`__ for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
 Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop supporting the
 previous features. To inject custom behavior you can subclass them and override the following methods:
 - **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
 - **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaulation DataLoader (PyTorch) or TF Dataset.
 - **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
 - **log** -- Logs information on the various objects watching training.
 - **setup_wandb** -- Setups wandb (see `here <https://docs.wandb.com/huggingface>`__ for more information).
 - **create_optimizer_and_scheduler** -- Setups the optimizer and learning rate scheduler if they were not passed at
  init.
 - **compute_loss** - Computes the loss on a batch of training inputs.
 - **training_step** -- Performs a training step.
 - **prediction_step** -- Performs an evaluation/test step.
 - **run_model** (TensorFlow only) -- Basic pass through the model.
 - **evaluate** -- Runs an evaluation loop and returns metrics.
 - **predict** -- Returns predictions (with metrics if labels are available) on a test set.
 Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function:
 .. code-block:: python
    from transformers import Trainer
    class MyTrainer(Trainer):
        def compute_loss(self, model, inputs):
            labels = inputs.pop("labels")
            outputs = models(**inputs)
            logits = outputs[0]
            return my_custom_loss(logits, labels)
-``Trainer`` 
+Trainer
-~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.Trainer
    :members:
-``TFTrainer`` 
+TFTrainer
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFTrainer
    :members:
-``TrainingArguments``
+TrainingArguments
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TrainingArguments
    :members:
-``TFTrainingArguments``
+TFTrainingArguments
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFTrainingArguments
    :members:
 Utilities
-~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.EvalPrediction
 .. autofunction:: transformers.set_seed
 .. autofunction:: transformers.torch_distributed_zero_first
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
 ALBERT
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
+The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
-by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
+<https://arxiv.org/abs/1909.11942>`__ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
-two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
+Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
+speed of BERT:
- Splitting the embedding matrix into two smaller matrices
+- Splitting the embedding matrix into two smaller matrices.
- Using repeating layers split among groups
+- Using repeating layers split among groups.
 The abstract from the paper is the following:
@@ -30,17 +31,17 @@ Tips:
  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
  number of (repeating) layers.
-The original code can be found `here <https://github.com/google-research/ALBERT>`_.
+The original code can be found `here <https://github.com/google-research/ALBERT>`__.
 AlbertConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertConfig
    :members:
 AlbertTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -48,7 +49,7 @@ AlbertTokenizer
 Albert specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_albert.AlbertForPreTrainingOutput
    :members:
@@ -58,98 +59,98 @@ Albert specific outputs
 AlbertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertModel
-    :members:
+    :members: forward
 AlbertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForPreTraining
-    :members:
+    :members: forward
 AlbertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForMaskedLM
-    :members:
+    :members: forward
 AlbertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForSequenceClassification
-    :members:
+    :members: forward
 AlbertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForMultipleChoice
    :members:
 AlbertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForTokenClassification
-    :members:
+    :members: forward
 AlbertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForQuestionAnswering
-    :members:
+    :members: forward
 TFAlbertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertModel
-    :members:
+    :members: call
 TFAlbertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForPreTraining
-    :members:
+    :members: call
 TFAlbertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForMaskedLM
-    :members:
+    :members: call
 TFAlbertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForSequenceClassification
-    :members:
+    :members: call
 TFAlbertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForMultipleChoice
-    :members:
+    :members: call
 TFAlbertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForTokenClassification
-    :members:
+    :members: call
 TFAlbertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForQuestionAnswering
-    :members:
+    :members: call
--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
 AutoClasses
-----------
+-----------------------------------------------------------------------------------------------------------------------
 In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
 are supplying to the :obj:`from_pretrained()` method.
@@ -20,112 +20,112 @@ There is one class of :obj:`AutoModel` for each task, and for each backend (PyTo
 AutoConfig
-~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoConfig
    :members:
 AutoTokenizer
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoTokenizer
    :members:
 AutoModel
-~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoModel
    :members:
 AutoModelForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoModelForPreTraining
    :members:
 AutoModelWithLMHead
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoModelWithLMHead
    :members:
 AutoModelForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoModelForSequenceClassification
    :members:
 AutoModelForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoModelForMultipleChoice
    :members:
 AutoModelForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoModelForTokenClassification
    :members:
 AutoModelForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AutoModelForQuestionAnswering
    :members:
 TFAutoModel
-~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAutoModel
    :members:
 TFAutoModelForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAutoModelForPreTraining
    :members:
 TFAutoModelWithLMHead
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAutoModelWithLMHead
    :members:
 TFAutoModelForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAutoModelForSequenceClassification
    :members:
 TFAutoModelForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAutoModelForMultipleChoice
    :members:
 TFAutoModelForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAutoModelForTokenClassification
    :members:
 TFAutoModelForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAutoModelForQuestionAnswering
    :members: