Doc styling (#8067)

* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy

Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
08f534d2 · Sylvain Gugger · GitHub · 04a17f85 · 08f534d2 · 08f534d2
Unverified Commit 08f534d2 authored Oct 26, 2020 by Sylvain Gugger Committed by GitHub Oct 26, 2020
20 changed files
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -3,7 +3,7 @@ MarianMT
 **Bugs:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
-and assign @sshleifer. 
+and assign @sshleifer.
 Translations should be similar, but not identical to, output in the test set linked to in each model card.
@@ -12,13 +12,14 @@ Implementation Notes
 - Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- Models were originally trained by 
+- Models were originally trained by `Jörg Tiedemann
-  `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
+  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
-  `Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
+  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
 - All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
 - The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
@@ -29,17 +30,17 @@ Implementation Notes
 Naming
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
+- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
 - The language codes used to name models are inconsistent. Two digit codes can usually be found `here
-  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
+  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
-  "language code {code}".
+  code {code}".
 - Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
 Multilingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
+All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
    - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
      looking at the model card, or the Group Members `mapping
@@ -112,6 +113,7 @@ Code to see available pretrained models:
 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianConfig
    :members:

--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -7,9 +7,10 @@ MBart
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
-<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
 corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
@@ -21,12 +22,13 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
 Training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
-As the model is multilingual it expects the sequences in a different format. A special language id token 
-is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
-where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
-The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode 
+MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. As the model is
+multilingual it expects the sequences in a different format. A special language id token is added in both the source
+and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The target
+text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
+The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
 the sequences for sequence-to-sequence fine-tuning.
 - Supervised training
@@ -44,8 +46,8 @@ the sequences for sequence-to-sequence fine-tuning.
 - Generation
-    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. 
+    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
-    The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
+    example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
 .. code-block::

--- a/docs/source/model_doc/mobilebert.rst
+++ b/docs/source/model_doc/mobilebert.rst
@@ -14,23 +14,23 @@ The abstract from the paper is the following:
 *Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds
 of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot
 be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating
-the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied
+the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to
-to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
+various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
-equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward
+equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
-networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated
+To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE
-BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that
+model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is
-MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known
+4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the
-benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7
+natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms
-(0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task,
+latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of
-MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
+90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
 Tips:
- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  the right rather than the left.
+  than the left.
- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
+- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+  with a causal language modeling (CLM) objective are better in that regard.
 The original code can be found `here <https://github.com/google-research/mobilebert>`__.

--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@@ -9,9 +9,8 @@ and assign @sshleifer.
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
+The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
-Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and
+<https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
-Peter J. Liu on Dec 18, 2019.
 According to the abstract,
@@ -26,7 +25,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus
 Checkpoints
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides 
+All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides
 `pegasus-large`, whence the other checkpoints are fine-tuned:
 - Each checkpoint is 2.2 GB on disk and 568M parameters.
@@ -44,6 +43,7 @@ Implementation Notes
 - All models are transformer encoder-decoders with 16 layers in each component.
 - The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
 - Some key configuration differences:
    - static, sinusoidal position embeddings
    - no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
@@ -84,6 +84,7 @@ PegasusConfig
 PegasusTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 warning: ``add_tokens`` does not work at the moment.
 .. autoclass:: transformers.PegasusTokenizer

--- a/docs/source/model_doc/prophetnet.rst
+++ b/docs/source/model_doc/prophetnet.rst
@@ -8,13 +8,24 @@ ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
+<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
+Zhang, Ming Zhou on 13 Jan, 2020.
-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token.
+ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just
+the next token.
 The abstract from the paper is the following:
-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
+self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
+the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
+n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
+step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
+overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
+dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
+abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
+state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.

--- a/docs/source/model_doc/rag.rst
+++ b/docs/source/model_doc/rag.rst
 RAG
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
 sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate
@@ -15,46 +15,40 @@ Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäs
 The abstract from the paper is the following:
-*Large pre-trained language models have been shown to store factual knowledge
+*Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve
-in their parameters, and achieve state-of-the-art results when fine-tuned on
+state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely
-downstream NLP tasks. However, their ability to access and precisely manipulate
+manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind
-knowledge is still limited, and hence on knowledge-intensive tasks, their
+task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge
-performance lags behind task-specific architectures. Additionally, providing
+remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric
-provenance for their decisions and updating their world knowledge remain open
+memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a
-research problems. Pre-trained models with a differentiable access mechanism to
+general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained
-explicit nonparametric memory can overcome this issue, but have so far been only
+parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a
-investigated for extractive downstream tasks. We explore a general-purpose
+pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a
-fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine
+pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages
-pre-trained parametric and non-parametric memory for language generation. We
+across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our
-introduce RAG models where the parametric memory is a pre-trained seq2seq model and
+models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks,
-the non-parametric memory is a dense vector index of Wikipedia, accessed with
+outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation
-a pre-trained neural retriever. We compare two RAG formulations, one which
+tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art
-conditions on the same retrieved passages across the whole generated sequence, the
+parametric-only seq2seq baseline.*
-other can use different passages per token. We fine-tune and evaluate our models
-on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art
-on three open domain QA tasks, outperforming parametric seq2seq models and
-task-specific retrieve-and-extract architectures. For language generation tasks, we
-find that RAG models generate more specific, diverse and factual language than a
-state-of-the-art parametric-only seq2seq baseline.*
 RagConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagConfig
    :members:
 RagTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagTokenizer
    :members: prepare_seq2seq_batch
 Rag specific outputs
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput
    :members:
@@ -63,28 +57,28 @@ Rag specific outputs
    :members:
 RagRetriever
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagRetriever
    :members:
 RagModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagModel
    :members: forward
 RagSequenceForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagSequenceForGeneration
    :members: forward, generate
 RagTokenForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagTokenForGeneration
    :members: forward, generate
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -10,7 +10,7 @@ Overview
 The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
 <https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-The abstract from the paper is the following: 
+The abstract from the paper is the following:
 *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
 be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
@@ -36,12 +36,12 @@ would result in a position encoding matrix:
 .. math::
    X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] 
-which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices: 
+which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
 .. math::
    X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right] 
-and 
+and
 .. math::
    X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right] 
@@ -67,22 +67,23 @@ factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj
 Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
 can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
-In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to
+In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to be
-be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
+equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
-product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the
+product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the `sequence
-`sequence length` of the :obj:`input_ids`.
+length` of the :obj:`input_ids`.
 LSH Self Attention
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
 query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
 `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
 query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
 key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
-the same bucket. 
+the same bucket.
-The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument 
+The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument
 :obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
 of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
 each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
@@ -92,11 +93,11 @@ neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring
 For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
 <https://www.pragmatic.ml/reformer-deep-dive/>`__.
-Note that :obj:`config.num_buckets` can also be factorized into a list
+Note that :obj:`config.num_buckets` can also be factorized into a list :math:`(n_{\text{buckets}}^1,
-:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to
+n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots,
-one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of
+n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
-:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`.
+1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to
-This is crucial for very long sequences to save memory.
+save memory.
 When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
 sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
@@ -128,7 +129,7 @@ multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length`
 Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
 easily be trained on sequences as long as 64000 tokens.
-For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows: 
+For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows:
 .. code-block::

--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin
 <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
 Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
-It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
+It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with
-objective and training with much larger mini-batches and learning rates.
+much larger mini-batches and learning rates.
 The abstract from the paper is the following:
@@ -17,15 +17,15 @@ The abstract from the paper is the following:
 approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
 and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
 study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
-training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
+training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every
-every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
+model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results
-results highlight the importance of previously overlooked design choices, and raise questions about the source
+highlight the importance of previously overlooked design choices, and raise questions about the source of recently
-of recently reported improvements. We release our models and code.*
+reported improvements. We release our models and code.*
 Tips:
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
+- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a setup
-  setup for Roberta pretrained models.
+  for Roberta pretrained models.
 - RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
  different pretraining scheme.
 - RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just

--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
@@ -4,38 +4,34 @@ SqueezeBERT
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The SqueezeBERT model was proposed in
+The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
-`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
+<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
-<https://arxiv.org/abs/2006.11316>`__
+bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
-by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
+SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
-It's a bidirectional transformer similar to the BERT model.
-The key difference between the BERT architecture and the SqueezeBERT architecture
-is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
 instead of fully-connected layers for the Q, K, V and FFN layers.
 The abstract from the paper is the following:
-*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
+*Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets,
-large datasets, large computing systems, and better neural network models, natural language processing (NLP)
+large computing systems, and better neural network models, natural language processing (NLP) technology has made
-technology has made significant strides in understanding, proofreading, and organizing these messages.
+significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant
-Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
+opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we
-social networks, and businesses. In particular, we consider smartphones and other mobile devices as
+consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's
-crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
+highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with
-models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
+BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods
-to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
+such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these
-convolutions have yielded significant speedups for computer vision networks, but many of these techniques
+techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
-have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
+self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called
-self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
+SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test
-called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
+set. The SqueezeBERT code will be released.*
-accuracy on the GLUE test set. The SqueezeBERT code will be released.*
 Tips:
- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
-  the right rather than the left.
+  rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
+- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+  with a causal language modeling (CLM) objective are better in that regard.
 - For best results when finetuning on sequence classification tasks, it is recommended to start with the
  `squeezebert/squeezebert-mnli-headless` checkpoint.

--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
@@ -29,13 +29,12 @@ Tips:
  each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
  different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
  for summarization: *summarize: ...*.
  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
-  <https://arxiv.org/pdf/1910.10683.pdf>`__.
+  <https://arxiv.org/pdf/1910.10683.pdf>`__. - For sequence-to-sequence generation, it is recommended to use
- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This
+  :obj:`T5ForConditionalGeneration.generate()``. This method takes care of feeding the encoded input via
-  method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively
+  cross-attention layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar
-  generates the decoder output.
+  embeddings. Encoder input padding can be done on the left and on the right.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
 The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.
@@ -51,14 +50,14 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
 - Unsupervised denoising training
-  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) 
+  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
-  and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. 
+  the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
-  Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, 
+  sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
  :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
  :class:`~transformers.T5Tokenizer`.
  For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
-  processed as follows: 
+  processed as follows:
 .. code-block::
@@ -69,10 +68,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
 - Supervised training
-  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping.
+  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping. In
-  In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
+  translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
  wunderbar.", the sentences should be processed as follows:
 .. code-block::
  input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids

--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -14,19 +14,19 @@ The abstract from the paper is the following:
 *Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the
 setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency
-beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and
+beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a
-a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves
+novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the
-the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and
+context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450%
-450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up
+longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+
-to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results
+times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of
-of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on
+bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn
-Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
+Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
 coherent, novel text articles with thousands of tokens.*
 Tips:
- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right.
+- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
-  The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
+  original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.
 The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.

--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@@ -14,21 +14,21 @@ Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the
 The abstract from the paper is the following:
 *Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
-In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining.
+In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
-We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
+propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
 data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
-state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI,
+state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
-our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation,
+approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
-we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On
+obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
-supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming
+machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
-the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
+previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
 Tips:
 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the
+- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
-  :doc:`multi-lingual <../multilingual>` page for more information.
+  <../multilingual>` page for more information.
 The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.

--- a/docs/source/model_doc/xlmprophetnet.rst
+++ b/docs/source/model_doc/xlmprophetnet.rst
@@ -9,13 +9,25 @@ XLM-ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
+<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
+Zhang, Ming Zhou on 13 Jan, 2020.
-XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump.
+XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of
+just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual
+"wiki100" Wikipedia dump.
 The abstract from the paper is the following:
-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
+self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
+the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
+n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
+step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
+overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
+dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
+abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
+state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.

--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
@@ -12,25 +12,25 @@ data.
 The abstract from the paper is the following:
-*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for
+*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
-a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
+wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
 languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
-outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy
+outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
-on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
+XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
-low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model.
+low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
-We also present a detailed empirical evaluation of the key factors that are required to achieve these gains,
+also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
-including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and
+trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
-low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling
+languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
-without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE
+per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
-and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.*
+will make XLM-R code, data, and models publicly available.*
 Tips:
 - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
  not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
  language from the input ids.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
-  examples as well as the information relative to the inputs and outputs.
+  as well as the information relative to the inputs and outputs.
 The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.

--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -16,11 +16,11 @@ The abstract from the paper is the following:
 better performance than pretraining approaches based on autoregressive language modeling. However, relying on
 corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
 pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
-pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over
+pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
-all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
+permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
-formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model,
+formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
-into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by
+pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
-a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
+margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
 Tips:

--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -15,8 +15,8 @@ Prepare your model for uploading
 We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
 done something similar on your task, either using the model directly in your own training loop or using the
-:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
+:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on the
-the `model hub <https://huggingface.co/models>`__.
+`model hub <https://huggingface.co/models>`__.
 Basic steps
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -60,22 +60,20 @@ Make your model work on all frameworks
 You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
 PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
-your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
+your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
-it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
+super easy to do (and in a future version, it will all be automatic). You will need to install both PyTorch and
-worry about the GPU, so it should be very easy. Check the
+TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
-`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
+installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
-and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
 First check that your model class exists in the other framework, that is try to import the same model by either adding
-or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
+or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to type
-type
 .. code-block::
    from transformers import TFDistilBertForSequenceClassification
-and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
+and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to type
-type
 .. code-block::
@@ -112,7 +110,8 @@ Make sure there are no garbage files in the directory you'll upload. It should o
 - a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
 - a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part
+  of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
 Other files can safely be deleted.
@@ -135,7 +134,8 @@ Then log in using the same credentials as on huggingface.co. To upload your mode
 This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
-By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
+By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple
+models and need to script that process, you can add `-y` to bypass the prompt. For example:
 .. code-block::
@@ -179,15 +179,15 @@ Add a model card
 To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
 considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
 placed in a subfolder with your username or organization, then another subfolder named like your model
-(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
+(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will get
-get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
+you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a model
-model card template (meta-suggestions are welcome).
+card template (meta-suggestions are welcome).
 If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
 don't forget to link to its model card so that people can fully trace how your model was built.
-If you have never made a pull request to the 🤗 Transformers repo, look at the
+If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
-:doc:`contributing guide <contributing>` to see the steps to follow.
+to see the steps to follow.
 .. Note::

--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
 Summary of the models
 =======================================================================================================================
-This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
+This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
-`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
+model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
 <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also checkout the
+models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
-:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
+page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
-community models <https://huggingface.co/models>`_.
+<https://huggingface.co/models>`_.
 Each one of the models in the library falls into one of the following categories:
@@ -19,8 +19,8 @@ Each one of the models in the library falls into one of the following categories
 Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
 previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
 sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
-models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
+models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
-A typical example of such models is GPT.
+typical example of such models is GPT.
 Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
 sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
@@ -30,8 +30,8 @@ sentence classification or token classification. A typical example of such model
 Note that the only difference between autoregressive models and autoencoding models is in the way the model is
 pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
-model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
+model has been used for both types of pretraining, we have put it in the category corresponding to the article where it
-introduced.
+was first introduced.
 Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
 tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
@@ -60,8 +60,8 @@ Original GPT
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
   </a>
-`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
+`Improving Language Understanding by Generative Pre-Training
-Alec Radford et al.
+<https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, Alec Radford et al.
 The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
@@ -80,7 +80,8 @@ GPT-2
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
   </a>
-`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
+`Language Models are Unsupervised Multitask Learners
+<https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
 Alec Radford et al.
 A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
@@ -122,8 +123,8 @@ Transformer-XL
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
   </a>
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
+`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, Zihang
-Zihang Dai et al.
+Dai et al.
 Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
 RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
@@ -153,8 +154,7 @@ Reformer
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
   </a>
-`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
+`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, Nikita Kitaev et al .
-Nikita Kitaev et al .
 An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
 include:
@@ -188,8 +188,8 @@ XLNet
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
   </a>
-`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
+`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, Zhilin
-Zhilin Yang et al.
+Yang et al.
 XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
 tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
@@ -207,7 +207,8 @@ Autoencoding models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
-look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
+look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
+corrupted versions.
 BERT
 -----------------------------------------------------------------------------------------------------------------------
@@ -260,8 +261,8 @@ Same as BERT but with a few tweaks:
    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
    being the vocab size). If E < H, it has less parameters.
  * Layers are split in groups that share parameters (to save memory).
-  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
+  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
-    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
+    B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
    been swapped or not.
 The library provides a version of the model for masked language modeling, token classification, sentence
@@ -279,8 +280,7 @@ RoBERTa
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
   </a>
-`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
+`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, Yinhan Liu et al.
-Yinhan Liu et al.
 Same as BERT with better pretraining tricks:
@@ -339,8 +339,8 @@ library provides checkpoints for all of them:
    previous section as well). One of the languages is selected for each training sample, and the model input is a
    sentence of 256 tokens, that may span over several documents in one of those languages.
  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
+    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
-    dynamic masking of the tokens.
+    with dynamic masking of the tokens.
  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
    surrounding context in language 1 and the context given by language 2.
@@ -523,20 +523,21 @@ Pegasus
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
   </a>
-`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization 
+`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization
 <https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
 Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
 objective, called Gap Sentence Generation (GSG).
-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like
+  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
-    in BERT)
+    BERT)
  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
    causal mask to hide the future words like a regular auto-regressive transformer decoder.
 In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
-masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
+summary.
 The library provides a version of this model for conditional generation, which should be used for summarization.
@@ -571,20 +572,20 @@ T5
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
   </a>
-`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
+`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
-Colin Raffel et al.
+<https://arxiv.org/abs/1910.10683>`_, Colin Raffel et al.
-Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
+Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
-each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
+layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
 prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
 The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
 tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
-Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
+Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with
-replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group
+individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a
-is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder
+single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original
-is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
+sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
 For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and
 "cute", the encoder input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
@@ -603,13 +604,12 @@ MBart
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
   </a>
-`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan
+`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
-Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
+The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
-and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
+for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
-for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
+sequence-to-sequence model by denoising full texts in multiple languages,
 The library provides a version of this model for conditional generation.
@@ -636,11 +636,11 @@ ProphetNet
 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
 ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
-future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at
+future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
-each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages
+time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
-the model to plan for the future tokens and prevent overfitting on strong local correlations.
+to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
-The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism
+the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main
-in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
+self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
 The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
 summarization.
@@ -682,8 +682,8 @@ et al.
 A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
 model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
-(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
+(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the resnet to
-resnet to the hidden state dimension of the transformer).
+the hidden state dimension of the transformer).
 The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
 model know which part of the input vector corresponds to the text and which to the image.
@@ -691,8 +691,7 @@ model know which part of the input vector corresponds to the text and which to t
 The pretrained model only works for classification.
 ..
-    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
+    More information in this :doc:`model documentation </model_doc/mmbt.html>`. TODO: write this page
-    TODO: write this page
 .. _retrieval-based-models:
@@ -714,19 +713,22 @@ DPR
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
   </a>
-`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
+`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, Vladimir Karpukhin et
-Vladimir Karpukhin et al.
+al.
-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
+Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
+research.
 DPR consists in three models:
  * Question encoder: encode questions as vectors
  * Context encoder: encode contexts as vectors
-  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
+  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
+    inferred span actually answers the question).
-DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
+DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
+then it calls the reader with the question and the retrieved documents to get the answer.
 RAG
 -----------------------------------------------------------------------------------------------------------------------
@@ -740,12 +742,14 @@ RAG
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
   </a>
-`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
+`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, Patrick Lewis,
-Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
+Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
+Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq
-RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
+models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and
-The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation
+to adapt to downstream tasks.
 The two models RAG-Token and RAG-Sequence are available for generation.
@@ -764,19 +768,19 @@ use a sparse version of the attention matrix to speed up training.
 **LSH attention**
 :ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
-dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can  consider only
+dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
 the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
-modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
+modified to mask the current token (except at the first position), because it will give a query and a key equal (so
-similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
+very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
-a n_rounds parameter) and then are averaged together.
+(determined by a n_rounds parameter) and then are averaged together.
 .. _local-attention:
 **Local attention**
-:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
+:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the
-right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
+left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
-last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
+window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
 representation of the whole sentence.
 Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
@@ -799,8 +803,9 @@ Other tricks
 :ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
 E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
-hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
+hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
-E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
+that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
-and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
+dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` and
-time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
+:math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for time
-:math:`j // l1` in E2.
+step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`j // l1`
+in E2.
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
 Multi-lingual models
 =======================================================================================================================
-Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
+Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
-multi-lingual models are available and have a different mechanisms than mono-lingual models.
+models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
-This page details the usage of these models.
+models.
 The two models that currently support multiple languages are BERT and XLM.
@@ -28,8 +28,8 @@ This section concerns the following checkpoints:
 These checkpoints require language embeddings that will specify the language used at inference time. These language
 embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
-these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
+these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
-from the tokenizer.
+the tokenizer.
 Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
@@ -78,8 +78,9 @@ You can then feed it all as input to your model:
    >>> outputs = model(input_ids, langs=langs)
-The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
+The example `run_generation.py
-can generate text using the CLM checkpoints from XLM, using the language embeddings.
+<https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ can generate
+text using the CLM checkpoints from XLM, using the language embeddings.
 XLM without Language Embeddings
 -----------------------------------------------------------------------------------------------------------------------
@@ -89,8 +90,8 @@ This section concerns the following checkpoints:
 - ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
 - ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
-These checkpoints do not require language embeddings at inference time. These models are used to have generic
+These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
-sentence representations, differently from previously-mentioned XLM checkpoints.
+representations, differently from previously-mentioned XLM checkpoints.
 BERT
@@ -101,15 +102,15 @@ BERT has two checkpoints that can be used for multi-lingual tasks:
 - ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
 - ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
-These checkpoints do not require language embeddings at inference time. They should identify the language
+These checkpoints do not require language embeddings at inference time. They should identify the language used in the
-used in the context and infer accordingly.
+context and infer accordingly.
 XLM-RoBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
+XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
-gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,
+over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
-sequence labeling and question answering.
+labeling and question answering.
 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:

--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
 Perplexity of fixed-length models
 =======================================================================================================================
-Perplexity (PPL) is one of the most common metrics for evaluating language
+Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
-models. Before diving in, we should note that the metric applies specifically
+that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
-to classical language models (sometimes called autoregressive or causal
+models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
-language models) and is not well defined for masked language models like BERT
+<model_summary>`).
-(see :doc:`summary of the models <model_summary>`).
-Perplexity is defined as the exponentiated average log-likelihood of a
+Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
-sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`,
+:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
-then the perplexity of :math:`X` is,
 .. math::
    \text{PPL}(X)
    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
-where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith
+where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
-token conditioned on the preceding tokens :math:`x_{<i}` according to our
+:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
-model. Intuitively, it can be thought of as an evaluation of the model's
+predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
-ability to predict uniformly among the set of specified tokens in a corpus.
+procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
-Importantly, this means that the tokenization procedure has a direct impact
+different models.
-on a model's perplexity which should always be taken into consideration when
-comparing different models.
-This is also equivalent to the exponentiation of the cross-entropy between
+This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
-the data and model predictions. For more intuition about perplexity and its
+intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
-relationship to Bits Per Character (BPC) and data compression, check out this
+`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
-`fantastic blog post on The Gradient
-<https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
 Calculating PPL with fixed-length models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If we weren't limited by a model's context size, we would evaluate the
+If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
-model's perplexity by autoregressively factorizing a sequence and
+factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
-conditioning on the entire preceding subsequence at each step, as shown
-below.
 .. image:: imgs/ppl_full.gif
    :width: 600
    :alt: Full decomposition of a sequence with unlimited context length
-When working with approximate models, however, we typically have a constraint
+When working with approximate models, however, we typically have a constraint on the number of tokens the model can
-on the number of tokens the model can process. The largest version
+process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
-of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024
+cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
-tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
-:math:`t` is greater than 1024.
-Instead, the sequence is typically broken into subsequences equal to the
+Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
-model's maximum input size. If a model's max input size is :math:`k`, we
+input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
-then approximate the likelihood of a token :math:`x_t` by conditioning only
+:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
-on the :math:`k-1` tokens that precede it rather than the entire context.
+sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
-When evaluating the model's perplexity of a sequence, a tempting but
+log-likelihoods of each segment independently.
-suboptimal approach is to break the sequence into disjoint chunks and
-add up the decomposed log-likelihoods of each segment independently.
 .. image:: imgs/ppl_chunked.gif
    :width: 600
    :alt: Suboptimal PPL not taking advantage of full available context
-This is quick to compute since the perplexity of each segment can be computed
+This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
-in one forward pass, but serves as a poor approximation of the
+approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
-fully-factorized perplexity and will typically yield a higher (worse) PPL
+have less context at most of the prediction steps.
-because the model will have less context at most of the prediction steps.
-Instead, the PPL of fixed-length models should be evaluated with a
+Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
-sliding-window strategy. This involves repeatedly sliding the
+sliding the context window so that the model has more context when making each prediction.
-context window so that the model has more context when making each
-prediction.
 .. image:: imgs/ppl_sliding.gif
    :width: 600
    :alt: Sliding window PPL taking advantage of all available context
-This is a closer approximation to the true decomposition of the
+This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
-sequence probability and will typically yield a more favorable score.
+favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
-The downside is that it requires a separate forward pass for each token in
+practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
-the corpus. A good practical compromise is to employ a strided sliding
+1 token a time. This allows computation to procede much faster while still giving the model a large context to make
-window, moving the context by larger strides rather than sliding by 1 token a
+predictions at each step.
-time. This allows computation to procede much faster while still giving the
-model a large context to make predictions at each step.
 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
-We'll load in the WikiText-2 dataset and evaluate the perplexity using a few
+We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
-different sliding-window strategies. Since this dataset is small and we're
+this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
-just doing one forward pass over the set, we can just load and encode the
+dataset in memory.
-entire dataset in memory.
 .. code-block:: python
@@ -106,16 +88,13 @@ entire dataset in memory.
    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
-With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels``
+With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
-to our model, and the average log-likelihood for each token is returned as
+log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
-the loss. With our sliding window approach, however, there is overlap in the
+the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
-tokens we pass to the model at each iteration. We don't want the
+as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
-log-likelihood for the tokens we're just treating as context to be included
+is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
-in our loss, so we can set these targets to ``-100`` so that they are
+for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
-ignored. The following is an example of how we could do this with a stride of
+available to condition on).
-``512``. This means that the model will have at least 512 tokens for context
-when calculating the conditional likelihood of any one token (provided there
-are 512 preceding tokens available to condition on).
 .. code-block:: python
@@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on).
    ppl = torch.exp(torch.stack(lls).sum() / end_loc)
-Running this with the stride length equal to the max input length is
+Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
-equivalent to the suboptimal, non-sliding-window strategy we discussed above.
+strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
-The smaller the stride, the more context the model will have in making each
+and the better the reported perplexity will typically be.
-prediction, and the better the reported perplexity will typically be.
+When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
-When we run the above with ``stride = 1024``, i.e. no overlap, the resulting
+as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
-PPL is ``19.64``, which is about the same as the ``19.93`` reported in the
+strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
-GPT-2 paper. By using ``stride = 512`` and thereby employing our striding
+closer to the true autoregressive decomposition of a sequence likelihood.
-window strategy, this jumps down to ``16.53``. This is not only a more
-favorable score, but is calculated in a way that is closer to the true
-autoregressive decomposition of a sequence likelihood.
--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
@@ -12,15 +12,15 @@ The library was designed with two strong goals in mind:
 - Be as easy and fast to use as possible:
    - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
-      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`, 
+      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
      :doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
    - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
      :obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
-      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary, 
+      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
-      and models' weights) from a pretrained checkpoint provided on 
+      and models' weights) from a pretrained checkpoint provided on `Hugging Face Hub
-      `Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint.
+      <https://huggingface.co/models>`__ or your own saved checkpoint.
    - On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
-      using a model (plus its associated tokenizer and configuration) on a given task and 
+      using a model (plus its associated tokenizer and configuration) on a given task and
      :func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
    - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
      extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
@@ -52,10 +52,10 @@ Main concepts
 The library is built around three types of classes for each model:
- **Model classes**  such as :class:`~transformers.BertModel`, which are 30+ PyTorch models 
+- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models (`torch.nn.Module
-  (`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models 
+  <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models (`tf.keras.Model
-  (`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained
+  <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained weights provided in the
-  weights provided in the library.
+  library.
 - **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
  a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
  without any modification, creating the model will automatically take care of instantiating the configuration (which
@@ -66,8 +66,8 @@ The library is built around three types of classes for each model:
 All these classes can be instantiated from pretrained instances and saved locally using two methods:
 - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
-  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`
+  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
-  or stored locally (or on a server) by the user,
+  stored locally (or on a server) by the user,
 - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
  :obj:`from_pretrained()`.