Doc styling (#8067)

* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy

Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
08f534d2 · Sylvain Gugger · GitHub · 04a17f85 · 08f534d2 · 08f534d2
Unverified Commit 08f534d2 authored Oct 26, 2020 by Sylvain Gugger Committed by GitHub Oct 26, 2020
20 changed files
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -12,13 +12,14 @@ Implementation Notes

 - Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- Models were originally trained by 
-  `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
-  `Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
+- Models were originally trained by `Jörg Tiedemann
+  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
+  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
 - All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
 - The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
+
    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
@@ -31,8 +32,8 @@ Naming

 - All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
 - The language codes used to name models are inconsistent. Two digit codes can usually be found `here
-  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
-  "language code {code}".
+  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
+  code {code}".
 - Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.


@@ -112,6 +113,7 @@ Code to see available pretrained models:

 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 .. autoclass:: transformers.MarianConfig
    :members:


--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -7,9 +7,10 @@ MBart

 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
-<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
+Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

 According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
 corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
@@ -21,10 +22,11 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma

 Training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
-As the model is multilingual it expects the sequences in a different format. A special language id token 
-is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
-where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
+
+MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. As the model is
+multilingual it expects the sequences in a different format. A special language id token is added in both the source
+and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The target
+text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.

 The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
 the sequences for sequence-to-sequence fine-tuning.
@@ -44,8 +46,8 @@ the sequences for sequence-to-sequence fine-tuning.

 - Generation

-    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. 
-    The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
+    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
+    example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.

 .. code-block::


--- a/docs/source/model_doc/mobilebert.rst
+++ b/docs/source/model_doc/mobilebert.rst
@@ -14,23 +14,23 @@ The abstract from the paper is the following:
 *Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds
 of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot
 be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating
-the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied
-to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
-equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward
-networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated
-BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that
-MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known
-benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7
-(0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task,
-MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
+the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to
+various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
+equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
+To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE
+model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is
+4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the
+natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms
+latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of
+90.0/79.2 (1.5/2.1 higher than BERT_BASE).*

 Tips:

- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
-  the right rather than the left.
- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
+  than the left.
+- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
+  with a causal language modeling (CLM) objective are better in that regard.

 The original code can be found `here <https://github.com/google-research/mobilebert>`__.


--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@@ -9,9 +9,8 @@ and assign @sshleifer.
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
-Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and
-Peter J. Liu on Dec 18, 2019.
+The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
+<https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.

 According to the abstract,

@@ -44,6 +43,7 @@ Implementation Notes
 - All models are transformer encoder-decoders with 16 layers in each component.
 - The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
 - Some key configuration differences:
+
    - static, sinusoidal position embeddings
    - no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
@@ -84,6 +84,7 @@ PegasusConfig

 PegasusTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 warning: ``add_tokens`` does not work at the moment.

 .. autoclass:: transformers.PegasusTokenizer

--- a/docs/source/model_doc/prophetnet.rst
+++ b/docs/source/model_doc/prophetnet.rst
@@ -8,13 +8,24 @@ ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
+<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
+Zhang, Ming Zhou on 13 Jan, 2020.

-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token.
+ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just
+the next token.

 The abstract from the paper is the following:

-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
+self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
+the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
+n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
+step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
+overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
+dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
+abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
+state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*

 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.


--- a/docs/source/model_doc/rag.rst
+++ b/docs/source/model_doc/rag.rst
 RAG
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
 sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate
@@ -15,46 +15,40 @@ Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäs

 The abstract from the paper is the following:

-*Large pre-trained language models have been shown to store factual knowledge
-in their parameters, and achieve state-of-the-art results when fine-tuned on
-downstream NLP tasks. However, their ability to access and precisely manipulate
-knowledge is still limited, and hence on knowledge-intensive tasks, their
-performance lags behind task-specific architectures. Additionally, providing
-provenance for their decisions and updating their world knowledge remain open
-research problems. Pre-trained models with a differentiable access mechanism to
-explicit nonparametric memory can overcome this issue, but have so far been only
-investigated for extractive downstream tasks. We explore a general-purpose
-fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine
-pre-trained parametric and non-parametric memory for language generation. We
-introduce RAG models where the parametric memory is a pre-trained seq2seq model and
-the non-parametric memory is a dense vector index of Wikipedia, accessed with
-a pre-trained neural retriever. We compare two RAG formulations, one which
-conditions on the same retrieved passages across the whole generated sequence, the
-other can use different passages per token. We fine-tune and evaluate our models
-on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art
-on three open domain QA tasks, outperforming parametric seq2seq models and
-task-specific retrieve-and-extract architectures. For language generation tasks, we
-find that RAG models generate more specific, diverse and factual language than a
-state-of-the-art parametric-only seq2seq baseline.*
+*Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve
+state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely
+manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind
+task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge
+remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric
+memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a
+general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained
+parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a
+pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a
+pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages
+across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our
+models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks,
+outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation
+tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art
+parametric-only seq2seq baseline.*



 RagConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagConfig
    :members:


 RagTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagTokenizer
    :members: prepare_seq2seq_batch


 Rag specific outputs
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput
    :members:
@@ -63,28 +57,28 @@ Rag specific outputs
    :members:

 RagRetriever
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagRetriever
    :members:


 RagModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagModel
    :members: forward


 RagSequenceForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagSequenceForGeneration
    :members: forward, generate


 RagTokenForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagTokenForGeneration
    :members: forward, generate
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -67,14 +67,15 @@ factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj
 Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
 can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.

-In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to
-be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
-product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the
-`sequence length` of the :obj:`input_ids`.
+In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to be
+equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
+product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the `sequence
+length` of the :obj:`input_ids`.


 LSH Self Attention
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
 query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
 `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
@@ -92,11 +93,11 @@ neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring
 For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
 <https://www.pragmatic.ml/reformer-deep-dive/>`__.

-Note that :obj:`config.num_buckets` can also be factorized into a list
-:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to
-one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of
-:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`.
-This is crucial for very long sequences to save memory.
+Note that :obj:`config.num_buckets` can also be factorized into a list :math:`(n_{\text{buckets}}^1,
+n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots,
+n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
+1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to
+save memory.

 When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
 sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be

--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin
 <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
 Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.

-It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
-objective and training with much larger mini-batches and learning rates.
+It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with
+much larger mini-batches and learning rates.

 The abstract from the paper is the following:

@@ -17,15 +17,15 @@ The abstract from the paper is the following:
 approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
 and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
 study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
-training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
-every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
-results highlight the importance of previously overlooked design choices, and raise questions about the source
-of recently reported improvements. We release our models and code.*
+training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every
+model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results
+highlight the importance of previously overlooked design choices, and raise questions about the source of recently
+reported improvements. We release our models and code.*

 Tips:

- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
-  setup for Roberta pretrained models.
+- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a setup
+  for Roberta pretrained models.
 - RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
  different pretraining scheme.
 - RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just

--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
@@ -4,38 +4,34 @@ SqueezeBERT
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The SqueezeBERT model was proposed in
-`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
-<https://arxiv.org/abs/2006.11316>`__
-by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
-It's a bidirectional transformer similar to the BERT model.
-The key difference between the BERT architecture and the SqueezeBERT architecture
-is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
+The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
+<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
+bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
+SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
 instead of fully-connected layers for the Q, K, V and FFN layers.

 The abstract from the paper is the following:

-*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
-large datasets, large computing systems, and better neural network models, natural language processing (NLP)
-technology has made significant strides in understanding, proofreading, and organizing these messages.
-Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
-social networks, and businesses. In particular, we consider smartphones and other mobile devices as
-crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
-models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
-to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
-convolutions have yielded significant speedups for computer vision networks, but many of these techniques
-have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
-self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
-called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
-accuracy on the GLUE test set. The SqueezeBERT code will be released.*
+*Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets,
+large computing systems, and better neural network models, natural language processing (NLP) technology has made
+significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant
+opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we
+consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's
+highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with
+BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods
+such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these
+techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
+self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called
+SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test
+set. The SqueezeBERT code will be released.*

 Tips:

- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
-  the right rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
+  rather than the left.
+- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
+  with a causal language modeling (CLM) objective are better in that regard.
 - For best results when finetuning on sequence classification tasks, it is recommended to start with the
  `squeezebert/squeezebert-mnli-headless` checkpoint.


--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
@@ -31,11 +31,10 @@ Tips:
  for summarization: *summarize: ...*.

  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
-  <https://arxiv.org/pdf/1910.10683.pdf>`__.
- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This
-  method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively
-  generates the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
+  <https://arxiv.org/pdf/1910.10683.pdf>`__. - For sequence-to-sequence generation, it is recommended to use
+  :obj:`T5ForConditionalGeneration.generate()``. This method takes care of feeding the encoded input via
+  cross-attention layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar
+  embeddings. Encoder input padding can be done on the left and on the right.

 The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.

@@ -51,9 +50,9 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash

 - Unsupervised denoising training

-  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) 
-  and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. 
-  Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, 
+  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
+  the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
+  sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
  :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
  :class:`~transformers.T5Tokenizer`.

@@ -69,8 +68,8 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash

 - Supervised training

-  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping.
-  In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
+  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping. In
+  translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
  wunderbar.", the sentences should be processed as follows:

 .. code-block::

--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -14,19 +14,19 @@ The abstract from the paper is the following:

 *Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the
 setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency
-beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and
-a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves
-the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and
-450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up
-to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results
-of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on
-Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
+beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a
+novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the
+context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450%
+longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+
+times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of
+bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn
+Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
 coherent, novel text articles with thousands of tokens.*

 Tips:

- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right.
-  The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
+- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
+  original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.

 The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.

--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@@ -14,21 +14,21 @@ Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the
 The abstract from the paper is the following:

 *Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
-In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining.
-We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
+In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
+propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
 data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
-state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI,
-our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation,
-we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On
-supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming
-the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
+state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
+approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
+obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
+machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
+previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*

 Tips:

 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the
-  :doc:`multi-lingual <../multilingual>` page for more information.
+- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
+  <../multilingual>` page for more information.

 The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.


--- a/docs/source/model_doc/xlmprophetnet.rst
+++ b/docs/source/model_doc/xlmprophetnet.rst
@@ -9,13 +9,25 @@ XLM-ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
+<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
+Zhang, Ming Zhou on 13 Jan, 2020.

-XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump.
+XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of
+just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual
+"wiki100" Wikipedia dump.

 The abstract from the paper is the following:

-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
+self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
+the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
+n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
+step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
+overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
+dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
+abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
+state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*

 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.


--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
@@ -12,25 +12,25 @@ data.

 The abstract from the paper is the following:

-*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for
-a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
+*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
+wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
 languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
-outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy
-on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
-low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model.
-We also present a detailed empirical evaluation of the key factors that are required to achieve these gains,
-including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and
-low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling
-without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE
-and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.*
+outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
+XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
+low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
+also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
+trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
+languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
+per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
+will make XLM-R code, data, and models publicly available.*

 Tips:

 - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
  not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
  language from the input ids.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
-  examples as well as the information relative to the inputs and outputs.
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
+  as well as the information relative to the inputs and outputs.

 The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.


--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -16,11 +16,11 @@ The abstract from the paper is the following:
 better performance than pretraining approaches based on autoregressive language modeling. However, relying on
 corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
 pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
-pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over
-all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
-formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model,
-into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by
-a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
+pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
+permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
+formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
+pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
+margin, including question answering, natural language inference, sentiment analysis, and document ranking.*

 Tips:


--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -15,8 +15,8 @@ Prepare your model for uploading

 We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
 done something similar on your task, either using the model directly in your own training loop or using the
-:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
-the `model hub <https://huggingface.co/models>`__.
+:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on the
+`model hub <https://huggingface.co/models>`__.

 Basic steps
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -60,22 +60,20 @@ Make your model work on all frameworks

 You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
 PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
-your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
-it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
-worry about the GPU, so it should be very easy. Check the
-`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
-and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
+super easy to do (and in a future version, it will all be automatic). You will need to install both PyTorch and
+TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
+installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
+installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.

 First check that your model class exists in the other framework, that is try to import the same model by either adding
-or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
-type
+or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to type

 .. code-block::

    from transformers import TFDistilBertForSequenceClassification

-and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
-type
+and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to type

 .. code-block::

@@ -112,7 +110,8 @@ Make sure there are no garbage files in the directory you'll upload. It should o
 - a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
 - a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part
+  of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.

 Other files can safely be deleted.
@@ -135,7 +134,8 @@ Then log in using the same credentials as on huggingface.co. To upload your mode

 This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.

-By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
+By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple
+models and need to script that process, you can add `-y` to bypass the prompt. For example:

 .. code-block::

@@ -179,15 +179,15 @@ Add a model card
 To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
 considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
 placed in a subfolder with your username or organization, then another subfolder named like your model
-(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
-get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
-model card template (meta-suggestions are welcome).
+(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will get
+you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a model
+card template (meta-suggestions are welcome).

 If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
 don't forget to link to its model card so that people can fully trace how your model was built.

-If you have never made a pull request to the 🤗 Transformers repo, look at the
-:doc:`contributing guide <contributing>` to see the steps to follow.
+If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
+to see the steps to follow.

 .. Note::


--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
 Summary of the models
 =======================================================================================================================

-This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
-`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
+This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
+model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
 <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also checkout the
-:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
-community models <https://huggingface.co/models>`_.
+models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
+page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
+<https://huggingface.co/models>`_.

 Each one of the models in the library falls into one of the following categories:

@@ -19,8 +19,8 @@ Each one of the models in the library falls into one of the following categories
 Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
 previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
 sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
-models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
-A typical example of such models is GPT.
+models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
+typical example of such models is GPT.

 Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
 sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
@@ -30,8 +30,8 @@ sentence classification or token classification. A typical example of such model

 Note that the only difference between autoregressive models and autoencoding models is in the way the model is
 pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
-model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
-introduced.
+model has been used for both types of pretraining, we have put it in the category corresponding to the article where it
+was first introduced.

 Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
 tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
@@ -60,8 +60,8 @@ Original GPT
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
   </a>

-`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
-Alec Radford et al.
+`Improving Language Understanding by Generative Pre-Training
+<https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, Alec Radford et al.

 The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.

@@ -80,7 +80,8 @@ GPT-2
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
   </a>

-`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
+`Language Models are Unsupervised Multitask Learners
+<https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
 Alec Radford et al.

 A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
@@ -122,8 +123,8 @@ Transformer-XL
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
   </a>

-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
-Zihang Dai et al.
+`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, Zihang
+Dai et al.

 Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
 RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
@@ -153,8 +154,7 @@ Reformer
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
   </a>

-`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
-Nikita Kitaev et al .
+`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, Nikita Kitaev et al .

 An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
 include:
@@ -188,8 +188,8 @@ XLNet
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
   </a>

-`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
-Zhilin Yang et al.
+`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, Zhilin
+Yang et al.

 XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
 tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
@@ -207,7 +207,8 @@ Autoencoding models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
-look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
+look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
+corrupted versions.

 BERT
 -----------------------------------------------------------------------------------------------------------------------
@@ -260,8 +261,8 @@ Same as BERT but with a few tweaks:
    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
    being the vocab size). If E < H, it has less parameters.
  * Layers are split in groups that share parameters (to save memory).
-  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
-    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
+  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
+    B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
    been swapped or not.

 The library provides a version of the model for masked language modeling, token classification, sentence
@@ -279,8 +280,7 @@ RoBERTa
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
   </a>

-`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
-Yinhan Liu et al.
+`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, Yinhan Liu et al.

 Same as BERT with better pretraining tricks:

@@ -339,8 +339,8 @@ library provides checkpoints for all of them:
    previous section as well). One of the languages is selected for each training sample, and the model input is a
    sentence of 256 tokens, that may span over several documents in one of those languages.
  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
-    dynamic masking of the tokens.
+    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
+    with dynamic masking of the tokens.
  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
    surrounding context in language 1 and the context given by language 2.
@@ -530,13 +530,14 @@ Sequence-to-sequence model with the same encoder-decoder model architecture as B
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
 objective, called Gap Sentence Generation (GSG).

-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like
-    in BERT)
+  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
+    BERT)
  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
    causal mask to hide the future words like a regular auto-regressive transformer decoder.

 In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
-masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
+summary.

 The library provides a version of this model for conditional generation, which should be used for summarization.

@@ -571,20 +572,20 @@ T5
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
   </a>

-`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
-Colin Raffel et al.
+`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
+<https://arxiv.org/abs/1910.10683>`_, Colin Raffel et al.

-Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
-each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
+Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
+layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
 prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.

 The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
 tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).

-Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
-replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group
-is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder
-is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
+Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with
+individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a
+single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original
+sentence and the target is then the dropped out tokens delimited by their sentinel tokens.

 For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and
 "cute", the encoder input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
@@ -603,13 +604,12 @@ MBart
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
   </a>

-`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan
-Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
+Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

-The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
-and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
-for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
+The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
+for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
+sequence-to-sequence model by denoising full texts in multiple languages,

 The library provides a version of this model for conditional generation.

@@ -636,11 +636,11 @@ ProphetNet
 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.

 ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
-future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at
-each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages
-the model to plan for the future tokens and prevent overfitting on strong local correlations.
-The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism
-in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
+future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
+time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
+to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
+the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main
+self-attention mechanism and a self and n-stream (predict) self-attention mechanism.

 The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
 summarization.
@@ -682,8 +682,8 @@ et al.

 A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
 model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
-(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
-resnet to the hidden state dimension of the transformer).
+(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the resnet to
+the hidden state dimension of the transformer).

 The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
 model know which part of the input vector corresponds to the text and which to the image.
@@ -691,8 +691,7 @@ model know which part of the input vector corresponds to the text and which to t
 The pretrained model only works for classification.

 ..
-    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
-    TODO: write this page
+    More information in this :doc:`model documentation </model_doc/mmbt.html>`. TODO: write this page

 .. _retrieval-based-models:

@@ -714,19 +713,22 @@ DPR
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
   </a>

-`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
-Vladimir Karpukhin et al.
+`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, Vladimir Karpukhin et
+al.

-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
+Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
+research.


 DPR consists in three models:

  * Question encoder: encode questions as vectors
  * Context encoder: encode contexts as vectors
-  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
+  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
+    inferred span actually answers the question).

-DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
+DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
+then it calls the reader with the question and the retrieved documents to get the answer.

 RAG
 -----------------------------------------------------------------------------------------------------------------------
@@ -740,12 +742,14 @@ RAG
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
   </a>

-`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
-Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
+`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, Patrick Lewis,
+Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
+Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
-RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
-The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq
+models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and
+seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation
+to adapt to downstream tasks.

 The two models RAG-Token and RAG-Sequence are available for generation.

@@ -766,17 +770,17 @@ use a sparse version of the attention matrix to speed up training.
 :ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
 dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
 the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
-modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
-similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
-a n_rounds parameter) and then are averaged together.
+modified to mask the current token (except at the first position), because it will give a query and a key equal (so
+very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
+(determined by a n_rounds parameter) and then are averaged together.

 .. _local-attention:

 **Local attention**

-:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
-right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
-last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
+:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the
+left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
+window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
 representation of the whole sentence.

 Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
@@ -799,8 +803,9 @@ Other tricks

 :ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
 E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
-hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
-E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
-and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
-time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
-:math:`j // l1` in E2.
+hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
+that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
+dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` and
+:math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for time
+step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`j // l1`
+in E2.
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
 Multi-lingual models
 =======================================================================================================================

-Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
-multi-lingual models are available and have a different mechanisms than mono-lingual models.
-This page details the usage of these models.
+Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
+models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
+models.

 The two models that currently support multiple languages are BERT and XLM.

@@ -28,8 +28,8 @@ This section concerns the following checkpoints:

 These checkpoints require language embeddings that will specify the language used at inference time. These language
 embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
-these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
-from the tokenizer.
+these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
+the tokenizer.

 Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):

@@ -78,8 +78,9 @@ You can then feed it all as input to your model:
    >>> outputs = model(input_ids, langs=langs)


-The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
-can generate text using the CLM checkpoints from XLM, using the language embeddings.
+The example `run_generation.py
+<https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ can generate
+text using the CLM checkpoints from XLM, using the language embeddings.

 XLM without Language Embeddings
 -----------------------------------------------------------------------------------------------------------------------
@@ -89,8 +90,8 @@ This section concerns the following checkpoints:
 - ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
 - ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)

-These checkpoints do not require language embeddings at inference time. These models are used to have generic
-sentence representations, differently from previously-mentioned XLM checkpoints.
+These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
+representations, differently from previously-mentioned XLM checkpoints.


 BERT
@@ -101,15 +102,15 @@ BERT has two checkpoints that can be used for multi-lingual tasks:
 - ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
 - ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)

-These checkpoints do not require language embeddings at inference time. They should identify the language
-used in the context and infer accordingly.
+These checkpoints do not require language embeddings at inference time. They should identify the language used in the
+context and infer accordingly.

 XLM-RoBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
-gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,
-sequence labeling and question answering.
+XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
+over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
+labeling and question answering.

 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:


--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
 Perplexity of fixed-length models
 =======================================================================================================================

-Perplexity (PPL) is one of the most common metrics for evaluating language
-models. Before diving in, we should note that the metric applies specifically
-to classical language models (sometimes called autoregressive or causal
-language models) and is not well defined for masked language models like BERT
-(see :doc:`summary of the models <model_summary>`).
+Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
+that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
+models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
+<model_summary>`).

-Perplexity is defined as the exponentiated average log-likelihood of a
-sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`,
-then the perplexity of :math:`X` is,
+Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
+:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,

 .. math::

    \text{PPL}(X)
    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}

-where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith
-token conditioned on the preceding tokens :math:`x_{<i}` according to our
-model. Intuitively, it can be thought of as an evaluation of the model's
-ability to predict uniformly among the set of specified tokens in a corpus.
-Importantly, this means that the tokenization procedure has a direct impact
-on a model's perplexity which should always be taken into consideration when
-comparing different models.
+where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
+:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
+predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
+procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
+different models.

-This is also equivalent to the exponentiation of the cross-entropy between
-the data and model predictions. For more intuition about perplexity and its
-relationship to Bits Per Character (BPC) and data compression, check out this
-`fantastic blog post on The Gradient
-<https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
+This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
+intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
+`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.

 Calculating PPL with fixed-length models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-If we weren't limited by a model's context size, we would evaluate the
-model's perplexity by autoregressively factorizing a sequence and
-conditioning on the entire preceding subsequence at each step, as shown
-below.
+If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
+factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.

 .. image:: imgs/ppl_full.gif
    :width: 600
    :alt: Full decomposition of a sequence with unlimited context length

-When working with approximate models, however, we typically have a constraint
-on the number of tokens the model can process. The largest version
-of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024
-tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
-:math:`t` is greater than 1024.
+When working with approximate models, however, we typically have a constraint on the number of tokens the model can
+process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
+cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.

-Instead, the sequence is typically broken into subsequences equal to the
-model's maximum input size. If a model's max input size is :math:`k`, we
-then approximate the likelihood of a token :math:`x_t` by conditioning only
-on the :math:`k-1` tokens that precede it rather than the entire context.
-When evaluating the model's perplexity of a sequence, a tempting but
-suboptimal approach is to break the sequence into disjoint chunks and
-add up the decomposed log-likelihoods of each segment independently.
+Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
+input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
+:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
+sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
+log-likelihoods of each segment independently.

 .. image:: imgs/ppl_chunked.gif
    :width: 600
    :alt: Suboptimal PPL not taking advantage of full available context

-This is quick to compute since the perplexity of each segment can be computed
-in one forward pass, but serves as a poor approximation of the
-fully-factorized perplexity and will typically yield a higher (worse) PPL
-because the model will have less context at most of the prediction steps.
+This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
+approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
+have less context at most of the prediction steps.

-Instead, the PPL of fixed-length models should be evaluated with a
-sliding-window strategy. This involves repeatedly sliding the
-context window so that the model has more context when making each
-prediction.
+Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
+sliding the context window so that the model has more context when making each prediction.

 .. image:: imgs/ppl_sliding.gif
    :width: 600
    :alt: Sliding window PPL taking advantage of all available context

-This is a closer approximation to the true decomposition of the
-sequence probability and will typically yield a more favorable score.
-The downside is that it requires a separate forward pass for each token in
-the corpus. A good practical compromise is to employ a strided sliding
-window, moving the context by larger strides rather than sliding by 1 token a
-time. This allows computation to procede much faster while still giving the
-model a large context to make predictions at each step.
+This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
+favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
+practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
+1 token a time. This allows computation to procede much faster while still giving the model a large context to make
+predictions at each step.

 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

-We'll load in the WikiText-2 dataset and evaluate the perplexity using a few
-different sliding-window strategies. Since this dataset is small and we're
-just doing one forward pass over the set, we can just load and encode the
-entire dataset in memory.
+We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
+this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
+dataset in memory.

 .. code-block:: python

@@ -106,16 +88,13 @@ entire dataset in memory.
    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')

-With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels``
-to our model, and the average log-likelihood for each token is returned as
-the loss. With our sliding window approach, however, there is overlap in the
-tokens we pass to the model at each iteration. We don't want the
-log-likelihood for the tokens we're just treating as context to be included
-in our loss, so we can set these targets to ``-100`` so that they are
-ignored. The following is an example of how we could do this with a stride of
-``512``. This means that the model will have at least 512 tokens for context
-when calculating the conditional likelihood of any one token (provided there
-are 512 preceding tokens available to condition on).
+With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
+log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
+the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
+as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
+is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
+for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
+available to condition on).

 .. code-block:: python

@@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on).

    ppl = torch.exp(torch.stack(lls).sum() / end_loc)

-Running this with the stride length equal to the max input length is
-equivalent to the suboptimal, non-sliding-window strategy we discussed above.
-The smaller the stride, the more context the model will have in making each
-prediction, and the better the reported perplexity will typically be.
-
-When we run the above with ``stride = 1024``, i.e. no overlap, the resulting
-PPL is ``19.64``, which is about the same as the ``19.93`` reported in the
-GPT-2 paper. By using ``stride = 512`` and thereby employing our striding
-window strategy, this jumps down to ``16.53``. This is not only a more
-favorable score, but is calculated in a way that is closer to the true
-autoregressive decomposition of a sequence likelihood.
+Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
+strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
+and the better the reported perplexity will typically be.
+
+When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
+as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
+strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
+closer to the true autoregressive decomposition of a sequence likelihood.
--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
@@ -17,8 +17,8 @@ The library was designed with two strong goals in mind:
    - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
      :obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
-      and models' weights) from a pretrained checkpoint provided on 
-      `Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint.
+      and models' weights) from a pretrained checkpoint provided on `Hugging Face Hub
+      <https://huggingface.co/models>`__ or your own saved checkpoint.
    - On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
      using a model (plus its associated tokenizer and configuration) on a given task and
      :func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
@@ -52,10 +52,10 @@ Main concepts

 The library is built around three types of classes for each model:

- **Model classes**  such as :class:`~transformers.BertModel`, which are 30+ PyTorch models 
-  (`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models 
-  (`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained
-  weights provided in the library.
+- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models (`torch.nn.Module
+  <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models (`tf.keras.Model
+  <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained weights provided in the
+  library.
 - **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
  a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
  without any modification, creating the model will automatically take care of instantiating the configuration (which
@@ -66,8 +66,8 @@ The library is built around three types of classes for each model:
 All these classes can be instantiated from pretrained instances and saved locally using two methods:

 - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
-  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`
-  or stored locally (or on a server) by the user,
+  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
+  stored locally (or on a server) by the user,
 - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
  :obj:`from_pretrained()`.