Doc styling (#8067)

* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy

Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
08f534d2 · Sylvain Gugger · GitHub · 04a17f85 · 08f534d2 · 08f534d2
Unverified Commit 08f534d2 authored Oct 26, 2020 by Sylvain Gugger Committed by GitHub Oct 26, 2020
20 changed files
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -3,7 +3,7 @@ MarianMT

 **Bugs:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
-and assign @sshleifer. 
+and assign @sshleifer.

 Translations should be similar, but not identical to, output in the test set linked to in each model card.

@@ -12,13 +12,14 @@ Implementation Notes

 - Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- Models were originally trained by 
-  `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
-  `Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
+- Models were originally trained by `Jörg Tiedemann
+  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
+  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
 - All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
 - The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
+
    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
@@ -29,17 +30,17 @@ Implementation Notes
 Naming
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
+- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
 - The language codes used to name models are inconsistent. Two digit codes can usually be found `here
-  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
-  "language code {code}".
+  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
+  code {code}".
 - Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.


 Multilingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
+All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:

    - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
      looking at the model card, or the Group Members `mapping
@@ -112,6 +113,7 @@ Code to see available pretrained models:

 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 .. autoclass:: transformers.MarianConfig
    :members:


--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -7,9 +7,10 @@ MBart

 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
-<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
+Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

 According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
 corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
@@ -21,12 +22,13 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma

 Training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
-As the model is multilingual it expects the sequences in a different format. A special language id token 
-is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
-where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.

-The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode 
+MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. As the model is
+multilingual it expects the sequences in a different format. A special language id token is added in both the source
+and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The target
+text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
+
+The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
 the sequences for sequence-to-sequence fine-tuning.

 - Supervised training
@@ -44,8 +46,8 @@ the sequences for sequence-to-sequence fine-tuning.

 - Generation

-    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. 
-    The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
+    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
+    example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.

 .. code-block::


--- a/docs/source/model_doc/mobilebert.rst
+++ b/docs/source/model_doc/mobilebert.rst
@@ -14,23 +14,23 @@ The abstract from the paper is the following:
 *Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds
 of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot
 be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating
-the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied
-to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
-equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward
-networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated
-BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that
-MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known
-benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7
-(0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task,
-MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
+the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to
+various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
+equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
+To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE
+model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is
+4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the
+natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms
+latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of
+90.0/79.2 (1.5/2.1 higher than BERT_BASE).*

 Tips:

- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
-  the right rather than the left.
- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
+  than the left.
+- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
+  with a causal language modeling (CLM) objective are better in that regard.

 The original code can be found `here <https://github.com/google-research/mobilebert>`__.


--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@@ -9,9 +9,8 @@ and assign @sshleifer.
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
-Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and
-Peter J. Liu on Dec 18, 2019.
+The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
+<https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.

 According to the abstract,

@@ -26,7 +25,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus
 Checkpoints
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides 
+All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides
 `pegasus-large`, whence the other checkpoints are fine-tuned:

 - Each checkpoint is 2.2 GB on disk and 568M parameters.
@@ -44,6 +43,7 @@ Implementation Notes
 - All models are transformer encoder-decoders with 16 layers in each component.
 - The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
 - Some key configuration differences:
+
    - static, sinusoidal position embeddings
    - no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
@@ -84,6 +84,7 @@ PegasusConfig

 PegasusTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 warning: ``add_tokens`` does not work at the moment.

 .. autoclass:: transformers.PegasusTokenizer

--- a/docs/source/model_doc/prophetnet.rst
+++ b/docs/source/model_doc/prophetnet.rst
@@ -8,13 +8,24 @@ ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
+<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
+Zhang, Ming Zhou on 13 Jan, 2020.

-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token.
+ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just
+the next token.

 The abstract from the paper is the following:

-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
+self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
+the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
+n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
+step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
+overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
+dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
+abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
+state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*

 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.


--- a/docs/source/model_doc/rag.rst
+++ b/docs/source/model_doc/rag.rst
 RAG
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------

 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
 sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate
@@ -15,46 +15,40 @@ Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäs

 The abstract from the paper is the following:

-*Large pre-trained language models have been shown to store factual knowledge
-in their parameters, and achieve state-of-the-art results when fine-tuned on
-downstream NLP tasks. However, their ability to access and precisely manipulate
-knowledge is still limited, and hence on knowledge-intensive tasks, their
-performance lags behind task-specific architectures. Additionally, providing
-provenance for their decisions and updating their world knowledge remain open
-research problems. Pre-trained models with a differentiable access mechanism to
-explicit nonparametric memory can overcome this issue, but have so far been only
-investigated for extractive downstream tasks. We explore a general-purpose
-fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine
-pre-trained parametric and non-parametric memory for language generation. We
-introduce RAG models where the parametric memory is a pre-trained seq2seq model and
-the non-parametric memory is a dense vector index of Wikipedia, accessed with
-a pre-trained neural retriever. We compare two RAG formulations, one which
-conditions on the same retrieved passages across the whole generated sequence, the
-other can use different passages per token. We fine-tune and evaluate our models
-on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art
-on three open domain QA tasks, outperforming parametric seq2seq models and
-task-specific retrieve-and-extract architectures. For language generation tasks, we
-find that RAG models generate more specific, diverse and factual language than a
-state-of-the-art parametric-only seq2seq baseline.*
+*Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve
+state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely
+manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind
+task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge
+remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric
+memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a
+general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained
+parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a
+pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a
+pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages
+across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our
+models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks,
+outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation
+tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art
+parametric-only seq2seq baseline.*



 RagConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagConfig
    :members:


 RagTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagTokenizer
    :members: prepare_seq2seq_batch


 Rag specific outputs
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput
    :members:
@@ -63,28 +57,28 @@ Rag specific outputs
    :members:

 RagRetriever
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagRetriever
    :members:


 RagModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagModel
    :members: forward


 RagSequenceForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagSequenceForGeneration
    :members: forward, generate


 RagTokenForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagTokenForGeneration
    :members: forward, generate
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -10,7 +10,7 @@ Overview
 The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
 <https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.

-The abstract from the paper is the following: 
+The abstract from the paper is the following:

 *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
 be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
@@ -36,12 +36,12 @@ would result in a position encoding matrix:
 .. math::
    X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] 

-which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices: 
+which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:

 .. math::
    X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right] 

-and 
+and

 .. math::
    X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right] 
@@ -67,22 +67,23 @@ factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj
 Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
 can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.

-In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to
-be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
-product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the
-`sequence length` of the :obj:`input_ids`.
+In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to be
+equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
+product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the `sequence
+length` of the :obj:`input_ids`.


 LSH Self Attention
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
 query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
 `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
 query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
 key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
-the same bucket. 
+the same bucket.

-The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument 
+The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument
 :obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
 of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
 each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
@@ -92,11 +93,11 @@ neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring
 For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
 <https://www.pragmatic.ml/reformer-deep-dive/>`__.

-Note that :obj:`config.num_buckets` can also be factorized into a list
-:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to
-one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of
-:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`.
-This is crucial for very long sequences to save memory.
+Note that :obj:`config.num_buckets` can also be factorized into a list :math:`(n_{\text{buckets}}^1,
+n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots,
+n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
+1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to
+save memory.

 When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
 sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
@@ -128,7 +129,7 @@ multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length`
 Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
 easily be trained on sequences as long as 64000 tokens.

-For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows: 
+For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows:

 .. code-block::


--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin
 <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
 Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.

-It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
-objective and training with much larger mini-batches and learning rates.
+It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with
+much larger mini-batches and learning rates.

 The abstract from the paper is the following:

@@ -17,15 +17,15 @@ The abstract from the paper is the following:
 approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
 and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
 study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
-training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
-every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
-results highlight the importance of previously overlooked design choices, and raise questions about the source
-of recently reported improvements. We release our models and code.*
+training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every
+model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results
+highlight the importance of previously overlooked design choices, and raise questions about the source of recently
+reported improvements. We release our models and code.*

 Tips:

- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
-  setup for Roberta pretrained models.
+- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a setup
+  for Roberta pretrained models.
 - RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
  different pretraining scheme.
 - RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just

--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
@@ -4,38 +4,34 @@ SqueezeBERT
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The SqueezeBERT model was proposed in
-`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
-<https://arxiv.org/abs/2006.11316>`__
-by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
-It's a bidirectional transformer similar to the BERT model.
-The key difference between the BERT architecture and the SqueezeBERT architecture
-is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
+The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
+<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
+bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
+SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
 instead of fully-connected layers for the Q, K, V and FFN layers.

 The abstract from the paper is the following:

-*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
-large datasets, large computing systems, and better neural network models, natural language processing (NLP)
-technology has made significant strides in understanding, proofreading, and organizing these messages.
-Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
-social networks, and businesses. In particular, we consider smartphones and other mobile devices as
-crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
-models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
-to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
-convolutions have yielded significant speedups for computer vision networks, but many of these techniques
-have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
-self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
-called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
-accuracy on the GLUE test set. The SqueezeBERT code will be released.*
+*Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets,
+large computing systems, and better neural network models, natural language processing (NLP) technology has made
+significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant
+opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we
+consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's
+highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with
+BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods
+such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these
+techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
+self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called
+SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test
+set. The SqueezeBERT code will be released.*

 Tips:

- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
-  the right rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
+  rather than the left.
+- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
+  with a causal language modeling (CLM) objective are better in that regard.
 - For best results when finetuning on sequence classification tasks, it is recommended to start with the
  `squeezebert/squeezebert-mnli-headless` checkpoint.


--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
@@ -29,13 +29,12 @@ Tips:
  each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
  different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
  for summarization: *summarize: ...*.
-  
+
  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
-  <https://arxiv.org/pdf/1910.10683.pdf>`__.
- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This
-  method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively
-  generates the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
+  <https://arxiv.org/pdf/1910.10683.pdf>`__. - For sequence-to-sequence generation, it is recommended to use
+  :obj:`T5ForConditionalGeneration.generate()``. This method takes care of feeding the encoded input via
+  cross-attention layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar
+  embeddings. Encoder input padding can be done on the left and on the right.

 The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.

@@ -51,14 +50,14 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash

 - Unsupervised denoising training

-  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) 
-  and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. 
-  Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, 
+  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
+  the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
+  sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
  :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
  :class:`~transformers.T5Tokenizer`.
-  
+
  For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
-  processed as follows: 
+  processed as follows:

 .. code-block::

@@ -69,10 +68,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash

 - Supervised training

-  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping.
-  In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
+  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping. In
+  translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
  wunderbar.", the sentences should be processed as follows:
-  
+
 .. code-block::

  input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids

--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -14,19 +14,19 @@ The abstract from the paper is the following:

 *Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the
 setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency
-beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and
-a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves
-the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and
-450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up
-to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results
-of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on
-Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
+beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a
+novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the
+context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450%
+longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+
+times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of
+bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn
+Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
 coherent, novel text articles with thousands of tokens.*

 Tips:

- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right.
-  The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
+- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
+  original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.

 The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.

--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@@ -14,21 +14,21 @@ Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the
 The abstract from the paper is the following:

 *Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
-In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining.
-We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
+In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
+propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
 data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
-state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI,
-our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation,
-we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On
-supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming
-the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
+state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
+approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
+obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
+machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
+previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*

 Tips:

 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the
-  :doc:`multi-lingual <../multilingual>` page for more information.
+- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
+  <../multilingual>` page for more information.

 The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.


--- a/docs/source/model_doc/xlmprophetnet.rst
+++ b/docs/source/model_doc/xlmprophetnet.rst
@@ -9,13 +9,25 @@ XLM-ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
+<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
+Zhang, Ming Zhou on 13 Jan, 2020.

-XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump.
+XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of
+just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual
+"wiki100" Wikipedia dump.

 The abstract from the paper is the following:

-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
+self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
+the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
+n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
+step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
+overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
+dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
+abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
+state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*

 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.


--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
@@ -12,25 +12,25 @@ data.

 The abstract from the paper is the following:

-*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for
-a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
+*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
+wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
 languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
-outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy
-on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
-low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model.
-We also present a detailed empirical evaluation of the key factors that are required to achieve these gains,
-including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and
-low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling
-without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE
-and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.*
+outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
+XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
+low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
+also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
+trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
+languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
+per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
+will make XLM-R code, data, and models publicly available.*

 Tips:

 - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
  not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
  language from the input ids.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
-  examples as well as the information relative to the inputs and outputs.
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
+  as well as the information relative to the inputs and outputs.

 The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.


--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -16,11 +16,11 @@ The abstract from the paper is the following:
 better performance than pretraining approaches based on autoregressive language modeling. However, relying on
 corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
 pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
-pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over
-all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
-formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model,
-into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by
-a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
+pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
+permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
+formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
+pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
+margin, including question answering, natural language inference, sentiment analysis, and document ranking.*

 Tips:


--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -15,8 +15,8 @@ Prepare your model for uploading

 We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
 done something similar on your task, either using the model directly in your own training loop or using the
-:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
-the `model hub <https://huggingface.co/models>`__.
+:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on the
+`model hub <https://huggingface.co/models>`__.

 Basic steps
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -60,22 +60,20 @@ Make your model work on all frameworks

 You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
 PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
-your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
-it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
-worry about the GPU, so it should be very easy. Check the
-`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
-and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
+super easy to do (and in a future version, it will all be automatic). You will need to install both PyTorch and
+TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
+installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
+installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.

 First check that your model class exists in the other framework, that is try to import the same model by either adding
-or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
-type
+or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to type

 .. code-block::

    from transformers import TFDistilBertForSequenceClassification

-and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
-type
+and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to type

 .. code-block::

@@ -112,7 +110,8 @@ Make sure there are no garbage files in the directory you'll upload. It should o
 - a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
 - a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part
+  of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.

 Other files can safely be deleted.
@@ -135,7 +134,8 @@ Then log in using the same credentials as on huggingface.co. To upload your mode

 This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.

-By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
+By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple
+models and need to script that process, you can add `-y` to bypass the prompt. For example:

 .. code-block::

@@ -179,15 +179,15 @@ Add a model card
 To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
 considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
 placed in a subfolder with your username or organization, then another subfolder named like your model
-(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
-get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
-model card template (meta-suggestions are welcome).
+(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will get
+you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a model
+card template (meta-suggestions are welcome).

 If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
 don't forget to link to its model card so that people can fully trace how your model was built.

-If you have never made a pull request to the 🤗 Transformers repo, look at the
-:doc:`contributing guide <contributing>` to see the steps to follow.
+If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
+to see the steps to follow.

 .. Note::


--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
 Multi-lingual models
 =======================================================================================================================

-Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
-multi-lingual models are available and have a different mechanisms than mono-lingual models.
-This page details the usage of these models.
+Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
+models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
+models.

 The two models that currently support multiple languages are BERT and XLM.

@@ -28,8 +28,8 @@ This section concerns the following checkpoints:

 These checkpoints require language embeddings that will specify the language used at inference time. These language
 embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
-these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
-from the tokenizer.
+these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
+the tokenizer.

 Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):

@@ -78,8 +78,9 @@ You can then feed it all as input to your model:
    >>> outputs = model(input_ids, langs=langs)


-The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
-can generate text using the CLM checkpoints from XLM, using the language embeddings.
+The example `run_generation.py
+<https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ can generate
+text using the CLM checkpoints from XLM, using the language embeddings.

 XLM without Language Embeddings
 -----------------------------------------------------------------------------------------------------------------------
@@ -89,8 +90,8 @@ This section concerns the following checkpoints:
 - ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
 - ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)

-These checkpoints do not require language embeddings at inference time. These models are used to have generic
-sentence representations, differently from previously-mentioned XLM checkpoints.
+These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
+representations, differently from previously-mentioned XLM checkpoints.


 BERT
@@ -101,15 +102,15 @@ BERT has two checkpoints that can be used for multi-lingual tasks:
 - ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
 - ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)

-These checkpoints do not require language embeddings at inference time. They should identify the language
-used in the context and infer accordingly.
+These checkpoints do not require language embeddings at inference time. They should identify the language used in the
+context and infer accordingly.

 XLM-RoBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
-gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,
-sequence labeling and question answering.
+XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
+over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
+labeling and question answering.

 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:


--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
 Perplexity of fixed-length models
 =======================================================================================================================

-Perplexity (PPL) is one of the most common metrics for evaluating language
-models. Before diving in, we should note that the metric applies specifically
-to classical language models (sometimes called autoregressive or causal
-language models) and is not well defined for masked language models like BERT
-(see :doc:`summary of the models <model_summary>`).
+Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
+that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
+models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
+<model_summary>`).

-Perplexity is defined as the exponentiated average log-likelihood of a
-sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`,
-then the perplexity of :math:`X` is,
+Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
+:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,

 .. math::

    \text{PPL}(X)
    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}

-where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith
-token conditioned on the preceding tokens :math:`x_{<i}` according to our
-model. Intuitively, it can be thought of as an evaluation of the model's
-ability to predict uniformly among the set of specified tokens in a corpus.
-Importantly, this means that the tokenization procedure has a direct impact
-on a model's perplexity which should always be taken into consideration when
-comparing different models.
+where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
+:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
+predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
+procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
+different models.

-This is also equivalent to the exponentiation of the cross-entropy between
-the data and model predictions. For more intuition about perplexity and its
-relationship to Bits Per Character (BPC) and data compression, check out this
-`fantastic blog post on The Gradient
-<https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
+This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
+intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
+`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.

 Calculating PPL with fixed-length models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-If we weren't limited by a model's context size, we would evaluate the
-model's perplexity by autoregressively factorizing a sequence and
-conditioning on the entire preceding subsequence at each step, as shown
-below.
+If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
+factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.

 .. image:: imgs/ppl_full.gif
    :width: 600
    :alt: Full decomposition of a sequence with unlimited context length

-When working with approximate models, however, we typically have a constraint
-on the number of tokens the model can process. The largest version
-of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024
-tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
-:math:`t` is greater than 1024.
+When working with approximate models, however, we typically have a constraint on the number of tokens the model can
+process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
+cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.

-Instead, the sequence is typically broken into subsequences equal to the
-model's maximum input size. If a model's max input size is :math:`k`, we
-then approximate the likelihood of a token :math:`x_t` by conditioning only
-on the :math:`k-1` tokens that precede it rather than the entire context.
-When evaluating the model's perplexity of a sequence, a tempting but
-suboptimal approach is to break the sequence into disjoint chunks and
-add up the decomposed log-likelihoods of each segment independently.
+Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
+input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
+:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
+sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
+log-likelihoods of each segment independently.

 .. image:: imgs/ppl_chunked.gif
    :width: 600
    :alt: Suboptimal PPL not taking advantage of full available context

-This is quick to compute since the perplexity of each segment can be computed
-in one forward pass, but serves as a poor approximation of the
-fully-factorized perplexity and will typically yield a higher (worse) PPL
-because the model will have less context at most of the prediction steps.
+This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
+approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
+have less context at most of the prediction steps.

-Instead, the PPL of fixed-length models should be evaluated with a
-sliding-window strategy. This involves repeatedly sliding the
-context window so that the model has more context when making each
-prediction.
+Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
+sliding the context window so that the model has more context when making each prediction.

 .. image:: imgs/ppl_sliding.gif
    :width: 600
    :alt: Sliding window PPL taking advantage of all available context

-This is a closer approximation to the true decomposition of the
-sequence probability and will typically yield a more favorable score.
-The downside is that it requires a separate forward pass for each token in
-the corpus. A good practical compromise is to employ a strided sliding
-window, moving the context by larger strides rather than sliding by 1 token a
-time. This allows computation to procede much faster while still giving the
-model a large context to make predictions at each step.
+This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
+favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
+practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
+1 token a time. This allows computation to procede much faster while still giving the model a large context to make
+predictions at each step.

 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

-We'll load in the WikiText-2 dataset and evaluate the perplexity using a few
-different sliding-window strategies. Since this dataset is small and we're
-just doing one forward pass over the set, we can just load and encode the
-entire dataset in memory.
+We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
+this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
+dataset in memory.

 .. code-block:: python

@@ -106,16 +88,13 @@ entire dataset in memory.
    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')

-With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels``
-to our model, and the average log-likelihood for each token is returned as
-the loss. With our sliding window approach, however, there is overlap in the
-tokens we pass to the model at each iteration. We don't want the
-log-likelihood for the tokens we're just treating as context to be included
-in our loss, so we can set these targets to ``-100`` so that they are
-ignored. The following is an example of how we could do this with a stride of
-``512``. This means that the model will have at least 512 tokens for context
-when calculating the conditional likelihood of any one token (provided there
-are 512 preceding tokens available to condition on).
+With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
+log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
+the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
+as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
+is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
+for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
+available to condition on).

 .. code-block:: python

@@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on).

    ppl = torch.exp(torch.stack(lls).sum() / end_loc)

-Running this with the stride length equal to the max input length is
-equivalent to the suboptimal, non-sliding-window strategy we discussed above.
-The smaller the stride, the more context the model will have in making each
-prediction, and the better the reported perplexity will typically be.
-
-When we run the above with ``stride = 1024``, i.e. no overlap, the resulting
-PPL is ``19.64``, which is about the same as the ``19.93`` reported in the
-GPT-2 paper. By using ``stride = 512`` and thereby employing our striding
-window strategy, this jumps down to ``16.53``. This is not only a more
-favorable score, but is calculated in a way that is closer to the true
-autoregressive decomposition of a sequence likelihood.
+Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
+strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
+and the better the reported perplexity will typically be.
+
+When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
+as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
+strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
+closer to the true autoregressive decomposition of a sequence likelihood.
--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
@@ -12,15 +12,15 @@ The library was designed with two strong goals in mind:
 - Be as easy and fast to use as possible:

    - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
-      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`, 
+      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
      :doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
    - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
      :obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
-      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary, 
-      and models' weights) from a pretrained checkpoint provided on 
-      `Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint.
+      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
+      and models' weights) from a pretrained checkpoint provided on `Hugging Face Hub
+      <https://huggingface.co/models>`__ or your own saved checkpoint.
    - On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
-      using a model (plus its associated tokenizer and configuration) on a given task and 
+      using a model (plus its associated tokenizer and configuration) on a given task and
      :func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
    - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
      extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
@@ -52,10 +52,10 @@ Main concepts

 The library is built around three types of classes for each model:

- **Model classes**  such as :class:`~transformers.BertModel`, which are 30+ PyTorch models 
-  (`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models 
-  (`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained
-  weights provided in the library.
+- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models (`torch.nn.Module
+  <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models (`tf.keras.Model
+  <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained weights provided in the
+  library.
 - **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
  a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
  without any modification, creating the model will automatically take care of instantiating the configuration (which
@@ -66,8 +66,8 @@ The library is built around three types of classes for each model:
 All these classes can be instantiated from pretrained instances and saved locally using two methods:

 - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
-  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`
-  or stored locally (or on a server) by the user,
+  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
+  stored locally (or on a server) by the user,
 - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
  :obj:`from_pretrained()`.