Tips + whitespaces

9ddf60b6 · Lysandre · Lysandre Debut · 0e9899f4 · 9ddf60b6 · 9ddf60b6
Commit 9ddf60b6 authored Jan 21, 2020 by Lysandre Committed by Lysandre Debut Jan 23, 2020
20 changed files
--- a/docs/source/model_doc/camembert.rst
+++ b/docs/source/model_doc/camembert.rst
 CamemBERT
 ----------------------------------------------------
-The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model`_
+The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__
 by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
-Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019.
+Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
+trained on 138GB of French text.
-It is a model trained on 138GB of French text.
+The abstract from the paper is the following:
-This implementation is the same as RoBERTa.
+*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success,
+most available models have either been trained on English data or on the concatenation of data in multiple
+languages. This makes practical use of such models --in all languages except English-- very limited. Aiming
+to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for
+Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple
+downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural
+language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the
+pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.*
-``CamembertConfig``
+Tips:
+- This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
+  examples as well as the information relative to the inputs and outputs.
+CamembertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertConfig
    :members:
-``CamembertTokenizer``
+CamembertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertTokenizer
    :members:
-``CamembertModel``
+CamembertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertModel
    :members:
-``CamembertForMaskedLM``
+CamembertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForMaskedLM
    :members:
-``CamembertForSequenceClassification``
+CamembertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForSequenceClassification
    :members:
-``CamembertForMultipleChoice``
+CamembertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForMultipleChoice
    :members:
-``CamembertForTokenClassification``
+CamembertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForTokenClassification

--- a/docs/source/model_doc/ctrl.rst
+++ b/docs/source/model_doc/ctrl.rst
@@ -6,51 +6,68 @@ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Ric
 It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
 corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
-This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+The abstract from the paper is the following:
-Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
-usage and behavior.
+*Large-scale language models show promising text generation capabilities, but users cannot easily control particular
+aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
-Note: if you fine-tune a CTRL model using the Salesforce code (https://github.com/salesforce/ctrl),
+trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
-you'll be able to convert from TF to our HuggingFace/Transformers format using the 
+derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning
-``convert_tf_to_huggingface_pytorch.py`` script (see `issue #1654 <https://github.com/huggingface/transformers/issues/1654>`_).
+while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of
+the training data are most likely given a sequence. This provides a potential method for analyzing large amounts
+of data via model-based source attribution.*
-``CTRLConfig``
+Tips:
+- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
+  or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__
+  for more information.
+- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
+  the right rather than the left.
+- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
+  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as
+  it can be observed in the `run_generation.py` example script.
+- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
+  this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
+  See `reusing the past in generative models <../quickstart.html#using-the-past>`_ for more information on the usage
+  of this argument.
+CTRLConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLConfig
    :members:
-``CTRLTokenizer``
+CTRLTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLTokenizer
    :members:
-``CTRLModel``
+CTRLModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLModel
    :members:
-``CTRLLMHeadModel``
+CTRLLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLLMHeadModel
    :members:
-``TFCTRLModel``
+TFCTRLModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCTRLModel
    :members:
-``TFCTRLLMHeadModel``
+TFCTRLLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCTRLLMHeadModel

--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
 DistilBERT
 ----------------------------------------------------
-DistilBERT is a small, fast, cheap and light Transformer model
+The DistilBERT model was proposed in the blog post
-trained by distilling Bert base. It has 40% less parameters than
+`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__,
-`bert-base-uncased`, runs 60% faster while preserving over 95% of
+and the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
-Bert's performances as measured on the GLUE language understanding benchmark.
+DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% less
+parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on
-Here are the differences between the interface of Bert and DistilBert:
+the GLUE language understanding benchmark.
+The abstract from the paper is the following:
+*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
+operating these large models in on-the-edge and/or under constrained computational training or inference budgets
+remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
+model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
+counterparts. While most prior work investigated the use of distillation for building task-specific models, we
+leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a
+BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage
+the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language
+modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train
+and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative
+on-device study.*
+Tips:
 - DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
 - DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
-For more information on DistilBERT, please refer to our
-`detailed blog post`_
-.. _`detailed blog post`:
-    https://medium.com/huggingface/distilbert-8cf3380435b5
-``DistilBertConfig``
+DistilBertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertConfig
    :members:
-``DistilBertTokenizer``
+DistilBertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertTokenizer
    :members:
-``DistilBertModel``
+DistilBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertModel
    :members:
-``DistilBertForMaskedLM``
+DistilBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForMaskedLM
    :members:
-``DistilBertForSequenceClassification``
+DistilBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForSequenceClassification
    :members:
-``DistilBertForQuestionAnswering``
+DistilBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForQuestionAnswering
    :members:
-``TFDistilBertModel``
+TFDistilBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertModel
    :members:
-``TFDistilBertForMaskedLM``
+TFDistilBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForMaskedLM
    :members:
-``TFDistilBertForSequenceClassification``
+TFDistilBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForSequenceClassification
    :members:
-``TFDistilBertForQuestionAnswering``
+TFDistilBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForQuestionAnswering

--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
 OpenAI GPT
 ----------------------------------------------------
+Overview
+~~~~~~~~~~~~~~~~~~~~~
 OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training`_
 by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional)
 transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
@@ -33,56 +36,56 @@ Tips:
 `Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by
 Hugging Face showcasing the generative capabilities of several models. GPT is one of them.
-``OpenAIGPTConfig``
+OpenAIGPTConfig
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTConfig
    :members:
-``OpenAIGPTTokenizer``
+OpenAIGPTTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTTokenizer
    :members:
-``OpenAIGPTModel``
+OpenAIGPTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTModel
    :members:
-``OpenAIGPTLMHeadModel``
+OpenAIGPTLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTLMHeadModel
    :members:
-``OpenAIGPTDoubleHeadsModel``
+OpenAIGPTDoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTDoubleHeadsModel
    :members:
-``TFOpenAIGPTModel``
+TFOpenAIGPTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFOpenAIGPTModel
    :members:
-``TFOpenAIGPTLMHeadModel``
+TFOpenAIGPTLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFOpenAIGPTLMHeadModel
    :members:
-``TFOpenAIGPTDoubleHeadsModel``
+TFOpenAIGPTDoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFOpenAIGPTDoubleHeadsModel

--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -35,56 +35,56 @@ Hugging Face showcasing the generative capabilities of several models. GPT-2 is
 different sizes: small, medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.
-``GPT2Config``
+GPT2Config
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2Config
    :members:
-``GPT2Tokenizer``
+GPT2Tokenizer
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2Tokenizer
    :members:
-``GPT2Model``
+GPT2Model
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2Model
    :members:
-``GPT2LMHeadModel``
+GPT2LMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2LMHeadModel
    :members:
-``GPT2DoubleHeadsModel``
+GPT2DoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2DoubleHeadsModel
    :members:
-``TFGPT2Model``
+TFGPT2Model
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFGPT2Model
    :members:
-``TFGPT2LMHeadModel``
+TFGPT2LMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFGPT2LMHeadModel
    :members:
-``TFGPT2DoubleHeadsModel``
+TFGPT2DoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFGPT2DoubleHeadsModel

--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
 RoBERTa
 ----------------------------------------------------
-The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach`_
+The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_
 by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
 Veselin Stoyanov. It is based on Google's BERT model released in 2018.
 It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
 objective and training with much larger mini-batches and learning rates.
-This implementation is the same as BertModel with a tiny embeddings tweak as well as a setup for Roberta pretrained
+The abstract from the paper is the following:
-models.
+*Language model pretraining has led to significant performance gains but careful comparison between different
+approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
+and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
+study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
+training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
+every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
+results highlight the importance of previously overlooked design choices, and raise questions about the source
+of recently reported improvements. We release our models and code.*
+Tips:
+- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
+  setup for Roberta pretrained models.
+- `Camembert <./camembert.html>`__ is a wrapper around RoBERTa. Refer to this page for usage examples.
 RobertaConfig
 ~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
 Transformer XL
 ----------------------------------------------------
+Overview
+~~~~~~~~~~~~~~~~~~~~~
 The Transformer-XL model was proposed in
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context`_
+`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__
 by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can reuse
 previously computed hidden-states to attend to longer context (memory).
@@ -23,46 +26,47 @@ coherent, novel text articles with thousands of tokens.*
 Tips:
- Transformer-XL uses relative sinusoidal positional embeddings so it's usually advised to pad the inputs on
+- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right.
-  the left rather than the right.
+  The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
+- Transformer-XL is one of the few models that has no sequence length limit.
-``TransfoXLConfig``
+TransfoXLConfig
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TransfoXLConfig
    :members:
-``TransfoXLTokenizer``
+TransfoXLTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TransfoXLTokenizer
    :members:
-``TransfoXLModel``
+TransfoXLModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TransfoXLModel
    :members:
-``TransfoXLLMHeadModel``
+TransfoXLLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TransfoXLLMHeadModel
    :members:
-``TFTransfoXLModel``
+TFTransfoXLModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFTransfoXLModel
    :members:
-``TFTransfoXLLMHeadModel``
+TFTransfoXLLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFTransfoXLLMHeadModel

--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
 XLM
 ----------------------------------------------------
-The XLM model was proposed in `Cross-lingual Language Model Pretraining`_
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+The XLM model was proposed in `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_
 by Guillaume Lample*, Alexis Conneau*. It's a transformer pre-trained using one of the following objectives:
-    - a causal language modeling (CLM) objective (next token prediction),
+- a causal language modeling (CLM) objective (next token prediction),
-    - a masked language modeling (MLM) objective (Bert-like), or
+- a masked language modeling (MLM) objective (Bert-like), or
-    - a Translation Language Modeling (TLM) object (extension of Bert's MLM to multiple language inputs)
+- a Translation Language Modeling (TLM) object (extension of Bert's MLM to multiple language inputs)
+The abstract from the paper is the following:
+*Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
+In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining.
+We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
+data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
+state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI,
+our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation,
+we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On
+supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming
+the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
-Original code can be found `here <https://github.com/facebookresearch/XLM>`_.
+Tips:
-This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+- XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
-refer to the PyTorch documentation for all matter related to general usage and behavior.
+  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
+- XLM has multilingual checkpoints which leverage a specific `lang` parameter. Check out the
+  `multi-lingual <../multilingual.html>`__ page for more information.
 XLMConfig

--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
 XLM-RoBERTa
 ------------------------------------------
-The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale`_
+The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__
-by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019.
+by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán,
+Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019.
 It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
-This implementation is the same as RoBERTa.
+The abstract from the paper is the following:
-This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
-refer to the PyTorch documentation for all matter related to general usage and behavior.
-.. _`Unsupervised Cross-lingual Representation Learning at Scale`:
+*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for
-    https://arxiv.org/abs/1911.02116
+a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
+languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
+outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy
+on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
+low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model.
+We also present a detailed empirical evaluation of the key factors that are required to achieve these gains,
+including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and
+low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling
+without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE
+and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.*
-.. _`torch.nn.Module`:
+Tips:
-    https://pytorch.org/docs/stable/nn.html#module
+- This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
+  examples as well as the information relative to the inputs and outputs.
 XLMRobertaConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
 XLNet
 ----------------------------------------------------
-The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding`_
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_
 by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method
 to learn bidirectional contexts by maximizing the expected likelihood over all permutations
 of the input sequence factorization order.
-The specific attention pattern can be controlled at training and test time using the `perm_mask` input.
+The abstract from the paper is the following:
+*With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves
+better performance than pretraining approaches based on autoregressive language modeling. However, relying on
+corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
+pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
+pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over
+all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
+formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model,
+into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by
+a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
-Due to the difficulty of training a fully auto-regressive model over various factorization order,
+Tips:
-XLNet is pretrained using only a sub-set of the output tokens as target which are selected
-with the `target_mapping` input.
-To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
+- The specific attention pattern can be controlled at training and test time using the `perm_mask` input.
-`target_mapping` inputs to control the attention span and outputs (see examples in `examples/run_generation.py`)
+- Due to the difficulty of training a fully auto-regressive model over various factorization order,
+  XLNet is pretrained using only a sub-set of the output tokens as target which are selected
+  with the `target_mapping` input.
+- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
+  `target_mapping` inputs to control the attention span and outputs (see examples in `examples/run_generation.py`)
+- XLNet is one of the few models that has no sequence length limit.
-``XLNetConfig``
+XLNetConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetConfig
    :members:
-``XLNetTokenizer``
+XLNetTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetTokenizer
    :members:
-``XLNetModel``
+XLNetModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetModel
    :members:
-``XLNetLMHeadModel``
+XLNetLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetLMHeadModel
    :members:
-``XLNetForSequenceClassification``
+XLNetForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetForSequenceClassification
    :members:
-``XLNetForTokenClassification``
+XLNetForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetForTokenClassification
    :members:
-``XLNetForMultipleChoice``
+XLNetForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetForMultipleChoice
    :members:
-``XLNetForQuestionAnsweringSimple``
+XLNetForQuestionAnsweringSimple
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetForQuestionAnsweringSimple
    :members:
-``XLNetForQuestionAnswering``
+XLNetForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.XLNetForQuestionAnswering
    :members:
-``TFXLNetModel``
+TFXLNetModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFXLNetModel
    :members:
-``TFXLNetLMHeadModel``
+TFXLNetLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFXLNetLMHeadModel
    :members:
-``TFXLNetForSequenceClassification``
+TFXLNetForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFXLNetForSequenceClassification
    :members:
-``TFXLNetForQuestionAnsweringSimple``
+TFXLNetForQuestionAnsweringSimple
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFXLNetForQuestionAnsweringSimple

--- a/src/transformers/configuration_camembert.py
+++ b/src/transformers/configuration_camembert.py
@@ -29,35 +29,10 @@ CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class CamembertConfig(RobertaConfig):
-    r"""
-        This is the configuration class to store the configuration of an :class:`~transformers.CamembertModel`.
-        It is used to instantiate an Camembert model according to the specified arguments, defining the model
-        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
-        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
-        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
-        for more information.
-        The :class:`~transformers.CamembertConfig` class directly inherits :class:`~transformers.BertConfig`.
-        It reuses the same defaults. Please check the parent class for more information.
-        Example::
-            from transformers import CamembertModel, CamembertConfig
-            # Initializing a CamemBERT configuration
-            configuration = CamembertConfig()
-            # Initializing a model from the configuration
-            model = CamembertModel(configuration)
-            # Accessing the model configuration
-            configuration = model.config
-        Attributes:
-            pretrained_config_archive_map (Dict[str, str]):
-                A dictionary containing all the available pre-trained checkpoints.
    """
+    This class overrides :class:`~transformers.RobertaConfig`. Please check the
+    superclass for the appropriate documentation alongside usage examples.
+    """
    pretrained_config_archive_map = CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
    model_type = "camembert"
--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
@@ -377,14 +377,10 @@ class AlbertPreTrainedModel(PreTrainedModel):
 ALBERT_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
-    refer to the PyTorch documentation for all matter related to general usage and behavior.
-    .. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:
+    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
-        https://arxiv.org/abs/1909.11942
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
-    .. _`torch.nn.Module`:
-        https://pytorch.org/docs/stable/nn.html#module
    Args:
        config (:class:`~transformers.AlbertConfig`): Model configuration class with all the parameters of the model.

--- a/src/transformers/modeling_bert.py
+++ b/src/transformers/modeling_bert.py
--- a/src/transformers/modeling_camembert.py
+++ b/src/transformers/modeling_camembert.py
--- a/src/transformers/modeling_ctrl.py
+++ b/src/transformers/modeling_ctrl.py
@@ -185,8 +185,9 @@ class CTRLPreTrainedModel(PreTrainedModel):
 CTRL_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
-    refer to the PyTorch documentation for all matter related to general usage and behavior.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
    Parameters:
        config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.

--- a/src/transformers/modeling_distilbert.py
+++ b/src/transformers/modeling_distilbert.py
@@ -351,6 +351,7 @@ class DistilBertPreTrainedModel(PreTrainedModel):
 DISTILBERT_START_DOCSTRING = r"""
    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
    usage and behavior.

--- a/src/transformers/modeling_gpt2.py
+++ b/src/transformers/modeling_gpt2.py
@@ -266,6 +266,7 @@ class GPT2PreTrainedModel(PreTrainedModel):
 GPT2_START_DOCSTRING = r"""
    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
    usage and behavior.

--- a/src/transformers/modeling_openai.py
+++ b/src/transformers/modeling_openai.py
@@ -280,14 +280,10 @@ class OpenAIGPTPreTrainedModel(PreTrainedModel):
 OPENAI_GPT_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
-    refer to the PyTorch documentation for all matter related to general usage and behavior.
-    .. _`Improving Language Understanding by Generative Pre-Training`:
+    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
-        https://openai.com/blog/language-unsupervised/
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
-    .. _`torch.nn.Module`:
-        https://pytorch.org/docs/stable/nn.html#module
    Parameters:
        config (:class:`~transformers.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.

--- a/src/transformers/modeling_roberta.py
+++ b/src/transformers/modeling_roberta.py
@@ -94,8 +94,9 @@ class RobertaEmbeddings(BertEmbeddings):
 ROBERTA_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and
+    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
-    refer to the PyTorch documentation for all matter related to general usage and behavior.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
    Parameters:
        config (:class:`~transformers.RobertaConfig`): Model configuration class with all the parameters of the