Convert rst files (#14888)

* Convert all tutorials and guides * Convert all remaining rst to mdx * Track and fix bad links

Convert rst files (#14888)
* Convert all tutorials and guides * Convert all remaining rst to mdx * Track and fix bad links
207594be · Sylvain Gugger · GitHub · b0c7d2ec · 207594be · b0c7d2ec
Unverified Commit 207594be authored Dec 22, 2021 by Sylvain Gugger Committed by GitHub Dec 22, 2021
20 changed files
--- a/docs/source/model_doc/wav2vec2.mdx
+++ b/docs/source/model_doc/wav2vec2.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Wav2Vec2
+
+## Overview
+
+The Wav2Vec2 model was proposed in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
+
+The abstract from the paper is the following:
+
+*We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on
+transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks
+the speech input in the latent space and solves a contrastive task defined over a quantization of the latent
+representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the
+clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state
+of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and
+pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech
+recognition with limited amounts of labeled data.*
+
+Tips:
+
+- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
+- Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded
+  using [`Wav2Vec2CTCTokenizer`].
+
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
+
+
+## Wav2Vec2Config
+
+[[autodoc]] Wav2Vec2Config
+
+## Wav2Vec2CTCTokenizer
+
+[[autodoc]] Wav2Vec2CTCTokenizer
+    - __call__
+    - save_vocabulary
+
+## Wav2Vec2FeatureExtractor
+
+[[autodoc]] Wav2Vec2FeatureExtractor
+    - __call__
+
+## Wav2Vec2Processor
+
+[[autodoc]] Wav2Vec2Processor
+    - __call__
+    - pad
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+    - as_target_processor
+
+## Wav2Vec2ProcessorWithLM
+
+[[autodoc]] Wav2Vec2ProcessorWithLM
+    - __call__
+    - pad
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+    - as_target_processor
+
+## Wav2Vec2 specific outputs
+
+[[autodoc]] models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2DecoderWithLMOutput
+
+[[autodoc]] models.wav2vec2.modeling_wav2vec2.Wav2Vec2BaseModelOutput
+
+[[autodoc]] models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput
+
+[[autodoc]] models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2BaseModelOutput
+
+[[autodoc]] models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2ForPreTrainingOutput
+
+## Wav2Vec2Model
+
+[[autodoc]] Wav2Vec2Model
+    - forward
+
+## Wav2Vec2ForCTC
+
+[[autodoc]] Wav2Vec2ForCTC
+    - forward
+
+## Wav2Vec2ForSequenceClassification
+
+[[autodoc]] Wav2Vec2ForSequenceClassification
+    - forward
+
+## Wav2Vec2ForAudioFrameClassification
+
+[[autodoc]] Wav2Vec2ForAudioFrameClassification
+    - forward
+
+## Wav2Vec2ForXVector
+
+[[autodoc]] Wav2Vec2ForXVector
+    - forward
+
+## Wav2Vec2ForPreTraining
+
+[[autodoc]] Wav2Vec2ForPreTraining
+    - forward
+
+## TFWav2Vec2Model
+
+[[autodoc]] TFWav2Vec2Model
+    - call
+
+## TFWav2Vec2ForCTC
+
+[[autodoc]] TFWav2Vec2ForCTC
+    - call
+
+## FlaxWav2Vec2Model
+
+[[autodoc]] FlaxWav2Vec2Model
+    - __call__
+
+## FlaxWav2Vec2ForCTC
+
+[[autodoc]] FlaxWav2Vec2ForCTC
+    - __call__
+
+## FlaxWav2Vec2ForPreTraining
+
+[[autodoc]] FlaxWav2Vec2ForPreTraining
+    - __call__
--- a/docs/source/model_doc/wav2vec2.rst
+++ b/docs/source/model_doc/wav2vec2.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Wav2Vec2
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The Wav2Vec2 model was proposed in `wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
-<https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
-
-The abstract from the paper is the following:
-
-*We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on
-transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks
-the speech input in the latent space and solves a contrastive task defined over a quantization of the latent
-representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the
-clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state
-of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and
-pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech
-recognition with limited amounts of labeled data.*
-
-Tips:
-
- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded
-  using :class:`~transformers.Wav2Vec2CTCTokenizer`.
-
-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
-
-
-Wav2Vec2Config
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2Config
-    :members:
-
-
-Wav2Vec2CTCTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2CTCTokenizer
-    :members: __call__, save_vocabulary
-
-
-Wav2Vec2FeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2FeatureExtractor
-    :members: __call__
-
-
-Wav2Vec2Processor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2Processor
-    :members: __call__, pad, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
-
-
-Wav2Vec2ProcessorWithLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2ProcessorWithLM
-    :members: __call__, pad, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
-
-
-Wav2Vec2 specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2DecoderWithLMOutput
-    :members: 
-
-.. autoclass:: transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2BaseModelOutput
-    :members: 
-
-.. autoclass:: transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput
-    :members: 
-
-.. autoclass:: transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2BaseModelOutput
-    :members: 
-
-.. autoclass:: transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2ForPreTrainingOutput
-    :members: 
-
-
-Wav2Vec2Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2Model
-    :members: forward
-
-
-Wav2Vec2ForCTC
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2ForCTC
-    :members: forward
-
-
-Wav2Vec2ForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2ForSequenceClassification
-    :members: forward
-
-
-Wav2Vec2ForAudioFrameClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2ForAudioFrameClassification
-    :members: forward
-
-
-Wav2Vec2ForXVector
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2ForXVector
-    :members: forward
-
-
-Wav2Vec2ForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Wav2Vec2ForPreTraining
-    :members: forward
-
-
-TFWav2Vec2Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFWav2Vec2Model
-    :members: call
-
-
-TFWav2Vec2ForCTC
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFWav2Vec2ForCTC
-    :members: call
-
-
-FlaxWav2Vec2Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FlaxWav2Vec2Model
-    :members: __call__
-
-
-FlaxWav2Vec2ForCTC
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FlaxWav2Vec2ForCTC
-    :members: __call__
-
-FlaxWav2Vec2ForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FlaxWav2Vec2ForPreTraining
-    :members: __call__
--- a/docs/source/model_doc/wavlm.rst
+++ b/docs/source/model_doc/wavlm.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-WavLM
-----------------------------------------------------------------------------------------------------------------------
+# WavLM

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The WavLM model was proposed in `WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
-<https://arxiv.org/abs/2110.13900>`__ by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen,
+The WavLM model was proposed in [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen,
 Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu,
 Michael Zeng, Furu Wei.

@@ -37,61 +34,46 @@ benchmark, and brings significant improvements for various speech processing tas
 Tips:

 - WavLM is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please use
-  :class:`~transformers.Wav2Vec2Processor` for the feature extraction.
+  [`Wav2Vec2Processor`] for the feature extraction.
 - WavLM model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
-  using :class:`~transformers.Wav2Vec2CTCTokenizer`.
+  using [`Wav2Vec2CTCTokenizer`].
 - WavLM performs especially well on speaker verification, speaker identification, and speaker diarization tasks.

 Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm.

-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The Authors' code can be
-found `here <https://github.com/microsoft/unilm/tree/master/wavlm>`__.
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
+found [here](https://github.com/microsoft/unilm/tree/master/wavlm).


-WavLMConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## WavLMConfig

-.. autoclass:: transformers.WavLMConfig
-    :members:
+[[autodoc]] WavLMConfig

+## WavLM specific outputs

-WavLM specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] models.wavlm.modeling_wavlm.WavLMBaseModelOutput

-.. autoclass:: transformers.models.wavlm.modeling_wavlm.WavLMBaseModelOutput
-    :members: 
+## WavLMModel

+[[autodoc]] WavLMModel
+    - forward

-WavLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## WavLMForCTC

-.. autoclass:: transformers.WavLMModel
-    :members: forward
+[[autodoc]] WavLMForCTC
+    - forward

+## WavLMForSequenceClassification

-WavLMForCTC
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] WavLMForSequenceClassification
+    - forward

-.. autoclass:: transformers.WavLMForCTC
-    :members: forward
+## WavLMForAudioFrameClassification

+[[autodoc]] WavLMForAudioFrameClassification
+    - forward

-WavLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## WavLMForXVector

-.. autoclass:: transformers.WavLMForSequenceClassification
-    :members: forward
-
-
-WavLMForAudioFrameClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.WavLMForAudioFrameClassification
-    :members: forward
-
-
-WavLMForXVector
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.WavLMForXVector
-    :members: forward
+[[autodoc]] WavLMForXVector
+    - forward
--- a/docs/source/model_doc/xlm.mdx
+++ b/docs/source/model_doc/xlm.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# XLM
+
+## Overview
+
+The XLM model was proposed in [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by
+Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the following objectives:
+
+- a causal language modeling (CLM) objective (next token prediction),
+- a masked language modeling (MLM) objective (BERT-like), or
+- a Translation Language Modeling (TLM) object (extension of BERT's MLM to multiple language inputs)
+
+The abstract from the paper is the following:
+
+*Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
+In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
+propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
+data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
+state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
+approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
+obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
+machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
+previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
+
+Tips:
+
+- XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
+  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
+- XLM has multilingual checkpoints which leverage a specific `lang` parameter. Check out the [multi-lingual](../multilingual) page for more information.
+
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/facebookresearch/XLM/).
+
+
+## XLMConfig
+
+[[autodoc]] XLMConfig
+
+## XLMTokenizer
+
+[[autodoc]] XLMTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+
+## XLM specific outputs
+
+[[autodoc]] models.xlm.modeling_xlm.XLMForQuestionAnsweringOutput
+
+## XLMModel
+
+[[autodoc]] XLMModel
+    - forward
+
+## XLMWithLMHeadModel
+
+[[autodoc]] XLMWithLMHeadModel
+    - forward
+
+## XLMForSequenceClassification
+
+[[autodoc]] XLMForSequenceClassification
+    - forward
+
+## XLMForMultipleChoice
+
+[[autodoc]] XLMForMultipleChoice
+    - forward
+
+## XLMForTokenClassification
+
+[[autodoc]] XLMForTokenClassification
+    - forward
+
+## XLMForQuestionAnsweringSimple
+
+[[autodoc]] XLMForQuestionAnsweringSimple
+    - forward
+
+## XLMForQuestionAnswering
+
+[[autodoc]] XLMForQuestionAnswering
+    - forward
+
+## TFXLMModel
+
+[[autodoc]] TFXLMModel
+    - call
+
+## TFXLMWithLMHeadModel
+
+[[autodoc]] TFXLMWithLMHeadModel
+    - call
+
+## TFXLMForSequenceClassification
+
+[[autodoc]] TFXLMForSequenceClassification
+    - call
+
+## TFXLMForMultipleChoice
+
+[[autodoc]] TFXLMForMultipleChoice
+    - call
+
+## TFXLMForTokenClassification
+
+[[autodoc]] TFXLMForTokenClassification
+    - call
+
+## TFXLMForQuestionAnsweringSimple
+
+[[autodoc]] TFXLMForQuestionAnsweringSimple
+    - call
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-XLM
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The XLM model was proposed in `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by
-Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the following objectives:
-
- a causal language modeling (CLM) objective (next token prediction),
- a masked language modeling (MLM) objective (BERT-like), or
- a Translation Language Modeling (TLM) object (extension of BERT's MLM to multiple language inputs)
-
-The abstract from the paper is the following:
-
-*Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
-In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
-propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
-data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
-state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
-approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
-obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
-machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
-previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
-
-Tips:
-
- XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
-  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
-  <../multilingual>` page for more information.
-
-This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
-<https://github.com/facebookresearch/XLM/>`__.
-
-
-XLMConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMConfig
-    :members:
-
-XLMTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-
-
-XLM specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.xlm.modeling_xlm.XLMForQuestionAnsweringOutput
-    :members:
-
-
-XLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMModel
-    :members: forward
-
-
-XLMWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMWithLMHeadModel
-    :members: forward
-
-
-XLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMForSequenceClassification
-    :members: forward
-
-
-XLMForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMForMultipleChoice
-    :members: forward
-
-
-XLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMForTokenClassification
-    :members: forward
-
-
-XLMForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMForQuestionAnsweringSimple
-    :members: forward
-
-
-XLMForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMForQuestionAnswering
-    :members: forward
-
-
-TFXLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMModel
-    :members: call
-
-
-TFXLMWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMWithLMHeadModel
-    :members: call
-
-
-TFXLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMForSequenceClassification
-    :members: call
-
-
-TFXLMForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMForMultipleChoice
-    :members: call
-
-
-TFXLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMForTokenClassification
-    :members: call
-
-
-
-TFXLMForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMForQuestionAnsweringSimple
-    :members: call
--- a/docs/source/model_doc/xlmprophetnet.rst
+++ b/docs/source/model_doc/xlmprophetnet.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-XLM-ProphetNet
-----------------------------------------------------------------------------------------------------------------------
+# XLM-ProphetNet

-**DISCLAIMER:** If you see something strange, file a `Github Issue
-<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
 @patrickvonplaten


-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
-<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
+The XLM-ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
 Zhang, Ming Zhou on 13 Jan, 2020.

 XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of
@@ -41,47 +37,32 @@ dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Giga
 abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
 state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*

-The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
+The Authors' code can be found [here](https://github.com/microsoft/ProphetNet).

-XLMProphetNetConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## XLMProphetNetConfig

-.. autoclass:: transformers.XLMProphetNetConfig
-    :members:
+[[autodoc]] XLMProphetNetConfig

+## XLMProphetNetTokenizer

-XLMProphetNetTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] XLMProphetNetTokenizer

-.. autoclass:: transformers.XLMProphetNetTokenizer
-    :members:
+## XLMProphetNetModel

+[[autodoc]] XLMProphetNetModel

-XLMProphetNetModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## XLMProphetNetEncoder

-.. autoclass:: transformers.XLMProphetNetModel
+[[autodoc]] XLMProphetNetEncoder

+## XLMProphetNetDecoder

-XLMProphetNetEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] XLMProphetNetDecoder

-.. autoclass:: transformers.XLMProphetNetEncoder
+## XLMProphetNetForConditionalGeneration

+[[autodoc]] XLMProphetNetForConditionalGeneration

-XLMProphetNetDecoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## XLMProphetNetForCausalLM

-.. autoclass:: transformers.XLMProphetNetDecoder
-
-
-XLMProphetNetForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMProphetNetForConditionalGeneration
-
-
-XLMProphetNetForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMProphetNetForCausalLM
+[[autodoc]] XLMProphetNetForCausalLM
--- a/docs/source/model_doc/xlmroberta.mdx
+++ b/docs/source/model_doc/xlmroberta.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# XLM-RoBERTa
+
+## Overview
+
+The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
+Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's
+RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl
+data.
+
+The abstract from the paper is the following:
+
+*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
+wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
+languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
+outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
+XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
+low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
+also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
+trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
+languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
+per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
+will make XLM-R code, data, and models publicly available.*
+
+Tips:
+
+- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
+  not require `lang` tensors to understand which language is used, and should be able to determine the correct
+  language from the input ids.
+- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
+  as well as the information relative to the inputs and outputs.
+
+This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
+
+
+## XLMRobertaConfig
+
+[[autodoc]] XLMRobertaConfig
+
+## XLMRobertaTokenizer
+
+[[autodoc]] XLMRobertaTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+
+## XLMRobertaTokenizerFast
+
+[[autodoc]] XLMRobertaTokenizerFast
+
+## XLMRobertaModel
+
+[[autodoc]] XLMRobertaModel
+    - forward
+
+## XLMRobertaForCausalLM
+
+[[autodoc]] XLMRobertaForCausalLM
+    - forward
+
+## XLMRobertaForMaskedLM
+
+[[autodoc]] XLMRobertaForMaskedLM
+    - forward
+
+## XLMRobertaForSequenceClassification
+
+[[autodoc]] XLMRobertaForSequenceClassification
+    - forward
+
+## XLMRobertaForMultipleChoice
+
+[[autodoc]] XLMRobertaForMultipleChoice
+    - forward
+
+## XLMRobertaForTokenClassification
+
+[[autodoc]] XLMRobertaForTokenClassification
+    - forward
+
+## XLMRobertaForQuestionAnswering
+
+[[autodoc]] XLMRobertaForQuestionAnswering
+    - forward
+
+## TFXLMRobertaModel
+
+[[autodoc]] TFXLMRobertaModel
+    - call
+
+## TFXLMRobertaForMaskedLM
+
+[[autodoc]] TFXLMRobertaForMaskedLM
+    - call
+
+## TFXLMRobertaForSequenceClassification
+
+[[autodoc]] TFXLMRobertaForSequenceClassification
+    - call
+
+## TFXLMRobertaForMultipleChoice
+
+[[autodoc]] TFXLMRobertaForMultipleChoice
+    - call
+
+## TFXLMRobertaForTokenClassification
+
+[[autodoc]] TFXLMRobertaForTokenClassification
+    - call
+
+## TFXLMRobertaForQuestionAnswering
+
+[[autodoc]] TFXLMRobertaForQuestionAnswering
+    - call
--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-XLM-RoBERTa
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale
-<https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
-Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's
-RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl
-data.
-
-The abstract from the paper is the following:
-
-*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
-wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
-languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
-outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
-XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
-low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
-also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
-trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
-languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
-per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
-will make XLM-R code, data, and models publicly available.*
-
-Tips:
-
- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
-  not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
-  language from the input ids.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
-  as well as the information relative to the inputs and outputs.
-
-This model was contributed by `stefan-it <https://huggingface.co/stefan-it>`__. The original code can be found `here
-<https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.
-
-
-XLMRobertaConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaConfig
-    :members:
-
-
-XLMRobertaTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-
-
-XLMRobertaTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaTokenizerFast
-    :members:
-
-
-XLMRobertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaModel
-    :members: forward
-
-
-XLMRobertaForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaForCausalLM
-    :members: forward
-
-
-XLMRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaForMaskedLM
-    :members: forward
-
-
-XLMRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaForSequenceClassification
-    :members: forward
-
-
-XLMRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaForMultipleChoice
-    :members: forward
-
-
-XLMRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaForTokenClassification
-    :members: forward
-
-
-XLMRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLMRobertaForQuestionAnswering
-    :members: forward
-
-
-TFXLMRobertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMRobertaModel
-    :members: call
-
-
-TFXLMRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMRobertaForMaskedLM
-    :members: call
-
-
-TFXLMRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMRobertaForSequenceClassification
-    :members: call
-
-
-TFXLMRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMRobertaForMultipleChoice
-    :members: call
-
-
-TFXLMRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMRobertaForTokenClassification
-    :members: call
-
-
-TFXLMRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLMRobertaForQuestionAnswering
-    :members: call
--- a/docs/source/model_doc/xlnet.mdx
+++ b/docs/source/model_doc/xlnet.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# XLNet
+
+## Overview
+
+The XLNet model was proposed in [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
+Quoc V. Le. XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn
+bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization
+order.
+
+The abstract from the paper is the following:
+
+*With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves
+better performance than pretraining approaches based on autoregressive language modeling. However, relying on
+corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
+pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
+pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
+permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
+formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
+pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
+margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
+
+Tips:
+
+- The specific attention pattern can be controlled at training and test time using the `perm_mask` input.
+- Due to the difficulty of training a fully auto-regressive model over various factorization order, XLNet is pretrained
+  using only a sub-set of the output tokens as target which are selected with the `target_mapping` input.
+- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
+  `target_mapping` inputs to control the attention span and outputs (see examples in
+  *examples/pytorch/text-generation/run_generation.py*)
+- XLNet is one of the few models that has no sequence length limit.
+
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/zihangdai/xlnet/).
+
+
+## XLNetConfig
+
+[[autodoc]] XLNetConfig
+
+## XLNetTokenizer
+
+[[autodoc]] XLNetTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+
+## XLNetTokenizerFast
+
+[[autodoc]] XLNetTokenizerFast
+
+## XLNet specific outputs
+
+[[autodoc]] models.xlnet.modeling_xlnet.XLNetModelOutput
+
+[[autodoc]] models.xlnet.modeling_xlnet.XLNetLMHeadModelOutput
+
+[[autodoc]] models.xlnet.modeling_xlnet.XLNetForSequenceClassificationOutput
+
+[[autodoc]] models.xlnet.modeling_xlnet.XLNetForMultipleChoiceOutput
+
+[[autodoc]] models.xlnet.modeling_xlnet.XLNetForTokenClassificationOutput
+
+[[autodoc]] models.xlnet.modeling_xlnet.XLNetForQuestionAnsweringSimpleOutput
+
+[[autodoc]] models.xlnet.modeling_xlnet.XLNetForQuestionAnsweringOutput
+
+[[autodoc]] models.xlnet.modeling_tf_xlnet.TFXLNetModelOutput
+
+[[autodoc]] models.xlnet.modeling_tf_xlnet.TFXLNetLMHeadModelOutput
+
+[[autodoc]] models.xlnet.modeling_tf_xlnet.TFXLNetForSequenceClassificationOutput
+
+[[autodoc]] models.xlnet.modeling_tf_xlnet.TFXLNetForMultipleChoiceOutput
+
+[[autodoc]] models.xlnet.modeling_tf_xlnet.TFXLNetForTokenClassificationOutput
+
+[[autodoc]] models.xlnet.modeling_tf_xlnet.TFXLNetForQuestionAnsweringSimpleOutput
+
+## XLNetModel
+
+[[autodoc]] XLNetModel
+    - forward
+
+## XLNetLMHeadModel
+
+[[autodoc]] XLNetLMHeadModel
+    - forward
+
+## XLNetForSequenceClassification
+
+[[autodoc]] XLNetForSequenceClassification
+    - forward
+
+## XLNetForMultipleChoice
+
+[[autodoc]] XLNetForMultipleChoice
+    - forward
+
+## XLNetForTokenClassification
+
+[[autodoc]] XLNetForTokenClassification
+    - forward
+
+## XLNetForQuestionAnsweringSimple
+
+[[autodoc]] XLNetForQuestionAnsweringSimple
+    - forward
+
+## XLNetForQuestionAnswering
+
+[[autodoc]] XLNetForQuestionAnswering
+    - forward
+
+## TFXLNetModel
+
+[[autodoc]] TFXLNetModel
+    - call
+
+## TFXLNetLMHeadModel
+
+[[autodoc]] TFXLNetLMHeadModel
+    - call
+
+## TFXLNetForSequenceClassification
+
+[[autodoc]] TFXLNetForSequenceClassification
+    - call
+
+## TFLNetForMultipleChoice
+
+[[autodoc]] TFXLNetForMultipleChoice
+    - call
+
+## TFXLNetForTokenClassification
+
+[[autodoc]] TFXLNetForTokenClassification
+    - call
+
+## TFXLNetForQuestionAnsweringSimple
+
+[[autodoc]] TFXLNetForQuestionAnsweringSimple
+    - call
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-XLNet
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding
-<https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
-Quoc V. Le. XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn
-bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization
-order.
-
-The abstract from the paper is the following:
-
-*With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves
-better performance than pretraining approaches based on autoregressive language modeling. However, relying on
-corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
-pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
-pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
-permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
-formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
-pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
-margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
-
-Tips:
-
- The specific attention pattern can be controlled at training and test time using the :obj:`perm_mask` input.
- Due to the difficulty of training a fully auto-regressive model over various factorization order, XLNet is pretrained
-  using only a sub-set of the output tokens as target which are selected with the :obj:`target_mapping` input.
- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the :obj:`perm_mask` and
-  :obj:`target_mapping` inputs to control the attention span and outputs (see examples in
-  `examples/pytorch/text-generation/run_generation.py`)
- XLNet is one of the few models that has no sequence length limit.
-
-This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
-<https://github.com/zihangdai/xlnet/>`__.
-
-
-XLNetConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetConfig
-    :members:
-
-
-XLNetTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-
-
-XLNetTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetTokenizerFast
-    :members:
-
-
-XLNet specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetModelOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetLMHeadModelOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForSequenceClassificationOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForMultipleChoiceOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForTokenClassificationOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForQuestionAnsweringSimpleOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForQuestionAnsweringOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetModelOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetLMHeadModelOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForSequenceClassificationOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForMultipleChoiceOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForTokenClassificationOutput
-    :members:
-
-.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForQuestionAnsweringSimpleOutput
-    :members:
-
-
-XLNetModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetModel
-    :members: forward
-
-
-XLNetLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetLMHeadModel
-    :members: forward
-
-
-XLNetForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetForSequenceClassification
-    :members: forward
-
-
-XLNetForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetForMultipleChoice
-    :members: forward
-
-
-XLNetForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetForTokenClassification
-    :members: forward
-
-
-XLNetForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetForQuestionAnsweringSimple
-    :members: forward
-
-
-XLNetForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.XLNetForQuestionAnswering
-    :members: forward
-
-
-TFXLNetModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLNetModel
-    :members: call
-
-
-TFXLNetLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLNetLMHeadModel
-    :members: call
-
-
-TFXLNetForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLNetForSequenceClassification
-    :members: call
-
-
-TFLNetForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLNetForMultipleChoice
-    :members: call
-
-
-TFXLNetForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLNetForTokenClassification
-    :members: call
-
-
-TFXLNetForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFXLNetForQuestionAnsweringSimple
-    :members: call
--- a/docs/source/model_doc/xls_r.rst
+++ b/docs/source/model_doc/xls_r.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-XLS-R
-----------------------------------------------------------------------------------------------------------------------
+# XLS-R

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The XLS-R model was proposed in `XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
-<https://arxiv.org/abs/2111.09296>`__ by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman
+The XLS-R model was proposed in [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman
 Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.

 The abstract from the paper is the following:
@@ -37,11 +34,10 @@ Tips:

 - XLS-R is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - XLS-R model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using
-  :class:`~transformers.Wav2Vec2CTCTokenizer`.
+  [`Wav2Vec2CTCTokenizer`].

 Relevant checkpoints can be found under https://huggingface.co/models?other=xls_r.

-XLS-R's architecture is based on the Wav2Vec2 model, so one can refer to :doc:`Wav2Vec2's documentation page
-<wav2vec2>`.
+XLS-R's architecture is based on the Wav2Vec2 model, so one can refer to [Wav2Vec2's documentation page](wav2vec2).

-The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec>`__.
+The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec).
--- a/docs/source/model_doc/xlsr_wav2vec2.rst
+++ b/docs/source/model_doc/xlsr_wav2vec2.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-XLSR-Wav2Vec2
-----------------------------------------------------------------------------------------------------------------------
+# XLSR-Wav2Vec2

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The XLSR-Wav2Vec2 model was proposed in `Unsupervised Cross-Lingual Representation Learning For Speech Recognition
-<https://arxiv.org/abs/2006.13979>`__ by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael
+The XLSR-Wav2Vec2 model was proposed in [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael
 Auli.

 The abstract from the paper is the following:
@@ -37,9 +34,8 @@ Tips:

 - XLSR-Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - XLSR-Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be
-  decoded using :class:`~transformers.Wav2Vec2CTCTokenizer`.
+  decoded using [`Wav2Vec2CTCTokenizer`].

-XLSR-Wav2Vec2's architecture is based on the Wav2Vec2 model, so one can refer to :doc:`Wav2Vec2's documentation page
-<wav2vec2>`.
+XLSR-Wav2Vec2's architecture is based on the Wav2Vec2 model, so one can refer to [Wav2Vec2's documentation page](wav2vec2).

-The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec>`__.
+The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec).
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-Model sharing and uploading
-=======================================================================================================================
+# Model sharing and uploading

 In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
-the `model hub <https://huggingface.co/models>`__.
+the [model hub](https://huggingface.co/models).

-.. raw:: html
+<iframe width="560" height="315" src="https://www.youtube.com/embed/XvSGPZFEjDY" title="YouTube video player"
+frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
+picture-in-picture" allowfullscreen></iframe>

-   <iframe width="560" height="315" src="https://www.youtube.com/embed/XvSGPZFEjDY" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
+<Tip>

-.. note::
+You will need to create an account on [huggingface.co](https://huggingface.co/join) for this.

-    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
+Optionally, you can join an existing organization or create a new one.

-    Optionally, you can join an existing organization or create a new one.
+</Tip>

+We have seen in the [training tutorial](training): how to fine-tune a model on a given task. You have probably
+done something similar on your task, either using the model `fit()` method, directly in your own training loop or using the
+[`Trainer`] class. Let's see how you can share the result on the
+[model hub](https://huggingface.co/models).

-We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
-done something similar on your task, either using the model directly in your own training loop or using the
-:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on the
-`model hub <https://huggingface.co/models>`__.
-
-Model versioning
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Model versioning

 Since version v3.5.0, the model hub has built-in model versioning based on git and git-lfs. It is based on the paradigm
 that one model *is* one repo.
@@ -51,89 +48,80 @@ branch.

 For instance:

-.. code-block::
-
-    >>> model = AutoModel.from_pretrained(
-    >>>   "julien-c/EsperBERTo-small",
-    >>>   revision="v2.0.1" # tag name, or branch name, or commit hash
-    >>> )
-
+```python
+>>> model = AutoModel.from_pretrained(
+>>>     "julien-c/EsperBERTo-small",
+>>>     revision="v2.0.1" # tag name, or branch name, or commit hash
+>>> )
+```

-Push your model from Python
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Push your model from Python

-Preparation
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Preparation

 The first step is to make sure your credentials to the hub are stored somewhere. This can be done in two ways. If you
 have access to a terminal, you cam just run the following command in the virtual environment where you installed 🤗
 Transformers:

-.. code-block:: bash
+```bash
+huggingface-cli login
+```

-    transformers-cli login
-
-It will store your access token in the Hugging Face cache folder (by default :obj:`~/.cache/`).
+It will store your access token in the Hugging Face cache folder (by default `~/.cache/`).

 If you don't have an easy access to a terminal (for instance in a Colab session), you can find a token linked to your
-account by going on `huggingface.co <https://huggingface.co/>`, click on your avatar on the top left corner, then on
-`Edit profile` on the left, just beneath your profile picture. In the submenu `API Tokens`, you will find your API
+account by going on [huggingface.co](https://huggingface.co/), click on your avatar on the top left corner, then on
+*Edit profile* on the left, just beneath your profile picture. In the submenu *API Tokens*, you will find your API
 token that you can just copy.

-Directly push your model to the hub
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. raw:: html
+### Directly push your model to the hub

-   <iframe width="560" height="315" src="https://www.youtube.com/embed/Z1-XMy-GNLQ" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
+<Youtube id="Z1-XMy-GNLQ"/>

 Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a
-finetuned model you saved in :obj:`save_directory` by calling:
+finetuned model you saved in `save_directory` by calling:

-.. code-block:: python
+```python
+finetuned_model.push_to_hub("my-awesome-model")
+```

-    finetuned_model.push_to_hub("my-awesome-model")
-
-If you have your API token not stored in the cache, you will need to pass it with :obj:`use_auth_token=your_token`.
+If you have your API token not stored in the cache, you will need to pass it with `use_auth_token=your_token`.
 This is also be the case for all the examples below, so we won't mention it again.

-This will create a repository in your namespace name :obj:`my-awesome-model`, so anyone can now run:
-
-.. code-block:: python
-
-    from transformers import AutoModel
+This will create a repository in your namespace name `my-awesome-model`, so anyone can now run:

-    model = AutoModel.from_pretrained("your_username/my-awesome-model")
+```python
+from transformers import AutoModel

-Even better, you can combine this push to the hub with the call to :obj:`save_pretrained`:
+model = AutoModel.from_pretrained("your_username/my-awesome-model")
+```

-.. code-block:: python
+Even better, you can combine this push to the hub with the call to `save_pretrained`:

-    finetuned_model.save_pretrained(save_directory, push_to_hub=True, repo_name="my-awesome-model")
+```python
+finetuned_model.save_pretrained(save_directory, push_to_hub=True, repo_name="my-awesome-model")
+```

-If you are a premium user and want your model to be private, just add :obj:`private=True` to this call.
+If you are a premium user and want your model to be private, just add `private=True` to this call.

 If you are a member of an organization and want to push it inside the namespace of the organization instead of yours,
-just add :obj:`organization=my_amazing_org`.
+just add `organization=my_amazing_org`.

-Add new files to your model repo
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Add new files to your model repo

 Once you have pushed your model to the hub, you might want to add the tokenizer, or a version of your model for another
 framework (TensorFlow, PyTorch, Flax). This is super easy to do! Let's begin with the tokenizer. You can add it to the
 repo you created before like this

-.. code-block:: python
-
-    tokenizer.push_to_hub("my-awesome-model")
-
-If you know its URL (it should be :obj:`https://huggingface.co/username/repo_name`), you can also do:
+```python
+tokenizer.push_to_hub("my-awesome-model")
+```

-.. code-block:: python
+If you know its URL (it should be `https://huggingface.co/username/repo_name`), you can also do:

-    tokenizer.push_to_hub(repo_url=my_repo_url)
+```python
+tokenizer.push_to_hub(repo_url=my_repo_url)
+```

 And that's all there is to it! It's also a very easy way to fix a mistake if one of the files online had a bug.

@@ -141,121 +129,111 @@ To add a model for another backend, it's also super easy. Let's say you have fin
 add the pytorch model files to your model repo, so that anyone in the community can use it. The following allows you to
 directly create a PyTorch version of your TensorFlow model:

-.. code-block:: python
+```python
+from transformers import AutoModel

-    from transformers import AutoModel
+model = AutoModel.from_pretrained(save_directory, from_tf=True)
+```

-    model = AutoModel.from_pretrained(save_directory, from_tf=True)
-
-You can also replace :obj:`save_directory` by the identifier of your model (:obj:`username/repo_name`) if you don't
+You can also replace `save_directory` by the identifier of your model (`username/repo_name`) if you don't
 have a local save of it anymore. Then, just do the same as before:

-.. code-block:: python
-
-    model.push_to_hub("my-awesome-model")
+```python
+model.push_to_hub("my-awesome-model")
+```

 or

-.. code-block:: python
-
-    model.push_to_hub(repo_url=my_repo_url)
-
-
-Use your terminal and git
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+```python
+model.push_to_hub(repo_url=my_repo_url)
+```

-.. raw:: html
+## Use your terminal and git

-   <iframe width="560" height="315" src="https://www.youtube.com/embed/rkCly_cbMBk" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
+<Youtube id="rkCly_cbMBk"/>

-Basic steps
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Basic steps

 In order to upload a model, you'll need to first create a git repo. This repo will live on the model hub, allowing
 users to clone it and you (and your organization members) to push to it.

-You can create a model repo directly from `the /new page on the website <https://huggingface.co/new>`__.
+You can create a model repo directly from [the /new page on the website](https://huggingface.co/new).

-Alternatively, you can use the ``transformers-cli``. The next steps describe that process:
+Alternatively, you can use the `transformers-cli`. The next steps describe that process:

 Go to a terminal and run the following command. It should be in the virtual environment where you installed 🤗
-Transformers, since that command :obj:`transformers-cli` comes from the library.
-
-.. code-block:: bash
-
-    transformers-cli login
+Transformers, since that command `transformers-cli` comes from the library.

+```bash
+transformers-cli login
+```

 Once you are logged in with your model hub credentials, you can start building your repositories. To create a repo:

-.. code-block:: bash
-
-    transformers-cli repo create your-model-name
-
-If you want to create a repo under a specific organization, you should add a `--organization` flag:
+```bash
+transformers-cli repo create your-model-name
+```

-.. code-block:: bash
+If you want to create a repo under a specific organization, you should add a *--organization* flag:

-    transformers-cli repo create your-model-name --organization your-org-name
+```bash
+transformers-cli repo create your-model-name --organization your-org-name
+```

 This creates a repo on the model hub, which can be cloned.

-.. code-block:: bash
+```bash
+# Make sure you have git-lfs installed
+# (https://git-lfs.github.com/)
+git lfs install

-    # Make sure you have git-lfs installed
-    # (https://git-lfs.github.com/)
-    git lfs install
-
-    git clone https://huggingface.co/username/your-model-name
+git clone https://huggingface.co/username/your-model-name
+```

 When you have your local clone of your repo and lfs installed, you can then add/remove from that clone as you would
 with any other git repo.

-.. code-block:: bash
-
-    # Commit as usual
-    cd your-model-name
-    echo "hello" >> README.md
-    git add . && git commit -m "Update from $USER"
+```bash
+# Commit as usual
+cd your-model-name
+echo "hello" >> README.md
+git add . && git commit -m "Update from $USER"
+```

 We are intentionally not wrapping git too much, so that you can go on with the workflow you're used to and the tools
 you already know.

 The only learning curve you might have compared to regular git is the one for git-lfs. The documentation at
-`git-lfs.github.com <https://git-lfs.github.com/>`__ is decent, but we'll work on a tutorial with some tips and tricks
+[git-lfs.github.com](https://git-lfs.github.com/) is decent, but we'll work on a tutorial with some tips and tricks
 in the coming weeks!

-Additionally, if you want to change multiple repos at once, the `change_config.py script
-<https://github.com/huggingface/efficient_scripts/blob/main/change_config.py>`__ can probably save you some time.
+Additionally, if you want to change multiple repos at once, the [change_config.py script](https://github.com/huggingface/efficient_scripts/blob/main/change_config.py) can probably save you some time.

-Make your model work on all frameworks
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Make your model work on all frameworks

-.. 
-    TODO Sylvain: make this automatic during the upload
+<!--TODO Sylvain: make this automatic during the upload
+-->

 You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
-PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
+PyTorch *and* TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
 your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
 super easy to do (and in a future version, it might all be automatic). You will need to install both PyTorch and
-TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
-installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
-installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the [TensorFlow
+installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or the [PyTorch
+installation page](https://pytorch.org/get-started/locally/#start-locally) to see how.

 First check that your model class exists in the other framework, that is try to import the same model by either adding
-or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to type
-
-.. code-block::
+or removing TF. For instance, if you trained a [`DistilBertForSequenceClassification`], try to type

-    >>> from transformers import TFDistilBertForSequenceClassification
+```python
+>>> from transformers import TFDistilBertForSequenceClassification
+```

-and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to type
+and if you trained a [`TFDistilBertForSequenceClassification`], try to type

-.. code-block::
-
-    >>> from transformers import DistilBertForSequenceClassification
+```python
+>>> from transformers import DistilBertForSequenceClassification
+```

 This will give back an error if your model does not exist in the other framework (something that should be pretty rare
 since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
@@ -263,161 +241,155 @@ since we're aiming for full parity between the two frameworks). In this case, sk
 Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your
 model class:

-.. code-block::
-
-    >>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
-    >>> tf_model.save_pretrained("path/to/awesome-name-you-picked")
+```python
+>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
+>>> tf_model.save_pretrained("path/to/awesome-name-you-picked")
+```

 and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
 model class:

-.. code-block::
-
-    >>> pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
-    >>> pt_model.save_pretrained("path/to/awesome-name-you-picked")
+```python
+>>> pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
+>>> pt_model.save_pretrained("path/to/awesome-name-you-picked")
+```

 That's all there is to it!

-Check the directory before pushing to the model hub.
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Check the directory before pushing to the model hub.

 Make sure there are no garbage files in the directory you'll upload. It should only have:

- a `config.json` file, which saves the :doc:`configuration <main_classes/configuration>` of your model ;
- a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part
-  of your :doc:`tokenizer <main_classes/tokenizer>` save;
- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
+- a *config.json* file, which saves the [configuration](main_classes/configuration) of your model ;
+- a *pytorch_model.bin* file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
+- a *tf_model.h5* file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
+- a *special_tokens_map.json*, which is part of your [tokenizer](main_classes/tokenizer) save;
+- a *tokenizer_config.json*, which is part of your [tokenizer](main_classes/tokenizer) save;
+- files named *vocab.json*, *vocab.txt*, *merges.txt*, or similar, which contain the vocabulary of your tokenizer, part
+  of your [tokenizer](main_classes/tokenizer) save;
+- maybe a *added_tokens.json*, which is part of your [tokenizer](main_classes/tokenizer) save.

 Other files can safely be deleted.


-Uploading your files
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Uploading your files

 Once the repo is cloned, you can add the model, configuration and tokenizer files. For instance, saving the model and
 tokenizer files:

-.. code-block::
-
-    >>> model.save_pretrained("path/to/repo/clone/your-model-name")
-    >>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")
+```python
+>>> model.save_pretrained("path/to/repo/clone/your-model-name")
+>>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")
+```

 Or, if you're using the Trainer API

-.. code-block::
-
-    >>> trainer.save_model("path/to/awesome-name-you-picked")
-    >>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")
+```python
+>>> trainer.save_model("path/to/awesome-name-you-picked")
+>>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")
+```

-You can then add these files to the staging environment and verify that they have been correctly staged with the ``git
-status`` command:
+You can then add these files to the staging environment and verify that they have been correctly staged with the `git status` command:

-.. code-block:: bash
-
-    git add --all
-    git status
+```bash
+git add --all
+git status
+```

 Finally, the files should be committed:

-.. code-block:: bash
-
-    git commit -m "First version of the your-model-name model and tokenizer."
+```bash
+git commit -m "First version of the your-model-name model and tokenizer."
+```

 And pushed to the remote:

-.. code-block:: bash
-
-    git push
+```bash
+git push
+```

 This will upload the folder containing the weights, tokenizer and configuration we have just prepared.


-Add a model card
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Add a model card

 To make sure everyone knows what your model can do, what its limitations, potential bias or ethical considerations are,
 please add a README.md model card to your model repo. You can just create it, or there's also a convenient button
-titled "Add a README.md" on your model page. A model card documentation can be found `here
-<https://huggingface.co/docs/hub/model-repos>`__ (meta-suggestions are welcome). model card template (meta-suggestions
+titled "Add a README.md" on your model page. A model card documentation can be found [here](https://huggingface.co/docs/hub/model-repos) (meta-suggestions are welcome). model card template (meta-suggestions
 are welcome).

-.. note::
+<Tip>
+
+Model cards used to live in the 🤗 Transformers repo under *model_cards/*, but for consistency and scalability we
+migrated every model card from the repo to its corresponding huggingface.co model repo.

-    Model cards used to live in the 🤗 Transformers repo under `model_cards/`, but for consistency and scalability we
-    migrated every model card from the repo to its corresponding huggingface.co model repo.
+</Tip>

 If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
 don't forget to link to its model card so that people can fully trace how your model was built.


-Using your model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Using your model

 Your model now has a page on huggingface.co/models 🔥

 Anyone can load it from code:

-.. code-block::
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
-    >>> model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
-
-
-You may specify a revision by using the ``revision`` flag in the ``from_pretrained`` method:
+```python
+>>> tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
+>>> model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
+```

-.. code-block::
+You may specify a revision by using the `revision` flag in the `from_pretrained` method:

-    >>> tokenizer = AutoTokenizer.from_pretrained(
-    >>>   "julien-c/EsperBERTo-small",
-    >>>   revision="v2.0.1" # tag name, or branch name, or commit hash
-    >>> )
+```python
+>>> tokenizer = AutoTokenizer.from_pretrained(
+>>>   "julien-c/EsperBERTo-small",
+>>>   revision="v2.0.1" # tag name, or branch name, or commit hash
+>>> )
+```

-Workflow in a Colab notebook
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Workflow in a Colab notebook

 If you're in a Colab notebook (or similar) with no direct access to a terminal, here is the workflow you can use to
 upload your model. You can execute each one of them in a cell by adding a ! at the beginning.

-First you need to install `git-lfs` in the environment used by the notebook:
+First you need to install *git-lfs* in the environment used by the notebook:

-.. code-block:: bash
+```bash
+sudo apt-get install git-lfs
+```

-    sudo apt-get install git-lfs
+Then you can use either create a repo directly from [huggingface.co](https://huggingface.co/) , or use the
+`transformers-cli` to create it:

-Then you can use either create a repo directly from `huggingface.co <https://huggingface.co/>`__ , or use the
-:obj:`transformers-cli` to create it:

-
-.. code-block:: bash
-
-    transformers-cli login
-    transformers-cli repo create your-model-name
+```bash
+transformers-cli login
+transformers-cli repo create your-model-name
+```

 Once it's created, you can clone it and configure it (replace username by your username on huggingface.co):

-.. code-block:: bash
+```bash
+git lfs install

-    git lfs install
+git clone https://username:password@huggingface.co/username/your-model-name
+# Alternatively if you have a token,
+# you can use it instead of your password
+git clone https://username:token@huggingface.co/username/your-model-name

-    git clone https://username:password@huggingface.co/username/your-model-name
-    # Alternatively if you have a token,
-    # you can use it instead of your password
-    git clone https://username:token@huggingface.co/username/your-model-name
-
-    cd your-model-name
-    git config --global user.email "email@example.com"
-    # Tip: using the same email than for your huggingface.co account will link your commits to your profile
-    git config --global user.name "Your name"
+cd your-model-name
+git config --global user.email "email@example.com"
+# Tip: using the same email than for your huggingface.co account will link your commits to your profile
+git config --global user.name "Your name"
+```

 Once you've saved your model inside, and your clone is setup with the right remote URL, you can add it and push it with
 usual git commands.

-.. code-block:: bash
-
-    git add .
-    git commit -m "Initial commit"
-    git push
+```bash
+git add .
+git commit -m "Initial commit"
+git push
+```
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-Summary of the models
-=======================================================================================================================
+# Summary of the models

-This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
-model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
-<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also check out the :doc:`pretrained model
-page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
-<https://huggingface.co/models>`_.
+This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original [transformer
+model](https://arxiv.org/abs/1706.03762). For a gentle introduction check the [annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html). Here we focus on the high-level differences between the
+models. You can check them more in detail in their respective documentation. Also check out [the Model Hub](https://huggingface.co/models) where you can filter the checkpoints by model architecture.

 Each one of the models in the library falls into one of the following categories:

-  * :ref:`autoregressive-models`
-  * :ref:`autoencoding-models`
-  * :ref:`seq-to-seq-models`
-  * :ref:`multimodal-models`
-  * :ref:`retrieval-based-models`
+- [autoregressive-models](#autoregressive-models)
+- [autoencoding-models](#autoencoding-models)
+- [seq-to-seq-models](#seq-to-seq-models)
+- [multimodal-models](#multimodal-models)
+- [retrieval-based-models](#retrieval-based-models)

-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/H39Z_720T5s" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
+<iframe width="560" height="315" src="https://www.youtube.com/embed/H39Z_720T5s" title="YouTube video player"
+frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
+picture-in-picture" allowfullscreen></iframe>

 Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
 previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
@@ -58,54 +52,41 @@ example of such a model (only for translation), T5 is an example that can be fin

 Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.

-.. _autoregressive-models:
+<a id='autoregressive-models'></a>

-Decoders or autoregressive models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+## Decoders or autoregressive models

 As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
 that at each position, the model can only look at the tokens before the attention heads.

-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/d_ixlCubqQw" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-Original GPT
-----------------------------------------------------------------------------------------------------------------------
+<Youtube id="d_ixlCubqQw"/>

-.. raw:: html
+### Original GPT

-   <a href="https://huggingface.co/models?filter=openai-gpt">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
-   </a>
-   <a href="model_doc/gpt">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=openai-gpt">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
+</a>
+<a href="model_doc/gpt">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
+</a>

-`Improving Language Understanding by Generative Pre-Training
-<https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, Alec Radford et al.
+[Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), Alec Radford et al.

 The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.

 The library provides versions of the model for language modeling and multitask language modeling/multiple choice
 classification.

-GPT-2
-----------------------------------------------------------------------------------------------------------------------
+### GPT-2

-.. raw:: html
+<a href="https://huggingface.co/models?filter=gpt2">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
+</a>
+<a href="model_doc/gpt2">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=gpt2">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
-   </a>
-   <a href="model_doc/gpt2">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
-   </a>
-
-`Language Models are Unsupervised Multitask Learners
-<https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
+[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf),
 Alec Radford et al.

 A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
@@ -114,19 +95,16 @@ more).
 The library provides versions of the model for language modeling and multitask language modeling/multiple choice
 classification.

-CTRL
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### CTRL

-   <a href="https://huggingface.co/models?filter=ctrl">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
-   </a>
-   <a href="model_doc/ctrl">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=ctrl">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
+</a>
+<a href="model_doc/ctrl">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
+</a>

-`CTRL: A Conditional Transformer Language Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`_,
+[CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858),
 Nitish Shirish Keskar et al.

 Same as the GPT model but adds the idea of control codes. Text is generated from a prompt (can be empty) and one (or
@@ -135,19 +113,16 @@ wikipedia article, a book or a movie review.

 The library provides a version of the model for language modeling only.

-Transformer-XL
-----------------------------------------------------------------------------------------------------------------------
+### Transformer-XL

-.. raw:: html
+<a href="https://huggingface.co/models?filter=transfo-xl">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
+</a>
+<a href="model_doc/transformerxl">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=transfo-xl">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
-   </a>
-   <a href="model_doc/transformerxl">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
-   </a>
-
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, Zihang
+[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860), Zihang
 Dai et al.

 Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
@@ -164,55 +139,53 @@ adjustments in the way attention scores are computed.

 The library provides a version of the model for language modeling only.

-.. _reformer:
-
-Reformer
-----------------------------------------------------------------------------------------------------------------------
+<a id='reformer'></a>

-.. raw:: html
+### Reformer

-   <a href="https://huggingface.co/models?filter=reformer">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
-   </a>
-   <a href="model_doc/reformer">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=reformer">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
+</a>
+<a href="model_doc/reformer">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
+</a>

-`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, Nikita Kitaev et al .
+[Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451), Nikita Kitaev et al .

 An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
 include:

-  * Use :ref:`Axial position encoding <axial-pos-encoding>` (see below for more details). It’s a mechanism to avoid
-    having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller
-    matrices.
-  * Replace traditional attention by :ref:`LSH (local-sensitive hashing) attention <lsh-attention>` (see below for more
-    details). It's a technique to avoid computing the full product query-key in the attention layers.
-  * Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
-    the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
-    for results inside a given layer (less efficient than storing them but saves memory).
-  * Compute the feedforward operations by chunks and not on the whole batch.
+- Use [Axial position encoding](#axial-pos-encoding) (see below for more details). It’s a mechanism to avoid
+  having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller
+  matrices.
+- Replace traditional attention by [LSH (local-sensitive hashing) attention](#lsh-attention) (see below for more
+  details). It's a technique to avoid computing the full product query-key in the attention layers.
+- Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
+  the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
+  for results inside a given layer (less efficient than storing them but saves memory).
+- Compute the feedforward operations by chunks and not on the whole batch.

 With those tricks, the model can be fed much larger sentences than traditional transformer autoregressive models.

-**Note:** This model could be very well be used in an autoencoding setting, there is no checkpoint for such a
+<Tip>
+
+This model could be very well be used in an autoencoding setting, there is no checkpoint for such a
 pretraining yet, though.

-The library provides a version of the model for language modeling only.
+</Tip>

-XLNet
-----------------------------------------------------------------------------------------------------------------------
+The library provides a version of the model for language modeling only.

-.. raw:: html
+### XLNet

-   <a href="https://huggingface.co/models?filter=xlnet">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
-   </a>
-   <a href="model_doc/xlnet">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=xlnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
+</a>
+<a href="model_doc/xlnet">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
+</a>

-`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, Zhilin
+[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237), Zhilin
 Yang et al.

 XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
@@ -225,42 +198,34 @@ XLNet also uses the same recurrence mechanism as Transformer-XL to build long-te
 The library provides a version of the model for language modeling, token classification, sentence classification,
 multiple choice classification and question answering.

-.. _autoencoding-models:
+<a id='autoencoding-models'></a>

-Encoders or autoencoding models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+## Encoders or autoencoding models

 As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
 look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
 corrupted versions.

-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/MUqNwgPjJvQ" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-BERT
-----------------------------------------------------------------------------------------------------------------------
+<Youtube id="MUqNwgPjJvQ"/>

-.. raw:: html
+### BERT

-   <a href="https://huggingface.co/models?filter=bert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
-   </a>
-   <a href="model_doc/bert">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=bert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
+</a>
+<a href="model_doc/bert">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
+</a>

-`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_,
+[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805),
 Jacob Devlin et al.

 Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
 15%) is masked by:

-  * a special mask token with probability 0.8
-  * a random token different from the one masked with probability 0.1
-  * the same token with probability 0.1
+- a special mask token with probability 0.8
+- a random token different from the one masked with probability 0.1
+- the same token with probability 0.1

 The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a
 separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50%
@@ -269,98 +234,86 @@ they are not related. The model has to predict if the sentences are consecutive
 The library provides a version of the model for language modeling (traditional or masked), next sentence prediction,
 token classification, sentence classification, multiple choice classification and question answering.

-ALBERT
-----------------------------------------------------------------------------------------------------------------------
+### ALBERT

-.. raw:: html
+<a href="https://huggingface.co/models?filter=albert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
+</a>
+<a href="model_doc/albert">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=albert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
-   </a>
-   <a href="model_doc/albert">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
-   </a>
-
-`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_,
+[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942),
 Zhenzhong Lan et al.

 Same as BERT but with a few tweaks:

-  * Embedding size E is different from hidden size H justified because the embeddings are context independent (one
-    embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a
-    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
-    being the vocab size). If E < H, it has less parameters.
-  * Layers are split in groups that share parameters (to save memory).
-  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
-    B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
-    been swapped or not.
+- Embedding size E is different from hidden size H justified because the embeddings are context independent (one
+  embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a
+  sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
+  being the vocab size). If E < H, it has less parameters.
+- Layers are split in groups that share parameters (to save memory).
+- Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
+  B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
+  been swapped or not.

 The library provides a version of the model for masked language modeling, token classification, sentence
 classification, multiple choice classification and question answering.

-RoBERTa
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### RoBERTa

-   <a href="https://huggingface.co/models?filter=roberta">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
-   </a>
-   <a href="model_doc/roberta">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=roberta">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
+</a>
+<a href="model_doc/roberta">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
+</a>

-`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, Yinhan Liu et al.
+[RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692), Yinhan Liu et al.

 Same as BERT with better pretraining tricks:

-  * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
-  * no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
-    contiguous texts together to reach 512 tokens (so the sentences are in an order than may span several documents)
-  * train with larger batches
-  * use BPE with bytes as a subunit and not characters (because of unicode characters)
+- dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
+- no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
+  contiguous texts together to reach 512 tokens (so the sentences are in an order than may span several documents)
+- train with larger batches
+- use BPE with bytes as a subunit and not characters (because of unicode characters)

 The library provides a version of the model for masked language modeling, token classification, sentence
 classification, multiple choice classification and question answering.

-DistilBERT
-----------------------------------------------------------------------------------------------------------------------
+### DistilBERT

-.. raw:: html
+<a href="https://huggingface.co/models?filter=distilbert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
+</a>
+<a href="model_doc/distilbert">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=distilbert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
-   </a>
-   <a href="model_doc/distilbert">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
-   </a>
-
-`DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_,
+[DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108),
 Victor Sanh et al.

 Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it's been trained to predict
 the same probabilities as the larger model. The actual objective is a combination of:

-  * finding the same probabilities as the teacher model
-  * predicting the masked tokens correctly (but no next-sentence objective)
-  * a cosine similarity between the hidden states of the student and the teacher model
+- finding the same probabilities as the teacher model
+- predicting the masked tokens correctly (but no next-sentence objective)
+- a cosine similarity between the hidden states of the student and the teacher model

 The library provides a version of the model for masked language modeling, token classification, sentence classification
 and question answering.

-ConvBERT
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### ConvBERT

-   <a href="https://huggingface.co/models?filter=convbert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-convbert-blueviolet">
-   </a>
-   <a href="model_doc/convbert">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-convbert-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=convbert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-convbert-blueviolet">
+</a>
+<a href="model_doc/convbert">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-convbert-blueviolet">
+</a>

-`ConvBERT: Improving BERT with Span-based Dynamic Convolution <https://arxiv.org/abs/2008.02496>`_, Zihang Jiang,
+[ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496), Zihang Jiang,
 Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

 Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural
@@ -378,53 +331,47 @@ using less than 1/4 training cost.
 The library provides a version of the model for masked language modeling, token classification, sentence classification
 and question answering.

-XLM
-----------------------------------------------------------------------------------------------------------------------
+### XLM

-.. raw:: html
+<a href="https://huggingface.co/models?filter=xlm">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
+</a>
+<a href="model_doc/xlm">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=xlm">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
-   </a>
-   <a href="model_doc/xlm">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
-   </a>
-
-`Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_, Guillaume Lample and Alexis Conneau
+[Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291), Guillaume Lample and Alexis Conneau

 A transformer model trained on several languages. There are three different type of training for this model and the
 library provides checkpoints for all of them:

-  * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the
-    previous section as well). One of the languages is selected for each training sample, and the model input is a
-    sentence of 256 tokens, that may span over several documents in one of those languages.
-  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
-    with dynamic masking of the tokens.
-  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
-    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
-    surrounding context in language 1 and the context given by language 2.
-
-Checkpoints refer to which method was used for pretraining by having `clm`, `mlm` or `mlm-tlm` in their names. On top
+- Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the
+  previous section as well). One of the languages is selected for each training sample, and the model input is a
+  sentence of 256 tokens, that may span over several documents in one of those languages.
+- Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
+  and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
+  with dynamic masking of the tokens.
+- A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
+  different languages, with random masking. To predict one of the masked tokens, the model can use both, the
+  surrounding context in language 1 and the context given by language 2.
+
+Checkpoints refer to which method was used for pretraining by having *clm*, *mlm* or *mlm-tlm* in their names. On top
 of positional embeddings, the model has language embeddings. When training using MLM/CLM, this gives the model an
 indication of the language used, and when training using MLM+TLM, an indication of the language used for each part.

 The library provides a version of the model for language modeling, token classification, sentence classification and
 question answering.

-XLM-RoBERTa
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### XLM-RoBERTa

-   <a href="https://huggingface.co/models?filter=xlm-roberta">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
-   </a>
-   <a href="model_doc/xlmroberta">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=xlm-roberta">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
+</a>
+<a href="model_doc/xlmroberta">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
+</a>

-`Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_, Alexis Conneau et
+[Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116), Alexis Conneau et
 al.

 Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses
@@ -434,37 +381,31 @@ masked language modeling on sentences coming from one language. However, the mod
 The library provides a version of the model for masked language modeling, token classification, sentence
 classification, multiple choice classification and question answering.

-FlauBERT
-----------------------------------------------------------------------------------------------------------------------
+### FlauBERT

-.. raw:: html
+<a href="https://huggingface.co/models?filter=flaubert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
+</a>
+<a href="model_doc/flaubert">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=flaubert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
-   </a>
-   <a href="model_doc/flaubert">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
-   </a>
-
-`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_, Hang Le et al.
+[FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372), Hang Le et al.

 Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).

 The library provides a version of the model for language modeling and sentence classification.

-ELECTRA
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### ELECTRA

-   <a href="https://huggingface.co/models?filter=electra">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
-   </a>
-   <a href="model_doc/electra">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=electra">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
+</a>
+<a href="model_doc/electra">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
+</a>

-`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://arxiv.org/abs/2003.10555>`_,
+[ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555),
 Kevin Clark et al.

 ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are
@@ -476,20 +417,16 @@ traditional GAN setting) then the ELECTRA model is trained for a few steps.
 The library provides a version of the model for masked language modeling, token classification and sentence
 classification.

-Funnel Transformer
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### Funnel Transformer

-   <a href="https://huggingface.co/models?filter=funnel">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
-   </a>
-   <a href="model_doc/funnel">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=funnel">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
+</a>
+<a href="model_doc/funnel">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
+</a>

-`Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
-<https://arxiv.org/abs/2006.03236>`_, Zihang Dai et al.
+[Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236), Zihang Dai et al.

 Funnel Transformer is a transformer model using pooling, a bit like a ResNet model: layers are grouped in blocks, and
 at the beginning of each block (except the first one), the hidden states are pooled among the sequence dimension. This
@@ -508,98 +445,86 @@ The pretrained models available use the same pretraining objective as ELECTRA.
 The library provides a version of the model for masked language modeling, token classification, sentence
 classification, multiple choice classification and question answering.

-.. _longformer:
+<a id='longformer'></a>

-Longformer
-----------------------------------------------------------------------------------------------------------------------
+### Longformer

-.. raw:: html
+<a href="https://huggingface.co/models?filter=longformer">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
+</a>
+<a href="model_doc/longformer">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=longformer">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
-   </a>
-   <a href="model_doc/longformer">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
-   </a>
-
-`Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`_, Iz Beltagy et al.
+[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150), Iz Beltagy et al.

 A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g.,
 what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are
 still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the
-:ref:`local attention section <local-attention>` for more information.
+[local attention section](#local-attention) for more information.

 It is pretrained the same way a RoBERTa otherwise.

-**Note:** This model could be very well be used in an autoregressive setting, there is no checkpoint for such a
+<Tip>
+
+This model could be very well be used in an autoregressive setting, there is no checkpoint for such a
 pretraining yet, though.

+</Tip>
+
 The library provides a version of the model for masked language modeling, token classification, sentence
 classification, multiple choice classification and question answering.

-.. _seq-to-seq-models:
+<a id='seq-to-seq-models'></a>

-Sequence-to-sequence models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+## Sequence-to-sequence models

 As mentioned before, these models keep both the encoder and the decoder of the original transformer.

-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/0_4KEb08xrE" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
+<Youtube id="0_4KEb08xrE"/>

-BART
-----------------------------------------------------------------------------------------------------------------------
+### BART

-.. raw:: html
+<a href="https://huggingface.co/models?filter=bart">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
+</a>
+<a href="model_doc/bart">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=bart">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
-   </a>
-   <a href="model_doc/bart">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
-   </a>
-
-`BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
-<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
+[BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461), Mike Lewis et al.

 Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
 fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of
 the following transformations are applied on the pretraining tasks for the encoder:

-  * mask random tokens (like in BERT)
-  * delete random tokens
-  * mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
-  * permute sentences
-  * rotate the document to make it start at a specific token
+- mask random tokens (like in BERT)
+- delete random tokens
+- mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
+- permute sentences
+- rotate the document to make it start at a specific token

 The library provides a version of this model for conditional generation and sequence classification.

-Pegasus
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### Pegasus

-   <a href="https://huggingface.co/models?filter=pegasus">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
-   </a>
-   <a href="model_doc/pegasus">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=pegasus">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
+</a>
+<a href="model_doc/pegasus">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
+</a>

-`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization
-<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
+[PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf), Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.

 Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
 objective, called Gap Sentence Generation (GSG).

-  * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
-    BERT)
-  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
-    causal mask to hide the future words like a regular auto-regressive transformer decoder.
+- MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
+  BERT)
+- GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
+  causal mask to hide the future words like a regular auto-regressive transformer decoder.

 In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
 masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
@@ -608,39 +533,32 @@ summary.
 The library provides a version of this model for conditional generation, which should be used for summarization.


-MarianMT
-----------------------------------------------------------------------------------------------------------------------
+### MarianMT

-.. raw:: html
+<a href="https://huggingface.co/models?filter=marian">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
+</a>
+<a href="model_doc/marian">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=marian">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
-   </a>
-   <a href="model_doc/marian">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
-   </a>
-
-`Marian: Fast Neural Machine Translation in C++ <https://arxiv.org/abs/1804.00344>`_, Marcin Junczys-Dowmunt et al.
+[Marian: Fast Neural Machine Translation in C++](https://arxiv.org/abs/1804.00344), Marcin Junczys-Dowmunt et al.

 A framework for translation models, using the same models as BART

 The library provides a version of this model for conditional generation.


-T5
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### T5

-   <a href="https://huggingface.co/models?filter=t5">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
-   </a>
-   <a href="model_doc/t5">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=t5">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
+</a>
+<a href="model_doc/t5">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
+</a>

-`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
-<https://arxiv.org/abs/1910.10683>`_, Colin Raffel et al.
+[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683), Colin Raffel et al.

 Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
 layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
@@ -660,19 +578,16 @@ For instance, if we have the sentence “My dog is very cute .”, and we decide
 The library provides a version of this model for conditional generation.


-MT5
-----------------------------------------------------------------------------------------------------------------------
+### MT5

-.. raw:: html
+<a href="https://huggingface.co/models?filter=mt5">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-mt5-blueviolet">
+</a>
+<a href="model_doc/mt5">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mt5-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=mt5">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mt5-blueviolet">
-   </a>
-   <a href="model_doc/mt5">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mt5-blueviolet">
-   </a>
-
-`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
+[mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934), Linting Xue
 et al.

 The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
@@ -681,19 +596,16 @@ supervised training. mT5 is trained on 101 languages.
 The library provides a version of this model for conditional generation.


-MBart
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### MBart

-   <a href="https://huggingface.co/models?filter=mbart">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
-   </a>
-   <a href="model_doc/mbart">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=mbart">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
+</a>
+<a href="model_doc/mbart">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
+</a>

-`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
+[Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu,
 Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

 The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
@@ -702,27 +614,24 @@ sequence-to-sequence model by denoising full texts in multiple languages,

 The library provides a version of this model for conditional generation.

-The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-en-ro>`_ can be used for english ->
+The [mbart-large-en-ro checkpoint](https://huggingface.co/facebook/mbart-large-en-ro) can be used for english ->
 romanian translation.

-The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other
+The [mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) checkpoint can be finetuned for other
 translation and summarization tasks, using code in ```examples/pytorch/translation/``` , but is not very useful without
 finetuning.


-ProphetNet
-----------------------------------------------------------------------------------------------------------------------
+### ProphetNet

-.. raw:: html
+<a href="https://huggingface.co/models?filter=prophetnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-prophetnet-blueviolet">
+</a>
+<a href="model_doc/prophetnet">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-prophetnet-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=prophetnet">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-prophetnet-blueviolet">
-   </a>
-   <a href="model_doc/prophetnet">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-prophetnet-blueviolet">
-   </a>
-
-`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
+[ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by
 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.

 ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
@@ -735,39 +644,34 @@ self-attention mechanism and a self and n-stream (predict) self-attention mechan
 The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
 summarization.

-XLM-ProphetNet
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### XLM-ProphetNet

-   <a href="https://huggingface.co/models?filter=xprophetnet">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
-   </a>
-   <a href="model_doc/xlmprophetnet">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xprophetnet-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=xprophetnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
+</a>
+<a href="model_doc/xlmprophetnet">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xprophetnet-blueviolet">
+</a>

-`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
+[ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by
 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.

 XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
-on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
+on the cross-lingual dataset [XGLUE](https://arxiv.org/abs/2004.01401).

 The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
 versions for headline generation and question generation, respectively.

-.. _multimodal-models:
+<a id='multimodal-models'></a>

-Multimodal models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+## Multimodal models

 There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the
 others.

-MMBT
-----------------------------------------------------------------------------------------------------------------------
+### MMBT

-`Supervised Multimodal Bitransformers for Classifying Images and Text <https://arxiv.org/abs/1909.02950>`_, Douwe Kiela
+[Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/abs/1909.02950), Douwe Kiela
 et al.

 A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
@@ -780,30 +684,26 @@ model know which part of the input vector corresponds to the text and which to t

 The pretrained model only works for classification.

-..
-    More information in this :doc:`model documentation <model_doc/mmbt>`. TODO: write this page
+<!--More information in this [model documentation](model_doc/mmbt). TODO: write this page
+-->

-.. _retrieval-based-models:
+<a id='retrieval-based-models'></a>

-Retrieval-based models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+## Retrieval-based models

 Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.


-DPR
-----------------------------------------------------------------------------------------------------------------------
+### DPR

-.. raw:: html
+<a href="https://huggingface.co/models?filter=dpr">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
+</a>
+<a href="model_doc/dpr">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
+</a>

-   <a href="https://huggingface.co/models?filter=dpr">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
-   </a>
-   <a href="model_doc/dpr">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
-   </a>
-
-`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, Vladimir Karpukhin et
+[Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906), Vladimir Karpukhin et
 al.

 Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
@@ -812,27 +712,24 @@ research.

 DPR consists in three models:

-  * Question encoder: encode questions as vectors
-  * Context encoder: encode contexts as vectors
-  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
-    inferred span actually answers the question).
+- Question encoder: encode questions as vectors
+- Context encoder: encode contexts as vectors
+- Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
+  inferred span actually answers the question).

 DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
 then it calls the reader with the question and the retrieved documents to get the answer.

-RAG
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
+### RAG

-   <a href="https://huggingface.co/models?filter=rag">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
-   </a>
-   <a href="model_doc/rag">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
-   </a>
+<a href="https://huggingface.co/models?filter=rag">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
+</a>
+<a href="model_doc/rag">
+<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
+</a>

-`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, Patrick Lewis,
+[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401), Patrick Lewis,
 Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
 Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

@@ -843,32 +740,30 @@ to adapt to downstream tasks.

 The two models RAG-Token and RAG-Sequence are available for generation.

-More technical aspects
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+## More technical aspects

-Full vs sparse attention
-----------------------------------------------------------------------------------------------------------------------
+### Full vs sparse attention

 Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
 computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
 use a sparse version of the attention matrix to speed up training.

-.. _lsh-attention:
+<a id='lsh-attention'></a>

 **LSH attention**

-:ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
+[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
 dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
 the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
 modified to mask the current token (except at the first position), because it will give a query and a key equal (so
 very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
 (determined by a n_rounds parameter) and then are averaged together.

-.. _local-attention:
+<a id='local-attention'></a>

 **Local attention**

-:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the
+[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
 left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
 window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
 representation of the whole sentence.
@@ -877,25 +772,22 @@ Some preselected input tokens are also given global attention: for those few tok
 all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
 their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:

-.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local_attention_mask.png
-   :scale: 50 %
-   :align: center
+<img scale="50 %" align="center" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local_attention_mask.png"/>

 Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
 length.

-Other tricks
-----------------------------------------------------------------------------------------------------------------------
+### Other tricks

-.. _axial-pos-encoding:
+<a id='axial-pos-encoding'></a>

 **Axial positional encodings**

-:ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
-E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
+[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
+E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the
 hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
 that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
-dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` and
-:math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for time
-step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`j // l1`
+dimensions \\(l_{1} \times d_{1}\\) and \\(l_{2} \times d_{2}\\), such that \\(l_{1} \times l_{2} = l\\) and
+\\(d_{1} + d_{2} = d\\) (with the product for the lengths, this ends up being way smaller). The embedding for time
+step \\(j\\) in E is obtained by concatenating the embeddings for timestep \\(j \% l1\\) in E1 and \\(j // l1\\)
 in E2.
--- a/docs/source/parallelism.md
+++ b/docs/source/parallelism.md
--- a/docs/source/performance.md
+++ b/docs/source/performance.md
--- a/docs/source/philosophy.mdx
+++ b/docs/source/philosophy.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Philosophy
+
+🤗 Transformers is an opinionated library built for:
+
+- NLP researchers and educators seeking to use/study/extend large-scale transformers models
+- hands-on practitioners who want to fine-tune those models and/or serve them in production
+- engineers who just want to download a pretrained model and use it to solve a given NLP task.
+
+The library was designed with two strong goals in mind:
+
+- Be as easy and fast to use as possible:
+
+  - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
+    just three standard classes required to use each model: [configuration](main_classes/configuration),
+    [models](main_classes/model) and [tokenizer](main_classes/tokenizer).
+  - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
+    `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
+    loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
+    and models' weights) from a pretrained checkpoint provided on [Hugging Face Hub](https://huggingface.co/models) or your own saved checkpoint.
+  - On top of those three base classes, the library provides two APIs: [`pipeline`] for quickly
+    using a model (plus its associated tokenizer and configuration) on a given task and
+    [`Trainer`]/`Keras.fit` to quickly train or fine-tune a given model.
+  - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
+    extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
+    classes of the library to reuse functionalities like model loading/saving.
+
+- Provide state-of-the-art models with performances as close as possible to the original models:
+
+  - We provide at least one example for each architecture which reproduces a result provided by the official authors
+    of said architecture.
+  - The code is usually as close to the original code base as possible which means some PyTorch code may be not as
+    *pytorchic* as it could be as a result of being converted TensorFlow code and vice versa.
+
+A few other goals:
+
+- Expose the models' internals as consistently as possible:
+
+  - We give access, using a single API, to the full hidden-states and attention weights.
+  - Tokenizer and base model's API are standardized to easily switch between models.
+
+- Incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
+
+  - A simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning.
+  - Simple ways to mask and prune transformer heads.
+
+- Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framework and inference using another.
+
+## Main concepts
+
+The library is built around three types of classes for each model:
+
+- **Model classes** such as [`BertModel`], which are 30+ PyTorch models ([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)) or Keras models ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)) that work with the pretrained weights provided in the
+  library.
+- **Configuration classes** such as [`BertConfig`], which store all the parameters required to build
+  a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
+  without any modification, creating the model will automatically take care of instantiating the configuration (which
+  is part of the model).
+- **Tokenizer classes** such as [`BertTokenizer`], which store the vocabulary for each model and
+  provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.
+
+All these classes can be instantiated from pretrained instances and saved locally using two methods:
+
+- `from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
+  provided by the library itself (the supported models can be found on the [Model Hub](https://huggingface.co/models)) or
+  stored locally (or on a server) by the user,
+- `save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
+  `from_pretrained()`.
+
--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Philosophy
-=======================================================================================================================
-
-🤗 Transformers is an opinionated library built for:
-
- NLP researchers and educators seeking to use/study/extend large-scale transformers models
- hands-on practitioners who want to fine-tune those models and/or serve them in production
- engineers who just want to download a pretrained model and use it to solve a given NLP task.
-
-The library was designed with two strong goals in mind:
-
- Be as easy and fast to use as possible:
-
-    - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
-      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
-      :doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
-    - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
-      :obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
-      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
-      and models' weights) from a pretrained checkpoint provided on `Hugging Face Hub
-      <https://huggingface.co/models>`__ or your own saved checkpoint.
-    - On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
-      using a model (plus its associated tokenizer and configuration) on a given task and
-      :func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
-    - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
-      extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
-      classes of the library to reuse functionalities like model loading/saving.
-
- Provide state-of-the-art models with performances as close as possible to the original models:
-
-    - We provide at least one example for each architecture which reproduces a result provided by the official authors
-      of said architecture.
-    - The code is usually as close to the original code base as possible which means some PyTorch code may be not as
-      *pytorchic* as it could be as a result of being converted TensorFlow code and vice versa.
-
-A few other goals:
-
- Expose the models' internals as consistently as possible:
-
-    - We give access, using a single API, to the full hidden-states and attention weights.
-    - Tokenizer and base model's API are standardized to easily switch between models.
-
- Incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
-
-    - A simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning.
-    - Simple ways to mask and prune transformer heads.
-
- Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framework and inference using another.
-
-Main concepts
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The library is built around three types of classes for each model:
-
- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models (`torch.nn.Module
-  <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models (`tf.keras.Model
-  <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained weights provided in the
-  library.
- **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
-  a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
-  without any modification, creating the model will automatically take care of instantiating the configuration (which
-  is part of the model).
- **Tokenizer classes** such as :class:`~transformers.BertTokenizer`, which store the vocabulary for each model and
-  provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.
-
-All these classes can be instantiated from pretrained instances and saved locally using two methods:
-
- :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
-  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`) or
-  stored locally (or on a server) by the user,
- :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
-  :obj:`from_pretrained()`.
-
--- a/docs/source/pr_checks.md
+++ b/docs/source/pr_checks.md
--- a/docs/source/sagemaker.md
+++ b/docs/source/sagemaker.md