Convert model files from rst to mdx (#14865)

* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Convert model files from rst to mdx (#14865)
* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
ec3567fe · Lysandre Debut · GitHub · d0422de5 · ec3567fe · ec3567fe
Unverified Commit ec3567fe authored Dec 22, 2021 by Lysandre Debut Committed by GitHub Dec 22, 2021
20 changed files
--- a/docs/source/model_doc/byt5.rst
+++ b/docs/source/model_doc/byt5.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-ByT5
+# ByT5
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ByT5 model was presented in `ByT5: Towards a token-free future with pre-trained byte-to-byte models
+The ByT5 model was presented in [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
-<https://arxiv.org/abs/2105.13626>`_ by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
 Kale, Adam Roberts, Colin Raffel.
 The abstract from the paper is the following:
@@ -36,51 +33,48 @@ tasks that are sensitive to spelling and pronunciation. As part of our contribut
 pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
 experiments.*
-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
-found `here <https://github.com/google-research/byt5>`__.
+found [here](https://github.com/google-research/byt5).
-ByT5's architecture is based on the T5v1.1 model, so one can refer to :doc:`T5v1.1's documentation page <t5v1.1>`. They
+ByT5's architecture is based on the T5v1.1 model, so one can refer to [T5v1.1's documentation page](t5v1.1). They
 only differ in how inputs should be prepared for the model, see the code examples below.
 Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
 fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
-Example
+### Example
-_______________________________________________________________________________________________________________________
 ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
-.. code-block::
+```python
+from transformers import T5ForConditionalGeneration
+import torch
-    from transformers import T5ForConditionalGeneration
+model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
-    import torch
-    model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
+input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
+labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens
-    input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
-    labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens
-    loss = model(input_ids, labels=labels).loss # forward pass
+loss = model(input_ids, labels=labels).loss # forward pass
+```
 For batched inference and training it is however recommended to make use of the tokenizer:
-.. code-block::
+```python
+from transformers import T5ForConditionalGeneration, AutoTokenizer
-    from transformers import T5ForConditionalGeneration, AutoTokenizer
-    model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
+model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
-    tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
+tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
-    model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
+model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
-    labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
+labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
-    loss = model(**model_inputs, labels=labels).loss # forward pass
+loss = model(**model_inputs, labels=labels).loss # forward pass
+```
-ByT5Tokenizer
+## ByT5Tokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ByT5Tokenizer
+[[autodoc]] ByT5Tokenizer
-See :class:`~transformers.ByT5Tokenizer` for all details.
+See [`ByT5Tokenizer`] for all details.
--- a/docs/source/model_doc/camembert.mdx
+++ b/docs/source/model_doc/camembert.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# CamemBERT
+## Overview
+The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by
+Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
+Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
+trained on 138GB of French text.
+The abstract from the paper is the following:
+*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
+models have either been trained on English data or on the concatenation of data in multiple languages. This makes
+practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
+we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
+performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
+dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
+for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
+downstream applications for French NLP.*
+Tips:
+- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
+  as well as the information relative to the inputs and outputs.
+This model was contributed by [camembert](https://huggingface.co/camembert). The original code can be found [here](https://camembert-model.fr/).
+## CamembertConfig
+[[autodoc]] CamembertConfig
+## CamembertTokenizer
+[[autodoc]] CamembertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## CamembertTokenizerFast
+[[autodoc]] CamembertTokenizerFast
+## CamembertModel
+[[autodoc]] CamembertModel
+## CamembertForCausalLM
+[[autodoc]] CamembertForCausalLM
+## CamembertForMaskedLM
+[[autodoc]] CamembertForMaskedLM
+## CamembertForSequenceClassification
+[[autodoc]] CamembertForSequenceClassification
+## CamembertForMultipleChoice
+[[autodoc]] CamembertForMultipleChoice
+## CamembertForTokenClassification
+[[autodoc]] CamembertForTokenClassification
+## CamembertForQuestionAnswering
+[[autodoc]] CamembertForQuestionAnswering
+## TFCamembertModel
+[[autodoc]] TFCamembertModel
+## TFCamembertForMaskedLM
+[[autodoc]] TFCamembertForMaskedLM
+## TFCamembertForSequenceClassification
+[[autodoc]] TFCamembertForSequenceClassification
+## TFCamembertForMultipleChoice
+[[autodoc]] TFCamembertForMultipleChoice
+## TFCamembertForTokenClassification
+[[autodoc]] TFCamembertForTokenClassification
+## TFCamembertForQuestionAnswering
+[[autodoc]] TFCamembertForQuestionAnswering
--- a/docs/source/model_doc/camembert.rst
+++ b/docs/source/model_doc/camembert.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-CamemBERT
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__ by
-Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
-Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
-trained on 138GB of French text.
-The abstract from the paper is the following:
-*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
-models have either been trained on English data or on the concatenation of data in multiple languages. This makes
-practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
-we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
-performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
-dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
-for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
-downstream applications for French NLP.*
-Tips:
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
-  as well as the information relative to the inputs and outputs.
-This model was contributed by `camembert <https://huggingface.co/camembert>`__. The original code can be found `here
-<https://camembert-model.fr/>`__.
-CamembertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertConfig
-    :members:
-CamembertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-CamembertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertTokenizerFast
-    :members:
-CamembertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertModel
-    :members:
-CamembertForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertForCausalLM
-    :members:
-CamembertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertForMaskedLM
-    :members:
-CamembertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertForSequenceClassification
-    :members:
-CamembertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertForMultipleChoice
-    :members:
-CamembertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertForTokenClassification
-    :members:
-CamembertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CamembertForQuestionAnswering
-    :members:
-TFCamembertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCamembertModel
-    :members:
-TFCamembertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCamembertForMaskedLM
-    :members:
-TFCamembertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCamembertForSequenceClassification
-    :members:
-TFCamembertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCamembertForMultipleChoice
-    :members:
-TFCamembertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCamembertForTokenClassification
-    :members:
-TFCamembertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCamembertForQuestionAnswering
-    :members:
--- a/docs/source/model_doc/canine.rst
+++ b/docs/source/model_doc/canine.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-CANINE
+# CANINE
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The CANINE model was proposed in `CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
+The CANINE model was proposed in [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
-Representation <https://arxiv.org/abs/2103.06874>`__ by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
+Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
 among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
 Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
 Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
@@ -42,7 +40,7 @@ Tips:
  the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
  after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
  downsampling can be found in the paper.
- CANINE uses a max sequence length of 2048 characters by default. One can use :class:`~transformers.CanineTokenizer`
+- CANINE uses a max sequence length of 2048 characters by default. One can use [`CanineTokenizer`]
  to prepare text for the model.
 - Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
  (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
@@ -50,106 +48,86 @@ Tips:
  details for this can be found in the paper.
 -  Models:
-      - `google/canine-c <https://huggingface.co/google/canine-c>`__: Pre-trained with autoregressive character loss,
+  - [google/canine-c](https://huggingface.co/google/canine-c): Pre-trained with autoregressive character loss,
-        12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
+    12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
-      - `google/canine-s <https://huggingface.co/google/canine-s>`__: Pre-trained with subword loss, 12-layer,
+  - [google/canine-s](https://huggingface.co/google/canine-s): Pre-trained with subword loss, 12-layer,
-        768-hidden, 12-heads, 121M parameters (size ~500 MB).
+    768-hidden, 12-heads, 121M parameters (size ~500 MB).
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine).
-<https://github.com/google-research/language/tree/master/language/canine>`__.
-Example
+### Example
-_______________________________________________________________________________________________________________________
 CANINE works on raw characters, so it can be used without a tokenizer:
-.. code-block::
+```python
+>>> from transformers import CanineModel
+>>> import torch
-    from transformers import CanineModel
+>>> model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
-    import torch
-    model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
+>>> text = "hello world"
+>>> # use Python's built-in ord() function to turn each character into its unicode code point id
-    text = "hello world"
+>>> input_ids = torch.tensor([[ord(char) for char in text]])
-    # use Python's built-in ord() function to turn each character into its unicode code point id
-    input_ids = torch.tensor([[ord(char) for char in text]])
-    outputs = model(input_ids) # forward pass
-    pooled_output = outputs.pooler_output
-    sequence_output = outputs.last_hidden_state
+>>> outputs = model(input_ids) # forward pass
+>>> pooled_output = outputs.pooler_output
+>>> sequence_output = outputs.last_hidden_state
+```
 For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
 sequences to the same length):
-.. code-block::
+```python
+>>> from transformers import CanineTokenizer, CanineModel
-    from transformers import CanineTokenizer, CanineModel
-    model = CanineModel.from_pretrained('google/canine-c')
-    tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
-    inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
-    encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
-    outputs = model(**encoding) # forward pass
-    pooled_output = outputs.pooler_output
-    sequence_output = outputs.last_hidden_state
-CANINE specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.canine.modeling_canine.CanineModelOutputWithPooling
-    :members:
-CanineConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CanineConfig
+>>> model = CanineModel.from_pretrained('google/canine-c')
-    :members:
+>>> tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
+>>> inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
+>>> encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
-CanineTokenizer
+>>> outputs = model(**encoding) # forward pass
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+>>> pooled_output = outputs.pooler_output
+>>> sequence_output = outputs.last_hidden_state
+```
-.. autoclass:: transformers.CanineTokenizer
+## CANINE specific outputs
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences
+[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling
-CanineModel
+## CanineConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CanineModel
+[[autodoc]] CanineConfig
-    :members: forward
+## CanineTokenizer
-CanineForSequenceClassification
+[[autodoc]] CanineTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
-.. autoclass:: transformers.CanineForSequenceClassification
+## CanineModel
-    :members: forward
+[[autodoc]] CanineModel
+    - forward
-CanineForMultipleChoice
+## CanineForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CanineForMultipleChoice
+[[autodoc]] CanineForSequenceClassification
-    :members: forward
+    - forward
+## CanineForMultipleChoice
-CanineForTokenClassification
+[[autodoc]] CanineForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.CanineForTokenClassification
+## CanineForTokenClassification
-    :members: forward
+[[autodoc]] CanineForTokenClassification
+    - forward
-CanineForQuestionAnswering
+## CanineForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CanineForQuestionAnswering
+[[autodoc]] CanineForQuestionAnswering
-    :members: forward
+    - forward
--- a/docs/source/model_doc/clip.rst
+++ b/docs/source/model_doc/clip.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-CLIP
+# CLIP
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The CLIP model was proposed in `Learning Transferable Visual Models From Natural Language Supervision
+The CLIP model was proposed in [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
-<https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
 Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
 (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
 instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
@@ -39,8 +36,7 @@ for any dataset specific training. For instance, we match the accuracy of the or
 without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
 model weights at this https URL.*
-Usage
+## Usage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
 classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
@@ -50,125 +46,98 @@ product between the projected image and text features is then used as a similar
 To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
 which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
 also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
-The :class:`~transformers.CLIPFeatureExtractor` can be used to resize (or rescale) and normalize images for the model.
+The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model.
-The :class:`~transformers.CLIPTokenizer` is used to encode the text. The :class:`~transformers.CLIPProcessor` wraps
+The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps
-:class:`~transformers.CLIPFeatureExtractor` and :class:`~transformers.CLIPTokenizer` into a single instance to both
+[`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both
 encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
-:class:`~transformers.CLIPProcessor` and :class:`~transformers.CLIPModel`.
+[`CLIPProcessor`] and [`CLIPModel`].
-.. code-block::
+```python
+>>> from PIL import Image
+>>> import requests
-        >>> from PIL import Image
+>>> from transformers import CLIPProcessor, CLIPModel
-        >>> import requests
-        >>> from transformers import CLIPProcessor, CLIPModel
+>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
-        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-        >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+>>> image = Image.open(requests.get(url, stream=True).raw)
-        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
-        >>> image = Image.open(requests.get(url, stream=True).raw)
-        >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
+>>> outputs = model(**inputs)
+>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
+>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
+```
-        >>> outputs = model(**inputs)
+This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
-        >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
-        >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
+## CLIPConfig
-This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here
+[[autodoc]] CLIPConfig
-<https://github.com/openai/CLIP>`__.
+    - from_text_vision_configs
-CLIPConfig
+## CLIPTextConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPConfig
+[[autodoc]] CLIPTextConfig
-    :members: from_text_vision_configs
+## CLIPVisionConfig
-CLIPTextConfig
+[[autodoc]] CLIPVisionConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPTextConfig
+## CLIPTokenizer
-    :members:
+[[autodoc]] CLIPTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
-CLIPVisionConfig
+## CLIPTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPVisionConfig
+[[autodoc]] CLIPTokenizerFast
-    :members:
+## CLIPFeatureExtractor
+[[autodoc]] CLIPFeatureExtractor
-CLIPTokenizer
+## CLIPProcessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPTokenizer
+[[autodoc]] CLIPProcessor
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-CLIPTokenizerFast
+## CLIPModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPTokenizerFast
+[[autodoc]] CLIPModel
-    :members:
+    - forward
+    - get_text_features
+    - get_image_features
+## CLIPTextModel
-CLIPFeatureExtractor
+[[autodoc]] CLIPTextModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.CLIPFeatureExtractor
+## CLIPVisionModel
-    :members:
+[[autodoc]] CLIPVisionModel
+    - forward
-CLIPProcessor
+## FlaxCLIPModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPProcessor
+[[autodoc]] FlaxCLIPModel
-    :members:
+    - __call__
+    - get_text_features
+    - get_image_features
+## FlaxCLIPTextModel
+[[autodoc]] FlaxCLIPTextModel
+    - __call__
-CLIPModel
+## FlaxCLIPVisionModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPModel
+[[autodoc]] FlaxCLIPVisionModel
-    :members: forward, get_text_features, get_image_features
+    - __call__
-CLIPTextModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPTextModel
-    :members: forward
-CLIPVisionModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CLIPVisionModel
-    :members: forward
-FlaxCLIPModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxCLIPModel
-    :members: __call__, get_text_features, get_image_features
-FlaxCLIPTextModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxCLIPTextModel
-    :members: __call__
-FlaxCLIPVisionModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxCLIPVisionModel
-    :members: __call__
--- a/docs/source/model_doc/convbert.mdx
+++ b/docs/source/model_doc/convbert.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# ConvBERT
+## Overview
+The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
+Yan.
+The abstract from the paper is the following:
+*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
+natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
+large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
+generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
+which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
+replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
+rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
+learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
+ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
+fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
+using less than 1/4 training cost. Code and pre-trained models will be released.*
+ConvBERT training tips are similar to those of BERT.
+This model was contributed by [abhishek](https://huggingface.co/abhishek). The original implementation can be found
+here: https://github.com/yitu-opensource/ConvBert
+## ConvBertConfig
+[[autodoc]] ConvBertConfig
+## ConvBertTokenizer
+[[autodoc]] ConvBertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## ConvBertTokenizerFast
+[[autodoc]] ConvBertTokenizerFast
+## ConvBertModel
+[[autodoc]] ConvBertModel
+    - forward
+## ConvBertForMaskedLM
+[[autodoc]] ConvBertForMaskedLM
+    - forward
+## ConvBertForSequenceClassification
+[[autodoc]] ConvBertForSequenceClassification
+    - forward
+## ConvBertForMultipleChoice
+[[autodoc]] ConvBertForMultipleChoice
+    - forward
+## ConvBertForTokenClassification
+[[autodoc]] ConvBertForTokenClassification
+    - forward
+## ConvBertForQuestionAnswering
+[[autodoc]] ConvBertForQuestionAnswering
+    - forward
+## TFConvBertModel
+[[autodoc]] TFConvBertModel
+    - call
+## TFConvBertForMaskedLM
+[[autodoc]] TFConvBertForMaskedLM
+    - call
+## TFConvBertForSequenceClassification
+[[autodoc]] TFConvBertForSequenceClassification
+    - call
+## TFConvBertForMultipleChoice
+[[autodoc]] TFConvBertForMultipleChoice
+    - call
+## TFConvBertForTokenClassification
+[[autodoc]] TFConvBertForTokenClassification
+    - call
+## TFConvBertForQuestionAnswering
+[[autodoc]] TFConvBertForQuestionAnswering
+    - call
--- a/docs/source/model_doc/convbert.rst
+++ b/docs/source/model_doc/convbert.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-ConvBERT
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ConvBERT model was proposed in `ConvBERT: Improving BERT with Span-based Dynamic Convolution
-<https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
-Yan.
-The abstract from the paper is the following:
-*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
-natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
-large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
-generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
-which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
-replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
-rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
-learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
-ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
-fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
-using less than 1/4 training cost. Code and pre-trained models will be released.*
-ConvBERT training tips are similar to those of BERT.
-This model was contributed by `abhishek <https://huggingface.co/abhishek>`__. The original implementation can be found
-here: https://github.com/yitu-opensource/ConvBert
-ConvBertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertConfig
-    :members:
-ConvBertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-ConvBertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertTokenizerFast
-    :members:
-ConvBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertModel
-    :members: forward
-ConvBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertForMaskedLM
-    :members: forward
-ConvBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertForSequenceClassification
-    :members: forward
-ConvBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertForMultipleChoice
-    :members: forward
-ConvBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertForTokenClassification
-    :members: forward
-ConvBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ConvBertForQuestionAnswering
-    :members: forward
-TFConvBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFConvBertModel
-    :members: call
-TFConvBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFConvBertForMaskedLM
-    :members: call
-TFConvBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFConvBertForSequenceClassification
-    :members: call
-TFConvBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFConvBertForMultipleChoice
-    :members: call
-TFConvBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFConvBertForTokenClassification
-    :members: call
-TFConvBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFConvBertForQuestionAnswering
-    :members: call
--- a/docs/source/model_doc/cpm.rst
+++ b/docs/source/model_doc/cpm.rst
-..
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-CPM
+# CPM
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The CPM model was proposed in `CPM: A Large-scale Generative Chinese Pre-trained Language Model
+The CPM model was proposed in [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
-<https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
 Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
 Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
@@ -33,13 +30,11 @@ language model, which could facilitate several downstream Chinese NLP tasks, suc
 cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
 NLP tasks in the settings of few-shot (even zero-shot) learning.*
-This model was contributed by `canwenxu <https://huggingface.co/canwenxu>`__. The original implementation can be found
+This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
 here: https://github.com/TsinghuaAI/CPM-Generate
 Note: We only have a tokenizer here, since the model architecture is the same as GPT-2.
-CpmTokenizer
+## CpmTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CpmTokenizer
+[[autodoc]] CpmTokenizer
-    :members:
--- a/docs/source/model_doc/ctrl.rst
+++ b/docs/source/model_doc/ctrl.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-CTRL
+# CTRL
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation
+CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
-<https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
 Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
 of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
@@ -34,72 +31,57 @@ via model-based source attribution.*
 Tips:
 - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
-  or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__ for
+  or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for
  more information.
 - CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
-  observed in the `run_generation.py` example script.
+  observed in the *run_generation.py* example script.
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
+- The PyTorch models can take the *past* as input, which is the previously computed key/value attention pairs. Using
-  this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
+  this *past* value prevents the model from re-computing pre-computed values in the context of text generation. See
-  `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
+  [reusing the past in generative models](../quickstart#using-the-past) for more information on the usage of
  this argument.
-This model was contributed by `keskarnitishr <https://huggingface.co/keskarnitishr>`__. The original code can be found
+This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
-`here <https://github.com/salesforce/ctrl>`__.
+[here](https://github.com/salesforce/ctrl).
-CTRLConfig
+## CTRLConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CTRLConfig
+[[autodoc]] CTRLConfig
-    :members:
+## CTRLTokenizer
-CTRLTokenizer
+[[autodoc]] CTRLTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - save_vocabulary
-.. autoclass:: transformers.CTRLTokenizer
+## CTRLModel
-    :members: save_vocabulary
+[[autodoc]] CTRLModel
+    - forward
-CTRLModel
+## CTRLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CTRLModel
+[[autodoc]] CTRLLMHeadModel
-    :members: forward
+    - forward
+## CTRLForSequenceClassification
-CTRLLMHeadModel
+[[autodoc]] CTRLForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.CTRLLMHeadModel
+## TFCTRLModel
-    :members: forward
+[[autodoc]] TFCTRLModel
+    - call
-CTRLForSequenceClassification
+## TFCTRLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CTRLForSequenceClassification
+[[autodoc]] TFCTRLLMHeadModel
-    :members: forward
+    - call
+## TFCTRLForSequenceClassification
-TFCTRLModel
+[[autodoc]] TFCTRLForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - call
-.. autoclass:: transformers.TFCTRLModel
-    :members: call
-TFCTRLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCTRLLMHeadModel
-    :members: call
-TFCTRLForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFCTRLForSequenceClassification
-    :members: call
--- a/docs/source/model_doc/deberta.mdx
+++ b/docs/source/model_doc/deberta.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# DeBERTa
+## Overview
+The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
+BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
+It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
+RoBERTa.
+The abstract from the paper is the following:
+*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
+language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
+disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
+disentangled attention mechanism, where each word is represented using two vectors that encode its content and
+position, respectively, and the attention weights among words are computed using disentangled matrices on their
+contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
+predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
+of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
+the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
+(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
+pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
+This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
+contributed by [kamalkraj](https://huggingface.co/kamalkraj) . The original code can be found [here](https://github.com/microsoft/DeBERTa).
+## DebertaConfig
+[[autodoc]] DebertaConfig
+## DebertaTokenizer
+[[autodoc]] DebertaTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## DebertaTokenizerFast
+[[autodoc]] DebertaTokenizerFast
+    - build_inputs_with_special_tokens
+    - create_token_type_ids_from_sequences
+## DebertaModel
+[[autodoc]] DebertaModel
+    - forward
+## DebertaPreTrainedModel
+[[autodoc]] DebertaPreTrainedModel
+## DebertaForMaskedLM
+[[autodoc]] DebertaForMaskedLM
+    - forward
+## DebertaForSequenceClassification
+[[autodoc]] DebertaForSequenceClassification
+    - forward
+## DebertaForTokenClassification
+[[autodoc]] DebertaForTokenClassification
+    - forward
+## DebertaForQuestionAnswering
+[[autodoc]] DebertaForQuestionAnswering
+    - forward
+## TFDebertaModel
+[[autodoc]] TFDebertaModel
+    - call
+## TFDebertaPreTrainedModel
+[[autodoc]] TFDebertaPreTrainedModel
+    - call
+## TFDebertaForMaskedLM
+[[autodoc]] TFDebertaForMaskedLM
+    - call
+## TFDebertaForSequenceClassification
+[[autodoc]] TFDebertaForSequenceClassification
+    - call
+## TFDebertaForTokenClassification
+[[autodoc]] TFDebertaForTokenClassification
+    - call
+## TFDebertaForQuestionAnswering
+[[autodoc]] TFDebertaForQuestionAnswering
+    - call
--- a/docs/source/model_doc/deberta.rst
+++ b/docs/source/model_doc/deberta.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-DeBERTa
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
-<https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
-BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
-It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
-RoBERTa.
-The abstract from the paper is the following:
-*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
-language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
-disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
-disentangled attention mechanism, where each word is represented using two vectors that encode its content and
-position, respectively, and the attention weights among words are computed using disentangled matrices on their
-contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
-predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
-of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
-the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
-(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
-pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
-This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. This model TF 2.0 implementation was
-contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__ . The original code can be found `here
-<https://github.com/microsoft/DeBERTa>`__.
-DebertaConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaConfig
-    :members:
-DebertaTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-DebertaTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaTokenizerFast
-    :members: build_inputs_with_special_tokens, create_token_type_ids_from_sequences
-DebertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaModel
-    :members: forward
-DebertaPreTrainedModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaPreTrainedModel
-    :members:
-DebertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaForMaskedLM
-    :members: forward
-DebertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaForSequenceClassification
-    :members: forward
-DebertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaForTokenClassification
-    :members: forward
-DebertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaForQuestionAnswering
-    :members: forward
-TFDebertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaModel
-    :members: call
-TFDebertaPreTrainedModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaPreTrainedModel
-    :members: call
-TFDebertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaForMaskedLM
-    :members: call
-TFDebertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaForSequenceClassification
-    :members: call
-TFDebertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaForTokenClassification
-    :members: call
-TFDebertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaForQuestionAnswering
-    :members: call
--- a/docs/source/model_doc/deberta_v2.rst
+++ b/docs/source/model_doc/deberta_v2.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-DeBERTa-v2
+# DeBERTa-v2
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
+The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
-<https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
 BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
 It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
@@ -58,105 +55,78 @@ New in v2:
 - **900M model & 1.5B model** Two additional model sizes are available: 900M and 1.5B, which significantly improves the
  performance of downstream tasks.
-This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. This model TF 2.0 implementation was
+This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
-contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
+contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/DeBERTa).
-<https://github.com/microsoft/DeBERTa>`__.
-DebertaV2Config
+## DebertaV2Config
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaV2Config
+[[autodoc]] DebertaV2Config
-    :members:
+## DebertaV2Tokenizer
-DebertaV2Tokenizer
+[[autodoc]] DebertaV2Tokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
-.. autoclass:: transformers.DebertaV2Tokenizer
+## DebertaV2Model
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
+[[autodoc]] DebertaV2Model
+    - forward
-DebertaV2Model
+## DebertaV2PreTrainedModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaV2Model
+[[autodoc]] DebertaV2PreTrainedModel
-    :members: forward
+    - forward
+## DebertaV2ForMaskedLM
-DebertaV2PreTrainedModel
+[[autodoc]] DebertaV2ForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.DebertaV2PreTrainedModel
+## DebertaV2ForSequenceClassification
-    :members: forward
+[[autodoc]] DebertaV2ForSequenceClassification
+    - forward
-DebertaV2ForMaskedLM
+## DebertaV2ForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaV2ForMaskedLM
+[[autodoc]] DebertaV2ForTokenClassification
-    :members: forward
+    - forward
+## DebertaV2ForQuestionAnswering
-DebertaV2ForSequenceClassification
+[[autodoc]] DebertaV2ForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.DebertaV2ForSequenceClassification
+## TFDebertaV2Model
-    :members: forward
+[[autodoc]] TFDebertaV2Model
+    - call
-DebertaV2ForTokenClassification
+## TFDebertaV2PreTrainedModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DebertaV2ForTokenClassification
+[[autodoc]] TFDebertaV2PreTrainedModel
-    :members: forward
+    - call
+## TFDebertaV2ForMaskedLM
-DebertaV2ForQuestionAnswering
+[[autodoc]] TFDebertaV2ForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - call
-.. autoclass:: transformers.DebertaV2ForQuestionAnswering
+## TFDebertaV2ForSequenceClassification
-    :members: forward
+[[autodoc]] TFDebertaV2ForSequenceClassification
+    - call
-TFDebertaV2Model
+## TFDebertaV2ForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaV2Model
+[[autodoc]] TFDebertaV2ForTokenClassification
-    :members: call
+    - call
+## TFDebertaV2ForQuestionAnswering
-TFDebertaV2PreTrainedModel
+[[autodoc]] TFDebertaV2ForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - call
-.. autoclass:: transformers.TFDebertaV2PreTrainedModel
-    :members: call
-TFDebertaV2ForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaV2ForMaskedLM
-    :members: call
-TFDebertaV2ForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaV2ForSequenceClassification
-    :members: call
-TFDebertaV2ForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaV2ForTokenClassification
-    :members: call
-TFDebertaV2ForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDebertaV2ForQuestionAnswering
-    :members: call
--- a/docs/source/model_doc/deit.rst
+++ b/docs/source/model_doc/deit.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-DeiT
+# DeiT
-----------------------------------------------------------------------------------------------------------------------
-.. note::
+<Tip>
-    This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
+This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
-    breaking changes to fix it in the future. If you see something strange, file a `Github Issue
+breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
-    <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
+</Tip>
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The DeiT model was proposed in `Training data-efficient image transformers & distillation through attention
+The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
-<https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
+Sablayrolles, Hervé Jégou. The [Vision Transformer (ViT)](vit) introduced in [Dosovitskiy et al., 2020](https://arxiv.org/abs/2010.11929) has shown that one can match or even outperform existing convolutional neural
-Sablayrolles, Hervé Jégou. The `Vision Transformer (ViT) <vit>`__ introduced in `Dosovitskiy et al., 2020
-<https://arxiv.org/abs/2010.11929>`__ has shown that one can match or even outperform existing convolutional neural
 networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on
 expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more
 efficiently trained transformers for image classification, requiring far less data and far less computing resources
@@ -58,54 +54,44 @@ Tips:
  distillation head and the label predicted by the teacher). At inference time, one takes the average prediction
  between both heads as final prediction. (2) is also called "fine-tuning with distillation", because one relies on a
  teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds to
-  :class:`~transformers.DeiTForImageClassification` and (2) corresponds to
+  [`DeiTForImageClassification`] and (2) corresponds to
-  :class:`~transformers.DeiTForImageClassificationWithTeacher`.
+  [`DeiTForImageClassificationWithTeacher`].
 - Note that the authors also did try soft distillation for (2) (in which case the distillation prediction head is
  trained using KL divergence to match the softmax output of the teacher), but hard distillation gave the best results.
 - All released checkpoints were pre-trained and fine-tuned on ImageNet-1k only. No external data was used. This is in
  contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
  pre-training.
 - The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into
-  :class:`~transformers.ViTModel` or :class:`~transformers.ViTForImageClassification`. Techniques like data
+  [`ViTModel`] or [`ViTForImageClassification`]. Techniques like data
  augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
  (while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
-  `facebook/deit-tiny-patch16-224`, `facebook/deit-small-patch16-224`, `facebook/deit-base-patch16-224` and
+  *facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
-  `facebook/deit-base-patch16-384`. Note that one should use :class:`~transformers.DeiTFeatureExtractor` in order to
+  *facebook/deit-base-patch16-384*. Note that one should use [`DeiTFeatureExtractor`] in order to
  prepare images for the model.
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__.
+This model was contributed by [nielsr](https://huggingface.co/nielsr).
-DeiTConfig
+## DeiTConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DeiTConfig
+[[autodoc]] DeiTConfig
-    :members:
+## DeiTFeatureExtractor
-DeiTFeatureExtractor
+[[autodoc]] DeiTFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - __call__
-.. autoclass:: transformers.DeiTFeatureExtractor
+## DeiTModel
-    :members: __call__
+[[autodoc]] DeiTModel
+    - forward
-DeiTModel
+## DeiTForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DeiTModel
+[[autodoc]] DeiTForImageClassification
-    :members: forward
+    - forward
+## DeiTForImageClassificationWithTeacher
-DeiTForImageClassification
+[[autodoc]] DeiTForImageClassificationWithTeacher
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.DeiTForImageClassification
-    :members: forward
-DeiTForImageClassificationWithTeacher
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DeiTForImageClassificationWithTeacher
-    :members: forward
--- a/docs/source/model_doc/dialogpt.rst
+++ b/docs/source/model_doc/dialogpt.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-DialoGPT
+# DialoGPT
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-DialoGPT was proposed in `DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
+DialoGPT was proposed in [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
-<https://arxiv.org/abs/1911.00536>`_ by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
 Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
 Reddit.
@@ -37,8 +34,7 @@ Tips:
  than the left.
 - DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
  at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card
+- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on [DialoGPT's model card](https://huggingface.co/microsoft/DialoGPT-medium).
-  <https://huggingface.co/microsoft/DialoGPT-medium>`_.
 Training:
@@ -48,6 +44,6 @@ modeling. We first concatenate all dialog turns within a dialogue session into a
 sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
-DialoGPT's architecture is based on the GPT2 model, so one can refer to :doc:`GPT2's documentation page <gpt2>`.
+DialoGPT's architecture is based on the GPT2 model, so one can refer to [GPT2's documentation page](gpt2).
-The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.
+The original code can be found [here](https://github.com/microsoft/DialoGPT).
--- a/docs/source/model_doc/distilbert.mdx
+++ b/docs/source/model_doc/distilbert.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# DistilBERT
+## Overview
+The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
+distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, a
+distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a
+small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
+*bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
+understanding benchmark.
+The abstract from the paper is the following:
+*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
+operating these large models in on-the-edge and/or under constrained computational training or inference budgets
+remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
+model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
+counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
+knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
+40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
+biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
+distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
+demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
+study.*
+Tips:
+- DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
+  separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`).
+- DistilBERT doesn't have options to select the input positions (`position_ids` input). This could be added if
+  necessary though, just let us know if you need this option.
+This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
+contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation).
+## DistilBertConfig
+[[autodoc]] DistilBertConfig
+## DistilBertTokenizer
+[[autodoc]] DistilBertTokenizer
+## DistilBertTokenizerFast
+[[autodoc]] DistilBertTokenizerFast
+## DistilBertModel
+[[autodoc]] DistilBertModel
+    - forward
+## DistilBertForMaskedLM
+[[autodoc]] DistilBertForMaskedLM
+    - forward
+## DistilBertForSequenceClassification
+[[autodoc]] DistilBertForSequenceClassification
+    - forward
+## DistilBertForMultipleChoice
+[[autodoc]] DistilBertForMultipleChoice
+    - forward
+## DistilBertForTokenClassification
+[[autodoc]] DistilBertForTokenClassification
+    - forward
+## DistilBertForQuestionAnswering
+[[autodoc]] DistilBertForQuestionAnswering
+    - forward
+## TFDistilBertModel
+[[autodoc]] TFDistilBertModel
+    - call
+## TFDistilBertForMaskedLM
+[[autodoc]] TFDistilBertForMaskedLM
+    - call
+## TFDistilBertForSequenceClassification
+[[autodoc]] TFDistilBertForSequenceClassification
+    - call
+## TFDistilBertForMultipleChoice
+[[autodoc]] TFDistilBertForMultipleChoice
+    - call
+## TFDistilBertForTokenClassification
+[[autodoc]] TFDistilBertForTokenClassification
+    - call
+## TFDistilBertForQuestionAnswering
+[[autodoc]] TFDistilBertForQuestionAnswering
+    - call
+## FlaxDistilBertModel
+[[autodoc]] FlaxDistilBertModel
+    - __call__
+## FlaxDistilBertForMaskedLM
+[[autodoc]] FlaxDistilBertForMaskedLM
+    - __call__
+## FlaxDistilBertForSequenceClassification
+[[autodoc]] FlaxDistilBertForSequenceClassification
+    - __call__
+## FlaxDistilBertForMultipleChoice
+[[autodoc]] FlaxDistilBertForMultipleChoice
+    - __call__
+## FlaxDistilBertForTokenClassification
+[[autodoc]] FlaxDistilBertForTokenClassification
+    - __call__
+## FlaxDistilBertForQuestionAnswering
+[[autodoc]] FlaxDistilBertForQuestionAnswering
+    - __call__
--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-DistilBERT
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The DistilBERT model was proposed in the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
-distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a
-distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__. DistilBERT is a
-small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
-`bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
-understanding benchmark.
-The abstract from the paper is the following:
-*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
-operating these large models in on-the-edge and/or under constrained computational training or inference budgets
-remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
-model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
-counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
-knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
-40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
-biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
-distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
-demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
-study.*
-Tips:
- DistilBERT doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
-  separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`[SEP]`).
- DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
-  necessary though, just let us know if you need this option.
-This model was contributed by `victorsanh <https://huggingface.co/victorsanh>`__. This model jax version was
-contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found :prefix_link:`here
-<examples/research_projects/distillation>`.
-DistilBertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertConfig
-    :members:
-DistilBertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertTokenizer
-    :members:
-DistilBertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertTokenizerFast
-    :members:
-DistilBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertModel
-    :members: forward
-DistilBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertForMaskedLM
-    :members: forward
-DistilBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertForSequenceClassification
-    :members: forward
-DistilBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertForMultipleChoice
-    :members: forward
-DistilBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertForTokenClassification
-    :members: forward
-DistilBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DistilBertForQuestionAnswering
-    :members: forward
-TFDistilBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDistilBertModel
-    :members: call
-TFDistilBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDistilBertForMaskedLM
-    :members: call
-TFDistilBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDistilBertForSequenceClassification
-    :members: call
-TFDistilBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDistilBertForMultipleChoice
-    :members: call
-TFDistilBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDistilBertForTokenClassification
-    :members: call
-TFDistilBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDistilBertForQuestionAnswering
-    :members: call
-FlaxDistilBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxDistilBertModel
-    :members: __call__
-FlaxDistilBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxDistilBertForMaskedLM
-    :members: __call__
-FlaxDistilBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxDistilBertForSequenceClassification
-    :members: __call__
-FlaxDistilBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxDistilBertForMultipleChoice
-    :members: __call__
-FlaxDistilBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxDistilBertForTokenClassification
-    :members: __call__
-FlaxDistilBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxDistilBertForQuestionAnswering
-    :members: __call__
--- a/docs/source/model_doc/dpr.mdx
+++ b/docs/source/model_doc/dpr.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# DPR
+## Overview
+Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
+introduced in [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by
+Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
+The abstract from the paper is the following:
+*Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional
+sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can
+be practically implemented using dense representations alone, where embeddings are learned from a small number of
+questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets,
+our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage
+retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
+benchmarks.*
+This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR).
+## DPRConfig
+[[autodoc]] DPRConfig
+## DPRContextEncoderTokenizer
+[[autodoc]] DPRContextEncoderTokenizer
+## DPRContextEncoderTokenizerFast
+[[autodoc]] DPRContextEncoderTokenizerFast
+## DPRQuestionEncoderTokenizer
+[[autodoc]] DPRQuestionEncoderTokenizer
+## DPRQuestionEncoderTokenizerFast
+[[autodoc]] DPRQuestionEncoderTokenizerFast
+## DPRReaderTokenizer
+[[autodoc]] DPRReaderTokenizer
+## DPRReaderTokenizerFast
+[[autodoc]] DPRReaderTokenizerFast
+## DPR specific outputs
+[[autodoc]] models.dpr.modeling_dpr.DPRContextEncoderOutput
+[[autodoc]] models.dpr.modeling_dpr.DPRQuestionEncoderOutput
+[[autodoc]] models.dpr.modeling_dpr.DPRReaderOutput
+## DPRContextEncoder
+[[autodoc]] DPRContextEncoder
+    - forward
+## DPRQuestionEncoder
+[[autodoc]] DPRQuestionEncoder
+    - forward
+## DPRReader
+[[autodoc]] DPRReader
+    - forward
+## TFDPRContextEncoder
+[[autodoc]] TFDPRContextEncoder
+    - call
+## TFDPRQuestionEncoder
+[[autodoc]] TFDPRQuestionEncoder
+    - call
+## TFDPRReader
+[[autodoc]] TFDPRReader
+    - call
--- a/docs/source/model_doc/dpr.rst
+++ b/docs/source/model_doc/dpr.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-DPR
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
-introduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
-Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
-The abstract from the paper is the following:
-*Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional
-sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can
-be practically implemented using dense representations alone, where embeddings are learned from a small number of
-questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets,
-our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage
-retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
-benchmarks.*
-This model was contributed by `lhoestq <https://huggingface.co/lhoestq>`__. The original code can be found `here
-<https://github.com/facebookresearch/DPR>`__.
-DPRConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRConfig
-    :members:
-DPRContextEncoderTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRContextEncoderTokenizer
-    :members:
-DPRContextEncoderTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRContextEncoderTokenizerFast
-    :members:
-DPRQuestionEncoderTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRQuestionEncoderTokenizer
-    :members:
-DPRQuestionEncoderTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRQuestionEncoderTokenizerFast
-    :members:
-DPRReaderTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRReaderTokenizer
-    :members:
-DPRReaderTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRReaderTokenizerFast
-    :members:
-DPR specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.dpr.modeling_dpr.DPRContextEncoderOutput
-    :members:
-.. autoclass:: transformers.models.dpr.modeling_dpr.DPRQuestionEncoderOutput
-    :members:
-.. autoclass:: transformers.models.dpr.modeling_dpr.DPRReaderOutput
-    :members:
-DPRContextEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRContextEncoder
-    :members: forward
-DPRQuestionEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRQuestionEncoder
-    :members: forward
-DPRReader
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DPRReader
-    :members: forward
-TFDPRContextEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDPRContextEncoder
-    :members: call
-TFDPRQuestionEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDPRQuestionEncoder
-    :members: call
-TFDPRReader
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFDPRReader
-    :members: call
--- a/docs/source/model_doc/electra.mdx
+++ b/docs/source/model_doc/electra.mdx
--- a/docs/source/model_doc/electra.rst
+++ b/docs/source/model_doc/electra.rst