Convert model files from rst to mdx (#14865)

* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Convert model files from rst to mdx (#14865)
* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
ec3567fe · Lysandre Debut · GitHub · d0422de5 · ec3567fe · d0422de5
Unverified Commit ec3567fe authored Dec 22, 2021 by Lysandre Debut Committed by GitHub Dec 22, 2021
20 changed files
--- a/docs/source/model_doc/albert.mdx
+++ b/docs/source/model_doc/albert.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# ALBERT
+## Overview
+The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
+Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
+speed of BERT:
+- Splitting the embedding matrix into two smaller matrices.
+- Using repeating layers split among groups.
+The abstract from the paper is the following:
+*Increasing model size when pretraining natural language representations often results in improved performance on
+downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
+longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
+techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
+that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
+self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
+with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
+SQuAD benchmarks while having fewer parameters compared to BERT-large.*
+Tips:
+- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
+  than the left.
+- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
+  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
+  number of (repeating) layers.
+This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
+[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
+## AlbertConfig
+[[autodoc]] AlbertConfig
+## AlbertTokenizer
+[[autodoc]] AlbertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## AlbertTokenizerFast
+[[autodoc]] AlbertTokenizerFast
+## Albert specific outputs
+[[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput
+[[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
+## AlbertModel
+[[autodoc]] AlbertModel
+    - forward
+## AlbertForPreTraining
+[[autodoc]] AlbertForPreTraining
+    - forward
+## AlbertForMaskedLM
+[[autodoc]] AlbertForMaskedLM
+    - forward
+## AlbertForSequenceClassification
+[[autodoc]] AlbertForSequenceClassification
+    - forward
+## AlbertForMultipleChoice
+[[autodoc]] AlbertForMultipleChoice
+## AlbertForTokenClassification
+[[autodoc]] AlbertForTokenClassification
+    - forward
+## AlbertForQuestionAnswering
+[[autodoc]] AlbertForQuestionAnswering
+    - forward
+## TFAlbertModel
+[[autodoc]] TFAlbertModel
+    - call
+## TFAlbertForPreTraining
+[[autodoc]] TFAlbertForPreTraining
+    - call
+## TFAlbertForMaskedLM
+[[autodoc]] TFAlbertForMaskedLM
+    - call
+## TFAlbertForSequenceClassification
+[[autodoc]] TFAlbertForSequenceClassification
+    - call
+## TFAlbertForMultipleChoice
+[[autodoc]] TFAlbertForMultipleChoice
+    - call
+## TFAlbertForTokenClassification
+[[autodoc]] TFAlbertForTokenClassification
+    - call
+## TFAlbertForQuestionAnswering
+[[autodoc]] TFAlbertForQuestionAnswering
+    - call
+## FlaxAlbertModel
+[[autodoc]] FlaxAlbertModel
+    - __call__
+## FlaxAlbertForPreTraining
+[[autodoc]] FlaxAlbertForPreTraining
+    - __call__
+## FlaxAlbertForMaskedLM
+[[autodoc]] FlaxAlbertForMaskedLM
+    - __call__
+## FlaxAlbertForSequenceClassification
+[[autodoc]] FlaxAlbertForSequenceClassification
+    - __call__
+## FlaxAlbertForMultipleChoice
+[[autodoc]] FlaxAlbertForMultipleChoice
+    - __call__
+## FlaxAlbertForTokenClassification
+[[autodoc]] FlaxAlbertForTokenClassification
+    - __call__
+## FlaxAlbertForQuestionAnswering
+[[autodoc]] FlaxAlbertForQuestionAnswering
+    - __call__
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-ALBERT
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
-<https://arxiv.org/abs/1909.11942>`__ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
-Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
-speed of BERT:
- Splitting the embedding matrix into two smaller matrices.
- Using repeating layers split among groups.
-The abstract from the paper is the following:
-*Increasing model size when pretraining natural language representations often results in improved performance on
-downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
-longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
-techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
-that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
-self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
-with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
-SQuAD benchmarks while having fewer parameters compared to BERT-large.*
-Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  than the left.
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
-  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
-  number of (repeating) layers.
-This model was contributed by `lysandre <https://huggingface.co/lysandre>`__. This model jax version was contributed by
-`kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
-<https://github.com/google-research/ALBERT>`__.
-AlbertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertConfig
-    :members:
-AlbertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-AlbertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertTokenizerFast
-    :members:
-Albert specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.albert.modeling_albert.AlbertForPreTrainingOutput
-    :members:
-.. autoclass:: transformers.models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
-    :members:
-AlbertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertModel
-    :members: forward
-AlbertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertForPreTraining
-    :members: forward
-AlbertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertForMaskedLM
-    :members: forward
-AlbertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertForSequenceClassification
-    :members: forward
-AlbertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertForMultipleChoice
-    :members:
-AlbertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertForTokenClassification
-    :members: forward
-AlbertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.AlbertForQuestionAnswering
-    :members: forward
-TFAlbertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFAlbertModel
-    :members: call
-TFAlbertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFAlbertForPreTraining
-    :members: call
-TFAlbertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFAlbertForMaskedLM
-    :members: call
-TFAlbertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFAlbertForSequenceClassification
-    :members: call
-TFAlbertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFAlbertForMultipleChoice
-    :members: call
-TFAlbertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFAlbertForTokenClassification
-    :members: call
-TFAlbertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFAlbertForQuestionAnswering
-    :members: call
-FlaxAlbertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxAlbertModel
-    :members: __call__
-FlaxAlbertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxAlbertForPreTraining
-    :members: __call__
-FlaxAlbertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxAlbertForMaskedLM
-    :members: __call__
-FlaxAlbertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxAlbertForSequenceClassification
-    :members: __call__
-FlaxAlbertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxAlbertForMultipleChoice
-    :members: __call__
-FlaxAlbertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxAlbertForTokenClassification
-    :members: __call__
-FlaxAlbertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxAlbertForQuestionAnswering
-    :members: __call__
--- a/docs/source/model_doc/bart.mdx
+++ b/docs/source/model_doc/bart.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# BART
+**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
+@patrickvonplaten
+## Overview
+The Bart model was proposed in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
+Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
+Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
+According to the abstract,
+- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
+  left-to-right decoder (like GPT).
+- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
+  where spans of text are replaced with a single mask token.
+- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
+  matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
+  state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
+  of up to 6 ROUGE.
+This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart).
+### Examples
+- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
+  [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/README.md).
+- An example of how to train [`BartForConditionalGeneration`] with a Hugging Face `datasets`
+  object can be found in this [forum discussion](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904).
+- [Distilled checkpoints](https://huggingface.co/models?search=distilbart) are described in this [paper](https://arxiv.org/abs/2010.13002).
+## Implementation Notes
+- Bart doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or
+  [`~BartTokenizer.encode`] to get the proper splitting.
+- The forward pass of [`BartModel`] will create the `decoder_input_ids` if they are not passed.
+  This is different than some other modeling APIs. A typical use case of this feature is mask filling.
+- Model predictions are intended to be identical to the original implementation when
+  `force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
+  [`fairseq.encode`] starts with a space.
+- [`~generation_utils.GenerationMixin.generate`] should be used for conditional generation tasks like
+  summarization, see the example in that docstrings.
+- Models that load the *facebook/bart-large-cnn* weights will not have a `mask_token_id`, or be able to perform
+  mask-filling tasks.
+## Mask Filling
+The `facebook/bart-base` and `facebook/bart-large` checkpoints can be used to fill multi-token masks.
+```python
+from transformers import BartForConditionalGeneration, BartTokenizer
+model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
+tok = BartTokenizer.from_pretrained("facebook/bart-large")
+example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
+batch = tok(example_english_phrase, return_tensors='pt')
+generated_ids = model.generate(batch['input_ids'])
+assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
+```
+## BartConfig
+[[autodoc]] BartConfig
+    - all
+## BartTokenizer
+[[autodoc]] BartTokenizer
+    - all
+## BartTokenizerFast
+[[autodoc]] BartTokenizerFast
+    - all
+## BartModel
+[[autodoc]] BartModel
+    - forward
+## BartForConditionalGeneration
+[[autodoc]] BartForConditionalGeneration
+    - forward
+## BartForSequenceClassification
+[[autodoc]] BartForSequenceClassification
+    - forward
+## BartForQuestionAnswering
+[[autodoc]] BartForQuestionAnswering
+    - forward
+## BartForCausalLM
+[[autodoc]] BartForCausalLM
+    - forward
+## TFBartModel
+[[autodoc]] TFBartModel
+    - call
+## TFBartForConditionalGeneration
+[[autodoc]] TFBartForConditionalGeneration
+    - call
+## FlaxBartModel
+[[autodoc]] FlaxBartModel
+    - __call__
+    - encode
+    - decode
+## FlaxBartForConditionalGeneration
+[[autodoc]] FlaxBartForConditionalGeneration
+    - __call__
+    - encode
+    - decode
+## FlaxBartForSequenceClassification
+[[autodoc]] FlaxBartForSequenceClassification
+    - __call__
+    - encode
+    - decode
+## FlaxBartForQuestionAnswering
+[[autodoc]] FlaxBartForQuestionAnswering
+    - __call__
+    - encode
+    - decode
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-BART
-----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange, file a `Github Issue
-<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
-@patrickvonplaten
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Bart model was proposed in `BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
-Translation, and Comprehension <https://arxiv.org/abs/1910.13461>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
-Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
-According to the abstract,
- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
-  left-to-right decoder (like GPT).
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
-  where spans of text are replaced with a single mask token.
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
-  matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
-  state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
-  of up to 6 ROUGE.
-This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The Authors' code can be found `here
-<https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.
-Examples
-_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
-  :prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
-  object can be found in this `forum discussion
-  <https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
- `Distilled checkpoints <https://huggingface.co/models?search=distilbart>`__ are described in this `paper
-  <https://arxiv.org/abs/2010.13002>`__.
-Implementation Notes
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
-  :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
- The forward pass of :class:`~transformers.BartModel` will create the ``decoder_input_ids`` if they are not passed.
-  This is different than some other modeling APIs. A typical use case of this feature is mask filling.
- Model predictions are intended to be identical to the original implementation when
-  :obj:`force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
-  :func:`fairseq.encode` starts with a space.
- :meth:`~transformers.generation_utils.GenerationMixin.generate` should be used for conditional generation tasks like
-  summarization, see the example in that docstrings.
- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
-  mask-filling tasks.
-Mask Filling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The :obj:`facebook/bart-base` and :obj:`facebook/bart-large` checkpoints can be used to fill multi-token masks.
-.. code-block::
-    from transformers import BartForConditionalGeneration, BartTokenizer
-    model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
-    tok = BartTokenizer.from_pretrained("facebook/bart-large")
-    example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
-    batch = tok(example_english_phrase, return_tensors='pt')
-    generated_ids = model.generate(batch['input_ids'])
-    assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
-BartConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartConfig
-    :members:
-BartTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartTokenizer
-    :members:
-BartTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartTokenizerFast
-    :members:
-BartModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartModel
-    :members: forward
-BartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartForConditionalGeneration
-    :members: forward
-BartForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartForSequenceClassification
-    :members: forward
-BartForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartForQuestionAnswering
-    :members: forward
-BartForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartForCausalLM
-    :members: forward
-TFBartModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFBartModel
-    :members: call
-TFBartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFBartForConditionalGeneration
-    :members: call
-FlaxBartModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBartModel
-    :members: __call__, encode, decode
-FlaxBartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBartForConditionalGeneration
-    :members: __call__, encode, decode
-FlaxBartForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBartForSequenceClassification
-    :members: __call__, encode, decode
-FlaxBartForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBartForQuestionAnswering
-    :members: __call__, encode, decode
--- a/docs/source/model_doc/barthez.rst
+++ b/docs/source/model_doc/barthez.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-BARThez
+# BARThez
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BARThez model was proposed in `BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
+The BARThez model was proposed in [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct,
-<https://arxiv.org/abs/2010.12321>`__ by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct,
 2020.
 The abstract of the paper:
@@ -35,26 +32,19 @@ summarization dataset, OrangeSum, that we release with this paper. We also conti
 pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez,
 provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.*
-This model was contributed by `moussakam <https://huggingface.co/moussakam>`__. The Authors' code can be found `here
+This model was contributed by [moussakam](https://huggingface.co/moussakam). The Authors' code can be found [here](https://github.com/moussaKam/BARThez).
-<https://github.com/moussaKam/BARThez>`__.
-Examples
+### Examples
-_______________________________________________________________________________________________________________________
 - BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
-  :prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
+  [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/README.md).
-BarthezTokenizer
+## BarthezTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BarthezTokenizer
+[[autodoc]] BarthezTokenizer
-    :members:
+## BarthezTokenizerFast
-BarthezTokenizerFast
+[[autodoc]] BarthezTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BarthezTokenizerFast
-    :members:
--- a/docs/source/model_doc/bartpho.rst
+++ b/docs/source/model_doc/bartpho.rst
-..
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-BARTpho
+# BARTpho
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BARTpho model was proposed in `BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
+The BARTpho model was proposed in [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
-<https://arxiv.org/abs/2109.09701>`__ by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
 The abstract from the paper is the following:
@@ -30,57 +27,54 @@ research and applications of generative Vietnamese NLP tasks.*
 Example of use:
-.. code-block::
+```python
+>>> import torch
+>>> from transformers import AutoModel, AutoTokenizer
-    >>> import torch
+>>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")
-    >>> from transformers import AutoModel, AutoTokenizer
-    >>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")
+>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
-    >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
+>>> line = "Chúng tôi là những nghiên cứu viên."
-    >>> line = "Chúng tôi là những nghiên cứu viên."
+>>> input_ids = tokenizer(line, return_tensors="pt")
-    >>> input_ids = tokenizer(line, return_tensors="pt")
+>>> with torch.no_grad():
+...     features = bartpho(**input_ids)  # Models outputs are now tuples
-    >>> with torch.no_grad():
+>>> # With TensorFlow 2.0+:
-    ...     features = bartpho(**input_ids)  # Models outputs are now tuples
+>>> from transformers import TFAutoModel
+>>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
-    >>> # With TensorFlow 2.0+:
+>>> input_ids = tokenizer(line, return_tensors="tf")
-    >>> from transformers import TFAutoModel
+>>> features = bartpho(**input_ids)
-    >>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
+```
-    >>> input_ids = tokenizer(line, return_tensors="tf")
-    >>> features = bartpho(**input_ids)
 Tips:
 - Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of
-  both the encoder and decoder. Thus, usage examples in the :doc:`documentation of BART <bart>`, when adapting to use
+  both the encoder and decoder. Thus, usage examples in the [documentation of BART](bart), when adapting to use
  with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts.
  For example:
-.. code-block::
+```python
+>>> from transformers import MBartForConditionalGeneration
-    >>> from transformers import MBartForConditionalGeneration
+>>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
-    >>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
+>>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
-    >>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
+>>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
-    >>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
+>>> logits = bartpho(input_ids).logits
-    >>> logits = bartpho(input_ids).logits
+>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
-    >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
+>>> probs = logits[0, masked_index].softmax(dim=0)
-    >>> probs = logits[0, masked_index].softmax(dim=0)
+>>> values, predictions = probs.topk(5)
-    >>> values, predictions = probs.topk(5)
+>>> print(tokenizer.decode(predictions).split())
-    >>> print(tokenizer.decode(predictions).split())
+```
 - This implementation is only for tokenization: "monolingual_vocab_file" consists of Vietnamese-specialized types
  extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
  Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
  segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
-This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).
-<https://github.com/VinAIResearch/BARTpho>`__.
-BartphoTokenizer
+## BartphoTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BartphoTokenizer
+[[autodoc]] BartphoTokenizer
-    :members:
--- a/docs/source/model_doc/beit.rst
+++ b/docs/source/model_doc/beit.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-BEiT
+# BEiT
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BEiT model was proposed in `BEiT: BERT Pre-Training of Image Transformers <https://arxiv.org/abs/2106.08254>`__ by
+The BEiT model was proposed in [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by
 Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
 Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
-of an image (as done in the `original ViT paper <https://arxiv.org/abs/2010.11929>`__), BEiT models are pre-trained to
+of an image (as done in the [original ViT paper](https://arxiv.org/abs/2010.11929)), BEiT models are pre-trained to
-predict visual tokens from the codebook of OpenAI's `DALL-E model <https://arxiv.org/abs/2102.12092>`__ given masked
+predict visual tokens from the codebook of OpenAI's [DALL-E model](https://arxiv.org/abs/2102.12092) given masked
 patches.
 The abstract from the paper is the following:
@@ -40,105 +38,77 @@ significantly outperforming from-scratch DeiT training (81.8%) with the same set
 Tips:
 - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
-  outperform both the :doc:`original model (ViT) <vit>` as well as :doc:`Data-efficient Image Transformers (DeiT)
+  outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
-  <deit>` when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
+  fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
-  fine-tuning on custom data `here
+  [`ViTFeatureExtractor`] by [`BeitFeatureExtractor`] and
-  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__ (you can just replace
+  [`ViTForImageClassification`] by [`BeitForImageClassification`]).
-  :class:`~transformers.ViTFeatureExtractor` by :class:`~transformers.BeitFeatureExtractor` and
-  :class:`~transformers.ViTForImageClassification` by :class:`~transformers.BeitForImageClassification`).
 - There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
-  performing masked image modeling. You can find it `here
+  performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
-  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT>`__.
 - As the BEiT models expect each image to be of the same size (resolution), one can use
-  :class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
+  [`BeitFeatureExtractor`] to resize (or rescale) and normalize images for the model.
 - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
-  each checkpoint. For example, :obj:`microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
+  each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
-  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the `hub
+  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
-  <https://huggingface.co/models?search=microsoft/beit>`__.
+- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of
- The available checkpoints are either (1) pre-trained on `ImageNet-22k <http://www.image-net.org/>`__ (a collection of
+  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
-  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on `ImageNet-1k
-  <http://www.image-net.org/challenges/LSVRC/2012/>`__ (also referred to as ILSVRC 2012, a collection of 1.3 million
  images and 1,000 classes).
 - BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
  relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
  bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
-  pre-train a model from scratch, one needs to either set the :obj:`use_relative_position_bias` or the
+  pre-train a model from scratch, one needs to either set the `use_relative_position_bias` or the
-  :obj:`use_relative_position_bias` attribute of :class:`~transformers.BeitConfig` to :obj:`True` in order to add
+  `use_relative_position_bias` attribute of [`BeitConfig`] to `True` in order to add
  position embeddings.
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The JAX/FLAX version of this model was
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was
-contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
+contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
-<https://github.com/microsoft/unilm/tree/master/beit>`__.
-BEiT specific outputs
+## BEiT specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.beit.modeling_beit.BeitModelOutputWithPooling
+[[autodoc]] models.beit.modeling_beit.BeitModelOutputWithPooling
-    :members:
-.. autoclass:: transformers.models.beit.modeling_flax_beit.FlaxBeitModelOutputWithPooling
+[[autodoc]] models.beit.modeling_flax_beit.FlaxBeitModelOutputWithPooling
-    :members:
+## BeitConfig
-BeitConfig
+[[autodoc]] BeitConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BeitConfig
+## BeitFeatureExtractor
-    :members:
+[[autodoc]] BeitFeatureExtractor
+    - __call__
-BeitFeatureExtractor
+## BeitModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BeitFeatureExtractor
+[[autodoc]] BeitModel
-    :members: __call__
+    - forward
+## BeitForMaskedImageModeling
-BeitModel
+[[autodoc]] BeitForMaskedImageModeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.BeitModel
+## BeitForImageClassification
-    :members: forward
+[[autodoc]] BeitForImageClassification
+    - forward
-BeitForMaskedImageModeling
+## BeitForSemanticSegmentation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BeitForMaskedImageModeling
+[[autodoc]] BeitForSemanticSegmentation
-    :members: forward
+    - forward
+## FlaxBeitModel
-BeitForImageClassification
+[[autodoc]] FlaxBeitModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - __call__
-.. autoclass:: transformers.BeitForImageClassification
+## FlaxBeitForMaskedImageModeling
-    :members: forward
+[[autodoc]] FlaxBeitForMaskedImageModeling
+    - __call__
-BeitForSemanticSegmentation
+## FlaxBeitForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BeitForSemanticSegmentation
+[[autodoc]] FlaxBeitForImageClassification
-    :members: forward
+    - __call__
-FlaxBeitModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBeitModel
-    :members: __call__
-FlaxBeitForMaskedImageModeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBeitForMaskedImageModeling
-    :members: __call__
-FlaxBeitForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBeitForImageClassification
-    :members: __call__
--- a/docs/source/model_doc/bert_japanese.mdx
+++ b/docs/source/model_doc/bert_japanese.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# BertJapanese
+## Overview
+The BERT models trained on Japanese text.
+There are models with two different tokenization methods:
+- Tokenize with MeCab and WordPiece. This requires some extra dependencies, [fugashi](https://github.com/polm/fugashi) which is a wrapper around [MeCab](https://taku910.github.io/mecab/).
+- Tokenize into characters.
+To use *MecabTokenizer*, you should `pip install transformers["ja"]` (or `pip install -e .["ja"]` if you install
+from source) to install dependencies.
+See [details on cl-tohoku repository](https://github.com/cl-tohoku/bert-japanese).
+Example of using a model with MeCab and WordPiece tokenization:
+```python
+>>> import torch
+>>> from transformers import AutoModel, AutoTokenizer 
+>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
+>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
+>>> ## Input Japanese Text
+>>> line = "吾輩は猫である。"
+>>> inputs = tokenizer(line, return_tensors="pt")
+>>> print(tokenizer.decode(inputs['input_ids'][0]))
+[CLS] 吾輩 は 猫 で ある 。 [SEP]
+>>> outputs = bertjapanese(**inputs)
+```
+Example of using a model with Character tokenization:
+```python
+>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
+>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
+>>> ## Input Japanese Text
+>>> line = "吾輩は猫である。"
+>>> inputs = tokenizer(line, return_tensors="pt")
+>>> print(tokenizer.decode(inputs['input_ids'][0]))
+[CLS] 吾 輩 は 猫 で あ る 。 [SEP]
+>>> outputs = bertjapanese(**inputs)
+```
+Tips:
+- This implementation is the same as BERT, except for tokenization method. Refer to the [documentation of BERT](bert) for more usage examples.
+This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).
+## BertJapaneseTokenizer
+[[autodoc]] BertJapaneseTokenizer
--- a/docs/source/model_doc/bert_japanese.rst
+++ b/docs/source/model_doc/bert_japanese.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-BertJapanese
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BERT models trained on Japanese text.
-There are models with two different tokenization methods:
- Tokenize with MeCab and WordPiece. This requires some extra dependencies, `fugashi
-  <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__.
- Tokenize into characters.
-To use `MecabTokenizer`, you should ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install
-from source) to install dependencies.
-See `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__.
-Example of using a model with MeCab and WordPiece tokenization:
-.. code-block::
-    >>> import torch
-    >>> from transformers import AutoModel, AutoTokenizer 
-    >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
-    >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
-    >>> ## Input Japanese Text
-    >>> line = "吾輩は猫である。"
-    >>> inputs = tokenizer(line, return_tensors="pt")
-    >>> print(tokenizer.decode(inputs['input_ids'][0]))
-    [CLS] 吾輩 は 猫 で ある 。 [SEP]
-    >>> outputs = bertjapanese(**inputs)
-Example of using a model with Character tokenization:
-.. code-block::
-    >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
-    >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
-    >>> ## Input Japanese Text
-    >>> line = "吾輩は猫である。"
-    >>> inputs = tokenizer(line, return_tensors="pt")
-    >>> print(tokenizer.decode(inputs['input_ids'][0]))
-    [CLS] 吾 輩 は 猫 で あ る 。 [SEP]
-    >>> outputs = bertjapanese(**inputs)
-Tips:
- This implementation is the same as BERT, except for tokenization method. Refer to the :doc:`documentation of BERT
-  <bert>` for more usage examples.
-This model was contributed by `cl-tohoku <https://huggingface.co/cl-tohoku>`__.
-BertJapaneseTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BertJapaneseTokenizer
-    :members: 
--- a/docs/source/model_doc/bertgeneration.mdx
+++ b/docs/source/model_doc/bertgeneration.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# BertGeneration
+## Overview
+The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using
+[`EncoderDecoderModel`] as proposed in [Leveraging Pre-trained Checkpoints for Sequence Generation
+Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+The abstract from the paper is the following:
+*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
+warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
+benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
+Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
+developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT,
+GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both
+encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
+Text Summarization, Sentence Splitting, and Sentence Fusion.*
+Usage:
+- The model can be used in combination with the [`EncoderDecoderModel`] to leverage two pretrained
+  BERT checkpoints for subsequent fine-tuning.
+```python
+>>> # leverage checkpoints for Bert2Bert model...
+>>> # use BERT's cls token as BOS token and sep token as EOS token
+>>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
+>>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
+>>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
+>>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
+>>> # create tokenizer...
+>>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
+>>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
+>>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
+>>> # train...
+>>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
+>>> loss.backward()
+```
+- Pretrained [`EncoderDecoderModel`] are also directly available in the model hub, e.g.,
+```python
+>>> # instantiate sentence fusion model
+>>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
+>>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
+>>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
+>>> outputs = sentence_fuser.generate(input_ids)
+>>> print(tokenizer.decode(outputs[0]))
+```
+Tips:
+- [`BertGenerationEncoder`] and [`BertGenerationDecoder`] should be used in
+  combination with [`EncoderDecoder`].
+- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
+  Therefore, no EOS token should be added to the end of the input.
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
+found [here](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder).
+## BertGenerationConfig
+[[autodoc]] BertGenerationConfig
+## BertGenerationTokenizer
+[[autodoc]] BertGenerationTokenizer
+    - save_vocabulary
+## BertGenerationEncoder
+[[autodoc]] BertGenerationEncoder
+    - forward
+## BertGenerationDecoder
+[[autodoc]] BertGenerationDecoder
+    - forward
--- a/docs/source/model_doc/bertgeneration.rst
+++ b/docs/source/model_doc/bertgeneration.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-BertGeneration
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using
-:class:`~transformers.EncoderDecoderModel` as proposed in `Leveraging Pre-trained Checkpoints for Sequence Generation
-Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
-The abstract from the paper is the following:
-*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
-warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
-benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
-Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
-developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT,
-GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both
-encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
-Text Summarization, Sentence Splitting, and Sentence Fusion.*
-Usage:
- The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two pretrained
-  BERT checkpoints for subsequent fine-tuning.
-.. code-block::
-    >>> # leverage checkpoints for Bert2Bert model...
-    >>> # use BERT's cls token as BOS token and sep token as EOS token
-    >>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
-    >>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
-    >>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
-    >>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
-    >>> # create tokenizer...
-    >>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
-    >>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
-    >>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
-    >>> # train...
-    >>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
-    >>> loss.backward()
- Pretrained :class:`~transformers.EncoderDecoderModel` are also directly available in the model hub, e.g.,
-.. code-block::
-    >>> # instantiate sentence fusion model
-    >>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
-    >>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
-    >>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
-    >>> outputs = sentence_fuser.generate(input_ids)
-    >>> print(tokenizer.decode(outputs[0]))
-Tips:
- :class:`~transformers.BertGenerationEncoder` and :class:`~transformers.BertGenerationDecoder` should be used in
-  combination with :class:`~transformers.EncoderDecoder`.
- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
-  Therefore, no EOS token should be added to the end of the input.
-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
-found `here <https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder>`__.
-BertGenerationConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BertGenerationConfig
-    :members:
-BertGenerationTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BertGenerationTokenizer
-    :members: save_vocabulary
-BertGenerationEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BertGenerationEncoder
-    :members: forward
-BertGenerationDecoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BertGenerationDecoder
-    :members: forward
--- a/docs/source/model_doc/bertweet.mdx
+++ b/docs/source/model_doc/bertweet.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# BERTweet
+## Overview
+The BERTweet model was proposed in [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
+The abstract from the paper is the following:
+*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
+the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
+al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
+2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
+Part-of-speech tagging, Named-entity recognition and text classification.*
+Example of use:
+```python
+>>> import torch
+>>> from transformers import AutoModel, AutoTokenizer 
+>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
+>>> # For transformers v4.x+: 
+>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
+>>> # For transformers v3.x: 
+>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
+>>> # INPUT TWEET IS ALREADY NORMALIZED!
+>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
+>>> input_ids = torch.tensor([tokenizer.encode(line)])
+>>> with torch.no_grad():
+...     features = bertweet(input_ids)  # Models outputs are now tuples
+>>> # With TensorFlow 2.0+:
+>>> # from transformers import TFAutoModel
+>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
+```
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet).
+## BertweetTokenizer
+[[autodoc]] BertweetTokenizer
--- a/docs/source/model_doc/bertweet.rst
+++ b/docs/source/model_doc/bertweet.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-BERTweet
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BERTweet model was proposed in `BERTweet: A pre-trained language model for English Tweets
-<https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf>`__ by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
-The abstract from the paper is the following:
-*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
-the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
-al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
-2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
-Part-of-speech tagging, Named-entity recognition and text classification.*
-Example of use:
-.. code-block::
-    >>> import torch
-    >>> from transformers import AutoModel, AutoTokenizer 
-    >>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
-    >>> # For transformers v4.x+: 
-    >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
-    >>> # For transformers v3.x: 
-    >>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
-    >>> # INPUT TWEET IS ALREADY NORMALIZED!
-    >>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
-    >>> input_ids = torch.tensor([tokenizer.encode(line)])
-    >>> with torch.no_grad():
-    ...     features = bertweet(input_ids)  # Models outputs are now tuples
-    >>> # With TensorFlow 2.0+:
-    >>> # from transformers import TFAutoModel
-    >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
-This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here
-<https://github.com/VinAIResearch/BERTweet>`__.
-BertweetTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BertweetTokenizer
-    :members: 
--- a/docs/source/model_doc/bigbird_pegasus.rst
+++ b/docs/source/model_doc/bigbird_pegasus.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-BigBirdPegasus
+# BigBird
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BigBird model was proposed in `Big Bird: Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by
+The BigBird model was proposed in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by
 Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
 Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
 based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
@@ -41,58 +39,108 @@ propose novel applications to genomics data.*
 Tips:
- For an in-detail explanation on how BigBird's attention works, see `this blog post
+- For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
-  <https://huggingface.co/blog/big-bird>`__.
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
  **original_full** is advised as there is no benefit in using **block_sparse** attention.
 - The code currently uses window size of 3 blocks and 2 global blocks.
 - Sequence length must be divisible by block size.
 - Current implementation supports only **ITC**.
- Current implementation doesn't support **num_random_blocks = 0**.
+- Current implementation doesn't support **num_random_blocks = 0**
- BigBirdPegasus uses the `PegasusTokenizer
-  <https://github.com/huggingface/transformers/blob/master/src/transformers/models/pegasus/tokenization_pegasus.py>`__.
-The original code can be found `here <https://github.com/google-research/bigbird>`__.
+This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta). The original code can be found
+[here](https://github.com/google-research/bigbird).
-BigBirdPegasusConfig
+## BigBirdConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdPegasusConfig
+[[autodoc]] BigBirdConfig
-    :members:
+## BigBirdTokenizer
-BigBirdPegasusModel
+[[autodoc]] BigBirdTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
-.. autoclass:: transformers.BigBirdPegasusModel
+## BigBirdTokenizerFast
-    :members: forward
+[[autodoc]] BigBirdTokenizerFast
-BigBirdPegasusForConditionalGeneration
+## BigBird specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdPegasusForConditionalGeneration
+[[autodoc]] models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput
-    :members: forward
+## BigBirdModel
-BigBirdPegasusForSequenceClassification
+[[autodoc]] BigBirdModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.BigBirdPegasusForSequenceClassification
+## BigBirdForPreTraining
-    :members: forward
+[[autodoc]] BigBirdForPreTraining
+    - forward
-BigBirdPegasusForQuestionAnswering
+## BigBirdForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdPegasusForQuestionAnswering
+[[autodoc]] BigBirdForCausalLM
-    :members: forward
+    - forward
+## BigBirdForMaskedLM
-BigBirdPegasusForCausalLM
+[[autodoc]] BigBirdForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.BigBirdPegasusForCausalLM
+## BigBirdForSequenceClassification
-    :members: forward
+[[autodoc]] BigBirdForSequenceClassification
+    - forward
+## BigBirdForMultipleChoice
+[[autodoc]] BigBirdForMultipleChoice
+    - forward
+## BigBirdForTokenClassification
+[[autodoc]] BigBirdForTokenClassification
+    - forward
+## BigBirdForQuestionAnswering
+[[autodoc]] BigBirdForQuestionAnswering
+    - forward
+## FlaxBigBirdModel
+[[autodoc]] FlaxBigBirdModel
+    - __call__
+## FlaxBigBirdForPreTraining
+[[autodoc]] FlaxBigBirdForPreTraining
+    - __call__
+## FlaxBigBirdForMaskedLM
+[[autodoc]] FlaxBigBirdForMaskedLM
+    - __call__
+## FlaxBigBirdForSequenceClassification
+[[autodoc]] FlaxBigBirdForSequenceClassification
+    - __call__
+## FlaxBigBirdForMultipleChoice
+[[autodoc]] FlaxBigBirdForMultipleChoice
+    - __call__
+## FlaxBigBirdForTokenClassification
+[[autodoc]] FlaxBigBirdForTokenClassification
+    - __call__
+## FlaxBigBirdForQuestionAnswering
+[[autodoc]] FlaxBigBirdForQuestionAnswering
+    - __call__
--- a/docs/source/model_doc/bigbird.rst
+++ b/docs/source/model_doc/bigbird.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-BigBird
+# BigBirdPegasus
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BigBird model was proposed in `Big Bird: Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by
+The BigBird model was proposed in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by
 Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
 Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
 based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
@@ -41,145 +39,43 @@ propose novel applications to genomics data.*
 Tips:
- For an in-detail explanation on how BigBird's attention works, see `this blog post
+- For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
-  <https://huggingface.co/blog/big-bird>`__.
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
  **original_full** is advised as there is no benefit in using **block_sparse** attention.
 - The code currently uses window size of 3 blocks and 2 global blocks.
 - Sequence length must be divisible by block size.
 - Current implementation supports only **ITC**.
- Current implementation doesn't support **num_random_blocks = 0**
+- Current implementation doesn't support **num_random_blocks = 0**.
+- BigBirdPegasus uses the [PegasusTokenizer](https://github.com/huggingface/transformers/blob/master/src/transformers/models/pegasus/tokenization_pegasus.py).
-This model was contributed by `vasudevgupta <https://huggingface.co/vasudevgupta>`__. The original code can be found
+The original code can be found [here](https://github.com/google-research/bigbird).
-`here <https://github.com/google-research/bigbird>`__.
-BigBirdConfig
+## BigBirdPegasusConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdConfig
+[[autodoc]] BigBirdPegasusConfig
-    :members:
+    - all
+## BigBirdPegasusModel
-BigBirdTokenizer
+[[autodoc]] BigBirdPegasusModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.BigBirdTokenizer
+## BigBirdPegasusForConditionalGeneration
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-BigBirdTokenizerFast
+[[autodoc]] BigBirdPegasusForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.BigBirdTokenizerFast
+## BigBirdPegasusForSequenceClassification
-    :members:
-BigBird specific outputs
+[[autodoc]] BigBirdPegasusForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput
+## BigBirdPegasusForQuestionAnswering
-    :members:
+[[autodoc]] BigBirdPegasusForQuestionAnswering
+    - forward
-BigBirdModel
+## BigBirdPegasusForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdModel
+[[autodoc]] BigBirdPegasusForCausalLM
-    :members: forward
+    - forward
-BigBirdForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdForPreTraining
-    :members: forward
-BigBirdForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdForCausalLM
-    :members: forward
-BigBirdForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdForMaskedLM
-    :members: forward
-BigBirdForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdForSequenceClassification
-    :members: forward
-BigBirdForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdForMultipleChoice
-    :members: forward
-BigBirdForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdForTokenClassification
-    :members: forward
-BigBirdForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BigBirdForQuestionAnswering
-    :members: forward
-FlaxBigBirdModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBigBirdModel
-    :members: __call__
-FlaxBigBirdForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBigBirdForPreTraining
-    :members: __call__
-FlaxBigBirdForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBigBirdForMaskedLM
-    :members: __call__
-FlaxBigBirdForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBigBirdForSequenceClassification
-    :members: __call__
-FlaxBigBirdForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBigBirdForMultipleChoice
-    :members: __call__
-FlaxBigBirdForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBigBirdForTokenClassification
-    :members: __call__
-FlaxBigBirdForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBigBirdForQuestionAnswering
-    :members: __call__
--- a/docs/source/model_doc/blenderbot.mdx
+++ b/docs/source/model_doc/blenderbot.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Blenderbot
+**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) .
+## Overview
+The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
+Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+The abstract of the paper is the following:
+*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
+scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
+we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
+skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
+their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
+persona. We show that large scale models can learn these skills when given appropriate training data and choice of
+generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
+and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
+dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
+failure cases of our models.*
+This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/facebookresearch/ParlAI) .
+## Implementation Notes
+- Blenderbot uses a standard [seq2seq model transformer](https://arxiv.org/pdf/1706.03762.pdf) based architecture.
+- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot).
+- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as
+  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
+  [BlenderbotSmall](blenderbot_small).
+## Usage
+Here is an example of model usage:
+```python
+>>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
+>>> mname = 'facebook/blenderbot-400M-distill'
+>>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
+>>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
+>>> UTTERANCE = "My friends are cool but they eat too many carbs."
+>>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
+>>> reply_ids = model.generate(**inputs)
+>>> print(tokenizer.batch_decode(reply_ids))
+["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
+```
+## BlenderbotConfig
+[[autodoc]] BlenderbotConfig
+## BlenderbotTokenizer
+[[autodoc]] BlenderbotTokenizer
+    - build_inputs_with_special_tokens
+## BlenderbotTokenizerFast
+[[autodoc]] BlenderbotTokenizerFast
+    - build_inputs_with_special_tokens
+## BlenderbotModel
+See `transformers.BartModel` for arguments to *forward* and *generate*
+[[autodoc]] BlenderbotModel
+    - forward
+## BlenderbotForConditionalGeneration
+See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* and *generate*
+[[autodoc]] BlenderbotForConditionalGeneration
+    - forward
+## BlenderbotForCausalLM
+[[autodoc]] BlenderbotForCausalLM
+    - forward
+## TFBlenderbotModel
+[[autodoc]] TFBlenderbotModel
+    - call
+## TFBlenderbotForConditionalGeneration
+[[autodoc]] TFBlenderbotForConditionalGeneration
+    - call
+## FlaxBlenderbotModel
+[[autodoc]] FlaxBlenderbotModel
+    - __call__
+    - encode
+    - decode
+## FlaxBlenderbotForConditionalGeneration
+[[autodoc]] FlaxBlenderbotForConditionalGeneration
+    - __call__
+    - encode
+    - decode
--- a/docs/source/model_doc/blenderbot.rst
+++ b/docs/source/model_doc/blenderbot.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-Blenderbot
-----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange, file a `Github Issue
-<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
-<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
-Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
-The abstract of the paper is the following:
-*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
-scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
-we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
-skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
-their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
-persona. We show that large scale models can learn these skills when given appropriate training data and choice of
-generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
-and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
-dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
-failure cases of our models.*
-This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The authors' code can be found `here
-<https://github.com/facebookresearch/ParlAI>`__ .
-Implementation Notes
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
- Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
- This is the `default` Blenderbot model class. However, some smaller checkpoints, such as
-  ``facebook/blenderbot_small_90M``, have a different architecture and consequently should be used with
-  `BlenderbotSmall <blenderbot_small>`__.
-Usage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Here is an example of model usage:
-.. code-block::
-        >>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
-        >>> mname = 'facebook/blenderbot-400M-distill'
-        >>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
-        >>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
-        >>> UTTERANCE = "My friends are cool but they eat too many carbs."
-        >>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
-        >>> reply_ids = model.generate(**inputs)
-        >>> print(tokenizer.batch_decode(reply_ids))
-        ["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
-BlenderbotConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotConfig
-    :members:
-BlenderbotTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotTokenizer
-    :members: build_inputs_with_special_tokens
-BlenderbotTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotTokenizerFast
-    :members: build_inputs_with_special_tokens
-BlenderbotModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-See :obj:`transformers.BartModel` for arguments to `forward` and `generate`
-.. autoclass:: transformers.BlenderbotModel
-    :members: forward
-BlenderbotForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
-.. autoclass:: transformers.BlenderbotForConditionalGeneration
-    :members: forward
-BlenderbotForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotForCausalLM
-    :members: forward
-TFBlenderbotModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFBlenderbotModel
-    :members: call
-TFBlenderbotForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFBlenderbotForConditionalGeneration
-    :members: call
-FlaxBlenderbotModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBlenderbotModel
-    :members: __call__, encode, decode
-FlaxBlenderbotForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBlenderbotForConditionalGeneration
-    :members: __call__, encode, decode
--- a/docs/source/model_doc/blenderbot_small.mdx
+++ b/docs/source/model_doc/blenderbot_small.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Blenderbot Small
+Note that [`BlenderbotSmallModel`] and
+[`BlenderbotSmallForConditionalGeneration`] are only used in combination with the checkpoint
+[facebook/blenderbot-90M](https://huggingface.co/facebook/blenderbot-90M). Larger Blenderbot checkpoints should
+instead be used with [`BlenderbotModel`] and
+[`BlenderbotForConditionalGeneration`]
+## Overview
+The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
+Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+The abstract of the paper is the following:
+*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
+scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
+we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
+skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
+their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
+persona. We show that large scale models can learn these skills when given appropriate training data and choice of
+generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
+and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
+dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
+failure cases of our models.*
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be
+found [here](https://github.com/facebookresearch/ParlAI) .
+## BlenderbotSmallConfig
+[[autodoc]] BlenderbotSmallConfig
+## BlenderbotSmallTokenizer
+[[autodoc]] BlenderbotSmallTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## BlenderbotSmallTokenizerFast
+[[autodoc]] BlenderbotSmallTokenizerFast
+## BlenderbotSmallModel
+[[autodoc]] BlenderbotSmallModel
+    - forward
+## BlenderbotSmallForConditionalGeneration
+[[autodoc]] BlenderbotSmallForConditionalGeneration
+    - forward
+## BlenderbotSmallForCausalLM
+[[autodoc]] BlenderbotSmallForCausalLM
+    - forward
+## TFBlenderbotSmallModel
+[[autodoc]] TFBlenderbotSmallModel
+    - call
+## TFBlenderbotSmallForConditionalGeneration
+[[autodoc]] TFBlenderbotSmallForConditionalGeneration
+    - call
+## FlaxBlenderbotSmallModel
+[[autodoc]] FlaxBlenderbotSmallModel
+    - __call__
+    - encode
+    - decode
+## FlaxBlenderbotForConditionalGeneration
+[[autodoc]] FlaxBlenderbotSmallForConditionalGeneration
+    - __call__
+    - encode
+    - decode
--- a/docs/source/model_doc/blenderbot_small.rst
+++ b/docs/source/model_doc/blenderbot_small.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-Blenderbot Small
-----------------------------------------------------------------------------------------------------------------------
-Note that :class:`~transformers.BlenderbotSmallModel` and
-:class:`~transformers.BlenderbotSmallForConditionalGeneration` are only used in combination with the checkpoint
-`facebook/blenderbot-90M <https://huggingface.co/facebook/blenderbot-90M>`__. Larger Blenderbot checkpoints should
-instead be used with :class:`~transformers.BlenderbotModel` and
-:class:`~transformers.BlenderbotForConditionalGeneration`
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
-<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
-Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
-The abstract of the paper is the following:
-*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
-scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
-we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
-skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
-their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
-persona. We show that large scale models can learn these skills when given appropriate training data and choice of
-generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
-and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
-dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
-failure cases of our models.*
-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The authors' code can be
-found `here <https://github.com/facebookresearch/ParlAI>`__ .
-BlenderbotSmallConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotSmallConfig
-    :members:
-BlenderbotSmallTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotSmallTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-BlenderbotSmallTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotSmallTokenizerFast
-    :members:
-BlenderbotSmallModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotSmallModel
-    :members: forward
-BlenderbotSmallForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotSmallForConditionalGeneration
-    :members: forward
-BlenderbotSmallForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.BlenderbotSmallForCausalLM
-    :members: forward
-TFBlenderbotSmallModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFBlenderbotSmallModel
-    :members: call
-TFBlenderbotSmallForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFBlenderbotSmallForConditionalGeneration
-    :members: call
-FlaxBlenderbotSmallModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBlenderbotSmallModel
-    :members: __call__, encode, decode
-FlaxBlenderbotForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxBlenderbotSmallForConditionalGeneration
-    :members: __call__, encode, decode
--- a/docs/source/model_doc/bort.rst
+++ b/docs/source/model_doc/bort.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-BORT
+# BORT
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BORT model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by
+The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by
 Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
 authors refer to as "Bort".
@@ -34,14 +32,11 @@ absolute, with respect to BERT-large, on multiple public natural language unders
 Tips:
- BORT's model architecture is based on BERT, so one can refer to :doc:`BERT's documentation page <bert>` for the
+- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the
  model's API as well as usage examples.
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to :doc:`RoBERTa's documentation page
+- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples.
-  <roberta>` for the tokenizer's API as well as usage examples.
+- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
- BORT requires a specific fine-tuning algorithm, called `Agora
-  <https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology>`__ ,
  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
  algorithm to make BORT fine-tuning work.
-This model was contributed by `stefan-it <https://huggingface.co/stefan-it>`__. The original code can be found `here
+This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
-<https://github.com/alexa/bort/>`__.