Convert rst files (#14888)

* Convert all tutorials and guides * Convert all remaining rst to mdx * Track and fix bad links

Convert rst files (#14888)
* Convert all tutorials and guides * Convert all remaining rst to mdx * Track and fix bad links
207594be · Sylvain Gugger · GitHub · b0c7d2ec · b0c7d2ec · 207594be
Unverified Commit 207594be authored Dec 22, 2021 by Sylvain Gugger Committed by GitHub Dec 22, 2021
20 changed files
--- a/docs/source/model_doc/speech_to_text.rst
+++ b/docs/source/model_doc/speech_to_text.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Speech2Text
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The Speech2Text model was proposed in `fairseq S2T: Fast Speech-to-Text Modeling with fairseq
-<https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. It's a
-transformer-based seq2seq (encoder-decoder) model designed for end-to-end Automatic Speech Recognition (ASR) and Speech
-Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are
-fed into the encoder. The model is trained with standard autoregressive cross-entropy loss and generates the
-transcripts/translations autoregressively. Speech2Text has been fine-tuned on several datasets for ASR and ST:
-`LibriSpeech <http://www.openslr.org/12>`__, `CoVoST 2 <https://github.com/facebookresearch/covost>`__, `MuST-C
-<https://ict.fbk.eu/must-c/>`__.
-
-This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here
-<https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text>`__.
-
-
-Inference
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech
-signal. It's a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively. The
-:obj:`generate()` method can be used for inference.
-
-The :class:`~transformers.Speech2TextFeatureExtractor` class is responsible for extracting the log-mel filter-bank
-features. The :class:`~transformers.Speech2TextProcessor` wraps :class:`~transformers.Speech2TextFeatureExtractor` and
-:class:`~transformers.Speech2TextTokenizer` into a single instance to both extract the input features and decode the
-predicted token ids.
-
-The feature extractor depends on :obj:`torchaudio` and the tokenizer depends on :obj:`sentencepiece` so be sure to
-install those packages before running the examples. You could either install those as extra speech dependencies with
-``pip install transformers"[speech, sentencepiece]"`` or install the packages seperately with ``pip install torchaudio
-sentencepiece``. Also ``torchaudio`` requires the development version of the `libsndfile
-<http://www.mega-nerd.com/libsndfile/>`__ package which can be installed via a system package manager. On Ubuntu it can
-be installed as follows: ``apt install libsndfile1-dev``
-
-
- ASR and Speech Translation
-
-.. code-block::
-
-        >>> import torch
-        >>> from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
-        >>> from datasets import load_dataset
-        >>> import soundfile as sf
-
-        >>> model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
-        >>> processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
-
-        >>> def map_to_array(batch):
-        ...     speech, _ = sf.read(batch["file"])
-        ...     batch["speech"] = speech
-        ...     return batch
-
-        >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-        >>> ds = ds.map(map_to_array)
-
-        >>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
-        >>> generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
-
-        >>> transcription = processor.batch_decode(generated_ids)
-
-
- Multilingual speech translation
-
-    For multilingual speech translation models, :obj:`eos_token_id` is used as the :obj:`decoder_start_token_id` and
-    the target language id is forced as the first generated token. To force the target language id as the first
-    generated token, pass the :obj:`forced_bos_token_id` parameter to the :obj:`generate()` method. The following
-    example shows how to transate English speech to French text using the `facebook/s2t-medium-mustc-multilingual-st`
-    checkpoint.
-
-.. code-block::
-
-        >>> import torch
-        >>> from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
-        >>> from datasets import load_dataset
-        >>> import soundfile as sf
-
-        >>> model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
-        >>> processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
-
-        >>> def map_to_array(batch):
-        ...     speech, _ = sf.read(batch["file"])
-        ...     batch["speech"] = speech
-        ...     return batch
-
-        >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-        >>> ds = ds.map(map_to_array)
-
-        >>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
-        >>> generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask], forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"])
-
-        >>> translation = processor.batch_decode(generated_ids)
-
-
-See the `model hub <https://huggingface.co/models?filter=speech_to_text>`__ to look for Speech2Text checkpoints.
-
-
-Speech2TextConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2TextConfig
-    :members:
-
-
-Speech2TextTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2TextTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-
-
-Speech2TextFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2TextFeatureExtractor
-    :members: __call__
-
-
-Speech2TextProcessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2TextProcessor
-    :members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
-
-
-Speech2TextModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2TextModel
-    :members: forward
-
-
-Speech2TextForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2TextForConditionalGeneration
-    :members: forward
--- a/docs/source/model_doc/speech_to_text_2.mdx
+++ b/docs/source/model_doc/speech_to_text_2.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Speech2Text2
+
+## Overview
+
+The Speech2Text2 model is used together with [Wav2Vec2](wav2vec2) for Speech Translation models proposed in
+[Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by
+Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
+
+Speech2Text2 is a *decoder-only* transformer model that can be used with any speech *encoder-only*, such as
+[Wav2Vec2](wav2vec2) or [HuBERT](hubert) for Speech-to-Text tasks. Please refer to the
+[SpeechEncoderDecoder](speechencoderdecoder) class on how to combine Speech2Text2 with any speech *encoder-only*
+model.
+
+This model was contributed by [Patrick von Platen](https://huggingface.co/patrickvonplaten).
+
+The original code can be found [here](https://github.com/pytorch/fairseq/blob/1f7ef9ed1e1061f8c7f88f8b94c7186834398690/fairseq/models/wav2vec/wav2vec2_asr.py#L266).
+
+
+Tips:
+
+- Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see
+  the [official models](https://huggingface.co/models?other=speech2text2) .
+- Speech2Text2 is always used within the [SpeechEncoderDecoder](speechencoderdecoder) framework.
+- Speech2Text2's tokenizer is based on [fastBPE](https://github.com/glample/fastBPE).
+
+## Inference
+
+Speech2Text2's [`SpeechEncoderDecoderModel`] model accepts raw waveform input values from speech and
+makes use of [`~generation_utils.GenerationMixin.generate`] to translate the input speech
+autoregressively to the target language.
+
+The [`Wav2Vec2FeatureExtractor`] class is responsible for preprocessing the input speech and
+[`Speech2Text2Tokenizer`] decodes the generated target tokens to the target string. The
+[`Speech2Text2Processor`] wraps [`Wav2Vec2FeatureExtractor`] and
+[`Speech2Text2Tokenizer`] into a single instance to both extract the input features and decode the
+predicted token ids.
+
+- Step-by-step Speech Translation
+
+```python
+>>> import torch
+>>> from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
+>>> from datasets import load_dataset
+>>> import soundfile as sf
+
+>>> model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
+>>> processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
+
+>>> def map_to_array(batch):
+...     speech, _ = sf.read(batch["file"])
+...     batch["speech"] = speech
+...     return batch
+
+>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+>>> ds = ds.map(map_to_array)
+
+>>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
+>>> generated_ids = model.generate(input_ids=inputs["input_values"], attention_mask=inputs["attention_mask"])
+
+>>> transcription = processor.batch_decode(generated_ids)
+```
+
+- Speech Translation via Pipelines
+
+  The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code
+
+```python
+>>> from datasets import load_dataset
+>>> from transformers import pipeline
+
+>>> librispeech_en = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+>>> asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-de", feature_extractor="facebook/s2t-wav2vec2-large-en-de")
+
+>>> translation_de = asr(librispeech_en[0]["file"])
+```
+
+See [model hub](https://huggingface.co/models?filter=speech2text2) to look for Speech2Text2 checkpoints.
+
+
+## Speech2Text2Config
+
+[[autodoc]] Speech2Text2Config
+
+## Speech2TextTokenizer
+
+[[autodoc]] Speech2Text2Tokenizer
+    - batch_decode
+    - decode
+    - save_vocabulary
+
+## Speech2Text2Processor
+
+[[autodoc]] Speech2Text2Processor
+    - __call__
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+    - as_target_processor
+
+## Speech2Text2ForCausalLM
+
+[[autodoc]] Speech2Text2ForCausalLM
+    - forward
--- a/docs/source/model_doc/speech_to_text_2.rst
+++ b/docs/source/model_doc/speech_to_text_2.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Speech2Text2
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The Speech2Text2 model is used together with :doc:`Wav2Vec2 <wav2vec2>` for Speech Translation models proposed in
-`Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
-Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
-
-Speech2Text2 is a *decoder-only* transformer model that can be used with any speech *encoder-only*, such as
-:doc:`Wav2Vec2 <wav2vec2>` or :doc:`HuBERT <hubert>` for Speech-to-Text tasks. Please refer to the
-:doc:`SpeechEncoderDecoder <speechencoderdecoder>` class on how to combine Speech2Text2 with any speech *encoder-only*
-model.
-
-This model was contributed by `Patrick von Platen <https://huggingface.co/patrickvonplaten>`__.
-
-The original code can be found `here
-<https://github.com/pytorch/fairseq/blob/1f7ef9ed1e1061f8c7f88f8b94c7186834398690/fairseq/models/wav2vec/wav2vec2_asr.py#L266>`__.
-
-
-Tips:
-
- Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see
-  the `official models <https://huggingface.co/models?other=speech2text2>`__ .
- Speech2Text2 is always used within the :doc:`SpeechEncoderDecoder <speechencoderdecoder>` framework.
- Speech2Text2's tokenizer is based on `fastBPE <https://github.com/glample/fastBPE>`.
-
-Inference
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Speech2Text2's :class:`~transformers.SpeechEncoderDecoderModel` model accepts raw waveform input values from speech and
-makes use of :func:`~transformers.generation_utils.GenerationMixin.generate` to translate the input speech
-autoregressively to the target language.
-
-The :class:`~transformers.Wav2Vec2FeatureExtractor` class is responsible for preprocessing the input speech and
-:class:`~transformers.Speech2Text2Tokenizer` decodes the generated target tokens to the target string. The
-:class:`~transformers.Speech2Text2Processor` wraps :class:`~transformers.Wav2Vec2FeatureExtractor` and
-:class:`~transformers.Speech2Text2Tokenizer` into a single instance to both extract the input features and decode the
-predicted token ids.
-
- Step-by-step Speech Translation
-
-.. code-block::
-
-        >>> import torch
-        >>> from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
-        >>> from datasets import load_dataset
-        >>> import soundfile as sf
-
-        >>> model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
-        >>> processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
-
-        >>> def map_to_array(batch):
-        ...     speech, _ = sf.read(batch["file"])
-        ...     batch["speech"] = speech
-        ...     return batch
-
-        >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-        >>> ds = ds.map(map_to_array)
-
-        >>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
-        >>> generated_ids = model.generate(input_ids=inputs["input_values"], attention_mask=inputs["attention_mask"])
-
-        >>> transcription = processor.batch_decode(generated_ids)
-
-
- Speech Translation via Pipelines
-
-    The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code
-
-.. code-block::
-
-        >>> from datasets import load_dataset
-        >>> from transformers import pipeline
-
-        >>> librispeech_en = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-        >>> asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-de", feature_extractor="facebook/s2t-wav2vec2-large-en-de")
-
-        >>> translation_de = asr(librispeech_en[0]["file"])
-
-
-See `model hub <https://huggingface.co/models?filter=speech2text2>`__ to look for Speech2Text2 checkpoints.
-
-
-Speech2Text2Config
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2Text2Config
-    :members:
-
-
-Speech2TextTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2Text2Tokenizer
-    :members: batch_decode, decode, save_vocabulary
-
-
-Speech2Text2Processor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2Text2Processor
-    :members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
-
-
-Speech2Text2ForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Speech2Text2ForCausalLM
-    :members: forward
--- a/docs/source/model_doc/speechencoderdecoder.mdx
+++ b/docs/source/model_doc/speechencoderdecoder.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Speech Encoder Decoder Models
+
+The [`SpeechEncoderDecoderModel`] can be used to initialize a speech-sequence-to-text-sequence model
+with any pretrained speech autoencoding model as the encoder (*e.g.* [Wav2Vec2](wav2vec2), [Hubert](hubert)) and any pretrained autoregressive model as the decoder.
+
+The effectiveness of initializing speech-sequence-to-text-sequence models with pretrained checkpoints for speech
+recognition and speech translation has *e.g.* been shown in [Large-Scale Self- and Semi-Supervised Learning for Speech
+Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli,
+Alexis Conneau.
+
+An example of how to use a [`SpeechEncoderDecoderModel`] for inference can be seen in
+[Speech2Text2](speech_to_text_2).
+
+
+## SpeechEncoderDecoderConfig
+
+[[autodoc]] SpeechEncoderDecoderConfig
+
+## SpeechEncoderDecoderModel
+
+[[autodoc]] SpeechEncoderDecoderModel
+    - forward
+    - from_encoder_decoder_pretrained
--- a/docs/source/model_doc/speechencoderdecoder.rst
+++ b/docs/source/model_doc/speechencoderdecoder.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Speech Encoder Decoder Models
-----------------------------------------------------------------------------------------------------------------------
-
-The :class:`~transformers.SpeechEncoderDecoderModel` can be used to initialize a speech-sequence-to-text-sequence model
-with any pretrained speech autoencoding model as the encoder (*e.g.* :doc:`Wav2Vec2 <wav2vec2>`, :doc:`Hubert
-<hubert>`) and any pretrained autoregressive model as the decoder.
-
-The effectiveness of initializing speech-sequence-to-text-sequence models with pretrained checkpoints for speech
-recognition and speech translation has *e.g.* been shown in `Large-Scale Self- and Semi-Supervised Learning for Speech
-Translation <https://arxiv.org/abs/2104.06678>`__ by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli,
-Alexis Conneau.
-
-An example of how to use a :class:`~transformers.SpeechEncoderDecoderModel` for inference can be seen in
-:doc:`Speech2Text2 <speech_to_text_2>`.
-
-
-SpeechEncoderDecoderConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.SpeechEncoderDecoderConfig
-    :members:
-
-
-SpeechEncoderDecoderModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.SpeechEncoderDecoderModel
-    :members: forward, from_encoder_decoder_pretrained
--- a/docs/source/model_doc/splinter.rst
+++ b/docs/source/model_doc/splinter.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-Splinter
-----------------------------------------------------------------------------------------------------------------------
+# Splinter

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The Splinter model was proposed in `Few-Shot Question Answering by Pretraining Span Selection
-<https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. Splinter
+The Splinter model was proposed in [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. Splinter
 is an encoder-only transformer (similar to BERT) pretrained using the recurring span selection task on a large corpus
 comprising Wikipedia and the Toronto Book Corpus.

@@ -37,51 +34,41 @@ Tips:

 - Splinter was trained to predict answers spans conditioned on a special [QUESTION] token. These tokens contextualize
  to question representations which are used to predict the answers. This layer is called QASS, and is the default
-  behaviour in the :class:`~transformers.SplinterForQuestionAnswering` class. Therefore:
- Use :class:`~transformers.SplinterTokenizer` (rather than :class:`~transformers.BertTokenizer`), as it already
+  behaviour in the [`SplinterForQuestionAnswering`] class. Therefore:
+- Use [`SplinterTokenizer`] (rather than [`BertTokenizer`]), as it already
  contains this special token. Also, its default behavior is to use this token when two sequences are given (for
-  example, in the `run_qa.py` script).
- If you plan on using Splinter outside `run_qa.py`, please keep in mind the question token - it might be important for
+  example, in the *run_qa.py* script).
+- If you plan on using Splinter outside *run_qa.py*, please keep in mind the question token - it might be important for
  the success of your model, especially in a few-shot setting.
 - Please note there are two different checkpoints for each size of Splinter. Both are basically the same, except that
-  one also has the pretrained wights of the QASS layer (`tau/splinter-base-qass` and `tau/splinter-large-qass`) and one
-  doesn't (`tau/splinter-base` and `tau/splinter-large`). This is done to support randomly initializing this layer at
+  one also has the pretrained wights of the QASS layer (*tau/splinter-base-qass* and *tau/splinter-large-qass*) and one
+  doesn't (*tau/splinter-base* and *tau/splinter-large*). This is done to support randomly initializing this layer at
  fine-tuning, as it is shown to yield better results for some cases in the paper.

-This model was contributed by `yuvalkirstain <https://huggingface.co/yuvalkirstain>`__ and `oriram
-<https://huggingface.co/oriram>`__. The original code can be found `here <https://github.com/oriram/splinter>`__.
+This model was contributed by [yuvalkirstain](https://huggingface.co/yuvalkirstain) and [oriram](https://huggingface.co/oriram). The original code can be found [here](https://github.com/oriram/splinter).

-SplinterConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## SplinterConfig

-.. autoclass:: transformers.SplinterConfig
-    :members:
+[[autodoc]] SplinterConfig

+## SplinterTokenizer

-SplinterTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] SplinterTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary

-.. autoclass:: transformers.SplinterTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
+## SplinterTokenizerFast

+[[autodoc]] SplinterTokenizerFast

-SplinterTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## SplinterModel

-.. autoclass:: transformers.SplinterTokenizerFast
-    :members:
+[[autodoc]] SplinterModel
+    - forward

+## SplinterForQuestionAnswering

-SplinterModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.SplinterModel
-    :members: forward
-
-
-SplinterForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.SplinterForQuestionAnswering
-    :members: forward
+[[autodoc]] SplinterForQuestionAnswering
+    - forward
--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-SqueezeBERT
-----------------------------------------------------------------------------------------------------------------------
+# SqueezeBERT

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
-<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
+The SqueezeBERT model was proposed in [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
 bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
-SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
+SqueezeBERT architecture is that SqueezeBERT uses [grouped convolutions](https://blog.yani.io/filter-group-tutorial)
 instead of fully-connected layers for the Q, K, V and FFN layers.

 The abstract from the paper is the following:
@@ -45,70 +42,47 @@ Tips:
  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
  with a causal language modeling (CLM) objective are better in that regard.
 - For best results when finetuning on sequence classification tasks, it is recommended to start with the
-  `squeezebert/squeezebert-mnli-headless` checkpoint.
+  *squeezebert/squeezebert-mnli-headless* checkpoint.

-This model was contributed by `forresti <https://huggingface.co/forresti>`__.
+This model was contributed by [forresti](https://huggingface.co/forresti).


-SqueezeBertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## SqueezeBertConfig

-.. autoclass:: transformers.SqueezeBertConfig
-    :members:
+[[autodoc]] SqueezeBertConfig

+## SqueezeBertTokenizer

-SqueezeBertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] SqueezeBertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary

-.. autoclass:: transformers.SqueezeBertTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
+## SqueezeBertTokenizerFast

+[[autodoc]] SqueezeBertTokenizerFast

-SqueezeBertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## SqueezeBertModel

-.. autoclass:: transformers.SqueezeBertTokenizerFast
-    :members:
+[[autodoc]] SqueezeBertModel

+## SqueezeBertForMaskedLM

-SqueezeBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] SqueezeBertForMaskedLM

-.. autoclass:: transformers.SqueezeBertModel
-    :members:
+## SqueezeBertForSequenceClassification

+[[autodoc]] SqueezeBertForSequenceClassification

-SqueezeBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## SqueezeBertForMultipleChoice

-.. autoclass:: transformers.SqueezeBertForMaskedLM
-    :members:
+[[autodoc]] SqueezeBertForMultipleChoice

+## SqueezeBertForTokenClassification

-SqueezeBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] SqueezeBertForTokenClassification

-.. autoclass:: transformers.SqueezeBertForSequenceClassification
-    :members:
+## SqueezeBertForQuestionAnswering

-
-SqueezeBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.SqueezeBertForMultipleChoice
-    :members:
-
-
-SqueezeBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.SqueezeBertForTokenClassification
-    :members:
-
-
-SqueezeBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.SqueezeBertForQuestionAnswering
-    :members:
+[[autodoc]] SqueezeBertForQuestionAnswering
--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
-..
-    Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-T5
-----------------------------------------------------------------------------------------------------------------------
+# T5

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
-<https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
+The T5 model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
 Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.

 The abstract from the paper is the following:
@@ -41,76 +38,73 @@ Tips:

 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.

- See the :ref:`training`, :ref:`inference` and :ref:`scripts` sections below for all details regarding usage.
+- See the [training](#training), [inference](#inference) and [scripts](#scripts) sections below for all details regarding usage.

 T5 comes in different sizes:

- `t5-small <https://huggingface.co/t5-small>`__
+- [t5-small](https://huggingface.co/t5-small)

- `t5-base <https://huggingface.co/t5-base>`__
+- [t5-base](https://huggingface.co/t5-base)

- `t5-large <https://huggingface.co/t5-large>`__
+- [t5-large](https://huggingface.co/t5-large)

- `t5-3b <https://huggingface.co/t5-3b>`__
+- [t5-3b](https://huggingface.co/t5-3b)

- `t5-11b <https://huggingface.co/t5-11b>`__.
+- [t5-11b](https://huggingface.co/t5-11b).

 Based on the original T5 model, Google has released some follow-up works:

 - **T5v1.1**: T5v1.1 is an improved version of T5 with some architectural tweaks, and is pre-trained on C4 only without
-  mixing in the supervised tasks. Refer to the documentation of T5v1.1 which can be found :doc:`here <t5v1.1>`.
+  mixing in the supervised tasks. Refer to the documentation of T5v1.1 which can be found [here](t5v1.1).

 - **mT5**: mT5 is a multilingual T5 model. It is pre-trained on the mC4 corpus, which includes 101 languages. Refer to
-  the documentation of mT5 which can be found :doc:`here <mt5>`.
+  the documentation of mT5 which can be found [here](mt5).

 - **byT5**: byT5 is a T5 model pre-trained on byte sequences rather than SentencePiece subword token sequences. Refer
-  to the documentation of byT5 which can be found :doc:`here <byt5>`.
+  to the documentation of byT5 which can be found [here](byt5).

-All checkpoints can be found on the `hub <https://huggingface.co/models?search=t5>`__.
+All checkpoints can be found on the [hub](https://huggingface.co/models?search=t5).

-This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
-<https://github.com/google-research/text-to-text-transfer-transformer>`__.
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/text-to-text-transfer-transformer).

-.. _training:
+<a id='training'></a>

-Training
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Training

 T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher
 forcing. This means that for training, we always need an input sequence and a corresponding target sequence. The input
-sequence is fed to the model using :obj:`input_ids`. The target sequence is shifted to the right, i.e., prepended by a
-start-sequence token and fed to the decoder using the :obj:`decoder_input_ids`. In teacher-forcing style, the target
-sequence is then appended by the EOS token and corresponds to the :obj:`labels`. The PAD token is hereby used as the
+sequence is fed to the model using `input_ids`. The target sequence is shifted to the right, i.e., prepended by a
+start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target
+sequence is then appended by the EOS token and corresponds to the `labels`. The PAD token is hereby used as the
 start-sequence token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.

-One can use :class:`~transformers.T5ForConditionalGeneration` (or the Tensorflow/Flax variant), which includes the
+One can use [`T5ForConditionalGeneration`] (or the Tensorflow/Flax variant), which includes the
 language modeling head on top of the decoder.

 - Unsupervised denoising training

  In this setup, spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
  the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
-  sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
-  :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
-  :class:`~transformers.T5Tokenizer`.
+  sentinel token represents a unique mask token for this sentence and should start with `<extra_id_0>`,
+  `<extra_id_1>`, ... up to `<extra_id_99>`. As a default, 100 sentinel tokens are available in
+  [`T5Tokenizer`].

  For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
  processed as follows:

-    .. code-block::
+  ```python
+  from transformers import T5Tokenizer, T5ForConditionalGeneration

-        from transformers import T5Tokenizer, T5ForConditionalGeneration
+  tokenizer = T5Tokenizer.from_pretrained("t5-small")
+  model = T5ForConditionalGeneration.from_pretrained("t5-small")

-        tokenizer = T5Tokenizer.from_pretrained("t5-small")
-        model = T5ForConditionalGeneration.from_pretrained("t5-small")
+  input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
+  labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
+  # the forward function automatically creates the correct decoder_input_ids
+  loss = model(input_ids=input_ids, labels=labels).loss
+  ```

-        input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
-        labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
-        # the forward function automatically creates the correct decoder_input_ids
-        loss = model(input_ids=input_ids, labels=labels).loss
-
-  If you're interested in pre-training T5 on a new corpus, check out the `run_t5_mlm_flax.py
-  <https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling>`__ script in the Examples
+  If you're interested in pre-training T5 on a new corpus, check out the [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling) script in the Examples
  directory.

 - Supervised training
@@ -120,245 +114,229 @@ language modeling head on top of the decoder.
  sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar.", then they should be prepared for
  the model as follows:

-    .. code-block::
-
-        from transformers import T5Tokenizer, T5ForConditionalGeneration
+  ```python
+  from transformers import T5Tokenizer, T5ForConditionalGeneration

-        tokenizer = T5Tokenizer.from_pretrained("t5-small")
-        model = T5ForConditionalGeneration.from_pretrained("t5-small")
+  tokenizer = T5Tokenizer.from_pretrained("t5-small")
+  model = T5ForConditionalGeneration.from_pretrained("t5-small")

-        input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
-        labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids
-        # the forward function automatically creates the correct decoder_input_ids
-        loss = model(input_ids=input_ids, labels=labels).loss
+  input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
+  labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids
+  # the forward function automatically creates the correct decoder_input_ids
+  loss = model(input_ids=input_ids, labels=labels).loss
+  ```

-  As you can see, only 2 inputs are required for the model in order to compute a loss: :obj:`input_ids` (which are the
-  :obj:`input_ids` of the encoded input sequence) and :obj:`labels` (which are the :obj:`input_ids` of the encoded
-  target sequence). The model will automatically create the :obj:`decoder_input_ids` based on the :obj:`labels`, by
-  shifting them one position to the right and prepending the :obj:`config.decoder_start_token_id`, which for T5 is
+  As you can see, only 2 inputs are required for the model in order to compute a loss: `input_ids` (which are the
+  `input_ids` of the encoded input sequence) and `labels` (which are the `input_ids` of the encoded
+  target sequence). The model will automatically create the `decoder_input_ids` based on the `labels`, by
+  shifting them one position to the right and prepending the `config.decoder_start_token_id`, which for T5 is
  equal to 0 (i.e. the id of the pad token). Also note the task prefix: we prepend the input sequence with 'translate
  English to German: ' before encoding it. This will help in improving the performance, as this task prefix was used
  during T5's pre-training.

  However, the example above only shows a single training example. In practice, one trains deep learning models in
  batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one
-  typically defines a :obj:`max_source_length` and :obj:`max_target_length`, which determine the maximum length of the
+  typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the
  input and output sequences respectively (otherwise they are truncated). These should be carefully set depending on
  the task.

-  In addition, we must make sure that padding token id's of the :obj:`labels` are not taken into account by the loss
-  function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the :obj:`ignore_index`
-  of the :obj:`CrossEntropyLoss`. In Flax, one can use the :obj:`decoder_attention_mask` to ignore padded tokens from
-  the loss (see the `Flax summarization script
-  <https://github.com/huggingface/transformers/tree/master/examples/flax/summarization>`__ for details). We also pass
-  :obj:`attention_mask` as additional input to the model, which makes sure that padding tokens of the inputs are
+  In addition, we must make sure that padding token id's of the `labels` are not taken into account by the loss
+  function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the `ignore_index`
+  of the `CrossEntropyLoss`. In Flax, one can use the `decoder_attention_mask` to ignore padded tokens from
+  the loss (see the [Flax summarization script](https://github.com/huggingface/transformers/tree/master/examples/flax/summarization) for details). We also pass
+  `attention_mask` as additional input to the model, which makes sure that padding tokens of the inputs are
  ignored. The code example below illustrates all of this.

-    .. code-block::
-
-        from transformers import T5Tokenizer, T5ForConditionalGeneration 
-        import torch
-
-        tokenizer = T5Tokenizer.from_pretrained("t5-small")
-        model = T5ForConditionalGeneration.from_pretrained("t5-small")
-
-        # the following 2 hyperparameters are task-specific
-        max_source_length = 512
-        max_target_length = 128
-
-        # Suppose we have the following 2 training examples:
-        input_sequence_1 = "Welcome to NYC"
-        output_sequence_1 = "Bienvenue à NYC"
-
-        input_sequence_2 = "HuggingFace is a company"
-        output_sequence_2 = "HuggingFace est une entreprise"
-
-        # encode the inputs
-        task_prefix = "translate English to French: "
-        input_sequences = [input_sequence_1, input_sequence_2]
-        encoding = tokenizer([task_prefix + sequence for sequence in input_sequences], 
-                             padding='longest', 
-                             max_length=max_source_length, 
-                             truncation=True, 
-                             return_tensors="pt")
-        input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
-
-        # encode the targets
-        target_encoding = tokenizer([output_sequence_1, output_sequence_2], 
-                                    padding='longest', 
-                                    max_length=max_target_length, 
-                                    truncation=True)
-        labels = target_encoding.input_ids
-
-        # replace padding token id's of the labels by -100
-        labels = torch.tensor(labels)
-        labels[labels == tokenizer.pad_token_id] = -100
-
-        # forward pass
-        loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
+  ```python
+  from transformers import T5Tokenizer, T5ForConditionalGeneration 
+  import torch
+
+  tokenizer = T5Tokenizer.from_pretrained("t5-small")
+  model = T5ForConditionalGeneration.from_pretrained("t5-small")
+
+  # the following 2 hyperparameters are task-specific
+  max_source_length = 512
+  max_target_length = 128
+
+  # Suppose we have the following 2 training examples:
+  input_sequence_1 = "Welcome to NYC"
+  output_sequence_1 = "Bienvenue à NYC"
+
+  input_sequence_2 = "HuggingFace is a company"
+  output_sequence_2 = "HuggingFace est une entreprise"
+
+  # encode the inputs
+  task_prefix = "translate English to French: "
+  input_sequences = [input_sequence_1, input_sequence_2]
+  encoding = tokenizer([task_prefix + sequence for sequence in input_sequences], 
+                      padding='longest', 
+                      max_length=max_source_length, 
+                      truncation=True, 
+                      return_tensors="pt")
+  input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
+
+  # encode the targets
+  target_encoding = tokenizer([output_sequence_1, output_sequence_2], 
+                              padding='longest', 
+                              max_length=max_target_length, 
+                              truncation=True)
+  labels = target_encoding.input_ids
+
+  # replace padding token id's of the labels by -100
+  labels = torch.tensor(labels)
+  labels[labels == tokenizer.pad_token_id] = -100
+
+  # forward pass
+  loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
+  ```

 Additional training tips:

- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer` when using the AdamW
+- T5 models need a slightly higher learning rate than the default one set in the `Trainer` when using the AdamW
  optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question
  answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer.

- According to `this forum post <https://discuss.huggingface.co/t/t5-finetuning-tips/684>`__, task prefixes matter when
+- According to [this forum post](https://discuss.huggingface.co/t/t5-finetuning-tips/684), task prefixes matter when
  (1) doing multi-task training (2) your task is similar or related to one of the supervised tasks used in T5's
-  pre-training mixture (see Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`__ for the task prefixes
+  pre-training mixture (see Appendix D of the [paper](https://arxiv.org/pdf/1910.10683.pdf) for the task prefixes
  used).

 - If training on TPU, it is recommended to pad all examples of the dataset to the same length or make use of
-  `pad_to_multiple_of` to have a small number of predefined bucket sizes to fit all examples in. Dynamically padding
+  *pad_to_multiple_of* to have a small number of predefined bucket sizes to fit all examples in. Dynamically padding
  batches to the longest example is not recommended on TPU as it triggers a recompilation for every batch shape that is
  encountered during training thus significantly slowing down the training. only padding up to the longest example in a
  batch) leads to very slow training on TPU.

-.. _inference:
+<a id='inference'></a>

-Inference
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Inference

-At inference time, it is recommended to use :meth:`~transformers.generation_utils.GenerationMixin.generate`. This
+At inference time, it is recommended to use [`~generation_utils.GenerationMixin.generate`]. This
 method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder
-and auto-regressively generates the decoder output. Check out `this blog post
-<https://huggingface.co/blog/how-to-generate>`__ to know all the details about generating text with Transformers.
-There's also `this blog post <https://huggingface.co/blog/encoder-decoder#encoder-decoder>`__ which explains how
+and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.
+There's also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how
 generation works in general in encoder-decoder models.

-.. code-block::
-
-        from transformers import T5Tokenizer, T5ForConditionalGeneration 
+```python
+from transformers import T5Tokenizer, T5ForConditionalGeneration 

-        tokenizer = T5Tokenizer.from_pretrained("t5-small")
-        model = T5ForConditionalGeneration.from_pretrained("t5-small")
+tokenizer = T5Tokenizer.from_pretrained("t5-small")
+model = T5ForConditionalGeneration.from_pretrained("t5-small")

-        input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
-        outputs = model.generate(input_ids)
-        print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-        # Das Haus ist wunderbar.
+input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
+outputs = model.generate(input_ids)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+# Das Haus ist wunderbar.
+```

-Note that T5 uses the :obj:`pad_token_id` as the :obj:`decoder_start_token_id`, so when doing generation without using
-:meth:`~transformers.generation_utils.GenerationMixin.generate`, make sure you start it with the :obj:`pad_token_id`.
+Note that T5 uses the `pad_token_id` as the `decoder_start_token_id`, so when doing generation without using
+[`~generation_utils.GenerationMixin.generate`], make sure you start it with the `pad_token_id`.

 The example above only shows a single example. You can also do batched inference, like so:

-.. code-block::
+```python
+from transformers import T5Tokenizer, T5ForConditionalGeneration

-        from transformers import T5Tokenizer, T5ForConditionalGeneration
+tokenizer = T5Tokenizer.from_pretrained("t5-small")
+model = T5ForConditionalGeneration.from_pretrained("t5-small")

-        tokenizer = T5Tokenizer.from_pretrained("t5-small")
-        model = T5ForConditionalGeneration.from_pretrained("t5-small")
+# when generating, we will use the logits of right-most token to predict the next token
+# so the padding should be on the left
+tokenizer.padding_side = "left" 
+tokenizer.pad_token = tokenizer.eos_token # to avoid an error

-        # when generating, we will use the logits of right-most token to predict the next token
-        # so the padding should be on the left
-        tokenizer.padding_side = "left" 
-        tokenizer.pad_token = tokenizer.eos_token # to avoid an error
+task_prefix = 'translate English to German: '
+sentences = ['The house is wonderful.', 'I like to work in NYC.'] # use different length sentences to test batching
+inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)

-        task_prefix = 'translate English to German: '
-        sentences = ['The house is wonderful.', 'I like to work in NYC.'] # use different length sentences to test batching
-        inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)
+output_sequences = model.generate(
+    input_ids=inputs['input_ids'],
+    attention_mask=inputs['attention_mask'],
+    do_sample=False, # disable sampling to test if batching affects output
+)

-        output_sequences = model.generate(
-            input_ids=inputs['input_ids'],
-            attention_mask=inputs['attention_mask'],
-            do_sample=False, # disable sampling to test if batching affects output
-        )
+print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

-        print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
+# ['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']
+```

-        # ['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']
+<a id='scripts'></a>

-.. _scripts:
-
-Example scripts
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Example scripts

 T5 is supported by several example scripts, both for pre-training and fine-tuning.

-* pre-training: the `run_t5_mlm_flax.py
-  <https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_t5_mlm_flax.py>`__
-  script allows you to further pre-train T5 or pre-train T5 from scratch on your own data. The `t5_tokenizer_model.py
-  <https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/t5_tokenizer_model.py>`__
+- pre-training: the [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_t5_mlm_flax.py)
+  script allows you to further pre-train T5 or pre-train T5 from scratch on your own data. The [t5_tokenizer_model.py](https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/t5_tokenizer_model.py)
  script allows you to further train a T5 tokenizer or train a T5 Tokenizer from scratch on your own data. Note that
  Flax (a neural network library on top of JAX) is particularly useful to train on TPU hardware.

-* fine-tuning: T5 is supported by the official summarization scripts (`PyTorch
-  <https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization>`__, `Tensorflow
-  <https://github.com/huggingface/transformers/tree/master/examples/tensorflow/summarization>`__, and `Flax
-  <https://github.com/huggingface/transformers/tree/master/examples/flax/summarization>`__) and translation scripts
-  (`PyTorch <https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation>`__ and `Tensorflow
-  <https://github.com/huggingface/transformers/tree/master/examples/tensorflow/translation>`__). These scripts allow
+- fine-tuning: T5 is supported by the official summarization scripts ([PyTorch](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization), [Tensorflow](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/summarization), and [Flax](https://github.com/huggingface/transformers/tree/master/examples/flax/summarization)) and translation scripts
+  ([PyTorch](https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation) and [Tensorflow](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/translation)). These scripts allow
  you to easily fine-tune T5 on custom data for summarization/translation.

-T5Config
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.T5Config
-    :members:
-
-
-T5Tokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.T5Tokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-
+## T5Config

-T5TokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] T5Config

-.. autoclass:: transformers.T5TokenizerFast
-    :members:
+## T5Tokenizer

+[[autodoc]] T5Tokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary

-T5Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## T5TokenizerFast

-.. autoclass:: transformers.T5Model
-    :members: forward, parallelize, deparallelize
+[[autodoc]] T5TokenizerFast

+## T5Model

-T5ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] T5Model
+    - forward
+    - parallelize
+    - deparallelize

-.. autoclass:: transformers.T5ForConditionalGeneration
-    :members: forward, parallelize, deparallelize
+## T5ForConditionalGeneration

-T5EncoderModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] T5ForConditionalGeneration
+    - forward
+    - parallelize
+    - deparallelize

-.. autoclass:: transformers.T5EncoderModel
-    :members: forward, parallelize, deparallelize
+## T5EncoderModel

-TFT5Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] T5EncoderModel
+    - forward
+    - parallelize
+    - deparallelize

-.. autoclass:: transformers.TFT5Model
-    :members: call
+## TFT5Model

+[[autodoc]] TFT5Model
+    - call

-TFT5ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## TFT5ForConditionalGeneration

-.. autoclass:: transformers.TFT5ForConditionalGeneration
-    :members: call
+[[autodoc]] TFT5ForConditionalGeneration
+    - call

-TFT5EncoderModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## TFT5EncoderModel

-.. autoclass:: transformers.TFT5EncoderModel
-    :members: call
+[[autodoc]] TFT5EncoderModel
+    - call

-FlaxT5Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## FlaxT5Model

-.. autoclass:: transformers.FlaxT5Model
-    :members: __call__, encode, decode
+[[autodoc]] FlaxT5Model
+    - __call__
+    - encode
+    - decode

-FlaxT5ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## FlaxT5ForConditionalGeneration

-.. autoclass:: transformers.FlaxT5ForConditionalGeneration
-    :members: __call__, encode, decode
+[[autodoc]] FlaxT5ForConditionalGeneration
+    - __call__
+    - encode
+    - decode
--- a/docs/source/model_doc/t5v1.1.rst
+++ b/docs/source/model_doc/t5v1.1.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-T5v1.1
-----------------------------------------------------------------------------------------------------------------------
+# T5v1.1

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-T5v1.1 was released in the `google-research/text-to-text-transfer-transformer
-<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__
+T5v1.1 was released in the [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511)
 repository by Colin Raffel et al. It's an improved version of the original T5 model.

 One can directly plug in the weights of T5v1.1 into a T5 model, like so:

-.. code-block::
+```python
+from transformers import T5ForConditionalGeneration

-    from transformers import T5ForConditionalGeneration
-
-    model = T5ForConditionalGeneration.from_pretrained('google/t5-v1_1-base') 
+model = T5ForConditionalGeneration.from_pretrained('google/t5-v1_1-base')
+```

 T5 Version 1.1 includes the following improvements compared to the original T5 model:

- GEGLU activation in the feed-forward hidden layer, rather than ReLU. See `this paper
-  <https://arxiv.org/abs/2002.05202>`__.
+- GEGLU activation in the feed-forward hidden layer, rather than ReLU. See [this paper](https://arxiv.org/abs/2002.05202).

 - Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.

@@ -39,28 +35,27 @@ T5 Version 1.1 includes the following improvements compared to the original T5 m

 - No parameter sharing between the embedding and classifier layer.

- "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger :obj:`d_model` and smaller
-  :obj:`num_heads` and :obj:`d_ff`.
+- "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger `d_model` and smaller
+  `num_heads` and `d_ff`.

-Note: T5 Version 1.1 was only pre-trained on `C4 <https://huggingface.co/datasets/c4>`__ excluding any supervised
+Note: T5 Version 1.1 was only pre-trained on [C4](https://huggingface.co/datasets/c4) excluding any supervised
 training. Therefore, this model has to be fine-tuned before it is useable on a downstream task, unlike the original T5
 model. Since t5v1.1 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
 fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.

 Google has released the following variants:

- `google/t5-v1_1-small <https://huggingface.co/google/t5-v1_1-small>`__
+- [google/t5-v1_1-small](https://huggingface.co/google/t5-v1_1-small)

- `google/t5-v1_1-base <https://huggingface.co/google/t5-v1_1-base>`__
+- [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base)

- `google/t5-v1_1-large <https://huggingface.co/google/t5-v1_1-large>`__
+- [google/t5-v1_1-large](https://huggingface.co/google/t5-v1_1-large)

- `google/t5-v1_1-xl <https://huggingface.co/google/t5-v1_1-xl>`__
+- [google/t5-v1_1-xl](https://huggingface.co/google/t5-v1_1-xl)

- `google/t5-v1_1-xxl <https://huggingface.co/google/t5-v1_1-xxl>`__.
+- [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl).

-One can refer to :doc:`T5's documentation page <t5>` for all tips, code examples and notebooks.
+One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks.

-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
-found `here
-<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__.
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
+found [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511).
--- a/docs/source/model_doc/tapas.mdx
+++ b/docs/source/model_doc/tapas.mdx
@@ -36,7 +36,7 @@ In addition, the authors have further pre-trained TAPAS to recognize **table ent
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tapas_architecture.png"
 alt="drawing" width="600"/> 

-<small> TAPAS architecture. Taken from the [official blog post](https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html). </small>
+<small> TAPAS architecture. Taken from the <a href="https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html">original blog post</a>.</small>

 This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tensorflow version of this model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/tapas).


--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-Transformer XL
-----------------------------------------------------------------------------------------------------------------------
+# Transformer XL

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The Transformer-XL model was proposed in `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
-<https://arxiv.org/abs/1901.02860>`__ by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
+The Transformer-XL model was proposed in [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
 Salakhutdinov. It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can
 reuse previously computed hidden-states to attend to longer context (memory). This model also uses adaptive softmax
 inputs and outputs (tied).
@@ -41,90 +38,66 @@ Tips:
  original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.

-This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
-<https://github.com/kimiyoung/transformer-xl>`__.
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/kimiyoung/transformer-xl).

-**Note**:
+<Tip warning={true}>

- TransformerXL does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see `issue #36035
-  <https://github.com/pytorch/pytorch/issues/36035>`__
+TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)

+</Tip>

-TransfoXLConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: transformers.TransfoXLConfig
-    :members:
+## TransfoXLConfig

+[[autodoc]] TransfoXLConfig

-TransfoXLTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## TransfoXLTokenizer

-.. autoclass:: transformers.TransfoXLTokenizer
-    :members: save_vocabulary
+[[autodoc]] TransfoXLTokenizer
+    - save_vocabulary

+## TransfoXL specific outputs

-TransfoXL specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput

-.. autoclass:: transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput
-    :members:
+[[autodoc]] models.transfo_xl.modeling_transfo_xl.TransfoXLLMHeadModelOutput

-.. autoclass:: transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLLMHeadModelOutput
-    :members:
+[[autodoc]] models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLModelOutput

-.. autoclass:: transformers.models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLModelOutput
-    :members:
+[[autodoc]] models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput

-.. autoclass:: transformers.models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput
-    :members:
+## TransfoXLModel

+[[autodoc]] TransfoXLModel
+    - forward

-TransfoXLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## TransfoXLLMHeadModel

-.. autoclass:: transformers.TransfoXLModel
-    :members: forward
+[[autodoc]] TransfoXLLMHeadModel
+    - forward

+## TransfoXLForSequenceClassification

-TransfoXLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] TransfoXLForSequenceClassification
+    - forward

-.. autoclass:: transformers.TransfoXLLMHeadModel
-    :members: forward
+## TFTransfoXLModel

+[[autodoc]] TFTransfoXLModel
+    - call

-TransfoXLForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## TFTransfoXLLMHeadModel

-.. autoclass:: transformers.TransfoXLForSequenceClassification
-    :members: forward
+[[autodoc]] TFTransfoXLLMHeadModel
+    - call

+## TFTransfoXLForSequenceClassification

-TFTransfoXLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] TFTransfoXLForSequenceClassification
+    - call

-.. autoclass:: transformers.TFTransfoXLModel
-    :members: call
+## Internal Layers

+[[autodoc]] AdaptiveEmbedding

-TFTransfoXLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFTransfoXLLMHeadModel
-    :members: call
-
-
-TFTransfoXLForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFTransfoXLForSequenceClassification
-    :members: call
-
-
-Internal Layers
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.AdaptiveEmbedding
-
-.. autoclass:: transformers.TFAdaptiveEmbedding
+[[autodoc]] TFAdaptiveEmbedding
--- a/docs/source/model_doc/trocr.mdx
+++ b/docs/source/model_doc/trocr.mdx
@@ -32,7 +32,7 @@ tasks.*
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg"
 alt="drawing" width="600"/> 

-<small> TrOCR architecture. Taken from the [original paper](https://arxiv.org/abs/2109.10282). </small>
+<small> TrOCR architecture. Taken from the <a href="https://arxiv.org/abs/2109.10282">original paper</a>. </small>

 Please refer to the [`VisionEncoderDecoder`] class on how to use this model.


--- a/docs/source/model_doc/unispeech.rst
+++ b/docs/source/model_doc/unispeech.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-UniSpeech
-----------------------------------------------------------------------------------------------------------------------
+# UniSpeech

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The UniSpeech model was proposed in `UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
-<https://arxiv.org/abs/2101.07597>`__ by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael
+The UniSpeech model was proposed in [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael
 Zeng, Xuedong Huang .

 The abstract from the paper is the following:
@@ -35,54 +32,40 @@ i.e., a relative word error rate reduction of 6% against the previous approach.*
 Tips:

 - UniSpeech is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please
-  use :class:`~transformers.Wav2Vec2Processor` for the feature extraction.
+  use [`Wav2Vec2Processor`] for the feature extraction.
 - UniSpeech model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be
-  decoded using :class:`~transformers.Wav2Vec2CTCTokenizer`.
+  decoded using [`Wav2Vec2CTCTokenizer`].

-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The Authors' code can be
-found `here <https://github.com/microsoft/UniSpeech/tree/main/UniSpeech>`__.
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
+found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech).


-UniSpeechConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## UniSpeechConfig

-.. autoclass:: transformers.UniSpeechConfig
-    :members:
+[[autodoc]] UniSpeechConfig

+## UniSpeech specific outputs

-UniSpeech specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] models.unispeech.modeling_unispeech.UniSpeechBaseModelOutput

-.. autoclass:: transformers.models.unispeech.modeling_unispeech.UniSpeechBaseModelOutput
-    :members: 
+[[autodoc]] models.unispeech.modeling_unispeech.UniSpeechForPreTrainingOutput

-.. autoclass:: transformers.models.unispeech.modeling_unispeech.UniSpeechForPreTrainingOutput
-    :members: 
+## UniSpeechModel

+[[autodoc]] UniSpeechModel
+    - forward

-UniSpeechModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## UniSpeechForCTC

-.. autoclass:: transformers.UniSpeechModel
-    :members: forward
+[[autodoc]] UniSpeechForCTC
+    - forward

+## UniSpeechForSequenceClassification

-UniSpeechForCTC
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] UniSpeechForSequenceClassification
+    - forward

-.. autoclass:: transformers.UniSpeechForCTC
-    :members: forward
+## UniSpeechForPreTraining

-
-UniSpeechForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.UniSpeechForSequenceClassification
-    :members: forward
-
-
-UniSpeechForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.UniSpeechForPreTraining
-    :members: forward
+[[autodoc]] UniSpeechForPreTraining
+    - forward
--- a/docs/source/model_doc/unispeech_sat.rst
+++ b/docs/source/model_doc/unispeech_sat.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-UniSpeech-SAT
-----------------------------------------------------------------------------------------------------------------------
+# UniSpeech-SAT

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The UniSpeech-SAT model was proposed in `UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware
-Pre-Training <https://arxiv.org/abs/2110.05752>`__ by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen,
+The UniSpeech-SAT model was proposed in [UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware
+Pre-Training](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen,
 Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu .

 The abstract from the paper is the following:
@@ -38,69 +36,51 @@ dataset to 94 thousand hours public audio data and achieve further performance i
 Tips:

 - UniSpeechSat is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
-  Please use :class:`~transformers.Wav2Vec2Processor` for the feature extraction.
+  Please use [`Wav2Vec2Processor`] for the feature extraction.
 - UniSpeechSat model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be
-  decoded using :class:`~transformers.Wav2Vec2CTCTokenizer`.
+  decoded using [`Wav2Vec2CTCTokenizer`].
 - UniSpeechSat performs especially well on speaker verification, speaker identification, and speaker diarization tasks.

-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The Authors' code can be
-found `here <https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT>`__.
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
+found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT).


-UniSpeechSatConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## UniSpeechSatConfig

-.. autoclass:: transformers.UniSpeechSatConfig
-    :members:
+[[autodoc]] UniSpeechSatConfig

+## UniSpeechSat specific outputs

-UniSpeechSat specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatBaseModelOutput

-.. autoclass:: transformers.models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatBaseModelOutput
-    :members: 
+[[autodoc]] models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatForPreTrainingOutput

-.. autoclass:: transformers.models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatForPreTrainingOutput
-    :members: 
+## UniSpeechSatModel

+[[autodoc]] UniSpeechSatModel
+    - forward

-UniSpeechSatModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## UniSpeechSatForCTC

-.. autoclass:: transformers.UniSpeechSatModel
-    :members: forward
+[[autodoc]] UniSpeechSatForCTC
+    - forward

+## UniSpeechSatForSequenceClassification

-UniSpeechSatForCTC
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] UniSpeechSatForSequenceClassification
+    - forward

-.. autoclass:: transformers.UniSpeechSatForCTC
-    :members: forward
+## UniSpeechSatForAudioFrameClassification

+[[autodoc]] UniSpeechSatForAudioFrameClassification
+    - forward

-UniSpeechSatForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## UniSpeechSatForXVector

-.. autoclass:: transformers.UniSpeechSatForSequenceClassification
-    :members: forward
+[[autodoc]] UniSpeechSatForXVector
+    - forward

+## UniSpeechSatForPreTraining

-UniSpeechSatForAudioFrameClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.UniSpeechSatForAudioFrameClassification
-    :members: forward
-
-
-UniSpeechSatForXVector
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.UniSpeechSatForXVector
-    :members: forward
-
-
-UniSpeechSatForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.UniSpeechSatForPreTraining
-    :members: forward
+[[autodoc]] UniSpeechSatForPreTraining
+    - forward
--- a/docs/source/model_doc/vision_text_dual_encoder.mdx
+++ b/docs/source/model_doc/vision_text_dual_encoder.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# VisionTextDualEncoder
+
+## Overview
+
+The [`VisionTextDualEncoderModel`] can be used to initialize a vision-text dual encoder model with
+any pretrained vision autoencoding model as the vision encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit)) and any pretrained text autoencoding model as the text encoder (*e.g.* [RoBERTa](roberta), [BERT](bert)). Two projection layers are added on top of both the vision and text encoder to project the output embeddings
+to a shared latent space. The projection layers are randomly initialized so the model should be fine-tuned on a
+downstream task. This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
+training and then can be used for zero-shot vision tasks such image-classification or retrieval.
+
+In [LiT: Zero-Shot Transfer with Locked-image Text Tuning](https://arxiv.org/abs/2111.07991) it is shown how
+leveraging pre-trained (locked/frozen) image and text model for contrastive learning yields significant improvment on
+new zero-shot vision tasks such as image classification or retrieval.
+
+## VisionTextDualEncoderConfig
+
+[[autodoc]] VisionTextDualEncoderConfig
+
+## VisionTextDualEncoderProcessor
+
+[[autodoc]] VisionTextDualEncoderProcessor
+
+## VisionTextDualEncoderModel
+
+[[autodoc]] VisionTextDualEncoderModel
+    - forward
+
+## FlaxVisionTextDualEncoderModel
+
+[[autodoc]] FlaxVisionTextDualEncoderModel
+    - __call__
--- a/docs/source/model_doc/vision_text_dual_encoder.rst
+++ b/docs/source/model_doc/vision_text_dual_encoder.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-VisionTextDualEncoder
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The :class:`~transformers.VisionTextDualEncoderModel` can be used to initialize a vision-text dual encoder model with
-any pretrained vision autoencoding model as the vision encoder (*e.g.* :doc:`ViT <vit>`, :doc:`BEiT <beit>`, :doc:`DeiT
-<deit>`) and any pretrained text autoencoding model as the text encoder (*e.g.* :doc:`RoBERTa <roberta>`, :doc:`BERT
-<bert>`). Two projection layers are added on top of both the vision and text encoder to project the output embeddings
-to a shared latent space. The projection layers are randomly initialized so the model should be fine-tuned on a
-downstream task. This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
-training and then can be used for zero-shot vision tasks such image-classification or retrieval.
-
-In `LiT: Zero-Shot Transfer with Locked-image Text Tuning <https://arxiv.org/abs/2111.07991>`__ it is shown how
-leveraging pre-trained (locked/frozen) image and text model for contrastive learning yields significant improvment on
-new zero-shot vision tasks such as image classification or retrieval.
-
-VisionTextDualEncoderConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.VisionTextDualEncoderConfig
-    :members:
-
-
-VisionTextDualEncoderProcessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.VisionTextDualEncoderProcessor
-    :members:
-
-
-VisionTextDualEncoderModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.VisionTextDualEncoderModel
-    :members: forward
-
-
-FlaxVisionTextDualEncoderModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FlaxVisionTextDualEncoderModel
-    :members: __call__
--- a/docs/source/model_doc/visionencoderdecoder.mdx
+++ b/docs/source/model_doc/visionencoderdecoder.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Vision Encoder Decoder Models
+
+The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any
+pretrained vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit))
+and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert)).
+
+The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
+example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
+Zhoujun Li, Furu Wei.
+
+An example of how to use a [`VisionEncoderDecoderModel`] for inference can be seen in [TrOCR](trocr).
+
+
+## VisionEncoderDecoderConfig
+
+[[autodoc]] VisionEncoderDecoderConfig
+
+## VisionEncoderDecoderModel
+
+[[autodoc]] VisionEncoderDecoderModel
+    - forward
+    - from_encoder_decoder_pretrained
+
+## FlaxVisionEncoderDecoderModel
+
+[[autodoc]] FlaxVisionEncoderDecoderModel
+    - __call__
+    - from_encoder_decoder_pretrained
--- a/docs/source/model_doc/visionencoderdecoder.rst
+++ b/docs/source/model_doc/visionencoderdecoder.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Vision Encoder Decoder Models
-----------------------------------------------------------------------------------------------------------------------
-
-The :class:`~transformers.VisionEncoderDecoderModel` can be used to initialize an image-to-text-sequence model with any
-pretrained vision autoencoding model as the encoder (*e.g.* :doc:`ViT <vit>`, :doc:`BEiT <beit>`, :doc:`DeiT <deit>`)
-and any pretrained language model as the decoder (*e.g.* :doc:`RoBERTa <roberta>`, :doc:`GPT2 <gpt2>`, :doc:`BERT
-<bert>`).
-
-The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
-example) `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
-<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
-Zhoujun Li, Furu Wei.
-
-An example of how to use a :class:`~transformers.VisionEncoderDecoderModel` for inference can be seen in :doc:`TrOCR
-<trocr>`.
-
-
-VisionEncoderDecoderConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.VisionEncoderDecoderConfig
-    :members:
-
-
-VisionEncoderDecoderModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.VisionEncoderDecoderModel
-    :members: forward, from_encoder_decoder_pretrained
-
-
-FlaxVisionEncoderDecoderModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FlaxVisionEncoderDecoderModel
-    :members: __call__, from_encoder_decoder_pretrained
--- a/docs/source/model_doc/visual_bert.rst
+++ b/docs/source/model_doc/visual_bert.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-VisualBERT
-----------------------------------------------------------------------------------------------------------------------
+# VisualBERT

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The VisualBERT model was proposed in `VisualBERT: A Simple and Performant Baseline for Vision and Language
-<https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
+The VisualBERT model was proposed in [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
 VisualBERT is a neural network trained on a variety of (image, text) pairs.

 The abstract from the paper is the following:
@@ -33,7 +30,7 @@ verbs and image regions corresponding to their arguments.*

 Tips:

-1. Most of the checkpoints provided work with the :class:`~transformers.VisualBertForPreTraining` configuration. Other
+1. Most of the checkpoints provided work with the [`VisualBertForPreTraining`] configuration. Other
   checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR
   ('visualbert-vcr'), NLVR2 ('visualbert-nlvr2'). Hence, if you are not working on these downstream tasks, it is
   recommended that you use the pretrained checkpoints.
@@ -42,8 +39,7 @@ Tips:
   We do not provide the detector and its weights as a part of the package, but it will be available in the research
   projects, and the states can be loaded directly into the detector provided.

-Usage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Usage

 VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice,
 visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare
@@ -57,87 +53,71 @@ vectors to a standard BERT model. The text input is concatenated in the front of
 layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
 appropriately for the textual and visual parts.

-The :class:`~transformers.BertTokenizer` is used to encode the text. A custom detector/feature extractor must be used
+The [`BertTokenizer`] is used to encode the text. A custom detector/feature extractor must be used
 to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:

-* `VisualBERT VQA demo notebook
-  <https://github.com/huggingface/transformers/tree/master/examples/research_projects/visual_bert>`__ : This notebook
+- [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers/tree/master/examples/research_projects/visual_bert) : This notebook
  contains an example on VisualBERT VQA.

-* `Generate Embeddings for VisualBERT (Colab Notebook)
-  <https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing>`__ : This notebook contains
+- [Generate Embeddings for VisualBERT (Colab Notebook)](https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing) : This notebook contains
  an example on how to generate visual embeddings.

-The following example shows how to get the last hidden state using :class:`~transformers.VisualBertModel`:
+The following example shows how to get the last hidden state using [`VisualBertModel`]:

-.. code-block::
+```python
+>>> import torch
+>>> from transformers import BertTokenizer, VisualBertModel

-        >>> import torch
-        >>> from transformers import BertTokenizer, VisualBertModel
+>>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
+>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

-        >>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
-        >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+>>> inputs = tokenizer("What is the man eating?", return_tensors="pt")
+>>> # this is a custom function that returns the visual embeddings given the image path
+>>> visual_embeds = get_visual_embeddings(image_path)

-        >>> inputs = tokenizer("What is the man eating?", return_tensors="pt")
-        >>> # this is a custom function that returns the visual embeddings given the image path
-        >>> visual_embeds = get_visual_embeddings(image_path)
+>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
+>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
+>>> inputs.update({
+...     "visual_embeds": visual_embeds,
+...     "visual_token_type_ids": visual_token_type_ids,
+...     "visual_attention_mask": visual_attention_mask
+... })
+>>> outputs = model(**inputs)
+>>> last_hidden_state = outputs.last_hidden_state
+```

-        >>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
-        >>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
-        >>> inputs.update({
-        ...     "visual_embeds": visual_embeds,
-        ...     "visual_token_type_ids": visual_token_type_ids,
-        ...     "visual_attention_mask": visual_attention_mask
-        ... })
-        >>> outputs = model(**inputs)
-        >>> last_hidden_state = outputs.last_hidden_state
+This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/uclanlp/visualbert).

-This model was contributed by `gchhablani <https://huggingface.co/gchhablani>`__. The original code can be found `here
-<https://github.com/uclanlp/visualbert>`__.
+## VisualBertConfig

-VisualBertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] VisualBertConfig

-.. autoclass:: transformers.VisualBertConfig
-    :members:
+## VisualBertModel

-VisualBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] VisualBertModel
+    - forward

-.. autoclass:: transformers.VisualBertModel
-    :members: forward
+## VisualBertForPreTraining

+[[autodoc]] VisualBertForPreTraining
+    - forward

-VisualBertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## VisualBertForQuestionAnswering

-.. autoclass:: transformers.VisualBertForPreTraining
-    :members: forward
+[[autodoc]] VisualBertForQuestionAnswering
+    - forward

+## VisualBertForMultipleChoice

-VisualBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] VisualBertForMultipleChoice
+    - forward

-.. autoclass:: transformers.VisualBertForQuestionAnswering
-    :members: forward
+## VisualBertForVisualReasoning

+[[autodoc]] VisualBertForVisualReasoning
+    - forward

-VisualBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## VisualBertForRegionToPhraseAlignment

-.. autoclass:: transformers.VisualBertForMultipleChoice
-    :members: forward
-
-
-VisualBertForVisualReasoning
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.VisualBertForVisualReasoning
-    :members: forward
-
-
-VisualBertForRegionToPhraseAlignment
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.VisualBertForRegionToPhraseAlignment
-    :members: forward
+[[autodoc]] VisualBertForRegionToPhraseAlignment
+    - forward
--- a/docs/source/model_doc/vit.rst
+++ b/docs/source/model_doc/vit.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->

-Vision Transformer (ViT)
-----------------------------------------------------------------------------------------------------------------------
+# Vision Transformer (ViT)

-.. note::
+<Tip>

-    This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
-    breaking changes to fix it in the future. If you see something strange, file a `Github Issue
-    <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
+This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
+breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).

+</Tip>

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The Vision Transformer (ViT) model was proposed in `An Image is Worth 16x16 Words: Transformers for Image Recognition
-at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
+The Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition
+at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
 Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob
 Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining
 very good results compared to familiar convolutional architectures.
@@ -43,25 +41,22 @@ substantially fewer computational resources to train.*

 Tips:

- Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found `here
-  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__.
+- Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer).
 - To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
  which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be
  used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of
  vectors to a standard Transformer encoder.
 - As the Vision Transformer expects each image to be of the same size (resolution), one can use
-  :class:`~transformers.ViTFeatureExtractor` to resize (or rescale) and normalize images for the model.
+  [`ViTFeatureExtractor`] to resize (or rescale) and normalize images for the model.
 - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
-  each checkpoint. For example, :obj:`google/vit-base-patch16-224` refers to a base-sized architecture with patch
-  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the `hub
-  <https://huggingface.co/models?search=vit>`__.
- The available checkpoints are either (1) pre-trained on `ImageNet-21k <http://www.image-net.org/>`__ (a collection of
-  14 million images and 21k classes) only, or (2) also fine-tuned on `ImageNet
-  <http://www.image-net.org/challenges/LSVRC/2012/>`__ (also referred to as ILSVRC 2012, a collection of 1.3 million
+  each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch
+  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit).
+- The available checkpoints are either (1) pre-trained on [ImageNet-21k](http://www.image-net.org/) (a collection of
+  14 million images and 21k classes) only, or (2) also fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
  images and 1,000 classes).
 - The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to
-  use a higher resolution than pre-training `(Touvron et al., 2019) <https://arxiv.org/abs/1906.06423>`__, `(Kolesnikov
-  et al., 2020) <https://arxiv.org/abs/1912.11370>`__. In order to fine-tune at higher resolution, the authors perform
+  use a higher resolution than pre-training [(Touvron et al., 2019)](https://arxiv.org/abs/1906.06423), [(Kolesnikov
+  et al., 2020)](https://arxiv.org/abs/1912.11370). In order to fine-tune at higher resolution, the authors perform
  2D interpolation of the pre-trained position embeddings, according to their location in the original image.
 - The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed
  an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked
@@ -71,81 +66,62 @@ Tips:
 Following the original Vision Transformer, some follow-up works have been made:

 - DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision transformers. Refer to
-  :doc:`DeiT's documentation page <deit>`. The authors of DeiT also released more efficiently trained ViT models, which
-  you can directly plug into :class:`~transformers.ViTModel` or :class:`~transformers.ViTForImageClassification`. There
-  are 4 variants available (in 3 different sizes): `facebook/deit-tiny-patch16-224`, `facebook/deit-small-patch16-224`,
-  `facebook/deit-base-patch16-224` and `facebook/deit-base-patch16-384`. Note that one should use
-  :class:`~transformers.DeiTFeatureExtractor` in order to prepare images for the model.
+  [DeiT's documentation page](deit). The authors of DeiT also released more efficiently trained ViT models, which
+  you can directly plug into [`ViTModel`] or [`ViTForImageClassification`]. There
+  are 4 variants available (in 3 different sizes): *facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*,
+  *facebook/deit-base-patch16-224* and *facebook/deit-base-patch16-384*. Note that one should use
+  [`DeiTFeatureExtractor`] in order to prepare images for the model.

 - BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained
  vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.
-  Refer to :doc:`BEiT's documentation page <beit>`.
+  Refer to [BEiT's documentation page](beit).

 - DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision Transformers trained using
  the DINO method show very interesting properties not seen with convolutional models. They are capable of segmenting
-  objects, without having ever been trained to do so. DINO checkpoints can be found on the `hub
-  <https://huggingface.co/models?other=dino>`__.
+  objects, without having ever been trained to do so. DINO checkpoints can be found on the [hub](https://huggingface.co/models?other=dino).

-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code (written in JAX) can be
-found `here <https://github.com/google-research/vision_transformer>`__.
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code (written in JAX) can be
+found [here](https://github.com/google-research/vision_transformer).

-Note that we converted the weights from Ross Wightman's `timm library
-<https://github.com/rwightman/pytorch-image-models>`__, who already converted the weights from JAX to PyTorch. Credits
+Note that we converted the weights from Ross Wightman's [timm library](https://github.com/rwightman/pytorch-image-models), who already converted the weights from JAX to PyTorch. Credits
 go to him!


-ViTConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## ViTConfig

-.. autoclass:: transformers.ViTConfig
-    :members:
+[[autodoc]] ViTConfig

+## ViTFeatureExtractor

-ViTFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] ViTFeatureExtractor
+    - __call__

-.. autoclass:: transformers.ViTFeatureExtractor
-    :members: __call__
+## ViTModel

+[[autodoc]] ViTModel
+    - forward

-ViTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## ViTForImageClassification

-.. autoclass:: transformers.ViTModel
-    :members: forward
+[[autodoc]] ViTForImageClassification
+    - forward

+## TFViTModel

-ViTForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] TFViTModel
+    - call

-.. autoclass:: transformers.ViTForImageClassification
-    :members: forward
+## TFViTForImageClassification

+[[autodoc]] TFViTForImageClassification
+    - call

-TFViTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## FlaxVitModel

-.. autoclass:: transformers.TFViTModel
-    :members: call
+[[autodoc]] FlaxViTModel
+    - __call__

+## FlaxViTForImageClassification

-TFViTForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFViTForImageClassification
-    :members: call
-
-
-FlaxVitModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FlaxViTModel
-    :members: __call__
-
-
-FlaxViTForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FlaxViTForImageClassification
-    :members: __call__
-
+[[autodoc]] FlaxViTForImageClassification
+    - __call__