[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors

[Docs] Model_doc structure/clarity improvements (#26876)
* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors
5964f820 · Maria Khalusova · GitHub · ad8ff962 · 5964f820 · 5964f820
Unverified Commit 5964f820 authored Nov 03, 2023 by Maria Khalusova Committed by GitHub Nov 03, 2023
20 changed files
--- a/docs/source/en/model_doc/roberta-prelayernorm.md
+++ b/docs/source/en/model_doc/roberta-prelayernorm.md
@@ -25,15 +25,15 @@ The abstract from the paper is the following:

 *fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs.*

-Tips:
+This model was contributed by [andreasmaden](https://huggingface.co/andreasmaden).
+The original code can be found [here](https://github.com/princeton-nlp/DinkyTrain).
+
+## Usage tips

 - The implementation is the same as [Roberta](roberta) except instead of using _Add and Norm_ it does _Norm and Add_. _Add_ and _Norm_ refers to the Addition and LayerNormalization as described in [Attention Is All You Need](https://arxiv.org/abs/1706.03762).
 - This is identical to using the `--encoder-normalize-before` flag in [fairseq](https://fairseq.readthedocs.io/).

-This model was contributed by [andreasmaden](https://huggingface.co/andreasmaden).
-The original code can be found [here](https://github.com/princeton-nlp/DinkyTrain).
-
-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@@ -46,6 +46,9 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai

 [[autodoc]] RobertaPreLayerNormConfig

+<frameworkcontent>
+<pt>
+
 ## RobertaPreLayerNormModel

 [[autodoc]] RobertaPreLayerNormModel
@@ -81,6 +84,9 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai
 [[autodoc]] RobertaPreLayerNormForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFRobertaPreLayerNormModel

 [[autodoc]] TFRobertaPreLayerNormModel
@@ -116,6 +122,9 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai
 [[autodoc]] TFRobertaPreLayerNormForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxRobertaPreLayerNormModel

 [[autodoc]] FlaxRobertaPreLayerNormModel
@@ -150,3 +159,6 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai

 [[autodoc]] FlaxRobertaPreLayerNormForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/roberta.md
+++ b/docs/source/en/model_doc/roberta.md
@@ -47,7 +47,9 @@ model published after it. Our best model achieves state-of-the-art results on GL
 highlight the importance of previously overlooked design choices, and raise questions about the source of recently
 reported improvements. We release our models and code.*

-Tips:
+This model was contributed by [julien-c](https://huggingface.co/julien-c). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/roberta).
+
+## Usage tips

 - This implementation is the same as [`BertModel`] with a tiny embeddings tweak as well as a setup
  for Roberta pretrained models.
@@ -63,8 +65,6 @@ Tips:
    * use BPE with bytes as a subunit and not characters (because of unicode characters)
 - [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples.

-This model was contributed by [julien-c](https://huggingface.co/julien-c). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/roberta).
-
 ## Resources

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with RoBERTa. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
@@ -127,6 +127,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] RobertaTokenizerFast
    - build_inputs_with_special_tokens

+<frameworkcontent>
+<pt>
+
 ## RobertaModel

 [[autodoc]] RobertaModel
@@ -162,6 +165,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] RobertaForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFRobertaModel

 [[autodoc]] TFRobertaModel
@@ -197,6 +203,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] TFRobertaForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxRobertaModel

 [[autodoc]] FlaxRobertaModel
@@ -231,3 +240,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] FlaxRobertaForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/roc_bert.md
+++ b/docs/source/en/model_doc/roc_bert.md
@@ -35,7 +35,7 @@ in the toxic content detection task under human-made attacks.*

 This model was contributed by [weiweishi](https://huggingface.co/weiweishi).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@@ -49,7 +49,6 @@ This model was contributed by [weiweishi](https://huggingface.co/weiweishi).
 [[autodoc]] RoCBertConfig
    - all

-
 ## RoCBertTokenizer

 [[autodoc]] RoCBertTokenizer
@@ -58,31 +57,26 @@ This model was contributed by [weiweishi](https://huggingface.co/weiweishi).
    - create_token_type_ids_from_sequences
    - save_vocabulary

-
 ## RoCBertModel

 [[autodoc]] RoCBertModel
    - forward

-
 ## RoCBertForPreTraining

 [[autodoc]] RoCBertForPreTraining
    - forward

-
 ## RoCBertForCausalLM

 [[autodoc]] RoCBertForCausalLM
    - forward

-
 ## RoCBertForMaskedLM

 [[autodoc]] RoCBertForMaskedLM
    - forward

-
 ## RoCBertForSequenceClassification

 [[autodoc]] transformers.RoCBertForSequenceClassification
@@ -93,13 +87,11 @@ This model was contributed by [weiweishi](https://huggingface.co/weiweishi).
 [[autodoc]] transformers.RoCBertForMultipleChoice
    - forward

-
 ## RoCBertForTokenClassification

 [[autodoc]] transformers.RoCBertForTokenClassification
    - forward

-
 ## RoCBertForQuestionAnswering

 [[autodoc]] RoCBertForQuestionAnswering

--- a/docs/source/en/model_doc/roformer.md
+++ b/docs/source/en/model_doc/roformer.md
@@ -33,15 +33,13 @@ transformer with rotary position embedding, or RoFormer, achieves superior perfo
 release the theoretical analysis along with some preliminary experiment results on Chinese data. The undergoing
 experiment for English benchmark will soon be updated.*

-Tips:
-
- RoFormer is a BERT-like autoencoding model with rotary position embeddings. Rotary position embeddings have shown
-  improved performance on classification tasks with long texts.
-
-
 This model was contributed by [junnyu](https://huggingface.co/junnyu). The original code can be found [here](https://github.com/ZhuiyiTechnology/roformer).

-## Documentation resources
+## Usage tips
+RoFormer is a BERT-like autoencoding model with rotary position embeddings. Rotary position embeddings have shown 
+improved performance on classification tasks with long texts.
+
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@@ -67,6 +65,9 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi
 [[autodoc]] RoFormerTokenizerFast
    - build_inputs_with_special_tokens

+<frameworkcontent>
+<pt>
+
 ## RoFormerModel

 [[autodoc]] RoFormerModel
@@ -102,6 +103,9 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi
 [[autodoc]] RoFormerForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFRoFormerModel

 [[autodoc]] TFRoFormerModel
@@ -137,6 +141,9 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi
 [[autodoc]] TFRoFormerForQuestionAnswering
    - call

+</tf>
+<jax>
+
 ## FlaxRoFormerModel

 [[autodoc]] FlaxRoFormerModel
@@ -166,3 +173,6 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi

 [[autodoc]] FlaxRoFormerForQuestionAnswering
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/rwkv.md
+++ b/docs/source/en/model_doc/rwkv.md
@@ -27,7 +27,7 @@ This can be more efficient than a regular Transformer and can deal with sentence
 This model was contributed by [sgugger](https://huggingface.co/sgugger).
 The original code can be found [here](https://github.com/BlinkDL/RWKV-LM).

-Example of use as an RNN:
+## Usage example

 ```py
 import torch
@@ -73,7 +73,6 @@ output = model.generate(inputs["input_ids"], max_new_tokens=64, stopping_criteri

 [[autodoc]] RwkvConfig

-
 ## RwkvModel

 [[autodoc]] RwkvModel

--- a/docs/source/en/model_doc/segformer.md
+++ b/docs/source/en/model_doc/segformer.md
@@ -43,7 +43,7 @@ The figure below illustrates the architecture of SegFormer. Taken from the [orig
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version
 of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/NVlabs/SegFormer).

-Tips:
+## Usage tips

 - SegFormer consists of a hierarchical Transformer encoder, and a lightweight all-MLP decoder head.
  [`SegformerModel`] is the hierarchical Transformer encoder (which in the paper is also referred to
@@ -123,6 +123,9 @@ If you're interested in submitting a resource to be included here, please feel f
    - preprocess
    - post_process_semantic_segmentation

+<frameworkcontent>
+<pt>
+
 ## SegformerModel

 [[autodoc]] SegformerModel
@@ -143,6 +146,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] SegformerForSemanticSegmentation
    - forward

+</pt>
+<tf>
+
 ## TFSegformerDecodeHead

 [[autodoc]] TFSegformerDecodeHead
@@ -162,3 +168,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] TFSegformerForSemanticSegmentation
    - call
+
+</tf>
+</frameworkcontent>
\ No newline at end of file
--- a/docs/source/en/model_doc/sew-d.md
+++ b/docs/source/en/model_doc/sew-d.md
@@ -32,15 +32,15 @@ variety of training setups. For example, under the 100h-960h semi-supervised set
 inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference
 time, SEW reduces word error rate by 25-50% across different model sizes.*

-Tips:
+This model was contributed by [anton-l](https://huggingface.co/anton-l).
+
+## Usage tips

 - SEW-D is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - SEWDForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
  using [`Wav2Vec2CTCTokenizer`].

-This model was contributed by [anton-l](https://huggingface.co/anton-l).
-
-## Documentation resources
+## Resources

 - [Audio classification task guide](../tasks/audio_classification)
 - [Automatic speech recognition task guide](../tasks/asr)

--- a/docs/source/en/model_doc/sew.md
+++ b/docs/source/en/model_doc/sew.md
@@ -32,15 +32,15 @@ variety of training setups. For example, under the 100h-960h semi-supervised set
 inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference
 time, SEW reduces word error rate by 25-50% across different model sizes.*

-Tips:
+This model was contributed by [anton-l](https://huggingface.co/anton-l).
+
+## Usage tips

 - SEW is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - SEWForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using
  [`Wav2Vec2CTCTokenizer`].

-This model was contributed by [anton-l](https://huggingface.co/anton-l).
-
-## Documentation resources
+## Resources

 - [Audio classification task guide](../tasks/audio_classification)
 - [Automatic speech recognition task guide](../tasks/asr)

--- a/docs/source/en/model_doc/speech_to_text.md
+++ b/docs/source/en/model_doc/speech_to_text.md
@@ -27,7 +27,6 @@ transcripts/translations autoregressively. Speech2Text has been fine-tuned on se

 This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text).

-
 ## Inference

 Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech
@@ -44,7 +43,6 @@ install those packages before running the examples. You could either install tho
 `pip install transformers"[speech, sentencepiece]"` or install the packages separately with `pip install torchaudio sentencepiece`. Also `torchaudio` requires the development version of the [libsndfile](http://www.mega-nerd.com/libsndfile/) package which can be installed via a system package manager. On Ubuntu it can
 be installed as follows: `apt install libsndfile1-dev`

-
 - ASR and Speech Translation

 ```python
@@ -98,7 +96,6 @@ be installed as follows: `apt install libsndfile1-dev`

 See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look for Speech2Text checkpoints.

-
 ## Speech2TextConfig

 [[autodoc]] Speech2TextConfig
@@ -125,6 +122,9 @@ See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look
    - batch_decode
    - decode

+<frameworkcontent>
+<pt>
+
 ## Speech2TextModel

 [[autodoc]] Speech2TextModel
@@ -135,6 +135,9 @@ See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look
 [[autodoc]] Speech2TextForConditionalGeneration
    - forward

+</pt>
+<tf>
+
 ## TFSpeech2TextModel

 [[autodoc]] TFSpeech2TextModel
@@ -144,3 +147,6 @@ See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look

 [[autodoc]] TFSpeech2TextForConditionalGeneration
    - call
+
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/speech_to_text_2.md
+++ b/docs/source/en/model_doc/speech_to_text_2.md
@@ -31,8 +31,7 @@ This model was contributed by [Patrick von Platen](https://huggingface.co/patric

 The original code can be found [here](https://github.com/pytorch/fairseq/blob/1f7ef9ed1e1061f8c7f88f8b94c7186834398690/fairseq/models/wav2vec/wav2vec2_asr.py#L266).

-
-Tips:
+## Usage tips

 - Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see
  the [official models](https://huggingface.co/models?other=speech2text2) .
@@ -98,7 +97,7 @@ predicted token ids.

 See [model hub](https://huggingface.co/models?filter=speech2text2) to look for Speech2Text2 checkpoints.

-## Documentation resources
+## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)


--- a/docs/source/en/model_doc/splinter.md
+++ b/docs/source/en/model_doc/splinter.md
@@ -34,7 +34,9 @@ are replaced with a special token, viewed as a question representation, that is
 the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD
 with only 128 training examples), while maintaining competitive performance in the high-resource setting.

-Tips:
+This model was contributed by [yuvalkirstain](https://huggingface.co/yuvalkirstain) and [oriram](https://huggingface.co/oriram). The original code can be found [here](https://github.com/oriram/splinter).
+
+## Usage tips

 - Splinter was trained to predict answers spans conditioned on a special [QUESTION] token. These tokens contextualize
  to question representations which are used to predict the answers. This layer is called QASS, and is the default
@@ -49,9 +51,7 @@ Tips:
  doesn't (*tau/splinter-base* and *tau/splinter-large*). This is done to support randomly initializing this layer at
  fine-tuning, as it is shown to yield better results for some cases in the paper.

-This model was contributed by [yuvalkirstain](https://huggingface.co/yuvalkirstain) and [oriram](https://huggingface.co/oriram). The original code can be found [here](https://github.com/oriram/splinter).
-
-## Documentation resources
+## Resources

 - [Question answering task guide](../tasks/question-answering)


--- a/docs/source/en/model_doc/squeezebert.md
+++ b/docs/source/en/model_doc/squeezebert.md
@@ -38,7 +38,9 @@ self-attention layers with grouped convolutions, and we use this technique in a
 SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test
 set. The SqueezeBERT code will be released.*

-Tips:
+This model was contributed by [forresti](https://huggingface.co/forresti).
+
+## Usage tips

 - SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
  rather than the left.
@@ -48,9 +50,7 @@ Tips:
 - For best results when finetuning on sequence classification tasks, it is recommended to start with the
  *squeezebert/squeezebert-mnli-headless* checkpoint.

-This model was contributed by [forresti](https://huggingface.co/forresti).
-
-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)

--- a/docs/source/en/model_doc/swiftformer.md
+++ b/docs/source/en/model_doc/swiftformer.md
@@ -26,14 +26,9 @@ The abstract from the paper is the following:

 *Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.*

-Tips:
-    - One can use the [`ViTImageProcessor`] API to prepare images for the model.
-
-
 This model was contributed by [shehan97](https://huggingface.co/shehan97).
 The original code can be found [here](https://github.com/Amshaker/SwiftFormer).

-
 ## SwiftFormerConfig

 [[autodoc]] SwiftFormerConfig

--- a/docs/source/en/model_doc/swin.md
+++ b/docs/source/en/model_doc/swin.md
@@ -36,11 +36,6 @@ prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO
 +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
 The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.*

-Tips:
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
- Swin pads the inputs supporting any input height and width (if divisible by `32`).
- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/swin_transformer_architecture.png"
 alt="drawing" width="600"/>

@@ -48,6 +43,10 @@ alt="drawing" width="600"/>

 This model was contributed by [novice03](https://huggingface.co/novice03). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts). The original code can be found [here](https://github.com/microsoft/Swin-Transformer).

+## Usage tips
+
+- Swin pads the inputs supporting any input height and width (if divisible by `32`).
+- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.

 ## Resources

@@ -68,6 +67,8 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] SwinConfig

+<frameworkcontent>
+<pt>

 ## SwinModel

@@ -84,6 +85,9 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] transformers.SwinForImageClassification
    - forward

+</pt>
+<tf>
+
 ## TFSwinModel

 [[autodoc]] TFSwinModel
@@ -98,3 +102,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] transformers.TFSwinForImageClassification
    - call
+
+</tf>
+</frameworkcontent>
\ No newline at end of file
--- a/docs/source/en/model_doc/swinv2.md
+++ b/docs/source/en/model_doc/swinv2.md
@@ -24,9 +24,6 @@ The abstract from the paper is the following:

 *Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.*

-Tips:
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
-
 This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik).
 The original code can be found [here](https://github.com/microsoft/Swin-Transformer).


--- a/docs/source/en/model_doc/switch_transformers.md
+++ b/docs/source/en/model_doc/switch_transformers.md
@@ -23,19 +23,18 @@ The SwitchTransformers model was proposed in [Switch Transformers: Scaling to Tr
 The Switch Transformer model uses a sparse T5 encoder-decoder architecture, where the MLP are replaced by a Mixture of Experts (MoE). A routing mechanism (top 1 in this case) associates each token to one of the expert, where each expert is a dense MLP. While switch transformers have a lot more weights than their equivalent dense models, the sparsity allows better scaling and better finetuning performance at scale.
 During a forward pass, only a fraction of the weights are used. The routing mechanism allows the model to select relevant weights on the fly which increases the model capacity without increasing the number of operations.

-
 The abstract from the paper is the following:

 *In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.*

-Tips:
+This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
+The original code can be found [here](https://github.com/google/flaxformer/tree/main/flaxformer/architectures/moe).
+
+## Usage tips

 - SwitchTransformers uses the [`T5Tokenizer`], which can be loaded directly from each model's repository.
 - The released weights are pretrained on English [Masked Language Modeling](https://moon-ci-docs.huggingface.co/docs/transformers/pr_19323/en/glossary#general-terms) task, and should be finetuned.

-This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
-The original code can be found [here](https://github.com/google/flaxformer/tree/main/flaxformer/architectures/moe).
-
 ## Resources

 - [Translation task guide](../tasks/translation)

--- a/docs/source/en/model_doc/t5.md
+++ b/docs/source/en/model_doc/t5.md
@@ -45,7 +45,11 @@ with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-
 summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
 NLP, we release our dataset, pre-trained models, and code.*

-Tips:
+All checkpoints can be found on the [hub](https://huggingface.co/models?search=t5).
+
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/text-to-text-transfer-transformer).
+
+## Usage tips

 - T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which
 each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
@@ -91,12 +95,6 @@ Based on the original T5 model, Google has released some follow-up works:
 - **UMT5**: UmT5 is a multilingual T5 model trained on an improved and refreshed mC4 multilingual corpus,  29 trillion characters across 107 language, using a new sampling method, UniMax. Refer to
 the documentation of mT5 which can be found [here](umt5).

-All checkpoints can be found on the [hub](https://huggingface.co/models?search=t5).
-
-This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/text-to-text-transfer-transformer).
-
-<a id='training'></a>
-
 ## Training

 T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher
@@ -249,8 +247,6 @@ batches to the longest example is not recommended on TPU as it triggers a recomp
 encountered during training thus significantly slowing down the training. only padding up to the longest example in a
 batch) leads to very slow training on TPU.

-<a id='inference'></a>
-
 ## Inference

 At inference time, it is recommended to use [`~generation.GenerationMixin.generate`]. This
@@ -316,9 +312,6 @@ The predicted tokens will then be placed between the sentinel tokens.
 ['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']
 ```

-
-<a id='scripts'></a>
-
 ## Performance

 If you'd like a faster training and inference performance, install [apex](https://github.com/NVIDIA/apex#quick-start) and then the model will automatically use `apex.normalization.FusedRMSNorm` instead of `T5LayerNorm`. The former uses an optimized fused kernel which is several times faster than the latter.
@@ -386,6 +379,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] T5TokenizerFast

+<frameworkcontent>
+<pt>
+
 ## T5Model

 [[autodoc]] T5Model
@@ -411,6 +407,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] T5ForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFT5Model

 [[autodoc]] TFT5Model
@@ -426,6 +425,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] TFT5EncoderModel
    - call

+</tf>
+<jax>
+
 ## FlaxT5Model

 [[autodoc]] FlaxT5Model
@@ -444,3 +446,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] FlaxT5EncoderModel
    - __call__
+
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/t5v1.1.md
+++ b/docs/source/en/model_doc/t5v1.1.md
@@ -20,6 +20,10 @@ rendered properly in your Markdown viewer.

 T5v1.1 was released in the [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511)
 repository by Colin Raffel et al. It's an improved version of the original T5 model.
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
+found [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511).
+
+## Usage tips

 One can directly plug in the weights of T5v1.1 into a T5 model, like so:

@@ -59,7 +63,9 @@ Google has released the following variants:

 - [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl).

-One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks.

-This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
-found [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511).
+<Tip>
+
+Refer to [T5's documentation page](t5) for all API reference, tips, code examples and notebooks.
+
+</Tip>
\ No newline at end of file
--- a/docs/source/en/model_doc/table-transformer.md
+++ b/docs/source/en/model_doc/table-transformer.md
@@ -33,16 +33,15 @@ significant increase in training performance and a more reliable estimate of mod
 object detection models trained on PubTables-1M produce excellent results for all three tasks of detection, structure recognition, and functional analysis without the need for any
 special customization for these tasks.*

-Tips:
-
- The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table).
- One can use the [`AutoImageProcessor`] API to prepare images and optional targets for the model. This will load a [`DetrImageProcessor`] behind the scenes.
-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/table_transformer_architecture.jpeg"
 alt="drawing" width="600"/>

 <small> Table detection and table structure recognition clarified. Taken from the <a href="https://arxiv.org/abs/2110.00061">original paper</a>. </small>

+The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in 
+documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) 
+(the task of recognizing the individual rows, columns etc. in a table).
+
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be
 found [here](https://github.com/microsoft/table-transformer).


--- a/docs/source/en/model_doc/tapas.md
+++ b/docs/source/en/model_doc/tapas.md
@@ -44,7 +44,7 @@ alt="drawing" width="600"/>

 This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tensorflow version of this model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/tapas).

-Tips:
+## Usage tips

 - TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell of the table). Note that this is something that was added after the publication of the original TAPAS paper. According to the authors, this usually results in a slightly better performance, and allows you to encode longer sequences without running out of embeddings. This is reflected in the `reset_position_index_per_cell` parameter of [`TapasConfig`], which is set to `True` by default. The default versions of the models available on the [hub](https://huggingface.co/models?search=tapas) all use relative position embeddings. You can still use the ones with absolute position embeddings by passing in an additional argument `revision="no_reset"` when calling the `from_pretrained()` method. Note that it's usually advised to pad the inputs on the right rather than the left.
 - TAPAS is based on BERT, so `TAPAS-base` for example corresponds to a `BERT-base` architecture. Of course, `TAPAS-large` will result in the best performance (the results reported in the paper are from `TAPAS-large`). Results of the various sized models are shown on the [original Github repository](https://github.com/google-research/tapas>).
@@ -573,7 +573,7 @@ Predicted answer: SUM > 87, 53, 69

 In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that the `prev_labels` token types can be overwritten by the predicted `labels` of the previous table-question pair. Again, more info can be found in [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for PyTorch) and [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for TensorFlow).

-## Documentation resources
+## Resources

 - [Text classification task guide](../tasks/sequence_classification)
 - [Masked language modeling task guide](../tasks/masked_language_modeling)
@@ -590,6 +590,9 @@ In case of a conversational set-up, then each table-question pair must be provid
    - convert_logits_to_predictions
    - save_vocabulary

+<frameworkcontent>
+<pt>
+
 ## TapasModel
 [[autodoc]] TapasModel
    - forward
@@ -606,6 +609,9 @@ In case of a conversational set-up, then each table-question pair must be provid
 [[autodoc]] TapasForQuestionAnswering
    - forward

+</pt>
+<tf>
+
 ## TFTapasModel
 [[autodoc]] TFTapasModel
    - call
@@ -621,3 +627,8 @@ In case of a conversational set-up, then each table-question pair must be provid
 ## TFTapasForQuestionAnswering
 [[autodoc]] TFTapasForQuestionAnswering
    - call
+
+</tf>
+</frameworkcontent>
+
+