[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors

[Docs] Model_doc structure/clarity improvements (#26876)
* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors
5964f820 · Maria Khalusova · GitHub · ad8ff962 · 5964f820 · 5964f820
Unverified Commit 5964f820 authored Nov 03, 2023 by Maria Khalusova Committed by GitHub Nov 03, 2023
20 changed files
--- a/docs/source/en/model_doc/openai-gpt.md
+++ b/docs/source/en/model_doc/openai-gpt.md
@@ -44,7 +44,12 @@ approach on a wide range of benchmarks for natural language understanding. Our g
 discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
 the state of the art in 9 out of the 12 tasks studied.*
-Tips:
+[Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face
+showcasing the generative capabilities of several models. GPT is one of them.
+This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm).
+## Usage tips
 - GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
@@ -52,10 +57,6 @@ Tips:
  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
  observed in the *run_generation.py* example script.
-[Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face
-showcasing the generative capabilities of several models. GPT is one of them.
-This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm).
 Note:
@@ -116,6 +117,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
+<frameworkcontent>
+<pt>
 ## OpenAIGPTModel
 [[autodoc]] OpenAIGPTModel
@@ -136,6 +140,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] OpenAIGPTForSequenceClassification
    - forward
+</pt>
+<tf>
 ## TFOpenAIGPTModel
 [[autodoc]] TFOpenAIGPTModel
@@ -155,3 +162,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] TFOpenAIGPTForSequenceClassification
    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/opt.md
+++ b/docs/source/en/model_doc/opt.md
@@ -25,13 +25,13 @@ The abstract from the paper is the following:
 *Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.*
+This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Younes Belkada](https://huggingface.co/ybelkada), and [Patrick Von Platen](https://huggingface.co/patrickvonplaten).
+The original code can be found [here](https://github.com/facebookresearch/metaseq).
 Tips:
 - OPT has the same architecture as [`BartDecoder`].
 - Contrary to GPT2, OPT adds the EOS token `</s>` to the beginning of every prompt.
-This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Younes Belkada](https://huggingface.co/ybelkada), and [Patrick Von Platen](https://huggingface.co/patrickvonplaten).
-The original code can be found [here](https://github.com/facebookresearch/metaseq).
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with OPT. If you're
@@ -66,6 +66,9 @@ The resource should ideally demonstrate something new instead of duplicating an
 [[autodoc]] OPTConfig
+<frameworkcontent>
+<pt>
 ## OPTModel
 [[autodoc]] OPTModel
@@ -76,6 +79,19 @@ The resource should ideally demonstrate something new instead of duplicating an
 [[autodoc]] OPTForCausalLM
    - forward
+## OPTForSequenceClassification
+[[autodoc]] OPTForSequenceClassification
+    - forward
+## OPTForQuestionAnswering
+[[autodoc]] OPTForQuestionAnswering
+    - forward
+</pt>
+<tf>
 ## TFOPTModel
 [[autodoc]] TFOPTModel
@@ -86,23 +102,18 @@ The resource should ideally demonstrate something new instead of duplicating an
 [[autodoc]] TFOPTForCausalLM
    - call
-## OPTForSequenceClassification
+</tf>
+<jax>
-[[autodoc]] OPTForSequenceClassification
-    - forward
-## OPTForQuestionAnswering
-[[autodoc]] OPTForQuestionAnswering
-    - forward
 ## FlaxOPTModel
 [[autodoc]] FlaxOPTModel
    - __call__
 ## FlaxOPTForCausalLM
 [[autodoc]] FlaxOPTForCausalLM
    - __call__
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/owlv2.md
+++ b/docs/source/en/model_doc/owlv2.md
@@ -24,11 +24,6 @@ The abstract from the paper is the following:
 *Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.*
-Tips:
- The architecture of OWLv2 is identical to [OWL-ViT](owlvit), however the object detection head now also includes an objectness classifier, which predicts the (query-agnostic) likelihood that a predicted box contains an object (as opposed to background). The objectness score can be used to rank or filter predictions independently of text queries.
- Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image processor ([`Owlv2ImageProcessor`]).
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/owlv2_overview.png"
 alt="drawing" width="600"/>
@@ -37,13 +32,12 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).
-## Usage
+## Usage example
 OWLv2 is, just like its predecessor [OWL-ViT](owlvit), a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.
 [`Owlv2ImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`Owlv2Processor`] wraps [`Owlv2ImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`Owlv2Processor`] and [`Owlv2ForObjectDetection`].
 ```python
 >>> import requests
 >>> from PIL import Image
@@ -76,7 +70,15 @@ Detected a photo of a cat with confidence 0.665 at location [6.75, 38.97, 326.62
 ## Resources
-A demo notebook on using OWLv2 for zero- and one-shot (image-guided) object detection can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/OWLv2).
+- A demo notebook on using OWLv2 for zero- and one-shot (image-guided) object detection can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/OWLv2).
+- [Zero-shot object detection task guide](../tasks/zero_shot_object_detection)
+<Tip>
+The architecture of OWLv2 is identical to [OWL-ViT](owlvit), however the object detection head now also includes an objectness classifier, which predicts the (query-agnostic) likelihood that a predicted box contains an object (as opposed to background). The objectness score can be used to rank or filter predictions independently of text queries.
+Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image processor ([`Owlv2ImageProcessor`]).
+</Tip>
 ## Owlv2Config

--- a/docs/source/en/model_doc/owlvit.md
+++ b/docs/source/en/model_doc/owlvit.md
@@ -31,13 +31,12 @@ alt="drawing" width="600"/>
 This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).
-## Usage
+## Usage tips
 OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.
 [`OwlViTImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`].
 ```python
 >>> import requests
 >>> from PIL import Image

--- a/docs/source/en/model_doc/pegasus.md
+++ b/docs/source/en/model_doc/pegasus.md
@@ -25,9 +25,6 @@ rendered properly in your Markdown viewer.
 </a>
 </div>
-**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
-and assign @patrickvonplaten.
 ## Overview
@@ -42,13 +39,17 @@ According to the abstract,
 This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/google-research/pegasus).
-Tips:
+## Usage tips
 - Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining objective, called Gap Sentence Generation (GSG).
  * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.
+- FP16 is not supported (help/ideas on this appreciated!).
+- The adafactor optimizer is recommended for pegasus fine-tuning.
 ## Checkpoints
 All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tuned for summarization, besides
@@ -60,20 +61,11 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun
 - Full replication results and correctly pre-processed data can be found in this [Issue](https://github.com/huggingface/transformers/issues/6844#issue-689259666).
 - [Distilled checkpoints](https://huggingface.co/models?search=distill-pegasus) are described in this [paper](https://arxiv.org/abs/2010.13002).
-### Examples
- [Script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh) to fine-tune pegasus
-  on the XSUM dataset. Data download instructions at [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md).
- FP16 is not supported (help/ideas on this appreciated!).
- The adafactor optimizer is recommended for pegasus fine-tuning.
 ## Implementation Notes
 - All models are transformer encoder-decoders with 16 layers in each component.
 - The implementation is completely inherited from [`BartForConditionalGeneration`]
 - Some key configuration differences:
  - static, sinusoidal position embeddings
  - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
  - more beams are used (`num_beams=8`)
@@ -82,7 +74,6 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun
 - The code to convert checkpoints trained in the author's [repo](https://github.com/google-research/pegasus) can be
  found in `convert_pegasus_tf_to_pytorch.py`.
 ## Usage Example
 ```python
@@ -106,8 +97,10 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun
 ... )
 ```
-## Documentation resources
+## Resources
+- [Script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh) to fine-tune pegasus
+  on the XSUM dataset. Data download instructions at [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md).
 - [Causal language modeling task guide](../tasks/language_modeling)
 - [Translation task guide](../tasks/translation)
 - [Summarization task guide](../tasks/summarization)
@@ -126,6 +119,9 @@ warning: `add_tokens` does not work at the moment.
 [[autodoc]] PegasusTokenizerFast
+<frameworkcontent>
+<pt>
 ## PegasusModel
 [[autodoc]] PegasusModel
@@ -141,6 +137,9 @@ warning: `add_tokens` does not work at the moment.
 [[autodoc]] PegasusForCausalLM
    - forward
+</pt>
+<tf>
 ## TFPegasusModel
 [[autodoc]] TFPegasusModel
@@ -151,6 +150,9 @@ warning: `add_tokens` does not work at the moment.
 [[autodoc]] TFPegasusForConditionalGeneration
    - call
+</tf>
+<jax>
 ## FlaxPegasusModel
 [[autodoc]] FlaxPegasusModel
@@ -164,3 +166,6 @@ warning: `add_tokens` does not work at the moment.
    - __call__
    - encode
    - decode
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/pegasus_x.md
+++ b/docs/source/en/model_doc/pegasus_x.md
@@ -26,10 +26,6 @@ The abstract from the paper is the following:
 *While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.*
-Tips:
-* PEGASUS-X uses the same tokenizer as PEGASUS.
 This model was contributed by [zphang](<https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus).
 ## Documentation resources
@@ -37,17 +33,21 @@ This model was contributed by [zphang](<https://huggingface.co/zphang). The orig
 - [Translation task guide](../tasks/translation)
 - [Summarization task guide](../tasks/summarization)
+<Tip>
+PEGASUS-X uses the same tokenizer as [PEGASUS](pegasus).
+</Tip>
 ## PegasusXConfig
 [[autodoc]] PegasusXConfig
 ## PegasusXModel
 [[autodoc]] PegasusXModel
    - forward
 ## PegasusXForConditionalGeneration
 [[autodoc]] PegasusXForConditionalGeneration

--- a/docs/source/en/model_doc/perceiver.md
+++ b/docs/source/en/model_doc/perceiver.md
@@ -81,7 +81,13 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
 [here](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
-Tips:
+<Tip warning={true}>
+Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
+</Tip>
+## Resources
 - The quickest way to get started with the Perceiver is by checking the [tutorial
  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
@@ -89,13 +95,6 @@ Tips:
 is implemented in the library. Note that the models available in the library only showcase some examples of what you can do
 with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection,
 audio classification, video classification, etc.
-**Note**:
- Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
-## Documentation resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Masked language modeling task guide](../tasks/masked_language_modeling)
 - [Image classification task guide](../tasks/image_classification)

--- a/docs/source/en/model_doc/persimmon.md
+++ b/docs/source/en/model_doc/persimmon.md
@@ -26,6 +26,10 @@ The authors showcase their approach to model evaluation, focusing on practical t
 In terms of model details, the work outlines the architecture and training methodology of Persimmon-8B, providing insights into its design choices, sequence length, and dataset composition. The authors present a fast inference code that outperforms traditional implementations through operator fusion and CUDA graph utilization while maintaining code coherence. They express their anticipation of how the community will leverage this contribution to drive innovation, hinting at further upcoming releases as part of an ongoing series of developments.
+This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ).
+The original code can be found [here](https://github.com/persimmon-ai-labs/adept-inference).
+## Usage tips
 <Tip warning={true}>
@@ -67,8 +71,6 @@ model = PersimmonForCausalLM.from_pretrained("/output/path")
 tokenizer = PersimmonTokenizer.from_pretrained("/output/path")
 ```
-This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ).
-The original code can be found [here](https://github.com/persimmon-ai-labs/adept-inference).
 - Perismmon uses a `sentencepiece` based tokenizer, with a `Unigram` model. It supports bytefallback, which is only available in `tokenizers==0.14.0` for the fast tokenizer.
 The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. The `chat` template will be updated with the templating functions in a follow up PR!

--- a/docs/source/en/model_doc/phobert.md
+++ b/docs/source/en/model_doc/phobert.md
@@ -28,7 +28,9 @@ best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves th
 Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and
 Natural language inference.*
-Example of use:
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/PhoBERT).
+## Usage example
 ```python
 >>> import torch
@@ -50,7 +52,12 @@ Example of use:
 >>> # phobert = TFAutoModel.from_pretrained("vinai/phobert-base")
 ```
-This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/PhoBERT).
+<Tip> 
+PhoBERT implementation is the same as BERT, except for tokenization. Refer to [EART documentation](bert) for information on 
+configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.  
+</Tip>
 ## PhobertTokenizer

--- a/docs/source/en/model_doc/pix2struct.md
+++ b/docs/source/en/model_doc/pix2struct.md
@@ -39,7 +39,6 @@ The original code can be found [here](https://github.com/google-research/pix2str
 - [Fine-tuning Notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb)
 - [All models](https://huggingface.co/models?search=pix2struct)
 ## Pix2StructConfig
 [[autodoc]] Pix2StructConfig

--- a/docs/source/en/model_doc/plbart.md
+++ b/docs/source/en/model_doc/plbart.md
@@ -16,10 +16,7 @@ rendered properly in your Markdown viewer.
 # PLBart
-**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
+## Overview
-[@gchhablani](https://www.github.com/gchhablani).
-## Overview of PLBart
 The PLBART model was proposed in [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
 This is a BART-like model which can be used to perform code-summarization, code-generation, and code-translation tasks. The pre-trained model `plbart-base` has been trained using multilingual denoising task
@@ -40,7 +37,7 @@ even with limited annotations.*
 This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The Authors' code can be found [here](https://github.com/wasiahmad/PLBART).
-### Training of PLBart
+## Usage examples
 PLBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. As the
 model is multilingual it expects the sequences in a different format. A special language id token is added in both the
@@ -53,7 +50,7 @@ In cases where the language code is needed, the regular [`~PLBartTokenizer.__cal
 when you pass texts as the first argument or with the keyword argument `text`, and will encode target text format if
 it's passed with the `text_target` keyword argument.
- Supervised training
+### Supervised training
 ```python
 >>> from transformers import PLBartForConditionalGeneration, PLBartTokenizer
@@ -65,7 +62,7 @@ it's passed with the `text_target` keyword argument.
 >>> model(**inputs)
 ```
- Generation
+### Generation
  While generating the target text set the `decoder_start_token_id` to the target language id. The following
  example shows how to translate Python to English using the `uclanlp/plbart-python-en_XX` model.
@@ -82,7 +79,7 @@ it's passed with the `text_target` keyword argument.
 "Returns the maximum value of a b c."
 ```
-## Documentation resources
+## Resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Causal language modeling task guide](../tasks/language_modeling)

--- a/docs/source/en/model_doc/poolformer.md
+++ b/docs/source/en/model_doc/poolformer.md
@@ -28,8 +28,9 @@ The figure below illustrates the architecture of PoolFormer. Taken from the [ori
 <img width="600" src="https://user-images.githubusercontent.com/15921929/142746124-1ab7635d-2536-4a0e-ad43-b4fe2c5a525d.png"/>
+This model was contributed by [heytanay](https://huggingface.co/heytanay). The original code can be found [here](https://github.com/sail-sg/poolformer).
-Tips:
+## Usage tips
 - PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer).
 - One can use [`PoolFormerImageProcessor`] to prepare images for the model.
@@ -43,8 +44,6 @@ Tips:
 | m36               | [6, 6, 18, 6] | [96, 192, 384, 768] | 56             | 82.1                  |
 | m48               | [8, 8, 24, 8] | [96, 192, 384, 768] | 73             | 82.5                  |
-This model was contributed by [heytanay](https://huggingface.co/heytanay). The original code can be found [here](https://github.com/sail-sg/poolformer).
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with PoolFormer.

--- a/docs/source/en/model_doc/pop2piano.md
+++ b/docs/source/en/model_doc/pop2piano.md
@@ -32,7 +32,6 @@ is transformed to its waveform and passed to the encoder, which transforms it to
 uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four 
 different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file.
 The abstract from the paper is the following:
 *Piano covers of pop music are enjoyed by many people. However, the
@@ -49,22 +48,21 @@ directly from pop audio without using melody and chord extraction
 modules. We show that Pop2Piano, trained with our dataset, is capable
 of producing plausible piano covers.*
+This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
+The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
-Tips:
+## Usage tips
-1. To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules:  
+* To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules:  
 ```
 pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
 ```
 Please note that you may need to restart your runtime after installation.
-2. Pop2Piano is an Encoder-Decoder based model like T5.
+* Pop2Piano is an Encoder-Decoder based model like T5.
-3. Pop2Piano can be used to generate midi-audio files for a given audio sequence.
+* Pop2Piano can be used to generate midi-audio files for a given audio sequence.
-4. Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.
+* Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.
-5. Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance.
+* Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance.
-6. Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs.
+* Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs.
-This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
-The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
 ## Examples

--- a/docs/source/en/model_doc/prophetnet.md
+++ b/docs/source/en/model_doc/prophetnet.md
@@ -25,10 +25,6 @@ rendered properly in your Markdown viewer.
 </a>
 </div>
-**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
-@patrickvonplaten
 ## Overview
 The ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
@@ -49,15 +45,15 @@ dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Giga
 abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
 state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
-Tips:
+The Authors' code can be found [here](https://github.com/microsoft/ProphetNet).
+## Usage tips
 - ProphetNet is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - The model architecture is based on the original Transformer, but replaces the “standard” self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
-The Authors' code can be found [here](https://github.com/microsoft/ProphetNet).
+## Resources
-## Documentation resources
 - [Causal language modeling task guide](../tasks/language_modeling)
 - [Translation task guide](../tasks/translation)

--- a/docs/source/en/model_doc/qdqbert.md
+++ b/docs/source/en/model_doc/qdqbert.md
@@ -32,22 +32,18 @@ by processors with high-throughput integer math pipelines. We also present a wor
 able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are
 more difficult to quantize, such as MobileNets and BERT-large.*
-Tips:
+This model was contributed by [shangz](https://huggingface.co/shangz).
+## Usage tips
 - QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer
  inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.
 - QDQBERT requires the dependency of [Pytorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization). To install `pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com`
 - QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example *bert-base-uncased*), and
  perform Quantization Aware Training/Post Training Quantization.
 - A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for
  SQUAD task can be found at [transformers/examples/research_projects/quantization-qdqbert/](examples/research_projects/quantization-qdqbert/).
-This model was contributed by [shangz](https://huggingface.co/shangz).
 ### Set default quantizers
 QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by
@@ -118,7 +114,7 @@ the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Exa
 >>> torch.onnx.export(...)
 ```
-## Documentation resources
+## Resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)

--- a/docs/source/en/model_doc/rag.md
+++ b/docs/source/en/model_doc/rag.md
@@ -52,8 +52,12 @@ parametric-only seq2seq baseline.*
 This model was contributed by [ola13](https://huggingface.co/ola13).
-Tips:
+## Usage tips
- Retrieval-augmented generation (“RAG”) models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. 
+RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq 
+modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt 
+to downstream tasks.
 ## RagConfig
@@ -73,6 +77,9 @@ Tips:
 [[autodoc]] RagRetriever
+<frameworkcontent>
+<pt>
 ## RagModel
 [[autodoc]] RagModel
@@ -90,6 +97,9 @@ Tips:
    - forward
    - generate
+</pt>
+<tf>
 ## TFRagModel
 [[autodoc]] TFRagModel
@@ -106,3 +116,6 @@ Tips:
 [[autodoc]] TFRagTokenForGeneration
    - call
    - generate
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/reformer.md
+++ b/docs/source/en/model_doc/reformer.md
@@ -25,8 +25,6 @@ rendered properly in your Markdown viewer.
 </a>
 </div>
-**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
 ## Overview
 The Reformer model was proposed in the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451.pdf) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
@@ -44,7 +42,7 @@ while being much more memory-efficient and much faster on long sequences.*
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
 found [here](https://github.com/google/trax/tree/master/trax/models/reformer).
-Tips:
+## Usage tips
 - Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035).
 - Use Axial position encoding (see below for more details). It’s a mechanism to avoid having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller matrices.
@@ -52,7 +50,7 @@ Tips:
 - Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them for results inside a given layer (less efficient than storing them but saves memory).
 - Compute the feedforward operations by chunks and not on the whole batch.
-## Axial Positional Encodings
+### Axial Positional Encodings
 Axial Positional Encodings were first implemented in Google's [trax library](https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29)
 and developed by the authors of this model's paper. In models that are treating very long input sequences, the
@@ -96,7 +94,7 @@ product has to be equal to `config.max_embedding_size`, which during training ha
 length* of the `input_ids`.
-## LSH Self Attention
+### LSH Self Attention
 In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
 query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
@@ -129,7 +127,7 @@ Using LSH self attention, the memory and time complexity of the query-key matmul
 and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
-## Local Self Attention
+### Local Self Attention
 Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is
 chunked so that in each chunk of length `config.local_chunk_length` the query embedding vectors only attends to
@@ -141,7 +139,7 @@ Using Local self attention, the memory and time complexity of the query-key matm
 and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
-## Training
+### Training
 During training, we must ensure that the sequence length is set to a value that can be divided by the least common
 multiple of `config.lsh_chunk_length` and `config.local_chunk_length` and that the parameters of the Axial
@@ -155,7 +153,7 @@ input_ids = tokenizer.encode("This is a sentence from the training data", return
 loss = model(input_ids, labels=input_ids)[0]
 ```
-## Documentation resources
+## Resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Question answering task guide](../tasks/question_answering)

--- a/docs/source/en/model_doc/regnet.md
+++ b/docs/source/en/model_doc/regnet.md
@@ -26,15 +26,13 @@ The abstract from the paper is the following:
 *In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.*
-Tips:
- One can use [`AutoImageProcessor`] to prepare images for the model.
- The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
 This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model
 was contributed by [sayakpaul](https://huggingface.com/sayakpaul) and [ariG23498](https://huggingface.com/ariG23498).
 The original code can be found [here](https://github.com/facebookresearch/pycls).
+The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), 
+trained on  one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with RegNet.
@@ -50,37 +48,43 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] RegNetConfig
+<frameworkcontent>
+<pt>
 ## RegNetModel
 [[autodoc]] RegNetModel
    - forward
 ## RegNetForImageClassification
 [[autodoc]] RegNetForImageClassification
    - forward
+</pt>
+<tf>
 ## TFRegNetModel
 [[autodoc]] TFRegNetModel
    - call
 ## TFRegNetForImageClassification
 [[autodoc]] TFRegNetForImageClassification
    - call
+</tf>
+<jax>
 ## FlaxRegNetModel
 [[autodoc]] FlaxRegNetModel
    - __call__
 ## FlaxRegNetForImageClassification
 [[autodoc]] FlaxRegNetForImageClassification
    - __call__
\ No newline at end of file
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/rembert.md
+++ b/docs/source/en/model_doc/rembert.md
@@ -34,14 +34,14 @@ Transformer representations to be more general and more transferable to other ta
 findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the
 number of parameters at the fine-tuning stage.*
-Tips:
+## Usage tips
 For fine-tuning, RemBERT can be thought of as a bigger version of mBERT with an ALBERT-like factorization of the
 embedding layer. The embeddings are not tied in pre-training, in contrast with BERT, which enables smaller input
 embeddings (preserved during fine-tuning) and bigger output embeddings (discarded at fine-tuning). The tokenizer is
 also similar to the Albert one rather than the BERT one.
-## Documentation resources
+## Resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@@ -70,6 +70,9 @@ also similar to the Albert one rather than the BERT one.
    - create_token_type_ids_from_sequences
    - save_vocabulary
+<frameworkcontent>
+<pt>
 ## RemBertModel
 [[autodoc]] RemBertModel
@@ -105,6 +108,9 @@ also similar to the Albert one rather than the BERT one.
 [[autodoc]] RemBertForQuestionAnswering
    - forward
+</pt>
+<tf>
 ## TFRemBertModel
 [[autodoc]] TFRemBertModel
@@ -139,3 +145,6 @@ also similar to the Albert one rather than the BERT one.
 [[autodoc]] TFRemBertForQuestionAnswering
    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/resnet.md
+++ b/docs/source/en/model_doc/resnet.md
@@ -27,10 +27,6 @@ The abstract from the paper is the following:
 *Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
 The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.*
-Tips:
- One can use [`AutoImageProcessor`] to prepare images for the model.
 The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385).
 <img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/resnet_architecture.png"/>
@@ -52,30 +48,35 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] ResNetConfig
+<frameworkcontent>
+<pt>
 ## ResNetModel
 [[autodoc]] ResNetModel
    - forward
 ## ResNetForImageClassification
 [[autodoc]] ResNetForImageClassification
    - forward
+</pt>
+<tf>
 ## TFResNetModel
 [[autodoc]] TFResNetModel
    - call
 ## TFResNetForImageClassification
 [[autodoc]] TFResNetForImageClassification
    - call
+</tf>
+<jax>
 ## FlaxResNetModel
 [[autodoc]] FlaxResNetModel
@@ -85,3 +86,6 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] FlaxResNetForImageClassification
    - __call__
+</jax>
+</frameworkcontent>