[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors

[Docs] Model_doc structure/clarity improvements (#26876)
* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors
5964f820 · Maria Khalusova · GitHub · ad8ff962 · 5964f820 · 5964f820
Unverified Commit 5964f820 authored Nov 03, 2023 by Maria Khalusova Committed by GitHub Nov 03, 2023
20 changed files
--- a/docs/source/en/model_doc/gpt_neox.md
+++ b/docs/source/en/model_doc/gpt_neox.md
@@ -38,7 +38,7 @@ model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b").half().cud
 GPT-NeoX-20B also has a different tokenizer from the one used in GPT-J-6B and GPT-Neo. The new tokenizer allocates
 additional tokens to whitespace characters, making the model more suitable for certain tasks like code generation.
-### Generation
+## Usage example
 The `generate()` method can be used to generate text using GPT Neo model.
@@ -61,7 +61,7 @@ The `generate()` method can be used to generate text using GPT Neo model.
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```
-## Documentation resources
+## Resources
 - [Causal language modeling task guide](../tasks/language_modeling)

--- a/docs/source/en/model_doc/gpt_neox_japanese.md
+++ b/docs/source/en/model_doc/gpt_neox_japanese.md
@@ -25,7 +25,7 @@ Following the recommendations from Google's research on [PaLM](https://ai.google
 Development of the model was led by [Shinya Otani](https://github.com/SO0529), [Takayoshi Makabe](https://github.com/spider-man-tm), [Anuj Arora](https://github.com/Anuj040), and [Kyo Hattori](https://github.com/go5paopao) from [ABEJA, Inc.](https://www.abejainc.com/). For more information on this model-building activity, please refer [here (ja)](https://tech-blog.abeja.asia/entry/abeja-gpt-project-202207).
-### Generation
+### Usage example
 The `generate()` method can be used to generate text using GPT NeoX Japanese model.
@@ -51,7 +51,7 @@ The `generate()` method can be used to generate text using GPT NeoX Japanese mod
 人とAIが協調するためには、AIと人が共存し、AIを正しく理解する必要があります。
 ```
-## Documentation resources
+## Resources
 - [Causal language modeling task guide](../tasks/language_modeling)

--- a/docs/source/en/model_doc/gptj.md
+++ b/docs/source/en/model_doc/gptj.md
@@ -23,7 +23,7 @@ causal language model trained on [the Pile](https://pile.eleuther.ai/) dataset.
 This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena).
-Tips:
+## Usage tips
 - To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size
  RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB
@@ -56,7 +56,7 @@ Tips:
  size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens
  `<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400.
-### Generation
+## Usage examples
 The [`~generation.GenerationMixin.generate`] method can be used to generate text using GPT-J
 model.
@@ -138,6 +138,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GPTJConfig
    - all
+<frameworkcontent>
+<pt>
 ## GPTJModel
 [[autodoc]] GPTJModel
@@ -158,6 +161,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GPTJForQuestionAnswering
    - forward
+</pt>
+<tf>
 ## TFGPTJModel
 [[autodoc]] TFGPTJModel
@@ -178,6 +184,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] TFGPTJForQuestionAnswering
    - call
+</tf>
+<jax>
 ## FlaxGPTJModel
 [[autodoc]] FlaxGPTJModel
@@ -187,3 +196,5 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] FlaxGPTJForCausalLM
    - __call__
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/gptsan-japanese.md
+++ b/docs/source/en/model_doc/gptsan-japanese.md
@@ -24,7 +24,7 @@ GPTSAN is a Japanese language model using Switch Transformer. It has the same st
 in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can
 fine-tune for translation or summarization.
-### Generation
+### Usage example
 The `generate()` method can be used to generate text using GPTSAN-Japanese model.
@@ -56,7 +56,7 @@ This length applies to the text entered in `prefix_text` for the tokenizer.
 The tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`.
 The model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after.
-Tips:
+## Usage tips
 Specifying the Prefix part is done with a mask passed to self-attention.
 When token_type_ids=None or all zero, it is equivalent to regular causal mask

--- a/docs/source/en/model_doc/graphormer.md
+++ b/docs/source/en/model_doc/graphormer.md
@@ -23,26 +23,24 @@ The abstract from the paper is the following:
 *The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.*
-Tips:
+This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
+## Usage tips
 This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode.
 You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges.
 This model does not use a tokenizer, but instead a special collator during training.
-This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
 ## GraphormerConfig
 [[autodoc]] GraphormerConfig
 ## GraphormerModel
 [[autodoc]] GraphormerModel
    - forward
 ## GraphormerForGraphClassification
 [[autodoc]] GraphormerForGraphClassification

--- a/docs/source/en/model_doc/groupvit.md
+++ b/docs/source/en/model_doc/groupvit.md
@@ -25,13 +25,13 @@ The abstract from the paper is the following:
 *Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.*
-Tips:
- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts. 
 This model was contributed by [xvjiarui](https://huggingface.co/xvjiarui). The TensorFlow version was contributed by [ariG23498](https://huggingface.co/ariG23498) with the help of [Yih-Dar SHIEH](https://huggingface.co/ydshieh), [Amy Roberts](https://huggingface.co/amyeroberts), and [Joao Gante](https://huggingface.co/joaogante).
 The original code can be found [here](https://github.com/NVlabs/GroupViT).
+## Usage tips
+- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts. 
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GroupViT.
@@ -52,6 +52,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GroupViTVisionConfig
+<frameworkcontent>
+<pt>
 ## GroupViTModel
 [[autodoc]] GroupViTModel
@@ -69,6 +72,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] GroupViTVisionModel
    - forward
+</pt>
+<tf>
 ## TFGroupViTModel
 [[autodoc]] TFGroupViTModel
@@ -84,4 +90,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 ## TFGroupViTVisionModel
 [[autodoc]] TFGroupViTVisionModel
    - call
\ No newline at end of file
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/herbert.md
+++ b/docs/source/en/model_doc/herbert.md
@@ -37,7 +37,11 @@ which has the best average performance and obtains the best results for three ou
 extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
 models.*
-Examples of use:
+This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found
+[here](https://github.com/allegro/HerBERT).
+## Usage example
 ```python
 >>> from transformers import HerbertTokenizer, RobertaModel
@@ -56,9 +60,12 @@ Examples of use:
 >>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
 ```
-This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found
+<Tip>
-[here](https://github.com/allegro/HerBERT).
+Herbert implementation is the same as `BERT` except for the tokenization method. Refer to [BERT documentation](bert) 
+for API reference and examples.  
+</Tip>
 ## HerbertTokenizer

--- a/docs/source/en/model_doc/hubert.md
+++ b/docs/source/en/model_doc/hubert.md
@@ -36,15 +36,15 @@ state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-lig
 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER
 reduction on the more challenging dev-other and test-other evaluation subsets.*
-Tips:
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
+# Usage tips
 - Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
  using [`Wav2Vec2CTCTokenizer`].
-This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
+## Resources
-## Documentation resources
 - [Audio classification task guide](../tasks/audio_classification)
 - [Automatic speech recognition task guide](../tasks/asr)
@@ -53,6 +53,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] HubertConfig
+<frameworkcontent>
+<pt>
 ## HubertModel
 [[autodoc]] HubertModel
@@ -68,6 +71,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] HubertForSequenceClassification
    - forward
+</pt>
+<tf>
 ## TFHubertModel
 [[autodoc]] TFHubertModel
@@ -77,3 +83,6 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] TFHubertForCTC
    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/ibert.md
+++ b/docs/source/en/model_doc/ibert.md
@@ -40,7 +40,7 @@ been open-sourced.*
 This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT).
-## Documentation resources
+## Resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)

--- a/docs/source/en/model_doc/idefics.md
+++ b/docs/source/en/model_doc/idefics.md
@@ -31,9 +31,9 @@ This model was contributed by [HuggingFaceM4](https://huggingface.co/HuggingFace
 <Tip warning={true}>
-Idefics modeling code in Transformers is for finetuning and inferencing the pre-trained Idefics models.
+IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models.
-To train a new Idefics model from scratch use the m4 codebase (a link will be provided once it's made public)
+To train a new IDEFICS model from scratch use the m4 codebase (a link will be provided once it's made public)
 </Tip>

--- a/docs/source/en/model_doc/imagegpt.md
+++ b/docs/source/en/model_doc/imagegpt.md
@@ -40,7 +40,7 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found
 [here](https://github.com/openai/image-gpt).
-Tips:
+## Usage tips
 - ImageGPT is almost exactly the same as [GPT-2](gpt2), with the exception that a different activation
  function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT
@@ -92,7 +92,6 @@ If you're interested in submitting a resource to be included here, please feel f
 ## ImageGPTFeatureExtractor
 [[autodoc]] ImageGPTFeatureExtractor
    - __call__
 ## ImageGPTImageProcessor
@@ -103,17 +102,14 @@ If you're interested in submitting a resource to be included here, please feel f
 ## ImageGPTModel
 [[autodoc]] ImageGPTModel
    - forward
 ## ImageGPTForCausalImageModeling
 [[autodoc]] ImageGPTForCausalImageModeling
    - forward
 ## ImageGPTForImageClassification
 [[autodoc]] ImageGPTForImageClassification
    - forward
--- a/docs/source/en/model_doc/informer.md
+++ b/docs/source/en/model_doc/informer.md
@@ -39,13 +39,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] InformerConfig
 ## InformerModel
 [[autodoc]] InformerModel
    - forward
 ## InformerForPrediction
 [[autodoc]] InformerForPrediction

--- a/docs/source/en/model_doc/instructblip.md
+++ b/docs/source/en/model_doc/instructblip.md
@@ -21,10 +21,6 @@ The abstract from the paper is the following:
 *General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.*
-Tips:
- InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/instructblip_architecture.jpg"
 alt="drawing" width="600"/>
@@ -33,6 +29,9 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip).
+## Usage tips
+InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
 ## InstructBlipConfig

--- a/docs/source/en/model_doc/jukebox.md
+++ b/docs/source/en/model_doc/jukebox.md
@@ -32,7 +32,11 @@ The metadata such as *artist, genre and timing* are passed to each prior, in the
 ![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg)
-Tips:
+This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ).
+The original code can be found [here](https://github.com/openai/jukebox).
+## Usage tips
 - This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face traineer!
 - This model is very slow, and takes 8h to generate a minute long audio using the 5b top prior on a V100 GPU. In order automaticallay handle the device on which the model should execute, use `accelerate`.
 - Contrary to the paper, the order of the priors goes from `0` to `1` as it felt more intuitive : we sample starting from `0`.
@@ -67,14 +71,12 @@ The original code can be found [here](https://github.com/openai/jukebox).
    - upsample
    - _sample
 ## JukeboxPrior
 [[autodoc]] JukeboxPrior
    - sample
    - forward
 ## JukeboxVQVAE
 [[autodoc]] JukeboxVQVAE

--- a/docs/source/en/model_doc/layoutlm.md
+++ b/docs/source/en/model_doc/layoutlm.md
@@ -46,7 +46,7 @@ document-level pretraining. It achieves new state-of-the-art results in several
 understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
 (from 93.07 to 94.42).*
-Tips:
+## Usage tips
 - In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are
  the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
@@ -123,6 +123,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] LayoutLMTokenizerFast
+<frameworkcontent>
+<pt>
 ## LayoutLMModel
 [[autodoc]] LayoutLMModel
@@ -143,6 +146,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] LayoutLMForQuestionAnswering
+</pt>
+<tf>
 ## TFLayoutLMModel
 [[autodoc]] TFLayoutLMModel
@@ -162,3 +168,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 ## TFLayoutLMForQuestionAnswering
 [[autodoc]] TFLayoutLMForQuestionAnswering
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/layoutlmv2.md
+++ b/docs/source/en/model_doc/layoutlmv2.md
@@ -56,7 +56,7 @@ python -m pip install torchvision tesseract
 ```
 (If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.)
-Tips:
+## Usage tips
 - The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
  pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).

--- a/docs/source/en/model_doc/layoutlmv3.md
+++ b/docs/source/en/model_doc/layoutlmv3.md
@@ -26,16 +26,6 @@ The abstract from the paper is the following:
 *Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*
-Tips:
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
-    - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
-    - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
-  Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png"
 alt="drawing" width="600"/>
@@ -43,6 +33,14 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3).
+## Usage tips
+- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
+    - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
+    - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
+  Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
+- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
@@ -53,6 +51,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 </Tip>
+- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
+- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
 <PipelineTag pipeline="text-classification"/>
 - [`LayoutLMv2ForSequenceClassification`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb).
@@ -103,6 +104,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 [[autodoc]] LayoutLMv3Processor
    - __call__
+<frameworkcontent>
+<pt>
 ## LayoutLMv3Model
 [[autodoc]] LayoutLMv3Model
@@ -123,6 +127,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 [[autodoc]] LayoutLMv3ForQuestionAnswering
    - forward
+</pt>
+<tf>
 ## TFLayoutLMv3Model
 [[autodoc]] TFLayoutLMv3Model
@@ -142,3 +149,6 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 [[autodoc]] TFLayoutLMv3ForQuestionAnswering
    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/layoutxlm.md
+++ b/docs/source/en/model_doc/layoutxlm.md
@@ -33,6 +33,10 @@ introduce a multilingual form understanding benchmark dataset named XFUN, which
 for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
 cross-lingual pre-trained models on the XFUN dataset.*
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
+## Usage tips and examples
 One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
 ```python
@@ -56,10 +60,10 @@ Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally appl
 [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
 data for the model.
-As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
+<Tip>
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
+As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
+</Tip>
 ## LayoutXLMTokenizer

--- a/docs/source/en/model_doc/led.md
+++ b/docs/source/en/model_doc/led.md
@@ -35,7 +35,7 @@ WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED),
 long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
 dataset.*
-Tips:
+## Usage tips
 - [`LEDForConditionalGeneration`] is an extension of
  [`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with
@@ -52,15 +52,15 @@ Tips:
  errors. This can be done by executing `model.gradient_checkpointing_enable()`. 
 Moreover, the `use_cache=False`
  flag can be used to disable the caching mechanism to save memory.
- A notebook showing how to evaluate LED, can be accessed [here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
- A notebook showing how to fine-tune LED, can be accessed [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
 - LED is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
-## Documentation resources
+## Resources
+- [A notebook showing how to evaluate LED](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
+- [A notebook showing how to fine-tune LED](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
 - [Text classification task guide](../tasks/sequence_classification)
 - [Question answering task guide](../tasks/question_answering)
 - [Translation task guide](../tasks/translation)
@@ -100,6 +100,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
+<frameworkcontent>
+<pt>
 ## LEDModel
 [[autodoc]] LEDModel
@@ -120,6 +123,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] LEDForQuestionAnswering
    - forward
+</pt>
+<tf>
 ## TFLEDModel
 [[autodoc]] TFLEDModel
@@ -129,3 +135,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
 [[autodoc]] TFLEDForConditionalGeneration
    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/levit.md
+++ b/docs/source/en/model_doc/levit.md
@@ -38,7 +38,9 @@ alt="drawing" width="600"/>
 <small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small>
-Tips:
+This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
+## Usage tips
 - Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency.
 - There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top
@@ -63,8 +65,6 @@ Tips:
 - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
  (you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
-This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LeViT.
@@ -90,7 +90,6 @@ If you're interested in submitting a resource to be included here, please feel f
  [[autodoc]] LevitImageProcessor
    - preprocess
 ## LevitModel
 [[autodoc]] LevitModel