[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors

[Docs] Model_doc structure/clarity improvements (#26876)
* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors
5964f820 · Maria Khalusova · GitHub · ad8ff962 · 5964f820 · 5964f820
Unverified Commit 5964f820 authored Nov 03, 2023 by Maria Khalusova Committed by GitHub Nov 03, 2023
20 changed files
--- a/docs/source/en/model_doc/lilt.md
+++ b/docs/source/en/model_doc/lilt.md
@@ -26,7 +26,15 @@ The abstract from the paper is the following:
 *Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding. LiLT can be pre-trained on the structured documents of a single language and then directly fine-tuned on other languages with the corresponding off-the-shelf monolingual/multilingual pre-trained textual models. Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks, which enables language-independent benefit from the pre-training of document layout structure.*
-Tips:
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/lilt_architecture.jpg"
+alt="drawing" width="600"/>
+<small> LiLT architecture. Taken from the <a href="https://arxiv.org/abs/2202.13669">original paper</a>. </small>
+This model was contributed by [nielsr](https://huggingface.co/nielsr).
+The original code can be found [here](https://github.com/jpwang/lilt).
+## Usage tips
 - To combine the Language-Independent Layout Transformer with a new RoBERTa checkpoint from the [hub](https://huggingface.co/models?search=roberta), refer to [this guide](https://github.com/jpWang/LiLT#or-generate-your-own-checkpoint-optional).
 The script will result in `config.json` and `pytorch_model.bin` files being stored locally. After doing this, one can do the following (assuming you're logged in with your HuggingFace account):
@@ -42,14 +50,6 @@ model.push_to_hub("name_of_repo_on_the_hub")
 - As [lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) uses the same vocabulary as [LayoutLMv3](layoutlmv3), one can use [`LayoutLMv3TokenizerFast`] to prepare data for the model.
 The same is true for [lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-infoxlm-base): one can use [`LayoutXLMTokenizerFast`] for that model.
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/lilt_architecture.jpg"
-alt="drawing" width="600"/>
-<small> LiLT architecture. Taken from the <a href="https://arxiv.org/abs/2202.13669">original paper</a>. </small>
-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/jpwang/lilt).
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LiLT.

--- a/docs/source/en/model_doc/llama.md
+++ b/docs/source/en/model_doc/llama.md
@@ -24,7 +24,9 @@ The abstract from the paper is the following:
 *We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community. *
-Tips:
+This model was contributed by [zphang](https://huggingface.co/zphang) with contributions from [BlackSamorez](https://huggingface.co/BlackSamorez). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).
+## Usage tips
 - Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form)
 - After downloading the weights, they will need to be converted to the Hugging Face Transformers format using the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). The script can be called with the following (example) command:
@@ -48,9 +50,6 @@ come in several checkpoints they each contain a part of each weight of the model
 - The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string.
-This model was contributed by [zphang](https://huggingface.co/zphang) with contributions from [BlackSamorez](https://huggingface.co/BlackSamorez). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).
 Based on the original LLaMA model, Meta AI has released some follow-up works:
 - **Llama2**: Llama2 is an improved version of Llama with some architectural tweaks (Grouped Query Attention), and is pre-trained on 2Trillion tokens. Refer to the documentation of Llama2 which can be found [here](llama2).
@@ -82,7 +81,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] LlamaConfig
 ## LlamaTokenizer
 [[autodoc]] LlamaTokenizer
@@ -105,7 +103,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] LlamaModel
    - forward
 ## LlamaForCausalLM
 [[autodoc]] LlamaForCausalLM

--- a/docs/source/en/model_doc/llama2.md
+++ b/docs/source/en/model_doc/llama2.md
@@ -24,7 +24,10 @@ The abstract from the paper is the following:
 *In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.*
-Checkout all Llama2 models [here](https://huggingface.co/models?search=llama2)
+Checkout all Llama2 model checkpoints [here](https://huggingface.co/models?search=llama2).
+This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ) with contributions from [Lysandre Debut](https://huggingface.co/lysandre). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).
+## Usage tips
 <Tip warning={true}>
@@ -64,7 +67,6 @@ come in several checkpoints they each contain a part of each weight of the model
 - The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string.
-This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ) with contributions from [Lysandre Debut](https://huggingface.co/lysandre). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).
 ## Resources

--- a/docs/source/en/model_doc/longformer.md
+++ b/docs/source/en/model_doc/longformer.md
@@ -41,15 +41,15 @@ contrast to most prior work, we also pretrain Longformer and finetune it on a va
 pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
 WikiHop and TriviaQA.*
-Tips:
+This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer).
+## Usage tips
 - Since the Longformer is based on RoBERTa, it doesn't have `token_type_ids`. You don't need to indicate which
  token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or
  `</s>`).
 - A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g., what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the local attention section for more information.
-This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer).
 ## Longformer Self Attention
 Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
@@ -93,7 +93,7 @@ mlm_labels = tokenizer.encode("This is a sentence from the training data", retur
 loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 ```
-## Documentation resources
+## Resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
@@ -143,6 +143,9 @@ loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
+<frameworkcontent>
+<pt>
 ## LongformerModel
 [[autodoc]] LongformerModel
@@ -173,6 +176,9 @@ loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 [[autodoc]] LongformerForQuestionAnswering
    - forward
+</pt>
+<tf>
 ## TFLongformerModel
 [[autodoc]] TFLongformerModel
@@ -202,3 +208,6 @@ loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 [[autodoc]] TFLongformerForMultipleChoice
    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/longt5.md
+++ b/docs/source/en/model_doc/longt5.md
@@ -36,7 +36,10 @@ attention ideas from long-input transformers (ETC), and adopted pre-training str
 able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on
 question answering tasks.*
-Tips:
+This model was contributed by [stancld](https://huggingface.co/stancld).
+The original code can be found [here](https://github.com/google-research/longt5).
+## Usage tips
 - [`LongT5ForConditionalGeneration`] is an extension of [`T5ForConditionalGeneration`] exchanging the traditional
 encoder *self-attention* layer with efficient either *local* attention or *transient-global* (*tglobal*) attention.
@@ -87,10 +90,8 @@ The complexity of this mechanism is `O(l(r + l/k))`.
 >>> rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])
 ```
-This model was contributed by [stancld](https://huggingface.co/stancld).
-The original code can be found [here](https://github.com/google-research/longt5).
-## Documentation resources
+## Resources
 - [Translation task guide](../tasks/translation)
 - [Summarization task guide](../tasks/summarization)
@@ -99,6 +100,9 @@ The original code can be found [here](https://github.com/google-research/longt5)
 [[autodoc]] LongT5Config
+<frameworkcontent>
+<pt>
 ## LongT5Model
 [[autodoc]] LongT5Model
@@ -114,6 +118,9 @@ The original code can be found [here](https://github.com/google-research/longt5)
 [[autodoc]] LongT5EncoderModel
    - forward
+</pt>
+<jax>
 ## FlaxLongT5Model
 [[autodoc]] FlaxLongT5Model
@@ -127,3 +134,6 @@ The original code can be found [here](https://github.com/google-research/longt5)
    - __call__
    - encode
    - decode
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/luke.md
+++ b/docs/source/en/model_doc/luke.md
@@ -37,7 +37,9 @@ state-of-the-art results on five well-known datasets: Open Entity (entity typing
 CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question
 answering).*
-Tips:
+This model was contributed by [ikuyamada](https://huggingface.co/ikuyamada) and [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/studio-ousia/luke).
+## Usage tips
 - This implementation is the same as [`RobertaModel`] with the addition of entity embeddings as well
  as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
@@ -75,13 +77,7 @@ Tips:
  head models by specifying `task="entity_classification"`, `task="entity_pair_classification"`, or
  `task="entity_span_classification"`. Please refer to the example code of each head models.
-  A demo notebook on how to fine-tune [`LukeForEntityPairClassification`] for relation
+Usage example:
-  classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE).
-  There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
-  the HuggingFace implementation of LUKE. They can be found [here](https://github.com/studio-ousia/luke/tree/master/notebooks).
-Example:
 ```python
 >>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
@@ -119,10 +115,10 @@ Example:
 >>> print("Predicted class:", model.config.id2label[predicted_class_idx])
 ```
-This model was contributed by [ikuyamada](https://huggingface.co/ikuyamada) and [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/studio-ousia/luke).
+## Resources
-## Documentation resources
+- [A demo notebook on how to fine-tune [`LukeForEntityPairClassification`] for relation classification](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE)
+- [Notebooks showcasing how you to reproduce the results as reported in the paper with the HuggingFace implementation of LUKE](https://github.com/studio-ousia/luke/tree/master/notebooks)
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)

--- a/docs/source/en/model_doc/lxmert.md
+++ b/docs/source/en/model_doc/lxmert.md
@@ -41,7 +41,9 @@ best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablati
 model components and pretraining strategies significantly contribute to our strong results; and also present several
 attention visualizations for the different encoders*
-Tips:
+This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). The original code can be found [here](https://github.com/airsplay/lxmert).
+## Usage tips
 - Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
  will work.
@@ -53,9 +55,7 @@ Tips:
  contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
  both self attention outputs are disregarded.
-This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). The original code can be found [here](https://github.com/airsplay/lxmert).
+## Resources
-## Documentation resources
 - [Question answering task guide](../tasks/question_answering)
@@ -83,6 +83,9 @@ This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). T
 [[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
+<frameworkcontent>
+<pt>
 ## LxmertModel
 [[autodoc]] LxmertModel
@@ -98,6 +101,9 @@ This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). T
 [[autodoc]] LxmertForQuestionAnswering
    - forward
+</pt>
+<tf>
 ## TFLxmertModel
 [[autodoc]] TFLxmertModel
@@ -107,3 +113,6 @@ This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). T
 [[autodoc]] TFLxmertForPreTraining
    - call
+</tf>
+</frameworkcontent>
--- a/docs/source/en/model_doc/m2m_100.md
+++ b/docs/source/en/model_doc/m2m_100.md
@@ -38,7 +38,7 @@ open-source our scripts so that others may reproduce the data, evaluation, and f
 This model was contributed by [valhalla](https://huggingface.co/valhalla).
-### Training and Generation
+## Usage tips and examples
 M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
 multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
@@ -48,7 +48,7 @@ id for source text and target language id for target text, with `X` being the so
 The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
 examples. To install `sentencepiece` run `pip install sentencepiece`.
- Supervised Training
+**Supervised Training**
 ```python
 from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
@@ -64,12 +64,12 @@ model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
 loss = model(**model_inputs).loss  # forward pass
 ```
- Generation
+**Generation**
-  M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id
+M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id 
-  being forced as the first generated token. To force the target language id as the first generated token, pass the
+being forced as the first generated token. To force the target language id as the first generated token, pass the 
-  *forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between
+*forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between 
-  Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint.
+Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint.
 ```python
 >>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
@@ -95,7 +95,7 @@ loss = model(**model_inputs).loss  # forward pass
 "Life is like a box of chocolate."
 ```
-## Documentation resources
+## Resources
 - [Translation task guide](../tasks/translation)
 - [Summarization task guide](../tasks/summarization)

--- a/docs/source/en/model_doc/marian.md
+++ b/docs/source/en/model_doc/marian.md
@@ -25,14 +25,11 @@ rendered properly in your Markdown viewer.
 </a>
 </div>
-**Bugs:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
+## Overview
-and assign @patrickvonplaten.
-Translations should be similar, but not identical to output in the test set linked to in each model card.
+A framework for translation models, using the same models as BART. Translations should be similar, but not identical to output in the test set linked to in each model card.
+This model was contributed by [sshleifer](https://huggingface.co/sshleifer).
-Tips:
- A framework for translation models, using the same models as BART.
 ## Implementation Notes
@@ -49,7 +46,7 @@ Tips:
  - the model starts generating with `pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
    `<s/>`),
 - Code to bulk convert models can be found in `convert_marian_to_pytorch.py`.
- This model was contributed by [sshleifer](https://huggingface.co/sshleifer).
 ## Naming
@@ -165,7 +162,7 @@ Example of translating english to many romance languages, using old-style 2 char
 'Y esto al español']
 ```
-## Documentation resources
+## Resources
 - [Translation task guide](../tasks/translation)
 - [Summarization task guide](../tasks/summarization)
@@ -180,6 +177,9 @@ Example of translating english to many romance languages, using old-style 2 char
 [[autodoc]] MarianTokenizer
    - build_inputs_with_special_tokens
+<frameworkcontent>
+<pt>
 ## MarianModel
 [[autodoc]] MarianModel
@@ -195,6 +195,9 @@ Example of translating english to many romance languages, using old-style 2 char
 [[autodoc]] MarianForCausalLM
    - forward
+</pt>
+<tf>
 ## TFMarianModel
 [[autodoc]] TFMarianModel
@@ -205,6 +208,9 @@ Example of translating english to many romance languages, using old-style 2 char
 [[autodoc]] TFMarianMTModel
    - call
+</tf>
+<jax>
 ## FlaxMarianModel
 [[autodoc]] FlaxMarianModel
@@ -214,3 +220,6 @@ Example of translating english to many romance languages, using old-style 2 char
 [[autodoc]] FlaxMarianMTModel
    - __call__
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/markuplm.md
+++ b/docs/source/en/model_doc/markuplm.md
@@ -40,19 +40,19 @@ HTML/XML-based documents, where text and markup information is jointly pre-train
 pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding
 tasks. The pre-trained model and code will be publicly available.*
-Tips:
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/markuplm).
+## Usage tips
 - In addition to `input_ids`, [`~MarkupLMModel.forward`] expects 2 additional inputs, namely `xpath_tags_seq` and `xpath_subs_seq`.
 These are the XPATH tags and subscripts respectively for each token in the input sequence.
 - One can use [`MarkupLMProcessor`] to prepare all data for the model. Refer to the [usage guide](#usage-markuplmprocessor) for more info.
- Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/MarkupLM).
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/markuplm_architecture.jpg"
 alt="drawing" width="600"/> 
 <small> MarkupLM architecture. Taken from the <a href="https://arxiv.org/abs/2110.08518">original paper.</a> </small>
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/markuplm).
 ## Usage: MarkupLMProcessor
 The easiest way to prepare data for the model is to use [`MarkupLMProcessor`], which internally combines a feature extractor
@@ -197,8 +197,9 @@ all nodes and xpaths yourself, you can provide them directly to the processor. M
 dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'])
 ```
-## Documentation resources
+## Resources
+- [Demo notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/MarkupLM)
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)

--- a/docs/source/en/model_doc/mask2former.md
+++ b/docs/source/en/model_doc/mask2former.md
@@ -25,16 +25,17 @@ The abstract from the paper is the following:
 *Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice
 of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).*
-Tips:
- Mask2Former uses the same preprocessing and postprocessing steps as [MaskFormer](maskformer). Use [`Mask2FormerImageProcessor`] or [`AutoImageProcessor`] to prepare images and optional targets for the model.
- To get the final segmentation, depending on the task, you can call [`~Mask2FormerImageProcessor.post_process_semantic_segmentation`] or [`~Mask2FormerImageProcessor.post_process_instance_segmentation`] or [`~Mask2FormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`Mask2FormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/mask2former_architecture.jpg" alt="drawing" width="600"/>
 <small> Mask2Former architecture. Taken from the <a href="https://arxiv.org/abs/2112.01527">original paper.</a> </small>
 This model was contributed by [Shivalika Singh](https://huggingface.co/shivi) and [Alara Dirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/facebookresearch/Mask2Former).
+## Usage tips
+- Mask2Former uses the same preprocessing and postprocessing steps as [MaskFormer](maskformer). Use [`Mask2FormerImageProcessor`] or [`AutoImageProcessor`] to prepare images and optional targets for the model.
+- To get the final segmentation, depending on the task, you can call [`~Mask2FormerImageProcessor.post_process_semantic_segmentation`] or [`~Mask2FormerImageProcessor.post_process_instance_segmentation`] or [`~Mask2FormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`Mask2FormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
 ## Resources
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Mask2Former.
@@ -44,16 +45,16 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it.
 The resource should ideally demonstrate something new instead of duplicating an existing resource.
+## Mask2FormerConfig
+[[autodoc]] Mask2FormerConfig
 ## MaskFormer specific outputs
 [[autodoc]] models.mask2former.modeling_mask2former.Mask2FormerModelOutput
 [[autodoc]] models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput
-## Mask2FormerConfig
-[[autodoc]] Mask2FormerConfig
 ## Mask2FormerModel
 [[autodoc]] Mask2FormerModel

--- a/docs/source/en/model_doc/maskformer.md
+++ b/docs/source/en/model_doc/maskformer.md
@@ -31,7 +31,14 @@ The abstract from the paper is the following:
 *Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.*
-Tips:
+The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278).
+<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/maskformer_architecture.png"/>
+This model was contributed by [francesco](https://huggingface.co/francesco). The original code can be found [here](https://github.com/facebookresearch/MaskFormer).
+## Usage tips
 -  MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxilary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
 - If you want to train the model in a distributed environment across multiple nodes, then one should update the
  `get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
@@ -39,12 +46,6 @@ Tips:
 - One can use [`MaskFormerImageProcessor`] to prepare images for the model and optional targets for the model.
 - To get the final segmentation, depending on the task, you can call [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or [`~MaskFormerImageProcessor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
-The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278).
-<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/maskformer_architecture.png"/>
-This model was contributed by [francesco](https://huggingface.co/francesco). The original code can be found [here](https://github.com/facebookresearch/MaskFormer).
 ## Resources
 <PipelineTag pipeline="image-segmentation"/>

--- a/docs/source/en/model_doc/matcha.md
+++ b/docs/source/en/model_doc/matcha.md
@@ -67,4 +67,10 @@ from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup
 optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05)
 scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)
 ```
\ No newline at end of file
+<Tip>
+MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
+</Tip>
\ No newline at end of file
--- a/docs/source/en/model_doc/mbart.md
+++ b/docs/source/en/model_doc/mbart.md
@@ -25,8 +25,6 @@ rendered properly in your Markdown viewer.
 </a>
 </div>
-**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
-@patrickvonplaten
 ## Overview of MBart
@@ -186,6 +184,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
 [[autodoc]] MBart50TokenizerFast
+<frameworkcontent>
+<pt>
 ## MBartModel
 [[autodoc]] MBartModel
@@ -207,6 +208,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
 [[autodoc]] MBartForCausalLM
    - forward
+</pt>
+<tf>
 ## TFMBartModel
 [[autodoc]] TFMBartModel
@@ -217,6 +221,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
 [[autodoc]] TFMBartForConditionalGeneration
    - call
+</tf>
+<jax>
 ## FlaxMBartModel
 [[autodoc]] FlaxMBartModel
@@ -244,3 +251,6 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    - __call__
    - encode
    - decode
+</jax>
+</frameworkcontent>
--- a/docs/source/en/model_doc/mctct.md
+++ b/docs/source/en/model_doc/mctct.md
@@ -40,17 +40,15 @@ pseudo-labels for all languages, either from scratch or by fine-tuning. Experime
 Common Voice and unlabeled VoxPopuli datasets show that our recipe can yield a model with better
 performance for many languages that also transfers well to LibriSpeech.*
 This model was contributed by [cwkeam](https://huggingface.co/cwkeam). The original code can be found [here](https://github.com/flashlight/wav2letter/tree/main/recipes/mling_pl).
-## Documentation resources
+## Usage tips
- [Automatic speech recognition task guide](../tasks/asr)
+The PyTorch version of this model is only available in torch 1.9 and higher.
-Tips:
+## Resources
- The PyTorch version of this model is only available in torch 1.9 and higher.
+- [Automatic speech recognition task guide](../tasks/asr)
 ## MCTCTConfig
@@ -70,7 +68,6 @@ Tips:
    - batch_decode
    - decode
 ## MCTCTModel
 [[autodoc]] MCTCTModel

--- a/docs/source/en/model_doc/mega.md
+++ b/docs/source/en/model_doc/mega.md
@@ -28,15 +28,17 @@ The abstract from the paper is the following:
 *The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models. *
-Tips:
+This model was contributed by [mnaylor](https://huggingface.co/mnaylor).
+The original code can be found [here](https://github.com/facebookresearch/mega).
+## Usage tips
 - MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set `bidirectional=False` to avoid errors with default bidirectional. 
 - Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size 
-This model was contributed by [mnaylor](https://huggingface.co/mnaylor).
-The original code can be found [here](https://github.com/facebookresearch/mega).
-Implementation Notes:
+## Implementation Notes
 - The original implementation of MEGA had an inconsistent expectation of attention masks for padding and causal self-attention between the softmax attention and Laplace/squared ReLU method. This implementation addresses that inconsistency.
 - The original implementation did not include token type embeddings; this implementation adds support for these, with the option controlled by MegaConfig.add_token_type_embeddings

--- a/docs/source/en/model_doc/megatron-bert.md
+++ b/docs/source/en/model_doc/megatron-bert.md
@@ -40,7 +40,11 @@ achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.
 accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy
 of 89.4%).*
-Tips:
+This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). 
+That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, 
+it contains a hybrid model parallel approach using "tensor parallel" and "pipeline parallel" techniques.
+## Usage tips
 We have provided pretrained [BERT-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m) checkpoints
 for use to evaluate or finetuning downstream tasks.
@@ -78,11 +82,7 @@ python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpo
 python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpoint.py megatron_bert_345m_v0_1_cased.zip
 ```
-This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). That repository contains a multi-GPU and multi-node implementation of the
+## Resources
-Megatron Language models. In particular, it contains a hybrid model parallel approach using "tensor parallel" and
-"pipeline parallel" techniques.
-## Documentation resources
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)

--- a/docs/source/en/model_doc/megatron_gpt2.md
+++ b/docs/source/en/model_doc/megatron_gpt2.md
@@ -40,7 +40,11 @@ achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.
 accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy
 of 89.4%).*
-Tips:
+This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). 
+That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, it 
+contains a hybrid model parallel approach using "tensor parallel" and "pipeline parallel" techniques.
+## Usage tips
 We have provided pretrained [GPT2-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_lm_345m) checkpoints
 for use to evaluate or finetuning downstream tasks.
@@ -65,7 +69,9 @@ The following command allows you to do the conversion. We assume that the folder
 python3 $PATH_TO_TRANSFORMERS/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py megatron_gpt2_345m_v0_0.zip
 ```
-This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). That repository contains a multi-GPU and multi-node implementation of the
+<Tip> 
-Megatron Language models. In particular, it contains a hybrid model parallel approach using "tensor parallel" and
-"pipeline parallel" techniques.
+ MegatronGPT2 architecture is the same as OpenAI GPT-2 . Refer to [GPT-2 documentation](gpt2) for information on 
+ configuration classes and their parameters.  
+ </Tip>
\ No newline at end of file
--- a/docs/source/en/model_doc/mgp-str.md
+++ b/docs/source/en/model_doc/mgp-str.md
@@ -29,12 +29,10 @@ alt="drawing" width="600"/>
 <small> MGP-STR architecture. Taken from the <a href="https://arxiv.org/abs/2209.03592">original paper</a>. </small>
-Tips:
+MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and SynthText(http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
+This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR).
- MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and SynthText(http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
+## Inference example
- This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR).
-## Inference
 [`MgpstrModel`] accepts images as input and generates three types of predictions, which represent textual information at different granularities.
 The three types of predictions are fused to give the final prediction result.
@@ -46,7 +44,7 @@ into a single instance to both extract the input features and decode the predict
 - Step-by-step Optical Character Recognition (OCR)
-``` py
+```py
 >>> from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
 >>> import requests
 >>> from PIL import Image

--- a/docs/source/en/model_doc/mistral.md
+++ b/docs/source/en/model_doc/mistral.md
@@ -18,9 +18,9 @@ rendered properly in your Markdown viewer.
 ## Overview
-Mistral-7B-v0.1 is Mistral AI’s first Large Language Model (LLM). 
+Mistral-7B-v0.1 is Mistral AI's first Large Language Model (LLM). 
-## Model Details
+### Model Details
 Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:
 * Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
@@ -31,11 +31,11 @@ We also provide an instruction fine-tuned model: `Mistral-7B-Instruct-v0.1` whic
 For more details please read our [release blog post](https://mistral.ai/news/announcing-mistral-7b/)
-## License
+### License
 Both `Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` are released under the Apache 2.0 license.
-## Usage
+## Usage tips
 `Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` can be found on the [Huggingface Hub](https://huggingface.co/mistralai)