Refactor model summary (#21408)

* first draft of model summary * restructure docs * finish first draft * ✨minor reviews and edits * apply feedbacks * save important info, create new page for attention * add attention doc to toctree * ✨ few more minor fixes

Refactor model summary (#21408)
* first draft of model summary * restructure docs * finish first draft * ✨minor reviews and edits * apply feedbacks * save important info, create new page for attention * add attention doc to toctree * ✨ few more minor fixes
7a5533b2 · Steven Liu · GitHub · a0e69a93 · 7a5533b2 · 7a5533b2
Unverified Commit 7a5533b2 authored Feb 15, 2023 by Steven Liu Committed by GitHub Feb 15, 2023
10 changed files
--- a/docs/source/en/model_doc/rag.mdx
+++ b/docs/source/en/model_doc/rag.mdx
@@ -12,6 +12,12 @@ specific language governing permissions and limitations under the License.

 # RAG

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=rag">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
+</a>
+</div>
+
 ## Overview

 Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
@@ -42,6 +48,8 @@ parametric-only seq2seq baseline.*

 This model was contributed by [ola13](https://huggingface.co/ola13).

+Tips:
+- Retrieval-augmented generation (“RAG”) models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.

 ## RagConfig


--- a/docs/source/en/model_doc/reformer.mdx
+++ b/docs/source/en/model_doc/reformer.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # Reformer

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=reformer">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/reformer-crime-and-punishment">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).

 ## Overview
@@ -31,9 +40,13 @@ while being much more memory-efficient and much faster on long sequences.*
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
 found [here](https://github.com/google/trax/tree/master/trax/models/reformer).

-**Note**:
+Tips:

- Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
+- Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035).
+- Use Axial position encoding (see below for more details). It’s a mechanism to avoid having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller matrices.
+- Replace traditional attention by LSH (local-sensitive hashing) attention (see below for more details). It’s a technique to avoid computing the full product query-key in the attention layers.
+- Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them for results inside a given layer (less efficient than storing them but saves memory).
+- Compute the feedforward operations by chunks and not on the whole batch.

 ## Axial Positional Encodings


--- a/docs/source/en/model_doc/roberta.mdx
+++ b/docs/source/en/model_doc/roberta.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # RoBERTa

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=roberta">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/roberta-base">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The RoBERTa model was proposed in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
@@ -39,6 +48,12 @@ Tips:
  different pretraining scheme.
 - RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
  separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
+- Same as BERT with better pretraining tricks:
+
+    * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
+    * together to reach 512 tokens (so the sentences are in an order than may span several documents)
+    * train with larger batches
+    * use BPE with bytes as a subunit and not characters (because of unicode characters)
 - [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples.

 This model was contributed by [julien-c](https://huggingface.co/julien-c). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/roberta).

--- a/docs/source/en/model_doc/t5.mdx
+++ b/docs/source/en/model_doc/t5.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # T5

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=t5">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/t5-base">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The T5 model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
@@ -35,6 +44,8 @@ Tips:
 each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
 different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
 for summarization: *summarize: ...*.
+- The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
+- Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.

 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.


--- a/docs/source/en/model_doc/transfo-xl.mdx
+++ b/docs/source/en/model_doc/transfo-xl.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # Transformer XL

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=transfo-xl">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/transfo-xl-wt103">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The Transformer-XL model was proposed in [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
@@ -37,6 +46,9 @@ Tips:
 - Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
  original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.
+- Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that may span across multiple documents, and segments are fed in order to the model.
+- Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention scores. This allows the model to pay attention to information that was in the previous segment as well as the current one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
+- This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would give the same results in the current input and the current hidden state at a given position) and needs to make some adjustments in the way attention scores are computed.

 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/kimiyoung/transformer-xl).


--- a/docs/source/en/model_doc/xlm-prophetnet.mdx
+++ b/docs/source/en/model_doc/xlm-prophetnet.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLM-ProphetNet

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xprophetnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xprophetnet-large-wiki100-cased-xglue-ntg">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
 @patrickvonplaten

@@ -39,6 +48,10 @@ state-of-the-art results on all these datasets compared to the models using the

 The Authors' code can be found [here](https://github.com/microsoft/ProphetNet).

+Tips:
+
+- XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained on the cross-lingual dataset XGLUE.
+
 ## XLMProphetNetConfig

 [[autodoc]] XLMProphetNetConfig

--- a/docs/source/en/model_doc/xlm-roberta.mdx
+++ b/docs/source/en/model_doc/xlm-roberta.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLM-RoBERTa

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xlm-roberta">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xlm-roberta-base">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
@@ -38,6 +47,7 @@ Tips:
 - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
  not require `lang` tensors to understand which language is used, and should be able to determine the correct
  language from the input ids.
+- Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses masked language modeling on sentences coming from one language.
 - This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
  as well as the information relative to the inputs and outputs.


--- a/docs/source/en/model_doc/xlm.mdx
+++ b/docs/source/en/model_doc/xlm.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLM

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xlm">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xlm-mlm-en-2048">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The XLM model was proposed in [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by
@@ -38,6 +47,11 @@ Tips:
 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
 - XLM has multilingual checkpoints which leverage a specific `lang` parameter. Check out the [multi-lingual](../multilingual) page for more information.
+- A transformer model trained on several languages. There are three different type of training for this model and the library provides checkpoints for all of them:
+
+    * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the previous section as well). One of the languages is selected for each training sample, and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages.
+    * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample, and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with dynamic masking of the tokens.
+    * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two different languages, with random masking. To predict one of the masked tokens, the model can use both, the surrounding context in language 1 and the context given by language 2.

 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/facebookresearch/XLM/).


--- a/docs/source/en/model_doc/xlnet.mdx
+++ b/docs/source/en/model_doc/xlnet.mdx
@@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLNet

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xlnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xlnet-base-cased">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The XLNet model was proposed in [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
@@ -40,6 +49,8 @@ Tips:
  `target_mapping` inputs to control the attention span and outputs (see examples in
  *examples/pytorch/text-generation/run_generation.py*)
 - XLNet is one of the few models that has no sequence length limit.
+- XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,…,sequence length.
+- XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies.

 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/zihangdai/xlnet/).


--- a/docs/source/en/model_summary.mdx
+++ b/docs/source/en/model_summary.mdx