Improve documentation of some models (#14695)

* Migrate docs to mdx * Update TAPAS docs * Remove lines * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add pt/tf switch to code examples * More improvements * Improve docstrings * More improvements Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Improve documentation of some models (#14695)
* Migrate docs to mdx * Update TAPAS docs * Remove lines * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add pt/tf switch to code examples * More improvements * Improve docstrings * More improvements Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
4c99e553 · NielsRogge · GitHub · 32eb29fe · 4c99e553 · 4c99e553
Unverified Commit 4c99e553 authored Dec 13, 2021 by NielsRogge Committed by GitHub Dec 13, 2021
13 changed files
--- a/docs/source/model_doc/beit.rst
+++ b/docs/source/model_doc/beit.rst
@@ -40,8 +40,15 @@ significantly outperforming from-scratch DeiT training (81.8%) with the same set
 Tips:

 - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
-  outperform both the original model (ViT) as well as Data-efficient Image Transformers (DeiT) when fine-tuned on
-  ImageNet-1K and CIFAR-100.
+  outperform both the :doc:`original model (ViT) <vit>` as well as :doc:`Data-efficient Image Transformers (DeiT)
+  <deit>` when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
+  fine-tuning on custom data `here
+  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__ (you can just replace
+  :class:`~transformers.ViTFeatureExtractor` by :class:`~transformers.BeitFeatureExtractor` and
+  :class:`~transformers.ViTForImageClassification` by :class:`~transformers.BeitForImageClassification`).
+- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
+  performing masked image modeling. You can find it `here
+  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT>`__.
 - As the BEiT models expect each image to be of the same size (resolution), one can use
  :class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
 - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of

--- a/docs/source/model_doc/imagegpt.rst
+++ b/docs/source/model_doc/imagegpt.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.

-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
+License. You may obtain a copy of the License at

-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->

-ImageGPT
-----------------------------------------------------------------------------------------------------------------------
+# ImageGPT

-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Overview

-The ImageGPT model was proposed in `Generative Pretraining from Pixels <https://openai.com/blog/image-gpt/>`__ by Mark
+The ImageGPT model was proposed in [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt) by Mark
 Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like
 model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.

@@ -31,25 +28,29 @@ ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pr
 competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0%
 top-1 accuracy on a linear probe of our features.*

-The figure below summarizes the approach (taken from the `original paper
-<https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf>`__):
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png"
+alt="drawing" width="600"/> 

-.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png
-  :width: 600
+<small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small>
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found
+[here](https://github.com/openai/image-gpt).

 Tips:

- ImageGPT is almost exactly the same as :doc:`GPT-2 <gpt2>`, with the exception that a different activation function
-  is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT also doesn't
-  have tied input- and output embeddings.
+- Demo notebooks for ImageGPT can be found
+  [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ImageGPT).
+- ImageGPT is almost exactly the same as [GPT-2](./model_doc/gpt2), with the exception that a different activation
+  function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT
+  also doesn't have tied input- and output embeddings.
 - As the time- and memory requirements of the attention mechanism of Transformers scales quadratically in the sequence
  length, the authors pre-trained ImageGPT on smaller input resolutions, such as 32x32 and 64x64. However, feeding a
  sequence of 32x32x3=3072 tokens from 0..255 into a Transformer is still prohibitively large. Therefore, the authors
  applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
  sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
  embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
-  token, used at the beginning of every sequence. One can use :class:`~transformers.ImageGPTFeatureExtractor` to
-  prepare images for the model.
+  token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare
+  images for the model.
 - Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
  performant image features useful for downstream tasks, such as image classification. The authors showed that the
  features in the middle of the network are the most performant, and can be used as-is to train a linear model (such as
@@ -57,54 +58,43 @@ Tips:
  easily obtained by first forwarding the image through the model, then specifying `output_hidden_states=True`, and
  then average-pool the hidden states at whatever layer you like.
 - Alternatively, one can further fine-tune the entire model on a downstream dataset, similar to BERT. For this, you can
-  use :class:`~transformers.ImageGPTForImageClassification`.
+  use [`ImageGPTForImageClassification`].
 - ImageGPT comes in different sizes: there's ImageGPT-small, ImageGPT-medium and ImageGPT-large. The authors did also
  train an XL variant, which they didn't release. The differences in size are summarized in the following table:

-+-------------------+----------------------+-----------------+---------------------+--------------+
-| **Model variant** | **Number of layers** | **Hidden size** | **Number of heads** | **# params** |
-+-------------------+----------------------+-----------------+---------------------+--------------+
-| iGPT-small        | 24                   | 512             | 8                   | 76 million   |
-+-------------------+----------------------+-----------------+---------------------+--------------+
-| iGPT-medium       | 36                   | 1024            | 8                   | 455 million  |
-+-------------------+----------------------+-----------------+---------------------+--------------+
-| iGPT-large        | 48                   | 1536            | 16                  | 1.4 million  |
-+-------------------+----------------------+-----------------+---------------------+--------------+
-| iGPT-XL           | 60                   | 3072            | not specified       | 6.8 billion  |
-+-------------------+----------------------+-----------------+---------------------+--------------+
+| **Model variant** | **Depths** | **Hidden sizes** | **Decoder hidden size** | **Params (M)** | **ImageNet-1k Top 1** |
+|---|---|---|---|---|---|
+| MiT-b0 | [2, 2, 2, 2] | [32, 64, 160, 256] | 256 | 3.7 | 70.5 |
+| MiT-b1 | [2, 2, 2, 2] | [64, 128, 320, 512] | 256 | 14.0 | 78.7 |
+| MiT-b2 | [3, 4, 6, 3] | [64, 128, 320, 512] | 768 | 25.4 | 81.6 |
+| MiT-b3 | [3, 4, 18, 3] | [64, 128, 320, 512] | 768 | 45.2 | 83.1 |
+| MiT-b4 | [3, 8, 27, 3] | [64, 128, 320, 512] | 768 | 62.6 | 83.6 |
+| MiT-b5 | [3, 6, 40, 3] | [64, 128, 320, 512] | 768 | 82.0 | 83.8 |
+
+## ImageGPTConfig

-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__, based on `this issue
-<https://github.com/openai/image-gpt/issues/7>`__. The original code can be found `here
-<https://github.com/openai/image-gpt>`__.
+[[autodoc]] ImageGPTConfig

-ImageGPTConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## ImageGPTFeatureExtractor

-.. autoclass:: transformers.ImageGPTConfig
-    :members:
+[[autodoc]] ImageGPTFeatureExtractor

-ImageGPTFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - __call__

-.. autoclass:: transformers.ImageGPTFeatureExtractor
-    :members: __call__
+## ImageGPTModel

-ImageGPTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] ImageGPTModel

-.. autoclass:: transformers.ImageGPTModel
-    :members: forward
+    - forward

+## ImageGPTForCausalImageModeling

-ImageGPTForCausalImageModeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] ImageGPTForCausalImageModeling

-.. autoclass:: transformers.ImageGPTForCausalImageModeling
-    :members: forward
+    - forward

+## ImageGPTForImageClassification

-ImageGPTForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+[[autodoc]] ImageGPTForImageClassification

-.. autoclass:: transformers.ImageGPTForImageClassification
-    :members: forward
+    - forward
\ No newline at end of file
--- a/docs/source/model_doc/luke.rst
+++ b/docs/source/model_doc/luke.rst
@@ -74,6 +74,9 @@ Tips:
  head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
  ``task="entity_span_classification"``. Please refer to the example code of each head models.

+  A demo notebook on how to fine-tune :class:`~transformers.LukeForEntityPairClassification` for relation
+  classification can be found `here <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE>`__.
+
  There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
  the HuggingFace implementation of LUKE. They can be found `here
  <https://github.com/studio-ousia/luke/tree/master/notebooks>`__.

--- a/docs/source/model_doc/perceiver.mdx
+++ b/docs/source/model_doc/perceiver.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Perceiver
+
+## Overview
+
+The Perceiver IO model was proposed in [Perceiver IO: A General Architecture for Structured Inputs &
+Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
+Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
+Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
+
+Perceiver IO is a generalization of [Perceiver](https://arxiv.org/abs/2103.03206) to handle arbitrary outputs in
+addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
+classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
+This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
+linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
+inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
+Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
+
+The abstract from the paper is the following:
+
+*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
+clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
+inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
+sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
+outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
+linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
+strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
+StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
+baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
+performance on Sintel optical flow estimation.*
+
+Here's a TLDR explaining how Perceiver works:
+
+The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
+quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
+tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
+of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
+depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
+randomly initialized, after which they are trained end-to-end using backpropagation.
+
+Internally, [`PerceiverModel`] will create the latents, which is a tensor of shape `(batch_size, num_latents,
+d_latents)`. One must provide `inputs` (which could be text, images, audio, you name it!) to the model, which it will
+use to perform cross-attention with the latents. The output of the Perceiver encoder is a tensor of the same shape. One
+can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along
+the sequence dimension, and placing a linear layer on top of that to project the `d_latents` to `num_labels`.
+
+This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
+work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
+idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
+last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
+
+So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
+length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
+providing `inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define the
+`outputs` as being of shape: `(batch_size, 2048, 768)`. Next, one performs cross-attention with the final hidden states
+of the latents to update the `outputs` tensor. After cross-attention, one still has a tensor of shape `(batch_size,
+2048, 768)`. One can then place a regular language modeling head on top, to project the last dimension to the
+vocabulary size of the model, i.e. creating logits of shape `(batch_size, 2048, 262)` (as Perceiver uses a vocabulary
+size of 262 byte IDs).
+
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg"
+alt="drawing" width="600"/> 
+
+<small> Perceiver IO architecture. Taken from the [original paper](https://arxiv.org/abs/2105.15203) </small>
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
+[here](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
+
+Tips:
+
+- The quickest way to get started with the Perceiver is by checking the [tutorial
+  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
+- Note that the models available in the library only showcase some examples of what you can do with the Perceiver.
+  There are many more use cases, including question answering,
+named-entity recognition, object detection, audio classification, video classification, etc.
+
+## Perceiver specific outputs
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverModelOutput
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverDecoderOutput
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassifierOutput
+
+## PerceiverConfig
+
+[[autodoc]] PerceiverConfig
+
+## PerceiverTokenizer
+
+[[autodoc]] PerceiverTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+
+## PerceiverFeatureExtractor
+
+[[autodoc]] PerceiverFeatureExtractor
+    - call
+
+## PerceiverTextPreprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
+
+## PerceiverImagePreprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
+
+## PerceiverOneHotPreprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
+
+## PerceiverAudioPreprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
+
+## PerceiverMultimodalPreprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
+
+## PerceiverProjectionDecoder
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverProjectionDecoder
+
+## PerceiverBasicDecoder
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverBasicDecoder
+
+## PerceiverClassificationDecoder
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassificationDecoder
+
+## PerceiverOpticalFlowDecoder
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder
+
+## PerceiverBasicVideoAutoencodingDecoder
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverBasicVideoAutoencodingDecoder
+
+## PerceiverMultimodalDecoder
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder
+
+## PerceiverProjectionPostprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
+
+## PerceiverAudioPostprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
+
+## PerceiverClassificationPostprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
+
+## PerceiverMultimodalPostprocessor
+
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
+
+## PerceiverModel
+
+[[autodoc]] PerceiverModel
+    - forward
+
+## PerceiverForMaskedLM
+
+[[autodoc]] PerceiverForMaskedLM
+    - forward
+
+## PerceiverForSequenceClassification
+
+[[autodoc]] PerceiverForSequenceClassification
+    - forward
+
+## PerceiverForImageClassificationLearned
+
+[[autodoc]] PerceiverForImageClassificationLearned
+    - forward
+
+## PerceiverForImageClassificationFourier
+
+[[autodoc]] PerceiverForImageClassificationFourier
+    - forward
+
+## PerceiverForImageClassificationConvProcessing
+
+[[autodoc]] PerceiverForImageClassificationConvProcessing
+    - forward
+
+## PerceiverForOpticalFlow
+
+[[autodoc]] PerceiverForOpticalFlow
+    - forward
+
+## PerceiverForMultimodalAutoencoding
+
+[[autodoc]] PerceiverForMultimodalAutoencoding
+    - forward
\ No newline at end of file
--- a/docs/source/model_doc/perceiver.rst
+++ b/docs/source/model_doc/perceiver.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Perceiver
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The Perceiver IO model was proposed in `Perceiver IO: A General Architecture for Structured Inputs & Outputs
-<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
-Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
-Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
-
-Perceiver IO is a generalization of `Perceiver <https://arxiv.org/abs/2103.03206>`__ to handle arbitrary outputs in
-addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
-classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
-This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
-linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
-inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
-Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
-
-The abstract from the paper is the following:
-
-*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
-clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
-inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
-sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
-outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
-linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
-strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
-StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
-baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
-performance on Sintel optical flow estimation.*
-
-Here's a TLDR explaining how Perceiver works:
-
-The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
-quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
-tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
-of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
-depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
-randomly initialized, after which they are trained end-to-end using backpropagation.
-
-Internally, :class:`~transformers.PerceiverModel` will create the latents, which is a tensor of shape
-:obj:`(batch_size, num_latents, d_latents)`. One must provide :obj:`inputs` (which could be text, images, audio, you
-name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver
-encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to
-classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project
-the :obj:`d_latents` to :obj:`num_labels`.
-
-This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
-work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
-idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
-last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
-
-So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
-length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
-providing :obj:`inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define
-the :obj:`outputs` as being of shape: :obj:`(batch_size, 2048, 768)`. Next, one performs cross-attention with the final
-hidden states of the latents to update the :obj:`outputs` tensor. After cross-attention, one still has a tensor of
-shape :obj:`(batch_size, 2048, 768)`. One can then place a regular language modeling head on top, to project the last
-dimension to the vocabulary size of the model, i.e. creating logits of shape :obj:`(batch_size, 2048, 262)` (as
-Perceiver uses a vocabulary size of 262 byte IDs).
-
-
-This model was contributed by `<nielsr> <https://huggingface.co/nielsr>`__. The original code can be found `here
-<https://github.com/deepmind/deepmind-research/tree/master/perceiver>`__.
-
-
-Perceiver specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput
-    :members:
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput
-    :members:
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
-    :members:
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput
-    :members:
-
-
-PerceiverConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverConfig
-    :members:
-
-
-PerceiverTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-
-
-PerceiverFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverFeatureExtractor
-    :members: 
-
-
-PerceiverTextPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
-    :members: 
-
-
-PerceiverImagePreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
-    :members: 
-
-
-PerceiverOneHotPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
-    :members: 
-
-
-PerceiverAudioPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
-    :members: 
-
-
-PerceiverMultimodalPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
-    :members: 
-
-
-PerceiverProjectionPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
-    :members: 
-
-
-PerceiverAudioPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
-    :members: 
-
-
-PerceiverClassificationPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
-    :members: 
-
-
-PerceiverMultimodalPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
-    :members: 
-
-
-PerceiverModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverModel
-    :members: forward
-
-
-PerceiverForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverForMaskedLM
-    :members: forward
-
-
-PerceiverForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverForSequenceClassification
-    :members: forward
-
-
-PerceiverForImageClassificationLearned
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverForImageClassificationLearned
-    :members: forward
-
-
-PerceiverForImageClassificationFourier
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverForImageClassificationFourier
-    :members: forward
-
-
-PerceiverForImageClassificationConvProcessing
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverForImageClassificationConvProcessing
-    :members: forward
-
-
-PerceiverForOpticalFlow
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverForOpticalFlow
-    :members: forward
-
-
-PerceiverForMultimodalAutoencoding
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.PerceiverForMultimodalAutoencoding
-    :members: forward
--- a/docs/source/model_doc/segformer.rst
+++ b/docs/source/model_doc/segformer.rst
@@ -35,15 +35,15 @@ and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50
 being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on
 Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C.*

-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
-<https://github.com/NVlabs/SegFormer>`__.
-
 The figure below illustrates the architecture of SegFormer. Taken from the `original paper
 <https://arxiv.org/abs/2105.15203>`__.

 .. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png
  :width: 600

+This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
+<https://github.com/NVlabs/SegFormer>`__.
+
 Tips:

 - SegFormer consists of a hierarchical Transformer encoder, and a lightweight all-MLP decode head.

--- a/docs/source/model_doc/tapas.mdx
+++ b/docs/source/model_doc/tapas.mdx
--- a/docs/source/model_doc/tapas.rst
+++ b/docs/source/model_doc/tapas.rst
--- a/docs/source/model_doc/trocr.mdx
+++ b/docs/source/model_doc/trocr.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
+License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+
+# TrOCR
+
+## Overview
+
+The TrOCR model was proposed in [TrOCR: Transformer-based Optical Character Recognition with Pre-trained
+Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
+Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to
+perform [optical character recognition (OCR)](https://en.wikipedia.org/wiki/Optical_character_recognition).
+
+The abstract from the paper is the following:
+
+*Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition
+are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language
+model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end
+text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the
+Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but
+effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments
+show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition
+tasks.*
+
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg"
+alt="drawing" width="600"/> 
+
+<small> TrOCR architecture. Taken from the [original paper](https://arxiv.org/abs/2109.10282). </small>
+
+Please refer to the [`VisionEncoderDecoder`] class on how to use this model.
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
+[here](https://github.com/microsoft/unilm/tree/6f60612e7cc86a2a1ae85c47231507a587ab4e01/trocr).
+
+Tips:
+
+- The quickest way to get started with TrOCR is by checking the [tutorial
+  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR), which show how to use the model
+  at inference time as well as fine-tuning on custom data.
+- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results
+  on both printed (e.g. the [SROIE dataset](https://paperswithcode.com/dataset/sroie) and handwritten (e.g. the [IAM
+  Handwriting dataset](https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>) text recognition tasks. For more
+  information, see the [official models](https://huggingface.co/models?other=trocr>).
+- TrOCR is always used within the [VisionEncoderDecoder](./model_doc/visionencoderdecoder) framework.
+
+## Inference
+
+TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
+[`~generation_utils.GenerationMixin.generate`] to autoregressively generate text given the input image.
+
+The [`ViTFeatureExtractor`] class is responsible for preprocessing the input image and
+[`RobertaTokenizer`] decodes the generated target tokens to the target string. The
+[`TrOCRProcessor`] wraps [`ViTFeatureExtractor`] and [`RobertaTokenizer`]
+into a single instance to both extract the input features and decode the predicted token ids.
+
+- Step-by-step Optical Character Recognition (OCR)
+
+``` py
+>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
+>>> import requests 
+>>> from PIL import Image
+
+>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") 
+>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
+
+>>> # load image from the IAM dataset url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg" 
+>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+
+>>> pixel_values = processor(image, return_tensors="pt").pixel_values 
+>>> generated_ids = model.generate(pixel_values)
+
+>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] 
+```
+
+See the [model hub](https://huggingface.co/models?filter=trocr) to look for TrOCR checkpoints.
+
+## TrOCRConfig
+
+[[autodoc]] TrOCRConfig
+
+## TrOCRProcessor
+
+[[autodoc]] TrOCRProcessor
+    - __call__
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+    - as_target_processor
+
+## TrOCRForCausalLM
+
+[[autodoc]] TrOCRForCausalLM
+     - forward
--- a/docs/source/model_doc/trocr.rst
+++ b/docs/source/model_doc/trocr.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-TrOCR
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The TrOCR model was proposed in `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
-<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
-Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to
-perform `optical character recognition (OCR) <https://en.wikipedia.org/wiki/Optical_character_recognition>`__.
-
-Please refer to the :doc:`VisionEncoderDecoder <visionencoderdecoder>` class on how to use this model.
-
-This model was contributed by `Niels Rogge <https://huggingface.co/nielsr>`__.
-
-The original code can be found `here
-<https://github.com/microsoft/unilm/tree/6f60612e7cc86a2a1ae85c47231507a587ab4e01/trocr>`__.
-
-
-Tips:
-
- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results
-  on both printed (e.g. the `SROIE dataset <https://paperswithcode.com/dataset/sroie>`__) and handwritten (e.g. the
-  `IAM Handwriting dataset <https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>`__) text recognition tasks.
-  For more information, see the `official models <https://huggingface.co/models?other=trocr>`__.
- TrOCR is always used within the :doc:`VisionEncoderDecoder <visionencoderdecoder>` framework.
-
-Inference
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-TrOCR's :class:`~transformers.VisionEncoderDecoderModel` model accepts images as input and makes use of
-:func:`~transformers.generation_utils.GenerationMixin.generate` to autoregressively generate text given the input
-image.
-
-The :class:`~transformers.ViTFeatureExtractor` class is responsible for preprocessing the input image and
-:class:`~transformers.RobertaTokenizer` decodes the generated target tokens to the target string. The
-:class:`~transformers.TrOCRProcessor` wraps :class:`~transformers.ViTFeatureExtractor` and
-:class:`~transformers.RobertaTokenizer` into a single instance to both extract the input features and decode the
-predicted token ids.
-
- Step-by-step Optical Character Recognition (OCR)
-
-.. code-block::
-
-        >>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
-        >>> import requests
-        >>> from PIL import Image
-
-        >>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
-        >>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
-
-        >>> # load image from the IAM dataset
-        >>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
-        >>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
-
-        >>> pixel_values = processor(image, return_tensors="pt").pixel_values
-        >>> generated_ids = model.generate(pixel_values)
-
-        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
-
-
-See the `model hub <https://huggingface.co/models?filter=trocr>`__ to look for TrOCR checkpoints.
-
-
-TrOCRConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TrOCRConfig
-    :members:
-
-
-TrOCRProcessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TrOCRProcessor
-    :members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
-
-
-TrOCRForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TrOCRForCausalLM
-    :members: forward
--- a/docs/source/model_doc/vit.rst
+++ b/docs/source/model_doc/vit.rst
@@ -43,6 +43,8 @@ substantially fewer computational resources to train.*

 Tips:

+- Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found `here
+  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__.
 - To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
  which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be
  used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of

--- a/src/transformers/models/perceiver/configuration_perceiver.py
+++ b/src/transformers/models/perceiver/configuration_perceiver.py
@@ -90,7 +90,8 @@ class PerceiverConfig(PretrainedConfig):
        samples_per_patch (:obj:`int`, `optional`, defaults to 16):
            Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
        output_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[1, 16, 224, 224]`):
-            Shape of the output for the multimodal autoencoding model.
+            Shape of the output (batch_size, num_frames, height, width) for the video decoder queries of the multimodal
+            autoencoding model. This excludes the channel dimension.

    Example::


--- a/src/transformers/models/perceiver/modeling_perceiver.py
+++ b/src/transformers/models/perceiver/modeling_perceiver.py
@@ -1865,7 +1865,13 @@ class PerceiverAbstractDecoder(nn.Module, metaclass=abc.ABCMeta):


 class PerceiverProjectionDecoder(PerceiverAbstractDecoder):
-    """Baseline projection decoder (no cross-attention)."""
+    """
+    Baseline projection decoder (no cross-attention).
+
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+    """

    def __init__(self, config):
        super().__init__()
@@ -1884,11 +1890,38 @@ class PerceiverProjectionDecoder(PerceiverAbstractDecoder):

 class PerceiverBasicDecoder(PerceiverAbstractDecoder):
    """
-    Cross-attention-based decoder.
+    Cross-attention-based decoder. This class can be used to decode the final hidden states of the latents using a
+    cross-attention operation, in which the latents produce keys and values.

-    Here, `output_num_channels` refers to the number of output channels. `num_channels` refers to the number of
-    channels of the output queries.
+    The shape of the output of this class depends on how one defines the output queries (also called decoder queries).

+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+        output_num_channels (:obj:`int`, `optional`):
+            The number of channels in the output. Will only be used in case `final_project` is set to `True`.
+        position_encoding_type (:obj:`str`, `optional`, defaults to "trainable"):
+            The type of position encoding to use. Can be either "trainable", "fourier", or "none".
+        output_index_dims (:obj:`int`, `optional`):
+            The number of dimensions of the output queries. Ignored if 'position_encoding_type' == 'none'.
+        num_channels (:obj:`int`, `optional`):
+            The number of channels of the decoder queries. Ignored if 'position_encoding_type' == 'none'.
+        qk_channels (:obj:`int`, `optional`):
+            The number of channels of the queries and keys in the cross-attention layer.
+        v_channels (:obj:`int`, `optional`, defaults to 128):
+            The number of channels of the values in the cross-attention layer.
+        num_heads (:obj:`int`, `optional`, defaults to 1):
+            The number of attention heads in the cross-attention layer.
+        widening_factor (:obj:`int`, `optional`, defaults to 1):
+            The widening factor of the cross-attention layer.
+        use_query_residual (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use a residual connection between the query and the output of the cross-attention layer.
+        concat_preprocessed_input (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to concatenate the preprocessed input to the query.
+        final_project (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to project the output of the cross-attention layer to a target dimension.
+        position_encoding_only (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to only use this class to define output queries.
    """

    def __init__(
@@ -2035,9 +2068,13 @@ class PerceiverBasicDecoder(PerceiverAbstractDecoder):

 class PerceiverClassificationDecoder(PerceiverAbstractDecoder):
    """
-    Cross-attention based classification decoder. Light-weight wrapper of `BasicDecoder` for logit output. Will turn
-    the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of shape
-    (batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).
+    Cross-attention based classification decoder. Light-weight wrapper of [`PerceiverBasicDecoder`] for logit output.
+    Will turn the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of
+    shape (batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).
+
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
    """

    def __init__(self, config, **decoder_kwargs):
@@ -2100,8 +2137,16 @@ class PerceiverOpticalFlowDecoder(PerceiverAbstractDecoder):

 class PerceiverBasicVideoAutoencodingDecoder(PerceiverAbstractDecoder):
    """
-    Cross-attention based video-autoencoding decoder. Light-weight wrapper of `BasicDecoder` with video reshaping
-    logic.
+    Cross-attention based video-autoencoding decoder. Light-weight wrapper of [`PerceiverBasicDecoder`] with video
+    reshaping logic.
+
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+        output_shape (:obj:`List[int]`):
+            Shape of the output as (batch_size, num_frames, height, width), excluding the channel dimension.
+        position_encoding_type (:obj:`str`):
+            The type of position encoding to use. Can be either "trainable", "fourier", or "none".
    """

    def __init__(self, config, output_shape, position_encoding_type, **decoder_kwargs):
@@ -2165,10 +2210,28 @@ def restructure(modality_sizes: ModalitySizeType, inputs: torch.Tensor) -> Mappi

 class PerceiverMultimodalDecoder(PerceiverAbstractDecoder):
    """
-    Multimodal decoding by composing uni-modal decoders. The modalities argument of the constructor is a dictionary
+    Multimodal decoding by composing uni-modal decoders. The `modalities` argument of the constructor is a dictionary
    mapping modality name to the decoder of that modality. That decoder will be used to construct queries for that
-    modality. However, there is a shared cross attention across all modalities, using the concatenated per-modality
-    query vectors.
+    modality. Modality-specific queries are padded with trainable modality-specific parameters, after which they are
+    concatenated along the time dimension.
+
+    Next, there is a shared cross attention operation across all modalities.
+
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+        modalities (:obj:`Dict[str, PerceiverAbstractDecoder]`):
+            Dictionary mapping modality name to the decoder of that modality.
+        num_outputs (:obj:`int`):
+            The number of outputs of the decoder.
+        output_num_channels (:obj:`int`):
+            The number of channels in the output.
+        min_padding_size (:obj:`int`, `optional`, defaults to 2):
+            The minimum padding size for all modalities. The final output will have num_channels equal to the maximum
+            channels across all modalities plus min_padding_size.
+        subsampled_index_dims (:obj:`Dict[str, PerceiverAbstractDecoder]`, `optional`):
+            Dictionary mapping modality name to the subsampled index dimensions to use for the decoder query of that
+            modality.
    """

    def __init__(
@@ -2556,7 +2619,15 @@ class AbstractPreprocessor(nn.Module):


 class PerceiverTextPreprocessor(AbstractPreprocessor):
-    """Text preprocessing for Perceiver Encoder."""
+    """
+    Text preprocessing for Perceiver Encoder. Can be used to embed `inputs` and add positional encodings.
+
+    The dimensionality of the embeddings is determined by the `d_model` attribute of the configuration.
+
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+    """

    def __init__(self, config):
        super().__init__()
@@ -2578,10 +2649,15 @@ class PerceiverTextPreprocessor(AbstractPreprocessor):


 class PerceiverEmbeddingDecoder(nn.Module):
-    """Module to decode embeddings (for masked language modeling)."""
+    """
+    Module to decode embeddings (for masked language modeling).
+
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+    """

    def __init__(self, config):
-        """Constructs the module."""
        super().__init__()
        self.config = config
        self.vocab_size = config.vocab_size
@@ -2597,7 +2673,8 @@ class PerceiverEmbeddingDecoder(nn.Module):

 class PerceiverMultimodalPostprocessor(nn.Module):
    """
-    Multimodal postprocessing for Perceiver.
+    Multimodal postprocessing for Perceiver. Can be used to combine modality-specific postprocessors into a single
+    postprocessor.

    Args:
          modalities (:obj:`Dict[str, PostprocessorType]`):
@@ -2633,7 +2710,7 @@ class PerceiverClassificationPostprocessor(nn.Module):
    Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits.

    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        in_channels (:obj:`int`):
            Number of channels in the input.
@@ -2653,7 +2730,7 @@ class PerceiverAudioPostprocessor(nn.Module):
    Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features.

    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        in_channels (:obj:`int`):
            Number of channels in the input.
@@ -2678,8 +2755,8 @@ class PerceiverAudioPostprocessor(nn.Module):

 class PerceiverProjectionPostprocessor(nn.Module):
    """
-    Projection postprocessing for Perceiver. Can be used to convert the project the channels of the decoder output to a
-    lower dimension.
+    Projection postprocessing for Perceiver. Can be used to project the channels of the decoder output to a lower
+    dimension.

    Args:
        in_channels (:obj:`int`):
@@ -2706,7 +2783,7 @@ class PerceiverImagePreprocessor(AbstractPreprocessor):
    position encoding kwargs are set equal to the `out_channels`.

    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        prep_type (:obj:`str`, `optional`, defaults to :obj:`"conv"`):
            Preprocessing type. Can be "conv1x1", "conv", "patches", "pixels".
@@ -2931,7 +3008,7 @@ class PerceiverOneHotPreprocessor(AbstractPreprocessor):
    One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input.

    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
    """

@@ -2957,7 +3034,7 @@ class PerceiverAudioPreprocessor(AbstractPreprocessor):
    Audio preprocessing for Perceiver Encoder.

    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        prep_type (:obj:`str`, `optional`, defaults to :obj:`"patches"`):
            Preprocessor type to use. Only "patches" is supported.