Improve documentation of some models (#14695)

* Migrate docs to mdx * Update TAPAS docs * Remove lines * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add pt/tf switch to code examples * More improvements * Improve docstrings * More improvements Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Improve documentation of some models (#14695)
* Migrate docs to mdx * Update TAPAS docs * Remove lines * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add pt/tf switch to code examples * More improvements * Improve docstrings * More improvements Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
4c99e553 · NielsRogge · GitHub · 32eb29fe · 4c99e553 · 4c99e553
Unverified Commit 4c99e553 authored Dec 13, 2021 by NielsRogge Committed by GitHub Dec 13, 2021
13 changed files
--- a/docs/source/model_doc/beit.rst
+++ b/docs/source/model_doc/beit.rst
@@ -40,8 +40,15 @@ significantly outperforming from-scratch DeiT training (81.8%) with the same set
 Tips:
 - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
-  outperform both the original model (ViT) as well as Data-efficient Image Transformers (DeiT) when fine-tuned on
+  outperform both the :doc:`original model (ViT) <vit>` as well as :doc:`Data-efficient Image Transformers (DeiT)
-  ImageNet-1K and CIFAR-100.
+  <deit>` when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
+  fine-tuning on custom data `here
+  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__ (you can just replace
+  :class:`~transformers.ViTFeatureExtractor` by :class:`~transformers.BeitFeatureExtractor` and
+  :class:`~transformers.ViTForImageClassification` by :class:`~transformers.BeitForImageClassification`).
+- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
+  performing masked image modeling. You can find it `here
+  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT>`__.
 - As the BEiT models expect each image to be of the same size (resolution), one can use
  :class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
 - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of

--- a/docs/source/model_doc/imagegpt.rst
+++ b/docs/source/model_doc/imagegpt.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
-    the License. You may obtain a copy of the License at
+License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License. -->
-ImageGPT
+# ImageGPT
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ImageGPT model was proposed in `Generative Pretraining from Pixels <https://openai.com/blog/image-gpt/>`__ by Mark
+The ImageGPT model was proposed in [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt) by Mark
 Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like
 model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.
@@ -31,25 +28,29 @@ ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pr
 competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0%
 top-1 accuracy on a linear probe of our features.*
-The figure below summarizes the approach (taken from the `original paper
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png"
-<https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf>`__):
+alt="drawing" width="600"/> 
-.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png
+<small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small>
-  :width: 600
+This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found
+[here](https://github.com/openai/image-gpt).
 Tips:
- ImageGPT is almost exactly the same as :doc:`GPT-2 <gpt2>`, with the exception that a different activation function
+- Demo notebooks for ImageGPT can be found
-  is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT also doesn't
+  [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ImageGPT).
-  have tied input- and output embeddings.
+- ImageGPT is almost exactly the same as [GPT-2](./model_doc/gpt2), with the exception that a different activation
+  function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT
+  also doesn't have tied input- and output embeddings.
 - As the time- and memory requirements of the attention mechanism of Transformers scales quadratically in the sequence
  length, the authors pre-trained ImageGPT on smaller input resolutions, such as 32x32 and 64x64. However, feeding a
  sequence of 32x32x3=3072 tokens from 0..255 into a Transformer is still prohibitively large. Therefore, the authors
  applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
  sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
  embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
-  token, used at the beginning of every sequence. One can use :class:`~transformers.ImageGPTFeatureExtractor` to
+  token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare
-  prepare images for the model.
+  images for the model.
 - Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
  performant image features useful for downstream tasks, such as image classification. The authors showed that the
  features in the middle of the network are the most performant, and can be used as-is to train a linear model (such as
@@ -57,54 +58,43 @@ Tips:
  easily obtained by first forwarding the image through the model, then specifying `output_hidden_states=True`, and
  then average-pool the hidden states at whatever layer you like.
 - Alternatively, one can further fine-tune the entire model on a downstream dataset, similar to BERT. For this, you can
-  use :class:`~transformers.ImageGPTForImageClassification`.
+  use [`ImageGPTForImageClassification`].
 - ImageGPT comes in different sizes: there's ImageGPT-small, ImageGPT-medium and ImageGPT-large. The authors did also
  train an XL variant, which they didn't release. The differences in size are summarized in the following table:
-+-------------------+----------------------+-----------------+---------------------+--------------+
+| **Model variant** | **Depths** | **Hidden sizes** | **Decoder hidden size** | **Params (M)** | **ImageNet-1k Top 1** |
-| **Model variant** | **Number of layers** | **Hidden size** | **Number of heads** | **# params** |
+|---|---|---|---|---|---|
-+-------------------+----------------------+-----------------+---------------------+--------------+
+| MiT-b0 | [2, 2, 2, 2] | [32, 64, 160, 256] | 256 | 3.7 | 70.5 |
-| iGPT-small        | 24                   | 512             | 8                   | 76 million   |
+| MiT-b1 | [2, 2, 2, 2] | [64, 128, 320, 512] | 256 | 14.0 | 78.7 |
-+-------------------+----------------------+-----------------+---------------------+--------------+
+| MiT-b2 | [3, 4, 6, 3] | [64, 128, 320, 512] | 768 | 25.4 | 81.6 |
-| iGPT-medium       | 36                   | 1024            | 8                   | 455 million  |
+| MiT-b3 | [3, 4, 18, 3] | [64, 128, 320, 512] | 768 | 45.2 | 83.1 |
-+-------------------+----------------------+-----------------+---------------------+--------------+
+| MiT-b4 | [3, 8, 27, 3] | [64, 128, 320, 512] | 768 | 62.6 | 83.6 |
-| iGPT-large        | 48                   | 1536            | 16                  | 1.4 million  |
+| MiT-b5 | [3, 6, 40, 3] | [64, 128, 320, 512] | 768 | 82.0 | 83.8 |
-+-------------------+----------------------+-----------------+---------------------+--------------+
-| iGPT-XL           | 60                   | 3072            | not specified       | 6.8 billion  |
+## ImageGPTConfig
-+-------------------+----------------------+-----------------+---------------------+--------------+
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__, based on `this issue
+[[autodoc]] ImageGPTConfig
-<https://github.com/openai/image-gpt/issues/7>`__. The original code can be found `here
-<https://github.com/openai/image-gpt>`__.
-ImageGPTConfig
+## ImageGPTFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ImageGPTConfig
+[[autodoc]] ImageGPTFeatureExtractor
-    :members:
-ImageGPTFeatureExtractor
+    - __call__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ImageGPTFeatureExtractor
+## ImageGPTModel
-    :members: __call__
-ImageGPTModel
+[[autodoc]] ImageGPTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ImageGPTModel
+    - forward
-    :members: forward
+## ImageGPTForCausalImageModeling
-ImageGPTForCausalImageModeling
+[[autodoc]] ImageGPTForCausalImageModeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ImageGPTForCausalImageModeling
+    - forward
-    :members: forward
+## ImageGPTForImageClassification
-ImageGPTForImageClassification
+[[autodoc]] ImageGPTForImageClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.ImageGPTForImageClassification
+    - forward
-    :members: forward
\ No newline at end of file
--- a/docs/source/model_doc/luke.rst
+++ b/docs/source/model_doc/luke.rst
@@ -74,6 +74,9 @@ Tips:
  head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
  ``task="entity_span_classification"``. Please refer to the example code of each head models.
+  A demo notebook on how to fine-tune :class:`~transformers.LukeForEntityPairClassification` for relation
+  classification can be found `here <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE>`__.
  There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
  the HuggingFace implementation of LUKE. They can be found `here
  <https://github.com/studio-ousia/luke/tree/master/notebooks>`__.

--- a/docs/source/model_doc/perceiver.mdx
+++ b/docs/source/model_doc/perceiver.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Perceiver
+## Overview
+The Perceiver IO model was proposed in [Perceiver IO: A General Architecture for Structured Inputs &
+Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
+Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
+Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
+Perceiver IO is a generalization of [Perceiver](https://arxiv.org/abs/2103.03206) to handle arbitrary outputs in
+addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
+classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
+This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
+linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
+inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
+Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
+The abstract from the paper is the following:
+*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
+clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
+inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
+sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
+outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
+linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
+strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
+StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
+baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
+performance on Sintel optical flow estimation.*
+Here's a TLDR explaining how Perceiver works:
+The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
+quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
+tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
+of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
+depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
+randomly initialized, after which they are trained end-to-end using backpropagation.
+Internally, [`PerceiverModel`] will create the latents, which is a tensor of shape `(batch_size, num_latents,
+d_latents)`. One must provide `inputs` (which could be text, images, audio, you name it!) to the model, which it will
+use to perform cross-attention with the latents. The output of the Perceiver encoder is a tensor of the same shape. One
+can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along
+the sequence dimension, and placing a linear layer on top of that to project the `d_latents` to `num_labels`.
+This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
+work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
+idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
+last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
+So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
+length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
+providing `inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define the
+`outputs` as being of shape: `(batch_size, 2048, 768)`. Next, one performs cross-attention with the final hidden states
+of the latents to update the `outputs` tensor. After cross-attention, one still has a tensor of shape `(batch_size,
+2048, 768)`. One can then place a regular language modeling head on top, to project the last dimension to the
+vocabulary size of the model, i.e. creating logits of shape `(batch_size, 2048, 262)` (as Perceiver uses a vocabulary
+size of 262 byte IDs).
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg"
+alt="drawing" width="600"/> 
+<small> Perceiver IO architecture. Taken from the [original paper](https://arxiv.org/abs/2105.15203) </small>
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
+[here](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
+Tips:
+- The quickest way to get started with the Perceiver is by checking the [tutorial
+  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
+- Note that the models available in the library only showcase some examples of what you can do with the Perceiver.
+  There are many more use cases, including question answering,
+named-entity recognition, object detection, audio classification, video classification, etc.
+## Perceiver specific outputs
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverModelOutput
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverDecoderOutput
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassifierOutput
+## PerceiverConfig
+[[autodoc]] PerceiverConfig
+## PerceiverTokenizer
+[[autodoc]] PerceiverTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## PerceiverFeatureExtractor
+[[autodoc]] PerceiverFeatureExtractor
+    - call
+## PerceiverTextPreprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
+## PerceiverImagePreprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
+## PerceiverOneHotPreprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
+## PerceiverAudioPreprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
+## PerceiverMultimodalPreprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
+## PerceiverProjectionDecoder
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverProjectionDecoder
+## PerceiverBasicDecoder
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverBasicDecoder
+## PerceiverClassificationDecoder
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassificationDecoder
+## PerceiverOpticalFlowDecoder
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder
+## PerceiverBasicVideoAutoencodingDecoder
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverBasicVideoAutoencodingDecoder
+## PerceiverMultimodalDecoder
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder
+## PerceiverProjectionPostprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
+## PerceiverAudioPostprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
+## PerceiverClassificationPostprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
+## PerceiverMultimodalPostprocessor
+[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
+## PerceiverModel
+[[autodoc]] PerceiverModel
+    - forward
+## PerceiverForMaskedLM
+[[autodoc]] PerceiverForMaskedLM
+    - forward
+## PerceiverForSequenceClassification
+[[autodoc]] PerceiverForSequenceClassification
+    - forward
+## PerceiverForImageClassificationLearned
+[[autodoc]] PerceiverForImageClassificationLearned
+    - forward
+## PerceiverForImageClassificationFourier
+[[autodoc]] PerceiverForImageClassificationFourier
+    - forward
+## PerceiverForImageClassificationConvProcessing
+[[autodoc]] PerceiverForImageClassificationConvProcessing
+    - forward
+## PerceiverForOpticalFlow
+[[autodoc]] PerceiverForOpticalFlow
+    - forward
+## PerceiverForMultimodalAutoencoding
+[[autodoc]] PerceiverForMultimodalAutoencoding
+    - forward
\ No newline at end of file
--- a/docs/source/model_doc/perceiver.rst
+++ b/docs/source/model_doc/perceiver.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-Perceiver
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Perceiver IO model was proposed in `Perceiver IO: A General Architecture for Structured Inputs & Outputs
-<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
-Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
-Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
-Perceiver IO is a generalization of `Perceiver <https://arxiv.org/abs/2103.03206>`__ to handle arbitrary outputs in
-addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
-classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
-This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
-linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
-inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
-Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
-The abstract from the paper is the following:
-*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
-clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
-inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
-sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
-outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
-linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
-strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
-StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
-baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
-performance on Sintel optical flow estimation.*
-Here's a TLDR explaining how Perceiver works:
-The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
-quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
-tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
-of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
-depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
-randomly initialized, after which they are trained end-to-end using backpropagation.
-Internally, :class:`~transformers.PerceiverModel` will create the latents, which is a tensor of shape
-:obj:`(batch_size, num_latents, d_latents)`. One must provide :obj:`inputs` (which could be text, images, audio, you
-name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver
-encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to
-classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project
-the :obj:`d_latents` to :obj:`num_labels`.
-This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
-work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
-idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
-last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
-So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
-length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
-providing :obj:`inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define
-the :obj:`outputs` as being of shape: :obj:`(batch_size, 2048, 768)`. Next, one performs cross-attention with the final
-hidden states of the latents to update the :obj:`outputs` tensor. After cross-attention, one still has a tensor of
-shape :obj:`(batch_size, 2048, 768)`. One can then place a regular language modeling head on top, to project the last
-dimension to the vocabulary size of the model, i.e. creating logits of shape :obj:`(batch_size, 2048, 262)` (as
-Perceiver uses a vocabulary size of 262 byte IDs).
-This model was contributed by `<nielsr> <https://huggingface.co/nielsr>`__. The original code can be found `here
-<https://github.com/deepmind/deepmind-research/tree/master/perceiver>`__.
-Perceiver specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput
-    :members:
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput
-    :members:
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
-    :members:
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput
-    :members:
-PerceiverConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverConfig
-    :members:
-PerceiverTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-PerceiverFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverFeatureExtractor
-    :members: 
-PerceiverTextPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
-    :members: 
-PerceiverImagePreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
-    :members: 
-PerceiverOneHotPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
-    :members: 
-PerceiverAudioPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
-    :members: 
-PerceiverMultimodalPreprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
-    :members: 
-PerceiverProjectionPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
-    :members: 
-PerceiverAudioPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
-    :members: 
-PerceiverClassificationPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
-    :members: 
-PerceiverMultimodalPostprocessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
-    :members: 
-PerceiverModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverModel
-    :members: forward
-PerceiverForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverForMaskedLM
-    :members: forward
-PerceiverForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverForSequenceClassification
-    :members: forward
-PerceiverForImageClassificationLearned
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverForImageClassificationLearned
-    :members: forward
-PerceiverForImageClassificationFourier
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverForImageClassificationFourier
-    :members: forward
-PerceiverForImageClassificationConvProcessing
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverForImageClassificationConvProcessing
-    :members: forward
-PerceiverForOpticalFlow
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverForOpticalFlow
-    :members: forward
-PerceiverForMultimodalAutoencoding
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.PerceiverForMultimodalAutoencoding
-    :members: forward
--- a/docs/source/model_doc/segformer.rst
+++ b/docs/source/model_doc/segformer.rst
@@ -35,15 +35,15 @@ and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50
 being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on
 Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C.*
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
-<https://github.com/NVlabs/SegFormer>`__.
 The figure below illustrates the architecture of SegFormer. Taken from the `original paper
 <https://arxiv.org/abs/2105.15203>`__.
 .. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png
  :width: 600
+This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
+<https://github.com/NVlabs/SegFormer>`__.
 Tips:
 - SegFormer consists of a hierarchical Transformer encoder, and a lightweight all-MLP decode head.

--- a/docs/source/model_doc/tapas.mdx
+++ b/docs/source/model_doc/tapas.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# TAPAS
+## Overview
+The TAPAS model was proposed in [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://www.aclweb.org/anthology/2020.acl-main.398)
+by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. It's a BERT-based model specifically 
+designed (and pre-trained) for answering questions about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 
+token types that encode tabular structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset comprising 
+millions of tables from English Wikipedia and corresponding texts. 
+For question answering, TAPAS has 2 heads on top: a cell selection head and an aggregation head, for (optionally) performing aggregations (such as counting or summing) among selected cells. TAPAS has been fine-tuned on several datasets: 
+- [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253) (Sequential Question Answering by Microsoft)
+- [WTQ](https://github.com/ppasupat/WikiTableQuestions) (Wiki Table Questions by Stanford University)
+- [WikiSQL](https://github.com/salesforce/WikiSQL) (by Salesforce). 
+It achieves state-of-the-art on both SQA and WTQ, while having comparable performance to SOTA on WikiSQL, with a much simpler architecture.
+The abstract from the paper is the following:
+*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
+In addition, the authors have further pre-trained TAPAS to recognize **table entailment**, by creating a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM, and then on another dataset). They found that intermediate pre-training further improves performance on SQA, achieving a new state-of-the-art as well as state-of-the-art on [TabFact](https://github.com/wenhuchen/Table-Fact-Checking), a large-scale dataset with 16k Wikipedia tables for table entailment (a binary classification task). For more details, see their follow-up paper: [Understanding tables with intermediate pre-training](https://www.aclweb.org/anthology/2020.findings-emnlp.27/) by Julian Martin Eisenschlos, Syrine Krichene and Thomas Müller.
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tapas_architecture.png"
+alt="drawing" width="600"/> 
+<small> TAPAS architecture. Taken from the [official blog post](https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html). </small>
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tensorflow version of this model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/tapas).
+Tips:
+- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell of the table). Note that this is something that was added after the publication of the original TAPAS paper. According to the authors, this usually results in a slightly better performance, and allows you to encode longer sequences without running out of embeddings. This is reflected in the `reset_position_index_per_cell` parameter of [`TapasConfig`], which is set to `True` by default. The default versions of the models available on the [hub](https://huggingface.co/models?search=tapas) all use relative position embeddings. You can still use the ones with absolute position embeddings by passing in an additional argument `revision="no_reset"` when calling the `from_pretrained()` method. Note that it's usually advised to pad the inputs on the right rather than the left.
+- TAPAS is based on BERT, so `TAPAS-base` for example corresponds to a `BERT-base` architecture. Of course, `TAPAS-large` will result in the best performance (the results reported in the paper are from `TAPAS-large`). Results of the various sized models are shown on the [original Github repository](https://github.com/google-research/tapas>).
+- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that case, you have to feed every table-question pair one by one to the model, such that the `prev_labels` token type ids can be overwritten by the predicted `labels` of the model to the previous question. See "Usage" section for more info.
+- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language modeling (CLM) objective are better in that regard. Note that TAPAS can be used as an encoder in the EncoderDecoderModel framework, to combine it with an autoregressive text decoder such as GPT-2.
+## Usage: fine-tuning
+Here we explain how you can fine-tune [`TapasForQuestionAnswering`] on your own dataset.
+**STEP 1: Choose one of the 3 ways in which you can use TAPAS - or experiment**
+Basically, there are 3 different ways in which one can fine-tune [`TapasForQuestionAnswering`], corresponding to the different datasets on which Tapas was fine-tuned:
+1. SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example if you first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is he?". Here, questions do not involve any aggregation (all questions are cell selection questions).
+2. WTQ: if you're not interested in asking questions in a conversational set-up, but rather just asking questions related to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or averaging cell values. You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his career?". This case is also called **weak supervision**, since the model itself must learn the appropriate aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer to the question as supervision.
+3. WikiSQL-supervised: this dataset is based on WikiSQL with the model being given the ground truth aggregation operator during training. This is also called **strong supervision**. Here, learning the appropriate aggregation operator is much easier.
+To summarize:
+| **Task**                            | **Example dataset** | **Description**                                                                                         |
+|-------------------------------------|---------------------|---------------------------------------------------------------------------------------------------------|
+| Conversational                      | SQA                 | Conversational, only cell selection questions                                                           |
+| Weak supervision for aggregation    | WTQ                 | Questions might involve aggregation, and the model must learn this given only the answer as supervision |
+| Strong supervision for aggregation  | WikiSQL-supervised  | Questions might involve aggregation, and the model must learn this given the gold aggregation operator  |
+Initializing a model with a pre-trained base and randomly initialized classification heads from the hub can be done as shown below. Be sure to have installed the
+[torch-scatter](https://github.com/rusty1s/pytorch_scatter) dependency for your environment in case you're using PyTorch, or the [tensorflow_probability](https://github.com/tensorflow/probability)
+dependency in case you're using Tensorflow:
+```py
+>>> from transformers import TapasConfig, TapasForQuestionAnswering
+>>> # for example, the base sized model with default SQA configuration
+>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base')
+>>> # or, the base sized model with WTQ configuration
+>>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
+>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> # or, the base sized model with WikiSQL configuration
+>>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
+>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+===PT-TF-SPLIT===
+>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
+>>> # for example, the base sized model with default SQA configuration
+>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base')
+>>> # or, the base sized model with WTQ configuration
+>>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
+>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> # or, the base sized model with WikiSQL configuration
+>>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
+>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+```
+Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters you want when initializing [`TapasConfig`], and then create a [`TapasForQuestionAnswering`] based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way. Here's an example:
+```py
+>>> from transformers import TapasConfig, TapasForQuestionAnswering
+>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
+>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
+>>> # initializing the pre-trained base sized model with our custom classification heads
+>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+===PT-TF-SPLIT===
+>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
+>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
+>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
+>>> # initializing the pre-trained base sized model with our custom classification heads
+>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+```
+What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See [here](https://github.com/google-research/tapas/issues/91#issuecomment-735719340) for more info.
+For a list of all pre-trained and fine-tuned TAPAS checkpoints available on HuggingFace's  hub, see [here](https://huggingface.co/models?search=tapas).
+**STEP 2: Prepare your data in the SQA format**
+Second, no matter what you picked above, you should prepare your dataset in the [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253) format. This format is a TSV/CSV file with the following columns:
+- `id`: optional, id of the table-question pair, for bookkeeping purposes.
+- `annotator`: optional, id of the person who annotated the table-question pair, for bookkeeping purposes.
+- `position`: integer indicating if the question is the first, second, third,... related to the table. Only required in case of conversational setup (SQA). You don't need this column in case you're going for WTQ/WikiSQL-supervised.
+- `question`: string
+- `table_file`: string, name of a csv file containing the tabular data
+- `answer_coordinates`: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer)
+- `answer_text`: list of one or more strings (each string being a cell value that is part of the answer)
+- `aggregation_label`: index of the aggregation operator. Only required in case of strong supervision for aggregation (the WikiSQL-supervised case)
+- `float_answer`: the float answer to the question, if there is one (np.nan if there isn't). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)
+The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ, WikiSQL) into the SQA format. The author explains this [here](https://github.com/google-research/tapas/issues/50#issuecomment-705465960). A conversion of this script that works with HuggingFace's implementation can be found [here](https://github.com/NielsRogge/tapas_utils). Interestingly, these conversion scripts are not perfect (the `answer_coordinates` and `float_answer` fields are populated based on the `answer_text`), meaning that WTQ and WikiSQL results could actually be improved.
+**STEP 3: Convert your data into PyTorch/TensorFlow tensors using TapasTokenizer**
+Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then use [`TapasTokenizer`] to convert table-question pairs into `input_ids`, `attention_mask`, `token_type_ids` and so on. Again, based on which of the three cases you picked above, [`TapasForQuestionAnswering`]/[`TFTapasForQuestionAnswering`] requires different
+inputs to be fine-tuned:
+| **Task**                           | **Required inputs**                                                                                                 |
+|------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| Conversational                     | `input_ids`, `attention_mask`, `token_type_ids`, `labels`                                                           |
+|  Weak supervision for aggregation  | `input_ids`, `attention_mask`, `token_type_ids`, `labels`, `numeric_values`, `numeric_values_scale`, `float_answer` |
+| Strong supervision for aggregation | `input ids`, `attention mask`, `token type ids`, `labels`, `aggregation_labels`                                     |
+[`TapasTokenizer`] creates the `labels`, `numeric_values` and `numeric_values_scale` based on the `answer_coordinates` and `answer_text` columns of the TSV file. The `float_answer` and `aggregation_labels` are already in the TSV file of step 2. Here's an example:
+```py
+>>> from transformers import TapasTokenizer
+>>> import pandas as pd
+>>> model_name = 'google/tapas-base'
+>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
+>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
+>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
+>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
+>>> table = pd.DataFrame.from_dict(data)
+>>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
+>>> inputs
+{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
+'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
+===PT-TF-SPLIT===
+>>> from transformers import TapasTokenizer
+>>> import pandas as pd
+>>> model_name = 'google/tapas-base'
+>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
+>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
+>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
+>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
+>>> table = pd.DataFrame.from_dict(data)
+>>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='tf')
+>>> inputs
+{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
+'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
+```
+Note that [`TapasTokenizer`] expects the data of the table to be **text-only**. You can use `.astype(str)` on a dataframe to turn it into text-only data.
+Of course, this only shows how to encode a single training example. It is advised to create a dataloader to iterate over batches:
+```py
+>>> import torch
+>>> import pandas as pd
+>>> tsv_path = "your_path_to_the_tsv_file"
+>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
+>>> class TableDataset(torch.utils.data.Dataset):
+...     def __init__(self, data, tokenizer):
+...         self.data = data
+...         self.tokenizer = tokenizer
+...
+...     def __getitem__(self, idx):
+...         item = data.iloc[idx]
+...         table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
+...         encoding = self.tokenizer(table=table, 
+...                                   queries=item.question, 
+...                                   answer_coordinates=item.answer_coordinates, 
+...                                   answer_text=item.answer_text,
+...                                   truncation=True,
+...                                   padding="max_length",
+...                                   return_tensors="pt"
+...         )
+...         # remove the batch dimension which the tokenizer adds by default
+...         encoding = {key: val.squeeze(0) for key, val in encoding.items()}
+...         # add the float_answer which is also required (weak supervision for aggregation case)
+...         encoding["float_answer"] = torch.tensor(item.float_answer) 
+...         return encoding
+...
+...     def __len__(self):
+...        return len(self.data)
+>>> data = pd.read_csv(tsv_path, sep='\t')
+>>> train_dataset = TableDataset(data, tokenizer)
+>>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
+===PT-TF-SPLIT===
+>>> import tensorflow as tf
+>>> import pandas as pd
+>>> tsv_path = "your_path_to_the_tsv_file"
+>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
+>>> class TableDataset:
+...     def __init__(self, data, tokenizer):
+...         self.data = data
+...         self.tokenizer = tokenizer
+...
+...     def __iter__(self):
+...         for idx in range(self.__len__()):
+...             item = self.data.iloc[idx]
+...             table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
+...             encoding = self.tokenizer(table=table, 
+...                                   queries=item.question, 
+...                                   answer_coordinates=item.answer_coordinates, 
+...                                   answer_text=item.answer_text,
+...                                   truncation=True,
+...                                   padding="max_length",
+...                                   return_tensors="tf"
+...             )
+...             # remove the batch dimension which the tokenizer adds by default
+...             encoding = {key: tf.squeeze(val,0) for key, val in encoding.items()}
+...             # add the float_answer which is also required (weak supervision for aggregation case)
+...             encoding["float_answer"] = tf.convert_to_tensor(item.float_answer,dtype=tf.float32)
+...             yield encoding['input_ids'], encoding['attention_mask'], encoding['numeric_values'], \
+...                   encoding['numeric_values_scale'], encoding['token_type_ids'], encoding['labels'], \
+...                   encoding['float_answer']
+...
+...     def __len__(self):
+...        return len(self.data)
+>>> data = pd.read_csv(tsv_path, sep='\t')
+>>> train_dataset = TableDataset(data, tokenizer)
+>>> output_signature = (
+... tf.TensorSpec(shape=(512,), dtype=tf.int32),
+... tf.TensorSpec(shape=(512,), dtype=tf.int32),
+... tf.TensorSpec(shape=(512,), dtype=tf.float32),
+... tf.TensorSpec(shape=(512,), dtype=tf.float32),
+... tf.TensorSpec(shape=(512,7), dtype=tf.int32),
+... tf.TensorSpec(shape=(512,), dtype=tf.int32),
+... tf.TensorSpec(shape=(512,), dtype=tf.float32))
+>>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)
+```
+Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group together the `queries`, `answer_coordinates` and `answer_text` per table (in the order of their `position`
+index) and batch encode each table with its questions. This will make sure that the `prev_labels` token types (see docs of [`TapasTokenizer`]) are set correctly. See [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info. See [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info regarding using the TensorFlow model.
+**STEP 4: Train (fine-tune) TapasForQuestionAnswering/TFTapasForQuestionAnswering**
+You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnswering`] as follows (shown here for the weak supervision for aggregation case):
+```py
+>>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW
+>>> # this is the default WTQ configuration
+>>> config = TapasConfig(
+...            num_aggregation_labels = 4,
+...            use_answer_as_supervision = True,
+...            answer_loss_cutoff = 0.664694,
+...            cell_selection_preference = 0.207951,
+...            huber_loss_delta = 0.121194,
+...            init_cell_selection_weights_to_zero = True,
+...            select_one_column = True,
+...            allow_empty_column_selection = False,
+...            temperature = 0.0352513,
+... )
+>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
+>>> optimizer = AdamW(model.parameters(), lr=5e-5)
+>>> model.train()
+>>> for epoch in range(2):  # loop over the dataset multiple times
+...    for batch in train_dataloader:
+...         # get the inputs; 
+...         input_ids = batch["input_ids"]
+...         attention_mask = batch["attention_mask"]
+...         token_type_ids = batch["token_type_ids"]
+...         labels = batch["labels"]
+...         numeric_values = batch["numeric_values"]
+...         numeric_values_scale = batch["numeric_values_scale"]
+...         float_answer = batch["float_answer"]
+...         # zero the parameter gradients
+...         optimizer.zero_grad()
+...         # forward + backward + optimize
+...         outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
+...                        labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
+...                        float_answer=float_answer)
+...         loss = outputs.loss
+...         loss.backward()
+...         optimizer.step()
+===PT-TF-SPLIT===
+>>> import tensorflow as tf
+>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
+>>> # this is the default WTQ configuration
+>>> config = TapasConfig(
+...            num_aggregation_labels = 4,
+...            use_answer_as_supervision = True,
+...            answer_loss_cutoff = 0.664694,
+...            cell_selection_preference = 0.207951,
+...            huber_loss_delta = 0.121194,
+...            init_cell_selection_weights_to_zero = True,
+...            select_one_column = True,
+...            allow_empty_column_selection = False,
+...            temperature = 0.0352513,
+... )
+>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
+>>> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
+>>> for epoch in range(2):  # loop over the dataset multiple times
+...    for batch in train_dataloader:
+...         # get the inputs; 
+...         input_ids = batch[0]
+...         attention_mask = batch[1]
+...         token_type_ids = batch[4]
+...         labels = batch[-1]
+...         numeric_values = batch[2]
+...         numeric_values_scale = batch[3]
+...         float_answer = batch[6]
+...         # forward + backward + optimize
+...         with tf.GradientTape() as tape:
+...              outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
+...                        labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
+...                        float_answer=float_answer )
+...         grads = tape.gradient(outputs.loss, model.trainable_weights)
+...         optimizer.apply_gradients(zip(grads, model.trainable_weights))
+```
+## Usage: inference
+Here we explain how you can use [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnswering`] for inference (i.e. making predictions on new data). For inference, only `input_ids`, `attention_mask` and `token_type_ids` (which you can obtain using [`TapasTokenizer`]) have to be provided to the model to obtain the logits. Next, you can use the handy [`~models.tapas.tokenization_tapas.convert_logits_to_predictions`] method to convert these into predicted coordinates and optional aggregation indices.
+However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example of that:
+```py
+>>> from transformers import TapasTokenizer, TapasForQuestionAnswering
+>>> import pandas as pd 
+>>> model_name = 'google/tapas-base-finetuned-wtq'
+>>> model = TapasForQuestionAnswering.from_pretrained(model_name)
+>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
+>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
+>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> table = pd.DataFrame.from_dict(data)
+>>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt") 
+>>> outputs = model(**inputs)
+>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
+...         inputs, 
+...         outputs.logits.detach(), 
+...         outputs.logits_aggregation.detach()
+... )
+>>> # let's print out the results:
+>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
+>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
+>>> answers = []
+>>> for coordinates in predicted_answer_coordinates:
+...   if len(coordinates) == 1:
+...     # only a single cell:
+...     answers.append(table.iat[coordinates[0]])
+...   else:
+...     # multiple cells
+...     cell_values = []
+...     for coordinate in coordinates:
+...        cell_values.append(table.iat[coordinate])
+...     answers.append(", ".join(cell_values))
+>>> display(table)
+>>> print("")
+>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
+...   print(query)
+...   if predicted_agg == "NONE":
+...     print("Predicted answer: " + answer)
+...   else:
+...     print("Predicted answer: " + predicted_agg + " > " + answer)    
+What is the name of the first actor?
+Predicted answer: Brad Pitt
+How many movies has George Clooney played in?
+Predicted answer: COUNT > 69
+What is the total number of movies?
+Predicted answer: SUM > 87, 53, 69
+===PT-TF-SPLIT===
+>>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
+>>> import pandas as pd 
+>>> model_name = 'google/tapas-base-finetuned-wtq'
+>>> model = TFTapasForQuestionAnswering.from_pretrained(model_name)
+>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
+>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
+>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> table = pd.DataFrame.from_dict(data)
+>>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="tf") 
+>>> outputs = model(**inputs)
+>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
+...         inputs, 
+...         outputs.logits, 
+...         outputs.logits_aggregation
+... )
+>>> # let's print out the results:
+>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
+>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
+>>> answers = []
+>>> for coordinates in predicted_answer_coordinates:
+...   if len(coordinates) == 1:
+...     # only a single cell:
+...     answers.append(table.iat[coordinates[0]])
+...   else:
+...     # multiple cells
+...     cell_values = []
+...     for coordinate in coordinates:
+...        cell_values.append(table.iat[coordinate])
+...     answers.append(", ".join(cell_values))
+>>> display(table)
+>>> print("")
+>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
+...   print(query)
+...   if predicted_agg == "NONE":
+...     print("Predicted answer: " + answer)
+...   else:
+...     print("Predicted answer: " + predicted_agg + " > " + answer)    
+What is the name of the first actor?
+Predicted answer: Brad Pitt
+How many movies has George Clooney played in?
+Predicted answer: COUNT > 69
+What is the total number of movies?
+Predicted answer: SUM > 87, 53, 69
+```
+In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that the `prev_labels` token types can be overwritten by the predicted `labels` of the previous table-question pair. Again, more info can be found in [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for PyTorch) and [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for TensorFlow).
+## TAPAS specific outputs
+[[autodoc]] models.tapas.modeling_tapas.TableQuestionAnsweringOutput
+## TapasConfig
+[[autodoc]] TapasConfig
+## TapasTokenizer
+[[autodoc]] TapasTokenizer
+    - __call__
+    - convert_logits_to_predictions
+    - save_vocabulary
+## TapasModel
+[[autodoc]] TapasModel
+    - forward
+## TapasForMaskedLM
+[[autodoc]] TapasForMaskedLM
+    - forward
+## TapasForSequenceClassification
+[[autodoc]] TapasForSequenceClassification
+    - forward
+## TapasForQuestionAnswering
+[[autodoc]] TapasForQuestionAnswering
+    - forward
+## TFTapasModel
+[[autodoc]] TFTapasModel
+    - call
+## TFTapasForMaskedLM
+[[autodoc]] TFTapasForMaskedLM
+    - call
+## TFTapasForSequenceClassification
+[[autodoc]] TFTapasForSequenceClassification
+    - call
+## TFTapasForQuestionAnswering
+[[autodoc]] TFTapasForQuestionAnswering
+    - call
\ No newline at end of file
--- a/docs/source/model_doc/tapas.rst
+++ b/docs/source/model_doc/tapas.rst
-TAPAS
-----------------------------------------------------------------------------------------------------------------------
-.. note::
-    This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
-    breaking changes to fix them in the future.
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The TAPAS model was proposed in `TAPAS: Weakly Supervised Table Parsing via Pre-training
-<https://www.aclweb.org/anthology/2020.acl-main.398>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
-Francesco Piccinno and Julian Martin Eisenschlos. It's a BERT-based model specifically designed (and pre-trained) for
-answering questions about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 token types
-that encode tabular structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset
-comprising millions of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads
-on top: a cell selection head and an aggregation head, for (optionally) performing aggregations (such as counting or
-summing) among selected cells. TAPAS has been fine-tuned on several datasets: `SQA
-<https://www.microsoft.com/en-us/download/details.aspx?id=54253>`__ (Sequential Question Answering by Microsoft), `WTQ
-<https://github.com/ppasupat/WikiTableQuestions>`__ (Wiki Table Questions by Stanford University) and `WikiSQL
-<https://github.com/salesforce/WikiSQL>`__ (by Salesforce). It achieves state-of-the-art on both SQA and WTQ, while
-having comparable performance to SOTA on WikiSQL, with a much simpler architecture.
-The abstract from the paper is the following:
-*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the
-collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations
-instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition,
-the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we
-present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak
-supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation
-operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective
-joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with
-three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by
-improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL
-and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our
-setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
-In addition, the authors have further pre-trained TAPAS to recognize **table entailment**, by creating a balanced
-dataset of millions of automatically created training examples which are learned in an intermediate step prior to
-fine-tuning. The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first
-pre-trained on MLM, and then on another dataset). They found that intermediate pre-training further improves
-performance on SQA, achieving a new state-of-the-art as well as state-of-the-art on `TabFact
-<https://github.com/wenhuchen/Table-Fact-Checking>`__, a large-scale dataset with 16k Wikipedia tables for table
-entailment (a binary classification task). For more details, see their follow-up paper: `Understanding tables with
-intermediate pre-training <https://www.aclweb.org/anthology/2020.findings-emnlp.27/>`__ by Julian Martin Eisenschlos,
-Syrine Krichene and Thomas Müller.
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The Tensorflow version of this model was
-contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
-<https://github.com/google-research/tapas>`__.
-Tips:
- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell
-  of the table). Note that this is something that was added after the publication of the original TAPAS paper.
-  According to the authors, this usually results in a slightly better performance, and allows you to encode longer
-  sequences without running out of embeddings. This is reflected in the ``reset_position_index_per_cell`` parameter of
-  :class:`~transformers.TapasConfig`, which is set to ``True`` by default. The default versions of the models available
-  in the `model hub <https://huggingface.co/models?search=tapas>`_ all use relative position embeddings. You can still
-  use the ones with absolute position embeddings by passing in an additional argument ``revision="no_reset"`` when
-  calling the ``.from_pretrained()`` method. Note that it's usually advised to pad the inputs on the right rather than
-  the left.
- TAPAS is based on BERT, so ``TAPAS-base`` for example corresponds to a ``BERT-base`` architecture. Of course,
-  TAPAS-large will result in the best performance (the results reported in the paper are from TAPAS-large). Results of
-  the various sized models are shown on the `original Github repository <https://github.com/google-research/tapas>`_.
- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a
-  conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the
-  previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that
-  case, you have to feed every table-question pair one by one to the model, such that the `prev_labels` token type ids
-  can be overwritten by the predicted `labels` of the model to the previous question. See "Usage" section for more
-  info.
- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
-  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
-  with a causal language modeling (CLM) objective are better in that regard.
-Usage: fine-tuning
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Here we explain how you can fine-tune :class:`~transformers.TapasForQuestionAnswering` on your own dataset.
-**STEP 1: Choose one of the 3 ways in which you can use TAPAS - or experiment**
-Basically, there are 3 different ways in which one can fine-tune :class:`~transformers.TapasForQuestionAnswering`,
-corresponding to the different datasets on which Tapas was fine-tuned:
-1. SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example
-   if you first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is
-   he?". Here, questions do not involve any aggregation (all questions are cell selection questions).
-2. WTQ: if you're not interested in asking questions in a conversational set-up, but rather just asking questions
-   related to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or
-   averaging cell values. You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his
-   career?". This case is also called **weak supervision**, since the model itself must learn the appropriate
-   aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer to the question as supervision.
-3. WikiSQL-supervised: this dataset is based on WikiSQL with the model being given the ground truth aggregation
-   operator during training. This is also called **strong supervision**. Here, learning the appropriate aggregation
-   operator is much easier.
-To summarize:
-+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
-| **Task**                           | **Example dataset**  | **Description**                                                                                                   |
-+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
-| Conversational                     | SQA                  | Conversational, only cell selection questions                                                                     |
-+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
-| Weak supervision for aggregation   | WTQ                  | Questions might involve aggregation, and the model must learn this given only the answer as supervision           |
-+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
-| Strong supervision for aggregation | WikiSQL-supervised   | Questions might involve aggregation, and the model must learn this given the gold aggregation operator            |
-+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
-Initializing a model with a pre-trained base and randomly initialized classification heads from the model hub can be
-done as follows (be sure to have installed the `torch-scatter dependency <https://github.com/rusty1s/pytorch_scatter>`_
-for your environment):
-.. code-block::
-        >>> from transformers import TapasConfig, TapasForQuestionAnswering
-        >>> # for example, the base sized model with default SQA configuration
-        >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base')
-        >>> # or, the base sized model with WTQ configuration
-        >>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
-        >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
-        >>> # or, the base sized model with WikiSQL configuration
-        >>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
-        >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
-In TensorFlow, this can be done as follows (make sure to have installed the `tensorflow_probability dependency
-<https://github.com/tensorflow/probability`>__ for your environment):
-.. code-block::
-        >>> from transformers import TapasConfig, TFTapasForQuestionAnswering
-        >>> # for example, the base sized model with default SQA configuration
-        >>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base')
-        >>> # or, the base sized model with WTQ configuration
-        >>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
-        >>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
-        >>> # or, the base sized model with WikiSQL configuration
-        >>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
-        >>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
-Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also
-experiment by defining any hyperparameters you want when initializing :class:`~transformers.TapasConfig`, and then
-create a :class:`~transformers.TapasForQuestionAnswering` based on that configuration. For example, if you have a
-dataset that has both conversational questions and questions that might involve aggregation, then you can do it this
-way. Here's an example:
-.. code-block::
-        >>> from transformers import TapasConfig, TapasForQuestionAnswering
-        >>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
-        >>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
-        >>> # initializing the pre-trained base sized model with our custom classification heads
-        >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
-And here is the equivalent code for TensorFlow:
-.. code-block::
-        >>> from transformers import TapasConfig, TFTapasForQuestionAnswering
-        >>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
-        >>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
-        >>> # initializing the pre-trained base sized model with our custom classification heads
-        >>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
-What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned
-checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See `here
-<https://github.com/google-research/tapas/issues/91#issuecomment-735719340>`__ for more info.
-For a list of all pre-trained and fine-tuned TAPAS checkpoints available in the HuggingFace model hub, see `here
-<https://huggingface.co/models?search=tapas>`__.
-**STEP 2: Prepare your data in the SQA format**
-Second, no matter what you picked above, you should prepare your dataset in the `SQA format
-<https://www.microsoft.com/en-us/download/details.aspx?id=54253>`__. This format is a TSV/CSV file with the following
-columns:
- ``id``: optional, id of the table-question pair, for bookkeeping purposes.
- ``annotator``: optional, id of the person who annotated the table-question pair, for bookkeeping purposes.
- ``position``: integer indicating if the question is the first, second, third,... related to the table. Only required
-  in case of conversational setup (SQA). You don't need this column in case you're going for WTQ/WikiSQL-supervised.
- ``question``: string
- ``table_file``: string, name of a csv file containing the tabular data
- ``answer_coordinates``: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is
-  part of the answer)
- ``answer_text``: list of one or more strings (each string being a cell value that is part of the answer)
- ``aggregation_label``: index of the aggregation operator. Only required in case of strong supervision for aggregation
-  (the WikiSQL-supervised case)
- ``float_answer``: the float answer to the question, if there is one (np.nan if there isn't). Only required in case of
-  weak supervision for aggregation (such as WTQ and WikiSQL)
-The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the
-TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ, WikiSQL) into the
-SQA format. The author explains this `here
-<https://github.com/google-research/tapas/issues/50#issuecomment-705465960>`__. Interestingly, these conversion scripts
-are not perfect (the ``answer_coordinates`` and ``float_answer`` fields are populated based on the ``answer_text``),
-meaning that WTQ and WikiSQL results could actually be improved.
-**STEP 3: Convert your data into PyTorch/TensorFlow tensors using TapasTokenizer**
-Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular
-data), you can then use :class:`~transformers.TapasTokenizer` to convert table-question pairs into :obj:`input_ids`,
-:obj:`attention_mask`, :obj:`token_type_ids` and so on. Again, based on which of the three cases you picked above,
-:class:`~transformers.TapasForQuestionAnswering`/:class:`~transformers.TFTapasForQuestionAnswering` requires different
-inputs to be fine-tuned:
-+------------------------------------+----------------------------------------------------------------------------------------------+
-| **Task**                           | **Required inputs**                                                                          |
-+------------------------------------+----------------------------------------------------------------------------------------------+
-| Conversational                     | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``labels``                            |
-+------------------------------------+----------------------------------------------------------------------------------------------+
-| Weak supervision for aggregation   | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``labels``, ``numeric_values``,       |
-|                                    | ``numeric_values_scale``, ``float_answer``                                                   |
-+------------------------------------+----------------------------------------------------------------------------------------------+
-| Strong supervision for aggregation | ``input ids``, ``attention mask``, ``token type ids``, ``labels``, ``aggregation_labels``    |
-+------------------------------------+----------------------------------------------------------------------------------------------+
-:class:`~transformers.TapasTokenizer` creates the ``labels``, ``numeric_values`` and ``numeric_values_scale`` based on
-the ``answer_coordinates`` and ``answer_text`` columns of the TSV file. The ``float_answer`` and ``aggregation_labels``
-are already in the TSV file of step 2. Here's an example:
-.. code-block::
-        >>> from transformers import TapasTokenizer
-        >>> import pandas as pd
-        >>> model_name = 'google/tapas-base'
-        >>> tokenizer = TapasTokenizer.from_pretrained(model_name)
-        >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
-        >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
-        >>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
-        >>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
-        >>> table = pd.DataFrame.from_dict(data)
-        >>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
-        >>> inputs
-        {'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
-        'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
-Set `return_tensors='tf'` when calling the tokenizer to prepare data for the TF models.
-Note that :class:`~transformers.TapasTokenizer` expects the data of the table to be **text-only**. You can use
-``.astype(str)`` on a dataframe to turn it into text-only data. Of course, this only shows how to encode a single
-training example. It is advised to create a PyTorch dataset and a corresponding dataloader:
-.. code-block::
-        >>> import torch
-        >>> import pandas as pd
-        >>> tsv_path = "your_path_to_the_tsv_file"
-        >>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
-        >>> class TableDataset(torch.utils.data.Dataset):
-        ...     def __init__(self, data, tokenizer):
-        ...         self.data = data
-        ...         self.tokenizer = tokenizer
-        ...
-        ...     def __getitem__(self, idx):
-        ...         item = data.iloc[idx]
-        ...         table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
-        ...         encoding = self.tokenizer(table=table, 
-        ...                                   queries=item.question, 
-        ...                                   answer_coordinates=item.answer_coordinates, 
-        ...                                   answer_text=item.answer_text,
-        ...                                   truncation=True,
-        ...                                   padding="max_length",
-        ...                                   return_tensors="pt"
-        ...         )
-        ...         # remove the batch dimension which the tokenizer adds by default
-        ...         encoding = {key: val.squeeze(0) for key, val in encoding.items()}
-        ...         # add the float_answer which is also required (weak supervision for aggregation case)
-        ...         encoding["float_answer"] = torch.tensor(item.float_answer) 
-        ...         return encoding
-        ...
-        ...     def __len__(self):
-        ...        return len(self.data)
-        >>> data = pd.read_csv(tsv_path, sep='\t')
-        >>> train_dataset = TableDataset(data, tokenizer)
-        >>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
-And here is the equivalent code for TensorFlow:
-.. code-block::
-        >>> import tensorflow as tf
-        >>> import pandas as pd
-        >>> tsv_path = "your_path_to_the_tsv_file"
-        >>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
-        >>> class TableDataset:
-        ...     def __init__(self, data, tokenizer):
-        ...         self.data = data
-        ...         self.tokenizer = tokenizer
-        ...
-        ...     def __iter__(self):
-        ...         for idx in range(self.__len__()):
-        ...             item = self.data.iloc[idx]
-        ...             table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
-        ...             encoding = self.tokenizer(table=table, 
-        ...                                   queries=item.question, 
-        ...                                   answer_coordinates=item.answer_coordinates, 
-        ...                                   answer_text=item.answer_text,
-        ...                                   truncation=True,
-        ...                                   padding="max_length",
-        ...                                   return_tensors="tf"
-        ...             )
-        ...             # remove the batch dimension which the tokenizer adds by default
-        ...             encoding = {key: tf.squeeze(val,0) for key, val in encoding.items()}
-        ...             # add the float_answer which is also required (weak supervision for aggregation case)
-        ...             encoding["float_answer"] = tf.convert_to_tensor(item.float_answer,dtype=tf.float32)
-        ...             yield encoding['input_ids'], encoding['attention_mask'], encoding['numeric_values'], \
-        ...                   encoding['numeric_values_scale'], encoding['token_type_ids'], encoding['labels'], \
-        ...                   encoding['float_answer']
-        ...
-        ...     def __len__(self):
-        ...        return len(self.data)
-        >>> data = pd.read_csv(tsv_path, sep='\t')
-        >>> train_dataset = TableDataset(data, tokenizer)
-        >>> output_signature = (
-        ... tf.TensorSpec(shape=(512,), dtype=tf.int32),
-        ... tf.TensorSpec(shape=(512,), dtype=tf.int32),
-        ... tf.TensorSpec(shape=(512,), dtype=tf.float32),
-        ... tf.TensorSpec(shape=(512,), dtype=tf.float32),
-        ... tf.TensorSpec(shape=(512,7), dtype=tf.int32),
-        ... tf.TensorSpec(shape=(512,), dtype=tf.int32),
-        ... tf.TensorSpec(shape=(512,), dtype=tf.float32))
-        >>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)
-Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not
-conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group
-together the ``queries``, ``answer_coordinates`` and ``answer_text`` per table (in the order of their ``position``
-index) and batch encode each table with its questions. This will make sure that the ``prev_labels`` token types (see
-docs of :class:`~transformers.TapasTokenizer`) are set correctly. See `this notebook
-<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
-for more info. See `this notebook
-<https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
-for more info regarding using the TensorFlow model.
-**STEP 4: Train (fine-tune) TapasForQuestionAnswering/TFTapasForQuestionAnswering**
-You can then fine-tune :class:`~transformers.TapasForQuestionAnswering` using native PyTorch as follows (shown here for
-the weak supervision for aggregation case):
-.. code-block::
-        >>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW
-        >>> # this is the default WTQ configuration
-        >>> config = TapasConfig(
-        ...            num_aggregation_labels = 4,
-        ...            use_answer_as_supervision = True,
-        ...            answer_loss_cutoff = 0.664694,
-        ...            cell_selection_preference = 0.207951,
-        ...            huber_loss_delta = 0.121194,
-        ...            init_cell_selection_weights_to_zero = True,
-        ...            select_one_column = True,
-        ...            allow_empty_column_selection = False,
-        ...            temperature = 0.0352513,
-        ... )
-        >>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
-        >>> optimizer = AdamW(model.parameters(), lr=5e-5)
-        >>> for epoch in range(2):  # loop over the dataset multiple times
-        ...    for idx, batch in enumerate(train_dataloader):
-        ...         # get the inputs; 
-        ...         input_ids = batch["input_ids"]
-        ...         attention_mask = batch["attention_mask"]
-        ...         token_type_ids = batch["token_type_ids"]
-        ...         labels = batch["labels"]
-        ...         numeric_values = batch["numeric_values"]
-        ...         numeric_values_scale = batch["numeric_values_scale"]
-        ...         float_answer = batch["float_answer"]
-        ...         # zero the parameter gradients
-        ...         optimizer.zero_grad()
-        ...         # forward + backward + optimize
-        ...         outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
-        ...                        labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
-        ...                        float_answer=float_answer)
-        ...         loss = outputs.loss
-        ...         loss.backward()
-        ...         optimizer.step()
-Equivalently, fine-tuning :class:`~transformers.TFTapasForQuestionAnswering` in native TensorFlow can be done as
-follows (shown here for the weak supervision for aggregation case):
-.. code-block::
-        >>> import tensorflow as tf
-        >>> from transformers import TapasConfig, TFTapasForQuestionAnswering
-        >>> # this is the default WTQ configuration
-        >>> config = TapasConfig(
-        ...            num_aggregation_labels = 4,
-        ...            use_answer_as_supervision = True,
-        ...            answer_loss_cutoff = 0.664694,
-        ...            cell_selection_preference = 0.207951,
-        ...            huber_loss_delta = 0.121194,
-        ...            init_cell_selection_weights_to_zero = True,
-        ...            select_one_column = True,
-        ...            allow_empty_column_selection = False,
-        ...            temperature = 0.0352513,
-        ... )
-        >>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
-        >>> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
-        >>> for epoch in range(2):  # loop over the dataset multiple times
-        ...    for idx, batch in enumerate(train_dataloader):
-        ...         # get the inputs; 
-        ...         input_ids = batch[0]
-        ...         attention_mask = batch[1]
-        ...         token_type_ids = batch[4]
-        ...         labels = batch[-1]
-        ...         numeric_values = batch[2]
-        ...         numeric_values_scale = batch[3]
-        ...         float_answer = batch[6]
-        ...         # forward + backward + optimize
-        ...         with tf.GradientTape() as tape:
-        ...              outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
-        ...                        labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
-        ...                        float_answer=float_answer )
-        ...         grads = tape.gradient(outputs.loss, model.trainable_weights)
-        ...         optimizer.apply_gradients(zip(grads, model.trainable_weights))
-Usage: inference
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Here we explain how you can use :class:`~transformers.TapasForQuestionAnswering` for inference (i.e. making predictions
-on new data). For inference, only ``input_ids``, ``attention_mask`` and ``token_type_ids`` (which you can obtain using
-:class:`~transformers.TapasTokenizer`) have to be provided to the model to obtain the logits. Next, you can use the
-handy ``convert_logits_to_predictions`` method of :class:`~transformers.TapasTokenizer` to convert these into predicted
-coordinates and optional aggregation indices.
-However, note that inference is **different** depending on whether or not the setup is conversational. In a
-non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example
-of that:
-.. code-block::
-        >>> from transformers import TapasTokenizer, TapasForQuestionAnswering
-        >>> import pandas as pd 
-        >>> model_name = 'google/tapas-base-finetuned-wtq'
-        >>> model = TapasForQuestionAnswering.from_pretrained(model_name)
-        >>> tokenizer = TapasTokenizer.from_pretrained(model_name)
-        >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
-        >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
-        >>> table = pd.DataFrame.from_dict(data)
-        >>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt") 
-        >>> outputs = model(**inputs)
-        >>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
-        ...         inputs, 
-        ...         outputs.logits.detach(), 
-        ...         outputs.logits_aggregation.detach()
-        ... )
-        >>> # let's print out the results:
-        >>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
-        >>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
-        >>> answers = []
-        >>> for coordinates in predicted_answer_coordinates:
-        ...   if len(coordinates) == 1:
-        ...     # only a single cell:
-        ...     answers.append(table.iat[coordinates[0]])
-        ...   else:
-        ...     # multiple cells
-        ...     cell_values = []
-        ...     for coordinate in coordinates:
-        ...        cell_values.append(table.iat[coordinate])
-        ...     answers.append(", ".join(cell_values))
-        >>> display(table)
-        >>> print("")
-        >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
-        ...   print(query)
-        ...   if predicted_agg == "NONE":
-        ...     print("Predicted answer: " + answer)
-        ...   else:
-        ...     print("Predicted answer: " + predicted_agg + " > " + answer)    
-        What is the name of the first actor?
-        Predicted answer: Brad Pitt
-        How many movies has George Clooney played in?
-        Predicted answer: COUNT > 69
-        What is the total number of movies?
-        Predicted answer: SUM > 87, 53, 69
-And here is the equivalent code for TensorFlow:
-.. code-block::
-        >>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
-        >>> import pandas as pd 
-        >>> model_name = 'google/tapas-base-finetuned-wtq'
-        >>> model = TFTapasForQuestionAnswering.from_pretrained(model_name)
-        >>> tokenizer = TapasTokenizer.from_pretrained(model_name)
-        >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
-        >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
-        >>> table = pd.DataFrame.from_dict(data)
-        >>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="tf") 
-        >>> outputs = model(**inputs)
-        >>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
-        ...         inputs, 
-        ...         outputs.logits, 
-        ...         outputs.logits_aggregation
-        ... )
-        >>> # let's print out the results:
-        >>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
-        >>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
-        >>> answers = []
-        >>> for coordinates in predicted_answer_coordinates:
-        ...   if len(coordinates) == 1:
-        ...     # only a single cell:
-        ...     answers.append(table.iat[coordinates[0]])
-        ...   else:
-        ...     # multiple cells
-        ...     cell_values = []
-        ...     for coordinate in coordinates:
-        ...        cell_values.append(table.iat[coordinate])
-        ...     answers.append(", ".join(cell_values))
-        >>> display(table)
-        >>> print("")
-        >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
-        ...   print(query)
-        ...   if predicted_agg == "NONE":
-        ...     print("Predicted answer: " + answer)
-        ...   else:
-        ...     print("Predicted answer: " + predicted_agg + " > " + answer)    
-        What is the name of the first actor?
-        Predicted answer: Brad Pitt
-        How many movies has George Clooney played in?
-        Predicted answer: COUNT > 69
-        What is the total number of movies?
-        Predicted answer: SUM > 87, 53, 69
-In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such
-that the ``prev_labels`` token types can be overwritten by the predicted ``labels`` of the previous table-question
-pair. Again, more info can be found in `this notebook
-<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
-(for PyTorch) and `this notebook
-<https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
-(for TensorFlow).
-Tapas specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.tapas.modeling_tapas.TableQuestionAnsweringOutput
-    :members:
-TapasConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TapasConfig
-    :members:
-TapasTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TapasTokenizer
-    :members: __call__, convert_logits_to_predictions, save_vocabulary
-TapasModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TapasModel
-    :members: forward
-TapasForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TapasForMaskedLM
-    :members: forward
-TapasForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TapasForSequenceClassification
-    :members: forward
-TapasForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TapasForQuestionAnswering
-    :members: forward
-TFTapasModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFTapasModel
-    :members: call
-TFTapasForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFTapasForMaskedLM
-    :members: call
-TFTapasForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFTapasForSequenceClassification
-    :members: call
-TFTapasForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFTapasForQuestionAnswering
-    :members: call
--- a/docs/source/model_doc/trocr.mdx
+++ b/docs/source/model_doc/trocr.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
+License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+# TrOCR
+## Overview
+The TrOCR model was proposed in [TrOCR: Transformer-based Optical Character Recognition with Pre-trained
+Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
+Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to
+perform [optical character recognition (OCR)](https://en.wikipedia.org/wiki/Optical_character_recognition).
+The abstract from the paper is the following:
+*Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition
+are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language
+model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end
+text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the
+Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but
+effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments
+show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition
+tasks.*
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg"
+alt="drawing" width="600"/> 
+<small> TrOCR architecture. Taken from the [original paper](https://arxiv.org/abs/2109.10282). </small>
+Please refer to the [`VisionEncoderDecoder`] class on how to use this model.
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
+[here](https://github.com/microsoft/unilm/tree/6f60612e7cc86a2a1ae85c47231507a587ab4e01/trocr).
+Tips:
+- The quickest way to get started with TrOCR is by checking the [tutorial
+  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR), which show how to use the model
+  at inference time as well as fine-tuning on custom data.
+- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results
+  on both printed (e.g. the [SROIE dataset](https://paperswithcode.com/dataset/sroie) and handwritten (e.g. the [IAM
+  Handwriting dataset](https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>) text recognition tasks. For more
+  information, see the [official models](https://huggingface.co/models?other=trocr>).
+- TrOCR is always used within the [VisionEncoderDecoder](./model_doc/visionencoderdecoder) framework.
+## Inference
+TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
+[`~generation_utils.GenerationMixin.generate`] to autoregressively generate text given the input image.
+The [`ViTFeatureExtractor`] class is responsible for preprocessing the input image and
+[`RobertaTokenizer`] decodes the generated target tokens to the target string. The
+[`TrOCRProcessor`] wraps [`ViTFeatureExtractor`] and [`RobertaTokenizer`]
+into a single instance to both extract the input features and decode the predicted token ids.
+- Step-by-step Optical Character Recognition (OCR)
+``` py
+>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
+>>> import requests 
+>>> from PIL import Image
+>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") 
+>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
+>>> # load image from the IAM dataset url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg" 
+>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+>>> pixel_values = processor(image, return_tensors="pt").pixel_values 
+>>> generated_ids = model.generate(pixel_values)
+>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] 
+```
+See the [model hub](https://huggingface.co/models?filter=trocr) to look for TrOCR checkpoints.
+## TrOCRConfig
+[[autodoc]] TrOCRConfig
+## TrOCRProcessor
+[[autodoc]] TrOCRProcessor
+    - __call__
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+    - as_target_processor
+## TrOCRForCausalLM
+[[autodoc]] TrOCRForCausalLM
+     - forward
--- a/docs/source/model_doc/trocr.rst
+++ b/docs/source/model_doc/trocr.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-TrOCR
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The TrOCR model was proposed in `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
-<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
-Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to
-perform `optical character recognition (OCR) <https://en.wikipedia.org/wiki/Optical_character_recognition>`__.
-Please refer to the :doc:`VisionEncoderDecoder <visionencoderdecoder>` class on how to use this model.
-This model was contributed by `Niels Rogge <https://huggingface.co/nielsr>`__.
-The original code can be found `here
-<https://github.com/microsoft/unilm/tree/6f60612e7cc86a2a1ae85c47231507a587ab4e01/trocr>`__.
-Tips:
- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results
-  on both printed (e.g. the `SROIE dataset <https://paperswithcode.com/dataset/sroie>`__) and handwritten (e.g. the
-  `IAM Handwriting dataset <https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>`__) text recognition tasks.
-  For more information, see the `official models <https://huggingface.co/models?other=trocr>`__.
- TrOCR is always used within the :doc:`VisionEncoderDecoder <visionencoderdecoder>` framework.
-Inference
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-TrOCR's :class:`~transformers.VisionEncoderDecoderModel` model accepts images as input and makes use of
-:func:`~transformers.generation_utils.GenerationMixin.generate` to autoregressively generate text given the input
-image.
-The :class:`~transformers.ViTFeatureExtractor` class is responsible for preprocessing the input image and
-:class:`~transformers.RobertaTokenizer` decodes the generated target tokens to the target string. The
-:class:`~transformers.TrOCRProcessor` wraps :class:`~transformers.ViTFeatureExtractor` and
-:class:`~transformers.RobertaTokenizer` into a single instance to both extract the input features and decode the
-predicted token ids.
- Step-by-step Optical Character Recognition (OCR)
-.. code-block::
-        >>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
-        >>> import requests
-        >>> from PIL import Image
-        >>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
-        >>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
-        >>> # load image from the IAM dataset
-        >>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
-        >>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
-        >>> pixel_values = processor(image, return_tensors="pt").pixel_values
-        >>> generated_ids = model.generate(pixel_values)
-        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
-See the `model hub <https://huggingface.co/models?filter=trocr>`__ to look for TrOCR checkpoints.
-TrOCRConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TrOCRConfig
-    :members:
-TrOCRProcessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TrOCRProcessor
-    :members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
-TrOCRForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TrOCRForCausalLM
-    :members: forward
--- a/docs/source/model_doc/vit.rst
+++ b/docs/source/model_doc/vit.rst
@@ -43,6 +43,8 @@ substantially fewer computational resources to train.*
 Tips:
+- Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found `here
+  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__.
 - To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
  which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be
  used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of

--- a/src/transformers/models/perceiver/configuration_perceiver.py
+++ b/src/transformers/models/perceiver/configuration_perceiver.py
@@ -90,7 +90,8 @@ class PerceiverConfig(PretrainedConfig):
        samples_per_patch (:obj:`int`, `optional`, defaults to 16):
            Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
        output_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[1, 16, 224, 224]`):
-            Shape of the output for the multimodal autoencoding model.
+            Shape of the output (batch_size, num_frames, height, width) for the video decoder queries of the multimodal
+            autoencoding model. This excludes the channel dimension.
    Example::

--- a/src/transformers/models/perceiver/modeling_perceiver.py
+++ b/src/transformers/models/perceiver/modeling_perceiver.py
@@ -1865,7 +1865,13 @@ class PerceiverAbstractDecoder(nn.Module, metaclass=abc.ABCMeta):
 class PerceiverProjectionDecoder(PerceiverAbstractDecoder):
-    """Baseline projection decoder (no cross-attention)."""
+    """
+    Baseline projection decoder (no cross-attention).
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+    """
    def __init__(self, config):
        super().__init__()
@@ -1884,11 +1890,38 @@ class PerceiverProjectionDecoder(PerceiverAbstractDecoder):
 class PerceiverBasicDecoder(PerceiverAbstractDecoder):
    """
-    Cross-attention-based decoder.
+    Cross-attention-based decoder. This class can be used to decode the final hidden states of the latents using a
+    cross-attention operation, in which the latents produce keys and values.
-    Here, `output_num_channels` refers to the number of output channels. `num_channels` refers to the number of
+    The shape of the output of this class depends on how one defines the output queries (also called decoder queries).
-    channels of the output queries.
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+        output_num_channels (:obj:`int`, `optional`):
+            The number of channels in the output. Will only be used in case `final_project` is set to `True`.
+        position_encoding_type (:obj:`str`, `optional`, defaults to "trainable"):
+            The type of position encoding to use. Can be either "trainable", "fourier", or "none".
+        output_index_dims (:obj:`int`, `optional`):
+            The number of dimensions of the output queries. Ignored if 'position_encoding_type' == 'none'.
+        num_channels (:obj:`int`, `optional`):
+            The number of channels of the decoder queries. Ignored if 'position_encoding_type' == 'none'.
+        qk_channels (:obj:`int`, `optional`):
+            The number of channels of the queries and keys in the cross-attention layer.
+        v_channels (:obj:`int`, `optional`, defaults to 128):
+            The number of channels of the values in the cross-attention layer.
+        num_heads (:obj:`int`, `optional`, defaults to 1):
+            The number of attention heads in the cross-attention layer.
+        widening_factor (:obj:`int`, `optional`, defaults to 1):
+            The widening factor of the cross-attention layer.
+        use_query_residual (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use a residual connection between the query and the output of the cross-attention layer.
+        concat_preprocessed_input (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to concatenate the preprocessed input to the query.
+        final_project (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to project the output of the cross-attention layer to a target dimension.
+        position_encoding_only (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to only use this class to define output queries.
    """
    def __init__(
@@ -2035,9 +2068,13 @@ class PerceiverBasicDecoder(PerceiverAbstractDecoder):
 class PerceiverClassificationDecoder(PerceiverAbstractDecoder):
    """
-    Cross-attention based classification decoder. Light-weight wrapper of `BasicDecoder` for logit output. Will turn
+    Cross-attention based classification decoder. Light-weight wrapper of [`PerceiverBasicDecoder`] for logit output.
-    the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of shape
+    Will turn the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of
-    (batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).
+    shape (batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
    """
    def __init__(self, config, **decoder_kwargs):
@@ -2100,8 +2137,16 @@ class PerceiverOpticalFlowDecoder(PerceiverAbstractDecoder):
 class PerceiverBasicVideoAutoencodingDecoder(PerceiverAbstractDecoder):
    """
-    Cross-attention based video-autoencoding decoder. Light-weight wrapper of `BasicDecoder` with video reshaping
+    Cross-attention based video-autoencoding decoder. Light-weight wrapper of [`PerceiverBasicDecoder`] with video
-    logic.
+    reshaping logic.
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+        output_shape (:obj:`List[int]`):
+            Shape of the output as (batch_size, num_frames, height, width), excluding the channel dimension.
+        position_encoding_type (:obj:`str`):
+            The type of position encoding to use. Can be either "trainable", "fourier", or "none".
    """
    def __init__(self, config, output_shape, position_encoding_type, **decoder_kwargs):
@@ -2165,10 +2210,28 @@ def restructure(modality_sizes: ModalitySizeType, inputs: torch.Tensor) -> Mappi
 class PerceiverMultimodalDecoder(PerceiverAbstractDecoder):
    """
-    Multimodal decoding by composing uni-modal decoders. The modalities argument of the constructor is a dictionary
+    Multimodal decoding by composing uni-modal decoders. The `modalities` argument of the constructor is a dictionary
    mapping modality name to the decoder of that modality. That decoder will be used to construct queries for that
-    modality. However, there is a shared cross attention across all modalities, using the concatenated per-modality
+    modality. Modality-specific queries are padded with trainable modality-specific parameters, after which they are
-    query vectors.
+    concatenated along the time dimension.
+    Next, there is a shared cross attention operation across all modalities.
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+        modalities (:obj:`Dict[str, PerceiverAbstractDecoder]`):
+            Dictionary mapping modality name to the decoder of that modality.
+        num_outputs (:obj:`int`):
+            The number of outputs of the decoder.
+        output_num_channels (:obj:`int`):
+            The number of channels in the output.
+        min_padding_size (:obj:`int`, `optional`, defaults to 2):
+            The minimum padding size for all modalities. The final output will have num_channels equal to the maximum
+            channels across all modalities plus min_padding_size.
+        subsampled_index_dims (:obj:`Dict[str, PerceiverAbstractDecoder]`, `optional`):
+            Dictionary mapping modality name to the subsampled index dimensions to use for the decoder query of that
+            modality.
    """
    def __init__(
@@ -2556,7 +2619,15 @@ class AbstractPreprocessor(nn.Module):
 class PerceiverTextPreprocessor(AbstractPreprocessor):
-    """Text preprocessing for Perceiver Encoder."""
+    """
+    Text preprocessing for Perceiver Encoder. Can be used to embed `inputs` and add positional encodings.
+    The dimensionality of the embeddings is determined by the `d_model` attribute of the configuration.
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+    """
    def __init__(self, config):
        super().__init__()
@@ -2578,10 +2649,15 @@ class PerceiverTextPreprocessor(AbstractPreprocessor):
 class PerceiverEmbeddingDecoder(nn.Module):
-    """Module to decode embeddings (for masked language modeling)."""
+    """
+    Module to decode embeddings (for masked language modeling).
+    Args:
+        config ([`PerceiverConfig`]):
+            Model configuration.
+    """
    def __init__(self, config):
-        """Constructs the module."""
        super().__init__()
        self.config = config
        self.vocab_size = config.vocab_size
@@ -2597,7 +2673,8 @@ class PerceiverEmbeddingDecoder(nn.Module):
 class PerceiverMultimodalPostprocessor(nn.Module):
    """
-    Multimodal postprocessing for Perceiver.
+    Multimodal postprocessing for Perceiver. Can be used to combine modality-specific postprocessors into a single
+    postprocessor.
    Args:
          modalities (:obj:`Dict[str, PostprocessorType]`):
@@ -2633,7 +2710,7 @@ class PerceiverClassificationPostprocessor(nn.Module):
    Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits.
    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        in_channels (:obj:`int`):
            Number of channels in the input.
@@ -2653,7 +2730,7 @@ class PerceiverAudioPostprocessor(nn.Module):
    Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features.
    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        in_channels (:obj:`int`):
            Number of channels in the input.
@@ -2678,8 +2755,8 @@ class PerceiverAudioPostprocessor(nn.Module):
 class PerceiverProjectionPostprocessor(nn.Module):
    """
-    Projection postprocessing for Perceiver. Can be used to convert the project the channels of the decoder output to a
+    Projection postprocessing for Perceiver. Can be used to project the channels of the decoder output to a lower
-    lower dimension.
+    dimension.
    Args:
        in_channels (:obj:`int`):
@@ -2706,7 +2783,7 @@ class PerceiverImagePreprocessor(AbstractPreprocessor):
    position encoding kwargs are set equal to the `out_channels`.
    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        prep_type (:obj:`str`, `optional`, defaults to :obj:`"conv"`):
            Preprocessing type. Can be "conv1x1", "conv", "patches", "pixels".
@@ -2931,7 +3008,7 @@ class PerceiverOneHotPreprocessor(AbstractPreprocessor):
    One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input.
    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
    """
@@ -2957,7 +3034,7 @@ class PerceiverAudioPreprocessor(AbstractPreprocessor):
    Audio preprocessing for Perceiver Encoder.
    Args:
-        config (:obj:`PerceiverConfig`):
+        config ([`PerceiverConfig`]):
            Model configuration.
        prep_type (:obj:`str`, `optional`, defaults to :obj:`"patches"`):
            Preprocessor type to use. Only "patches" is supported.