"...resnet50_tensorflow.git" did not exist on "0dfc57301021e37cdf1fc62bdabc77d3e21dd1d3"
Unverified Commit 4c99e553 authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Improve documentation of some models (#14695)



* Migrate docs to mdx

* Update TAPAS docs

* Remove lines

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply some more suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add pt/tf switch to code examples

* More improvements

* Improve docstrings

* More improvements
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 32eb29fe
...@@ -40,8 +40,15 @@ significantly outperforming from-scratch DeiT training (81.8%) with the same set ...@@ -40,8 +40,15 @@ significantly outperforming from-scratch DeiT training (81.8%) with the same set
Tips: Tips:
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
outperform both the original model (ViT) as well as Data-efficient Image Transformers (DeiT) when fine-tuned on outperform both the :doc:`original model (ViT) <vit>` as well as :doc:`Data-efficient Image Transformers (DeiT)
ImageNet-1K and CIFAR-100. <deit>` when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
fine-tuning on custom data `here
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__ (you can just replace
:class:`~transformers.ViTFeatureExtractor` by :class:`~transformers.BeitFeatureExtractor` and
:class:`~transformers.ViTForImageClassification` by :class:`~transformers.BeitForImageClassification`).
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
performing masked image modeling. You can find it `here
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT>`__.
- As the BEiT models expect each image to be of the same size (resolution), one can use - As the BEiT models expect each image to be of the same size (resolution), one can use
:class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model. :class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
......
.. <!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
the License. You may obtain a copy of the License at License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License. -->
ImageGPT # ImageGPT
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ImageGPT model was proposed in `Generative Pretraining from Pixels <https://openai.com/blog/image-gpt/>`__ by Mark The ImageGPT model was proposed in [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt) by Mark
Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like
model trained to predict the next pixel value, allowing for both unconditional and conditional image generation. model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.
...@@ -31,25 +28,29 @@ ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pr ...@@ -31,25 +28,29 @@ ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pr
competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0%
top-1 accuracy on a linear probe of our features.* top-1 accuracy on a linear probe of our features.*
The figure below summarizes the approach (taken from the `original paper <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png"
<https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf>`__): alt="drawing" width="600"/>
.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png <small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small>
:width: 600
This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found
[here](https://github.com/openai/image-gpt).
Tips: Tips:
- ImageGPT is almost exactly the same as :doc:`GPT-2 <gpt2>`, with the exception that a different activation function - Demo notebooks for ImageGPT can be found
is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT also doesn't [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ImageGPT).
have tied input- and output embeddings. - ImageGPT is almost exactly the same as [GPT-2](./model_doc/gpt2), with the exception that a different activation
function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT
also doesn't have tied input- and output embeddings.
- As the time- and memory requirements of the attention mechanism of Transformers scales quadratically in the sequence - As the time- and memory requirements of the attention mechanism of Transformers scales quadratically in the sequence
length, the authors pre-trained ImageGPT on smaller input resolutions, such as 32x32 and 64x64. However, feeding a length, the authors pre-trained ImageGPT on smaller input resolutions, such as 32x32 and 64x64. However, feeding a
sequence of 32x32x3=3072 tokens from 0..255 into a Transformer is still prohibitively large. Therefore, the authors sequence of 32x32x3=3072 tokens from 0..255 into a Transformer is still prohibitively large. Therefore, the authors
applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS) embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
token, used at the beginning of every sequence. One can use :class:`~transformers.ImageGPTFeatureExtractor` to token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare
prepare images for the model. images for the model.
- Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly - Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
performant image features useful for downstream tasks, such as image classification. The authors showed that the performant image features useful for downstream tasks, such as image classification. The authors showed that the
features in the middle of the network are the most performant, and can be used as-is to train a linear model (such as features in the middle of the network are the most performant, and can be used as-is to train a linear model (such as
...@@ -57,54 +58,43 @@ Tips: ...@@ -57,54 +58,43 @@ Tips:
easily obtained by first forwarding the image through the model, then specifying `output_hidden_states=True`, and easily obtained by first forwarding the image through the model, then specifying `output_hidden_states=True`, and
then average-pool the hidden states at whatever layer you like. then average-pool the hidden states at whatever layer you like.
- Alternatively, one can further fine-tune the entire model on a downstream dataset, similar to BERT. For this, you can - Alternatively, one can further fine-tune the entire model on a downstream dataset, similar to BERT. For this, you can
use :class:`~transformers.ImageGPTForImageClassification`. use [`ImageGPTForImageClassification`].
- ImageGPT comes in different sizes: there's ImageGPT-small, ImageGPT-medium and ImageGPT-large. The authors did also - ImageGPT comes in different sizes: there's ImageGPT-small, ImageGPT-medium and ImageGPT-large. The authors did also
train an XL variant, which they didn't release. The differences in size are summarized in the following table: train an XL variant, which they didn't release. The differences in size are summarized in the following table:
+-------------------+----------------------+-----------------+---------------------+--------------+ | **Model variant** | **Depths** | **Hidden sizes** | **Decoder hidden size** | **Params (M)** | **ImageNet-1k Top 1** |
| **Model variant** | **Number of layers** | **Hidden size** | **Number of heads** | **# params** | |---|---|---|---|---|---|
+-------------------+----------------------+-----------------+---------------------+--------------+ | MiT-b0 | [2, 2, 2, 2] | [32, 64, 160, 256] | 256 | 3.7 | 70.5 |
| iGPT-small | 24 | 512 | 8 | 76 million | | MiT-b1 | [2, 2, 2, 2] | [64, 128, 320, 512] | 256 | 14.0 | 78.7 |
+-------------------+----------------------+-----------------+---------------------+--------------+ | MiT-b2 | [3, 4, 6, 3] | [64, 128, 320, 512] | 768 | 25.4 | 81.6 |
| iGPT-medium | 36 | 1024 | 8 | 455 million | | MiT-b3 | [3, 4, 18, 3] | [64, 128, 320, 512] | 768 | 45.2 | 83.1 |
+-------------------+----------------------+-----------------+---------------------+--------------+ | MiT-b4 | [3, 8, 27, 3] | [64, 128, 320, 512] | 768 | 62.6 | 83.6 |
| iGPT-large | 48 | 1536 | 16 | 1.4 million | | MiT-b5 | [3, 6, 40, 3] | [64, 128, 320, 512] | 768 | 82.0 | 83.8 |
+-------------------+----------------------+-----------------+---------------------+--------------+
| iGPT-XL | 60 | 3072 | not specified | 6.8 billion | ## ImageGPTConfig
+-------------------+----------------------+-----------------+---------------------+--------------+
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__, based on `this issue [[autodoc]] ImageGPTConfig
<https://github.com/openai/image-gpt/issues/7>`__. The original code can be found `here
<https://github.com/openai/image-gpt>`__.
ImageGPTConfig ## ImageGPTFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ImageGPTConfig [[autodoc]] ImageGPTFeatureExtractor
:members:
ImageGPTFeatureExtractor - __call__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ImageGPTFeatureExtractor ## ImageGPTModel
:members: __call__
ImageGPTModel [[autodoc]] ImageGPTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ImageGPTModel - forward
:members: forward
## ImageGPTForCausalImageModeling
ImageGPTForCausalImageModeling [[autodoc]] ImageGPTForCausalImageModeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ImageGPTForCausalImageModeling - forward
:members: forward
## ImageGPTForImageClassification
ImageGPTForImageClassification [[autodoc]] ImageGPTForImageClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ImageGPTForImageClassification - forward
:members: forward \ No newline at end of file
...@@ -74,6 +74,9 @@ Tips: ...@@ -74,6 +74,9 @@ Tips:
head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
``task="entity_span_classification"``. Please refer to the example code of each head models. ``task="entity_span_classification"``. Please refer to the example code of each head models.
A demo notebook on how to fine-tune :class:`~transformers.LukeForEntityPairClassification` for relation
classification can be found `here <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE>`__.
There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
the HuggingFace implementation of LUKE. They can be found `here the HuggingFace implementation of LUKE. They can be found `here
<https://github.com/studio-ousia/luke/tree/master/notebooks>`__. <https://github.com/studio-ousia/luke/tree/master/notebooks>`__.
......
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Perceiver
## Overview
The Perceiver IO model was proposed in [Perceiver IO: A General Architecture for Structured Inputs &
Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
Perceiver IO is a generalization of [Perceiver](https://arxiv.org/abs/2103.03206) to handle arbitrary outputs in
addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
The abstract from the paper is the following:
*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
performance on Sintel optical flow estimation.*
Here's a TLDR explaining how Perceiver works:
The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
randomly initialized, after which they are trained end-to-end using backpropagation.
Internally, [`PerceiverModel`] will create the latents, which is a tensor of shape `(batch_size, num_latents,
d_latents)`. One must provide `inputs` (which could be text, images, audio, you name it!) to the model, which it will
use to perform cross-attention with the latents. The output of the Perceiver encoder is a tensor of the same shape. One
can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along
the sequence dimension, and placing a linear layer on top of that to project the `d_latents` to `num_labels`.
This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
providing `inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define the
`outputs` as being of shape: `(batch_size, 2048, 768)`. Next, one performs cross-attention with the final hidden states
of the latents to update the `outputs` tensor. After cross-attention, one still has a tensor of shape `(batch_size,
2048, 768)`. One can then place a regular language modeling head on top, to project the last dimension to the
vocabulary size of the model, i.e. creating logits of shape `(batch_size, 2048, 262)` (as Perceiver uses a vocabulary
size of 262 byte IDs).
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg"
alt="drawing" width="600"/>
<small> Perceiver IO architecture. Taken from the [original paper](https://arxiv.org/abs/2105.15203) </small>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
[here](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
Tips:
- The quickest way to get started with the Perceiver is by checking the [tutorial
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
- Note that the models available in the library only showcase some examples of what you can do with the Perceiver.
There are many more use cases, including question answering,
named-entity recognition, object detection, audio classification, video classification, etc.
## Perceiver specific outputs
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverModelOutput
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverDecoderOutput
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassifierOutput
## PerceiverConfig
[[autodoc]] PerceiverConfig
## PerceiverTokenizer
[[autodoc]] PerceiverTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
## PerceiverFeatureExtractor
[[autodoc]] PerceiverFeatureExtractor
- call
## PerceiverTextPreprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
## PerceiverImagePreprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
## PerceiverOneHotPreprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
## PerceiverAudioPreprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
## PerceiverMultimodalPreprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
## PerceiverProjectionDecoder
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverProjectionDecoder
## PerceiverBasicDecoder
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverBasicDecoder
## PerceiverClassificationDecoder
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassificationDecoder
## PerceiverOpticalFlowDecoder
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder
## PerceiverBasicVideoAutoencodingDecoder
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverBasicVideoAutoencodingDecoder
## PerceiverMultimodalDecoder
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder
## PerceiverProjectionPostprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
## PerceiverAudioPostprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
## PerceiverClassificationPostprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
## PerceiverMultimodalPostprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
## PerceiverModel
[[autodoc]] PerceiverModel
- forward
## PerceiverForMaskedLM
[[autodoc]] PerceiverForMaskedLM
- forward
## PerceiverForSequenceClassification
[[autodoc]] PerceiverForSequenceClassification
- forward
## PerceiverForImageClassificationLearned
[[autodoc]] PerceiverForImageClassificationLearned
- forward
## PerceiverForImageClassificationFourier
[[autodoc]] PerceiverForImageClassificationFourier
- forward
## PerceiverForImageClassificationConvProcessing
[[autodoc]] PerceiverForImageClassificationConvProcessing
- forward
## PerceiverForOpticalFlow
[[autodoc]] PerceiverForOpticalFlow
- forward
## PerceiverForMultimodalAutoencoding
[[autodoc]] PerceiverForMultimodalAutoencoding
- forward
\ No newline at end of file
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Perceiver
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Perceiver IO model was proposed in `Perceiver IO: A General Architecture for Structured Inputs & Outputs
<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
Perceiver IO is a generalization of `Perceiver <https://arxiv.org/abs/2103.03206>`__ to handle arbitrary outputs in
addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
The abstract from the paper is the following:
*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
performance on Sintel optical flow estimation.*
Here's a TLDR explaining how Perceiver works:
The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
randomly initialized, after which they are trained end-to-end using backpropagation.
Internally, :class:`~transformers.PerceiverModel` will create the latents, which is a tensor of shape
:obj:`(batch_size, num_latents, d_latents)`. One must provide :obj:`inputs` (which could be text, images, audio, you
name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver
encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to
classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project
the :obj:`d_latents` to :obj:`num_labels`.
This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
providing :obj:`inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define
the :obj:`outputs` as being of shape: :obj:`(batch_size, 2048, 768)`. Next, one performs cross-attention with the final
hidden states of the latents to update the :obj:`outputs` tensor. After cross-attention, one still has a tensor of
shape :obj:`(batch_size, 2048, 768)`. One can then place a regular language modeling head on top, to project the last
dimension to the vocabulary size of the model, i.e. creating logits of shape :obj:`(batch_size, 2048, 262)` (as
Perceiver uses a vocabulary size of 262 byte IDs).
This model was contributed by `<nielsr> <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/deepmind/deepmind-research/tree/master/perceiver>`__.
Perceiver specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput
:members:
PerceiverConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverConfig
:members:
PerceiverTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
PerceiverFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverFeatureExtractor
:members:
PerceiverTextPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
:members:
PerceiverImagePreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
:members:
PerceiverOneHotPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
:members:
PerceiverAudioPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
:members:
PerceiverMultimodalPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
:members:
PerceiverProjectionPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
:members:
PerceiverAudioPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
:members:
PerceiverClassificationPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
:members:
PerceiverMultimodalPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
:members:
PerceiverModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverModel
:members: forward
PerceiverForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForMaskedLM
:members: forward
PerceiverForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForSequenceClassification
:members: forward
PerceiverForImageClassificationLearned
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationLearned
:members: forward
PerceiverForImageClassificationFourier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationFourier
:members: forward
PerceiverForImageClassificationConvProcessing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationConvProcessing
:members: forward
PerceiverForOpticalFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForOpticalFlow
:members: forward
PerceiverForMultimodalAutoencoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForMultimodalAutoencoding
:members: forward
...@@ -35,15 +35,15 @@ and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50 ...@@ -35,15 +35,15 @@ and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50
being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on
Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C.* Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C.*
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/NVlabs/SegFormer>`__.
The figure below illustrates the architecture of SegFormer. Taken from the `original paper The figure below illustrates the architecture of SegFormer. Taken from the `original paper
<https://arxiv.org/abs/2105.15203>`__. <https://arxiv.org/abs/2105.15203>`__.
.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png .. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png
:width: 600 :width: 600
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/NVlabs/SegFormer>`__.
Tips: Tips:
- SegFormer consists of a hierarchical Transformer encoder, and a lightweight all-MLP decode head. - SegFormer consists of a hierarchical Transformer encoder, and a lightweight all-MLP decode head.
......
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# TAPAS
## Overview
The TAPAS model was proposed in [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://www.aclweb.org/anthology/2020.acl-main.398)
by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. It's a BERT-based model specifically
designed (and pre-trained) for answering questions about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7
token types that encode tabular structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset comprising
millions of tables from English Wikipedia and corresponding texts.
For question answering, TAPAS has 2 heads on top: a cell selection head and an aggregation head, for (optionally) performing aggregations (such as counting or summing) among selected cells. TAPAS has been fine-tuned on several datasets:
- [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253) (Sequential Question Answering by Microsoft)
- [WTQ](https://github.com/ppasupat/WikiTableQuestions) (Wiki Table Questions by Stanford University)
- [WikiSQL](https://github.com/salesforce/WikiSQL) (by Salesforce).
It achieves state-of-the-art on both SQA and WTQ, while having comparable performance to SOTA on WikiSQL, with a much simpler architecture.
The abstract from the paper is the following:
*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
In addition, the authors have further pre-trained TAPAS to recognize **table entailment**, by creating a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM, and then on another dataset). They found that intermediate pre-training further improves performance on SQA, achieving a new state-of-the-art as well as state-of-the-art on [TabFact](https://github.com/wenhuchen/Table-Fact-Checking), a large-scale dataset with 16k Wikipedia tables for table entailment (a binary classification task). For more details, see their follow-up paper: [Understanding tables with intermediate pre-training](https://www.aclweb.org/anthology/2020.findings-emnlp.27/) by Julian Martin Eisenschlos, Syrine Krichene and Thomas Müller.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tapas_architecture.png"
alt="drawing" width="600"/>
<small> TAPAS architecture. Taken from the [official blog post](https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html). </small>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tensorflow version of this model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/tapas).
Tips:
- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell of the table). Note that this is something that was added after the publication of the original TAPAS paper. According to the authors, this usually results in a slightly better performance, and allows you to encode longer sequences without running out of embeddings. This is reflected in the `reset_position_index_per_cell` parameter of [`TapasConfig`], which is set to `True` by default. The default versions of the models available on the [hub](https://huggingface.co/models?search=tapas) all use relative position embeddings. You can still use the ones with absolute position embeddings by passing in an additional argument `revision="no_reset"` when calling the `from_pretrained()` method. Note that it's usually advised to pad the inputs on the right rather than the left.
- TAPAS is based on BERT, so `TAPAS-base` for example corresponds to a `BERT-base` architecture. Of course, `TAPAS-large` will result in the best performance (the results reported in the paper are from `TAPAS-large`). Results of the various sized models are shown on the [original Github repository](https://github.com/google-research/tapas>).
- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that case, you have to feed every table-question pair one by one to the model, such that the `prev_labels` token type ids can be overwritten by the predicted `labels` of the model to the previous question. See "Usage" section for more info.
- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language modeling (CLM) objective are better in that regard. Note that TAPAS can be used as an encoder in the EncoderDecoderModel framework, to combine it with an autoregressive text decoder such as GPT-2.
## Usage: fine-tuning
Here we explain how you can fine-tune [`TapasForQuestionAnswering`] on your own dataset.
**STEP 1: Choose one of the 3 ways in which you can use TAPAS - or experiment**
Basically, there are 3 different ways in which one can fine-tune [`TapasForQuestionAnswering`], corresponding to the different datasets on which Tapas was fine-tuned:
1. SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example if you first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is he?". Here, questions do not involve any aggregation (all questions are cell selection questions).
2. WTQ: if you're not interested in asking questions in a conversational set-up, but rather just asking questions related to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or averaging cell values. You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his career?". This case is also called **weak supervision**, since the model itself must learn the appropriate aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer to the question as supervision.
3. WikiSQL-supervised: this dataset is based on WikiSQL with the model being given the ground truth aggregation operator during training. This is also called **strong supervision**. Here, learning the appropriate aggregation operator is much easier.
To summarize:
| **Task** | **Example dataset** | **Description** |
|-------------------------------------|---------------------|---------------------------------------------------------------------------------------------------------|
| Conversational | SQA | Conversational, only cell selection questions |
| Weak supervision for aggregation | WTQ | Questions might involve aggregation, and the model must learn this given only the answer as supervision |
| Strong supervision for aggregation | WikiSQL-supervised | Questions might involve aggregation, and the model must learn this given the gold aggregation operator |
Initializing a model with a pre-trained base and randomly initialized classification heads from the hub can be done as shown below. Be sure to have installed the
[torch-scatter](https://github.com/rusty1s/pytorch_scatter) dependency for your environment in case you're using PyTorch, or the [tensorflow_probability](https://github.com/tensorflow/probability)
dependency in case you're using Tensorflow:
```py
>>> from transformers import TapasConfig, TapasForQuestionAnswering
>>> # for example, the base sized model with default SQA configuration
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base')
>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
===PT-TF-SPLIT===
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
>>> # for example, the base sized model with default SQA configuration
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base')
>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
```
Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters you want when initializing [`TapasConfig`], and then create a [`TapasForQuestionAnswering`] based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way. Here's an example:
```py
>>> from transformers import TapasConfig, TapasForQuestionAnswering
>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
===PT-TF-SPLIT===
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
```
What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See [here](https://github.com/google-research/tapas/issues/91#issuecomment-735719340) for more info.
For a list of all pre-trained and fine-tuned TAPAS checkpoints available on HuggingFace's hub, see [here](https://huggingface.co/models?search=tapas).
**STEP 2: Prepare your data in the SQA format**
Second, no matter what you picked above, you should prepare your dataset in the [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253) format. This format is a TSV/CSV file with the following columns:
- `id`: optional, id of the table-question pair, for bookkeeping purposes.
- `annotator`: optional, id of the person who annotated the table-question pair, for bookkeeping purposes.
- `position`: integer indicating if the question is the first, second, third,... related to the table. Only required in case of conversational setup (SQA). You don't need this column in case you're going for WTQ/WikiSQL-supervised.
- `question`: string
- `table_file`: string, name of a csv file containing the tabular data
- `answer_coordinates`: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer)
- `answer_text`: list of one or more strings (each string being a cell value that is part of the answer)
- `aggregation_label`: index of the aggregation operator. Only required in case of strong supervision for aggregation (the WikiSQL-supervised case)
- `float_answer`: the float answer to the question, if there is one (np.nan if there isn't). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)
The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ, WikiSQL) into the SQA format. The author explains this [here](https://github.com/google-research/tapas/issues/50#issuecomment-705465960). A conversion of this script that works with HuggingFace's implementation can be found [here](https://github.com/NielsRogge/tapas_utils). Interestingly, these conversion scripts are not perfect (the `answer_coordinates` and `float_answer` fields are populated based on the `answer_text`), meaning that WTQ and WikiSQL results could actually be improved.
**STEP 3: Convert your data into PyTorch/TensorFlow tensors using TapasTokenizer**
Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then use [`TapasTokenizer`] to convert table-question pairs into `input_ids`, `attention_mask`, `token_type_ids` and so on. Again, based on which of the three cases you picked above, [`TapasForQuestionAnswering`]/[`TFTapasForQuestionAnswering`] requires different
inputs to be fine-tuned:
| **Task** | **Required inputs** |
|------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| Conversational | `input_ids`, `attention_mask`, `token_type_ids`, `labels` |
| Weak supervision for aggregation | `input_ids`, `attention_mask`, `token_type_ids`, `labels`, `numeric_values`, `numeric_values_scale`, `float_answer` |
| Strong supervision for aggregation | `input ids`, `attention mask`, `token type ids`, `labels`, `aggregation_labels` |
[`TapasTokenizer`] creates the `labels`, `numeric_values` and `numeric_values_scale` based on the `answer_coordinates` and `answer_text` columns of the TSV file. The `float_answer` and `aggregation_labels` are already in the TSV file of step 2. Here's an example:
```py
>>> from transformers import TapasTokenizer
>>> import pandas as pd
>>> model_name = 'google/tapas-base'
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
===PT-TF-SPLIT===
>>> from transformers import TapasTokenizer
>>> import pandas as pd
>>> model_name = 'google/tapas-base'
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='tf')
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
```
Note that [`TapasTokenizer`] expects the data of the table to be **text-only**. You can use `.astype(str)` on a dataframe to turn it into text-only data.
Of course, this only shows how to encode a single training example. It is advised to create a dataloader to iterate over batches:
```py
>>> import torch
>>> import pandas as pd
>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
>>> class TableDataset(torch.utils.data.Dataset):
... def __init__(self, data, tokenizer):
... self.data = data
... self.tokenizer = tokenizer
...
... def __getitem__(self, idx):
... item = data.iloc[idx]
... table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
... encoding = self.tokenizer(table=table,
... queries=item.question,
... answer_coordinates=item.answer_coordinates,
... answer_text=item.answer_text,
... truncation=True,
... padding="max_length",
... return_tensors="pt"
... )
... # remove the batch dimension which the tokenizer adds by default
... encoding = {key: val.squeeze(0) for key, val in encoding.items()}
... # add the float_answer which is also required (weak supervision for aggregation case)
... encoding["float_answer"] = torch.tensor(item.float_answer)
... return encoding
...
... def __len__(self):
... return len(self.data)
>>> data = pd.read_csv(tsv_path, sep='\t')
>>> train_dataset = TableDataset(data, tokenizer)
>>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
===PT-TF-SPLIT===
>>> import tensorflow as tf
>>> import pandas as pd
>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
>>> class TableDataset:
... def __init__(self, data, tokenizer):
... self.data = data
... self.tokenizer = tokenizer
...
... def __iter__(self):
... for idx in range(self.__len__()):
... item = self.data.iloc[idx]
... table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
... encoding = self.tokenizer(table=table,
... queries=item.question,
... answer_coordinates=item.answer_coordinates,
... answer_text=item.answer_text,
... truncation=True,
... padding="max_length",
... return_tensors="tf"
... )
... # remove the batch dimension which the tokenizer adds by default
... encoding = {key: tf.squeeze(val,0) for key, val in encoding.items()}
... # add the float_answer which is also required (weak supervision for aggregation case)
... encoding["float_answer"] = tf.convert_to_tensor(item.float_answer,dtype=tf.float32)
... yield encoding['input_ids'], encoding['attention_mask'], encoding['numeric_values'], \
... encoding['numeric_values_scale'], encoding['token_type_ids'], encoding['labels'], \
... encoding['float_answer']
...
... def __len__(self):
... return len(self.data)
>>> data = pd.read_csv(tsv_path, sep='\t')
>>> train_dataset = TableDataset(data, tokenizer)
>>> output_signature = (
... tf.TensorSpec(shape=(512,), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.float32),
... tf.TensorSpec(shape=(512,), dtype=tf.float32),
... tf.TensorSpec(shape=(512,7), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.float32))
>>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)
```
Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group together the `queries`, `answer_coordinates` and `answer_text` per table (in the order of their `position`
index) and batch encode each table with its questions. This will make sure that the `prev_labels` token types (see docs of [`TapasTokenizer`]) are set correctly. See [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info. See [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info regarding using the TensorFlow model.
**STEP 4: Train (fine-tune) TapasForQuestionAnswering/TFTapasForQuestionAnswering**
You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnswering`] as follows (shown here for the weak supervision for aggregation case):
```py
>>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW
>>> # this is the default WTQ configuration
>>> config = TapasConfig(
... num_aggregation_labels = 4,
... use_answer_as_supervision = True,
... answer_loss_cutoff = 0.664694,
... cell_selection_preference = 0.207951,
... huber_loss_delta = 0.121194,
... init_cell_selection_weights_to_zero = True,
... select_one_column = True,
... allow_empty_column_selection = False,
... temperature = 0.0352513,
... )
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
>>> optimizer = AdamW(model.parameters(), lr=5e-5)
>>> model.train()
>>> for epoch in range(2): # loop over the dataset multiple times
... for batch in train_dataloader:
... # get the inputs;
... input_ids = batch["input_ids"]
... attention_mask = batch["attention_mask"]
... token_type_ids = batch["token_type_ids"]
... labels = batch["labels"]
... numeric_values = batch["numeric_values"]
... numeric_values_scale = batch["numeric_values_scale"]
... float_answer = batch["float_answer"]
... # zero the parameter gradients
... optimizer.zero_grad()
... # forward + backward + optimize
... outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
... labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale,
... float_answer=float_answer)
... loss = outputs.loss
... loss.backward()
... optimizer.step()
===PT-TF-SPLIT===
>>> import tensorflow as tf
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
>>> # this is the default WTQ configuration
>>> config = TapasConfig(
... num_aggregation_labels = 4,
... use_answer_as_supervision = True,
... answer_loss_cutoff = 0.664694,
... cell_selection_preference = 0.207951,
... huber_loss_delta = 0.121194,
... init_cell_selection_weights_to_zero = True,
... select_one_column = True,
... allow_empty_column_selection = False,
... temperature = 0.0352513,
... )
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
>>> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
>>> for epoch in range(2): # loop over the dataset multiple times
... for batch in train_dataloader:
... # get the inputs;
... input_ids = batch[0]
... attention_mask = batch[1]
... token_type_ids = batch[4]
... labels = batch[-1]
... numeric_values = batch[2]
... numeric_values_scale = batch[3]
... float_answer = batch[6]
... # forward + backward + optimize
... with tf.GradientTape() as tape:
... outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
... labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale,
... float_answer=float_answer )
... grads = tape.gradient(outputs.loss, model.trainable_weights)
... optimizer.apply_gradients(zip(grads, model.trainable_weights))
```
## Usage: inference
Here we explain how you can use [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnswering`] for inference (i.e. making predictions on new data). For inference, only `input_ids`, `attention_mask` and `token_type_ids` (which you can obtain using [`TapasTokenizer`]) have to be provided to the model to obtain the logits. Next, you can use the handy [`~models.tapas.tokenization_tapas.convert_logits_to_predictions`] method to convert these into predicted coordinates and optional aggregation indices.
However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example of that:
```py
>>> from transformers import TapasTokenizer, TapasForQuestionAnswering
>>> import pandas as pd
>>> model_name = 'google/tapas-base-finetuned-wtq'
>>> model = TapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
... inputs,
... outputs.logits.detach(),
... outputs.logits_aggregation.detach()
... )
>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
... if len(coordinates) == 1:
... # only a single cell:
... answers.append(table.iat[coordinates[0]])
... else:
... # multiple cells
... cell_values = []
... for coordinate in coordinates:
... cell_values.append(table.iat[coordinate])
... answers.append(", ".join(cell_values))
>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
... print(query)
... if predicted_agg == "NONE":
... print("Predicted answer: " + answer)
... else:
... print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69
===PT-TF-SPLIT===
>>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
>>> import pandas as pd
>>> model_name = 'google/tapas-base-finetuned-wtq'
>>> model = TFTapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="tf")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
... inputs,
... outputs.logits,
... outputs.logits_aggregation
... )
>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
... if len(coordinates) == 1:
... # only a single cell:
... answers.append(table.iat[coordinates[0]])
... else:
... # multiple cells
... cell_values = []
... for coordinate in coordinates:
... cell_values.append(table.iat[coordinate])
... answers.append(", ".join(cell_values))
>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
... print(query)
... if predicted_agg == "NONE":
... print("Predicted answer: " + answer)
... else:
... print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69
```
In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that the `prev_labels` token types can be overwritten by the predicted `labels` of the previous table-question pair. Again, more info can be found in [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for PyTorch) and [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for TensorFlow).
## TAPAS specific outputs
[[autodoc]] models.tapas.modeling_tapas.TableQuestionAnsweringOutput
## TapasConfig
[[autodoc]] TapasConfig
## TapasTokenizer
[[autodoc]] TapasTokenizer
- __call__
- convert_logits_to_predictions
- save_vocabulary
## TapasModel
[[autodoc]] TapasModel
- forward
## TapasForMaskedLM
[[autodoc]] TapasForMaskedLM
- forward
## TapasForSequenceClassification
[[autodoc]] TapasForSequenceClassification
- forward
## TapasForQuestionAnswering
[[autodoc]] TapasForQuestionAnswering
- forward
## TFTapasModel
[[autodoc]] TFTapasModel
- call
## TFTapasForMaskedLM
[[autodoc]] TFTapasForMaskedLM
- call
## TFTapasForSequenceClassification
[[autodoc]] TFTapasForSequenceClassification
- call
## TFTapasForQuestionAnswering
[[autodoc]] TFTapasForQuestionAnswering
- call
\ No newline at end of file
TAPAS
-----------------------------------------------------------------------------------------------------------------------
.. note::
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
breaking changes to fix them in the future.
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The TAPAS model was proposed in `TAPAS: Weakly Supervised Table Parsing via Pre-training
<https://www.aclweb.org/anthology/2020.acl-main.398>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
Francesco Piccinno and Julian Martin Eisenschlos. It's a BERT-based model specifically designed (and pre-trained) for
answering questions about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 token types
that encode tabular structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset
comprising millions of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads
on top: a cell selection head and an aggregation head, for (optionally) performing aggregations (such as counting or
summing) among selected cells. TAPAS has been fine-tuned on several datasets: `SQA
<https://www.microsoft.com/en-us/download/details.aspx?id=54253>`__ (Sequential Question Answering by Microsoft), `WTQ
<https://github.com/ppasupat/WikiTableQuestions>`__ (Wiki Table Questions by Stanford University) and `WikiSQL
<https://github.com/salesforce/WikiSQL>`__ (by Salesforce). It achieves state-of-the-art on both SQA and WTQ, while
having comparable performance to SOTA on WikiSQL, with a much simpler architecture.
The abstract from the paper is the following:
*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the
collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations
instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition,
the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we
present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak
supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation
operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective
joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with
three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by
improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL
and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our
setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
In addition, the authors have further pre-trained TAPAS to recognize **table entailment**, by creating a balanced
dataset of millions of automatically created training examples which are learned in an intermediate step prior to
fine-tuning. The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first
pre-trained on MLM, and then on another dataset). They found that intermediate pre-training further improves
performance on SQA, achieving a new state-of-the-art as well as state-of-the-art on `TabFact
<https://github.com/wenhuchen/Table-Fact-Checking>`__, a large-scale dataset with 16k Wikipedia tables for table
entailment (a binary classification task). For more details, see their follow-up paper: `Understanding tables with
intermediate pre-training <https://www.aclweb.org/anthology/2020.findings-emnlp.27/>`__ by Julian Martin Eisenschlos,
Syrine Krichene and Thomas Müller.
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The Tensorflow version of this model was
contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
<https://github.com/google-research/tapas>`__.
Tips:
- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell
of the table). Note that this is something that was added after the publication of the original TAPAS paper.
According to the authors, this usually results in a slightly better performance, and allows you to encode longer
sequences without running out of embeddings. This is reflected in the ``reset_position_index_per_cell`` parameter of
:class:`~transformers.TapasConfig`, which is set to ``True`` by default. The default versions of the models available
in the `model hub <https://huggingface.co/models?search=tapas>`_ all use relative position embeddings. You can still
use the ones with absolute position embeddings by passing in an additional argument ``revision="no_reset"`` when
calling the ``.from_pretrained()`` method. Note that it's usually advised to pad the inputs on the right rather than
the left.
- TAPAS is based on BERT, so ``TAPAS-base`` for example corresponds to a ``BERT-base`` architecture. Of course,
TAPAS-large will result in the best performance (the results reported in the paper are from TAPAS-large). Results of
the various sized models are shown on the `original Github repository <https://github.com/google-research/tapas>`_.
- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a
conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the
previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that
case, you have to feed every table-question pair one by one to the model, such that the `prev_labels` token type ids
can be overwritten by the predicted `labels` of the model to the previous question. See "Usage" section for more
info.
- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
with a causal language modeling (CLM) objective are better in that regard.
Usage: fine-tuning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here we explain how you can fine-tune :class:`~transformers.TapasForQuestionAnswering` on your own dataset.
**STEP 1: Choose one of the 3 ways in which you can use TAPAS - or experiment**
Basically, there are 3 different ways in which one can fine-tune :class:`~transformers.TapasForQuestionAnswering`,
corresponding to the different datasets on which Tapas was fine-tuned:
1. SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example
if you first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is
he?". Here, questions do not involve any aggregation (all questions are cell selection questions).
2. WTQ: if you're not interested in asking questions in a conversational set-up, but rather just asking questions
related to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or
averaging cell values. You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his
career?". This case is also called **weak supervision**, since the model itself must learn the appropriate
aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer to the question as supervision.
3. WikiSQL-supervised: this dataset is based on WikiSQL with the model being given the ground truth aggregation
operator during training. This is also called **strong supervision**. Here, learning the appropriate aggregation
operator is much easier.
To summarize:
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| **Task** | **Example dataset** | **Description** |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| Conversational | SQA | Conversational, only cell selection questions |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| Weak supervision for aggregation | WTQ | Questions might involve aggregation, and the model must learn this given only the answer as supervision |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| Strong supervision for aggregation | WikiSQL-supervised | Questions might involve aggregation, and the model must learn this given the gold aggregation operator |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
Initializing a model with a pre-trained base and randomly initialized classification heads from the model hub can be
done as follows (be sure to have installed the `torch-scatter dependency <https://github.com/rusty1s/pytorch_scatter>`_
for your environment):
.. code-block::
>>> from transformers import TapasConfig, TapasForQuestionAnswering
>>> # for example, the base sized model with default SQA configuration
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base')
>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
In TensorFlow, this can be done as follows (make sure to have installed the `tensorflow_probability dependency
<https://github.com/tensorflow/probability`>__ for your environment):
.. code-block::
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
>>> # for example, the base sized model with default SQA configuration
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base')
>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also
experiment by defining any hyperparameters you want when initializing :class:`~transformers.TapasConfig`, and then
create a :class:`~transformers.TapasForQuestionAnswering` based on that configuration. For example, if you have a
dataset that has both conversational questions and questions that might involve aggregation, then you can do it this
way. Here's an example:
.. code-block::
>>> from transformers import TapasConfig, TapasForQuestionAnswering
>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
And here is the equivalent code for TensorFlow:
.. code-block::
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned
checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See `here
<https://github.com/google-research/tapas/issues/91#issuecomment-735719340>`__ for more info.
For a list of all pre-trained and fine-tuned TAPAS checkpoints available in the HuggingFace model hub, see `here
<https://huggingface.co/models?search=tapas>`__.
**STEP 2: Prepare your data in the SQA format**
Second, no matter what you picked above, you should prepare your dataset in the `SQA format
<https://www.microsoft.com/en-us/download/details.aspx?id=54253>`__. This format is a TSV/CSV file with the following
columns:
- ``id``: optional, id of the table-question pair, for bookkeeping purposes.
- ``annotator``: optional, id of the person who annotated the table-question pair, for bookkeeping purposes.
- ``position``: integer indicating if the question is the first, second, third,... related to the table. Only required
in case of conversational setup (SQA). You don't need this column in case you're going for WTQ/WikiSQL-supervised.
- ``question``: string
- ``table_file``: string, name of a csv file containing the tabular data
- ``answer_coordinates``: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is
part of the answer)
- ``answer_text``: list of one or more strings (each string being a cell value that is part of the answer)
- ``aggregation_label``: index of the aggregation operator. Only required in case of strong supervision for aggregation
(the WikiSQL-supervised case)
- ``float_answer``: the float answer to the question, if there is one (np.nan if there isn't). Only required in case of
weak supervision for aggregation (such as WTQ and WikiSQL)
The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the
TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ, WikiSQL) into the
SQA format. The author explains this `here
<https://github.com/google-research/tapas/issues/50#issuecomment-705465960>`__. Interestingly, these conversion scripts
are not perfect (the ``answer_coordinates`` and ``float_answer`` fields are populated based on the ``answer_text``),
meaning that WTQ and WikiSQL results could actually be improved.
**STEP 3: Convert your data into PyTorch/TensorFlow tensors using TapasTokenizer**
Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular
data), you can then use :class:`~transformers.TapasTokenizer` to convert table-question pairs into :obj:`input_ids`,
:obj:`attention_mask`, :obj:`token_type_ids` and so on. Again, based on which of the three cases you picked above,
:class:`~transformers.TapasForQuestionAnswering`/:class:`~transformers.TFTapasForQuestionAnswering` requires different
inputs to be fine-tuned:
+------------------------------------+----------------------------------------------------------------------------------------------+
| **Task** | **Required inputs** |
+------------------------------------+----------------------------------------------------------------------------------------------+
| Conversational | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``labels`` |
+------------------------------------+----------------------------------------------------------------------------------------------+
| Weak supervision for aggregation | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``labels``, ``numeric_values``, |
| | ``numeric_values_scale``, ``float_answer`` |
+------------------------------------+----------------------------------------------------------------------------------------------+
| Strong supervision for aggregation | ``input ids``, ``attention mask``, ``token type ids``, ``labels``, ``aggregation_labels`` |
+------------------------------------+----------------------------------------------------------------------------------------------+
:class:`~transformers.TapasTokenizer` creates the ``labels``, ``numeric_values`` and ``numeric_values_scale`` based on
the ``answer_coordinates`` and ``answer_text`` columns of the TSV file. The ``float_answer`` and ``aggregation_labels``
are already in the TSV file of step 2. Here's an example:
.. code-block::
>>> from transformers import TapasTokenizer
>>> import pandas as pd
>>> model_name = 'google/tapas-base'
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
Set `return_tensors='tf'` when calling the tokenizer to prepare data for the TF models.
Note that :class:`~transformers.TapasTokenizer` expects the data of the table to be **text-only**. You can use
``.astype(str)`` on a dataframe to turn it into text-only data. Of course, this only shows how to encode a single
training example. It is advised to create a PyTorch dataset and a corresponding dataloader:
.. code-block::
>>> import torch
>>> import pandas as pd
>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
>>> class TableDataset(torch.utils.data.Dataset):
... def __init__(self, data, tokenizer):
... self.data = data
... self.tokenizer = tokenizer
...
... def __getitem__(self, idx):
... item = data.iloc[idx]
... table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
... encoding = self.tokenizer(table=table,
... queries=item.question,
... answer_coordinates=item.answer_coordinates,
... answer_text=item.answer_text,
... truncation=True,
... padding="max_length",
... return_tensors="pt"
... )
... # remove the batch dimension which the tokenizer adds by default
... encoding = {key: val.squeeze(0) for key, val in encoding.items()}
... # add the float_answer which is also required (weak supervision for aggregation case)
... encoding["float_answer"] = torch.tensor(item.float_answer)
... return encoding
...
... def __len__(self):
... return len(self.data)
>>> data = pd.read_csv(tsv_path, sep='\t')
>>> train_dataset = TableDataset(data, tokenizer)
>>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
And here is the equivalent code for TensorFlow:
.. code-block::
>>> import tensorflow as tf
>>> import pandas as pd
>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
>>> class TableDataset:
... def __init__(self, data, tokenizer):
... self.data = data
... self.tokenizer = tokenizer
...
... def __iter__(self):
... for idx in range(self.__len__()):
... item = self.data.iloc[idx]
... table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
... encoding = self.tokenizer(table=table,
... queries=item.question,
... answer_coordinates=item.answer_coordinates,
... answer_text=item.answer_text,
... truncation=True,
... padding="max_length",
... return_tensors="tf"
... )
... # remove the batch dimension which the tokenizer adds by default
... encoding = {key: tf.squeeze(val,0) for key, val in encoding.items()}
... # add the float_answer which is also required (weak supervision for aggregation case)
... encoding["float_answer"] = tf.convert_to_tensor(item.float_answer,dtype=tf.float32)
... yield encoding['input_ids'], encoding['attention_mask'], encoding['numeric_values'], \
... encoding['numeric_values_scale'], encoding['token_type_ids'], encoding['labels'], \
... encoding['float_answer']
...
... def __len__(self):
... return len(self.data)
>>> data = pd.read_csv(tsv_path, sep='\t')
>>> train_dataset = TableDataset(data, tokenizer)
>>> output_signature = (
... tf.TensorSpec(shape=(512,), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.float32),
... tf.TensorSpec(shape=(512,), dtype=tf.float32),
... tf.TensorSpec(shape=(512,7), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.int32),
... tf.TensorSpec(shape=(512,), dtype=tf.float32))
>>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)
Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not
conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group
together the ``queries``, ``answer_coordinates`` and ``answer_text`` per table (in the order of their ``position``
index) and batch encode each table with its questions. This will make sure that the ``prev_labels`` token types (see
docs of :class:`~transformers.TapasTokenizer`) are set correctly. See `this notebook
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
for more info. See `this notebook
<https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
for more info regarding using the TensorFlow model.
**STEP 4: Train (fine-tune) TapasForQuestionAnswering/TFTapasForQuestionAnswering**
You can then fine-tune :class:`~transformers.TapasForQuestionAnswering` using native PyTorch as follows (shown here for
the weak supervision for aggregation case):
.. code-block::
>>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW
>>> # this is the default WTQ configuration
>>> config = TapasConfig(
... num_aggregation_labels = 4,
... use_answer_as_supervision = True,
... answer_loss_cutoff = 0.664694,
... cell_selection_preference = 0.207951,
... huber_loss_delta = 0.121194,
... init_cell_selection_weights_to_zero = True,
... select_one_column = True,
... allow_empty_column_selection = False,
... temperature = 0.0352513,
... )
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
>>> optimizer = AdamW(model.parameters(), lr=5e-5)
>>> for epoch in range(2): # loop over the dataset multiple times
... for idx, batch in enumerate(train_dataloader):
... # get the inputs;
... input_ids = batch["input_ids"]
... attention_mask = batch["attention_mask"]
... token_type_ids = batch["token_type_ids"]
... labels = batch["labels"]
... numeric_values = batch["numeric_values"]
... numeric_values_scale = batch["numeric_values_scale"]
... float_answer = batch["float_answer"]
... # zero the parameter gradients
... optimizer.zero_grad()
... # forward + backward + optimize
... outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
... labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale,
... float_answer=float_answer)
... loss = outputs.loss
... loss.backward()
... optimizer.step()
Equivalently, fine-tuning :class:`~transformers.TFTapasForQuestionAnswering` in native TensorFlow can be done as
follows (shown here for the weak supervision for aggregation case):
.. code-block::
>>> import tensorflow as tf
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering
>>> # this is the default WTQ configuration
>>> config = TapasConfig(
... num_aggregation_labels = 4,
... use_answer_as_supervision = True,
... answer_loss_cutoff = 0.664694,
... cell_selection_preference = 0.207951,
... huber_loss_delta = 0.121194,
... init_cell_selection_weights_to_zero = True,
... select_one_column = True,
... allow_empty_column_selection = False,
... temperature = 0.0352513,
... )
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
>>> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
>>> for epoch in range(2): # loop over the dataset multiple times
... for idx, batch in enumerate(train_dataloader):
... # get the inputs;
... input_ids = batch[0]
... attention_mask = batch[1]
... token_type_ids = batch[4]
... labels = batch[-1]
... numeric_values = batch[2]
... numeric_values_scale = batch[3]
... float_answer = batch[6]
... # forward + backward + optimize
... with tf.GradientTape() as tape:
... outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
... labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale,
... float_answer=float_answer )
... grads = tape.gradient(outputs.loss, model.trainable_weights)
... optimizer.apply_gradients(zip(grads, model.trainable_weights))
Usage: inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here we explain how you can use :class:`~transformers.TapasForQuestionAnswering` for inference (i.e. making predictions
on new data). For inference, only ``input_ids``, ``attention_mask`` and ``token_type_ids`` (which you can obtain using
:class:`~transformers.TapasTokenizer`) have to be provided to the model to obtain the logits. Next, you can use the
handy ``convert_logits_to_predictions`` method of :class:`~transformers.TapasTokenizer` to convert these into predicted
coordinates and optional aggregation indices.
However, note that inference is **different** depending on whether or not the setup is conversational. In a
non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example
of that:
.. code-block::
>>> from transformers import TapasTokenizer, TapasForQuestionAnswering
>>> import pandas as pd
>>> model_name = 'google/tapas-base-finetuned-wtq'
>>> model = TapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
... inputs,
... outputs.logits.detach(),
... outputs.logits_aggregation.detach()
... )
>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
... if len(coordinates) == 1:
... # only a single cell:
... answers.append(table.iat[coordinates[0]])
... else:
... # multiple cells
... cell_values = []
... for coordinate in coordinates:
... cell_values.append(table.iat[coordinate])
... answers.append(", ".join(cell_values))
>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
... print(query)
... if predicted_agg == "NONE":
... print("Predicted answer: " + answer)
... else:
... print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69
And here is the equivalent code for TensorFlow:
.. code-block::
>>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
>>> import pandas as pd
>>> model_name = 'google/tapas-base-finetuned-wtq'
>>> model = TFTapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="tf")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
... inputs,
... outputs.logits,
... outputs.logits_aggregation
... )
>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
... if len(coordinates) == 1:
... # only a single cell:
... answers.append(table.iat[coordinates[0]])
... else:
... # multiple cells
... cell_values = []
... for coordinate in coordinates:
... cell_values.append(table.iat[coordinate])
... answers.append(", ".join(cell_values))
>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
... print(query)
... if predicted_agg == "NONE":
... print("Predicted answer: " + answer)
... else:
... print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69
In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such
that the ``prev_labels`` token types can be overwritten by the predicted ``labels`` of the previous table-question
pair. Again, more info can be found in `this notebook
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
(for PyTorch) and `this notebook
<https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
(for TensorFlow).
Tapas specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.tapas.modeling_tapas.TableQuestionAnsweringOutput
:members:
TapasConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasConfig
:members:
TapasTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasTokenizer
:members: __call__, convert_logits_to_predictions, save_vocabulary
TapasModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasModel
:members: forward
TapasForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasForMaskedLM
:members: forward
TapasForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasForSequenceClassification
:members: forward
TapasForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasForQuestionAnswering
:members: forward
TFTapasModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTapasModel
:members: call
TFTapasForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTapasForMaskedLM
:members: call
TFTapasForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTapasForSequenceClassification
:members: call
TFTapasForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTapasForQuestionAnswering
:members: call
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# TrOCR
## Overview
The TrOCR model was proposed in [TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to
perform [optical character recognition (OCR)](https://en.wikipedia.org/wiki/Optical_character_recognition).
The abstract from the paper is the following:
*Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition
are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language
model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end
text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the
Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but
effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments
show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition
tasks.*
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg"
alt="drawing" width="600"/>
<small> TrOCR architecture. Taken from the [original paper](https://arxiv.org/abs/2109.10282). </small>
Please refer to the [`VisionEncoderDecoder`] class on how to use this model.
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
[here](https://github.com/microsoft/unilm/tree/6f60612e7cc86a2a1ae85c47231507a587ab4e01/trocr).
Tips:
- The quickest way to get started with TrOCR is by checking the [tutorial
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR), which show how to use the model
at inference time as well as fine-tuning on custom data.
- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results
on both printed (e.g. the [SROIE dataset](https://paperswithcode.com/dataset/sroie) and handwritten (e.g. the [IAM
Handwriting dataset](https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>) text recognition tasks. For more
information, see the [official models](https://huggingface.co/models?other=trocr>).
- TrOCR is always used within the [VisionEncoderDecoder](./model_doc/visionencoderdecoder) framework.
## Inference
TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
[`~generation_utils.GenerationMixin.generate`] to autoregressively generate text given the input image.
The [`ViTFeatureExtractor`] class is responsible for preprocessing the input image and
[`RobertaTokenizer`] decodes the generated target tokens to the target string. The
[`TrOCRProcessor`] wraps [`ViTFeatureExtractor`] and [`RobertaTokenizer`]
into a single instance to both extract the input features and decode the predicted token ids.
- Step-by-step Optical Character Recognition (OCR)
``` py
>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
>>> import requests
>>> from PIL import Image
>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
>>> # load image from the IAM dataset url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
See the [model hub](https://huggingface.co/models?filter=trocr) to look for TrOCR checkpoints.
## TrOCRConfig
[[autodoc]] TrOCRConfig
## TrOCRProcessor
[[autodoc]] TrOCRProcessor
- __call__
- from_pretrained
- save_pretrained
- batch_decode
- decode
- as_target_processor
## TrOCRForCausalLM
[[autodoc]] TrOCRForCausalLM
- forward
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
TrOCR
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The TrOCR model was proposed in `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to
perform `optical character recognition (OCR) <https://en.wikipedia.org/wiki/Optical_character_recognition>`__.
Please refer to the :doc:`VisionEncoderDecoder <visionencoderdecoder>` class on how to use this model.
This model was contributed by `Niels Rogge <https://huggingface.co/nielsr>`__.
The original code can be found `here
<https://github.com/microsoft/unilm/tree/6f60612e7cc86a2a1ae85c47231507a587ab4e01/trocr>`__.
Tips:
- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results
on both printed (e.g. the `SROIE dataset <https://paperswithcode.com/dataset/sroie>`__) and handwritten (e.g. the
`IAM Handwriting dataset <https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>`__) text recognition tasks.
For more information, see the `official models <https://huggingface.co/models?other=trocr>`__.
- TrOCR is always used within the :doc:`VisionEncoderDecoder <visionencoderdecoder>` framework.
Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TrOCR's :class:`~transformers.VisionEncoderDecoderModel` model accepts images as input and makes use of
:func:`~transformers.generation_utils.GenerationMixin.generate` to autoregressively generate text given the input
image.
The :class:`~transformers.ViTFeatureExtractor` class is responsible for preprocessing the input image and
:class:`~transformers.RobertaTokenizer` decodes the generated target tokens to the target string. The
:class:`~transformers.TrOCRProcessor` wraps :class:`~transformers.ViTFeatureExtractor` and
:class:`~transformers.RobertaTokenizer` into a single instance to both extract the input features and decode the
predicted token ids.
- Step-by-step Optical Character Recognition (OCR)
.. code-block::
>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
>>> import requests
>>> from PIL import Image
>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
>>> # load image from the IAM dataset
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
See the `model hub <https://huggingface.co/models?filter=trocr>`__ to look for TrOCR checkpoints.
TrOCRConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TrOCRConfig
:members:
TrOCRProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TrOCRProcessor
:members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
TrOCRForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TrOCRForCausalLM
:members: forward
...@@ -43,6 +43,8 @@ substantially fewer computational resources to train.* ...@@ -43,6 +43,8 @@ substantially fewer computational resources to train.*
Tips: Tips:
- Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found `here
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__.
- To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, - To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be
used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of
......
...@@ -90,7 +90,8 @@ class PerceiverConfig(PretrainedConfig): ...@@ -90,7 +90,8 @@ class PerceiverConfig(PretrainedConfig):
samples_per_patch (:obj:`int`, `optional`, defaults to 16): samples_per_patch (:obj:`int`, `optional`, defaults to 16):
Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model. Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
output_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[1, 16, 224, 224]`): output_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[1, 16, 224, 224]`):
Shape of the output for the multimodal autoencoding model. Shape of the output (batch_size, num_frames, height, width) for the video decoder queries of the multimodal
autoencoding model. This excludes the channel dimension.
Example:: Example::
......
...@@ -1865,7 +1865,13 @@ class PerceiverAbstractDecoder(nn.Module, metaclass=abc.ABCMeta): ...@@ -1865,7 +1865,13 @@ class PerceiverAbstractDecoder(nn.Module, metaclass=abc.ABCMeta):
class PerceiverProjectionDecoder(PerceiverAbstractDecoder): class PerceiverProjectionDecoder(PerceiverAbstractDecoder):
"""Baseline projection decoder (no cross-attention).""" """
Baseline projection decoder (no cross-attention).
Args:
config ([`PerceiverConfig`]):
Model configuration.
"""
def __init__(self, config): def __init__(self, config):
super().__init__() super().__init__()
...@@ -1884,11 +1890,38 @@ class PerceiverProjectionDecoder(PerceiverAbstractDecoder): ...@@ -1884,11 +1890,38 @@ class PerceiverProjectionDecoder(PerceiverAbstractDecoder):
class PerceiverBasicDecoder(PerceiverAbstractDecoder): class PerceiverBasicDecoder(PerceiverAbstractDecoder):
""" """
Cross-attention-based decoder. Cross-attention-based decoder. This class can be used to decode the final hidden states of the latents using a
cross-attention operation, in which the latents produce keys and values.
Here, `output_num_channels` refers to the number of output channels. `num_channels` refers to the number of The shape of the output of this class depends on how one defines the output queries (also called decoder queries).
channels of the output queries.
Args:
config ([`PerceiverConfig`]):
Model configuration.
output_num_channels (:obj:`int`, `optional`):
The number of channels in the output. Will only be used in case `final_project` is set to `True`.
position_encoding_type (:obj:`str`, `optional`, defaults to "trainable"):
The type of position encoding to use. Can be either "trainable", "fourier", or "none".
output_index_dims (:obj:`int`, `optional`):
The number of dimensions of the output queries. Ignored if 'position_encoding_type' == 'none'.
num_channels (:obj:`int`, `optional`):
The number of channels of the decoder queries. Ignored if 'position_encoding_type' == 'none'.
qk_channels (:obj:`int`, `optional`):
The number of channels of the queries and keys in the cross-attention layer.
v_channels (:obj:`int`, `optional`, defaults to 128):
The number of channels of the values in the cross-attention layer.
num_heads (:obj:`int`, `optional`, defaults to 1):
The number of attention heads in the cross-attention layer.
widening_factor (:obj:`int`, `optional`, defaults to 1):
The widening factor of the cross-attention layer.
use_query_residual (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use a residual connection between the query and the output of the cross-attention layer.
concat_preprocessed_input (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to concatenate the preprocessed input to the query.
final_project (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to project the output of the cross-attention layer to a target dimension.
position_encoding_only (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to only use this class to define output queries.
""" """
def __init__( def __init__(
...@@ -2035,9 +2068,13 @@ class PerceiverBasicDecoder(PerceiverAbstractDecoder): ...@@ -2035,9 +2068,13 @@ class PerceiverBasicDecoder(PerceiverAbstractDecoder):
class PerceiverClassificationDecoder(PerceiverAbstractDecoder): class PerceiverClassificationDecoder(PerceiverAbstractDecoder):
""" """
Cross-attention based classification decoder. Light-weight wrapper of `BasicDecoder` for logit output. Will turn Cross-attention based classification decoder. Light-weight wrapper of [`PerceiverBasicDecoder`] for logit output.
the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of shape Will turn the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of
(batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels). shape (batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).
Args:
config ([`PerceiverConfig`]):
Model configuration.
""" """
def __init__(self, config, **decoder_kwargs): def __init__(self, config, **decoder_kwargs):
...@@ -2100,8 +2137,16 @@ class PerceiverOpticalFlowDecoder(PerceiverAbstractDecoder): ...@@ -2100,8 +2137,16 @@ class PerceiverOpticalFlowDecoder(PerceiverAbstractDecoder):
class PerceiverBasicVideoAutoencodingDecoder(PerceiverAbstractDecoder): class PerceiverBasicVideoAutoencodingDecoder(PerceiverAbstractDecoder):
""" """
Cross-attention based video-autoencoding decoder. Light-weight wrapper of `BasicDecoder` with video reshaping Cross-attention based video-autoencoding decoder. Light-weight wrapper of [`PerceiverBasicDecoder`] with video
logic. reshaping logic.
Args:
config ([`PerceiverConfig`]):
Model configuration.
output_shape (:obj:`List[int]`):
Shape of the output as (batch_size, num_frames, height, width), excluding the channel dimension.
position_encoding_type (:obj:`str`):
The type of position encoding to use. Can be either "trainable", "fourier", or "none".
""" """
def __init__(self, config, output_shape, position_encoding_type, **decoder_kwargs): def __init__(self, config, output_shape, position_encoding_type, **decoder_kwargs):
...@@ -2165,10 +2210,28 @@ def restructure(modality_sizes: ModalitySizeType, inputs: torch.Tensor) -> Mappi ...@@ -2165,10 +2210,28 @@ def restructure(modality_sizes: ModalitySizeType, inputs: torch.Tensor) -> Mappi
class PerceiverMultimodalDecoder(PerceiverAbstractDecoder): class PerceiverMultimodalDecoder(PerceiverAbstractDecoder):
""" """
Multimodal decoding by composing uni-modal decoders. The modalities argument of the constructor is a dictionary Multimodal decoding by composing uni-modal decoders. The `modalities` argument of the constructor is a dictionary
mapping modality name to the decoder of that modality. That decoder will be used to construct queries for that mapping modality name to the decoder of that modality. That decoder will be used to construct queries for that
modality. However, there is a shared cross attention across all modalities, using the concatenated per-modality modality. Modality-specific queries are padded with trainable modality-specific parameters, after which they are
query vectors. concatenated along the time dimension.
Next, there is a shared cross attention operation across all modalities.
Args:
config ([`PerceiverConfig`]):
Model configuration.
modalities (:obj:`Dict[str, PerceiverAbstractDecoder]`):
Dictionary mapping modality name to the decoder of that modality.
num_outputs (:obj:`int`):
The number of outputs of the decoder.
output_num_channels (:obj:`int`):
The number of channels in the output.
min_padding_size (:obj:`int`, `optional`, defaults to 2):
The minimum padding size for all modalities. The final output will have num_channels equal to the maximum
channels across all modalities plus min_padding_size.
subsampled_index_dims (:obj:`Dict[str, PerceiverAbstractDecoder]`, `optional`):
Dictionary mapping modality name to the subsampled index dimensions to use for the decoder query of that
modality.
""" """
def __init__( def __init__(
...@@ -2556,7 +2619,15 @@ class AbstractPreprocessor(nn.Module): ...@@ -2556,7 +2619,15 @@ class AbstractPreprocessor(nn.Module):
class PerceiverTextPreprocessor(AbstractPreprocessor): class PerceiverTextPreprocessor(AbstractPreprocessor):
"""Text preprocessing for Perceiver Encoder.""" """
Text preprocessing for Perceiver Encoder. Can be used to embed `inputs` and add positional encodings.
The dimensionality of the embeddings is determined by the `d_model` attribute of the configuration.
Args:
config ([`PerceiverConfig`]):
Model configuration.
"""
def __init__(self, config): def __init__(self, config):
super().__init__() super().__init__()
...@@ -2578,10 +2649,15 @@ class PerceiverTextPreprocessor(AbstractPreprocessor): ...@@ -2578,10 +2649,15 @@ class PerceiverTextPreprocessor(AbstractPreprocessor):
class PerceiverEmbeddingDecoder(nn.Module): class PerceiverEmbeddingDecoder(nn.Module):
"""Module to decode embeddings (for masked language modeling).""" """
Module to decode embeddings (for masked language modeling).
Args:
config ([`PerceiverConfig`]):
Model configuration.
"""
def __init__(self, config): def __init__(self, config):
"""Constructs the module."""
super().__init__() super().__init__()
self.config = config self.config = config
self.vocab_size = config.vocab_size self.vocab_size = config.vocab_size
...@@ -2597,7 +2673,8 @@ class PerceiverEmbeddingDecoder(nn.Module): ...@@ -2597,7 +2673,8 @@ class PerceiverEmbeddingDecoder(nn.Module):
class PerceiverMultimodalPostprocessor(nn.Module): class PerceiverMultimodalPostprocessor(nn.Module):
""" """
Multimodal postprocessing for Perceiver. Multimodal postprocessing for Perceiver. Can be used to combine modality-specific postprocessors into a single
postprocessor.
Args: Args:
modalities (:obj:`Dict[str, PostprocessorType]`): modalities (:obj:`Dict[str, PostprocessorType]`):
...@@ -2633,7 +2710,7 @@ class PerceiverClassificationPostprocessor(nn.Module): ...@@ -2633,7 +2710,7 @@ class PerceiverClassificationPostprocessor(nn.Module):
Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits. Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits.
Args: Args:
config (:obj:`PerceiverConfig`): config ([`PerceiverConfig`]):
Model configuration. Model configuration.
in_channels (:obj:`int`): in_channels (:obj:`int`):
Number of channels in the input. Number of channels in the input.
...@@ -2653,7 +2730,7 @@ class PerceiverAudioPostprocessor(nn.Module): ...@@ -2653,7 +2730,7 @@ class PerceiverAudioPostprocessor(nn.Module):
Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features. Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features.
Args: Args:
config (:obj:`PerceiverConfig`): config ([`PerceiverConfig`]):
Model configuration. Model configuration.
in_channels (:obj:`int`): in_channels (:obj:`int`):
Number of channels in the input. Number of channels in the input.
...@@ -2678,8 +2755,8 @@ class PerceiverAudioPostprocessor(nn.Module): ...@@ -2678,8 +2755,8 @@ class PerceiverAudioPostprocessor(nn.Module):
class PerceiverProjectionPostprocessor(nn.Module): class PerceiverProjectionPostprocessor(nn.Module):
""" """
Projection postprocessing for Perceiver. Can be used to convert the project the channels of the decoder output to a Projection postprocessing for Perceiver. Can be used to project the channels of the decoder output to a lower
lower dimension. dimension.
Args: Args:
in_channels (:obj:`int`): in_channels (:obj:`int`):
...@@ -2706,7 +2783,7 @@ class PerceiverImagePreprocessor(AbstractPreprocessor): ...@@ -2706,7 +2783,7 @@ class PerceiverImagePreprocessor(AbstractPreprocessor):
position encoding kwargs are set equal to the `out_channels`. position encoding kwargs are set equal to the `out_channels`.
Args: Args:
config (:obj:`PerceiverConfig`): config ([`PerceiverConfig`]):
Model configuration. Model configuration.
prep_type (:obj:`str`, `optional`, defaults to :obj:`"conv"`): prep_type (:obj:`str`, `optional`, defaults to :obj:`"conv"`):
Preprocessing type. Can be "conv1x1", "conv", "patches", "pixels". Preprocessing type. Can be "conv1x1", "conv", "patches", "pixels".
...@@ -2931,7 +3008,7 @@ class PerceiverOneHotPreprocessor(AbstractPreprocessor): ...@@ -2931,7 +3008,7 @@ class PerceiverOneHotPreprocessor(AbstractPreprocessor):
One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input. One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input.
Args: Args:
config (:obj:`PerceiverConfig`): config ([`PerceiverConfig`]):
Model configuration. Model configuration.
""" """
...@@ -2957,7 +3034,7 @@ class PerceiverAudioPreprocessor(AbstractPreprocessor): ...@@ -2957,7 +3034,7 @@ class PerceiverAudioPreprocessor(AbstractPreprocessor):
Audio preprocessing for Perceiver Encoder. Audio preprocessing for Perceiver Encoder.
Args: Args:
config (:obj:`PerceiverConfig`): config ([`PerceiverConfig`]):
Model configuration. Model configuration.
prep_type (:obj:`str`, `optional`, defaults to :obj:`"patches"`): prep_type (:obj:`str`, `optional`, defaults to :obj:`"patches"`):
Preprocessor type to use. Only "patches" is supported. Preprocessor type to use. Only "patches" is supported.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment