Add Perceiver IO (#14487)

* First draft * Style and remove mlm * Make forward pass work * More improvements * More improvements * Fix bug * More improvements * More improvements * Add PerceiverTokenizer first draft * Improve conversion script * More improvements * Make conversion script work for the encoder * Make conversion script work with local pickle files * Style & quality, fix-copies * Add dummy input to conversion script * Add absolute position embeddings to TextPreProcessor * Make forward pass of encoder work * More improvements * Move text preprocessor to separate script * More improvements * More improvements * Add post processor * Make MLM model work * Style * Add PerceiverForMaskedLM * Add PerceiverImagePreprocessor * Make style * Make PerceiverForImageClassification work * More improvements * More improvements * Use tokenizer in conversion script * Use PerceiverForMaskedLM in conversion script * Define custom PerceiverModelOutput * Improve PerceiverAttention to make it work for both MLM and image classification * More improvements * More improvements * More improvements to the conversion script * Make conversion script work for both MLM and image classification * Add PerceiverFeatureExtractor * More improvements * Style and quality * Add center cropping * Fix bug * Small fix * Add print statement * Fix bug in image preprocessor * Fix bug with conversion script * Make output position embeddings an nn.Parameter layer instead of nn.Embedding * Comment out print statements * Add position encoding classes * More improvements * Use position_encoding_kwargs * Add PerceiverForImageClassificationFourier * Make style & quality * Add PerceiverForImageClassificationConvProcessing * Style & quality * Add flow model * Move processors to modeling file * Make position encodings modular * Make basic decoder use modular position encodings * Add PerceiverForOpticalFlow to conversion script * Add AudioPreprocessor * Make it possible for the basic decoder to use Fourier position embeddings * Add PerceiverForMultimodalAutoencoding * Improve model for optical flow * Improve _build_network_inputs method * Add print statement * Fix device issue * Fix device of Fourier embeddings * Add print statements for debugging * Add another print statement * Add another print statement * Add another print statement * Add another print statement * Improve PerceiverAudioPreprocessor * Improve conversion script for multimodal modal * More improvements * More improvements * Improve multimodal model * Make forward pass multimodal model work * More improvements * Improve tests * Fix some more tests * Add output dataclasses * Make more tests pass * Add print statements for debuggin * Add tests for image classification * Add PerceiverClassifierOutput * More improvements * Make more tests pass for the optical flow model * Make style & quality * Small improvements * Don't support training for optical flow model for now * Fix _prepare_for_class for tests * Make more tests pass, add some docs * Add multimodal model to tests * Minor fixes * Fix tests * Improve conversion script * Make fixup * Remove pos_dim argument * Fix device issue * Potential fix for OOM * Revert previous commit * Fix test_initialization * Add print statements for debugging * Fix print statement * Add print statement * Add print statement * Add print statement * Add print statement * Add print statement * Add print statement * Remove need for output_shape * Comment out output_shape * Remove unnecessary code * Improve docs * Fix make fixup * Remove PerceiverTextProcessor from init * Improve docs * Small improvement * Apply first batch of suggestions from code review * Apply more suggestions from code review * Update docstrings * Define dicts beforehand for readability * Rename task to architecture in conversion script, include PerceiverModel in tests * Add print statements for debugging * Fix tests on GPU * Remove preprocessors, postprocessors and decoders from main init * Add integration test * Fix docs * Replace einops by torch * Update for new docs frontend * Rename PerceiverForImageClassification * Improve docs * Improve docs * Improve docs of PerceiverModel * Fix some more tests * Improve center_crop * Add PerceiverForSequenceClassification * Small improvements * Fix tests * Add integration test for optical flow model * Clean up * Add tests for tokenizer * Fix tokenizer by adding special tokens properly * Fix CI

Add Perceiver IO (#14487)
* First draft * Style and remove mlm * Make forward pass work * More improvements * More improvements * Fix bug * More improvements * More improvements * Add PerceiverTokenizer first draft * Improve conversion script * More improvements * Make conversion script work for the encoder * Make conversion script work with local pickle files * Style & quality, fix-copies * Add dummy input to conversion script * Add absolute position embeddings to TextPreProcessor * Make forward pass of encoder work * More improvements * Move text preprocessor to separate script * More improvements * More improvements * Add post processor * Make MLM model work * Style * Add PerceiverForMaskedLM * Add PerceiverImagePreprocessor * Make style * Make PerceiverForImageClassification work * More improvements * More improvements * Use tokenizer in conversion script * Use PerceiverForMaskedLM in conversion script * Define custom PerceiverModelOutput * Improve PerceiverAttention to make it work for both MLM and image classification * More improvements * More improvements * More improvements to the conversion script * Make conversion script work for both MLM and image classification * Add PerceiverFeatureExtractor * More improvements * Style and quality * Add center cropping * Fix bug * Small fix * Add print statement * Fix bug in image preprocessor * Fix bug with conversion script * Make output position embeddings an nn.Parameter layer instead of nn.Embedding * Comment out print statements * Add position encoding classes * More improvements * Use position_encoding_kwargs * Add PerceiverForImageClassificationFourier * Make style & quality * Add PerceiverForImageClassificationConvProcessing * Style & quality * Add flow model * Move processors to modeling file * Make position encodings modular * Make basic decoder use modular position encodings * Add PerceiverForOpticalFlow to conversion script * Add AudioPreprocessor * Make it possible for the basic decoder to use Fourier position embeddings * Add PerceiverForMultimodalAutoencoding * Improve model for optical flow * Improve _build_network_inputs method * Add print statement * Fix device issue * Fix device of Fourier embeddings * Add print statements for debugging * Add another print statement * Add another print statement * Add another print statement * Add another print statement * Improve PerceiverAudioPreprocessor * Improve conversion script for multimodal modal * More improvements * More improvements * Improve multimodal model * Make forward pass multimodal model work * More improvements * Improve tests * Fix some more tests * Add output dataclasses * Make more tests pass * Add print statements for debuggin * Add tests for image classification * Add PerceiverClassifierOutput * More improvements * Make more tests pass for the optical flow model * Make style & quality * Small improvements * Don't support training for optical flow model for now * Fix _prepare_for_class for tests * Make more tests pass, add some docs * Add multimodal model to tests * Minor fixes * Fix tests * Improve conversion script * Make fixup * Remove pos_dim argument * Fix device issue * Potential fix for OOM * Revert previous commit * Fix test_initialization * Add print statements for debugging * Fix print statement * Add print statement * Add print statement * Add print statement * Add print statement * Add print statement * Add print statement * Remove need for output_shape * Comment out output_shape * Remove unnecessary code * Improve docs * Fix make fixup * Remove PerceiverTextProcessor from init * Improve docs * Small improvement * Apply first batch of suggestions from code review * Apply more suggestions from code review * Update docstrings * Define dicts beforehand for readability * Rename task to architecture in conversion script, include PerceiverModel in tests * Add print statements for debugging * Fix tests on GPU * Remove preprocessors, postprocessors and decoders from main init * Add integration test * Fix docs * Replace einops by torch * Update for new docs frontend * Rename PerceiverForImageClassification * Improve docs * Improve docs * Improve docs of PerceiverModel * Fix some more tests * Improve center_crop * Add PerceiverForSequenceClassification * Small improvements * Fix tests * Add integration test for optical flow model * Clean up * Add tests for tokenizer * Fix tokenizer by adding special tokens properly * Fix CI
65b20b73 · NielsRogge · GitHub · 961732c2 · 65b20b73 · 65b20b73
Unverified Commit 65b20b73 authored Dec 08, 2021 by NielsRogge Committed by GitHub Dec 08, 2021
20 changed files
--- a/README.md
+++ b/README.md
@@ -286,6 +286,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

--- a/README_ko.md
+++ b/README_ko.md
@@ -265,6 +265,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -289,6 +289,7 @@ conda install -c huggingface transformers
 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (来自 Microsoft Research) 伴随论文 [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) 由 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu 发布。
 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (来自 Google AI) 伴随论文 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 由 Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 发布。
 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。
+1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (来自 Deepmind) 伴随论文 [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) 由 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira 发布。
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -301,6 +301,7 @@ conda install -c huggingface transformers
 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -146,6 +146,7 @@ conversion utilities for the following models.
 1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
 1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
 1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
 1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
@@ -234,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow.
 |         OpenAI GPT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
 |        OpenAI GPT-2         |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
 |           Pegasus           |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|          Perceiver          |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |         ProphetNet          |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |           QDQBert           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             RAG             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
--- a/docs/source/model_doc/perceiver.rst
+++ b/docs/source/model_doc/perceiver.rst
+.. 
+    Copyright 2021 The HuggingFace Team. All rights reserved.
+    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+    the License. You may obtain a copy of the License at
+        http://www.apache.org/licenses/LICENSE-2.0
+    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+    specific language governing permissions and limitations under the License.
+Perceiver
+-----------------------------------------------------------------------------------------------------------------------
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The Perceiver IO model was proposed in `Perceiver IO: A General Architecture for Structured Inputs & Outputs
+<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
+Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
+Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
+Perceiver IO is a generalization of `Perceiver <https://arxiv.org/abs/2103.03206>`__ to handle arbitrary outputs in
+addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
+classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
+This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
+linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
+inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
+Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
+The abstract from the paper is the following:
+*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
+clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
+inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
+sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
+outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
+linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
+strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
+StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
+baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
+performance on Sintel optical flow estimation.*
+Here's a TLDR explaining how Perceiver works:
+The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
+quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
+tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
+of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
+depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
+randomly initialized, after which they are trained end-to-end using backpropagation.
+Internally, :class:`~transformers.PerceiverModel` will create the latents, which is a tensor of shape
+:obj:`(batch_size, num_latents, d_latents)`. One must provide :obj:`inputs` (which could be text, images, audio, you
+name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver
+encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to
+classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project
+the :obj:`d_latents` to :obj:`num_labels`.
+This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
+work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
+idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
+last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
+So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
+length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
+providing :obj:`inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define
+the :obj:`outputs` as being of shape: :obj:`(batch_size, 2048, 768)`. Next, one performs cross-attention with the final
+hidden states of the latents to update the :obj:`outputs` tensor. After cross-attention, one still has a tensor of
+shape :obj:`(batch_size, 2048, 768)`. One can then place a regular language modeling head on top, to project the last
+dimension to the vocabulary size of the model, i.e. creating logits of shape :obj:`(batch_size, 2048, 262)` (as
+Perceiver uses a vocabulary size of 262 byte IDs).
+This model was contributed by `<nielsr> <https://huggingface.co/nielsr>`__. The original code can be found `here
+<https://github.com/deepmind/deepmind-research/tree/master/perceiver>`__.
+Perceiver specific outputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput
+    :members:
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput
+    :members:
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
+    :members:
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput
+    :members:
+PerceiverConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverConfig
+    :members:
+PerceiverTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+PerceiverFeatureExtractor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverFeatureExtractor
+    :members: 
+PerceiverTextPreprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
+    :members: 
+PerceiverImagePreprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
+    :members: 
+PerceiverOneHotPreprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
+    :members: 
+PerceiverAudioPreprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
+    :members: 
+PerceiverMultimodalPreprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
+    :members: 
+PerceiverProjectionPostprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
+    :members: 
+PerceiverAudioPostprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
+    :members: 
+PerceiverClassificationPostprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
+    :members: 
+PerceiverMultimodalPostprocessor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
+    :members: 
+PerceiverModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverModel
+    :members: forward
+PerceiverForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverForMaskedLM
+    :members: forward
+PerceiverForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverForSequenceClassification
+    :members: forward
+PerceiverForImageClassificationLearned
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverForImageClassificationLearned
+    :members: forward
+PerceiverForImageClassificationFourier
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverForImageClassificationFourier
+    :members: forward
+PerceiverForImageClassificationConvProcessing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverForImageClassificationConvProcessing
+    :members: forward
+PerceiverForOpticalFlow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverForOpticalFlow
+    :members: forward
+PerceiverForMultimodalAutoencoding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.PerceiverForMultimodalAutoencoding
+    :members: forward
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -253,6 +253,7 @@ _import_structure = {
    "models.mt5": ["MT5Config"],
    "models.openai": ["OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "OpenAIGPTConfig", "OpenAIGPTTokenizer"],
    "models.pegasus": ["PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "PegasusConfig", "PegasusTokenizer"],
+    "models.perceiver": ["PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PerceiverConfig", "PerceiverTokenizer"],
    "models.phobert": ["PhobertTokenizer"],
    "models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"],
    "models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"],
@@ -502,6 +503,7 @@ if is_vision_available():
    _import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
    _import_structure["models.layoutlmv2"].append("LayoutLMv2Processor")
    _import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
+    _import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
    _import_structure["models.segformer"].append("SegformerFeatureExtractor")
    _import_structure["models.vit"].append("ViTFeatureExtractor")
 else:
@@ -1144,6 +1146,21 @@ if is_torch_available():
    _import_structure["models.pegasus"].extend(
        ["PegasusForCausalLM", "PegasusForConditionalGeneration", "PegasusModel", "PegasusPreTrainedModel"]
    )
+    _import_structure["models.perceiver"].extend(
+        [
+            "PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "PerceiverForImageClassificationConvProcessing",
+            "PerceiverForImageClassificationFourier",
+            "PerceiverForImageClassificationLearned",
+            "PerceiverForMaskedLM",
+            "PerceiverForMultimodalAutoencoding",
+            "PerceiverForOpticalFlow",
+            "PerceiverForSequenceClassification",
+            "PerceiverLayer",
+            "PerceiverModel",
+            "PerceiverPreTrainedModel",
+        ]
+    )
    _import_structure["models.prophetnet"].extend(
        [
            "PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -2263,6 +2280,7 @@ if TYPE_CHECKING:
    from .models.mt5 import MT5Config
    from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer
    from .models.pegasus import PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusConfig, PegasusTokenizer
+    from .models.perceiver import PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, PerceiverConfig, PerceiverTokenizer
    from .models.phobert import PhobertTokenizer
    from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer
    from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig
@@ -2470,6 +2488,7 @@ if TYPE_CHECKING:
        from .models.imagegpt import ImageGPTFeatureExtractor
        from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
        from .models.layoutxlm import LayoutXLMProcessor
+        from .models.perceiver import PerceiverFeatureExtractor
        from .models.segformer import SegformerFeatureExtractor
        from .models.vit import ViTFeatureExtractor
    else:
@@ -3006,6 +3025,19 @@ if TYPE_CHECKING:
            PegasusModel,
            PegasusPreTrainedModel,
        )
+        from .models.perceiver import (
+            PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            PerceiverForImageClassificationConvProcessing,
+            PerceiverForImageClassificationFourier,
+            PerceiverForImageClassificationLearned,
+            PerceiverForMaskedLM,
+            PerceiverForMultimodalAutoencoding,
+            PerceiverForOpticalFlow,
+            PerceiverForSequenceClassification,
+            PerceiverLayer,
+            PerceiverModel,
+            PerceiverPreTrainedModel,
+        )
        from .models.prophetnet import (
            PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
            ProphetNetDecoder,

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -78,6 +78,7 @@ from . import (
    mt5,
    openai,
    pegasus,
+    perceiver,
    phobert,
    prophetnet,
    qdqbert,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -37,6 +37,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("fnet", "FNetConfig"),
        ("segformer", "SegformerConfig"),
        ("vision-text-dual-encoder", "VisionTextDualEncoderConfig"),
+        ("perceiver", "PerceiverConfig"),
        ("gptj", "GPTJConfig"),
        ("layoutlmv2", "LayoutLMv2Config"),
        ("beit", "BeitConfig"),
@@ -119,6 +120,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("fnet", "FNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("pegasus", "PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("perceiver", "PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -194,6 +196,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("fnet", "FNet"),
        ("segformer", "SegFormer"),
        ("vision-text-dual-encoder", "VisionTextDualEncoder"),
+        ("perceiver", "Perceiver"),
        ("gptj", "GPT-J"),
        ("beit", "BEiT"),
        ("rembert", "RemBERT"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("fnet", "FNetModel"),
        ("segformer", "SegformerModel"),
        ("vision-text-dual-encoder", "VisionTextDualEncoderModel"),
+        ("perceiver", "PerceiverModel"),
        ("gptj", "GPTJModel"),
        ("layoutlmv2", "LayoutLMv2Model"),
        ("beit", "BeitModel"),
@@ -247,6 +248,14 @@ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
        ("beit", "BeitForImageClassification"),
        ("segformer", "SegformerForImageClassification"),
        ("imagegpt", "ImageGPTForImageClassification"),
+        (
+            "perceiver",
+            (
+                "PerceiverForImageClassificationLearned",
+                "PerceiverForImageClassificationFourier",
+                "PerceiverForImageClassificationConvProcessing",
+            ),
+        ),
    ]
 )
@@ -266,6 +275,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
    [
        # Model for Masked LM mapping
+        ("perceiver", "PerceiverForMaskedLM"),
        ("qdqbert", "QDQBertForMaskedLM"),
        ("fnet", "FNetForMaskedLM"),
        ("rembert", "RemBertForMaskedLM"),
@@ -337,6 +347,7 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
    [
        # Model for Sequence Classification mapping
+        ("perceiver", "PerceiverForSequenceClassification"),
        ("qdqbert", "QDQBertForSequenceClassification"),
        ("fnet", "FNetForSequenceClassification"),
        ("gptj", "GPTJForSequenceClassification"),

--- a/src/transformers/models/perceiver/__init__.py
+++ b/src/transformers/models/perceiver/__init__.py
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+from ...file_utils import _LazyModule, is_tokenizers_available, is_torch_available, is_vision_available
+_import_structure = {
+    "configuration_perceiver": ["PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PerceiverConfig"],
+    "tokenization_perceiver": ["PerceiverTokenizer"],
+}
+if is_vision_available():
+    _import_structure["feature_extraction_perceiver"] = ["PerceiverFeatureExtractor"]
+if is_torch_available():
+    _import_structure["modeling_perceiver"] = [
+        "PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "PerceiverForImageClassificationConvProcessing",
+        "PerceiverForImageClassificationFourier",
+        "PerceiverForImageClassificationLearned",
+        "PerceiverForMaskedLM",
+        "PerceiverForMultimodalAutoencoding",
+        "PerceiverForOpticalFlow",
+        "PerceiverForSequenceClassification",
+        "PerceiverLayer",
+        "PerceiverModel",
+        "PerceiverPreTrainedModel",
+    ]
+if TYPE_CHECKING:
+    from .configuration_perceiver import PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, PerceiverConfig
+    from .tokenization_perceiver import PerceiverTokenizer
+    if is_vision_available():
+        from .feature_extraction_perceiver import PerceiverFeatureExtractor
+    if is_torch_available():
+        from .modeling_perceiver import (
+            PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            PerceiverForImageClassificationConvProcessing,
+            PerceiverForImageClassificationFourier,
+            PerceiverForImageClassificationLearned,
+            PerceiverForMaskedLM,
+            PerceiverForMultimodalAutoencoding,
+            PerceiverForOpticalFlow,
+            PerceiverForSequenceClassification,
+            PerceiverLayer,
+            PerceiverModel,
+            PerceiverPreTrainedModel,
+        )
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
--- a/src/transformers/models/perceiver/configuration_perceiver.py
+++ b/src/transformers/models/perceiver/configuration_perceiver.py
+# coding=utf-8
+# Copyright Deepmind and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Perceiver model configuration """
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+logger = logging.get_logger(__name__)
+PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "deepmind/language-perceiver": "https://huggingface.co/deepmind/language-perceiver/resolve/main/config.json",
+    # See all Perceiver models at https://huggingface.co/models?filter=perceiver
+}
+class PerceiverConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.PerceiverModel`. It is used
+    to instantiate an Perceiver model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the Perceiver
+    `deepmind/language-perceiver <https://huggingface.co/deepmind/language-perceiver>`__ architecture.
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Args:
+        num_latents (:obj:`int`, `optional`, defaults to 256):
+            The number of latents.
+        d_latents (:obj:`int`, `optional`, defaults to 1280):
+            Dimension of the latent embeddings.
+        d_model (:obj:`int`, `optional`, defaults to 768):
+            Dimension of the inputs.
+        num_blocks (:obj:`int`, `optional`, defaults to 1):
+            Number of blocks in the Transformer encoder.
+        num_self_attends_per_block (:obj:`int`, `optional`, defaults to 26):
+            The number of self-attention layers per block.
+        num_self_attention_heads (:obj:`int`, `optional`, defaults to 8):
+            Number of attention heads for each self-attention layer in the Transformer encoder.
+        num_cross_attention_heads (:obj:`int`, `optional`, defaults to 8):
+            Number of attention heads for each cross-attention layer in the Transformer encoder.
+        qk_channels (:obj:`int`, `optional`):
+            Dimension to project the queries + keys before applying attention in the cross-attention and self-attention
+            layers of the encoder. Will default to preserving the dimension of the queries if not specified.
+        v_channels (:obj:`int`, `optional`):
+            Dimension to project the values before applying attention in the cross-attention and self-attention layers
+            of the encoder. Will default to preserving the dimension of the queries if not specified.
+        cross_attention_shape_for_attention (:obj:`str`, `optional`, defaults to :obj:`'kv'`):
+            Dimension to use when downsampling the queries and keys in the cross-attention layer of the encoder.
+        self_attention_widening_factor (:obj:`int`, `optional`, defaults to 1):
+            Dimension of the feed-forward layer in the cross-attention layer of the Transformer encoder.
+        cross_attention_widening_factor (:obj:`int`, `optional`, defaults to 1):
+            Dimension of the feed-forward layer in the self-attention layers of the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_query_residual (:obj:`float`, `optional`, defaults to :obj:`True`):
+            Whether to add a query residual in the cross-attention layer of the encoder.
+        vocab_size (:obj:`int`, `optional`, defaults to 262):
+            Vocabulary size for the masked language modeling model.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 2048):
+            The maximum sequence length that the masked language modeling model might ever be used with. Typically set
+            this to something large just in case (e.g., 512 or 1024 or 2048).
+        image_size (:obj:`int`, `optional`, defaults to 56):
+            Size of the images after preprocessing, for :class:`~transformers.PerceiverForImageClassificationLearned`.
+        train_size (:obj:`List[int]`, `optional`, defaults to [368, 496]):
+            Training size of the images for the optical flow model.
+        num_frames (:obj:`int`, `optional`, defaults to 16):
+            Number of video frames used for the multimodal autoencoding model.
+        audio_samples_per_frame (:obj:`int`, `optional`, defaults to 1920):
+            Number of audio samples per frame for the multimodal autoencoding model.
+        samples_per_patch (:obj:`int`, `optional`, defaults to 16):
+            Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
+        output_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[1, 16, 224, 224]`):
+            Shape of the output for the multimodal autoencoding model.
+    Example::
+        >>> from transformers import PerceiverModel, PerceiverConfig
+        >>> # Initializing a Perceiver deepmind/language-perceiver style configuration
+        >>> configuration = PerceiverConfig()
+        >>> # Initializing a model from the deepmind/language-perceiver style configuration
+        >>> model = PerceiverModel(configuration)
+        >>> # Accessing the model configuration
+        >>> configuration = model.config
+    """
+    model_type = "perceiver"
+    def __init__(
+        self,
+        num_latents=256,
+        d_latents=1280,
+        d_model=768,
+        num_blocks=1,
+        num_self_attends_per_block=26,
+        num_self_attention_heads=8,
+        num_cross_attention_heads=8,
+        qk_channels=None,
+        v_channels=None,
+        cross_attention_shape_for_attention="kv",
+        self_attention_widening_factor=1,
+        cross_attention_widening_factor=1,
+        hidden_act="gelu",
+        attention_probs_dropout_prob=0.1,
+        position_embedding_init_scale=0.02,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        is_encoder_decoder=False,
+        use_query_residual=True,
+        vocab_size=262,
+        max_position_embeddings=2048,
+        image_size=56,
+        train_size=[368, 496],
+        num_frames=16,
+        audio_samples_per_frame=1920,
+        samples_per_patch=16,
+        output_shape=[1, 16, 224, 224],
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.num_latents = num_latents
+        self.d_latents = d_latents
+        self.d_model = d_model
+        self.num_blocks = num_blocks
+        self.num_self_attends_per_block = num_self_attends_per_block
+        self.num_self_attention_heads = num_self_attention_heads
+        self.num_cross_attention_heads = num_cross_attention_heads
+        self.qk_channels = qk_channels
+        self.v_channels = v_channels
+        self.cross_attention_shape_for_attention = cross_attention_shape_for_attention
+        self.self_attention_widening_factor = self_attention_widening_factor
+        self.cross_attention_widening_factor = cross_attention_widening_factor
+        self.hidden_act = hidden_act
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.use_query_residual = use_query_residual
+        # masked language modeling attributes
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        # image classification attributes
+        self.image_size = image_size
+        # flow attributes
+        self.train_size = train_size
+        # multimodal autoencoding attributes
+        self.num_frames = num_frames
+        self.audio_samples_per_frame = audio_samples_per_frame
+        self.samples_per_patch = samples_per_patch
+        self.output_shape = output_shape
--- a/src/transformers/models/perceiver/convert_perceiver_haiku_to_pytorch.py
+++ b/src/transformers/models/perceiver/convert_perceiver_haiku_to_pytorch.py
--- a/src/transformers/models/perceiver/feature_extraction_perceiver.py
+++ b/src/transformers/models/perceiver/feature_extraction_perceiver.py
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for Perceiver."""
+from typing import Optional, Union
+import numpy as np
+from PIL import Image
+from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
+from ...file_utils import TensorType
+from ...image_utils import (
+    IMAGENET_DEFAULT_MEAN,
+    IMAGENET_DEFAULT_STD,
+    ImageFeatureExtractionMixin,
+    ImageInput,
+    is_torch_tensor,
+)
+from ...utils import logging
+logger = logging.get_logger(__name__)
+class PerceiverFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
+    r"""
+    Constructs a Perceiver feature extractor.
+    This feature extractor inherits from :class:`~transformers.ImageFeatureExtractionMixin` which contains most of the
+    main methods. Users should refer to this superclass for more information regarding those methods.
+    Args:
+        do_center_crop (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to crop the input at the center. If the input size is smaller than :obj:`crop_size` along any edge,
+            the image is padded with 0's and then center cropped.
+        crop_size (:obj:`int`, `optional`, defaults to 256):
+            Desired output size when applying center-cropping. Only has an effect if :obj:`do_center_crop` is set to
+            :obj:`True`.
+        do_resize (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to resize the input to a certain :obj:`size`.
+        size (:obj:`int` or :obj:`Tuple(int)`, `optional`, defaults to 224):
+            Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an
+            integer is provided, then the input will be resized to (size, size). Only has an effect if :obj:`do_resize`
+            is set to :obj:`True`.
+        resample (:obj:`int`, `optional`, defaults to :obj:`PIL.Image.BICUBIC`):
+            An optional resampling filter. This can be one of :obj:`PIL.Image.NEAREST`, :obj:`PIL.Image.BOX`,
+            :obj:`PIL.Image.BILINEAR`, :obj:`PIL.Image.HAMMING`, :obj:`PIL.Image.BICUBIC` or :obj:`PIL.Image.LANCZOS`.
+            Only has an effect if :obj:`do_resize` is set to :obj:`True`.
+        do_normalize (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to normalize the input with :obj:`image_mean` and :obj:`image_std`.
+        image_mean (:obj:`List[int]`, defaults to :obj:`[0.485, 0.456, 0.406]`):
+            The sequence of means for each channel, to be used when normalizing images.
+        image_std (:obj:`List[int]`, defaults to :obj:`[0.229, 0.224, 0.225]`):
+            The sequence of standard deviations for each channel, to be used when normalizing images.
+    """
+    model_input_names = ["pixel_values"]
+    def __init__(
+        self,
+        do_center_crop=True,
+        crop_size=256,
+        do_resize=True,
+        size=224,
+        resample=Image.BICUBIC,
+        do_normalize=True,
+        image_mean=None,
+        image_std=None,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
+    def center_crop(self, image):
+        """
+        Crops :obj:`image` to `self.crop_size` using a center crop. Note that if the image is too small to be cropped
+        to the size given, it will be padded (so the returned result has the size asked).
+        Args:
+            image (:obj:`PIL.Image.Image` or :obj:`np.ndarray` or :obj:`torch.Tensor`):
+                The image to resize.
+        """
+        if isinstance(image, Image.Image):
+            image = self.to_numpy_array(image)
+        image_height, image_width = image.shape[-2:]
+        padded_center_crop_size = (
+            (self.size / (self.crop_size)) * np.minimum(image_height, image_width).astype(np.float32)
+        ).astype(np.int32)
+        offset_height = ((image_height - padded_center_crop_size) + 1) // 2
+        offset_width = ((image_width - padded_center_crop_size) + 1) // 2
+        crop_window = [offset_height, offset_width, padded_center_crop_size, padded_center_crop_size]
+        image = image[
+            :, crop_window[0] : crop_window[0] + crop_window[2], crop_window[1] : crop_window[1] + crop_window[3]
+        ]
+        return image
+    def __call__(
+        self, images: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs
+    ) -> BatchFeature:
+        """
+        Main method to prepare for the model one or several image(s).
+        .. warning::
+           NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
+           PIL images.
+        Args:
+            images (:obj:`PIL.Image.Image`, :obj:`np.ndarray`, :obj:`torch.Tensor`, :obj:`List[PIL.Image.Image]`, :obj:`List[np.ndarray]`, :obj:`List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+            return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`, defaults to :obj:`'np'`):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
+                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
+                * :obj:`'np'`: Return NumPy :obj:`np.ndarray` objects.
+                * :obj:`'jax'`: Return JAX :obj:`jnp.ndarray` objects.
+        Returns:
+            :class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
+            - **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
+              width).
+        """
+        # Input type checking for clearer error
+        valid_images = False
+        # Check that images has a valid type
+        if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
+            valid_images = True
+        elif isinstance(images, (list, tuple)):
+            if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
+                valid_images = True
+        if not valid_images:
+            raise ValueError(
+                "Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example),"
+                "`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
+            )
+        is_batched = bool(
+            isinstance(images, (list, tuple))
+            and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
+        )
+        if not is_batched:
+            images = [images]
+        # transformations (center cropping + resizing + normalization)
+        if self.do_center_crop and self.crop_size is not None:
+            images = [self.center_crop(image) for image in images]
+        if self.do_resize and self.size is not None and self.resample is not None:
+            images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
+        if self.do_normalize:
+            images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
+        # return as BatchFeature
+        data = {"pixel_values": images}
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+        return encoded_inputs
--- a/src/transformers/models/perceiver/modeling_perceiver.py
+++ b/src/transformers/models/perceiver/modeling_perceiver.py
--- a/src/transformers/models/perceiver/tokenization_perceiver.py
+++ b/src/transformers/models/perceiver/tokenization_perceiver.py
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization class for Perceiver."""
+from typing import Dict, List, Optional, Tuple
+from ...tokenization_utils import AddedToken, PreTrainedTokenizer
+from ...utils import logging
+logger = logging.get_logger(__name__)
+class PerceiverTokenizer(PreTrainedTokenizer):
+    """
+    Construct a Perceiver tokenizer. The Perceiver simply uses raw bytes utf-8 encoding.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
+    Args:
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"[BOS]"`):
+            The BOS token (reserved in the vocab, but not actually used).
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"[EOS]"`):
+            The end of sequence token (reserved in the vocab, but not actually used).
+            .. note::
+                When building a sequence using special tokens, this is not the token that is used for the end of
+                sequence. The token used is the :obj:`sep_token`.
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
+            The MASK token, useful for masked language modeling.
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+            The CLS token (reserved in the vocab, but not actually used).
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+            The separator token, which is used when building a sequence from two sequences.
+    """
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        pad_token="[PAD]",
+        bos_token="[BOS]",
+        eos_token="[EOS]",
+        mask_token="[MASK]",
+        cls_token="[CLS]",
+        sep_token="[SEP]",
+        model_max_length=2048,
+        **kwargs
+    ) -> None:
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        mask_token = AddedToken(mask_token, lstrip=False, rstrip=False) if isinstance(mask_token, str) else mask_token
+        cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
+        sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
+        super().__init__(
+            pad_token=pad_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            mask_token=mask_token,
+            cls_token=cls_token,
+            sep_token=sep_token,
+            model_max_length=model_max_length,
+            **kwargs,
+        )
+        self._utf_vocab_size = 2 ** 8  # utf is 8 bits
+        # define special tokens dict
+        self.special_tokens_encoder: Dict[int, str] = {
+            self.pad_token: 0,
+            self.bos_token: 1,
+            self.eos_token: 2,
+            self.mask_token: 3,
+            self.cls_token: 4,
+            self.sep_token: 5,
+        }
+        self._num_special_tokens = len(self.special_tokens_encoder)
+        self.special_tokens_decoder: Dict[str, int] = {v: k for k, v in self.special_tokens_encoder.items()}
+    @property
+    def vocab_size(self):
+        return self._utf_vocab_size + self._num_special_tokens
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+        # normal case: some special tokens
+        if token_ids_1 is None:
+            return [1] + [0] * len(token_ids_0) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks. A sequence has the
+        following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        else:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] + token_ids_1 + [self.sep_token_id]
+    def _tokenize(self, text: str) -> List[str]:
+        """Take as input a string and return a list of strings (tokens) for words/sub-words"""
+        tokens = [chr(i) for i in text.encode("utf-8")]
+        return tokens
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token in self.special_tokens_encoder:
+            token_id = self.special_tokens_encoder[token]
+        elif token in self.added_tokens_encoder:
+            token_id = self.added_tokens_encoder[token]
+        elif len(token) != 1:
+            token_id = self.unk_token_id
+        else:
+            token_id = ord(token) + self._num_special_tokens
+        return token_id
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index in self.special_tokens_decoder:
+            token = self.special_tokens_decoder[index]
+        elif index in self.added_tokens_decoder:
+            token = self.added_tokens_decoder[index]
+        else:
+            token = chr(index - self._num_special_tokens)
+        return token
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        bstring = b""
+        for token in tokens:
+            if token in self.special_tokens_decoder:
+                tok_string = self.special_tokens_decoder[token].encode("utf-8")
+            elif token in self.added_tokens_decoder:
+                tok_string = self.special_tokens_decoder[token].encode("utf-8")
+            elif token in self.special_tokens_encoder:
+                tok_string = token.encode("utf-8")
+            elif token in self.added_tokens_encoder:
+                tok_string = token.encode("utf-8")
+            else:
+                tok_string = bytes([ord(token)])
+            bstring += tok_string
+        string = bstring.decode("utf-8", errors="replace")
+        return string
+    # PerceiverTokenizer has no vocab file
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        return ()
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -3753,6 +3753,87 @@ class PegasusPreTrainedModel:
        requires_backends(self, ["torch"])
+PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST = None
+class PerceiverForImageClassificationConvProcessing:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverForImageClassificationFourier:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverForImageClassificationLearned:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverForMaskedLM:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    def forward(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverForMultimodalAutoencoding:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverForOpticalFlow:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverForSequenceClassification:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    def forward(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverLayer:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverModel:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    def forward(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class PerceiverPreTrainedModel:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    def forward(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
 PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = None

--- a/src/transformers/utils/dummy_vision_objects.py
+++ b/src/transformers/utils/dummy_vision_objects.py
@@ -64,6 +64,11 @@ class LayoutXLMProcessor:
        requires_backends(cls, ["vision"])
+class PerceiverFeatureExtractor:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["vision"])
 class SegformerFeatureExtractor:
    def __init__(self, *args, **kwargs):
        requires_backends(self, ["vision"])

--- a/tests/test_modeling_perceiver.py
+++ b/tests/test_modeling_perceiver.py