Add ViLT (#14895)

* First commit * Add conversion script * Make conversion script work for base model * More improvements * Update conversion script, works for vqa * Add indexing argument to meshgrid * Make conversion script work for ViltForPreTraining * Add ViltForPreTraining to docs * Fix device issue * Add processor * Add MinMaxResize to feature extractor * Implement call method of ViltProcessor * Fix tests * Add integration test * Add loss calculation for VQA * Improve tests * Improve some more tests * Debug tests * Small improvements * Add support for attention_mask * Remove mask_it * Add pixel_mask * Add tests for ViltFeatureExtractor * Improve tests * Add ViltForNaturalLanguageVisualReasoning * Add ViltForNaturalLanguageVisualReasoning to conversion script * Minor fixes * Add support for image_embeds, update docstrings to markdown * Update docs to markdown * Improve conversion script * Rename ViltForPreTraining to ViltForMaskedLM * Improve conversion script * Convert docstrings to markdown * Fix code example of retrieval model * Properly convert masked language model * Add integration test for nlvr * Fix code quality * Apply suggestions from code review * Add copied from statements * Fix pretrained_config_archive_map * Fix docs * Add model to README * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply more suggestions from code review * Make code more readable * Add ViltForNaturalLanguageVisualReasoning to the tests * Rename ViltForVisualQuestionAnswering to ViltForQuestionAnswering * Replace pixel_values_2 by single tensor * Add hidden_states and attentions * Fix one more test * Fix all tests * Update year * Fix rebase issues * Fix another rebase issue * Remove ViltForPreTraining from auto mapping * Rename ViltForImageRetrievalTextRetrieval to ViltForImageAndTextRetrieval * Make it possible to use BertTokenizerFast in the processor * Use BertTokenizerFast by default * Rename ViltForNaturalLanguageVisualReasoning, define custom model output Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Add ViLT (#14895)
* First commit * Add conversion script * Make conversion script work for base model * More improvements * Update conversion script, works for vqa * Add indexing argument to meshgrid * Make conversion script work for ViltForPreTraining * Add ViltForPreTraining to docs * Fix device issue * Add processor * Add MinMaxResize to feature extractor * Implement call method of ViltProcessor * Fix tests * Add integration test * Add loss calculation for VQA * Improve tests * Improve some more tests * Debug tests * Small improvements * Add support for attention_mask * Remove mask_it * Add pixel_mask * Add tests for ViltFeatureExtractor * Improve tests * Add ViltForNaturalLanguageVisualReasoning * Add ViltForNaturalLanguageVisualReasoning to conversion script * Minor fixes * Add support for image_embeds, update docstrings to markdown * Update docs to markdown * Improve conversion script * Rename ViltForPreTraining to ViltForMaskedLM * Improve conversion script * Convert docstrings to markdown * Fix code example of retrieval model * Properly convert masked language model * Add integration test for nlvr * Fix code quality * Apply suggestions from code review * Add copied from statements * Fix pretrained_config_archive_map * Fix docs * Add model to README * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply more suggestions from code review * Make code more readable * Add ViltForNaturalLanguageVisualReasoning to the tests * Rename ViltForVisualQuestionAnswering to ViltForQuestionAnswering * Replace pixel_values_2 by single tensor * Add hidden_states and attentions * Fix one more test * Fix all tests * Update year * Fix rebase issues * Fix another rebase issue * Remove ViltForPreTraining from auto mapping * Rename ViltForImageRetrievalTextRetrieval to ViltForImageAndTextRetrieval * Make it possible to use BertTokenizerFast in the processor * Use BertTokenizerFast by default * Rename ViltForNaturalLanguageVisualReasoning, define custom model output Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
ac227093 · NielsRogge · GitHub · 691878ee · ac227093 · ac227093
Unverified Commit ac227093 authored Jan 19, 2022 by NielsRogge Committed by GitHub Jan 19, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -311,6 +311,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER
 AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
 1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.

--- a/README_ko.md
+++ b/README_ko.md
@@ -289,6 +289,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
 1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -313,6 +313,7 @@ conda install -c huggingface transformers
 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (来自 Microsoft) 伴随论文 [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 由 Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei 发布。
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (来自 Microsoft Research) 伴随论文 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 由 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 发布。
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。
+1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (来自 NAVER AI Lab/Kakao Enterprise/Kakao Brain) 伴随论文 [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) 由 Wonjae Kim, Bokyung Son, Ildoo Kim 发布。
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。
 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (来自 UCLA NLP) 伴随论文 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 由 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 发布。
 1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -325,6 +325,7 @@ conda install -c huggingface transformers
 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft) released with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 
 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
 1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.

--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -282,6 +282,8 @@
      title: UniSpeech
    - local: model_doc/unispeech-sat
      title: UniSpeech-SAT
+    - local: model_doc/vilt
+      title: ViLT
    - local: model_doc/vision-encoder-decoder
      title: Vision Encoder Decoder Models
    - local: model_doc/vision-text-dual-encoder

--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -170,6 +170,7 @@ conversion utilities for the following models.
 1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
 1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
 1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[ViLT)](model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
 1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
 1. **[ViTMAE)](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
 1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
@@ -266,6 +267,7 @@ Flax), PyTorch, and/or TensorFlow.
 |            TrOCR            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |          UniSpeech          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |        UniSpeechSat         |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|            ViLT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |   Vision Encoder decoder    |       ❌       |       ❌       |       ✅        |         ✅         |      ✅      |
 |    VisionTextDualEncoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
 |         VisualBert          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |

--- a/docs/source/model_doc/vilt.mdx
+++ b/docs/source/model_doc/vilt.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# ViLT
+
+## Overview
+
+The ViLT model was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334)
+by Wonjae Kim, Bokyung Son, Ildoo Kim. ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design
+for Vision-and-Language Pre-training (VLP).
+
+The abstract from the paper is the following:
+
+*Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.
+Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision
+(e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we
+find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more
+computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive
+power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model,
+Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically
+simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of
+times faster than previous VLP models, yet with competitive or better downstream task performance.*
+
+Tips:
+
+- ViLT is a model that takes both `pixel_values` and `input_ids` as input. One can use [`ViltProcessor`] to prepare data for the model.
+  This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one.
+- ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
+  under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
+  which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you. 
+- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes 
+  additional embedding layers for the language modality.
+
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vilt_architecture.jpg"
+alt="drawing" width="600"/> 
+
+<small> ViLT architecture. Taken from the <a href="https://arxiv.org/abs/2102.03334">original paper</a>. </small>
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/dandelin/ViLT).
+
+## ViltConfig
+
+[[autodoc]] ViltConfig
+
+## ViltFeatureExtractor
+
+[[autodoc]] ViltFeatureExtractor
+    - __call__
+
+## ViltProcessor
+
+[[autodoc]] ViltProcessor
+    - __call__
+
+## ViltModel
+
+[[autodoc]] ViltModel
+    - forward
+
+## ViltForMaskedLM
+
+[[autodoc]] ViltForMaskedLM
+    - forward
+
+## ViltForQuestionAnswering
+
+[[autodoc]] ViltForQuestionAnswering
+    - forward
+
+## ViltForImagesAndTextClassification
+
+[[autodoc]] ViltForImagesAndTextClassification
+    - forward
+
+## ViltForImageAndTextRetrieval
+
+[[autodoc]] ViltForImageAndTextRetrieval
+    - forward
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -308,6 +308,7 @@ _import_structure = {
        "UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "UniSpeechSatConfig",
    ],
+    "models.vilt": ["VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViltConfig", "ViltFeatureExtractor", "ViltProcessor"],
    "models.vision_encoder_decoder": ["VisionEncoderDecoderConfig"],
    "models.vision_text_dual_encoder": ["VisionTextDualEncoderConfig", "VisionTextDualEncoderProcessor"],
    "models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
@@ -514,6 +515,8 @@ if is_vision_available():
    _import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
    _import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
    _import_structure["models.segformer"].append("SegformerFeatureExtractor")
+    _import_structure["models.vilt"].append("ViltFeatureExtractor")
+    _import_structure["models.vilt"].append("ViltProcessor")
    _import_structure["models.vit"].append("ViTFeatureExtractor")
 else:
    from .utils import dummy_vision_objects
@@ -629,7 +632,6 @@ if is_torch_available():
    _import_structure["modeling_utils"] = ["Conv1D", "PreTrainedModel", "apply_chunking_to_forward", "prune_layer"]

    # PyTorch models structure
-
    _import_structure["models.albert"].extend(
        [
            "ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -1382,6 +1384,18 @@ if is_torch_available():
            "UniSpeechSatPreTrainedModel",
        ]
    )
+    _import_structure["models.vilt"].extend(
+        [
+            "VILT_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "ViltForImageAndTextRetrieval",
+            "ViltForImagesAndTextClassification",
+            "ViltForMaskedLM",
+            "ViltForQuestionAnswering",
+            "ViltLayer",
+            "ViltModel",
+            "ViltPreTrainedModel",
+        ]
+    )
    _import_structure["models.vision_encoder_decoder"].extend(["VisionEncoderDecoderModel"])
    _import_structure["models.vision_text_dual_encoder"].extend(["VisionTextDualEncoderModel"])
    _import_structure["models.visual_bert"].extend(
@@ -2409,6 +2423,7 @@ if TYPE_CHECKING:
    from .models.trocr import TROCR_PRETRAINED_CONFIG_ARCHIVE_MAP, TrOCRConfig, TrOCRProcessor
    from .models.unispeech import UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechConfig
    from .models.unispeech_sat import UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechSatConfig
+    from .models.vilt import VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViltConfig, ViltFeatureExtractor, ViltProcessor
    from .models.vision_encoder_decoder import VisionEncoderDecoderConfig
    from .models.vision_text_dual_encoder import VisionTextDualEncoderConfig, VisionTextDualEncoderProcessor
    from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
@@ -2585,6 +2600,7 @@ if TYPE_CHECKING:
        from .models.layoutxlm import LayoutXLMProcessor
        from .models.perceiver import PerceiverFeatureExtractor
        from .models.segformer import SegformerFeatureExtractor
+        from .models.vilt import ViltFeatureExtractor, ViltProcessor
        from .models.vit import ViTFeatureExtractor
    else:
        from .utils.dummy_vision_objects import *
@@ -3302,6 +3318,16 @@ if TYPE_CHECKING:
            UniSpeechSatModel,
            UniSpeechSatPreTrainedModel,
        )
+        from .models.vilt import (
+            VILT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ViltForImageAndTextRetrieval,
+            ViltForImagesAndTextClassification,
+            ViltForMaskedLM,
+            ViltForQuestionAnswering,
+            ViltLayer,
+            ViltModel,
+            ViltPreTrainedModel,
+        )
        from .models.vision_encoder_decoder import VisionEncoderDecoderModel
        from .models.vision_text_dual_encoder import VisionTextDualEncoderModel
        from .models.visual_bert import (

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -104,6 +104,7 @@ from . import (
    trocr,
    unispeech,
    unispeech_sat,
+    vilt,
    vision_encoder_decoder,
    vision_text_dual_encoder,
    visual_bert,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -30,6 +30,7 @@ logger = logging.get_logger(__name__)
 CONFIG_MAPPING_NAMES = OrderedDict(
    [
        # Add configs here
+        ("vilt", "ViltConfig"),
        ("vit_mae", "ViTMAEConfig"),
        ("realm", "RealmConfig"),
        ("nystromformer", "NystromformerConfig"),
@@ -119,6 +120,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
 CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
    [
        # Add archive maps here
+        ("vilt", "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("nystromformer", "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -196,6 +198,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
 MODEL_NAMES_MAPPING = OrderedDict(
    [
        # Add full (and cased) model names here
+        ("vilt", "ViLT"),
        ("vit_mae", "ViTMAE"),
        ("realm", "Realm"),
        ("nystromformer", "Nystromformer"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -28,6 +28,7 @@ logger = logging.get_logger(__name__)
 MODEL_MAPPING_NAMES = OrderedDict(
    [
        # Base model mapping
+        ("vilt", "ViltModel"),
        ("vit_mae", "ViTMAEModel"),
        ("nystromformer", "NystromformerModel"),
        ("imagegpt", "ImageGPTModel"),

--- a/src/transformers/models/deit/modeling_deit.py
+++ b/src/transformers/models/deit/modeling_deit.py
@@ -297,12 +297,6 @@ class DeiTLayer(nn.Module):

        # in DeiT, layernorm is also applied after self-attention
        layer_output = self.layernorm_after(hidden_states)
-
-        # TODO feedforward chunking not working for now
-        # layer_output = apply_chunking_to_forward(
-        #     self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, layer_output
-        # )
-
        layer_output = self.intermediate(layer_output)

        # second residual connection is done here
@@ -312,11 +306,6 @@ class DeiTLayer(nn.Module):

        return outputs

-    def feed_forward_chunk(self, attention_output):
-        intermediate_output = self.intermediate(attention_output)
-        layer_output = self.output(intermediate_output)
-        return layer_output
-

 # Copied from transformers.models.vit.modeling_vit.ViTEncoder with ViT->DeiT
 class DeiTEncoder(nn.Module):

--- a/src/transformers/models/vilt/__init__.py
+++ b/src/transformers/models/vilt/__init__.py
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+# rely on isort to merge the imports
+from ...file_utils import _LazyModule, is_torch_available, is_vision_available
+
+
+_import_structure = {
+    "configuration_vilt": ["VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViltConfig"],
+}
+
+if is_vision_available():
+    _import_structure["feature_extraction_vilt"] = ["ViltFeatureExtractor"]
+    _import_structure["processing_vilt"] = ["ViltProcessor"]
+
+if is_torch_available():
+    _import_structure["modeling_vilt"] = [
+        "VILT_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "ViltForImageAndTextRetrieval",
+        "ViltForImagesAndTextClassification",
+        "ViltForMaskedLM",
+        "ViltForQuestionAnswering",
+        "ViltLayer",
+        "ViltModel",
+        "ViltPreTrainedModel",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_vilt import VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViltConfig
+
+    if is_vision_available():
+        from .feature_extraction_vilt import ViltFeatureExtractor
+        from .processing_vilt import ViltProcessor
+
+    if is_torch_available():
+        from .modeling_vilt import (
+            VILT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ViltForImageAndTextRetrieval,
+            ViltForImagesAndTextClassification,
+            ViltForMaskedLM,
+            ViltForQuestionAnswering,
+            ViltLayer,
+            ViltModel,
+            ViltPreTrainedModel,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
--- a/src/transformers/models/vilt/configuration_vilt.py
+++ b/src/transformers/models/vilt/configuration_vilt.py
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" VilT model configuration"""
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+VILT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    # TODO
+}
+
+
+class ViltConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ViLTModel`]. It is used to instantiate an ViLT
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the ViLT
+    [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the text part of the model. Defines the number of different tokens that can be
+            represented by the `inputs_ids` passed when calling [`ViltModel`].
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`ViltModel`]. This is used when encoding
+            text.
+        modality_type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the modalities passed when calling [`ViltModel`]. This is used after concatening the
+            embeddings of the text and image modalities.
+        max_position_embeddings (`int`, *optional*, defaults to 40):
+            The maximum sequence length that this model might ever be used with.
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        image_size (`int`, *optional*, defaults to 384):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 32):
+            The size (resolution) of each patch.
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries, keys and values.
+        max_image_length (`int`, *optional*, defaults to -1):
+            The maximum number of patches to take as input for the Transformer encoder. If set to a positive integer,
+            the encoder will sample `max_image_length` patches at maximum. If set to -1, will not be taken into
+            account.
+        num_images (`int`, *optional*, defaults to -1):
+            The number of images to use for natural language visual reasoning. If set to a positive integer, will be
+            used by [`ViltForImagesAndTextClassification`] for defining the classifier head.
+
+    Example:
+
+    ```python
+    >>> from transformers import ViLTModel, ViLTConfig
+
+    >>> # Initializing a ViLT dandelin/vilt-b32-mlm style configuration
+    >>> configuration = ViLTConfig()
+
+    >>> # Initializing a model from the dandelin/vilt-b32-mlm style configuration
+    >>> model = ViLTModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "vilt"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        type_vocab_size=2,
+        modality_type_vocab_size=2,
+        max_position_embeddings=40,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        is_encoder_decoder=False,
+        image_size=384,
+        patch_size=32,
+        num_channels=3,
+        qkv_bias=True,
+        max_image_length=-1,
+        tie_word_embeddings=False,
+        num_images=-1,
+        **kwargs
+    ):
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.type_vocab_size = type_vocab_size
+        self.modality_type_vocab_size = modality_type_vocab_size
+        self.max_position_embeddings = max_position_embeddings
+
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.qkv_bias = qkv_bias
+        self.max_image_length = max_image_length
+        self.num_images = num_images
--- a/src/transformers/models/vilt/convert_vilt_original_to_pytorch.py
+++ b/src/transformers/models/vilt/convert_vilt_original_to_pytorch.py
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert ViLT checkpoints from the original Github repository."""
+
+
+import argparse
+import json
+from pathlib import Path
+
+import torch
+from PIL import Image
+
+import requests
+from huggingface_hub import cached_download, hf_hub_url
+from transformers import (
+    BertTokenizer,
+    ViltConfig,
+    ViltFeatureExtractor,
+    ViltForImageAndTextRetrieval,
+    ViltForImagesAndTextClassification,
+    ViltForMaskedLM,
+    ViltForQuestionAnswering,
+    ViltProcessor,
+)
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+# here we list all keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys(config, vqa_model=False, nlvr_model=False, irtr_model=False):
+    rename_keys = []
+    for i in range(config.num_hidden_layers):
+        # encoder layers: output projection, 2 feedforward neural networks and 2 layernorms
+        rename_keys.append((f"transformer.blocks.{i}.norm1.weight", f"vilt.encoder.layer.{i}.layernorm_before.weight"))
+        rename_keys.append((f"transformer.blocks.{i}.norm1.bias", f"vilt.encoder.layer.{i}.layernorm_before.bias"))
+        rename_keys.append(
+            (f"transformer.blocks.{i}.attn.proj.weight", f"vilt.encoder.layer.{i}.attention.output.dense.weight")
+        )
+        rename_keys.append(
+            (f"transformer.blocks.{i}.attn.proj.bias", f"vilt.encoder.layer.{i}.attention.output.dense.bias")
+        )
+        rename_keys.append((f"transformer.blocks.{i}.norm2.weight", f"vilt.encoder.layer.{i}.layernorm_after.weight"))
+        rename_keys.append((f"transformer.blocks.{i}.norm2.bias", f"vilt.encoder.layer.{i}.layernorm_after.bias"))
+        rename_keys.append(
+            (f"transformer.blocks.{i}.mlp.fc1.weight", f"vilt.encoder.layer.{i}.intermediate.dense.weight")
+        )
+        rename_keys.append((f"transformer.blocks.{i}.mlp.fc1.bias", f"vilt.encoder.layer.{i}.intermediate.dense.bias"))
+        rename_keys.append((f"transformer.blocks.{i}.mlp.fc2.weight", f"vilt.encoder.layer.{i}.output.dense.weight"))
+        rename_keys.append((f"transformer.blocks.{i}.mlp.fc2.bias", f"vilt.encoder.layer.{i}.output.dense.bias"))
+
+    # embeddings
+    rename_keys.extend(
+        [
+            # text embeddings
+            ("text_embeddings.word_embeddings.weight", "vilt.embeddings.text_embeddings.word_embeddings.weight"),
+            (
+                "text_embeddings.position_embeddings.weight",
+                "vilt.embeddings.text_embeddings.position_embeddings.weight",
+            ),
+            ("text_embeddings.position_ids", "vilt.embeddings.text_embeddings.position_ids"),
+            (
+                "text_embeddings.token_type_embeddings.weight",
+                "vilt.embeddings.text_embeddings.token_type_embeddings.weight",
+            ),
+            ("text_embeddings.LayerNorm.weight", "vilt.embeddings.text_embeddings.LayerNorm.weight"),
+            ("text_embeddings.LayerNorm.bias", "vilt.embeddings.text_embeddings.LayerNorm.bias"),
+            # patch embeddings
+            ("transformer.cls_token", "vilt.embeddings.cls_token"),
+            ("transformer.patch_embed.proj.weight", "vilt.embeddings.patch_embeddings.projection.weight"),
+            ("transformer.patch_embed.proj.bias", "vilt.embeddings.patch_embeddings.projection.bias"),
+            ("transformer.pos_embed", "vilt.embeddings.position_embeddings"),
+            # token type embeddings
+            ("token_type_embeddings.weight", "vilt.embeddings.token_type_embeddings.weight"),
+        ]
+    )
+
+    # final layernorm + pooler
+    rename_keys.extend(
+        [
+            ("transformer.norm.weight", "vilt.layernorm.weight"),
+            ("transformer.norm.bias", "vilt.layernorm.bias"),
+            ("pooler.dense.weight", "vilt.pooler.dense.weight"),
+            ("pooler.dense.bias", "vilt.pooler.dense.bias"),
+        ]
+    )
+
+    # classifier head(s)
+    if vqa_model:
+        # classification head
+        rename_keys.extend(
+            [
+                ("vqa_classifier.0.weight", "classifier.0.weight"),
+                ("vqa_classifier.0.bias", "classifier.0.bias"),
+                ("vqa_classifier.1.weight", "classifier.1.weight"),
+                ("vqa_classifier.1.bias", "classifier.1.bias"),
+                ("vqa_classifier.3.weight", "classifier.3.weight"),
+                ("vqa_classifier.3.bias", "classifier.3.bias"),
+            ]
+        )
+    elif nlvr_model:
+        # classification head
+        rename_keys.extend(
+            [
+                ("nlvr2_classifier.0.weight", "classifier.0.weight"),
+                ("nlvr2_classifier.0.bias", "classifier.0.bias"),
+                ("nlvr2_classifier.1.weight", "classifier.1.weight"),
+                ("nlvr2_classifier.1.bias", "classifier.1.bias"),
+                ("nlvr2_classifier.3.weight", "classifier.3.weight"),
+                ("nlvr2_classifier.3.bias", "classifier.3.bias"),
+            ]
+        )
+    else:
+        pass
+
+    return rename_keys
+
+
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v(state_dict, config):
+    for i in range(config.num_hidden_layers):
+        prefix = "vilt."
+        # read in weights + bias of input projection layer (in timm, this is a single matrix + bias)
+        in_proj_weight = state_dict.pop(f"transformer.blocks.{i}.attn.qkv.weight")
+        in_proj_bias = state_dict.pop(f"transformer.blocks.{i}.attn.qkv.bias")
+        # next, add query, keys and values (in that order) to the state dict
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[
+            : config.hidden_size, :
+        ]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.bias"] = in_proj_bias[: config.hidden_size]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
+            config.hidden_size : config.hidden_size * 2, :
+        ]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.key.bias"] = in_proj_bias[
+            config.hidden_size : config.hidden_size * 2
+        ]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[
+            -config.hidden_size :, :
+        ]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.bias"] = in_proj_bias[-config.hidden_size :]
+
+
+def remove_classification_head_(state_dict):
+    ignore_keys = ["head.weight", "head.bias"]
+    for k in ignore_keys:
+        state_dict.pop(k, None)
+
+
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+
+
+@torch.no_grad()
+def convert_vilt_checkpoint(checkpoint_url, pytorch_dump_folder_path):
+    """
+    Copy/paste/tweak model's weights to our ViLT structure.
+    """
+
+    # define configuration and initialize HuggingFace model
+    config = ViltConfig(image_size=384, patch_size=32, tie_word_embeddings=False)
+    mlm_model = False
+    vqa_model = False
+    nlvr_model = False
+    irtr_model = False
+    if "vqa" in checkpoint_url:
+        vqa_model = True
+        config.num_labels = 3129
+        repo_id = "datasets/huggingface/label-files"
+        filename = "vqa2-id2label.json"
+        id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
+        id2label = {int(k): v for k, v in id2label.items()}
+        config.id2label = id2label
+        config.label2id = {v: k for k, v in id2label.items()}
+        model = ViltForQuestionAnswering(config)
+    elif "nlvr" in checkpoint_url:
+        nlvr_model = True
+        config.num_labels = 2
+        config.id2label = {0: "False", 1: "True"}
+        config.label2id = {v: k for k, v in config.id2label.items()}
+        config.modality_type_vocab_size = 3
+        model = ViltForImagesAndTextClassification(config)
+    elif "irtr" in checkpoint_url:
+        irtr_model = True
+        model = ViltForImageAndTextRetrieval(config)
+    elif "mlm_itm" in checkpoint_url:
+        mlm_model = True
+        model = ViltForMaskedLM(config)
+    else:
+        raise ValueError("Unknown model type")
+
+    # load state_dict of original model, remove and rename some keys
+    state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")["state_dict"]
+    rename_keys = create_rename_keys(config, vqa_model, nlvr_model, irtr_model)
+    for src, dest in rename_keys:
+        rename_key(state_dict, src, dest)
+    read_in_q_k_v(state_dict, config)
+    if mlm_model or irtr_model:
+        ignore_keys = ["itm_score.fc.weight", "itm_score.fc.bias"]
+        for k in ignore_keys:
+            state_dict.pop(k, None)
+
+    # load state dict into HuggingFace model
+    model.eval()
+    if mlm_model:
+        missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
+        assert missing_keys == ["mlm_score.decoder.bias"]
+    else:
+        model.load_state_dict(state_dict)
+
+    # Define processor
+    feature_extractor = ViltFeatureExtractor(size=384)
+    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+    processor = ViltProcessor(feature_extractor, tokenizer)
+
+    # Forward pass on example inputs (image + text)
+    if nlvr_model:
+        image1 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_0.jpg", stream=True).raw)
+        image2 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_0.jpg", stream=True).raw)
+        text = "The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing."
+        encoding_1 = processor(image1, text, return_tensors="pt")
+        encoding_2 = processor(image2, text, return_tensors="pt")
+        outputs = model(
+            input_ids=encoding_1.input_ids,
+            pixel_values=encoding_1.pixel_values,
+            pixel_values_2=encoding_2.pixel_values,
+        )
+    else:
+        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
+        if mlm_model:
+            text = "a bunch of [MASK] laying on a [MASK]."
+        else:
+            text = "How many cats are there?"
+        encoding = processor(image, text, return_tensors="pt")
+        outputs = model(**encoding)
+
+    # Verify outputs
+    if mlm_model:
+        expected_shape = torch.Size([1, 11, 30522])
+        expected_slice = torch.tensor([-12.5061, -12.5123, -12.5174])
+        assert outputs.logits.shape == expected_shape
+        assert torch.allclose(outputs.logits[0, 0, :3], expected_slice, atol=1e-4)
+
+        # verify masked token prediction equals "cats"
+        predicted_id = outputs.logits[0, 4, :].argmax(-1).item()
+        assert tokenizer.decode([predicted_id]) == "cats"
+    elif vqa_model:
+        expected_shape = torch.Size([1, 3129])
+        expected_slice = torch.tensor([-15.9495, -18.1472, -10.3041])
+        assert torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4)
+        assert outputs.logits.shape == expected_shape
+        assert torch.allclose(outputs.logits[0, 0, :3], expected_slice, atol=1e-4)
+
+        # verify vqa prediction equals "2"
+        predicted_idx = outputs.logits.argmax(-1).item()
+        assert model.config.id2label[predicted_idx] == "2"
+    elif nlvr_model:
+        expected_shape = torch.Size([1, 2])
+        expected_slice = torch.tensor([-2.8721, 2.1291])
+        assert torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4)
+        assert outputs.logits.shape == expected_shape
+
+    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+    print(f"Saving model and processor to {pytorch_dump_folder_path}")
+    model.save_pretrained(pytorch_dump_folder_path)
+    processor.save_pretrained(pytorch_dump_folder_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--checkpoint_url",
+        default="https://github.com/dandelin/ViLT/releases/download/200k/vilt_200k_mlm_itm.ckpt",
+        type=str,
+        help="URL of the checkpoint you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
+    )
+
+    args = parser.parse_args()
+    convert_vilt_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path)
--- a/src/transformers/models/vilt/feature_extraction_vilt.py
+++ b/src/transformers/models/vilt/feature_extraction_vilt.py
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for ViLT."""
+
+from typing import List, Optional, Union
+
+import numpy as np
+from PIL import Image
+
+from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
+from ...file_utils import TensorType, is_torch_available
+from ...image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ImageFeatureExtractionMixin,
+    ImageInput,
+    is_torch_tensor,
+)
+from ...utils import logging
+
+
+if is_torch_available():
+    import torch
+
+logger = logging.get_logger(__name__)
+
+
+class ViltFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
+    r"""
+    Constructs a ViLT feature extractor.
+
+    This feature extractor inherits from [`FeatureExtractionMixin`] which contains most of the main methods. Users
+    should refer to this superclass for more information regarding those methods.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the input based on `size`.
+        size (`int`, *optional*, defaults to 384):
+            Resize the shorter side of the input to the given size. Should be an integer. The longer side will be
+            limited to under int((1333 / 800) * size) while preserving the aspect ratio. Only has an effect if
+            `do_resize` is set to `True`.
+        size_divisor (`int`, *optional*, defaults to 32):
+            The size by which to make sure both the height and width can be divided.
+        resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
+            An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
+            `PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
+            if `do_resize` is set to `True`.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether or not to normalize the input with mean and standard deviation.
+        image_mean (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):
+            The sequence of means for each channel, to be used when normalizing images.
+        image_std (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):
+            The sequence of standard deviations for each channel, to be used when normalizing images.
+    """
+
+    model_input_names = ["pixel_values", "pixel_mask"]
+
+    def __init__(
+        self,
+        do_resize=True,
+        size=384,
+        size_divisor=32,
+        resample=Image.BICUBIC,
+        do_normalize=True,
+        image_mean=None,
+        image_std=None,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.do_resize = do_resize
+        self.size = size
+        self.size_divisor = size_divisor
+        self.resample = resample
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+
+    def _resize(self, image, shorter=800, longer=1333, size_divisor=32, resample=Image.BICUBIC):
+        """
+        Resizes the shorter edge of `image` to `shorter` and limits the longer edge to under `longer`, while preserving
+        the aspect ratio. Also makes sure that both the height and width can be divided by `size_divisor`.
+
+        Based on original implementation:
+        https://github.com/dandelin/ViLT/blob/3db8b5035464afee84d951bf6322e1b27f1d072d/vilt/transforms/utils.py#L5
+
+        Args:
+            image (`PIL.Image`):
+                The image to resize.
+            shorter (`int`, *optional*, defaults to `800`):
+                The size to which to resize the shorter side of the image.
+            longer (`int`, *optional*, defaults to `1333`):
+                The size by which to limit the longer side of the image, while preserving the aspect ratio.
+            size_divisor (`int`, *optional*, defaults to `32`):
+                The size by which both the height and the width must be divisible.
+            resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
+                An optional resampling filter.
+        """
+        if not isinstance(image, Image.Image):
+            image = self.to_pil_image(image)
+
+        w, h = image.size
+        min_size = shorter
+        max_size = longer
+        scale = min_size / min(w, h)
+        if h < w:
+            newh, neww = min_size, scale * w
+        else:
+            newh, neww = scale * h, min_size
+
+        if max(newh, neww) > max_size:
+            scale = max_size / max(newh, neww)
+            newh = newh * scale
+            neww = neww * scale
+
+        newh, neww = int(newh + 0.5), int(neww + 0.5)
+        newh, neww = newh // size_divisor * size_divisor, neww // size_divisor * size_divisor
+
+        return self.resize(image, size=(neww, newh), resample=resample)
+
+    def _max_by_axis(self, the_list):
+        # type: (List[List[int]]) -> List[int]
+        maxes = the_list[0]
+        for sublist in the_list[1:]:
+            for index, item in enumerate(sublist):
+                maxes[index] = max(maxes[index], item)
+        return maxes
+
+    def pad_and_create_pixel_mask(
+        self, pixel_values_list: List["torch.Tensor"], return_tensors: Optional[Union[str, TensorType]] = None
+    ):
+        """
+        Pad images up to the largest image in a batch and create a corresponding `pixel_mask`.
+
+        Args:
+            pixel_values_list (`List[torch.Tensor]`):
+                List of images (pixel values) to be padded. Each image should be a tensor of shape (C, H, W).
+            return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
+                If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
+                objects.
+
+        Returns:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
+
+            - **pixel_values** -- Pixel values to be fed to a model.
+            - **pixel_mask** -- Pixel mask to be fed to a model (when `pad_and_return_pixel_mask=True` or if
+              *"pixel_mask"* is in `self.model_input_names`).
+        """
+
+        max_size = self._max_by_axis([list(image.shape) for image in pixel_values_list])
+        c, h, w = max_size
+        padded_images = []
+        pixel_mask = []
+        for image in pixel_values_list:
+            # create padded image
+            padded_image = np.zeros((c, h, w), dtype=np.float32)
+            padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
+            padded_images.append(padded_image)
+            # create pixel mask
+            mask = np.zeros((h, w), dtype=np.int64)
+            mask[: image.shape[1], : image.shape[2]] = True
+            pixel_mask.append(mask)
+
+        # return as BatchFeature
+        data = {"pixel_values": padded_images, "pixel_mask": pixel_mask}
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+
+        return encoded_inputs
+
+    def __call__(
+        self,
+        images: ImageInput,
+        pad_and_return_pixel_mask: Optional[bool] = True,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs
+    ) -> BatchFeature:
+        """
+        Main method to prepare for the model one or several image(s).
+
+        <Tip warning={true}>
+
+        NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
+        PIL images.
+
+        </Tip>
+
+        Args:
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+
+            pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
+                Whether or not to pad images up to the largest image in a batch and create a pixel mask.
+
+                If left to the default, will return a pixel mask that is:
+
+                - 1 for pixels that are real (i.e. **not masked**),
+                - 0 for pixels that are padding (i.e. **masked**).
+
+            return_tensors (`str` or [`~file_utils.TensorType`], *optional*, defaults to `'np'`):
+                If set, will return tensors of a particular framework. Acceptable values are:
+
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+
+        Returns:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
+
+            - **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
+              width).
+            - **pixel_mask** -- Pixel mask to be fed to a model (when `return_pixel_mask=True` or if *"pixel_mask"* is
+              in `self.model_input_names`).
+        """
+        # Input type checking for clearer error
+        valid_images = False
+
+        # Check that images has a valid type
+        if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
+            valid_images = True
+        elif isinstance(images, (list, tuple)):
+            if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
+                valid_images = True
+
+        if not valid_images:
+            raise ValueError(
+                "Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example), "
+                "`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
+            )
+
+        is_batched = bool(
+            isinstance(images, (list, tuple))
+            and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
+        )
+
+        if not is_batched:
+            images = [images]
+
+        # transformations (resizing + normalization)
+        if self.do_resize and self.size is not None:
+            longer = int((1333 / 800) * self.size)
+            images = [
+                self._resize(
+                    image=image,
+                    shorter=self.size,
+                    longer=longer,
+                    size_divisor=self.size_divisor,
+                    resample=self.resample,
+                )
+                for image in images
+            ]
+        if self.do_normalize:
+            images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
+
+        if pad_and_return_pixel_mask:
+            # pad images up to largest image in batch and create pixel_mask
+            max_size = self._max_by_axis([list(image.shape) for image in images])
+            c, h, w = max_size
+            padded_images = []
+            pixel_mask = []
+            for image in images:
+                # create padded image
+                padded_image = np.zeros((c, h, w), dtype=np.float32)
+                padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
+                padded_images.append(padded_image)
+                # create pixel mask
+                mask = np.zeros((h, w), dtype=np.int64)
+                mask[: image.shape[1], : image.shape[2]] = True
+                pixel_mask.append(mask)
+            images = padded_images
+
+        # return as BatchFeature
+        data = {}
+        data["pixel_values"] = images
+        if pad_and_return_pixel_mask:
+            data["pixel_mask"] = pixel_mask
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+
+        return encoded_inputs
--- a/src/transformers/models/vilt/modeling_vilt.py
+++ b/src/transformers/models/vilt/modeling_vilt.py
--- a/src/transformers/models/vilt/processing_vilt.py
+++ b/src/transformers/models/vilt/processing_vilt.py
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for ViLT.
+"""
+
+from typing import List, Optional, Union
+
+from transformers import BertTokenizerFast
+
+from ...file_utils import TensorType
+from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
+from .feature_extraction_vilt import ViltFeatureExtractor
+
+
+class ViltProcessor:
+    r"""
+    Constructs a ViLT processor which wraps a BERT tokenizer and ViLT feature extractor into a single processor.
+
+    [`ViltProcessor`] offers all the functionalities of [`ViltFeatureExtractor`] and [`BertTokenizerFast`]. See the
+    docstring of [`~ViltProcessor.__call__`] and [`~ViltProcessor.decode`] for more information.
+
+    Args:
+        feature_extractor (`ViltFeatureExtractor`):
+            An instance of [`ViltFeatureExtractor`]. The feature extractor is a required input.
+        tokenizer (`BertTokenizerFast`):
+            An instance of ['BertTokenizerFast`]. The tokenizer is a required input.
+    """
+
+    def __init__(self, feature_extractor, tokenizer):
+        if not isinstance(feature_extractor, ViltFeatureExtractor):
+            raise ValueError(
+                f"`feature_extractor` has to be of type {ViltFeatureExtractor.__class__}, but is {type(feature_extractor)}"
+            )
+        if not isinstance(tokenizer, BertTokenizerFast):
+            raise ValueError(f"`tokenizer` has to be of type {BertTokenizerFast.__class__}, but is {type(tokenizer)}")
+
+        self.feature_extractor = feature_extractor
+        self.tokenizer = tokenizer
+        self.current_processor = self.feature_extractor
+
+    def save_pretrained(self, save_directory):
+        """
+        Save a ViLT feature_extractor object and BERT tokenizer object to the directory `save_directory`, so that it
+        can be re-loaded using the [`~ViltProcessor.from_pretrained`] class method.
+
+        <Tip>
+
+        This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
+        [`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`]. Please refer to the docstrings of the methods
+        above for more information.
+
+        </Tip>
+
+        Args:
+            save_directory (`str` or `os.PathLike`):
+                Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
+                be created if it does not exist).
+        """
+
+        self.feature_extractor.save_pretrained(save_directory)
+        self.tokenizer.save_pretrained(save_directory)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r"""
+        Instantiate a [`ViltProcessor`] from a pretrained ViLT processor.
+
+        <Tip>
+
+        This class method is simply calling ViltFeatureExtractor's
+        [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] and BertTokenizerFast's
+        [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`]. Please refer to the docstrings of the methods
+        above for more information.
+
+        </Tip>
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+
+                - a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
+                  huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
+                  namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
+                - a path to a *directory* containing a feature extractor file saved using the
+                  [`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
+                - a path or url to a saved feature extractor JSON *file*, e.g.,
+                  `./my_model_directory/preprocessor_config.json`.
+            **kwargs
+                Additional keyword arguments passed along to both [`SequenceFeatureExtractor`] and
+                [`PreTrainedTokenizer`]
+        """
+        feature_extractor = ViltFeatureExtractor.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
+
+        return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)
+
+    def __call__(
+        self,
+        images,
+        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        pad_to_multiple_of: Optional[int] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        This method uses [`ViltFeatureExtractor.__call__`] method to prepare image(s) for the model, and
+        [`BertTokenizerFast.__call__`] to prepare text for the model.
+
+        Please refer to the docstring of the above two methods for more information.
+        """
+        encoding = self.tokenizer(
+            text=text,
+            add_special_tokens=add_special_tokens,
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            stride=stride,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            return_tensors=return_tensors,
+            **kwargs,
+        )
+        # add pixel_values + pixel_mask
+        encoding_feature_extractor = self.feature_extractor(images, return_tensors=return_tensors)
+        encoding.update(encoding_feature_extractor)
+
+        return encoding
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
--- a/src/transformers/models/vit/modeling_vit.py
+++ b/src/transformers/models/vit/modeling_vit.py
@@ -326,12 +326,6 @@ class ViTLayer(nn.Module):

        # in ViT, layernorm is also applied after self-attention
        layer_output = self.layernorm_after(hidden_states)
-
-        # TODO feedforward chunking not working for now
-        # layer_output = apply_chunking_to_forward(
-        #     self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, layer_output
-        # )
-
        layer_output = self.intermediate(layer_output)

        # second residual connection is done here
@@ -341,11 +335,6 @@ class ViTLayer(nn.Module):

        return outputs

-    def feed_forward_chunk(self, attention_output):
-        intermediate_output = self.intermediate(attention_output)
-        layer_output = self.output(intermediate_output)
-        return layer_output
-

 class ViTEncoder(nn.Module):
    def __init__(self, config):

--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -3540,6 +3540,58 @@ class UniSpeechSatPreTrainedModel(metaclass=DummyObject):
        requires_backends(self, ["torch"])


+VILT_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class ViltForImageAndTextRetrieval(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class ViltForImagesAndTextClassification(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class ViltForMaskedLM(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class ViltForQuestionAnswering(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class ViltLayer(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class ViltModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class ViltPreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 class VisionEncoderDecoderModel(metaclass=DummyObject):
    _backends = ["torch"]