Unverified Commit ac227093 authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Add ViLT (#14895)



* First commit

* Add conversion script

* Make conversion script work for base model

* More improvements

* Update conversion script, works for vqa

* Add indexing argument to meshgrid

* Make conversion script work for ViltForPreTraining

* Add ViltForPreTraining to docs

* Fix device issue

* Add processor

* Add MinMaxResize to feature extractor

* Implement call method of ViltProcessor

* Fix tests

* Add integration test

* Add loss calculation for VQA

* Improve tests

* Improve some more tests

* Debug tests

* Small improvements

* Add support for attention_mask

* Remove mask_it

* Add pixel_mask

* Add tests for ViltFeatureExtractor

* Improve tests

* Add ViltForNaturalLanguageVisualReasoning

* Add ViltForNaturalLanguageVisualReasoning to conversion script

* Minor fixes

* Add support for image_embeds, update docstrings to markdown

* Update docs to markdown

* Improve conversion script

* Rename ViltForPreTraining to ViltForMaskedLM

* Improve conversion script

* Convert docstrings to markdown

* Fix code example of retrieval model

* Properly convert masked language model

* Add integration test for nlvr

* Fix code quality

* Apply suggestions from code review

* Add copied from statements

* Fix pretrained_config_archive_map

* Fix docs

* Add model to README

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply more suggestions from code review

* Make code more readable

* Add ViltForNaturalLanguageVisualReasoning to the tests

* Rename ViltForVisualQuestionAnswering to ViltForQuestionAnswering

* Replace pixel_values_2 by single tensor

* Add hidden_states and attentions

* Fix one more test

* Fix all tests

* Update year

* Fix rebase issues

* Fix another rebase issue

* Remove ViltForPreTraining from auto mapping

* Rename ViltForImageRetrievalTextRetrieval to ViltForImageAndTextRetrieval

* Make it possible to use BertTokenizerFast in the processor

* Use BertTokenizerFast by default

* Rename ViltForNaturalLanguageVisualReasoning, define custom model output
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 691878ee
......@@ -311,6 +311,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER
AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
......
......@@ -289,6 +289,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
......
......@@ -313,6 +313,7 @@ conda install -c huggingface transformers
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (来自 Microsoft) 伴随论文 [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 由 Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei 发布。
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (来自 Microsoft Research) 伴随论文 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 由 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 发布。
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (来自 NAVER AI Lab/Kakao Enterprise/Kakao Brain) 伴随论文 [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) 由 Wonjae Kim, Bokyung Son, Ildoo Kim 发布。
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (来自 UCLA NLP) 伴随论文 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 由 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 发布。
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。
......
......@@ -325,6 +325,7 @@ conda install -c huggingface transformers
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft) released with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
......
......@@ -282,6 +282,8 @@
title: UniSpeech
- local: model_doc/unispeech-sat
title: UniSpeech-SAT
- local: model_doc/vilt
title: ViLT
- local: model_doc/vision-encoder-decoder
title: Vision Encoder Decoder Models
- local: model_doc/vision-text-dual-encoder
......
......@@ -170,6 +170,7 @@ conversion utilities for the following models.
1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[ViLT)](model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[ViTMAE)](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
......@@ -266,6 +267,7 @@ Flax), PyTorch, and/or TensorFlow.
| TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ |
| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
| UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ |
| ViLT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Vision Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
......
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ViLT
## Overview
The ViLT model was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334)
by Wonjae Kim, Bokyung Son, Ildoo Kim. ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design
for Vision-and-Language Pre-training (VLP).
The abstract from the paper is the following:
*Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.
Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision
(e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we
find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more
computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive
power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model,
Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically
simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of
times faster than previous VLP models, yet with competitive or better downstream task performance.*
Tips:
- ViLT is a model that takes both `pixel_values` and `input_ids` as input. One can use [`ViltProcessor`] to prepare data for the model.
This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one.
- ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.
- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes
additional embedding layers for the language modality.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vilt_architecture.jpg"
alt="drawing" width="600"/>
<small> ViLT architecture. Taken from the <a href="https://arxiv.org/abs/2102.03334">original paper</a>. </small>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/dandelin/ViLT).
## ViltConfig
[[autodoc]] ViltConfig
## ViltFeatureExtractor
[[autodoc]] ViltFeatureExtractor
- __call__
## ViltProcessor
[[autodoc]] ViltProcessor
- __call__
## ViltModel
[[autodoc]] ViltModel
- forward
## ViltForMaskedLM
[[autodoc]] ViltForMaskedLM
- forward
## ViltForQuestionAnswering
[[autodoc]] ViltForQuestionAnswering
- forward
## ViltForImagesAndTextClassification
[[autodoc]] ViltForImagesAndTextClassification
- forward
## ViltForImageAndTextRetrieval
[[autodoc]] ViltForImageAndTextRetrieval
- forward
......@@ -308,6 +308,7 @@ _import_structure = {
"UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP",
"UniSpeechSatConfig",
],
"models.vilt": ["VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViltConfig", "ViltFeatureExtractor", "ViltProcessor"],
"models.vision_encoder_decoder": ["VisionEncoderDecoderConfig"],
"models.vision_text_dual_encoder": ["VisionTextDualEncoderConfig", "VisionTextDualEncoderProcessor"],
"models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
......@@ -514,6 +515,8 @@ if is_vision_available():
_import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
_import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
_import_structure["models.segformer"].append("SegformerFeatureExtractor")
_import_structure["models.vilt"].append("ViltFeatureExtractor")
_import_structure["models.vilt"].append("ViltProcessor")
_import_structure["models.vit"].append("ViTFeatureExtractor")
else:
from .utils import dummy_vision_objects
......@@ -629,7 +632,6 @@ if is_torch_available():
_import_structure["modeling_utils"] = ["Conv1D", "PreTrainedModel", "apply_chunking_to_forward", "prune_layer"]
# PyTorch models structure
_import_structure["models.albert"].extend(
[
"ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
......@@ -1382,6 +1384,18 @@ if is_torch_available():
"UniSpeechSatPreTrainedModel",
]
)
_import_structure["models.vilt"].extend(
[
"VILT_PRETRAINED_MODEL_ARCHIVE_LIST",
"ViltForImageAndTextRetrieval",
"ViltForImagesAndTextClassification",
"ViltForMaskedLM",
"ViltForQuestionAnswering",
"ViltLayer",
"ViltModel",
"ViltPreTrainedModel",
]
)
_import_structure["models.vision_encoder_decoder"].extend(["VisionEncoderDecoderModel"])
_import_structure["models.vision_text_dual_encoder"].extend(["VisionTextDualEncoderModel"])
_import_structure["models.visual_bert"].extend(
......@@ -2409,6 +2423,7 @@ if TYPE_CHECKING:
from .models.trocr import TROCR_PRETRAINED_CONFIG_ARCHIVE_MAP, TrOCRConfig, TrOCRProcessor
from .models.unispeech import UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechConfig
from .models.unispeech_sat import UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechSatConfig
from .models.vilt import VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViltConfig, ViltFeatureExtractor, ViltProcessor
from .models.vision_encoder_decoder import VisionEncoderDecoderConfig
from .models.vision_text_dual_encoder import VisionTextDualEncoderConfig, VisionTextDualEncoderProcessor
from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
......@@ -2585,6 +2600,7 @@ if TYPE_CHECKING:
from .models.layoutxlm import LayoutXLMProcessor
from .models.perceiver import PerceiverFeatureExtractor
from .models.segformer import SegformerFeatureExtractor
from .models.vilt import ViltFeatureExtractor, ViltProcessor
from .models.vit import ViTFeatureExtractor
else:
from .utils.dummy_vision_objects import *
......@@ -3302,6 +3318,16 @@ if TYPE_CHECKING:
UniSpeechSatModel,
UniSpeechSatPreTrainedModel,
)
from .models.vilt import (
VILT_PRETRAINED_MODEL_ARCHIVE_LIST,
ViltForImageAndTextRetrieval,
ViltForImagesAndTextClassification,
ViltForMaskedLM,
ViltForQuestionAnswering,
ViltLayer,
ViltModel,
ViltPreTrainedModel,
)
from .models.vision_encoder_decoder import VisionEncoderDecoderModel
from .models.vision_text_dual_encoder import VisionTextDualEncoderModel
from .models.visual_bert import (
......
......@@ -104,6 +104,7 @@ from . import (
trocr,
unispeech,
unispeech_sat,
vilt,
vision_encoder_decoder,
vision_text_dual_encoder,
visual_bert,
......
......@@ -30,6 +30,7 @@ logger = logging.get_logger(__name__)
CONFIG_MAPPING_NAMES = OrderedDict(
[
# Add configs here
("vilt", "ViltConfig"),
("vit_mae", "ViTMAEConfig"),
("realm", "RealmConfig"),
("nystromformer", "NystromformerConfig"),
......@@ -119,6 +120,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
[
# Add archive maps here
("vilt", "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("nystromformer", "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
......@@ -196,6 +198,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
MODEL_NAMES_MAPPING = OrderedDict(
[
# Add full (and cased) model names here
("vilt", "ViLT"),
("vit_mae", "ViTMAE"),
("realm", "Realm"),
("nystromformer", "Nystromformer"),
......
......@@ -28,6 +28,7 @@ logger = logging.get_logger(__name__)
MODEL_MAPPING_NAMES = OrderedDict(
[
# Base model mapping
("vilt", "ViltModel"),
("vit_mae", "ViTMAEModel"),
("nystromformer", "NystromformerModel"),
("imagegpt", "ImageGPTModel"),
......
......@@ -297,12 +297,6 @@ class DeiTLayer(nn.Module):
# in DeiT, layernorm is also applied after self-attention
layer_output = self.layernorm_after(hidden_states)
# TODO feedforward chunking not working for now
# layer_output = apply_chunking_to_forward(
# self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, layer_output
# )
layer_output = self.intermediate(layer_output)
# second residual connection is done here
......@@ -312,11 +306,6 @@ class DeiTLayer(nn.Module):
return outputs
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output)
return layer_output
# Copied from transformers.models.vit.modeling_vit.ViTEncoder with ViT->DeiT
class DeiTEncoder(nn.Module):
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
# rely on isort to merge the imports
from ...file_utils import _LazyModule, is_torch_available, is_vision_available
_import_structure = {
"configuration_vilt": ["VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViltConfig"],
}
if is_vision_available():
_import_structure["feature_extraction_vilt"] = ["ViltFeatureExtractor"]
_import_structure["processing_vilt"] = ["ViltProcessor"]
if is_torch_available():
_import_structure["modeling_vilt"] = [
"VILT_PRETRAINED_MODEL_ARCHIVE_LIST",
"ViltForImageAndTextRetrieval",
"ViltForImagesAndTextClassification",
"ViltForMaskedLM",
"ViltForQuestionAnswering",
"ViltLayer",
"ViltModel",
"ViltPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_vilt import VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViltConfig
if is_vision_available():
from .feature_extraction_vilt import ViltFeatureExtractor
from .processing_vilt import ViltProcessor
if is_torch_available():
from .modeling_vilt import (
VILT_PRETRAINED_MODEL_ARCHIVE_LIST,
ViltForImageAndTextRetrieval,
ViltForImagesAndTextClassification,
ViltForMaskedLM,
ViltForQuestionAnswering,
ViltLayer,
ViltModel,
ViltPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" VilT model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
VILT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
# TODO
}
class ViltConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`ViLTModel`]. It is used to instantiate an ViLT
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the ViLT
[google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 30522):
Vocabulary size of the text part of the model. Defines the number of different tokens that can be
represented by the `inputs_ids` passed when calling [`ViltModel`].
type_vocab_size (`int`, *optional*, defaults to 2):
The vocabulary size of the `token_type_ids` passed when calling [`ViltModel`]. This is used when encoding
text.
modality_type_vocab_size (`int`, *optional*, defaults to 2):
The vocabulary size of the modalities passed when calling [`ViltModel`]. This is used after concatening the
embeddings of the text and image modalities.
max_position_embeddings (`int`, *optional*, defaults to 40):
The maximum sequence length that this model might ever be used with.
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers.
image_size (`int`, *optional*, defaults to 384):
The size (resolution) of each image.
patch_size (`int`, *optional*, defaults to 32):
The size (resolution) of each patch.
num_channels (`int`, *optional*, defaults to 3):
The number of input channels.
qkv_bias (`bool`, *optional*, defaults to `True`):
Whether to add a bias to the queries, keys and values.
max_image_length (`int`, *optional*, defaults to -1):
The maximum number of patches to take as input for the Transformer encoder. If set to a positive integer,
the encoder will sample `max_image_length` patches at maximum. If set to -1, will not be taken into
account.
num_images (`int`, *optional*, defaults to -1):
The number of images to use for natural language visual reasoning. If set to a positive integer, will be
used by [`ViltForImagesAndTextClassification`] for defining the classifier head.
Example:
```python
>>> from transformers import ViLTModel, ViLTConfig
>>> # Initializing a ViLT dandelin/vilt-b32-mlm style configuration
>>> configuration = ViLTConfig()
>>> # Initializing a model from the dandelin/vilt-b32-mlm style configuration
>>> model = ViLTModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "vilt"
def __init__(
self,
vocab_size=30522,
type_vocab_size=2,
modality_type_vocab_size=2,
max_position_embeddings=40,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.0,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
layer_norm_eps=1e-12,
is_encoder_decoder=False,
image_size=384,
patch_size=32,
num_channels=3,
qkv_bias=True,
max_image_length=-1,
tie_word_embeddings=False,
num_images=-1,
**kwargs
):
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
self.vocab_size = vocab_size
self.type_vocab_size = type_vocab_size
self.modality_type_vocab_size = modality_type_vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.image_size = image_size
self.patch_size = patch_size
self.num_channels = num_channels
self.qkv_bias = qkv_bias
self.max_image_length = max_image_length
self.num_images = num_images
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert ViLT checkpoints from the original Github repository."""
import argparse
import json
from pathlib import Path
import torch
from PIL import Image
import requests
from huggingface_hub import cached_download, hf_hub_url
from transformers import (
BertTokenizer,
ViltConfig,
ViltFeatureExtractor,
ViltForImageAndTextRetrieval,
ViltForImagesAndTextClassification,
ViltForMaskedLM,
ViltForQuestionAnswering,
ViltProcessor,
)
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
# here we list all keys to be renamed (original name on the left, our name on the right)
def create_rename_keys(config, vqa_model=False, nlvr_model=False, irtr_model=False):
rename_keys = []
for i in range(config.num_hidden_layers):
# encoder layers: output projection, 2 feedforward neural networks and 2 layernorms
rename_keys.append((f"transformer.blocks.{i}.norm1.weight", f"vilt.encoder.layer.{i}.layernorm_before.weight"))
rename_keys.append((f"transformer.blocks.{i}.norm1.bias", f"vilt.encoder.layer.{i}.layernorm_before.bias"))
rename_keys.append(
(f"transformer.blocks.{i}.attn.proj.weight", f"vilt.encoder.layer.{i}.attention.output.dense.weight")
)
rename_keys.append(
(f"transformer.blocks.{i}.attn.proj.bias", f"vilt.encoder.layer.{i}.attention.output.dense.bias")
)
rename_keys.append((f"transformer.blocks.{i}.norm2.weight", f"vilt.encoder.layer.{i}.layernorm_after.weight"))
rename_keys.append((f"transformer.blocks.{i}.norm2.bias", f"vilt.encoder.layer.{i}.layernorm_after.bias"))
rename_keys.append(
(f"transformer.blocks.{i}.mlp.fc1.weight", f"vilt.encoder.layer.{i}.intermediate.dense.weight")
)
rename_keys.append((f"transformer.blocks.{i}.mlp.fc1.bias", f"vilt.encoder.layer.{i}.intermediate.dense.bias"))
rename_keys.append((f"transformer.blocks.{i}.mlp.fc2.weight", f"vilt.encoder.layer.{i}.output.dense.weight"))
rename_keys.append((f"transformer.blocks.{i}.mlp.fc2.bias", f"vilt.encoder.layer.{i}.output.dense.bias"))
# embeddings
rename_keys.extend(
[
# text embeddings
("text_embeddings.word_embeddings.weight", "vilt.embeddings.text_embeddings.word_embeddings.weight"),
(
"text_embeddings.position_embeddings.weight",
"vilt.embeddings.text_embeddings.position_embeddings.weight",
),
("text_embeddings.position_ids", "vilt.embeddings.text_embeddings.position_ids"),
(
"text_embeddings.token_type_embeddings.weight",
"vilt.embeddings.text_embeddings.token_type_embeddings.weight",
),
("text_embeddings.LayerNorm.weight", "vilt.embeddings.text_embeddings.LayerNorm.weight"),
("text_embeddings.LayerNorm.bias", "vilt.embeddings.text_embeddings.LayerNorm.bias"),
# patch embeddings
("transformer.cls_token", "vilt.embeddings.cls_token"),
("transformer.patch_embed.proj.weight", "vilt.embeddings.patch_embeddings.projection.weight"),
("transformer.patch_embed.proj.bias", "vilt.embeddings.patch_embeddings.projection.bias"),
("transformer.pos_embed", "vilt.embeddings.position_embeddings"),
# token type embeddings
("token_type_embeddings.weight", "vilt.embeddings.token_type_embeddings.weight"),
]
)
# final layernorm + pooler
rename_keys.extend(
[
("transformer.norm.weight", "vilt.layernorm.weight"),
("transformer.norm.bias", "vilt.layernorm.bias"),
("pooler.dense.weight", "vilt.pooler.dense.weight"),
("pooler.dense.bias", "vilt.pooler.dense.bias"),
]
)
# classifier head(s)
if vqa_model:
# classification head
rename_keys.extend(
[
("vqa_classifier.0.weight", "classifier.0.weight"),
("vqa_classifier.0.bias", "classifier.0.bias"),
("vqa_classifier.1.weight", "classifier.1.weight"),
("vqa_classifier.1.bias", "classifier.1.bias"),
("vqa_classifier.3.weight", "classifier.3.weight"),
("vqa_classifier.3.bias", "classifier.3.bias"),
]
)
elif nlvr_model:
# classification head
rename_keys.extend(
[
("nlvr2_classifier.0.weight", "classifier.0.weight"),
("nlvr2_classifier.0.bias", "classifier.0.bias"),
("nlvr2_classifier.1.weight", "classifier.1.weight"),
("nlvr2_classifier.1.bias", "classifier.1.bias"),
("nlvr2_classifier.3.weight", "classifier.3.weight"),
("nlvr2_classifier.3.bias", "classifier.3.bias"),
]
)
else:
pass
return rename_keys
# we split up the matrix of each encoder layer into queries, keys and values
def read_in_q_k_v(state_dict, config):
for i in range(config.num_hidden_layers):
prefix = "vilt."
# read in weights + bias of input projection layer (in timm, this is a single matrix + bias)
in_proj_weight = state_dict.pop(f"transformer.blocks.{i}.attn.qkv.weight")
in_proj_bias = state_dict.pop(f"transformer.blocks.{i}.attn.qkv.bias")
# next, add query, keys and values (in that order) to the state dict
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[
: config.hidden_size, :
]
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.bias"] = in_proj_bias[: config.hidden_size]
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
config.hidden_size : config.hidden_size * 2, :
]
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.key.bias"] = in_proj_bias[
config.hidden_size : config.hidden_size * 2
]
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[
-config.hidden_size :, :
]
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.bias"] = in_proj_bias[-config.hidden_size :]
def remove_classification_head_(state_dict):
ignore_keys = ["head.weight", "head.bias"]
for k in ignore_keys:
state_dict.pop(k, None)
def rename_key(dct, old, new):
val = dct.pop(old)
dct[new] = val
@torch.no_grad()
def convert_vilt_checkpoint(checkpoint_url, pytorch_dump_folder_path):
"""
Copy/paste/tweak model's weights to our ViLT structure.
"""
# define configuration and initialize HuggingFace model
config = ViltConfig(image_size=384, patch_size=32, tie_word_embeddings=False)
mlm_model = False
vqa_model = False
nlvr_model = False
irtr_model = False
if "vqa" in checkpoint_url:
vqa_model = True
config.num_labels = 3129
repo_id = "datasets/huggingface/label-files"
filename = "vqa2-id2label.json"
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
id2label = {int(k): v for k, v in id2label.items()}
config.id2label = id2label
config.label2id = {v: k for k, v in id2label.items()}
model = ViltForQuestionAnswering(config)
elif "nlvr" in checkpoint_url:
nlvr_model = True
config.num_labels = 2
config.id2label = {0: "False", 1: "True"}
config.label2id = {v: k for k, v in config.id2label.items()}
config.modality_type_vocab_size = 3
model = ViltForImagesAndTextClassification(config)
elif "irtr" in checkpoint_url:
irtr_model = True
model = ViltForImageAndTextRetrieval(config)
elif "mlm_itm" in checkpoint_url:
mlm_model = True
model = ViltForMaskedLM(config)
else:
raise ValueError("Unknown model type")
# load state_dict of original model, remove and rename some keys
state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")["state_dict"]
rename_keys = create_rename_keys(config, vqa_model, nlvr_model, irtr_model)
for src, dest in rename_keys:
rename_key(state_dict, src, dest)
read_in_q_k_v(state_dict, config)
if mlm_model or irtr_model:
ignore_keys = ["itm_score.fc.weight", "itm_score.fc.bias"]
for k in ignore_keys:
state_dict.pop(k, None)
# load state dict into HuggingFace model
model.eval()
if mlm_model:
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
assert missing_keys == ["mlm_score.decoder.bias"]
else:
model.load_state_dict(state_dict)
# Define processor
feature_extractor = ViltFeatureExtractor(size=384)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
processor = ViltProcessor(feature_extractor, tokenizer)
# Forward pass on example inputs (image + text)
if nlvr_model:
image1 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_0.jpg", stream=True).raw)
image2 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_0.jpg", stream=True).raw)
text = "The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing."
encoding_1 = processor(image1, text, return_tensors="pt")
encoding_2 = processor(image2, text, return_tensors="pt")
outputs = model(
input_ids=encoding_1.input_ids,
pixel_values=encoding_1.pixel_values,
pixel_values_2=encoding_2.pixel_values,
)
else:
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
if mlm_model:
text = "a bunch of [MASK] laying on a [MASK]."
else:
text = "How many cats are there?"
encoding = processor(image, text, return_tensors="pt")
outputs = model(**encoding)
# Verify outputs
if mlm_model:
expected_shape = torch.Size([1, 11, 30522])
expected_slice = torch.tensor([-12.5061, -12.5123, -12.5174])
assert outputs.logits.shape == expected_shape
assert torch.allclose(outputs.logits[0, 0, :3], expected_slice, atol=1e-4)
# verify masked token prediction equals "cats"
predicted_id = outputs.logits[0, 4, :].argmax(-1).item()
assert tokenizer.decode([predicted_id]) == "cats"
elif vqa_model:
expected_shape = torch.Size([1, 3129])
expected_slice = torch.tensor([-15.9495, -18.1472, -10.3041])
assert torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4)
assert outputs.logits.shape == expected_shape
assert torch.allclose(outputs.logits[0, 0, :3], expected_slice, atol=1e-4)
# verify vqa prediction equals "2"
predicted_idx = outputs.logits.argmax(-1).item()
assert model.config.id2label[predicted_idx] == "2"
elif nlvr_model:
expected_shape = torch.Size([1, 2])
expected_slice = torch.tensor([-2.8721, 2.1291])
assert torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4)
assert outputs.logits.shape == expected_shape
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
print(f"Saving model and processor to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path)
processor.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--checkpoint_url",
default="https://github.com/dandelin/ViLT/releases/download/200k/vilt_200k_mlm_itm.ckpt",
type=str,
help="URL of the checkpoint you'd like to convert.",
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
)
args = parser.parse_args()
convert_vilt_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path)
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for ViLT."""
from typing import List, Optional, Union
import numpy as np
from PIL import Image
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
from ...file_utils import TensorType, is_torch_available
from ...image_utils import (
IMAGENET_STANDARD_MEAN,
IMAGENET_STANDARD_STD,
ImageFeatureExtractionMixin,
ImageInput,
is_torch_tensor,
)
from ...utils import logging
if is_torch_available():
import torch
logger = logging.get_logger(__name__)
class ViltFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
r"""
Constructs a ViLT feature extractor.
This feature extractor inherits from [`FeatureExtractionMixin`] which contains most of the main methods. Users
should refer to this superclass for more information regarding those methods.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the input based on `size`.
size (`int`, *optional*, defaults to 384):
Resize the shorter side of the input to the given size. Should be an integer. The longer side will be
limited to under int((1333 / 800) * size) while preserving the aspect ratio. Only has an effect if
`do_resize` is set to `True`.
size_divisor (`int`, *optional*, defaults to 32):
The size by which to make sure both the height and width can be divided.
resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
`PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
if `do_resize` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether or not to normalize the input with mean and standard deviation.
image_mean (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):
The sequence of means for each channel, to be used when normalizing images.
image_std (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):
The sequence of standard deviations for each channel, to be used when normalizing images.
"""
model_input_names = ["pixel_values", "pixel_mask"]
def __init__(
self,
do_resize=True,
size=384,
size_divisor=32,
resample=Image.BICUBIC,
do_normalize=True,
image_mean=None,
image_std=None,
**kwargs
):
super().__init__(**kwargs)
self.do_resize = do_resize
self.size = size
self.size_divisor = size_divisor
self.resample = resample
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
def _resize(self, image, shorter=800, longer=1333, size_divisor=32, resample=Image.BICUBIC):
"""
Resizes the shorter edge of `image` to `shorter` and limits the longer edge to under `longer`, while preserving
the aspect ratio. Also makes sure that both the height and width can be divided by `size_divisor`.
Based on original implementation:
https://github.com/dandelin/ViLT/blob/3db8b5035464afee84d951bf6322e1b27f1d072d/vilt/transforms/utils.py#L5
Args:
image (`PIL.Image`):
The image to resize.
shorter (`int`, *optional*, defaults to `800`):
The size to which to resize the shorter side of the image.
longer (`int`, *optional*, defaults to `1333`):
The size by which to limit the longer side of the image, while preserving the aspect ratio.
size_divisor (`int`, *optional*, defaults to `32`):
The size by which both the height and the width must be divisible.
resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
An optional resampling filter.
"""
if not isinstance(image, Image.Image):
image = self.to_pil_image(image)
w, h = image.size
min_size = shorter
max_size = longer
scale = min_size / min(w, h)
if h < w:
newh, neww = min_size, scale * w
else:
newh, neww = scale * h, min_size
if max(newh, neww) > max_size:
scale = max_size / max(newh, neww)
newh = newh * scale
neww = neww * scale
newh, neww = int(newh + 0.5), int(neww + 0.5)
newh, neww = newh // size_divisor * size_divisor, neww // size_divisor * size_divisor
return self.resize(image, size=(neww, newh), resample=resample)
def _max_by_axis(self, the_list):
# type: (List[List[int]]) -> List[int]
maxes = the_list[0]
for sublist in the_list[1:]:
for index, item in enumerate(sublist):
maxes[index] = max(maxes[index], item)
return maxes
def pad_and_create_pixel_mask(
self, pixel_values_list: List["torch.Tensor"], return_tensors: Optional[Union[str, TensorType]] = None
):
"""
Pad images up to the largest image in a batch and create a corresponding `pixel_mask`.
Args:
pixel_values_list (`List[torch.Tensor]`):
List of images (pixel values) to be padded. Each image should be a tensor of shape (C, H, W).
return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **pixel_values** -- Pixel values to be fed to a model.
- **pixel_mask** -- Pixel mask to be fed to a model (when `pad_and_return_pixel_mask=True` or if
*"pixel_mask"* is in `self.model_input_names`).
"""
max_size = self._max_by_axis([list(image.shape) for image in pixel_values_list])
c, h, w = max_size
padded_images = []
pixel_mask = []
for image in pixel_values_list:
# create padded image
padded_image = np.zeros((c, h, w), dtype=np.float32)
padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
padded_images.append(padded_image)
# create pixel mask
mask = np.zeros((h, w), dtype=np.int64)
mask[: image.shape[1], : image.shape[2]] = True
pixel_mask.append(mask)
# return as BatchFeature
data = {"pixel_values": padded_images, "pixel_mask": pixel_mask}
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
return encoded_inputs
def __call__(
self,
images: ImageInput,
pad_and_return_pixel_mask: Optional[bool] = True,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs
) -> BatchFeature:
"""
Main method to prepare for the model one or several image(s).
<Tip warning={true}>
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
PIL images.
</Tip>
Args:
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
number of channels, H and W are image height and width.
pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
Whether or not to pad images up to the largest image in a batch and create a pixel mask.
If left to the default, will return a pixel mask that is:
- 1 for pixels that are real (i.e. **not masked**),
- 0 for pixels that are padding (i.e. **masked**).
return_tensors (`str` or [`~file_utils.TensorType`], *optional*, defaults to `'np'`):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
width).
- **pixel_mask** -- Pixel mask to be fed to a model (when `return_pixel_mask=True` or if *"pixel_mask"* is
in `self.model_input_names`).
"""
# Input type checking for clearer error
valid_images = False
# Check that images has a valid type
if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
valid_images = True
elif isinstance(images, (list, tuple)):
if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
valid_images = True
if not valid_images:
raise ValueError(
"Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example), "
"`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
)
is_batched = bool(
isinstance(images, (list, tuple))
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
)
if not is_batched:
images = [images]
# transformations (resizing + normalization)
if self.do_resize and self.size is not None:
longer = int((1333 / 800) * self.size)
images = [
self._resize(
image=image,
shorter=self.size,
longer=longer,
size_divisor=self.size_divisor,
resample=self.resample,
)
for image in images
]
if self.do_normalize:
images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
if pad_and_return_pixel_mask:
# pad images up to largest image in batch and create pixel_mask
max_size = self._max_by_axis([list(image.shape) for image in images])
c, h, w = max_size
padded_images = []
pixel_mask = []
for image in images:
# create padded image
padded_image = np.zeros((c, h, w), dtype=np.float32)
padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
padded_images.append(padded_image)
# create pixel mask
mask = np.zeros((h, w), dtype=np.int64)
mask[: image.shape[1], : image.shape[2]] = True
pixel_mask.append(mask)
images = padded_images
# return as BatchFeature
data = {}
data["pixel_values"] = images
if pad_and_return_pixel_mask:
data["pixel_mask"] = pixel_mask
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
return encoded_inputs
This diff is collapsed.
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for ViLT.
"""
from typing import List, Optional, Union
from transformers import BertTokenizerFast
from ...file_utils import TensorType
from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
from .feature_extraction_vilt import ViltFeatureExtractor
class ViltProcessor:
r"""
Constructs a ViLT processor which wraps a BERT tokenizer and ViLT feature extractor into a single processor.
[`ViltProcessor`] offers all the functionalities of [`ViltFeatureExtractor`] and [`BertTokenizerFast`]. See the
docstring of [`~ViltProcessor.__call__`] and [`~ViltProcessor.decode`] for more information.
Args:
feature_extractor (`ViltFeatureExtractor`):
An instance of [`ViltFeatureExtractor`]. The feature extractor is a required input.
tokenizer (`BertTokenizerFast`):
An instance of ['BertTokenizerFast`]. The tokenizer is a required input.
"""
def __init__(self, feature_extractor, tokenizer):
if not isinstance(feature_extractor, ViltFeatureExtractor):
raise ValueError(
f"`feature_extractor` has to be of type {ViltFeatureExtractor.__class__}, but is {type(feature_extractor)}"
)
if not isinstance(tokenizer, BertTokenizerFast):
raise ValueError(f"`tokenizer` has to be of type {BertTokenizerFast.__class__}, but is {type(tokenizer)}")
self.feature_extractor = feature_extractor
self.tokenizer = tokenizer
self.current_processor = self.feature_extractor
def save_pretrained(self, save_directory):
"""
Save a ViLT feature_extractor object and BERT tokenizer object to the directory `save_directory`, so that it
can be re-loaded using the [`~ViltProcessor.from_pretrained`] class method.
<Tip>
This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
[`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`]. Please refer to the docstrings of the methods
above for more information.
</Tip>
Args:
save_directory (`str` or `os.PathLike`):
Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
be created if it does not exist).
"""
self.feature_extractor.save_pretrained(save_directory)
self.tokenizer.save_pretrained(save_directory)
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
r"""
Instantiate a [`ViltProcessor`] from a pretrained ViLT processor.
<Tip>
This class method is simply calling ViltFeatureExtractor's
[`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] and BertTokenizerFast's
[`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`]. Please refer to the docstrings of the methods
above for more information.
</Tip>
Args:
pretrained_model_name_or_path (`str` or `os.PathLike`):
This can be either:
- a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
- a path to a *directory* containing a feature extractor file saved using the
[`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
- a path or url to a saved feature extractor JSON *file*, e.g.,
`./my_model_directory/preprocessor_config.json`.
**kwargs
Additional keyword arguments passed along to both [`SequenceFeatureExtractor`] and
[`PreTrainedTokenizer`]
"""
feature_extractor = ViltFeatureExtractor.from_pretrained(pretrained_model_name_or_path, **kwargs)
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)
def __call__(
self,
images,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = False,
max_length: Optional[int] = None,
stride: int = 0,
pad_to_multiple_of: Optional[int] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs
) -> BatchEncoding:
"""
This method uses [`ViltFeatureExtractor.__call__`] method to prepare image(s) for the model, and
[`BertTokenizerFast.__call__`] to prepare text for the model.
Please refer to the docstring of the above two methods for more information.
"""
encoding = self.tokenizer(
text=text,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
stride=stride,
pad_to_multiple_of=pad_to_multiple_of,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
return_tensors=return_tensors,
**kwargs,
)
# add pixel_values + pixel_mask
encoding_feature_extractor = self.feature_extractor(images, return_tensors=return_tensors)
encoding.update(encoding_feature_extractor)
return encoding
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
......@@ -326,12 +326,6 @@ class ViTLayer(nn.Module):
# in ViT, layernorm is also applied after self-attention
layer_output = self.layernorm_after(hidden_states)
# TODO feedforward chunking not working for now
# layer_output = apply_chunking_to_forward(
# self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, layer_output
# )
layer_output = self.intermediate(layer_output)
# second residual connection is done here
......@@ -341,11 +335,6 @@ class ViTLayer(nn.Module):
return outputs
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output)
return layer_output
class ViTEncoder(nn.Module):
def __init__(self, config):
......
......@@ -3540,6 +3540,58 @@ class UniSpeechSatPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"])
VILT_PRETRAINED_MODEL_ARCHIVE_LIST = None
class ViltForImageAndTextRetrieval(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ViltForImagesAndTextClassification(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ViltForMaskedLM(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ViltForQuestionAnswering(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ViltLayer(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ViltModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ViltPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisionEncoderDecoderModel(metaclass=DummyObject):
_backends = ["torch"]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment