Unverified Commit 88ca6a23 authored by Gunjan Chhablani's avatar Gunjan Chhablani Committed by GitHub
Browse files

VisualBERT (#10534)



* Init VisualBERT

* Add cookie-cutter, Config, and Embeddings

* Add preliminary Model

* Add Bert analogous classes

* Add basic code for NLVR, VQA, Flickr

* Update Init

* Fix VisualBert Downstream Models

* Rename classifier to cls

* Comment position_ids buffer

* Remove sentence image predictor output

* Update output dicts

* Remove unnecessary files

* Fix Auto Modeling

* Fix transformers init

* Add conversion script

* Add conversion script

* Fix docs

* Update visualbert modelling

* Update configuration

* Style fixes

* Add model and integration tests

* Add all tests

* Update model mapping

* Add simple detector from original repository

* Update docs and configs

* Fix style

* Fix style

* Update docs

* Fix style

* Fix import issues in style

* Fix style

* Add changes from review

* Fix style

* Fix style

* Update docs

* Fix style

* Fix style

* Update docs/source/model_doc/visual_bert.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update tests/test_modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add changes from review

* Remove convert run script

* Add changes from review

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/visual_bert/modeling_visual_bert.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add changes from review

* Add changes from review

* Add visual embedding example in docs

* Fix "copied from" comments

* Add changes from review

* Fix error, style, checkpoints

* Update docs

* Fix integration tests

* Fix style
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 43f46aa7
...@@ -251,6 +251,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. ...@@ -251,6 +251,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[TAPAS](https://huggingface.co/transformers/model_doc/tapas.html)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. 1. **[TAPAS](https://huggingface.co/transformers/model_doc/tapas.html)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
1. **[Vision Transformer (ViT)](https://huggingface.co/transformers/model_doc/vit.html)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[Vision Transformer (ViT)](https://huggingface.co/transformers/model_doc/vit.html)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[VisualBERT](https://huggingface.co/transformers/model_doc/visual_bert.html)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
1. **[Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
......
...@@ -256,22 +256,25 @@ Supported models ...@@ -256,22 +256,25 @@ Supported models
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy, Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
54. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for 54. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
55. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
Zhou, Abdelrahman Mohamed, Michael Auli. Zhou, Abdelrahman Mohamed, Michael Auli.
55. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model 56. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau. Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
56. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet: 57. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
57. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised 58. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer and Veselin Stoyanov. Zettlemoyer and Veselin Stoyanov.
58. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive 59. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
59. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised 60. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
...@@ -389,6 +392,8 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -389,6 +392,8 @@ Flax), PyTorch, and/or TensorFlow.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ViT | ❌ | ❌ | ✅ | ❌ | ❌ | | ViT | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Wav2Vec2 | ✅ | ❌ | ✅ | ❌ | ❌ | | Wav2Vec2 | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ | | XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
...@@ -537,6 +542,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -537,6 +542,7 @@ Flax), PyTorch, and/or TensorFlow.
model_doc/tapas model_doc/tapas
model_doc/transformerxl model_doc/transformerxl
model_doc/vit model_doc/vit
model_doc/visual_bert
model_doc/wav2vec2 model_doc/wav2vec2
model_doc/xlm model_doc/xlm
model_doc/xlmprophetnet model_doc/xlmprophetnet
......
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
VisualBERT
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The VisualBERT model was proposed in `VisualBERT: A Simple and Performant Baseline for Vision and Language
<https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
VisualBERT is a neural network trained on a variety of (image, text) pairs.
The abstract from the paper is the following:
*We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an
associated input image with self-attention. We further propose two visually-grounded language model objectives for
pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2,
and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly
simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any
explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between
verbs and image regions corresponding to their arguments.*
Tips:
1. Most of the checkpoints provided work with the :class:`~transformers.VisualBertForPreTraining` configuration. Other
checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR
('visualbert-vcr'), NLVR2 ('visualbert-nlvr2'). Hence, if you are not working on these downstream tasks, it is
recommended that you use the pretrained checkpoints.
2. For the VCR task, the authors use a fine-tuned detector for generating visual embeddings, for all the checkpoints.
We do not provide the detector and its weights as a part of the package, but it will be available in the research
projects, and the states can be loaded directly into the detector provided.
Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice,
visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare
embeddings for image-text pairs. Both the text and visual features are then projected to a latent space with identical
dimension.
To feed images to the model, each image is passed through a pre-trained object detector and the regions and the
bounding boxes are extracted. The authors use the features generated after passing these regions through a pre-trained
CNN like ResNet as visual embeddings. They also add absolute position embeddings, and feed the resulting sequence of
vectors to a standard BERT model. The text input is concatenated in the front of the visual embeddings in the embedding
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
appropriately for the textual and visual parts.
The :class:`~transformers.BertTokenizer` is used to encode the text. A custom detector/feature extractor must be used
to get the visual embeddings. For an example on how to generate visual embeddings, see the `colab notebook
<https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing>`__. The following example shows
how to get the last hidden state using :class:`~transformers.VisualBertModel`:
.. code-block::
>>> import torch
>>> from transformers import BertTokenizer, VisualBertModel
>>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> inputs = tokenizer("What is the man eating?", return_tensors="pt")
>>> # this is a custom function that returns the visual embeddings given the image path
>>> visual_embeds = get_visual_embeddings(image_path)
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
This model was contributed by `gchhablani <https://huggingface.co/gchhablani>`__. The original code can be found `here
<https://github.com/uclanlp/visualbert>`__.
VisualBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertConfig
:members:
VisualBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertModel
:members: forward
VisualBertForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForPreTraining
:members: forward
VisualBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForQuestionAnswering
:members: forward
VisualBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForMultipleChoice
:members: forward
VisualBertForVisualReasoning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForVisualReasoning
:members: forward
VisualBertForRegionToPhraseAlignment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForRegionToPhraseAlignment
:members: forward
...@@ -233,6 +233,7 @@ _import_structure = { ...@@ -233,6 +233,7 @@ _import_structure = {
"TransfoXLCorpus", "TransfoXLCorpus",
"TransfoXLTokenizer", "TransfoXLTokenizer",
], ],
"models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
"models.vit": ["VIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTConfig"], "models.vit": ["VIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTConfig"],
"models.wav2vec2": [ "models.wav2vec2": [
"WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP",
...@@ -996,6 +997,19 @@ if is_torch_available(): ...@@ -996,6 +997,19 @@ if is_torch_available():
"load_tf_weights_in_transfo_xl", "load_tf_weights_in_transfo_xl",
] ]
) )
_import_structure["models.visual_bert"].extend(
[
"VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST",
"VisualBertForMultipleChoice",
"VisualBertForPreTraining",
"VisualBertForQuestionAnswering",
"VisualBertForRegionToPhraseAlignment",
"VisualBertForVisualReasoning",
"VisualBertLayer",
"VisualBertModel",
"VisualBertPreTrainedModel",
]
)
_import_structure["models.vit"].extend( _import_structure["models.vit"].extend(
[ [
"VIT_PRETRAINED_MODEL_ARCHIVE_LIST", "VIT_PRETRAINED_MODEL_ARCHIVE_LIST",
...@@ -1702,6 +1716,7 @@ if TYPE_CHECKING: ...@@ -1702,6 +1716,7 @@ if TYPE_CHECKING:
TransfoXLCorpus, TransfoXLCorpus,
TransfoXLTokenizer, TransfoXLTokenizer,
) )
from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
from .models.vit import VIT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTConfig from .models.vit import VIT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTConfig
from .models.wav2vec2 import ( from .models.wav2vec2 import (
WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
...@@ -2338,6 +2353,17 @@ if TYPE_CHECKING: ...@@ -2338,6 +2353,17 @@ if TYPE_CHECKING:
TransfoXLPreTrainedModel, TransfoXLPreTrainedModel,
load_tf_weights_in_transfo_xl, load_tf_weights_in_transfo_xl,
) )
from .models.visual_bert import ( # load_tf_weights_in_visual_bert,
VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST,
VisualBertForMultipleChoice,
VisualBertForPreTraining,
VisualBertForQuestionAnswering,
VisualBertForRegionToPhraseAlignment,
VisualBertForVisualReasoning,
VisualBertLayer,
VisualBertModel,
VisualBertPreTrainedModel,
)
from .models.vit import ( from .models.vit import (
VIT_PRETRAINED_MODEL_ARCHIVE_LIST, VIT_PRETRAINED_MODEL_ARCHIVE_LIST,
ViTForImageClassification, ViTForImageClassification,
......
...@@ -74,6 +74,7 @@ from . import ( ...@@ -74,6 +74,7 @@ from . import (
t5, t5,
tapas, tapas,
transfo_xl, transfo_xl,
visual_bert,
vit, vit,
wav2vec2, wav2vec2,
xlm, xlm,
......
...@@ -77,6 +77,7 @@ from ..squeezebert.configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFI ...@@ -77,6 +77,7 @@ from ..squeezebert.configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFI
from ..t5.configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config from ..t5.configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
from ..tapas.configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig from ..tapas.configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig
from ..transfo_xl.configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig from ..transfo_xl.configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
from ..visual_bert.configuration_visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
from ..vit.configuration_vit import VIT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTConfig from ..vit.configuration_vit import VIT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTConfig
from ..wav2vec2.configuration_wav2vec2 import WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, Wav2Vec2Config from ..wav2vec2.configuration_wav2vec2 import WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, Wav2Vec2Config
from ..xlm.configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig from ..xlm.configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
...@@ -92,6 +93,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict( ...@@ -92,6 +93,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
(key, value) (key, value)
for pretrained_map in [ for pretrained_map in [
# Add archive maps here # Add archive maps here
VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP,
...@@ -148,6 +150,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict( ...@@ -148,6 +150,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
CONFIG_MAPPING = OrderedDict( CONFIG_MAPPING = OrderedDict(
[ [
# Add configs here # Add configs here
("visual_bert", VisualBertConfig),
("roformer", RoFormerConfig), ("roformer", RoFormerConfig),
("clip", CLIPConfig), ("clip", CLIPConfig),
("bigbird_pegasus", BigBirdPegasusConfig), ("bigbird_pegasus", BigBirdPegasusConfig),
...@@ -210,6 +213,7 @@ CONFIG_MAPPING = OrderedDict( ...@@ -210,6 +213,7 @@ CONFIG_MAPPING = OrderedDict(
MODEL_NAMES_MAPPING = OrderedDict( MODEL_NAMES_MAPPING = OrderedDict(
[ [
# Add full (and cased) model names here # Add full (and cased) model names here
("visual_bert", "VisualBert"),
("roformer", "RoFormer"), ("roformer", "RoFormer"),
("clip", "CLIP"), ("clip", "CLIP"),
("bigbird_pegasus", "BigBirdPegasus"), ("bigbird_pegasus", "BigBirdPegasus"),
......
...@@ -266,6 +266,7 @@ from ..tapas.modeling_tapas import ( ...@@ -266,6 +266,7 @@ from ..tapas.modeling_tapas import (
TapasModel, TapasModel,
) )
from ..transfo_xl.modeling_transfo_xl import TransfoXLForSequenceClassification, TransfoXLLMHeadModel, TransfoXLModel from ..transfo_xl.modeling_transfo_xl import TransfoXLForSequenceClassification, TransfoXLLMHeadModel, TransfoXLModel
from ..visual_bert.modeling_visual_bert import VisualBertForPreTraining, VisualBertModel
from ..vit.modeling_vit import ViTForImageClassification, ViTModel from ..vit.modeling_vit import ViTForImageClassification, ViTModel
from ..wav2vec2.modeling_wav2vec2 import Wav2Vec2ForMaskedLM, Wav2Vec2Model from ..wav2vec2.modeling_wav2vec2 import Wav2Vec2ForMaskedLM, Wav2Vec2Model
from ..xlm.modeling_xlm import ( from ..xlm.modeling_xlm import (
...@@ -349,6 +350,7 @@ from .configuration_auto import ( ...@@ -349,6 +350,7 @@ from .configuration_auto import (
T5Config, T5Config,
TapasConfig, TapasConfig,
TransfoXLConfig, TransfoXLConfig,
VisualBertConfig,
ViTConfig, ViTConfig,
Wav2Vec2Config, Wav2Vec2Config,
XLMConfig, XLMConfig,
...@@ -364,6 +366,7 @@ logger = logging.get_logger(__name__) ...@@ -364,6 +366,7 @@ logger = logging.get_logger(__name__)
MODEL_MAPPING = OrderedDict( MODEL_MAPPING = OrderedDict(
[ [
# Base model mapping # Base model mapping
(VisualBertConfig, VisualBertModel),
(RoFormerConfig, RoFormerModel), (RoFormerConfig, RoFormerModel),
(CLIPConfig, CLIPModel), (CLIPConfig, CLIPModel),
(BigBirdPegasusConfig, BigBirdPegasusModel), (BigBirdPegasusConfig, BigBirdPegasusModel),
...@@ -425,6 +428,7 @@ MODEL_MAPPING = OrderedDict( ...@@ -425,6 +428,7 @@ MODEL_MAPPING = OrderedDict(
MODEL_FOR_PRETRAINING_MAPPING = OrderedDict( MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
[ [
# Model for pre-training mapping # Model for pre-training mapping
(VisualBertConfig, VisualBertForPreTraining),
(LayoutLMConfig, LayoutLMForMaskedLM), (LayoutLMConfig, LayoutLMForMaskedLM),
(RetriBertConfig, RetriBertModel), (RetriBertConfig, RetriBertModel),
(T5Config, T5ForConditionalGeneration), (T5Config, T5ForConditionalGeneration),
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...file_utils import _BaseLazyModule, is_torch_available
_import_structure = {
"configuration_visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
}
if is_torch_available():
_import_structure["modeling_visual_bert"] = [
"VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST",
"VisualBertForMultipleChoice",
"VisualBertForPreTraining",
"VisualBertForQuestionAnswering",
"VisualBertForRegionToPhraseAlignment",
"VisualBertForVisualReasoning",
"VisualBertLayer",
"VisualBertModel",
"VisualBertPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
if is_torch_available():
from .modeling_visual_bert import (
VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST,
VisualBertForMultipleChoice,
VisualBertForPreTraining,
VisualBertForQuestionAnswering,
VisualBertForRegionToPhraseAlignment,
VisualBertForVisualReasoning,
VisualBertLayer,
VisualBertModel,
VisualBertPreTrainedModel,
)
else:
import importlib
import os
import sys
class _LazyModule(_BaseLazyModule):
"""
Module class that surfaces all objects but only performs associated imports when the objects are requested.
"""
__file__ = globals()["__file__"]
__path__ = [os.path.dirname(__file__)]
def _get_module(self, module_name: str):
return importlib.import_module("." + module_name, self.__name__)
sys.modules[__name__] = _LazyModule(__name__, _import_structure)
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" VisualBERT model configuration """
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"uclanlp/visualbert-vqa": "https://huggingface.co/uclanlp/visualbert-vqa/resolve/main/config.json",
"uclanlp/visualbert-vqa-pre": "https://huggingface.co/uclanlp/visualbert-vqa-pre/resolve/main/config.json",
"uclanlp/visualbert-vqa-coco-pre": "https://huggingface.co/uclanlp/visualbert-vqa-coco-pre/resolve/main/config.json",
"uclanlp/visualbert-vcr": "https://huggingface.co/uclanlp/visualbert-vcr/resolve/main/config.json",
"uclanlp/visualbert-vcr-pre": "https://huggingface.co/uclanlp/visualbert-vcr-pre/resolve/main/config.json",
"uclanlp/visualbert-vcr-coco-pre": "https://huggingface.co/uclanlp/visualbert-vcr-coco-pre/resolve/main/config.json",
"uclanlp/visualbert-nlvr2": "https://huggingface.co/uclanlp/visualbert-nlvr2/resolve/main/config.json",
"uclanlp/visualbert-nlvr2-pre": "https://huggingface.co/uclanlp/visualbert-nlvr2-pre/resolve/main/config.json",
"uclanlp/visualbert-nlvr2-coco-pre": "https://huggingface.co/uclanlp/visualbert-nlvr2-coco-pre/resolve/main/config.json"
# See all VisualBERT models at https://huggingface.co/models?filter=visual_bert
}
class VisualBertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.VisualBertModel`. It is used
to instantiate an VisualBERT model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the VisualBERT
`visualbert-vqa-coco-pre <https://huggingface.co/uclanlp/visualbert-vqa-coco-pre>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the VisualBERT model. Defines the number of different tokens that can be represented by
the :obj:`inputs_ids` passed when calling :class:`~transformers.VisualBertModel`. Vocabulary size of the
model. Defines the different tokens that can be represented by the ``inputs_ids`` passed to the forward
method of :class:`~transformers.VisualBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
visual_embedding_dim (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the visual embeddings to be passed to the model.
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling
:class:`~transformers.VisualBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
bypass_transformer (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should bypass the transformer for the visual embeddings. If set to :obj:`True`,
the model directly concatenates the visual embeddings from :class:`~transformers.VisualBertEmbeddings` with
text output from transformers, and then pass it to a self-attention layer.
special_visual_initialize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the visual token type and position type embedding weights should be initialized the same as
the textual token type and positive type embeddings. When set to :obj:`True`, the weights of the textual
token type and position type embeddings are copied to the respective visual embedding layers.
Example::
>>> from transformers import VisualBertModel, VisualBertConfig
>>> # Initializing a VisualBERT visualbert-vqa-coco-pre style configuration
>>> configuration = VisualBertConfig.from_pretrained('visualbert-vqa-coco-pre')
>>> # Initializing a model from the visualbert-vqa-coco-pre style configuration
>>> model = VisualBertModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "visual_bert"
def __init__(
self,
vocab_size=30522,
hidden_size=768,
visual_embedding_dim=512,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02,
layer_norm_eps=1e-12,
bypass_transformer=False,
special_visual_initialize=True,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.visual_embedding_dim = visual_embedding_dim
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.type_vocab_size = type_vocab_size
self.layer_norm_eps = layer_norm_eps
self.bypass_transformer = bypass_transformer
self.special_visual_initialize = special_visual_initialize
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert VisualBert checkpoint."""
import argparse
from collections import OrderedDict
from pathlib import Path
import torch
from transformers import (
VisualBertConfig,
VisualBertForMultipleChoice,
VisualBertForPreTraining,
VisualBertForQuestionAnswering,
VisualBertForVisualReasoning,
)
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
rename_keys_prefix = [
("bert.bert", "visual_bert"),
("bert.cls", "cls"),
("bert.classifier", "cls"),
("token_type_embeddings_visual", "visual_token_type_embeddings"),
("position_embeddings_visual", "visual_position_embeddings"),
("projection", "visual_projection"),
]
ACCEPTABLE_CHECKPOINTS = [
"nlvr2_coco_pre_trained.th",
"nlvr2_fine_tuned.th",
"nlvr2_pre_trained.th",
"vcr_coco_pre_train.th",
"vcr_fine_tune.th",
"vcr_pre_train.th",
"vqa_coco_pre_trained.th",
"vqa_fine_tuned.th",
"vqa_pre_trained.th",
]
def load_state_dict(checkpoint_path):
sd = torch.load(checkpoint_path, map_location="cpu")
return sd
def get_new_dict(d, config, rename_keys_prefix=rename_keys_prefix):
new_d = OrderedDict()
new_d["visual_bert.embeddings.position_ids"] = torch.arange(config.max_position_embeddings).expand((1, -1))
# detector_d = OrderedDict()
for key in d:
if "detector" in key:
# detector_d[key.replace('detector.','')] = d[key]
continue
new_key = key
for name_pair in rename_keys_prefix:
new_key = new_key.replace(name_pair[0], name_pair[1])
new_d[new_key] = d[key]
if key == "bert.cls.predictions.decoder.weight":
# Old bert code didn't have `decoder.bias`, but was added separately
new_d["cls.predictions.decoder.bias"] = new_d["cls.predictions.bias"]
return new_d
@torch.no_grad()
def convert_visual_bert_checkpoint(checkpoint_path, pytorch_dump_folder_path):
"""
Copy/paste/tweak model's weights to our VisualBERT structure.
"""
assert (
checkpoint_path.split("/")[-1] in ACCEPTABLE_CHECKPOINTS
), f"The checkpoint provided must be in {ACCEPTABLE_CHECKPOINTS}."
# Get Config
if "pre" in checkpoint_path:
model_type = "pretraining"
if "vcr" in checkpoint_path:
config_params = {"visual_embedding_dim": 512}
elif "vqa_advanced" in checkpoint_path:
config_params = {"visual_embedding_dim": 2048}
elif "vqa" in checkpoint_path:
config_params = {"visual_embedding_dim": 2048}
elif "nlvr" in checkpoint_path:
config_params = {"visual_embedding_dim": 1024}
else:
raise NotImplementedError(f"No implementation found for `{checkpoint_path}`.")
else:
if "vcr" in checkpoint_path:
config_params = {"visual_embedding_dim": 512}
model_type = "multichoice"
elif "vqa_advanced" in checkpoint_path:
config_params = {"visual_embedding_dim": 2048}
model_type = "vqa_advanced"
elif "vqa" in checkpoint_path:
config_params = {"visual_embedding_dim": 2048, "num_labels": 3129}
model_type = "vqa"
elif "nlvr" in checkpoint_path:
config_params = {
"visual_embedding_dim": 1024,
"num_labels": 2,
}
model_type = "nlvr"
config = VisualBertConfig(**config_params)
# Load State Dict
state_dict = load_state_dict(checkpoint_path)
new_state_dict = get_new_dict(state_dict, config)
if model_type == "pretraining":
model = VisualBertForPreTraining(config)
elif model_type == "vqa":
model = VisualBertForQuestionAnswering(config)
elif model_type == "nlvr":
model = VisualBertForVisualReasoning(config)
elif model_type == "multichoice":
model = VisualBertForMultipleChoice(config)
model.load_state_dict(new_state_dict)
# Save Checkpoints
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
model.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument("orig_checkpoint_path", type=str, help="A path to .th on local filesystem.")
parser.add_argument("pytorch_dump_folder_path", type=str, help="Path to the output PyTorch model.")
args = parser.parse_args()
convert_visual_bert_checkpoint(args.orig_checkpoint_path, args.pytorch_dump_folder_path)
# coding=utf-8
# Copyright 2021 The UCLA NLP Authors and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch VisualBERT model. """
import math
from dataclasses import dataclass
from typing import Optional, Tuple
import torch
import torch.utils.checkpoint
from torch import nn
from torch.nn import CrossEntropyLoss, KLDivLoss, LogSoftmax
from ...activations import ACT2FN
from ...file_utils import (
ModelOutput,
add_start_docstrings,
add_start_docstrings_to_model_forward,
replace_return_docstrings,
)
from ...modeling_outputs import (
BaseModelOutput,
BaseModelOutputWithPooling,
MultipleChoiceModelOutput,
SequenceClassifierOutput,
)
from ...modeling_utils import (
PreTrainedModel,
apply_chunking_to_forward,
find_pruneable_heads_and_indices,
prune_linear_layer,
)
from ...utils import logging
from .configuration_visual_bert import VisualBertConfig
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "VisualBertConfig"
VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
"uclanlp/visualbert-vqa",
"uclanlp/visualbert-vqa-pre",
"uclanlp/visualbert-vqa-coco-pre",
"uclanlp/visualbert-vcr",
"uclanlp/visualbert-vcr-pre",
"uclanlp/visualbert-vcr-coco-pre",
"uclanlp/visualbert-nlvr2",
"uclanlp/visualbert-nlvr2-pre",
"uclanlp/visualbert-nlvr2-coco-pre"
# See all VisualBERT models at https://huggingface.co/models?filter=visual_bert
]
class VisualBertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings and visual embeddings."""
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
# For Visual Features
# Token type and position embedding for image features
self.visual_token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
self.visual_position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
if config.special_visual_initialize:
self.visual_token_type_embeddings.weight.data = torch.nn.Parameter(
self.token_type_embeddings.weight.data.clone(), requires_grad=True
)
self.visual_position_embeddings.weight.data = torch.nn.Parameter(
self.position_embeddings.weight.data.clone(), requires_grad=True
)
self.visual_projection = nn.Linear(config.visual_embedding_dim, config.hidden_size)
def forward(
self,
input_ids=None,
token_type_ids=None,
position_ids=None,
inputs_embeds=None,
visual_embeds=None,
visual_token_type_ids=None,
image_text_alignment=None,
):
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
if position_ids is None:
position_ids = self.position_ids[:, :seq_length]
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.input_embeds.device)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + token_type_embeddings
# Absolute Position Embeddings
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
if visual_embeds is not None:
if visual_token_type_ids is None:
visual_token_type_ids = torch.ones(
visual_embeds.size()[:-1], dtype=torch.long, device=self.position_ids.device
)
visual_embeds = self.visual_projection(visual_embeds)
visual_token_type_embeddings = self.visual_token_type_embeddings(visual_token_type_ids)
if image_text_alignment is not None:
# image_text_alignment = Batch x image_length x alignment_number.
# Each element denotes the position of the word corresponding to the image feature. -1 is the padding value.
dtype = token_type_embeddings.dtype
image_text_alignment_mask = (image_text_alignment != -1).long()
# Get rid of the -1.
image_text_alignment = image_text_alignment_mask * image_text_alignment
# Batch x image_length x alignment length x dim
visual_position_embeddings = self.position_embeddings(image_text_alignment)
visual_position_embeddings *= image_text_alignment_mask.to(dtype=dtype).unsqueeze(-1)
visual_position_embeddings = visual_position_embeddings.sum(2)
# We want to averge along the alignment_number dimension.
image_text_alignment_mask = image_text_alignment_mask.to(dtype=dtype).sum(2)
if (image_text_alignment_mask == 0).sum() != 0:
image_text_alignment_mask[image_text_alignment_mask == 0] = 1 # Avoid divide by zero error
logger.warning(
"Found 0 values in `image_text_alignment_mask`. Setting them to 1 to avoid divide-by-zero error."
)
visual_position_embeddings = visual_position_embeddings / image_text_alignment_mask.unsqueeze(-1)
visual_position_ids = torch.zeros(
*visual_embeds.size()[:-1], dtype=torch.long, device=visual_embeds.device
)
# When fine-tuning the detector , the image_text_alignment is sometimes padded too long.
if visual_position_embeddings.size(1) != visual_embeds.size(1):
if visual_position_embeddings.size(1) < visual_embeds.size(1):
raise ValueError(
f"Visual position embeddings length: {visual_position_embeddings.size(1)}"
f"should be the same as `visual_embeds` length: {visual_embeds.size(1)}"
)
visual_position_embeddings = visual_position_embeddings[:, : visual_embeds.size(1), :]
visual_position_embeddings = visual_position_embeddings + self.visual_position_embeddings(
visual_position_ids
)
else:
visual_position_ids = torch.zeros(
*visual_embeds.size()[:-1], dtype=torch.long, device=visual_embeds.device
)
visual_position_embeddings = self.visual_position_embeddings(visual_position_ids)
visual_embeddings = visual_embeds + visual_position_embeddings + visual_token_type_embeddings
embeddings = torch.cat((embeddings, visual_embeddings), dim=1)
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
class VisualBertSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
f"heads ({config.num_attention_heads})"
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
output_attentions=False,
):
mixed_query_layer = self.query(hidden_states)
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
query_layer = self.transpose_for_scores(mixed_query_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in VisualBertSelfAttentionModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs
# Copied from transformers.models.bert.modeling_bert.BertSelfOutput with Bert->VisualBert
class VisualBertSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
class VisualBertAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.self = VisualBertSelfAttention(config)
self.output = VisualBertSelfOutput(config)
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
output_attentions=False,
):
self_outputs = self.self(
hidden_states,
attention_mask,
head_mask,
output_attentions,
)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs
# Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->VisualBert
class VisualBertIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertOutput with Bert->VisualBert
class VisualBertOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
class VisualBertLayer(nn.Module):
def __init__(self, config):
super().__init__()
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.seq_len_dim = 1
self.attention = VisualBertAttention(config)
self.intermediate = VisualBertIntermediate(config)
self.output = VisualBertOutput(config)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
output_attentions=False,
):
self_attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions=output_attentions,
)
attention_output = self_attention_outputs[0]
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
outputs = (layer_output,) + outputs
return outputs
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output
class VisualBertEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.layer = nn.ModuleList([VisualBertLayer(config) for _ in range(config.num_hidden_layers)])
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
):
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
for i, layer_module in enumerate(self.layer):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_head_mask = head_mask[i] if head_mask is not None else None
if getattr(self.config, "gradient_checkpointing", False) and self.training:
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, output_attentions)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module),
hidden_states,
attention_mask,
layer_head_mask,
)
else:
layer_outputs = layer_module(hidden_states, attention_mask, layer_head_mask, output_attentions)
hidden_states = layer_outputs[0]
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(
v
for v in [
hidden_states,
all_hidden_states,
all_self_attentions,
]
if v is not None
)
return BaseModelOutput(
last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_self_attentions
)
# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->VisualBert
class VisualBertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
# Copied from transformers.models.bert.modeling_bert.BertPredictionHeadTransform with Bert->VisualBert
class VisualBertPredictionHeadTransform(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
if isinstance(config.hidden_act, str):
self.transform_act_fn = ACT2FN[config.hidden_act]
else:
self.transform_act_fn = config.hidden_act
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.transform_act_fn(hidden_states)
hidden_states = self.LayerNorm(hidden_states)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertLMPredictionHead with Bert->VisualBert
class VisualBertLMPredictionHead(nn.Module):
def __init__(self, config):
super().__init__()
self.transform = VisualBertPredictionHeadTransform(config)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertPreTrainingHeads with Bert->VisualBert
class VisualBertPreTrainingHeads(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = VisualBertLMPredictionHead(config)
self.seq_relationship = nn.Linear(config.hidden_size, 2)
def forward(self, sequence_output, pooled_output):
prediction_scores = self.predictions(sequence_output)
seq_relationship_score = self.seq_relationship(pooled_output)
return prediction_scores, seq_relationship_score
class VisualBertPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = VisualBertConfig
base_model_prefix = "visual_bert"
_keys_to_ignore_on_load_missing = [r"position_ids"]
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (nn.Linear, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()
@dataclass
class VisualBertForPreTrainingOutput(ModelOutput):
"""
Output type of :class:`~transformers.VisualBertForPreTraining`.
Args:
loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
Total loss as the sum of the masked language modeling loss and the sentence-image prediction
(classification) loss.
prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the sentence-image prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
loss: Optional[torch.FloatTensor] = None
prediction_logits: torch.FloatTensor = None
seq_relationship_logits: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
VISUAL_BERT_START_DOCSTRING = r"""
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.VisualBertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
VISUAL_BERT_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.BertTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
visual_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, visual_seq_length, visual_embedding_dim)`, `optional`):
The embedded representation of the visual inputs, generally derived using using an object detector.
visual_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, visual_seq_length)`, `optional`):
Mask to avoid performing attention on visual embeddings. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
visual_token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, visual_seq_length)`, `optional`):
Segment token indices to indicate different portions of the visual embeds.
`What are token type IDs? <../glossary.html#token-type-ids>`_ The authors of VisualBERT set the
`visual_token_type_ids` to `1` for all tokens.
image_text_alignment (:obj:`torch.LongTensor` of shape :obj:`(batch_size, visual_seq_length, alignment_number)`, `optional`):
Image-Text alignment uses to decide the position IDs of the visual embeddings.
output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`):
Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`):
Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
"""
@add_start_docstrings(
"The bare VisualBert Model transformer outputting raw hidden-states without any specific head on top.",
VISUAL_BERT_START_DOCSTRING,
)
class VisualBertModel(VisualBertPreTrainedModel):
"""
The model can behave as an encoder (with only self-attention) following the architecture described in `Attention is
all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
"""
def __init__(self, config, add_pooling_layer=True):
super().__init__(config)
self.config = config
self.embeddings = VisualBertEmbeddings(config)
self.encoder = VisualBertEncoder(config)
self.pooler = VisualBertPooler(config) if add_pooling_layer else None
self.bypass_transformer = config.bypass_transformer
if self.bypass_transformer:
self.additional_layer = VisualBertLayer(config)
self.init_weights()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_model_forward(VISUAL_BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
visual_embeds=None,
visual_attention_mask=None,
visual_token_type_ids=None,
image_text_alignment=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
Example::
>>> # Assumption: `get_visual_embeddings(image)` gets the visual embeddings of the image.
>>> from transformers import BertTokenizer, VisualBertModel
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = VisualBertModel.from_pretrained('uclanlp/visualbert-vqa-coco-pre')
>>> inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")
>>> visual_embeds = get_visual_embeddings(image).unsqueeze(0)
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) #example
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> inputs.update({{
... "visual_embeds": visual_embeds,
... "visual_token_type_ids": visual_token_type_ids,
... "visual_attention_mask": visual_attention_mask
... }})
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = input_ids.size()
batch_size, seq_length = input_shape
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
batch_size, seq_length = input_shape
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
if visual_embeds is None:
raise ValueError(
f"`visual_embeds` can not be of type {type(visual_embeds)} when using a VisualBert Model."
)
device = input_ids.device if input_ids is not None else inputs_embeds.device
visual_input_shape = visual_embeds.size()[:-1]
if attention_mask is None:
attention_mask = torch.ones(input_shape, device=device)
if visual_attention_mask is None:
visual_attention_mask = torch.ones(visual_input_shape, device=device)
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
combined_attention_mask = torch.cat((attention_mask, visual_attention_mask), dim=-1)
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(
combined_attention_mask, [batch_size, input_shape + visual_input_shape], device
)
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
inputs_embeds=inputs_embeds,
visual_embeds=visual_embeds,
visual_token_type_ids=visual_token_type_ids,
image_text_alignment=image_text_alignment,
)
if self.bypass_transformer and visual_embeds is not None:
text_length = input_ids.size(1)
text_embedding_output = embedding_output[:, :text_length, :]
visual_embedding_output = embedding_output[:, text_length:, :]
text_extended_attention_mask = extended_attention_mask[:, :, text_length, :text_length]
encoded_outputs = self.encoder(
text_embedding_output,
attention_mask=text_extended_attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoded_outputs[0]
concatenated_input = torch.cat((sequence_output, visual_embedding_output), dim=1)
sequence_output = self.additional_layer(concatenated_input, extended_attention_mask)
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
else:
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPooling(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
@add_start_docstrings(
"""
VisualBert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
`sentence-image prediction (classification)` head.
""",
VISUAL_BERT_START_DOCSTRING,
)
class VisualBertForPreTraining(VisualBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.visual_bert = VisualBertModel(config)
self.cls = VisualBertPreTrainingHeads(config)
self.init_weights()
def get_output_embeddings(self):
return self.cls.predictions.decoder
def set_output_embeddings(self, new_embeddings):
self.cls.predictions.decoder = new_embeddings
@add_start_docstrings_to_model_forward(VISUAL_BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=VisualBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
visual_embeds=None,
visual_attention_mask=None,
visual_token_type_ids=None,
image_text_alignment=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
labels=None,
sentence_image_labels=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape ``(batch_size, total_sequence_length)``, `optional`):
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
sentence_image_labels (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
Labels for computing the sentence-image prediction (classification) loss. Input should be a sequence
pair (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:
- 0 indicates sequence B is a matching pair of sequence A for the given image,
- 1 indicates sequence B is a random sequence w.r.t A for the given image.
Returns:
Example::
>>> # Assumption: `get_visual_embeddings(image)` gets the visual embeddings of the image in the batch.
>>> from transformers import BertTokenizer, VisualBertForPreTraining
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = VisualBertForPreTraining.from_pretrained('uclanlp/visualbert-vqa-coco-pre')
>>> inputs = tokenizer("The capital of France is {mask}.", return_tensors="pt")
>>> visual_embeds = get_visual_embeddings(image).unsqueeze(0)
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) #example
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> inputs.update({{
... "visual_embeds": visual_embeds,
... "visual_token_type_ids": visual_token_type_ids,
... "visual_attention_mask": visual_attention_mask
... }})
>>> max_length = inputs["input_ids"].shape[-1]+visual_embeds.shape[-2]
>>> labels = tokenizer("The capital of France is Paris.", return_tensors="pt", padding="max_length", max_length=max_length)["input_ids"]
>>> sentence_image_labels = torch.tensor(1).unsqueeze(0) # Batch_size
>>> outputs = model(**inputs, labels=labels, sentence_image_labels=sentence_image_labels)
>>> loss = outputs.loss
>>> prediction_logits = outputs.prediction_logits
>>> seq_relationship_logits = outputs.seq_relationship_logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.visual_bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
image_text_alignment=image_text_alignment,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output, pooled_output = outputs[:2]
prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
total_loss = None
if labels is not None and sentence_image_labels is not None:
total_size = attention_mask.size(-1) + visual_attention_mask.size(-1)
if labels.size(-1) != total_size:
raise ValueError(
f"The labels provided should have same sequence length as total attention mask."
f"Found labels with sequence length {labels.size(-1)}, expected {total_size}."
)
loss_fct = CrossEntropyLoss()
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
sentence_image_loss = loss_fct(seq_relationship_score.view(-1, 2), sentence_image_labels.view(-1))
total_loss = masked_lm_loss + sentence_image_loss
if labels is not None and sentence_image_labels is None:
total_size = attention_mask.size(-1) + visual_attention_mask.size(-1)
if labels.size(-1) != total_size:
raise ValueError(
f"The labels provided should have same sequence length as total attention mask."
f"Found labels with sequence length {labels.size(-1)}, expected {total_size}."
)
loss_fct = CrossEntropyLoss()
total_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
if not return_dict:
output = (prediction_scores, seq_relationship_score) + outputs[2:]
return ((total_loss,) + output) if total_loss is not None else output
return VisualBertForPreTrainingOutput(
loss=total_loss,
prediction_logits=prediction_scores,
seq_relationship_logits=seq_relationship_score,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""
VisualBert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and
a softmax) e.g. for VCR tasks.
""",
VISUAL_BERT_START_DOCSTRING,
)
class VisualBertForMultipleChoice(VisualBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.visual_bert = VisualBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.cls = nn.Linear(config.hidden_size, 1)
self.init_weights()
@add_start_docstrings_to_model_forward(
VISUAL_BERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length")
)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
visual_embeds=None,
visual_attention_mask=None,
visual_token_type_ids=None,
image_text_alignment=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
labels=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors.
(See :obj:`input_ids` above)
Example::
>>> from transformers import BertTokenizer, VisualBertForMultipleChoice
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = VisualBertForMultipleChoice.from_pretrained('uclanlp/visualbert-vcr')
>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> choice0 = "It is eaten with a fork and a knife."
>>> choice1 = "It is eaten while held in the hand."
>>> visual_embeds = get_visual_embeddings(image)
>>> # (batch_size, num_choices, visual_seq_length, visual_embedding_dim)
>>> visual_embeds = visual_embeds.expand(1, 2, *visual_embeds.shape)
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> labels = torch.tensor(0).unsqueeze(0) # choice0 is correct (according to Wikipedia ;)), batch size 1
>>> encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)
>>> # batch size is 1
>>> inputs_dict = {{k: v.unsqueeze(0) for k,v in encoding.items()}}
>>> inputs_dict.update({{
... visual_embeds=visual_embeds,
... visual_attention_mask=visual_attention_mask,
... visual_token_type_ids=visual_token_type_ids,
... labels=labels
... }})
>>> outputs = model(**inputs_dict)
>>> loss = outputs.loss
>>> logits = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
inputs_embeds = (
inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
if inputs_embeds is not None
else None
)
visual_embeds = (
visual_embeds.view(-1, visual_embeds.size(-2), visual_embeds.size(-1))
if visual_embeds is not None
else None
)
visual_attention_mask = (
visual_attention_mask.view(-1, visual_attention_mask.size(-1))
if visual_attention_mask is not None
else None
)
visual_token_type_ids = (
visual_token_type_ids.view(-1, visual_token_type_ids.size(-1))
if visual_token_type_ids is not None
else None
)
outputs = self.visual_bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
image_text_alignment=image_text_alignment,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
_, pooled_output = outputs[0], outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.cls(pooled_output)
reshaped_logits = logits.view(-1, num_choices)
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
loss = loss_fct(reshaped_logits, labels)
if not return_dict:
output = (reshaped_logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return MultipleChoiceModelOutput(
loss=loss,
logits=reshaped_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""
VisualBert Model with a classification/regression head on top (a dropout and a linear layer on top of the pooled
output) for VQA.
""",
VISUAL_BERT_START_DOCSTRING,
)
class VisualBertForQuestionAnswering(VisualBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.visual_bert = VisualBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.cls = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
@add_start_docstrings_to_model_forward(VISUAL_BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
visual_embeds=None,
visual_attention_mask=None,
visual_token_type_ids=None,
image_text_alignment=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
labels=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, total_sequence_length)`, `optional`):
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. A KLDivLoss is computed between the labels and the returned logits.
Example::
>>> # Assumption: `get_visual_embeddings(image)` gets the visual embeddings of the image in the batch.
>>> from transformers import BertTokenizer, VisualBertForQuestionAnswering
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = VisualBertForQuestionAnswering.from_pretrained('uclanlp/visualbert-vqa')
>>> text = "Who is eating the apple?"
>>> inputs = tokenizer(text, return_tensors='pt')
>>> visual_embeds = get_visual_embeddings(image).unsqueeze(0)
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) #example
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> inputs.update({{
... "visual_embeds": visual_embeds,
... "visual_token_type_ids": visual_token_type_ids,
... "visual_attention_mask": visual_attention_mask
... }})
>>> labels = torch.tensor([[0.0,1.0]]).unsqueeze(0) # Batch size 1, Num labels 2
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> scores = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# Get the index of the last text token
index_to_gather = attention_mask.sum(1) - 2 # as in original code
outputs = self.visual_bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
image_text_alignment=image_text_alignment,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
# TO-CHECK: From the original code
index_to_gather = (
index_to_gather.unsqueeze(-1).unsqueeze(-1).expand(index_to_gather.size(0), 1, sequence_output.size(-1))
)
pooled_output = torch.gather(sequence_output, 1, index_to_gather)
pooled_output = self.dropout(pooled_output)
logits = self.cls(pooled_output)
reshaped_logits = logits.view(-1, self.num_labels)
loss = None
if labels is not None:
loss_fct = torch.nn.KLDivLoss(reduction="batchmean")
log_softmax = torch.nn.LogSoftmax(dim=-1)
reshaped_logits = log_softmax(reshaped_logits)
loss = loss_fct(reshaped_logits, labels.contiguous())
if not return_dict:
output = (reshaped_logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutput(
loss=loss,
logits=reshaped_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""
VisualBert Model with a sequence classification head on top (a dropout and a linear layer on top of the pooled
output) for Visual Reasoning e.g. for NLVR task.
""",
VISUAL_BERT_START_DOCSTRING,
)
class VisualBertForVisualReasoning(VisualBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.visual_bert = VisualBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.cls = nn.Linear(config.hidden_size, config.num_labels) # 2
self.init_weights()
@add_start_docstrings_to_model_forward(VISUAL_BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
visual_embeds=None,
visual_attention_mask=None,
visual_token_type_ids=None,
image_text_alignment=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
labels=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. A classification loss is computed (Cross-Entropy) against these labels.
Example::
>>> # Assumption: `get_visual_embeddings(image)` gets the visual embeddings of the image in the batch.
>>> from transformers import BertTokenizer, VisualBertForVisualReasoning
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = VisualBertForVisualReasoning.from_pretrained('uclanlp/visualbert-nlvr2')
>>> text = "Who is eating the apple?"
>>> inputs = tokenizer(text, return_tensors='pt')
>>> visual_embeds = get_visual_embeddings(image).unsqueeze(0)
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) #example
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> inputs.update({{
... "visual_embeds": visual_embeds,
... "visual_token_type_ids": visual_token_type_ids,
... "visual_attention_mask": visual_attention_mask
... }})
>>> labels = torch.tensor(1).unsqueeze(0) # Batch size 1, Num choices 2
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> scores = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.visual_bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
image_text_alignment=image_text_alignment,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
# sequence_output = outputs[0]
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.cls(pooled_output)
reshaped_logits = logits.contiguous()
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
loss = loss_fct(reshaped_logits, labels.view(-1))
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutput(
loss=loss,
logits=reshaped_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
class VisualBertRegionToPhraseAttention(nn.Module):
def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0:
raise ValueError(
f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
f"heads ({config.num_attention_heads})"
)
self.num_attention_heads = 1 # config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(self, query, key, attention_mask):
attention_mask = attention_mask.to(query.dtype)
attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
attention_mask = (1.0 - attention_mask) * -10000.0
mixed_query_layer = self.query(query)
mixed_key_layer = self.key(key)
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
attention_scores = attention_scores + attention_mask
attention_scores = attention_scores.squeeze(1)
return attention_scores
@add_start_docstrings(
"""
VisualBert Model with a Masked Language Modeling head and an attention layer on top for Region-to-Phrase Alignment
e.g. for Flickr30 Entities task.
""",
VISUAL_BERT_START_DOCSTRING,
)
class VisualBertForRegionToPhraseAlignment(VisualBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.visual_bert = VisualBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.cls = VisualBertPreTrainingHeads(config)
self.attention = VisualBertRegionToPhraseAttention(config)
self.init_weights()
@add_start_docstrings_to_model_forward(VISUAL_BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
visual_embeds=None,
visual_attention_mask=None,
visual_token_type_ids=None,
image_text_alignment=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
region_to_phrase_position=None,
labels=None,
):
r"""
region_to_phrase_position (:obj:`torch.LongTensor` of shape ``(batch_size, total_sequence_length)``, `optional`):
The positions depicting the position of the image embedding corresponding to the textual tokens.
labels (:obj:`torch.LongTensor` of shape ``(batch_size, total_sequence_length, visual_sequence_length)``, `optional`):
Labels for computing the masked language modeling loss. KLDivLoss is computed against these labels and
the outputs from the attention layer.
Example::
>>> # Assumption: `get_visual_embeddings(image)` gets the visual embeddings of the image in the batch.
>>> from transformers import BertTokenizer, VisualBertForRegionToPhraseAlignment
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = VisualBertForRegionToPhraseAlignment.from_pretrained('uclanlp/visualbert-vqa-coco-pre')
>>> text = "Who is eating the apple?"
>>> inputs = tokenizer(text, return_tensors='pt')
>>> visual_embeds = get_visual_embeddings(image).unsqueeze(0)
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) #example
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> region_to_phrase_position = torch.ones((1, inputs["input_ids"].shape[-1]+visual_embeds.shape[-2]))
>>> inputs.update({{
... "region_to_phrase_position": region_to_phrase_position,
... "visual_embeds": visual_embeds,
... "visual_token_type_ids": visual_token_type_ids,
... "visual_attention_mask": visual_attention_mask
... }})
>>> labels = torch.ones((1, inputs["input_ids"].shape[-1]+visual_embeds.shape[-2], visual_embeds.shape[-2])) # Batch size 1
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> scores = outputs.logits
"""
if region_to_phrase_position is None:
raise ValueError("`region_to_phrase_position` should not be None when using Flickr Model.")
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.visual_bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
image_text_alignment=image_text_alignment,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
region_to_phrase_position_mask = (region_to_phrase_position != -1).long()
# Make the -1 become 0
region_to_phrase_position = region_to_phrase_position * region_to_phrase_position_mask
# Selected_positions = batch x selected position x dim
expanded_region_to_phrase_positions = region_to_phrase_position.unsqueeze(2).expand(
region_to_phrase_position.size(0), region_to_phrase_position.size(1), sequence_output.size(2)
)
selected_positions = sequence_output.gather(1, expanded_region_to_phrase_positions)
# Visual Features = batch x visual_feature_length x dim
# This will need separate image and visual masks.
visual_features = sequence_output[:, attention_mask.size(1) :]
if visual_features.size(1) != visual_attention_mask.size(1):
raise ValueError(
f"Visual features length :{visual_features.size(1)} should be the same"
f" as visual attention mask length: {visual_attention_mask.size(1)}."
)
logits = self.attention(selected_positions, visual_features, visual_attention_mask)
loss = None
if labels is not None:
# scores = batch x selected position x visual_feature
# scores = selected_positions.bmm(visual_features.transpose(1,2))
# label = batch x selected_postion x needed position
loss_fct = KLDivLoss(reduction="batchmean")
log_softmax = LogSoftmax(dim=-1)
scores = log_softmax(logits)
labels = labels.contiguous()
loss = loss_fct(scores, labels)
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
...@@ -2864,6 +2864,65 @@ def load_tf_weights_in_transfo_xl(*args, **kwargs): ...@@ -2864,6 +2864,65 @@ def load_tf_weights_in_transfo_xl(*args, **kwargs):
requires_backends(load_tf_weights_in_transfo_xl, ["torch"]) requires_backends(load_tf_weights_in_transfo_xl, ["torch"])
VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = None
class VisualBertForMultipleChoice:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisualBertForPreTraining:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisualBertForQuestionAnswering:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisualBertForRegionToPhraseAlignment:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisualBertForVisualReasoning:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisualBertLayer:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisualBertModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_backends(self, ["torch"])
class VisualBertPreTrainedModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_backends(self, ["torch"])
VIT_PRETRAINED_MODEL_ARCHIVE_LIST = None VIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch VisualBERT model. """
import copy
import unittest
from tests.test_modeling_common import floats_tensor
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
if is_torch_available():
import torch
from transformers import (
VisualBertConfig,
VisualBertForMultipleChoice,
VisualBertForPreTraining,
VisualBertForQuestionAnswering,
VisualBertForRegionToPhraseAlignment,
VisualBertForVisualReasoning,
VisualBertModel,
)
from transformers.models.visual_bert.modeling_visual_bert import VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST
class VisualBertModelTester:
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
visual_seq_length=5,
is_training=True,
use_attention_mask=True,
use_visual_attention_mask=True,
use_token_type_ids=True,
use_visual_token_type_ids=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
visual_embedding_dim=20,
type_vocab_size=16,
type_sequence_label_size=2,
initializer_range=0.02,
num_labels=3,
num_choices=4,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.visual_seq_length = visual_seq_length
self.is_training = is_training
self.use_attention_mask = use_attention_mask
self.use_visual_attention_mask = use_visual_attention_mask
self.use_token_type_ids = use_token_type_ids
self.use_visual_token_type_ids = use_visual_token_type_ids
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.visual_embedding_dim = visual_embedding_dim
self.type_vocab_size = type_vocab_size
self.type_sequence_label_size = type_sequence_label_size
self.initializer_range = initializer_range
self.num_labels = num_labels
self.num_choices = num_choices
self.scope = scope
def prepare_config(self):
return VisualBertConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
hidden_act=self.hidden_act,
hidden_dropout_prob=self.hidden_dropout_prob,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
type_vocab_size=self.type_vocab_size,
visual_embedding_dim=self.visual_embedding_dim,
num_labels=self.num_labels,
is_decoder=False,
initializer_range=self.initializer_range,
)
def prepare_config_and_inputs_for_common(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
visual_embeds = floats_tensor([self.batch_size, self.visual_seq_length, self.visual_embedding_dim])
attention_mask = None
if self.use_attention_mask:
attention_mask = torch.ones((self.batch_size, self.seq_length), dtype=torch.long, device=torch_device)
visual_attention_mask = None
if self.use_visual_attention_mask:
visual_attention_mask = torch.ones(
(self.batch_size, self.visual_seq_length), dtype=torch.long, device=torch_device
)
token_type_ids = None
if self.use_token_type_ids:
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
visual_token_type_ids = None
if self.use_visual_token_type_ids:
visual_token_type_ids = ids_tensor([self.batch_size, self.visual_seq_length], self.type_vocab_size)
config = self.prepare_config()
return config, {
"input_ids": input_ids,
"token_type_ids": token_type_ids,
"attention_mask": attention_mask,
"visual_embeds": visual_embeds,
"visual_token_type_ids": visual_token_type_ids,
"visual_attention_mask": visual_attention_mask,
}
def prepare_config_and_inputs_for_pretraining(self):
masked_lm_labels = None
sentence_image_labels = None
if self.use_labels:
masked_lm_labels = ids_tensor([self.batch_size, self.seq_length + self.visual_seq_length], self.vocab_size)
sentence_image_labels = ids_tensor(
[self.batch_size],
self.type_sequence_label_size,
)
config, input_dict = self.prepare_config_and_inputs_for_common()
input_dict.update({"labels": masked_lm_labels, "sentence_image_labels": sentence_image_labels})
return config, input_dict
def prepare_config_and_inputs_for_multiple_choice(self):
input_ids = ids_tensor([self.batch_size, self.num_choices, self.seq_length], self.vocab_size)
visual_embeds = floats_tensor(
[self.batch_size, self.num_choices, self.visual_seq_length, self.visual_embedding_dim]
)
attention_mask = None
if self.use_attention_mask:
attention_mask = torch.ones(
(self.batch_size, self.num_choices, self.seq_length), dtype=torch.long, device=torch_device
)
visual_attention_mask = None
if self.use_visual_attention_mask:
visual_attention_mask = torch.ones(
(self.batch_size, self.num_choices, self.visual_seq_length), dtype=torch.long, device=torch_device
)
token_type_ids = None
if self.use_token_type_ids:
token_type_ids = ids_tensor([self.batch_size, self.num_choices, self.seq_length], self.type_vocab_size)
visual_token_type_ids = None
if self.use_visual_token_type_ids:
visual_token_type_ids = ids_tensor(
[self.batch_size, self.num_choices, self.visual_seq_length], self.type_vocab_size
)
labels = None
if self.use_labels:
labels = ids_tensor([self.batch_size], self.num_choices)
config = self.prepare_config()
return config, {
"input_ids": input_ids,
"token_type_ids": token_type_ids,
"attention_mask": attention_mask,
"visual_embeds": visual_embeds,
"visual_token_type_ids": visual_token_type_ids,
"visual_attention_mask": visual_attention_mask,
"labels": labels,
}
def prepare_config_and_inputs_for_vqa(self):
vqa_labels = None
if self.use_labels:
vqa_labels = floats_tensor([self.batch_size, self.num_labels])
config, input_dict = self.prepare_config_and_inputs_for_common()
input_dict.update({"labels": vqa_labels})
return config, input_dict
def prepare_config_and_inputs_for_nlvr(self):
nlvr_labels = None
if self.use_labels:
nlvr_labels = ids_tensor([self.batch_size], self.num_labels)
config, input_dict = self.prepare_config_and_inputs_for_common()
input_dict.update({"labels": nlvr_labels})
return config, input_dict
def prepare_config_and_inputs_for_flickr(self):
region_to_phrase_position = torch.cat(
(
ids_tensor([self.batch_size, self.seq_length], self.visual_seq_length),
torch.ones(self.batch_size, self.visual_seq_length, dtype=torch.long, device=torch_device) * -1,
),
dim=-1,
)
flickr_labels = None
if self.use_labels:
flickr_labels = floats_tensor(
[self.batch_size, self.seq_length + self.visual_seq_length, self.visual_seq_length]
)
config, input_dict = self.prepare_config_and_inputs_for_common()
input_dict.update({"region_to_phrase_position": region_to_phrase_position, "labels": flickr_labels})
return config, input_dict
def create_and_check_model(self, config, input_dict):
model = VisualBertModel(config=config)
model.to(torch_device)
model.eval()
result = model(**input_dict)
self.parent.assertEqual(
result.last_hidden_state.shape,
(self.batch_size, self.seq_length + self.visual_seq_length, self.hidden_size),
)
def create_and_check_for_pretraining(self, config, input_dict):
model = VisualBertForPreTraining(config=config)
model.to(torch_device)
model.eval()
result = model(**input_dict)
self.parent.assertEqual(
result.prediction_logits.shape,
(self.batch_size, self.seq_length + self.visual_seq_length, self.vocab_size),
)
def create_and_check_for_vqa(self, config, input_dict):
model = VisualBertForQuestionAnswering(config=config)
model.to(torch_device)
model.eval()
result = model(**input_dict)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_multiple_choice(self, config, input_dict):
model = VisualBertForMultipleChoice(config=config)
model.to(torch_device)
model.eval()
result = model(**input_dict)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_choices))
def create_and_check_for_nlvr(self, config, input_dict):
model = VisualBertForVisualReasoning(config=config)
model.to(torch_device)
model.eval()
result = model(**input_dict)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_flickr(self, config, input_dict):
model = VisualBertForRegionToPhraseAlignment(config=config)
model.to(torch_device)
model.eval()
result = model(**input_dict)
self.parent.assertEqual(
result.logits.shape, (self.batch_size, self.seq_length + self.visual_seq_length, self.visual_seq_length)
)
@require_torch
class VisualBertModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (
(
VisualBertModel,
VisualBertForMultipleChoice,
VisualBertForVisualReasoning,
VisualBertForRegionToPhraseAlignment,
VisualBertForQuestionAnswering,
VisualBertForPreTraining,
)
if is_torch_available()
else ()
)
test_torchscript = False
test_pruning = False
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = copy.deepcopy(inputs_dict)
if model_class == VisualBertForMultipleChoice:
for key in inputs_dict.keys():
value = inputs_dict[key]
if isinstance(value, torch.Tensor) and value.ndim > 1:
if key != "visual_embeds":
inputs_dict[key] = (
inputs_dict[key].unsqueeze(1).expand(-1, self.model_tester.num_choices, -1).contiguous()
)
else:
inputs_dict[key] = (
inputs_dict[key]
.unsqueeze(1)
.expand(-1, self.model_tester.num_choices, -1, self.model_tester.visual_embedding_dim)
.contiguous()
)
elif model_class == VisualBertForRegionToPhraseAlignment:
total_length = self.model_tester.seq_length + self.model_tester.visual_seq_length
batch_size = self.model_tester.batch_size
inputs_dict["region_to_phrase_position"] = torch.zeros(
(batch_size, total_length),
dtype=torch.long,
device=torch_device,
)
if return_labels:
if model_class == VisualBertForMultipleChoice:
inputs_dict["labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
elif model_class == VisualBertForPreTraining:
total_length = self.model_tester.seq_length + self.model_tester.visual_seq_length
batch_size = self.model_tester.batch_size
inputs_dict["labels"] = torch.zeros(
(batch_size, total_length),
dtype=torch.long,
device=torch_device,
)
inputs_dict["sentence_image_labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
# Flickr expects float labels
elif model_class == VisualBertForRegionToPhraseAlignment:
batch_size = self.model_tester.batch_size
total_length = self.model_tester.seq_length + self.model_tester.visual_seq_length
inputs_dict["labels"] = torch.ones(
(
batch_size,
total_length,
self.model_tester.visual_seq_length,
),
dtype=torch.float,
device=torch_device,
)
# VQA expects float labels
elif model_class == VisualBertForQuestionAnswering:
inputs_dict["labels"] = torch.ones(
(self.model_tester.batch_size, self.model_tester.num_labels),
dtype=torch.float,
device=torch_device,
)
elif model_class == VisualBertForVisualReasoning:
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size), dtype=torch.long, device=torch_device
)
return inputs_dict
def setUp(self):
self.model_tester = VisualBertModelTester(self)
self.config_tester = ConfigTester(self, config_class=VisualBertConfig, hidden_size=37)
def test_attention_outputs(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
seq_len = getattr(self.model_tester, "seq_length", None)
visual_seq_len = getattr(self.model_tester, "visual_seq_length", None)
encoder_seq_length = (seq_len if seq_len is not None else 0) + (
visual_seq_len if visual_seq_len is not None else 0
)
encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
chunk_length = getattr(self.model_tester, "chunk_length", None)
if chunk_length is not None and hasattr(self.model_tester, "num_hashes"):
encoder_seq_length = encoder_seq_length * self.model_tester.num_hashes
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
if chunk_length is not None:
self.assertListEqual(
list(attentions[0].shape[-4:]),
[self.model_tester.num_attention_heads, encoder_seq_length, chunk_length, encoder_key_length],
)
else:
self.assertListEqual(
list(attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
)
out_len = len(outputs)
# Check attention is always last and order is fine
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
if hasattr(self.model_tester, "num_hidden_states_types"):
added_hidden_states = self.model_tester.num_hidden_states_types
elif self.is_encoder_decoder:
added_hidden_states = 2
else:
added_hidden_states = 1
self.assertEqual(out_len + added_hidden_states, len(outputs))
self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
if chunk_length is not None:
self.assertListEqual(
list(self_attentions[0].shape[-4:]),
[self.model_tester.num_attention_heads, encoder_seq_length, chunk_length, encoder_key_length],
)
else:
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
)
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
expected_num_layers = getattr(
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
)
self.assertEqual(len(hidden_states), expected_num_layers)
if hasattr(self.model_tester, "encoder_seq_length"):
seq_length = self.model_tester.encoder_seq_length
if hasattr(self.model_tester, "chunk_length") and self.model_tester.chunk_length > 1:
seq_length = seq_length * self.model_tester.chunk_length
else:
seq_length = self.model_tester.seq_length + self.model_tester.visual_seq_length
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[seq_length, self.model_tester.hidden_size],
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_model_various_embeddings(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
for type in ["absolute", "relative_key", "relative_key_query"]:
config_and_inputs[0].position_embedding_type = type
self.model_tester.create_and_check_model(*config_and_inputs)
def test_model_for_pretraining(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_pretraining()
self.model_tester.create_and_check_for_pretraining(*config_and_inputs)
def test_model_for_vqa(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_vqa()
self.model_tester.create_and_check_for_vqa(*config_and_inputs)
def test_model_for_nlvr(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_nlvr()
self.model_tester.create_and_check_for_nlvr(*config_and_inputs)
def test_model_for_multiple_choice(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_multiple_choice()
self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
def test_model_for_flickr(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_flickr()
self.model_tester.create_and_check_for_flickr(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
for model_name in VISUAL_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = VisualBertModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@require_torch
class VisualBertModelIntegrationTest(unittest.TestCase):
@slow
def test_inference_vqa_coco_pre(self):
model = VisualBertForPreTraining.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
input_ids = torch.tensor([1, 2, 3, 4, 5, 6], dtype=torch.long).reshape(1, -1)
token_type_ids = torch.tensor([0, 0, 0, 1, 1, 1], dtype=torch.long).reshape(1, -1)
visual_embeds = torch.ones(size=(1, 10, 2048), dtype=torch.float32) * 0.5
visual_token_type_ids = torch.ones(size=(1, 10), dtype=torch.long)
attention_mask = torch.tensor([1] * 6).reshape(1, -1)
visual_attention_mask = torch.tensor([1] * 10).reshape(1, -1)
output = model(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
)
vocab_size = 30522
expected_shape = torch.Size((1, 16, vocab_size))
self.assertEqual(output.prediction_logits.shape, expected_shape)
expected_slice = torch.tensor(
[[[-5.1858, -5.1903, -4.9142], [-6.2214, -5.9238, -5.8381], [-6.3027, -5.9939, -5.9297]]]
)
self.assertTrue(torch.allclose(output.prediction_logits[:, :3, :3], expected_slice, atol=1e-4))
expected_shape_2 = torch.Size((1, 2))
self.assertEqual(output.seq_relationship_logits.shape, expected_shape_2)
expected_slice_2 = torch.tensor([[0.7393, 0.1754]])
self.assertTrue(torch.allclose(output.seq_relationship_logits, expected_slice_2, atol=1e-4))
@slow
def test_inference_vqa(self):
model = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa")
input_ids = torch.tensor([1, 2, 3, 4, 5, 6], dtype=torch.long).reshape(1, -1)
token_type_ids = torch.tensor([0, 0, 0, 1, 1, 1], dtype=torch.long).reshape(1, -1)
visual_embeds = torch.ones(size=(1, 10, 2048), dtype=torch.float32) * 0.5
visual_token_type_ids = torch.ones(size=(1, 10), dtype=torch.long)
attention_mask = torch.tensor([1] * 6).reshape(1, -1)
visual_attention_mask = torch.tensor([1] * 10).reshape(1, -1)
output = model(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
)
# vocab_size = 30522
expected_shape = torch.Size((1, 3129))
self.assertEqual(output.logits.shape, expected_shape)
expected_slice = torch.tensor(
[[-8.9898, 3.0803, -1.8016, 2.4542, -8.3420, -2.0224, -3.3124, -4.4139, -3.1491, -3.8997]]
)
self.assertTrue(torch.allclose(output.logits[:, :10], expected_slice, atol=1e-4))
@slow
def test_inference_nlvr(self):
model = VisualBertForVisualReasoning.from_pretrained("uclanlp/visualbert-nlvr2")
input_ids = torch.tensor([1, 2, 3, 4, 5, 6], dtype=torch.long).reshape(1, -1)
token_type_ids = torch.tensor([0, 0, 0, 1, 1, 1], dtype=torch.long).reshape(1, -1)
visual_embeds = torch.ones(size=(1, 10, 1024), dtype=torch.float32) * 0.5
visual_token_type_ids = torch.ones(size=(1, 10), dtype=torch.long)
attention_mask = torch.tensor([1] * 6).reshape(1, -1)
visual_attention_mask = torch.tensor([1] * 10).reshape(1, -1)
output = model(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
)
# vocab_size = 30522
expected_shape = torch.Size((1, 2))
self.assertEqual(output.logits.shape, expected_shape)
expected_slice = torch.tensor([[-1.1436, 0.8900]])
self.assertTrue(torch.allclose(output.logits, expected_slice, atol=1e-4))
@slow
def test_inference_vcr(self):
model = VisualBertForMultipleChoice.from_pretrained("uclanlp/visualbert-vcr")
input_ids = torch.tensor([[[1, 2, 3, 4, 5, 6] for i in range(4)]], dtype=torch.long)
attention_mask = torch.ones_like(input_ids)
token_type_ids = torch.ones_like(input_ids)
visual_embeds = torch.ones(size=(1, 4, 10, 512), dtype=torch.float32) * 0.5
visual_token_type_ids = torch.ones(size=(1, 4, 10), dtype=torch.long)
visual_attention_mask = torch.ones_like(visual_token_type_ids)
output = model(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
visual_embeds=visual_embeds,
visual_attention_mask=visual_attention_mask,
visual_token_type_ids=visual_token_type_ids,
)
# vocab_size = 30522
expected_shape = torch.Size((1, 4))
self.assertEqual(output.logits.shape, expected_shape)
expected_slice = torch.tensor([[-7.7697, -7.7697, -7.7697, -7.7697]])
self.assertTrue(torch.allclose(output.logits, expected_slice, atol=1e-4))
...@@ -118,6 +118,10 @@ IGNORE_NON_AUTO_CONFIGURED = [ ...@@ -118,6 +118,10 @@ IGNORE_NON_AUTO_CONFIGURED = [
"XLMForQuestionAnswering", "XLMForQuestionAnswering",
"XLNetForQuestionAnswering", "XLNetForQuestionAnswering",
"SeparableConv1D", "SeparableConv1D",
"VisualBertForRegionToPhraseAlignment",
"VisualBertForVisualReasoning",
"VisualBertForQuestionAnswering",
"VisualBertForMultipleChoice",
] ]
# This is to make sure the transformers module imported is the one in the repo. # This is to make sure the transformers module imported is the one in the repo.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment