lxmert.rst 6.3 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

13
LXMERT
Sylvain Gugger's avatar
Sylvain Gugger committed
14
-----------------------------------------------------------------------------------------------------------------------
15
16

Overview
Sylvain Gugger's avatar
Sylvain Gugger committed
17
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18

Sylvain Gugger's avatar
Sylvain Gugger committed
19
20
21
22
The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers
<https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
Sylvain Gugger's avatar
Sylvain Gugger committed
23
24
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
25
26
27

The abstract from the paper is the following:

Sylvain Gugger's avatar
Sylvain Gugger committed
28
29
30
31
32
33
*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly,
the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality
Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
34
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
Sylvain Gugger's avatar
Sylvain Gugger committed
35
36
37
38
39
40
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR, and improve the previous
best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel
model components and pretraining strategies significantly contribute to our strong results; and also present several
41
42
43
44
attention visualizations for the different encoders*

Tips:

Sylvain Gugger's avatar
Sylvain Gugger committed
45
46
47
48
49
50
51
52
53
- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
  will work.
- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the
  cross-modality layer, so they contain information from both modalities. To access a modality that only attends to
  itself, select the vision/language hidden states from the first input in the tuple.
- The bidirectional cross-modality encoder attention only returns attention values when the language modality is used
  as the input and the vision modality is used as the context vector. Further, while the cross-modality encoder
  contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
  both self attention outputs are disregarded.
54

55
56
This model was contributed by `eltoto1219 <https://huggingface.co/eltoto1219>`__. The original code can be found `here
<https://github.com/airsplay/lxmert>`__.
57
58
59


LxmertConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
60
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
61
62
63
64
65
66

.. autoclass:: transformers.LxmertConfig
    :members:


LxmertTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
67
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
68
69

.. autoclass:: transformers.LxmertTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
70
71
72
73
74
75
76
77
    :members:


LxmertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.LxmertTokenizerFast
    :members:
78
79
80


Lxmert specific outputs
Sylvain Gugger's avatar
Sylvain Gugger committed
81
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
82

Sylvain Gugger's avatar
Sylvain Gugger committed
83
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertModelOutput
84
85
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
86
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
87
88
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
89
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
90
91
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
92
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
93
94
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
95
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
96
97
98
99
    :members:


LxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
100
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101
102

.. autoclass:: transformers.LxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
103
    :members: forward
104
105

LxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
106
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
107
108

.. autoclass:: transformers.LxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
109
    :members: forward
110
111

LxmertForQuestionAnswering
Sylvain Gugger's avatar
Sylvain Gugger committed
112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
113
114

.. autoclass:: transformers.LxmertForQuestionAnswering
Sylvain Gugger's avatar
Sylvain Gugger committed
115
    :members: forward
116
117
118


TFLxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
119
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
120
121

.. autoclass:: transformers.TFLxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
122
    :members: call
123
124

TFLxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
125
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
126
127

.. autoclass:: transformers.TFLxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
128
    :members: call