lxmert.rst 6.22 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

13
LXMERT
Sylvain Gugger's avatar
Sylvain Gugger committed
14
-----------------------------------------------------------------------------------------------------------------------
15
16

Overview
Sylvain Gugger's avatar
Sylvain Gugger committed
17
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18

Sylvain Gugger's avatar
Sylvain Gugger committed
19
20
21
22
The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers
<https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
Sylvain Gugger's avatar
Sylvain Gugger committed
23
24
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
25
26
27

The abstract from the paper is the following:

Sylvain Gugger's avatar
Sylvain Gugger committed
28
29
30
31
32
33
*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly,
the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality
Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
34
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
Sylvain Gugger's avatar
Sylvain Gugger committed
35
36
37
38
39
40
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR, and improve the previous
best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel
model components and pretraining strategies significantly contribute to our strong results; and also present several
41
42
43
44
attention visualizations for the different encoders*

Tips:

Sylvain Gugger's avatar
Sylvain Gugger committed
45
46
47
48
49
50
51
52
53
- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
  will work.
- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the
  cross-modality layer, so they contain information from both modalities. To access a modality that only attends to
  itself, select the vision/language hidden states from the first input in the tuple.
- The bidirectional cross-modality encoder attention only returns attention values when the language modality is used
  as the input and the vision modality is used as the context vector. Further, while the cross-modality encoder
  contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
  both self attention outputs are disregarded.
54

Sylvain Gugger's avatar
Sylvain Gugger committed
55
The original code can be found `here <https://github.com/airsplay/lxmert>`__.
56
57
58


LxmertConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
59
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61
62
63
64
65

.. autoclass:: transformers.LxmertConfig
    :members:


LxmertTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
66
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
67
68

.. autoclass:: transformers.LxmertTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
69
70
71
72
73
74
75
76
    :members:


LxmertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.LxmertTokenizerFast
    :members:
77
78
79


Lxmert specific outputs
Sylvain Gugger's avatar
Sylvain Gugger committed
80
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
81

Sylvain Gugger's avatar
Sylvain Gugger committed
82
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertModelOutput
83
84
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
85
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
86
87
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
88
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
89
90
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
91
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
92
93
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
94
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
95
96
97
98
    :members:


LxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
99
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100
101

.. autoclass:: transformers.LxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
102
    :members: forward
103
104

LxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
105
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
106
107

.. autoclass:: transformers.LxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
108
    :members: forward
109
110

LxmertForQuestionAnswering
Sylvain Gugger's avatar
Sylvain Gugger committed
111
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
112
113

.. autoclass:: transformers.LxmertForQuestionAnswering
Sylvain Gugger's avatar
Sylvain Gugger committed
114
    :members: forward
115
116
117


TFLxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
118
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
119
120

.. autoclass:: transformers.TFLxmertModel
Sylvain Gugger's avatar
Sylvain Gugger committed
121
    :members: call
122
123

TFLxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
124
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
125
126

.. autoclass:: transformers.TFLxmertForPreTraining
Sylvain Gugger's avatar
Sylvain Gugger committed
127
    :members: call