Merge pull request #2115 from suvrat96/add_mmbt_model

[WIP] Add MMBT Model to Transformers Repo

Merge pull request #2115 from suvrat96/add_mmbt_model
[WIP] Add MMBT Model to Transformers Repo
73f6e981 · Thomas Wolf · GitHub · 5b7fb6a4 · 344126fe · 73f6e981
Unverified Commit 73f6e981 authored Dec 21, 2019 by Thomas Wolf Committed by GitHub Dec 21, 2019
7 changed files
--- a/README.md
+++ b/README.md
@@ -148,7 +148,8 @@ At some point in the future, you'll be able to seamlessly move from pre-training
 11. **[ALBERT](https://github.com/google-research/ALBERT)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
 12. **[T5](https://github.com/google-research/text-to-text-transfer-transformer)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 13. **[XLM-RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/xlmr)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
-14. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
+14. **[MMBT](https://github.com/facebookresearch/mmbt/)** (from Facebook), released together with the paper a [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf) by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
+15. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).


--- a/examples/README.md
+++ b/examples/README.md
@@ -734,3 +734,28 @@ Training with the previously defined hyper-parameters yields the following resul
 ```bash
 acc = 0.7093812375249501
 ```
+
+## MM-IMDb
+
+Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/mm-imdb/run_mmimdb.py).
+
+[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
+
+### Training on MM-IMDb
+
+```
+python run_mmimdb.py \
+    --data_dir /path/to/mmimdb/dataset/ \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --output_dir /path/to/save/dir/ \
+    --do_train \
+    --do_eval \
+    --max_seq_len 512 \
+    --gradient_accumulation_steps 20 \
+    --num_image_embeds 3 \
+    --num_train_epochs 100 \
+    --patience 5
+```
+
+
--- a/examples/mm-imdb/run_mmimdb.py
+++ b/examples/mm-imdb/run_mmimdb.py
--- a/examples/mm-imdb/utils_mmimdb.py
+++ b/examples/mm-imdb/utils_mmimdb.py
+# coding=utf-8
+# Copyright (c) Facebook, Inc. and its affiliates.
+# Copyright (c) HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from collections import Counter
+from PIL import Image
+
+import torch
+import torch.nn as nn
+import torchvision
+import torchvision.transforms as transforms
+from torch.utils.data import Dataset
+
+POOLING_BREAKDOWN = {
+    1: (1, 1),
+    2: (2, 1),
+    3: (3, 1),
+    4: (2, 2),
+    5: (5, 1),
+    6: (3, 2),
+    7: (7, 1),
+    8: (4, 2),
+    9: (3, 3)
+}
+
+
+class ImageEncoder(nn.Module):
+    def __init__(self, args):
+        super(ImageEncoder, self).__init__()
+        model = torchvision.models.resnet152(pretrained=True)
+        modules = list(model.children())[:-2]
+        self.model = nn.Sequential(*modules)
+        self.pool = nn.AdaptiveAvgPool2d(POOLING_BREAKDOWN[args.num_image_embeds])
+
+    def forward(self, x):
+        # Bx3x224x224 -> Bx2048x7x7 -> Bx2048xN -> BxNx2048
+        out = self.pool(self.model(x))
+        out = torch.flatten(out, start_dim=2)
+        out = out.transpose(1, 2).contiguous()
+        return out  # BxNx2048
+
+
+
+class JsonlDataset(Dataset):
+    def __init__(self, data_path, tokenizer, transforms, labels, max_seq_length):
+        self.data = [json.loads(l) for l in open(data_path)]
+        self.data_dir = os.path.dirname(data_path)
+        self.tokenizer = tokenizer
+        self.labels = labels
+        self.n_classes = len(labels)
+        self.max_seq_length = max_seq_length
+
+        self.transforms = transforms
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        sentence = torch.LongTensor(self.tokenizer.encode(self.data[index]["text"], add_special_tokens=True))
+        start_token, sentence, end_token = sentence[0], sentence[1:-1], sentence[-1]
+        sentence = sentence[:self.max_seq_length]
+
+        label = torch.zeros(self.n_classes)
+        label[[self.labels.index(tgt) for tgt in self.data[index]["label"]]] = 1
+
+        image = Image.open(os.path.join(self.data_dir, self.data[index]["img"])).convert("RGB")
+        image = self.transforms(image)
+
+        return {"image_start_token": start_token, "image_end_token": end_token,
+                "sentence": sentence, "image": image, "label": label}
+
+    def get_label_frequencies(self):
+        label_freqs = Counter()
+        for row in self.data:
+            label_freqs.update(row["label"])
+        return label_freqs
+
+
+def collate_fn(batch):
+    lens = [len(row["sentence"]) for row in batch]
+    bsz, max_seq_len = len(batch), max(lens)
+
+    mask_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)
+    text_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)
+
+    for i_batch, (input_row, length) in enumerate(zip(batch, lens)):
+        text_tensor[i_batch, :length] = input_row["sentence"]
+        mask_tensor[i_batch, :length] = 1
+
+    img_tensor = torch.stack([row["image"] for row in batch])
+    tgt_tensor = torch.stack([row["label"] for row in batch])
+    img_start_token = torch.stack([row["image_start_token"] for row in batch])
+    img_end_token = torch.stack([row["image_end_token"] for row in batch])
+
+    return text_tensor, mask_tensor, img_tensor, img_start_token, img_end_token, tgt_tensor
+
+
+def get_mmimdb_labels():
+    return ['Crime', 'Drama', 'Thriller', 'Action', 'Comedy', 'Romance',
+            'Documentary', 'Short', 'Mystery', 'History', 'Family', 'Adventure',
+            'Fantasy', 'Sci-Fi', 'Western', 'Horror', 'Sport', 'War', 'Music',
+            'Musical', 'Animation', 'Biography', 'Film-Noir']
+
+
+def get_image_transforms():
+    return transforms.Compose(
+        [
+            transforms.Resize(256),
+            transforms.CenterCrop(224),
+            transforms.ToTensor(),
+            transforms.Normalize(
+                mean=[0.46777044, 0.44531429, 0.40661017],
+                std=[0.12221994, 0.12145835, 0.14380469],
+            ),
+        ]
+    )
--- a/transformers/__init__.py
+++ b/transformers/__init__.py
@@ -71,6 +71,7 @@ from .configuration_albert import AlbertConfig, ALBERT_PRETRAINED_CONFIG_ARCHIVE
 from .configuration_camembert import CamembertConfig, CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_t5 import T5Config, T5_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_xlm_roberta import XLMRobertaConfig, XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_mmbt import MMBTConfig

 # Modeling
 if is_torch_available():
@@ -120,13 +121,12 @@ if is_torch_available():
    from .modeling_t5 import (T5PreTrainedModel, T5Model, T5WithLMHeadModel,
                              load_tf_weights_in_t5,
                              T5_PRETRAINED_MODEL_ARCHIVE_MAP)
-
    from .modeling_albert import (AlbertPreTrainedModel, AlbertModel, AlbertForMaskedLM, AlbertForSequenceClassification,
                                AlbertForQuestionAnswering,
                                load_tf_weights_in_albert, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
-
    from .modeling_xlm_roberta import (XLMRobertaForMaskedLM, XLMRobertaModel, XLMRobertaForMultipleChoice,
                                       XLMRobertaForSequenceClassification, XLMRobertaForTokenClassification)
+    from .modeling_mmbt import ModalEmbeddings, MMBTModel, MMBTForClassification

    # Optimization
    from .optimization import (AdamW, get_constant_schedule, get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup,

--- a/transformers/configuration_mmbt.py
+++ b/transformers/configuration_mmbt.py
+# coding=utf-8
+# Copyright (c) Facebook, Inc. and its affiliates.
+# Copyright (c) HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MMBT configuration """
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+class MMBTConfig(object):
+    """Configuration class to store the configuration of a `MMBT Model`.
+
+    Args:
+        config: config of the underlying Transformer models. It's values are copied over to use a single config.
+        num_labels: Size of final Linear layer for classification.
+        modal_hidden_size: Embedding dimension of the non-text modality encoder.
+    """
+    def __init__(self, config, num_labels=None, modal_hidden_size=2048):
+        self.__dict__ = config.__dict__
+        self.modal_hidden_size = modal_hidden_size
+        if num_labels:
+            self.num_labels = num_labels
--- a/transformers/modeling_mmbt.py
+++ b/transformers/modeling_mmbt.py