Commit 72f5785f authored by huaerkl's avatar huaerkl
Browse files

v1.0

parents
Pipeline #505 canceled with stages
Overview
========
Fairseq can be extended through user-supplied `plug-ins
<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
plug-ins:
- :ref:`Models` define the neural network architecture and encapsulate all of the
learnable parameters.
- :ref:`Criterions` compute the loss function given the model outputs and targets.
- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
Datasets, initializing the Model/Criterion and calculating the loss.
- :ref:`Optimizers` update the Model parameters based on the gradients.
- :ref:`Learning Rate Schedulers` update the learning rate over the course of
training.
**Training Flow**
Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
fairseq implements the following high-level training flow::
for epoch in range(num_epochs):
itr = task.get_batch_iterator(task.dataset('train'))
for num_updates, batch in enumerate(itr):
task.train_step(batch, model, criterion, optimizer)
average_and_clip_gradients()
optimizer.step()
lr_scheduler.step_update(num_updates)
lr_scheduler.step(epoch)
where the default implementation for ``task.train_step`` is roughly::
def train_step(self, batch, model, criterion, optimizer, **unused):
loss = criterion(model, batch)
optimizer.backward(loss)
return loss
**Registering new plug-ins**
New plug-ins are *registered* through a set of ``@register`` function
decorators, for example::
@register_model('my_lstm')
class MyLSTM(FairseqEncoderDecoderModel):
(...)
Once registered, new plug-ins can be used with the existing :ref:`Command-line
Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
new plug-ins.
**Loading plug-ins from another directory**
New plug-ins can be defined in a custom module stored in the user system. In
order to import the module, and make the plugin available to *fairseq*, the
command line supports the ``--user-dir`` flag that can be used to specify a
custom location for additional modules to load into *fairseq*.
For example, assuming this directory tree::
/home/user/my-module/
└── __init__.py
with ``__init__.py``::
from fairseq.models import register_model_architecture
from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
@register_model_architecture('transformer', 'my_transformer')
def transformer_mmt_big(args):
transformer_vaswani_wmt_en_de_big(args)
it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
.. role:: hidden
:class: hidden-section
.. module:: fairseq.tasks
.. _Tasks:
Tasks
=====
Tasks store dictionaries and provide helpers for loading/iterating over
Datasets, initializing the Model/Criterion and calculating the loss.
Tasks can be selected via the ``--task`` command-line argument. Once selected, a
task may expose additional command-line arguments for further configuration.
Example usage::
# setup the task (e.g., load dictionaries)
task = fairseq.tasks.setup_task(args)
# build model and criterion
model = task.build_model(args)
criterion = task.build_criterion(args)
# load datasets
task.load_dataset('train')
task.load_dataset('valid')
# iterate over mini-batches of data
batch_itr = task.get_batch_iterator(
task.dataset('train'), max_tokens=4096,
)
for batch in batch_itr:
# compute the loss
loss, sample_size, logging_output = task.get_loss(
model, criterion, batch,
)
loss.backward()
Translation
-----------
.. autoclass:: fairseq.tasks.translation.TranslationTask
.. _language modeling:
Language Modeling
-----------------
.. autoclass:: fairseq.tasks.language_modeling.LanguageModelingTask
Adding new tasks
----------------
.. autofunction:: fairseq.tasks.register_task
.. autoclass:: fairseq.tasks.FairseqTask
:members:
:undoc-members:
Tutorial: Classifying Names with a Character-Level RNN
======================================================
In this tutorial we will extend fairseq to support *classification* tasks. In
particular we will re-implement the PyTorch tutorial for `Classifying Names with
a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`_
in fairseq. It is recommended to quickly skim that tutorial before beginning
this one.
This tutorial covers:
1. **Preprocessing the data** to create dictionaries.
2. **Registering a new Model** that encodes an input sentence with a simple RNN
and predicts the output label.
3. **Registering a new Task** that loads our dictionaries and dataset.
4. **Training the Model** using the existing command-line tools.
5. **Writing an evaluation script** that imports fairseq and allows us to
interactively evaluate our model on new inputs.
1. Preprocessing the data
-------------------------
The original tutorial provides raw data, but we'll work with a modified version
of the data that is already tokenized into characters and split into separate
train, valid and test sets.
Download and extract the data from here:
`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_
Once extracted, let's preprocess the data using the :ref:`fairseq-preprocess`
command-line tool to create the dictionaries. While this tool is primarily
intended for sequence-to-sequence problems, we're able to reuse it here by
treating the label as a "target" sequence of length 1. We'll also output the
preprocessed files in "raw" format using the ``--dataset-impl`` option to
enhance readability:
.. code-block:: console
> fairseq-preprocess \
--trainpref names/train --validpref names/valid --testpref names/test \
--source-lang input --target-lang label \
--destdir names-bin --dataset-impl raw
After running the above command you should see a new directory,
:file:`names-bin/`, containing the dictionaries for *inputs* and *labels*.
2. Registering a new Model
--------------------------
Next we'll register a new model in fairseq that will encode an input sentence
with a simple RNN and predict the output label. Compared to the original PyTorch
tutorial, our version will also work with batches of data and GPU Tensors.
First let's copy the simple RNN module implemented in the `PyTorch tutorial
<https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network>`_.
Create a new file named :file:`fairseq/models/rnn_classifier.py` with the
following contents::
import torch
import torch.nn as nn
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.hidden_size = hidden_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def initHidden(self):
return torch.zeros(1, self.hidden_size)
We must also *register* this model with fairseq using the
:func:`~fairseq.models.register_model` function decorator. Once the model is
registered we'll be able to use it with the existing :ref:`Command-line Tools`.
All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
interface, so we'll create a small wrapper class in the same file and register
it in fairseq with the name ``'rnn_classifier'``::
from fairseq.models import BaseFairseqModel, register_model
# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.
@register_model('rnn_classifier')
class FairseqRNNClassifier(BaseFairseqModel):
@staticmethod
def add_args(parser):
# Models can override this method to add new command-line arguments.
# Here we'll add a new command-line argument to configure the
# dimensionality of the hidden state.
parser.add_argument(
'--hidden-dim', type=int, metavar='N',
help='dimensionality of the hidden state',
)
@classmethod
def build_model(cls, args, task):
# Fairseq initializes models by calling the ``build_model()``
# function. This provides more flexibility, since the returned model
# instance can be of a different type than the one that was called.
# In this case we'll just return a FairseqRNNClassifier instance.
# Initialize our RNN module
rnn = RNN(
# We'll define the Task in the next section, but for now just
# notice that the task holds the dictionaries for the "source"
# (i.e., the input sentence) and "target" (i.e., the label).
input_size=len(task.source_dictionary),
hidden_size=args.hidden_dim,
output_size=len(task.target_dictionary),
)
# Return the wrapped version of the module
return FairseqRNNClassifier(
rnn=rnn,
input_vocab=task.source_dictionary,
)
def __init__(self, rnn, input_vocab):
super(FairseqRNNClassifier, self).__init__()
self.rnn = rnn
self.input_vocab = input_vocab
# The RNN module in the tutorial expects one-hot inputs, so we can
# precompute the identity matrix to help convert from indices to
# one-hot vectors. We register it as a buffer so that it is moved to
# the GPU when ``cuda()`` is called.
self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
def forward(self, src_tokens, src_lengths):
# The inputs to the ``forward()`` function are determined by the
# Task, and in particular the ``'net_input'`` key in each
# mini-batch. We'll define the Task in the next section, but for
# now just know that *src_tokens* has shape `(batch, src_len)` and
# *src_lengths* has shape `(batch)`.
bsz, max_src_len = src_tokens.size()
# Initialize the RNN hidden state. Compared to the original PyTorch
# tutorial we'll also handle batched inputs and work on the GPU.
hidden = self.rnn.initHidden()
hidden = hidden.repeat(bsz, 1) # expand for batched inputs
hidden = hidden.to(src_tokens.device) # move to GPU
for i in range(max_src_len):
# WARNING: The inputs have padding, so we should mask those
# elements here so that padding doesn't affect the results.
# This is left as an exercise for the reader. The padding symbol
# is given by ``self.input_vocab.pad()`` and the unpadded length
# of each input is given by *src_lengths*.
# One-hot encode a batch of input characters.
input = self.one_hot_inputs[src_tokens[:, i].long()]
# Feed the input to our RNN.
output, hidden = self.rnn(input, hidden)
# Return the final output state for making a prediction
return output
Finally let's define a *named architecture* with the configuration for our
model. This is done with the :func:`~fairseq.models.register_model_architecture`
function decorator. Thereafter this named architecture can be used with the
``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::
from fairseq.models import register_model_architecture
# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., 'rnn_classifier'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.
@register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
def pytorch_tutorial_rnn(args):
# We use ``getattr()`` to prioritize arguments that are explicitly given
# on the command-line, so that the defaults defined below are only used
# when no other value has been specified.
args.hidden_dim = getattr(args, 'hidden_dim', 128)
3. Registering a new Task
-------------------------
Now we'll register a new :class:`~fairseq.tasks.FairseqTask` that will load our
dictionaries and dataset. Tasks can also control how the data is batched into
mini-batches, but in this tutorial we'll reuse the batching provided by
:class:`fairseq.data.LanguagePairDataset`.
Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
following contents::
import os
import torch
from fairseq.data import Dictionary, LanguagePairDataset
from fairseq.tasks import LegacyFairseqTask, register_task
@register_task('simple_classification')
class SimpleClassificationTask(LegacyFairseqTask):
@staticmethod
def add_args(parser):
# Add some command-line arguments for specifying where the data is
# located and the maximum supported input length.
parser.add_argument('data', metavar='FILE',
help='file prefix for data')
parser.add_argument('--max-positions', default=1024, type=int,
help='max input length')
@classmethod
def setup_task(cls, args, **kwargs):
# Here we can perform any setup required for the task. This may include
# loading Dictionaries, initializing shared Embedding layers, etc.
# In this case we'll just load the Dictionaries.
input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
print('| [input] dictionary: {} types'.format(len(input_vocab)))
print('| [label] dictionary: {} types'.format(len(label_vocab)))
return SimpleClassificationTask(args, input_vocab, label_vocab)
def __init__(self, args, input_vocab, label_vocab):
super().__init__(args)
self.input_vocab = input_vocab
self.label_vocab = label_vocab
def load_dataset(self, split, **kwargs):
"""Load a given dataset split (e.g., train, valid, test)."""
prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
# Read input sentences.
sentences, lengths = [], []
with open(prefix + '.input', encoding='utf-8') as file:
for line in file:
sentence = line.strip()
# Tokenize the sentence, splitting on spaces
tokens = self.input_vocab.encode_line(
sentence, add_if_not_exist=False,
)
sentences.append(tokens)
lengths.append(tokens.numel())
# Read labels.
labels = []
with open(prefix + '.label', encoding='utf-8') as file:
for line in file:
label = line.strip()
labels.append(
# Convert label to a numeric ID.
torch.LongTensor([self.label_vocab.add_symbol(label)])
)
assert len(sentences) == len(labels)
print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
# We reuse LanguagePairDataset since classification can be modeled as a
# sequence-to-sequence task where the target sequence has length 1.
self.datasets[split] = LanguagePairDataset(
src=sentences,
src_sizes=lengths,
src_dict=self.input_vocab,
tgt=labels,
tgt_sizes=torch.ones(len(labels)), # targets have length 1
tgt_dict=self.label_vocab,
left_pad_source=False,
# Since our target is a single class label, there's no need for
# teacher forcing. If we set this to ``True`` then our Model's
# ``forward()`` method would receive an additional argument called
# *prev_output_tokens* that would contain a shifted version of the
# target sequence.
input_feeding=False,
)
def max_positions(self):
"""Return the max input length allowed by the task."""
# The source should be less than *args.max_positions* and the "target"
# has max length 1.
return (self.args.max_positions, 1)
@property
def source_dictionary(self):
"""Return the source :class:`~fairseq.data.Dictionary`."""
return self.input_vocab
@property
def target_dictionary(self):
"""Return the target :class:`~fairseq.data.Dictionary`."""
return self.label_vocab
# We could override this method if we wanted more control over how batches
# are constructed, but it's not necessary for this tutorial since we can
# reuse the batching provided by LanguagePairDataset.
#
# def get_batch_iterator(
# self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
# ignore_invalid_inputs=False, required_batch_size_multiple=1,
# seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
# data_buffer_size=0, disable_iterator_cache=False,
# ):
# (...)
4. Training the Model
---------------------
Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
command-line tool for this, making sure to specify our new Task (``--task
simple_classification``) and Model architecture (``--arch
pytorch_tutorial_rnn``):
.. note::
You can also configure the dimensionality of the hidden state by passing the
``--hidden-dim`` argument to :ref:`fairseq-train`.
.. code-block:: console
> fairseq-train names-bin \
--task simple_classification \
--arch pytorch_tutorial_rnn \
--optimizer adam --lr 0.001 --lr-shrink 0.5 \
--max-tokens 1000
(...)
| epoch 027 | loss 1.200 | ppl 2.30 | wps 15728 | ups 119.4 | wpb 116 | bsz 116 | num_updates 3726 | lr 1.5625e-05 | gnorm 1.290 | clip 0% | oom 0 | wall 32 | train_wall 21
| epoch 027 | valid on 'valid' subset | valid_loss 1.41304 | valid_ppl 2.66 | num_updates 3726 | best 1.41208
| done training in 31.6 seconds
The model files should appear in the :file:`checkpoints/` directory.
5. Writing an evaluation script
-------------------------------
Finally we can write a short script to evaluate our model on new inputs. Create
a new file named :file:`eval_classifier.py` with the following contents::
from fairseq import checkpoint_utils, data, options, tasks
# Parse command-line arguments for generation
parser = options.get_generation_parser(default_task='simple_classification')
args = options.parse_args_and_arch(parser)
# Setup task
task = tasks.setup_task(args)
# Load model
print('| loading model from {}'.format(args.path))
models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
model = models[0]
while True:
sentence = input('\nInput: ')
# Tokenize into characters
chars = ' '.join(list(sentence.strip()))
tokens = task.source_dictionary.encode_line(
chars, add_if_not_exist=False,
)
# Build mini-batch to feed to the model
batch = data.language_pair_dataset.collate(
samples=[{'id': -1, 'source': tokens}], # bsz = 1
pad_idx=task.source_dictionary.pad(),
eos_idx=task.source_dictionary.eos(),
left_pad_source=False,
input_feeding=False,
)
# Feed batch to the model and get predictions
preds = model(**batch['net_input'])
# Print top 3 predictions and their log-probabilities
top_scores, top_labels = preds[0].topk(k=3)
for score, label_idx in zip(top_scores, top_labels):
label_name = task.target_dictionary.string([label_idx])
print('({:.2f})\t{}'.format(score, label_name))
Now we can evaluate our model interactively. Note that we have included the
original data path (:file:`names-bin/`) so that the dictionaries can be loaded:
.. code-block:: console
> python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
| [input] dictionary: 64 types
| [label] dictionary: 24 types
| loading model from checkpoints/checkpoint_best.pt
Input: Satoshi
(-0.61) Japanese
(-1.20) Arabic
(-2.86) Italian
Input: Sinbad
(-0.30) Arabic
(-1.76) English
(-4.08) Russian
This diff is collapsed.
!*/*.sh
!*/*.md
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
runs
data
pretrained_models
projects/mmfusion_*
log_test
third-party
python_log
slurm_snapshot_code
lightning_logs
demos
### Config Files Explained
Taking `projects/mfmmlm.yaml` for example, which run pretraining using masked frame model (MFM) and masked language model (MLM) on a single BERT:
```yaml
project_dir: mfmmlm # specify the project dir for this baseline.
run_task:
- how2.yaml # run pretraining on how2 when launching `projects/taskmfmmlm.yaml`
- [vtt.yaml, vttcap.yaml, vttqa.yaml, youcook.yaml, youcookcap.yaml, crosstask.yaml, coin.yaml] # run fine-tuning tasks.
base_dir: task # a global template folder to specify each training task.
task_group:
pretrain: # section for pretraining. Most baselines differs in this section.
task_list:
- how2.yaml # reconfig `projects/task/how2.yaml`
dataset:
aligner: MFMMLMAligner # overwrite the aligner for MFMMLM training task.
model:
model_cls: MMFusionMFMMLM # overwrite the model, which constructs negative examples for MFM on-the-fly.
loss:
loss_cls: MFMMLM # overwrite the loss as MFMMLM, which combines MFM and MLM together.
fairseq: # all fairseq args can be expecified under this name.
dataset:
batch_size: 128
finetune: # section for fine-tuning tasks, we don't need to change anything here mostly since we want to see how pretraining can contribute to finetuning.
task_list: # specify the list of downstream tasks, e.g., copy `projects/task/vtt.yaml` to `projects/mfmmlm`.
- vtt.yaml
- vttqa.yaml
- youcook.yaml
- youcookcap.yaml
- crosstask.yaml
- coin.yaml
test: # section for testing.
task_list:
- test_vtt.yaml
- test_vttqa.yaml
- test_youcook.yaml
- test_youcookcap.yaml
- test_crosstask.yaml
- test_crosstask_zs.yaml
- test_coin.yaml
```
# Dataset
We understand video data are challenging to download and process. For videos, we provide our preprocessing scripts under `scripts/video_feature_extractor` (deeply adapted from `https://github.com/antoine77340/video_feature_extractor`); for text, we pre-tokenizing scripts under `scripts/text_token_extractor`.
### S3D Feature Extraction
We use pre-trained [S3D](https://github.com/antoine77340/S3D_HowTo100M) for video feature extraction. Please place the models as `pretrained_models/s3d_dict.npy` and `pretrained_models/s3d_howto100m.pth`.
We implement a `PathBuilder` to automatically track video ids, source video paths to their feature locations (you may need `conda install -c anaconda pandas`). Decoding may need `pip install ffmpeg-python`.
### Howto100M
[Howto100M](https://www.di.ens.fr/willow/research/howto100m/) is a large-scale video pre-training datasets. You may download videos by yourself and run preprocessing of our scripts.
Several key differences of our preprocessing from existing papers: (1) we use `raw_caption.json` instead of `caption.json` to have pure self-supervision on text (`caption.json` has manual removal of stop words); (2) we remove partially duplicated texts that are originally designed for real-time readability (see `mmpt/processors/dedupprocessor.py`); (3) then we shard video/text features using `SharedTensor` in `mmpt/utils/shardedtensor.py` for fast loading during training (faster than `h5py`).
#### Steps
##### video
To extract video features: edit and run `bash scripts/video_feature_extractor/how2/s3d.sh`. (consider to run this on multiple machines; by default, we store features in fp16 to save space and also for faster training).
Split available video ids as `data/how2/how2_s3d_train.lst` and `data/how2/how2_s3d_val.lst`.
Lastly, pack video features into `ShardedTensor` using `python scripts/video_feature_extractor/shard_feature.py`.
##### text
Clean captions using `python -m mmpt.processors.dedupprocessor`.
Tokenize dedupped captions `data/how2/raw_caption_dedup.pkl` into sharded numpy arrays:
```
python scripts/text_token_extractor/pretokenization.py scripts/text_token_extractor/configs/bert-base-uncased.yaml
```
### Youcook, MSRVTT etc.
We use the version of Youcook and MSRVTT come with Howto100M and MILNCE. Please download the data to `data/youcook` and `data/msrvtt` accordingly, you can also check `projects/task/youcook.yaml` and `projects/task/vtt.yaml` etc. in details.
We extract features for Youcook, MSRVTT similar to the first step of Howto100M but we read text from meta data directly and perform on-the-fly tokenization.
# VideoCLIP and VLM
You just find this toolkit for multimodal video understanding! It contains implementation of two recent multi-modal video understanding papers [VideoCLIP](https://arxiv.org/pdf/2109.14084.pdf) (EMNLP, 2021) and [VLM](https://aclanthology.org/2021.findings-acl.370.pdf) (ACL Findings, 2021), along with high-performance toolkits that are typically lacking in existing codebase. The toolkit is desigend to contain generic performance-tuned components that can be potentially adapted to other frameworks (we initially use fairseq).
VideoCLIP is a contrastive learning model for zero-shot transfer to retrieval/classification/sequence labeling style tasks.
<img src="videoclip.png" width="350" class="center">
VLM is a masked language model style pre-training using only one encoder with masked modality model (MMM) for retrieval/generation/sequence labeling style tasks.
<img src="vlm.png" width="350" class="center">
### News
[Oct. 2021] Initial release of implementation for the following papers:
[VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) (Xu et. al., EMNLP 2021)
[VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding](https://aclanthology.org/2021.findings-acl.370.pdf) (Xu et. al., ACL Findings 2021)
### Installation
We aim to minimize the dependency of this repo on other packages.
We use fairseq as the main trainer (no models/datasets dependency on fairseq. We will support other trainer in future):
```
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e . # also optionally follow fairseq README for apex installation for fp16 training.
export MKL_THREADING_LAYER=GNU # fairseq may need this for numpy.
```
Then install this toolkit:
```
cd examples/MMPT # MMPT can be in any folder, not necessarily under fairseq/examples.
pip install -e .
```
The code is developed under Python=3.8.8, Pytorch=1.8, cuda=11.0 with fairseq=1.0.0a0+af0389f and tested under Python=3.8.8 pytorch=1.9 cuda=11.0 fairseq=1.0.0a0+8e7bc73 during code release.
Most models require `transformers==3.4` for API compatibility `pip install transformers==3.4`.
In addition, some downstream tasks may need `conda install pandas`.
### Usage
#### Download Checkpoints
We use pre-trained [S3D](https://github.com/antoine77340/S3D_HowTo100M) for video feature extraction. Please place the models as `pretrained_models/s3d_dict.npy` and `pretrained_models/s3d_howto100m.pth`.
Download VideoCLIP checkpoint `https://dl.fbaipublicfiles.com/MMPT/retri/videoclip/checkpoint_best.pt` to `runs/retri/videoclip` or VLM checkpoint `https://dl.fbaipublicfiles.com/MMPT/mtm/vlm/checkpoint_best.pt` to `runs/mtm/vlm`.
#### Demo of Inference
run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` to get all `.yaml`s for VideoCLIP.
```python
import torch
from mmpt.models import MMPTModel
model, tokenizer, aligner = MMPTModel.from_pretrained(
"projects/retri/videoclip/how2.yaml")
model.eval()
# B, T, FPS, H, W, C (VideoCLIP is trained on 30 fps of s3d)
video_frames = torch.randn(1, 2, 30, 224, 224, 3)
caps, cmasks = aligner._build_text_seq(
tokenizer("some text", add_special_tokens=False)["input_ids"]
)
caps, cmasks = caps[None, :], cmasks[None, :] # bsz=1
with torch.no_grad():
output = model(video_frames, caps, cmasks, return_score=True)
print(output["score"]) # dot-product
```
#### Data Preparation
See [dataset](DATASET.md) for each dataset.
#### Global Config for Training Pipeline
We organize a global config file for a training/testing pipeline under projects (see a detailed [explanation](CONFIG.md)). For example, VideoCLIP in `projects/retri/videoclip.yaml` and VLM is in `projects/mtm/vlm.yaml`.
We wrap all cmds into `locallaunch.py` and `mmpt_cli/localjob.py`. You can check concrete cmds by `--dryrun` and then drop it for actual run.
First, run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` will generate configs for all configs of pre-training, zero-shot evaluation, fine-tuning and testing, for VideoCLIP under `projects/retri/videoclip`.
Then each (either training or evaluation) process will be configed by a concrete config file (we save all complex arguments into the concrete config file for reproducibility, including fairseq args). For example, run zero-shot evaluation on youcook,
```
python locallaunch.py projects/retri/videoclip/test_youcook_zs.yaml --jobtype local_predict # zero-shot evaluation.
python locallaunch.py projects/retri/videoclip/youcook_videoclip.yaml --jobtype local_single --dryrun # fine-tuning: use --dryrun to check cmds and drop it to make an actual run; local_small will run on two gpus (as in paper).
python locallaunch.py projects/retri/videoclip/test_youcook_videoclip.yaml --jobtype local_predict # testing on fine-tuned model.
```
Pretraining can be run as:
```
python locallaunch.py projects/retri/videoclip/how2.yaml --jobtype local_single --dryrun # check then drop dryrun; paper is ran on local_big as 8 gpus.
```
You may need to change `--jobtype`, check/extend `LocalJob` in `mmpt_cli/localjob.py` for multi-gpu/multi-node pre-training.
The detailed instructions of pretraining and fine-tuning can be found at [pretraining instruction](pretraining.md) and [finetuning instruction](endtask.md).
### Development
Several components of this toolkit can be re-used for future research (and also our ongoing research).
#### Framework Wrapper
We currently only support fairseq, but most components can be easily fit into other frameworks like huggingface. This repo is a `--user-dir` of fairseq with fairseq wrapper. For example, `mmpt/tasks` includes a `FairseqMMTTask`, which manages `mmpt/datasets` with `FairseqDataset`, `mmpt/models` with `FairseqModel`, `mmpt/losses` with `FairseqCriterion`.
#### Processors
**Multi**modal research introduces the complexity on modality alignment from different input sources to losses. Inspired by [MMF](https://github.com/facebookresearch/mmf), this toolkit leverages `mmpt/processors` to handle various needs of data preprocessing and loading, **alleviating** the needs of multiple `torch.data.utils.Dataset` (that can be tricky for ablation study).
Processors can also be decoupled from `torch.data.utils.Dataset` for offline preprocessing instead of on-the-fly data preprocessing.
We decouple a `mmpt.MMDataset` as 3 types of processors: `MetaProcessor`, `VideoProcessor`, `TextProcessor` and `Aligner`. They can be configed in `dataset` field of a config file (e.g., see `projects/task/how2.yaml`).
`MetaProcessor` is used to load the meta data about a dataset, aka, all video_ids of how2 dataset.
`VideoProcessor` is used to load the video features about a dataset. For example, S3D features for each second of a video.
`TextProcessor` is used to load the text (feature). For example, BERT pre-tokenized text clips for how2 dataset (with `start`s, `end`s of timestamps and `cap` for `token_ids`).
`Aligner` is the core class for different baselines that prepares the training data. For example, sampling a clip, masking tokens for MLM, etc.
#### Performance-tuned Components
To speed up pre-training, this toolkit uses sharded features stored in mmaped numpy, backed by `ShardedTensor` in `mmpt/utils/shardedtensor.py` (adopted from MARGE paper). This reduces the loads of IO for multi-GPU training without loading all features for a video into the memory each time and `ShardedTensor` ensure features are stored in continuous disk space for near random access. This is used for both How2 video features and texts in `mmpt/processors/how2processor.py`.
### Citation
If this codebase is useful for your work, please cite the following papers:
```BibTeX
@inproceedings{xu-etal-2021-videoclip,
title = "{VideoCLIP}: Contrastive Pre-training for\\Zero-shot Video-Text Understanding",
author = "Xu, Hu and
Ghosh, Gargi and
Huang, Po-Yao and
Okhonko, Dmytro and
Aghajanyan, Armen and
Metze, Florian and
Zettlemoyer, Luke and
Feichtenhofer, Christoph",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
}
@inproceedings{xu-etal-2021-vlm,
title = "{VLM}: Task-agnostic Video-Language Model Pre-training for Video Understanding",
author = "Xu, Hu and
Ghosh, Gargi and
Huang, Po-Yao and
Arora, Prahal and
Aminzadeh, Masoumeh and
Feichtenhofer, Christoph and
Metze, Florian and
Zettlemoyer, Luke",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.370",
doi = "10.18653/v1/2021.findings-acl.370",
pages = "4227--4239",
}
```
### Bug Reports
This repo is in its initial stage, welcome bug reports to huxu@fb.com
### Copyright
The majority of Multimodal Pre-training (MMPT) is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Evaluation Codes/Models: Howto100M and HuggingFace Transformers are licensed under the Apache2.0 license; COIN and NLG-eval are licensed under the MIT license; CrossTask is licensed under the BSD-3; DiDeMo is licensed under the BSD-2 license.
# Zero-shot Transfer and Finetuning
(If you are new to the ideas of `mmpt.processors`, see [README](README.md) first.)
All finetuning datasets (specifically `processors`) are defined in `mmpt.processors.dsprocessor`.
Given the complexity of different types of finetuning tasks, each task may have their own meta/video/text/aligner processors and `mmpt/evaluators/{Predictor,Metric}`.
### Tasks
Currently, we support 5 end datasets: `MSRVTT`, `Youcook`, `COIN`, `Crosstask` and `DiDeMo` with the following tasks:
text-video retrieval: `MSRVTT`, `Youcook`, `DiDeMo`;
video captioning: `Youcook`;
Video Question and Answering: `MSRVTT-QA`.
To add your own dataset, you can specify the corresponding processors and config them in the `dataset` field of a config file, such as `projects/task/vtt.yaml`.
### Zero-shot Transfer (no Training)
Zero-shot transfer will run the pre-trained model (e.g., VideoCLIP) directly on testing data. Configs with pattern: `projects/task/*_zs_*.yaml` are dedicated for zero-shot transfer.
### Fine-tuning
The training of a downstream task is similar to pretraining, execept you may need to specify the `restore_file` in `fairseq.checkpoint` and reset optimizers, see `projects/task/ft.yaml` that is included by `projects/task/vtt.yaml`.
We typically do finetuning on 2 gpus (`local_small`).
### Testing
For each finetuning dataset, you may need to specify a testing config, similar to `projects/task/test_vtt.yaml`.
We define `mmpt.evaluators.Predictor` for different types of prediction. For example, `MSRVTT` and `Youcook` are video-retrieval tasks and expecting to use `RetrievalPredictor`. You may need to define your new type of predictors and specify that in `predictor` field of a testing config.
Each task may also have their own metric for evaluation. This can be created in `mmpt.evaluators.Metric` and specified in the `metric` field of a testing config.
Launching a testing is as simple as training by specifying the path of a testing config:
```python locallaunch.py projects/mfmmlm/test_vtt.yaml```
Testing will be launched locally by default since prediction is computationally less expensive.
### Third-party Libraries
We list the following finetuning tasks that require third-party libraries.
Youcook captioning: `https://github.com/Maluuba/nlg-eval`
CrossTask: `https://github.com/DmZhukov/CrossTask`'s `dp` under `third-party/CrossTask` (`python setup.py build_ext --inplace`)
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import os
from omegaconf import OmegaConf
from mmpt.utils import recursive_config, overwrite_dir
from mmpt_cli.localjob import LocalJob
class JobLauncher(object):
JOB_CONFIG = {
"local": LocalJob,
}
def __init__(self, yaml_file):
self.yaml_file = yaml_file
job_key = "local"
if yaml_file.endswith(".yaml"):
config = recursive_config(yaml_file)
if config.task_type is not None:
job_key = config.task_type.split("_")[0]
else:
raise ValueError("unknown extension of job file:", yaml_file)
self.job_key = job_key
def __call__(self, job_type=None, dryrun=False):
if job_type is not None:
self.job_key = job_type.split("_")[0]
print("[JobLauncher] job_key", self.job_key)
job = JobLauncher.JOB_CONFIG[self.job_key](
self.yaml_file, job_type=job_type, dryrun=dryrun)
return job.submit()
class Pipeline(object):
"""a job that loads yaml config."""
def __init__(self, fn):
"""
load a yaml config of a job and save generated configs as yaml for each task.
return: a list of files to run as specified by `run_task`.
"""
if fn.endswith(".py"):
# a python command.
self.backend = "python"
self.run_yamls = [fn]
return
job_config = recursive_config(fn)
if job_config.base_dir is None: # single file job config.
self.run_yamls = [fn]
return
self.project_dir = os.path.join("projects", job_config.project_dir)
self.run_dir = os.path.join("runs", job_config.project_dir)
if job_config.run_task is not None:
run_yamls = []
for stage in job_config.run_task:
# each stage can have multiple tasks running in parallel.
if OmegaConf.is_list(stage):
stage_yamls = []
for task_file in stage:
stage_yamls.append(
os.path.join(self.project_dir, task_file))
run_yamls.append(stage_yamls)
else:
run_yamls.append(os.path.join(self.project_dir, stage))
self.run_yamls = run_yamls
configs_to_save = self._overwrite_task(job_config)
self._save_configs(configs_to_save)
def __getitem__(self, idx):
yaml_files = self.run_yamls[idx]
if isinstance(yaml_files, list):
return [JobLauncher(yaml_file) for yaml_file in yaml_files]
return [JobLauncher(yaml_files)]
def __len__(self):
return len(self.run_yamls)
def _save_configs(self, configs_to_save: dict):
# save
os.makedirs(self.project_dir, exist_ok=True)
for config_file in configs_to_save:
config = configs_to_save[config_file]
print("saving", config_file)
OmegaConf.save(config=config, f=config_file)
def _overwrite_task(self, job_config):
configs_to_save = {}
self.base_project_dir = os.path.join("projects", job_config.base_dir)
self.base_run_dir = os.path.join("runs", job_config.base_dir)
for config_sets in job_config.task_group:
overwrite_config = job_config.task_group[config_sets]
if (
overwrite_config.task_list is None
or len(overwrite_config.task_list) == 0
):
print(
"[warning]",
job_config.task_group,
"has no task_list specified.")
# we don't want this added to a final config.
task_list = overwrite_config.pop("task_list", None)
for config_file in task_list:
config_file_path = os.path.join(
self.base_project_dir, config_file)
config = recursive_config(config_file_path)
# overwrite it.
if overwrite_config:
config = OmegaConf.merge(config, overwrite_config)
overwrite_dir(config, self.run_dir, basedir=self.base_run_dir)
save_file_path = os.path.join(self.project_dir, config_file)
configs_to_save[save_file_path] = config
return configs_to_save
def main(args):
job_type = args.jobtype if args.jobtype else None
# parse multiple pipelines.
pipelines = [Pipeline(fn) for fn in args.yamls.split(",")]
for pipe_id, pipeline in enumerate(pipelines):
if not hasattr(pipeline, "project_dir"):
for job in pipeline[0]:
job(job_type=job_type, dryrun=args.dryrun)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("yamls", type=str)
parser.add_argument(
"--dryrun",
action="store_true",
help="run config and prepare to submit without launch the job.",
)
parser.add_argument(
"--jobtype", type=str, default="",
help="force to run jobs as specified.")
args = parser.parse_args()
main(args)
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
try:
# fairseq user dir
from .datasets import FairseqMMDataset
from .losses import FairseqCriterion
from .models import FairseqMMModel
from .tasks import FairseqMMTask
except ImportError:
pass
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from .mmdataset import *
try:
from .fairseqmmdataset import *
except ImportError:
pass
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
TODO (huxu): fairseq wrapper class for all dataset you defined: mostly MMDataset.
"""
from collections import OrderedDict
from torch.utils.data import Dataset
from torch.utils.data.dataloader import default_collate
from fairseq.data import FairseqDataset, data_utils
class FairseqMMDataset(FairseqDataset):
"""
A wrapper class for MMDataset for fairseq.
"""
def __init__(self, mmdataset):
if not isinstance(mmdataset, Dataset):
raise TypeError("mmdataset must be of type `torch.utils.data.dataset`.")
self.mmdataset = mmdataset
def set_epoch(self, epoch, **unused):
super().set_epoch(epoch)
self.epoch = epoch
def __getitem__(self, idx):
with data_utils.numpy_seed(43211, self.epoch, idx):
return self.mmdataset[idx]
def __len__(self):
return len(self.mmdataset)
def collater(self, samples):
if hasattr(self.mmdataset, "collator"):
return self.mmdataset.collator(samples)
if len(samples) == 0:
return {}
if isinstance(samples[0], dict):
batch = OrderedDict()
for key in samples[0]:
if samples[0][key] is not None:
batch[key] = default_collate([sample[key] for sample in samples])
return batch
else:
return default_collate(samples)
def size(self, index):
"""dummy implementation: we don't use --max-tokens"""
return 1
def num_tokens(self, index):
"""dummy implementation: we don't use --max-tokens"""
return 1
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import torch
from collections import OrderedDict
from torch.utils.data import Dataset
from torch.utils.data.dataloader import default_collate
from ..utils import set_seed
class MMDataset(Dataset):
"""
A generic multi-modal dataset.
Args:
`meta_processor`: a meta processor,
handling loading meta data and return video_id and text_id.
`video_processor`: a video processor,
handling e.g., decoding, loading .np files.
`text_processor`: a text processor,
handling e.g., tokenization.
`aligner`: combine the video and text feature
as one training example.
"""
def __init__(
self,
meta_processor,
video_processor,
text_processor,
align_processor,
):
self.split = meta_processor.split
self.meta_processor = meta_processor
self.video_processor = video_processor
self.text_processor = text_processor
self.align_processor = align_processor
def __len__(self):
return len(self.meta_processor)
def __getitem__(self, idx):
if self.split == "test":
set_seed(idx)
video_id, text_id = self.meta_processor[idx]
video_feature = self.video_processor(video_id)
text_feature = self.text_processor(text_id)
output = self.align_processor(video_id, video_feature, text_feature)
# TODO (huxu): the following is for debug purpose.
output.update({"idx": idx})
return output
def collater(self, samples):
"""This collator is deprecated.
set self.collator = MMDataset.collater.
see collator in FairseqMMDataset.
"""
if len(samples) == 0:
return {}
if isinstance(samples[0], dict):
batch = OrderedDict()
for key in samples[0]:
if samples[0][key] is not None:
batch[key] = default_collate(
[sample[key] for sample in samples])
# if torch.is_tensor(batch[key]):
# print(key, batch[key].size())
# else:
# print(key, len(batch[key]))
return batch
else:
return default_collate(samples)
def print_example(self, output):
print("[one example]", output["video_id"])
if (
hasattr(self.align_processor, "subsampling")
and self.align_processor.subsampling is not None
and self.align_processor.subsampling > 1
):
for key in output:
if torch.is_tensor(output[key]):
output[key] = output[key][0]
# search tokenizer to translate ids back.
tokenizer = None
if hasattr(self.text_processor, "tokenizer"):
tokenizer = self.text_processor.tokenizer
elif hasattr(self.align_processor, "tokenizer"):
tokenizer = self.align_processor.tokenizer
if tokenizer is not None:
caps = output["caps"].tolist()
if isinstance(caps[0], list):
caps = caps[0]
print("caps", tokenizer.decode(caps))
print("caps", tokenizer.convert_ids_to_tokens(caps))
for key, value in output.items():
if torch.is_tensor(value):
if len(value.size()) >= 3: # attention_mask.
print(key, value.size())
print(key, "first", value[0, :, :])
print(key, "last", value[-1, :, :])
else:
print(key, value)
print("[end of one example]")
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from .metric import *
from .evaluator import *
# experimental.
try:
from .expmetric import *
except ImportError:
pass
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import os
import glob
import numpy as np
from . import metric as metric_path
from . import predictor as predictor_path
class Evaluator(object):
"""
perform evaluation on a single (downstream) task.
make this both offline and online.
TODO(huxu) saving evaluation results.
"""
def __init__(self, config, eval_dataloader=None):
if config.metric is None:
raise ValueError("config.metric is", config.metric)
metric_cls = getattr(metric_path, config.metric)
self.metric = metric_cls(config)
if config.predictor is None:
raise ValueError("config.predictor is", config.predictor)
predictor_cls = getattr(predictor_path, config.predictor)
self.predictor = predictor_cls(config)
self.eval_dataloader = eval_dataloader
def __call__(self):
try:
print(self.predictor.pred_dir)
for pred_file in glob.glob(
self.predictor.pred_dir + "/*_merged.npy"):
outputs = np.load(pred_file)
results = self.metric.compute_metrics(outputs)
self.metric.print_computed_metrics(results)
outputs = np.load(os.path.join(
self.predictor.pred_dir, "merged.npy"))
results = self.metric.compute_metrics(outputs)
return {"results": results, "metric": self.metric}
except FileNotFoundError:
print("\n[missing]", self.predictor.pred_dir)
return {}
def evaluate(self, model, eval_dataloader=None, output_file="merged"):
if eval_dataloader is None:
eval_dataloader = self.eval_dataloader
outputs = self.predictor.predict_loop(
model, eval_dataloader, output_file)
results = self.metric.compute_metrics(**outputs)
return results
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import numpy as np
import json
class Metric(object):
def __init__(self, config, metric_names):
self.metric_names = metric_names
def best_metric(self, metric):
return metric[self.metric_names[0]]
def save_metrics(self, fn, metrics):
with open(fn, "w") as fw:
json.dump(fw, metrics)
def print_computed_metrics(self, metrics):
raise NotImplementedError
class RetrievalMetric(Metric):
"""
this is modified from `howto100m/metrics.py`.
History of changes:
refactor as a class.
add metric_key in __init__
"""
def __init__(self, config, metric_names=["R1", "R5", "R10", "MR"]):
super().__init__(config, metric_names)
self.error = False # TODO(huxu): add to config to print error.
def compute_metrics(self, outputs, texts, **kwargs):
x = outputs
sx = np.sort(-x, axis=1)
d = np.diag(-x)
d = d[:, np.newaxis]
ind = sx - d
ind = np.where(ind == 0)
ind = ind[1]
metrics = {}
metrics["R1"] = float(np.sum(ind == 0)) / len(ind)
metrics["R5"] = float(np.sum(ind < 5)) / len(ind)
metrics["R10"] = float(np.sum(ind < 10)) / len(ind)
metrics["MR"] = np.median(ind) + 1
max_idx = np.argmax(outputs, axis=1)
if self.error:
# print top-20 errors.
error = []
for ex_idx in range(20):
error.append((texts[ex_idx], texts[max_idx[ex_idx]]))
metrics["error"] = error
return metrics
def print_computed_metrics(self, metrics):
r1 = metrics["R1"]
r5 = metrics["R5"]
r10 = metrics["R10"]
mr = metrics["MR"]
print(
"R@1: {:.4f} - R@5: {:.4f} - R@10: {:.4f} - Median R: {}".format(
r1, r5, r10, mr
)
)
if "error" in metrics:
print(metrics["error"])
class DiDeMoMetric(Metric):
"""
History of changes:
python 2.x to python 3.x.
merge utils.py into eval to save one file.
reference: https://github.com/LisaAnne/LocalizingMoments/blob/master/utils/eval.py
Code to evaluate your results on the DiDeMo dataset.
"""
def __init__(self, config, metric_names=["rank1", "rank5", "miou"]):
super().__init__(config, metric_names)
def compute_metrics(self, outputs, targets, **kwargs):
assert len(outputs) == len(targets)
rank1, rank5, miou = self._eval_predictions(outputs, targets)
metrics = {
"rank1": rank1,
"rank5": rank5,
"miou": miou
}
return metrics
def print_computed_metrics(self, metrics):
rank1 = metrics["rank1"]
rank5 = metrics["rank5"]
miou = metrics["miou"]
# print("Average rank@1: %f" % rank1)
# print("Average rank@5: %f" % rank5)
# print("Average iou: %f" % miou)
print(
"Average rank@1: {:.4f} Average rank@5: {:.4f} Average iou: {:.4f}".format(
rank1, rank5, miou
)
)
def _iou(self, pred, gt):
intersection = max(0, min(pred[1], gt[1]) + 1 - max(pred[0], gt[0]))
union = max(pred[1], gt[1]) + 1 - min(pred[0], gt[0])
return float(intersection)/union
def _rank(self, pred, gt):
return pred.index(tuple(gt)) + 1
def _eval_predictions(self, segments, data):
'''
Inputs:
segments: For each item in the ground truth data, rank possible video segments given the description and video.
In DiDeMo, there are 21 posible moments extracted for each video so the list of video segments will be of length 21.
The first video segment should be the video segment that best corresponds to the text query.
There are 4180 sentence in the validation data, so when evaluating a model on the val dataset,
segments should be a list of lenght 4180, and each item in segments should be a list of length 21.
data: ground truth data
'''
average_ranks = []
average_iou = []
for s, d in zip(segments, data):
pred = s[0]
ious = [self._iou(pred, t) for t in d['times']]
average_iou.append(np.mean(np.sort(ious)[-3:]))
ranks = [self._rank(s, t) for t in d['times'] if tuple(t) in s] # if t in s] is added for s, e not in prediction.
average_ranks.append(np.mean(np.sort(ranks)[:3]))
rank1 = np.sum(np.array(average_ranks) <= 1)/float(len(average_ranks))
rank5 = np.sum(np.array(average_ranks) <= 5)/float(len(average_ranks))
miou = np.mean(average_iou)
# print("Average rank@1: %f" % rank1)
# print("Average rank@5: %f" % rank5)
# print("Average iou: %f" % miou)
return rank1, rank5, miou
class NLGMetric(Metric):
def __init__(
self,
config,
metric_names=[
"Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4",
"METEOR", "ROUGE_L", "CIDEr"
]
):
super().__init__(config, metric_names)
# please install NLGEval from `https://github.com/Maluuba/nlg-eval`
from nlgeval import NLGEval
self.nlg = NLGEval()
def compute_metrics(self, outputs, targets, **kwargs):
return self.nlg.compute_metrics(
hyp_list=outputs, ref_list=targets)
def print_computed_metrics(self, metrics):
Bleu_1 = metrics["Bleu_1"]
Bleu_2 = metrics["Bleu_2"]
Bleu_3 = metrics["Bleu_3"]
Bleu_4 = metrics["Bleu_4"]
METEOR = metrics["METEOR"]
ROUGE_L = metrics["ROUGE_L"]
CIDEr = metrics["CIDEr"]
print(
"Bleu_1: {:.4f} - Bleu_2: {:.4f} - Bleu_3: {:.4f} - Bleu_4: {:.4f} - METEOR: {:.4f} - ROUGE_L: {:.4f} - CIDEr: {:.4f}".format(
Bleu_1, Bleu_2, Bleu_3, Bleu_4, METEOR, ROUGE_L, CIDEr
)
)
class QAMetric(Metric):
def __init__(
self,
config,
metric_names=["acc"]
):
super().__init__(config, metric_names)
def compute_metrics(self, outputs, targets, **kwargs):
from sklearn.metrics import accuracy_score
return {"acc": accuracy_score(targets, outputs)}
def print_computed_metrics(self, metrics):
print("acc: {:.4f}".format(metrics["acc"]))
class COINActionSegmentationMetric(Metric):
"""
COIN dataset listed 3 repos for Action Segmentation.
Action Sets, NeuralNetwork-Viterbi, TCFPN-ISBA.
The first and second are the same.
https://github.com/alexanderrichard/action-sets/blob/master/eval.py
Future reference for the third:
`https://github.com/Zephyr-D/TCFPN-ISBA/blob/master/utils/metrics.py`
"""
def __init__(self, config, metric_name=["frame_acc"]):
super().__init__(config, metric_name)
def compute_metrics(self, outputs, targets):
n_frames = 0
n_errors = 0
n_errors = sum(outputs != targets)
n_frames = len(targets)
return {"frame_acc": 1.0 - float(n_errors) / n_frames}
def print_computed_metrics(self, metrics):
fa = metrics["frame_acc"]
print("frame accuracy:", fa)
class CrossTaskMetric(Metric):
def __init__(self, config, metric_names=["recall"]):
super().__init__(config, metric_names)
def compute_metrics(self, outputs, targets, **kwargs):
"""refactored from line 166:
https://github.com/DmZhukov/CrossTask/blob/master/train.py"""
recalls = self._get_recalls(Y_true=targets, Y_pred=outputs)
results = {}
for task, rec in recalls.items():
results[str(task)] = rec
avg_recall = np.mean(list(recalls.values()))
results["recall"] = avg_recall
return results
def print_computed_metrics(self, metrics):
print('Recall: {0:0.3f}'.format(metrics["recall"]))
for task in metrics:
if task != "recall":
print('Task {0}. Recall = {1:0.3f}'.format(
task, metrics[task]))
def _get_recalls(self, Y_true, Y_pred):
"""refactored from
https://github.com/DmZhukov/CrossTask/blob/master/train.py"""
step_match = {task: 0 for task in Y_true.keys()}
step_total = {task: 0 for task in Y_true.keys()}
for task, ys_true in Y_true.items():
ys_pred = Y_pred[task]
for vid in set(ys_pred.keys()).intersection(set(ys_true.keys())):
y_true = ys_true[vid]
y_pred = ys_pred[vid]
step_total[task] += (y_true.sum(axis=0) > 0).sum()
step_match[task] += (y_true*y_pred).sum()
recalls = {
task: step_match[task] / n for task, n in step_total.items()}
return recalls
class ActionRecognitionMetric(Metric):
def __init__(
self,
config,
metric_names=["acc", "acc_splits", "r1_splits", "r5_splits", "r10_splits"]
):
super().__init__(config, metric_names)
def compute_metrics(self, outputs, targets, splits, **kwargs):
all_video_embd = outputs
labels = targets
split1, split2, split3 = splits
accs = []
r1s = []
r5s = []
r10s = []
for split in range(3):
if split == 0:
s = split1
elif split == 1:
s = split2
else:
s = split3
X_pred = all_video_embd[np.where(s == 2)[0]]
label_test = labels[np.where(s == 2)[0]]
logits = X_pred
X_pred = np.argmax(X_pred, axis=1)
acc = np.sum(X_pred == label_test) / float(len(X_pred))
accs.append(acc)
# compute recall.
sorted_pred = (-logits).argsort(axis=-1)
label_test_sp = label_test.reshape(-1, 1)
r1 = np.mean((sorted_pred[:, :1] == label_test_sp).sum(axis=1), axis=0)
r5 = np.mean((sorted_pred[:, :5] == label_test_sp).sum(axis=1), axis=0)
r10 = np.mean((sorted_pred[:, :10] == label_test_sp).sum(axis=1), axis=0)
r1s.append(r1)
r5s.append(r5)
r10s.append(r10)
return {"acc": accs[0], "acc_splits": accs, "r1_splits": r1s, "r5_splits": r5s, "r10_splits": r10s}
def print_computed_metrics(self, metrics):
for split, acc in enumerate(metrics["acc_splits"]):
print("Top 1 accuracy on split {}: {}; r1 {}; r5 {}; r10 {}".format(
split + 1, acc,
metrics["r1_splits"][split],
metrics["r5_splits"][split],
metrics["r10_splits"][split],
)
)
This diff is collapsed.
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from .loss import *
from .nce import *
try:
from .fairseqmmloss import *
except ImportError:
pass
try:
from .expnce import *
except ImportError:
pass
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment