v1.0

72f5785f · huaerkl · 72f5785f · 72f5785f · 72f5785f · 72f5785f
Commit 72f5785f authored Aug 15, 2023 by huaerkl
20 changed files
--- a/docs/overview.rst
+++ b/docs/overview.rst
+Overview
+========
+
+Fairseq can be extended through user-supplied `plug-ins
+<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
+plug-ins:
+
+- :ref:`Models` define the neural network architecture and encapsulate all of the
+  learnable parameters.
+- :ref:`Criterions` compute the loss function given the model outputs and targets.
+- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
+  Datasets, initializing the Model/Criterion and calculating the loss.
+- :ref:`Optimizers` update the Model parameters based on the gradients.
+- :ref:`Learning Rate Schedulers` update the learning rate over the course of
+  training.
+
+**Training Flow**
+
+Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
+fairseq implements the following high-level training flow::
+
+  for epoch in range(num_epochs):
+      itr = task.get_batch_iterator(task.dataset('train'))
+      for num_updates, batch in enumerate(itr):
+          task.train_step(batch, model, criterion, optimizer)
+          average_and_clip_gradients()
+          optimizer.step()
+          lr_scheduler.step_update(num_updates)
+      lr_scheduler.step(epoch)
+
+where the default implementation for ``task.train_step`` is roughly::
+
+  def train_step(self, batch, model, criterion, optimizer, **unused):
+      loss = criterion(model, batch)
+      optimizer.backward(loss)
+      return loss
+
+**Registering new plug-ins**
+
+New plug-ins are *registered* through a set of ``@register`` function
+decorators, for example::
+
+  @register_model('my_lstm')
+  class MyLSTM(FairseqEncoderDecoderModel):
+      (...)
+
+Once registered, new plug-ins can be used with the existing :ref:`Command-line
+Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
+new plug-ins.
+
+**Loading plug-ins from another directory**
+
+New plug-ins can be defined in a custom module stored in the user system. In
+order to import the module, and make the plugin available to *fairseq*, the
+command line supports the ``--user-dir`` flag that can be used to specify a
+custom location for additional modules to load into *fairseq*.
+
+For example, assuming this directory tree::
+
+  /home/user/my-module/
+  └── __init__.py
+  
+with ``__init__.py``::
+
+  from fairseq.models import register_model_architecture
+  from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
+
+  @register_model_architecture('transformer', 'my_transformer')
+  def transformer_mmt_big(args):
+      transformer_vaswani_wmt_en_de_big(args)
+
+it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
+
+  fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
--- a/docs/tasks.rst
+++ b/docs/tasks.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: fairseq.tasks
+
+.. _Tasks:
+
+Tasks
+=====
+
+Tasks store dictionaries and provide helpers for loading/iterating over
+Datasets, initializing the Model/Criterion and calculating the loss.
+
+Tasks can be selected via the ``--task`` command-line argument. Once selected, a
+task may expose additional command-line arguments for further configuration.
+
+Example usage::
+
+    # setup the task (e.g., load dictionaries)
+    task = fairseq.tasks.setup_task(args)
+
+    # build model and criterion
+    model = task.build_model(args)
+    criterion = task.build_criterion(args)
+
+    # load datasets
+    task.load_dataset('train')
+    task.load_dataset('valid')
+
+    # iterate over mini-batches of data
+    batch_itr = task.get_batch_iterator(
+        task.dataset('train'), max_tokens=4096,
+    )
+    for batch in batch_itr:
+        # compute the loss
+        loss, sample_size, logging_output = task.get_loss(
+            model, criterion, batch,
+        )
+        loss.backward()
+
+
+Translation
+-----------
+
+.. autoclass:: fairseq.tasks.translation.TranslationTask
+
+.. _language modeling:
+
+Language Modeling
+-----------------
+
+.. autoclass:: fairseq.tasks.language_modeling.LanguageModelingTask
+
+
+Adding new tasks
+----------------
+
+.. autofunction:: fairseq.tasks.register_task
+.. autoclass:: fairseq.tasks.FairseqTask
+    :members:
+    :undoc-members:
--- a/docs/tutorial_classifying_names.rst
+++ b/docs/tutorial_classifying_names.rst
+Tutorial: Classifying Names with a Character-Level RNN
+======================================================
+
+In this tutorial we will extend fairseq to support *classification* tasks. In
+particular we will re-implement the PyTorch tutorial for `Classifying Names with
+a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`_
+in fairseq. It is recommended to quickly skim that tutorial before beginning
+this one.
+
+This tutorial covers:
+
+1. **Preprocessing the data** to create dictionaries.
+2. **Registering a new Model** that encodes an input sentence with a simple RNN
+   and predicts the output label.
+3. **Registering a new Task** that loads our dictionaries and dataset.
+4. **Training the Model** using the existing command-line tools.
+5. **Writing an evaluation script** that imports fairseq and allows us to
+   interactively evaluate our model on new inputs.
+
+
+1. Preprocessing the data
+-------------------------
+
+The original tutorial provides raw data, but we'll work with a modified version
+of the data that is already tokenized into characters and split into separate
+train, valid and test sets.
+
+Download and extract the data from here:
+`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_
+
+Once extracted, let's preprocess the data using the :ref:`fairseq-preprocess`
+command-line tool to create the dictionaries. While this tool is primarily
+intended for sequence-to-sequence problems, we're able to reuse it here by
+treating the label as a "target" sequence of length 1. We'll also output the
+preprocessed files in "raw" format using the ``--dataset-impl`` option to
+enhance readability:
+
+.. code-block:: console
+
+  > fairseq-preprocess \
+    --trainpref names/train --validpref names/valid --testpref names/test \
+    --source-lang input --target-lang label \
+    --destdir names-bin --dataset-impl raw
+
+After running the above command you should see a new directory,
+:file:`names-bin/`, containing the dictionaries for *inputs* and *labels*.
+
+
+2. Registering a new Model
+--------------------------
+
+Next we'll register a new model in fairseq that will encode an input sentence
+with a simple RNN and predict the output label. Compared to the original PyTorch
+tutorial, our version will also work with batches of data and GPU Tensors.
+
+First let's copy the simple RNN module implemented in the `PyTorch tutorial
+<https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network>`_.
+Create a new file named :file:`fairseq/models/rnn_classifier.py` with the
+following contents::
+
+    import torch
+    import torch.nn as nn
+
+    class RNN(nn.Module):
+
+        def __init__(self, input_size, hidden_size, output_size):
+            super(RNN, self).__init__()
+
+            self.hidden_size = hidden_size
+
+            self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
+            self.i2o = nn.Linear(input_size + hidden_size, output_size)
+            self.softmax = nn.LogSoftmax(dim=1)
+
+        def forward(self, input, hidden):
+            combined = torch.cat((input, hidden), 1)
+            hidden = self.i2h(combined)
+            output = self.i2o(combined)
+            output = self.softmax(output)
+            return output, hidden
+
+        def initHidden(self):
+            return torch.zeros(1, self.hidden_size)
+
+We must also *register* this model with fairseq using the
+:func:`~fairseq.models.register_model` function decorator. Once the model is
+registered we'll be able to use it with the existing :ref:`Command-line Tools`.
+
+All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
+interface, so we'll create a small wrapper class in the same file and register
+it in fairseq with the name ``'rnn_classifier'``::
+
+    from fairseq.models import BaseFairseqModel, register_model
+
+    # Note: the register_model "decorator" should immediately precede the
+    # definition of the Model class.
+
+    @register_model('rnn_classifier')
+    class FairseqRNNClassifier(BaseFairseqModel):
+
+        @staticmethod
+        def add_args(parser):
+            # Models can override this method to add new command-line arguments.
+            # Here we'll add a new command-line argument to configure the
+            # dimensionality of the hidden state.
+            parser.add_argument(
+                '--hidden-dim', type=int, metavar='N',
+                help='dimensionality of the hidden state',
+            )
+
+        @classmethod
+        def build_model(cls, args, task):
+            # Fairseq initializes models by calling the ``build_model()``
+            # function. This provides more flexibility, since the returned model
+            # instance can be of a different type than the one that was called.
+            # In this case we'll just return a FairseqRNNClassifier instance.
+
+            # Initialize our RNN module
+            rnn = RNN(
+                # We'll define the Task in the next section, but for now just
+                # notice that the task holds the dictionaries for the "source"
+                # (i.e., the input sentence) and "target" (i.e., the label).
+                input_size=len(task.source_dictionary),
+                hidden_size=args.hidden_dim,
+                output_size=len(task.target_dictionary),
+            )
+
+            # Return the wrapped version of the module
+            return FairseqRNNClassifier(
+                rnn=rnn,
+                input_vocab=task.source_dictionary,
+            )
+
+        def __init__(self, rnn, input_vocab):
+            super(FairseqRNNClassifier, self).__init__()
+
+            self.rnn = rnn
+            self.input_vocab = input_vocab
+
+            # The RNN module in the tutorial expects one-hot inputs, so we can
+            # precompute the identity matrix to help convert from indices to
+            # one-hot vectors. We register it as a buffer so that it is moved to
+            # the GPU when ``cuda()`` is called.
+            self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
+
+        def forward(self, src_tokens, src_lengths):
+            # The inputs to the ``forward()`` function are determined by the
+            # Task, and in particular the ``'net_input'`` key in each
+            # mini-batch. We'll define the Task in the next section, but for
+            # now just know that *src_tokens* has shape `(batch, src_len)` and
+            # *src_lengths* has shape `(batch)`.
+            bsz, max_src_len = src_tokens.size()
+
+            # Initialize the RNN hidden state. Compared to the original PyTorch
+            # tutorial we'll also handle batched inputs and work on the GPU.
+            hidden = self.rnn.initHidden()
+            hidden = hidden.repeat(bsz, 1)  # expand for batched inputs
+            hidden = hidden.to(src_tokens.device)  # move to GPU
+
+            for i in range(max_src_len):
+                # WARNING: The inputs have padding, so we should mask those
+                # elements here so that padding doesn't affect the results.
+                # This is left as an exercise for the reader. The padding symbol
+                # is given by ``self.input_vocab.pad()`` and the unpadded length
+                # of each input is given by *src_lengths*.
+
+                # One-hot encode a batch of input characters.
+                input = self.one_hot_inputs[src_tokens[:, i].long()]
+
+                # Feed the input to our RNN.
+                output, hidden = self.rnn(input, hidden)
+
+            # Return the final output state for making a prediction
+            return output
+
+Finally let's define a *named architecture* with the configuration for our
+model. This is done with the :func:`~fairseq.models.register_model_architecture`
+function decorator. Thereafter this named architecture can be used with the
+``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::
+
+    from fairseq.models import register_model_architecture
+
+    # The first argument to ``register_model_architecture()`` should be the name
+    # of the model we registered above (i.e., 'rnn_classifier'). The function we
+    # register here should take a single argument *args* and modify it in-place
+    # to match the desired architecture.
+
+    @register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
+    def pytorch_tutorial_rnn(args):
+        # We use ``getattr()`` to prioritize arguments that are explicitly given
+        # on the command-line, so that the defaults defined below are only used
+        # when no other value has been specified.
+        args.hidden_dim = getattr(args, 'hidden_dim', 128)
+
+
+3. Registering a new Task
+-------------------------
+
+Now we'll register a new :class:`~fairseq.tasks.FairseqTask` that will load our
+dictionaries and dataset. Tasks can also control how the data is batched into
+mini-batches, but in this tutorial we'll reuse the batching provided by
+:class:`fairseq.data.LanguagePairDataset`.
+
+Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
+following contents::
+
+  import os
+  import torch
+
+  from fairseq.data import Dictionary, LanguagePairDataset
+  from fairseq.tasks import LegacyFairseqTask, register_task
+
+
+  @register_task('simple_classification')
+  class SimpleClassificationTask(LegacyFairseqTask):
+
+      @staticmethod
+      def add_args(parser):
+          # Add some command-line arguments for specifying where the data is
+          # located and the maximum supported input length.
+          parser.add_argument('data', metavar='FILE',
+                              help='file prefix for data')
+          parser.add_argument('--max-positions', default=1024, type=int,
+                              help='max input length')
+
+      @classmethod
+      def setup_task(cls, args, **kwargs):
+          # Here we can perform any setup required for the task. This may include
+          # loading Dictionaries, initializing shared Embedding layers, etc.
+          # In this case we'll just load the Dictionaries.
+          input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
+          label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
+          print('| [input] dictionary: {} types'.format(len(input_vocab)))
+          print('| [label] dictionary: {} types'.format(len(label_vocab)))
+
+          return SimpleClassificationTask(args, input_vocab, label_vocab)
+
+      def __init__(self, args, input_vocab, label_vocab):
+          super().__init__(args)
+          self.input_vocab = input_vocab
+          self.label_vocab = label_vocab
+
+      def load_dataset(self, split, **kwargs):
+          """Load a given dataset split (e.g., train, valid, test)."""
+
+          prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
+
+          # Read input sentences.
+          sentences, lengths = [], []
+          with open(prefix + '.input', encoding='utf-8') as file:
+              for line in file:
+                  sentence = line.strip()
+
+                  # Tokenize the sentence, splitting on spaces
+                  tokens = self.input_vocab.encode_line(
+                      sentence, add_if_not_exist=False,
+                  )
+
+                  sentences.append(tokens)
+                  lengths.append(tokens.numel())
+
+          # Read labels.
+          labels = []
+          with open(prefix + '.label', encoding='utf-8') as file:
+              for line in file:
+                  label = line.strip()
+                  labels.append(
+                      # Convert label to a numeric ID.
+                      torch.LongTensor([self.label_vocab.add_symbol(label)])
+                  )
+
+          assert len(sentences) == len(labels)
+          print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
+
+          # We reuse LanguagePairDataset since classification can be modeled as a
+          # sequence-to-sequence task where the target sequence has length 1.
+          self.datasets[split] = LanguagePairDataset(
+              src=sentences,
+              src_sizes=lengths,
+              src_dict=self.input_vocab,
+              tgt=labels,
+              tgt_sizes=torch.ones(len(labels)),  # targets have length 1
+              tgt_dict=self.label_vocab,
+              left_pad_source=False,
+              # Since our target is a single class label, there's no need for
+              # teacher forcing. If we set this to ``True`` then our Model's
+              # ``forward()`` method would receive an additional argument called
+              # *prev_output_tokens* that would contain a shifted version of the
+              # target sequence.
+              input_feeding=False,
+          )
+
+      def max_positions(self):
+          """Return the max input length allowed by the task."""
+          # The source should be less than *args.max_positions* and the "target"
+          # has max length 1.
+          return (self.args.max_positions, 1)
+
+      @property
+      def source_dictionary(self):
+          """Return the source :class:`~fairseq.data.Dictionary`."""
+          return self.input_vocab
+
+      @property
+      def target_dictionary(self):
+          """Return the target :class:`~fairseq.data.Dictionary`."""
+          return self.label_vocab
+
+      # We could override this method if we wanted more control over how batches
+      # are constructed, but it's not necessary for this tutorial since we can
+      # reuse the batching provided by LanguagePairDataset.
+      #
+      # def get_batch_iterator(
+      #     self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
+      #     ignore_invalid_inputs=False, required_batch_size_multiple=1,
+      #     seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
+      #     data_buffer_size=0, disable_iterator_cache=False,
+      # ):
+      #     (...)
+
+
+4. Training the Model
+---------------------
+
+Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
+command-line tool for this, making sure to specify our new Task (``--task
+simple_classification``) and Model architecture (``--arch
+pytorch_tutorial_rnn``):
+
+.. note::
+
+  You can also configure the dimensionality of the hidden state by passing the
+  ``--hidden-dim`` argument to :ref:`fairseq-train`.
+
+.. code-block:: console
+
+  > fairseq-train names-bin \
+    --task simple_classification \
+    --arch pytorch_tutorial_rnn \
+    --optimizer adam --lr 0.001 --lr-shrink 0.5 \
+    --max-tokens 1000
+  (...)
+  | epoch 027 | loss 1.200 | ppl 2.30 | wps 15728 | ups 119.4 | wpb 116 | bsz 116 | num_updates 3726 | lr 1.5625e-05 | gnorm 1.290 | clip 0% | oom 0 | wall 32 | train_wall 21
+  | epoch 027 | valid on 'valid' subset | valid_loss 1.41304 | valid_ppl 2.66 | num_updates 3726 | best 1.41208
+  | done training in 31.6 seconds
+
+The model files should appear in the :file:`checkpoints/` directory.
+
+
+5. Writing an evaluation script
+-------------------------------
+
+Finally we can write a short script to evaluate our model on new inputs. Create
+a new file named :file:`eval_classifier.py` with the following contents::
+
+  from fairseq import checkpoint_utils, data, options, tasks
+
+  # Parse command-line arguments for generation
+  parser = options.get_generation_parser(default_task='simple_classification')
+  args = options.parse_args_and_arch(parser)
+
+  # Setup task
+  task = tasks.setup_task(args)
+
+  # Load model
+  print('| loading model from {}'.format(args.path))
+  models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
+  model = models[0]
+
+  while True:
+      sentence = input('\nInput: ')
+
+      # Tokenize into characters
+      chars = ' '.join(list(sentence.strip()))
+      tokens = task.source_dictionary.encode_line(
+          chars, add_if_not_exist=False,
+      )
+
+      # Build mini-batch to feed to the model
+      batch = data.language_pair_dataset.collate(
+          samples=[{'id': -1, 'source': tokens}],  # bsz = 1
+          pad_idx=task.source_dictionary.pad(),
+          eos_idx=task.source_dictionary.eos(),
+          left_pad_source=False,
+          input_feeding=False,
+      )
+
+      # Feed batch to the model and get predictions
+      preds = model(**batch['net_input'])
+
+      # Print top 3 predictions and their log-probabilities
+      top_scores, top_labels = preds[0].topk(k=3)
+      for score, label_idx in zip(top_scores, top_labels):
+          label_name = task.target_dictionary.string([label_idx])
+          print('({:.2f})\t{}'.format(score, label_name))
+
+Now we can evaluate our model interactively. Note that we have included the
+original data path (:file:`names-bin/`) so that the dictionaries can be loaded:
+
+.. code-block:: console
+
+  > python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
+  | [input] dictionary: 64 types
+  | [label] dictionary: 24 types
+  | loading model from checkpoints/checkpoint_best.pt
+
+  Input: Satoshi
+  (-0.61) Japanese
+  (-1.20) Arabic
+  (-2.86) Italian
+
+  Input: Sinbad
+  (-0.30) Arabic
+  (-1.76) English
+  (-4.08) Russian
--- a/docs/tutorial_simple_lstm.rst
+++ b/docs/tutorial_simple_lstm.rst
--- a/examples/.gitignore
+++ b/examples/.gitignore
+!*/*.sh
+!*/*.md
--- a/examples/MMPT/.gitignore
+++ b/examples/MMPT/.gitignore
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+runs
+data
+pretrained_models
+projects/mmfusion_*
+log_test
+third-party
+python_log
+slurm_snapshot_code
+lightning_logs
+demos
--- a/examples/MMPT/CONFIG.md
+++ b/examples/MMPT/CONFIG.md
+### Config Files Explained
+
+Taking `projects/mfmmlm.yaml` for example, which run pretraining using masked frame model (MFM) and masked language model (MLM) on a single BERT:  
+
+```yaml
+project_dir: mfmmlm # specify the project dir for this baseline.
+run_task:
+  - how2.yaml # run pretraining on how2 when launching `projects/taskmfmmlm.yaml`
+  - [vtt.yaml, vttcap.yaml, vttqa.yaml, youcook.yaml, youcookcap.yaml, crosstask.yaml, coin.yaml] # run fine-tuning tasks.
+base_dir: task # a global template folder to specify each training task. 
+task_group:
+  pretrain: # section for pretraining. Most baselines differs in this section.
+    task_list:
+      - how2.yaml # reconfig `projects/task/how2.yaml`
+    dataset:
+      aligner: MFMMLMAligner # overwrite the aligner for MFMMLM training task.
+    model:
+      model_cls: MMFusionMFMMLM # overwrite the model, which constructs negative examples for MFM on-the-fly.
+    loss:
+      loss_cls: MFMMLM # overwrite the loss as MFMMLM, which combines MFM and MLM together.
+    fairseq: # all fairseq args can be expecified under this name.
+      dataset:
+        batch_size: 128
+  finetune: # section for fine-tuning tasks, we don't need to change anything here mostly since we want to see how pretraining can contribute to finetuning.
+    task_list: # specify the list of downstream tasks, e.g., copy `projects/task/vtt.yaml` to `projects/mfmmlm`.
+      - vtt.yaml
+      - vttqa.yaml
+      - youcook.yaml
+      - youcookcap.yaml
+      - crosstask.yaml
+      - coin.yaml
+  test: # section for testing.
+    task_list:
+      - test_vtt.yaml
+      - test_vttqa.yaml
+      - test_youcook.yaml
+      - test_youcookcap.yaml
+      - test_crosstask.yaml
+      - test_crosstask_zs.yaml
+      - test_coin.yaml
+```
--- a/examples/MMPT/DATASET.md
+++ b/examples/MMPT/DATASET.md
+# Dataset
+
+We understand video data are challenging to download and process. For videos, we provide our preprocessing scripts under `scripts/video_feature_extractor` (deeply adapted from `https://github.com/antoine77340/video_feature_extractor`); for text, we pre-tokenizing scripts under `scripts/text_token_extractor`.
+
+### S3D Feature Extraction
+We use pre-trained [S3D](https://github.com/antoine77340/S3D_HowTo100M) for video feature extraction. Please place the models as `pretrained_models/s3d_dict.npy` and `pretrained_models/s3d_howto100m.pth`.
+
+We implement a `PathBuilder` to automatically track video ids, source video paths to their feature locations (you may need `conda install -c anaconda pandas`). Decoding may need `pip install ffmpeg-python`.
+
+### Howto100M
+[Howto100M](https://www.di.ens.fr/willow/research/howto100m/) is a large-scale video pre-training datasets. You may download videos by yourself and run preprocessing of our scripts. 
+
+Several key differences of our preprocessing from existing papers: (1) we use `raw_caption.json` instead of `caption.json` to have pure self-supervision on text (`caption.json` has manual removal of stop words); (2) we remove partially duplicated texts that are originally designed for real-time readability (see `mmpt/processors/dedupprocessor.py`); (3) then we shard video/text features using `SharedTensor` in `mmpt/utils/shardedtensor.py` for fast loading during training (faster than `h5py`).
+
+#### Steps
+##### video
+To extract video features: edit and run `bash scripts/video_feature_extractor/how2/s3d.sh`. (consider to run this on multiple machines; by default, we store features in fp16 to save space and also for faster training).
+
+Split available video ids as `data/how2/how2_s3d_train.lst` and `data/how2/how2_s3d_val.lst`.
+
+Lastly, pack video features into `ShardedTensor` using `python scripts/video_feature_extractor/shard_feature.py`.
+
+##### text
+Clean captions using `python -m mmpt.processors.dedupprocessor`.
+
+Tokenize dedupped captions `data/how2/raw_caption_dedup.pkl` into sharded numpy arrays:  
+```
+python scripts/text_token_extractor/pretokenization.py scripts/text_token_extractor/configs/bert-base-uncased.yaml
+```
+
+### Youcook, MSRVTT etc.
+We use the version of Youcook and MSRVTT come with Howto100M and MILNCE. Please download the data to `data/youcook` and `data/msrvtt` accordingly, you can also check `projects/task/youcook.yaml` and `projects/task/vtt.yaml` etc. in details. 
+We extract features for Youcook, MSRVTT similar to the first step of Howto100M but we read text from meta data directly and perform on-the-fly tokenization.
+
--- a/examples/MMPT/README.md
+++ b/examples/MMPT/README.md
+# VideoCLIP and VLM
+
+You just find this toolkit for multimodal video understanding! It contains implementation of two recent multi-modal video understanding papers [VideoCLIP](https://arxiv.org/pdf/2109.14084.pdf) (EMNLP, 2021) and [VLM](https://aclanthology.org/2021.findings-acl.370.pdf) (ACL Findings, 2021), along with high-performance toolkits that are typically lacking in existing codebase. The toolkit is desigend to contain generic performance-tuned components that can be potentially adapted to other frameworks (we initially use fairseq). 
+
+VideoCLIP is a contrastive learning model for zero-shot transfer to retrieval/classification/sequence labeling style tasks.
+
+<img src="videoclip.png" width="350" class="center">
+
+VLM is a masked language model style pre-training using only one encoder with masked modality model (MMM) for retrieval/generation/sequence labeling style tasks.
+
+<img src="vlm.png" width="350" class="center">
+
+### News
+[Oct. 2021] Initial release of implementation for the following papers:  
+[VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) (Xu et. al., EMNLP 2021)  
+[VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding](https://aclanthology.org/2021.findings-acl.370.pdf) (Xu et. al., ACL Findings 2021)  
+
+
+### Installation
+We aim to minimize the dependency of this repo on other packages.  
+We use fairseq as the main trainer (no models/datasets dependency on fairseq. We will support other trainer in future):  
+```
+git clone https://github.com/pytorch/fairseq
+cd fairseq
+pip install -e .  # also optionally follow fairseq README for apex installation for fp16 training.
+export MKL_THREADING_LAYER=GNU  # fairseq may need this for numpy.
+```
+
+Then install this toolkit:
+```
+cd examples/MMPT  # MMPT can be in any folder, not necessarily under fairseq/examples.
+pip install -e .
+```
+
+The code is developed under Python=3.8.8, Pytorch=1.8, cuda=11.0 with fairseq=1.0.0a0+af0389f and tested under Python=3.8.8 pytorch=1.9 cuda=11.0 fairseq=1.0.0a0+8e7bc73 during code release.
+Most models require `transformers==3.4` for API compatibility `pip install transformers==3.4`. 
+In addition, some downstream tasks may need `conda install pandas`.  
+
+
+### Usage
+#### Download Checkpoints
+We use pre-trained [S3D](https://github.com/antoine77340/S3D_HowTo100M) for video feature extraction. Please place the models as `pretrained_models/s3d_dict.npy` and `pretrained_models/s3d_howto100m.pth`.
+
+Download VideoCLIP checkpoint `https://dl.fbaipublicfiles.com/MMPT/retri/videoclip/checkpoint_best.pt` to `runs/retri/videoclip` or VLM checkpoint `https://dl.fbaipublicfiles.com/MMPT/mtm/vlm/checkpoint_best.pt` to `runs/mtm/vlm`.
+
+#### Demo of Inference
+run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` to get all `.yaml`s for VideoCLIP.
+
+```python
+import torch
+
+from mmpt.models import MMPTModel
+
+
+model, tokenizer, aligner = MMPTModel.from_pretrained(
+    "projects/retri/videoclip/how2.yaml")
+
+model.eval()
+
+
+# B, T, FPS, H, W, C (VideoCLIP is trained on 30 fps of s3d)
+video_frames = torch.randn(1, 2, 30, 224, 224, 3)
+caps, cmasks = aligner._build_text_seq(
+    tokenizer("some text", add_special_tokens=False)["input_ids"]
+)
+
+caps, cmasks = caps[None, :], cmasks[None, :]  # bsz=1
+
+with torch.no_grad():
+    output = model(video_frames, caps, cmasks, return_score=True)
+print(output["score"])  # dot-product
+```
+
+#### Data Preparation
+See [dataset](DATASET.md) for each dataset.
+
+#### Global Config for Training Pipeline
+We organize a global config file for a training/testing pipeline under projects (see a detailed [explanation](CONFIG.md)). For example, VideoCLIP in `projects/retri/videoclip.yaml` and VLM is in `projects/mtm/vlm.yaml`.
+
+We wrap all cmds into `locallaunch.py` and `mmpt_cli/localjob.py`. You can check concrete cmds by `--dryrun` and then drop it for actual run.  
+
+First, run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` will generate configs for all configs of pre-training, zero-shot evaluation, fine-tuning and testing, for VideoCLIP under `projects/retri/videoclip`.  
+
+Then each (either training or evaluation) process will be configed by a concrete config file (we save all complex arguments into the concrete config file for reproducibility, including fairseq args). For example, run zero-shot evaluation on youcook,
+```
+python locallaunch.py projects/retri/videoclip/test_youcook_zs.yaml --jobtype local_predict  # zero-shot evaluation.
+python locallaunch.py projects/retri/videoclip/youcook_videoclip.yaml --jobtype local_single --dryrun  # fine-tuning: use --dryrun to check cmds and drop it to make an actual run; local_small will run on two gpus (as in paper).
+python locallaunch.py projects/retri/videoclip/test_youcook_videoclip.yaml --jobtype local_predict  # testing on fine-tuned model.
+```
+
+Pretraining can be run as:  
+```
+python locallaunch.py projects/retri/videoclip/how2.yaml --jobtype local_single --dryrun # check then drop dryrun; paper is ran on local_big as 8 gpus.
+```
+You may need to change `--jobtype`, check/extend `LocalJob` in `mmpt_cli/localjob.py` for multi-gpu/multi-node pre-training.
+
+The detailed instructions of pretraining and fine-tuning can be found at [pretraining instruction](pretraining.md) and [finetuning instruction](endtask.md).
+
+
+### Development
+Several components of this toolkit can be re-used for future research (and also our ongoing research).
+
+#### Framework Wrapper
+We currently only support fairseq, but most components can be easily fit into other frameworks like huggingface. This repo is a `--user-dir` of fairseq with fairseq wrapper. For example, `mmpt/tasks` includes a `FairseqMMTTask`, which manages `mmpt/datasets` with `FairseqDataset`, `mmpt/models` with `FairseqModel`, `mmpt/losses` with `FairseqCriterion`.  
+
+#### Processors
+**Multi**modal research introduces the complexity on modality alignment from different input sources to losses. Inspired by [MMF](https://github.com/facebookresearch/mmf), this toolkit leverages `mmpt/processors` to handle various needs of data preprocessing and loading, **alleviating** the needs of multiple `torch.data.utils.Dataset` (that can be tricky for ablation study).  
+Processors can also be decoupled from `torch.data.utils.Dataset` for offline preprocessing instead of on-the-fly data preprocessing.
+
+We decouple a `mmpt.MMDataset` as 3 types of processors: `MetaProcessor`, `VideoProcessor`, `TextProcessor` and `Aligner`. They can be configed in `dataset` field of a config file (e.g., see `projects/task/how2.yaml`).  
+`MetaProcessor` is used to load the meta data about a dataset, aka, all video_ids of how2 dataset.  
+`VideoProcessor` is used to load the video features about a dataset. For example, S3D features for each second of a video.  
+`TextProcessor` is used to load the text (feature). For example, BERT pre-tokenized text clips for how2 dataset (with `start`s, `end`s of timestamps and `cap` for `token_ids`).  
+`Aligner` is the core class for different baselines that prepares the training data. For example, sampling a clip, masking tokens for MLM, etc.
+
+#### Performance-tuned Components
+To speed up pre-training, this toolkit uses sharded features stored in mmaped numpy, backed by `ShardedTensor` in `mmpt/utils/shardedtensor.py` (adopted from MARGE paper). This reduces the loads of IO for multi-GPU training without loading all features for a video into the memory each time and `ShardedTensor` ensure features are stored in continuous disk space for near random access. This is used for both How2 video features and texts in `mmpt/processors/how2processor.py`.
+
+
+### Citation
+If this codebase is useful for your work, please cite the following papers:
+
+```BibTeX
+@inproceedings{xu-etal-2021-videoclip,
+    title = "{VideoCLIP}: Contrastive Pre-training for\\Zero-shot Video-Text Understanding",
+    author = "Xu, Hu  and
+      Ghosh, Gargi  and
+      Huang, Po-Yao  and
+      Okhonko, Dmytro  and
+      Aghajanyan, Armen  and
+      Metze, Florian  and
+      Zettlemoyer, Luke  and
+      Feichtenhofer, Christoph",
+    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
+    month = nov,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+}
+
+@inproceedings{xu-etal-2021-vlm,
+    title = "{VLM}: Task-agnostic Video-Language Model Pre-training for Video Understanding",
+    author = "Xu, Hu  and
+      Ghosh, Gargi  and
+      Huang, Po-Yao  and
+      Arora, Prahal  and
+      Aminzadeh, Masoumeh  and
+      Feichtenhofer, Christoph  and
+      Metze, Florian  and
+      Zettlemoyer, Luke",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
+    month = aug,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2021.findings-acl.370",
+    doi = "10.18653/v1/2021.findings-acl.370",
+    pages = "4227--4239",
+}
+```
+
+### Bug Reports
+This repo is in its initial stage, welcome bug reports to huxu@fb.com
+
+### Copyright
+The majority of Multimodal Pre-training (MMPT) is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Evaluation Codes/Models: Howto100M and HuggingFace Transformers are licensed under the Apache2.0 license; COIN and NLG-eval are licensed under the MIT license; CrossTask is licensed under the BSD-3; DiDeMo is licensed under the BSD-2 license.
--- a/examples/MMPT/endtask.md
+++ b/examples/MMPT/endtask.md
+# Zero-shot Transfer and Finetuning
+
+(If you are new to the ideas of `mmpt.processors`, see [README](README.md) first.)
+All finetuning datasets (specifically `processors`) are defined in `mmpt.processors.dsprocessor`.
+Given the complexity of different types of finetuning tasks, each task may have their own meta/video/text/aligner processors and `mmpt/evaluators/{Predictor,Metric}`.
+
+### Tasks
+
+Currently, we support 5 end datasets: `MSRVTT`, `Youcook`, `COIN`, `Crosstask` and `DiDeMo` with the following tasks:  
+text-video retrieval: `MSRVTT`, `Youcook`, `DiDeMo`;   
+video captioning: `Youcook`;  
+Video Question and Answering: `MSRVTT-QA`.  
+
+To add your own dataset, you can specify the corresponding processors and config them in the `dataset` field of a config file, such as `projects/task/vtt.yaml`.
+
+### Zero-shot Transfer (no Training)
+Zero-shot transfer will run the pre-trained model (e.g., VideoCLIP) directly on testing data. Configs with pattern: `projects/task/*_zs_*.yaml` are dedicated for zero-shot transfer.
+
+### Fine-tuning
+
+The training of a downstream task is similar to pretraining, execept you may need to specify the `restore_file` in `fairseq.checkpoint` and reset optimizers, see `projects/task/ft.yaml` that is included by `projects/task/vtt.yaml`.
+
+We typically do finetuning on 2 gpus (`local_small`).
+
+### Testing
+For each finetuning dataset, you may need to specify a testing config, similar to `projects/task/test_vtt.yaml`.  
+
+We define `mmpt.evaluators.Predictor` for different types of prediction. For example, `MSRVTT` and `Youcook` are video-retrieval tasks and expecting to use `RetrievalPredictor`. You may need to define your new type of predictors and specify that in `predictor` field of a testing config.
+
+Each task may also have their own metric for evaluation. This can be created in `mmpt.evaluators.Metric` and specified in the `metric` field of a testing config.
+
+Launching a testing is as simple as training by specifying the path of a testing config:
+```python locallaunch.py projects/mfmmlm/test_vtt.yaml```
+Testing will be launched locally by default since prediction is computationally less expensive.
+
+### Third-party Libraries
+We list the following finetuning tasks that require third-party libraries.
+
+Youcook captioning: `https://github.com/Maluuba/nlg-eval`  
+
+CrossTask: `https://github.com/DmZhukov/CrossTask`'s `dp` under `third-party/CrossTask` (`python setup.py build_ext --inplace`)
--- a/examples/MMPT/locallaunch.py
+++ b/examples/MMPT/locallaunch.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import os
+
+from omegaconf import OmegaConf
+
+from mmpt.utils import recursive_config, overwrite_dir
+from mmpt_cli.localjob import LocalJob
+
+
+class JobLauncher(object):
+    JOB_CONFIG = {
+        "local": LocalJob,
+    }
+
+    def __init__(self, yaml_file):
+        self.yaml_file = yaml_file
+        job_key = "local"
+
+        if yaml_file.endswith(".yaml"):
+            config = recursive_config(yaml_file)
+            if config.task_type is not None:
+                job_key = config.task_type.split("_")[0]
+        else:
+            raise ValueError("unknown extension of job file:", yaml_file)
+        self.job_key = job_key
+
+    def __call__(self, job_type=None, dryrun=False):
+        if job_type is not None:
+            self.job_key = job_type.split("_")[0]
+        print("[JobLauncher] job_key", self.job_key)
+        job = JobLauncher.JOB_CONFIG[self.job_key](
+            self.yaml_file, job_type=job_type, dryrun=dryrun)
+        return job.submit()
+
+
+class Pipeline(object):
+    """a job that loads yaml config."""
+
+    def __init__(self, fn):
+        """
+        load a yaml config of a job and save generated configs as yaml for each task.
+        return: a list of files to run as specified by `run_task`.
+        """
+        if fn.endswith(".py"):
+            # a python command.
+            self.backend = "python"
+            self.run_yamls = [fn]
+            return
+
+        job_config = recursive_config(fn)
+        if job_config.base_dir is None:  # single file job config.
+            self.run_yamls = [fn]
+            return
+
+        self.project_dir = os.path.join("projects", job_config.project_dir)
+        self.run_dir = os.path.join("runs", job_config.project_dir)
+
+        if job_config.run_task is not None:
+            run_yamls = []
+            for stage in job_config.run_task:
+                # each stage can have multiple tasks running in parallel.
+                if OmegaConf.is_list(stage):
+                    stage_yamls = []
+                    for task_file in stage:
+                        stage_yamls.append(
+                            os.path.join(self.project_dir, task_file))
+                    run_yamls.append(stage_yamls)
+                else:
+                    run_yamls.append(os.path.join(self.project_dir, stage))
+            self.run_yamls = run_yamls
+        configs_to_save = self._overwrite_task(job_config)
+        self._save_configs(configs_to_save)
+
+    def __getitem__(self, idx):
+        yaml_files = self.run_yamls[idx]
+        if isinstance(yaml_files, list):
+            return [JobLauncher(yaml_file) for yaml_file in yaml_files]
+        return [JobLauncher(yaml_files)]
+
+    def __len__(self):
+        return len(self.run_yamls)
+
+    def _save_configs(self, configs_to_save: dict):
+        # save
+        os.makedirs(self.project_dir, exist_ok=True)
+        for config_file in configs_to_save:
+            config = configs_to_save[config_file]
+            print("saving", config_file)
+            OmegaConf.save(config=config, f=config_file)
+
+    def _overwrite_task(self, job_config):
+        configs_to_save = {}
+        self.base_project_dir = os.path.join("projects", job_config.base_dir)
+        self.base_run_dir = os.path.join("runs", job_config.base_dir)
+
+        for config_sets in job_config.task_group:
+            overwrite_config = job_config.task_group[config_sets]
+            if (
+                overwrite_config.task_list is None
+                or len(overwrite_config.task_list) == 0
+            ):
+                print(
+                    "[warning]",
+                    job_config.task_group,
+                    "has no task_list specified.")
+            # we don't want this added to a final config.
+            task_list = overwrite_config.pop("task_list", None)
+            for config_file in task_list:
+                config_file_path = os.path.join(
+                    self.base_project_dir, config_file)
+                config = recursive_config(config_file_path)
+                # overwrite it.
+                if overwrite_config:
+                    config = OmegaConf.merge(config, overwrite_config)
+                overwrite_dir(config, self.run_dir, basedir=self.base_run_dir)
+                save_file_path = os.path.join(self.project_dir, config_file)
+                configs_to_save[save_file_path] = config
+        return configs_to_save
+
+
+def main(args):
+    job_type = args.jobtype if args.jobtype else None
+    # parse multiple pipelines.
+    pipelines = [Pipeline(fn) for fn in args.yamls.split(",")]
+
+    for pipe_id, pipeline in enumerate(pipelines):
+        if not hasattr(pipeline, "project_dir"):
+            for job in pipeline[0]:
+                job(job_type=job_type, dryrun=args.dryrun)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("yamls", type=str)
+    parser.add_argument(
+        "--dryrun",
+        action="store_true",
+        help="run config and prepare to submit without launch the job.",
+    )
+    parser.add_argument(
+        "--jobtype", type=str, default="",
+        help="force to run jobs as specified.")
+    args = parser.parse_args()
+    main(args)
--- a/examples/MMPT/mmpt/__init__.py
+++ b/examples/MMPT/mmpt/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+try:
+    # fairseq user dir
+    from .datasets import FairseqMMDataset
+    from .losses import FairseqCriterion
+    from .models import FairseqMMModel
+    from .tasks import FairseqMMTask
+except ImportError:
+    pass
--- a/examples/MMPT/mmpt/datasets/__init__.py
+++ b/examples/MMPT/mmpt/datasets/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .mmdataset import *
+
+try:
+    from .fairseqmmdataset import *
+except ImportError:
+    pass
--- a/examples/MMPT/mmpt/datasets/fairseqmmdataset.py
+++ b/examples/MMPT/mmpt/datasets/fairseqmmdataset.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+TODO (huxu): fairseq wrapper class for all dataset you defined: mostly MMDataset.
+"""
+
+from collections import OrderedDict
+
+from torch.utils.data import Dataset
+from torch.utils.data.dataloader import default_collate
+from fairseq.data import FairseqDataset, data_utils
+
+
+class FairseqMMDataset(FairseqDataset):
+    """
+    A wrapper class for MMDataset for fairseq.
+    """
+
+    def __init__(self, mmdataset):
+        if not isinstance(mmdataset, Dataset):
+            raise TypeError("mmdataset must be of type `torch.utils.data.dataset`.")
+        self.mmdataset = mmdataset
+
+    def set_epoch(self, epoch, **unused):
+        super().set_epoch(epoch)
+        self.epoch = epoch
+
+    def __getitem__(self, idx):
+        with data_utils.numpy_seed(43211, self.epoch, idx):
+            return self.mmdataset[idx]
+
+    def __len__(self):
+        return len(self.mmdataset)
+
+    def collater(self, samples):
+        if hasattr(self.mmdataset, "collator"):
+            return self.mmdataset.collator(samples)
+        if len(samples) == 0:
+            return {}
+        if isinstance(samples[0], dict):
+            batch = OrderedDict()
+            for key in samples[0]:
+                if samples[0][key] is not None:
+                    batch[key] = default_collate([sample[key] for sample in samples])
+            return batch
+        else:
+            return default_collate(samples)
+
+    def size(self, index):
+        """dummy implementation: we don't use --max-tokens"""
+        return 1
+
+    def num_tokens(self, index):
+        """dummy implementation: we don't use --max-tokens"""
+        return 1
--- a/examples/MMPT/mmpt/datasets/mmdataset.py
+++ b/examples/MMPT/mmpt/datasets/mmdataset.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+
+from collections import OrderedDict
+
+from torch.utils.data import Dataset
+from torch.utils.data.dataloader import default_collate
+
+from ..utils import set_seed
+
+
+class MMDataset(Dataset):
+    """
+    A generic multi-modal dataset.
+        Args:
+            `meta_processor`: a meta processor,
+                handling loading meta data and return video_id and text_id.
+            `video_processor`: a video processor,
+                handling e.g., decoding, loading .np files.
+            `text_processor`: a text processor,
+                handling e.g., tokenization.
+            `aligner`: combine the video and text feature
+                as one training example.
+    """
+
+    def __init__(
+        self,
+        meta_processor,
+        video_processor,
+        text_processor,
+        align_processor,
+    ):
+        self.split = meta_processor.split
+        self.meta_processor = meta_processor
+        self.video_processor = video_processor
+        self.text_processor = text_processor
+        self.align_processor = align_processor
+
+    def __len__(self):
+        return len(self.meta_processor)
+
+    def __getitem__(self, idx):
+        if self.split == "test":
+            set_seed(idx)
+        video_id, text_id = self.meta_processor[idx]
+        video_feature = self.video_processor(video_id)
+        text_feature = self.text_processor(text_id)
+        output = self.align_processor(video_id, video_feature, text_feature)
+        # TODO (huxu): the following is for debug purpose.
+        output.update({"idx": idx})
+        return output
+
+    def collater(self, samples):
+        """This collator is deprecated.
+        set self.collator = MMDataset.collater.
+        see collator in FairseqMMDataset.
+        """
+
+        if len(samples) == 0:
+            return {}
+        if isinstance(samples[0], dict):
+            batch = OrderedDict()
+            for key in samples[0]:
+                if samples[0][key] is not None:
+                    batch[key] = default_collate(
+                        [sample[key] for sample in samples])
+                # if torch.is_tensor(batch[key]):
+                #    print(key, batch[key].size())
+                # else:
+                #    print(key, len(batch[key]))
+            return batch
+        else:
+            return default_collate(samples)
+
+    def print_example(self, output):
+        print("[one example]", output["video_id"])
+        if (
+            hasattr(self.align_processor, "subsampling")
+            and self.align_processor.subsampling is not None
+            and self.align_processor.subsampling > 1
+        ):
+            for key in output:
+                if torch.is_tensor(output[key]):
+                    output[key] = output[key][0]
+
+        # search tokenizer to translate ids back.
+        tokenizer = None
+        if hasattr(self.text_processor, "tokenizer"):
+            tokenizer = self.text_processor.tokenizer
+        elif hasattr(self.align_processor, "tokenizer"):
+            tokenizer = self.align_processor.tokenizer
+        if tokenizer is not None:
+            caps = output["caps"].tolist()
+            if isinstance(caps[0], list):
+                caps = caps[0]
+            print("caps", tokenizer.decode(caps))
+            print("caps", tokenizer.convert_ids_to_tokens(caps))
+
+        for key, value in output.items():
+            if torch.is_tensor(value):
+                if len(value.size()) >= 3:  # attention_mask.
+                    print(key, value.size())
+                    print(key, "first", value[0, :, :])
+                    print(key, "last", value[-1, :, :])
+                else:
+                    print(key, value)
+        print("[end of one example]")
--- a/examples/MMPT/mmpt/evaluators/__init__.py
+++ b/examples/MMPT/mmpt/evaluators/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .metric import *
+from .evaluator import *
+
+
+# experimental.
+try:
+    from .expmetric import *
+except ImportError:
+    pass
--- a/examples/MMPT/mmpt/evaluators/evaluator.py
+++ b/examples/MMPT/mmpt/evaluators/evaluator.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import glob
+import numpy as np
+
+from . import metric as metric_path
+from . import predictor as predictor_path
+
+
+class Evaluator(object):
+    """
+    perform evaluation on a single (downstream) task.
+    make this both offline and online.
+    TODO(huxu) saving evaluation results.
+    """
+
+    def __init__(self, config, eval_dataloader=None):
+        if config.metric is None:
+            raise ValueError("config.metric is", config.metric)
+        metric_cls = getattr(metric_path, config.metric)
+        self.metric = metric_cls(config)
+        if config.predictor is None:
+            raise ValueError("config.predictor is", config.predictor)
+        predictor_cls = getattr(predictor_path, config.predictor)
+        self.predictor = predictor_cls(config)
+        self.eval_dataloader = eval_dataloader
+
+    def __call__(self):
+        try:
+            print(self.predictor.pred_dir)
+            for pred_file in glob.glob(
+                    self.predictor.pred_dir + "/*_merged.npy"):
+                outputs = np.load(pred_file)
+                results = self.metric.compute_metrics(outputs)
+                self.metric.print_computed_metrics(results)
+
+            outputs = np.load(os.path.join(
+                    self.predictor.pred_dir, "merged.npy"))
+            results = self.metric.compute_metrics(outputs)
+            return {"results": results, "metric": self.metric}
+        except FileNotFoundError:
+            print("\n[missing]", self.predictor.pred_dir)
+            return {}
+
+    def evaluate(self, model, eval_dataloader=None, output_file="merged"):
+        if eval_dataloader is None:
+            eval_dataloader = self.eval_dataloader
+        outputs = self.predictor.predict_loop(
+            model, eval_dataloader, output_file)
+        results = self.metric.compute_metrics(**outputs)
+        return results
--- a/examples/MMPT/mmpt/evaluators/metric.py
+++ b/examples/MMPT/mmpt/evaluators/metric.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import numpy as np
+import json
+
+
+class Metric(object):
+    def __init__(self, config, metric_names):
+        self.metric_names = metric_names
+
+    def best_metric(self, metric):
+        return metric[self.metric_names[0]]
+
+    def save_metrics(self, fn, metrics):
+        with open(fn, "w") as fw:
+            json.dump(fw, metrics)
+
+    def print_computed_metrics(self, metrics):
+        raise NotImplementedError
+
+
+class RetrievalMetric(Metric):
+    """
+    this is modified from `howto100m/metrics.py`.
+    History of changes:
+    refactor as a class.
+    add metric_key in __init__
+    """
+
+    def __init__(self, config, metric_names=["R1", "R5", "R10", "MR"]):
+        super().__init__(config, metric_names)
+        self.error = False  # TODO(huxu): add to config to print error.
+
+    def compute_metrics(self, outputs, texts, **kwargs):
+        x = outputs
+        sx = np.sort(-x, axis=1)
+        d = np.diag(-x)
+        d = d[:, np.newaxis]
+        ind = sx - d
+        ind = np.where(ind == 0)
+        ind = ind[1]
+        metrics = {}
+        metrics["R1"] = float(np.sum(ind == 0)) / len(ind)
+        metrics["R5"] = float(np.sum(ind < 5)) / len(ind)
+        metrics["R10"] = float(np.sum(ind < 10)) / len(ind)
+        metrics["MR"] = np.median(ind) + 1
+
+        max_idx = np.argmax(outputs, axis=1)
+        if self.error:
+            # print top-20 errors.
+            error = []
+            for ex_idx in range(20):
+                error.append((texts[ex_idx], texts[max_idx[ex_idx]]))
+            metrics["error"] = error
+        return metrics
+
+    def print_computed_metrics(self, metrics):
+        r1 = metrics["R1"]
+        r5 = metrics["R5"]
+        r10 = metrics["R10"]
+        mr = metrics["MR"]
+        print(
+            "R@1: {:.4f} - R@5: {:.4f} - R@10: {:.4f} - Median R: {}".format(
+                r1, r5, r10, mr
+            )
+        )
+        if "error" in metrics:
+            print(metrics["error"])
+
+
+class DiDeMoMetric(Metric):
+    """
+    History of changes:
+    python 2.x to python 3.x.
+    merge utils.py into eval to save one file.
+    reference: https://github.com/LisaAnne/LocalizingMoments/blob/master/utils/eval.py
+    Code to evaluate your results on the DiDeMo dataset.
+    """
+    def __init__(self, config, metric_names=["rank1", "rank5", "miou"]):
+        super().__init__(config, metric_names)
+
+    def compute_metrics(self, outputs, targets, **kwargs):
+        assert len(outputs) == len(targets)
+        rank1, rank5, miou = self._eval_predictions(outputs, targets)
+        metrics = {
+            "rank1": rank1,
+            "rank5": rank5,
+            "miou": miou
+        }
+        return metrics
+
+    def print_computed_metrics(self, metrics):
+        rank1 = metrics["rank1"]
+        rank5 = metrics["rank5"]
+        miou = metrics["miou"]
+        # print("Average rank@1: %f" % rank1)
+        # print("Average rank@5: %f" % rank5)
+        # print("Average iou: %f" % miou)
+
+        print(
+            "Average rank@1: {:.4f} Average rank@5: {:.4f} Average iou: {:.4f}".format(
+                rank1, rank5, miou
+            )
+        )
+
+    def _iou(self, pred, gt):
+        intersection = max(0, min(pred[1], gt[1]) + 1 - max(pred[0], gt[0]))
+        union = max(pred[1], gt[1]) + 1 - min(pred[0], gt[0])
+        return float(intersection)/union
+
+    def _rank(self, pred, gt):
+        return pred.index(tuple(gt)) + 1
+
+    def _eval_predictions(self, segments, data):
+        '''
+        Inputs:
+        segments: For each item in the ground truth data, rank possible video segments given the description and video.
+            In DiDeMo, there are 21 posible moments extracted for each video so the list of video segments will be of length 21.
+            The first video segment should be the video segment that best corresponds to the text query.
+            There are 4180 sentence in the validation data, so when evaluating a model on the val dataset,
+            segments should be a list of lenght 4180, and each item in segments should be a list of length 21.
+        data: ground truth data
+        '''
+        average_ranks = []
+        average_iou = []
+        for s, d in zip(segments, data):
+            pred = s[0]
+            ious = [self._iou(pred, t) for t in d['times']]
+            average_iou.append(np.mean(np.sort(ious)[-3:]))
+            ranks = [self._rank(s, t) for t in d['times'] if tuple(t) in s]  # if t in s] is added for s, e not in prediction.
+            average_ranks.append(np.mean(np.sort(ranks)[:3]))
+        rank1 = np.sum(np.array(average_ranks) <= 1)/float(len(average_ranks))
+        rank5 = np.sum(np.array(average_ranks) <= 5)/float(len(average_ranks))
+        miou = np.mean(average_iou)
+
+        # print("Average rank@1: %f" % rank1)
+        # print("Average rank@5: %f" % rank5)
+        # print("Average iou: %f" % miou)
+        return rank1, rank5, miou
+
+
+class NLGMetric(Metric):
+    def __init__(
+        self,
+        config,
+        metric_names=[
+            "Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4",
+            "METEOR", "ROUGE_L", "CIDEr"
+        ]
+    ):
+        super().__init__(config, metric_names)
+        # please install NLGEval from `https://github.com/Maluuba/nlg-eval`
+        from nlgeval import NLGEval
+        self.nlg = NLGEval()
+
+    def compute_metrics(self, outputs, targets, **kwargs):
+        return self.nlg.compute_metrics(
+            hyp_list=outputs, ref_list=targets)
+
+    def print_computed_metrics(self, metrics):
+        Bleu_1 = metrics["Bleu_1"]
+        Bleu_2 = metrics["Bleu_2"]
+        Bleu_3 = metrics["Bleu_3"]
+        Bleu_4 = metrics["Bleu_4"]
+        METEOR = metrics["METEOR"]
+        ROUGE_L = metrics["ROUGE_L"]
+        CIDEr = metrics["CIDEr"]
+
+        print(
+            "Bleu_1: {:.4f} - Bleu_2: {:.4f} - Bleu_3: {:.4f} - Bleu_4: {:.4f} - METEOR: {:.4f} - ROUGE_L: {:.4f} - CIDEr: {:.4f}".format(
+                Bleu_1, Bleu_2, Bleu_3, Bleu_4, METEOR, ROUGE_L, CIDEr
+            )
+        )
+
+
+class QAMetric(Metric):
+    def __init__(
+        self,
+        config,
+        metric_names=["acc"]
+    ):
+        super().__init__(config, metric_names)
+
+    def compute_metrics(self, outputs, targets, **kwargs):
+        from sklearn.metrics import accuracy_score
+        return {"acc": accuracy_score(targets, outputs)}
+
+    def print_computed_metrics(self, metrics):
+        print("acc: {:.4f}".format(metrics["acc"]))
+
+
+class COINActionSegmentationMetric(Metric):
+    """
+    COIN dataset listed 3 repos for Action Segmentation.
+    Action Sets, NeuralNetwork-Viterbi, TCFPN-ISBA.
+    The first and second are the same.
+    https://github.com/alexanderrichard/action-sets/blob/master/eval.py
+
+    Future reference for the third:
+    `https://github.com/Zephyr-D/TCFPN-ISBA/blob/master/utils/metrics.py`
+    """
+    def __init__(self, config, metric_name=["frame_acc"]):
+        super().__init__(config, metric_name)
+
+    def compute_metrics(self, outputs, targets):
+        n_frames = 0
+        n_errors = 0
+        n_errors = sum(outputs != targets)
+        n_frames = len(targets)
+        return {"frame_acc": 1.0 - float(n_errors) / n_frames}
+
+    def print_computed_metrics(self, metrics):
+        fa = metrics["frame_acc"]
+        print("frame accuracy:", fa)
+
+
+class CrossTaskMetric(Metric):
+    def __init__(self, config, metric_names=["recall"]):
+        super().__init__(config, metric_names)
+
+    def compute_metrics(self, outputs, targets, **kwargs):
+        """refactored from line 166:
+        https://github.com/DmZhukov/CrossTask/blob/master/train.py"""
+
+        recalls = self._get_recalls(Y_true=targets, Y_pred=outputs)
+        results = {}
+        for task, rec in recalls.items():
+            results[str(task)] = rec
+
+        avg_recall = np.mean(list(recalls.values()))
+        results["recall"] = avg_recall
+        return results
+
+    def print_computed_metrics(self, metrics):
+        print('Recall: {0:0.3f}'.format(metrics["recall"]))
+        for task in metrics:
+            if task != "recall":
+                print('Task {0}. Recall = {1:0.3f}'.format(
+                    task, metrics[task]))
+
+    def _get_recalls(self, Y_true, Y_pred):
+        """refactored from
+        https://github.com/DmZhukov/CrossTask/blob/master/train.py"""
+
+        step_match = {task: 0 for task in Y_true.keys()}
+        step_total = {task: 0 for task in Y_true.keys()}
+        for task, ys_true in Y_true.items():
+            ys_pred = Y_pred[task]
+            for vid in set(ys_pred.keys()).intersection(set(ys_true.keys())):
+                y_true = ys_true[vid]
+                y_pred = ys_pred[vid]
+                step_total[task] += (y_true.sum(axis=0) > 0).sum()
+                step_match[task] += (y_true*y_pred).sum()
+        recalls = {
+            task: step_match[task] / n for task, n in step_total.items()}
+        return recalls
+
+
+class ActionRecognitionMetric(Metric):
+    def __init__(
+        self,
+        config,
+        metric_names=["acc", "acc_splits", "r1_splits", "r5_splits", "r10_splits"]
+    ):
+        super().__init__(config, metric_names)
+
+    def compute_metrics(self, outputs, targets, splits, **kwargs):
+        all_video_embd = outputs
+        labels = targets
+        split1, split2, split3 = splits
+        accs = []
+        r1s = []
+        r5s = []
+        r10s = []
+        for split in range(3):
+            if split == 0:
+                s = split1
+            elif split == 1:
+                s = split2
+            else:
+                s = split3
+
+            X_pred = all_video_embd[np.where(s == 2)[0]]
+            label_test = labels[np.where(s == 2)[0]]
+            logits = X_pred
+            X_pred = np.argmax(X_pred, axis=1)
+            acc = np.sum(X_pred == label_test) / float(len(X_pred))
+            accs.append(acc)
+            # compute recall.
+            sorted_pred = (-logits).argsort(axis=-1)
+            label_test_sp = label_test.reshape(-1, 1)
+
+            r1 = np.mean((sorted_pred[:, :1] == label_test_sp).sum(axis=1), axis=0)
+            r5 = np.mean((sorted_pred[:, :5] == label_test_sp).sum(axis=1), axis=0)
+            r10 = np.mean((sorted_pred[:, :10] == label_test_sp).sum(axis=1), axis=0)
+            r1s.append(r1)
+            r5s.append(r5)
+            r10s.append(r10)
+
+        return {"acc": accs[0], "acc_splits": accs, "r1_splits": r1s, "r5_splits": r5s, "r10_splits": r10s}
+
+    def print_computed_metrics(self, metrics):
+        for split, acc in enumerate(metrics["acc_splits"]):
+            print("Top 1 accuracy on split {}: {}; r1 {}; r5 {}; r10 {}".format(
+                split + 1, acc,
+                metrics["r1_splits"][split],
+                metrics["r5_splits"][split],
+                metrics["r10_splits"][split],
+                )
+            )
--- a/examples/MMPT/mmpt/evaluators/predictor.py
+++ b/examples/MMPT/mmpt/evaluators/predictor.py
--- a/examples/MMPT/mmpt/losses/__init__.py
+++ b/examples/MMPT/mmpt/losses/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .loss import *
+from .nce import *
+
+try:
+    from .fairseqmmloss import *
+except ImportError:
+    pass
+
+try:
+    from .expnce import *
+except ImportError:
+    pass