v1.0

799a38c5 · chenzk · 799a38c5 · 799a38c5 · 799a38c5 · 799a38c5
Commit 799a38c5 authored Nov 10, 2023 by chenzk
20 changed files
--- a/fairseq/docs/overview.rst
+++ b/fairseq/docs/overview.rst
+Overview
+========
+
+Fairseq can be extended through user-supplied `plug-ins
+<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
+plug-ins:
+
+- :ref:`Models` define the neural network architecture and encapsulate all of the
+  learnable parameters.
+- :ref:`Criterions` compute the loss function given the model outputs and targets.
+- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
+  Datasets, initializing the Model/Criterion and calculating the loss.
+- :ref:`Optimizers` update the Model parameters based on the gradients.
+- :ref:`Learning Rate Schedulers` update the learning rate over the course of
+  training.
+
+**Training Flow**
+
+Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
+fairseq implements the following high-level training flow::
+
+  for epoch in range(num_epochs):
+      itr = task.get_batch_iterator(task.dataset('train'))
+      for num_updates, batch in enumerate(itr):
+          task.train_step(batch, model, criterion, optimizer)
+          average_and_clip_gradients()
+          optimizer.step()
+          lr_scheduler.step_update(num_updates)
+      lr_scheduler.step(epoch)
+
+where the default implementation for ``task.train_step`` is roughly::
+
+  def train_step(self, batch, model, criterion, optimizer, **unused):
+      loss = criterion(model, batch)
+      optimizer.backward(loss)
+      return loss
+
+**Registering new plug-ins**
+
+New plug-ins are *registered* through a set of ``@register`` function
+decorators, for example::
+
+  @register_model('my_lstm')
+  class MyLSTM(FairseqEncoderDecoderModel):
+      (...)
+
+Once registered, new plug-ins can be used with the existing :ref:`Command-line
+Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
+new plug-ins.
+
+**Loading plug-ins from another directory**
+
+New plug-ins can be defined in a custom module stored in the user system. In
+order to import the module, and make the plugin available to *fairseq*, the
+command line supports the ``--user-dir`` flag that can be used to specify a
+custom location for additional modules to load into *fairseq*.
+
+For example, assuming this directory tree::
+
+  /home/user/my-module/
+  └── __init__.py
+  
+with ``__init__.py``::
+
+  from fairseq.models import register_model_architecture
+  from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
+
+  @register_model_architecture('transformer', 'my_transformer')
+  def transformer_mmt_big(args):
+      transformer_vaswani_wmt_en_de_big(args)
+
+it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
+
+  fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
--- a/fairseq/docs/requirements.txt
+++ b/fairseq/docs/requirements.txt
+sphinx<2.0
+sphinx-argparse
--- a/fairseq/docs/tasks.rst
+++ b/fairseq/docs/tasks.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: fairseq.tasks
+
+.. _Tasks:
+
+Tasks
+=====
+
+Tasks store dictionaries and provide helpers for loading/iterating over
+Datasets, initializing the Model/Criterion and calculating the loss.
+
+Tasks can be selected via the ``--task`` command-line argument. Once selected, a
+task may expose additional command-line arguments for further configuration.
+
+Example usage::
+
+    # setup the task (e.g., load dictionaries)
+    task = fairseq.tasks.setup_task(args)
+
+    # build model and criterion
+    model = task.build_model(args)
+    criterion = task.build_criterion(args)
+
+    # load datasets
+    task.load_dataset('train')
+    task.load_dataset('valid')
+
+    # iterate over mini-batches of data
+    batch_itr = task.get_batch_iterator(
+        task.dataset('train'), max_tokens=4096,
+    )
+    for batch in batch_itr:
+        # compute the loss
+        loss, sample_size, logging_output = task.get_loss(
+            model, criterion, batch,
+        )
+        loss.backward()
+
+
+Translation
+-----------
+
+.. autoclass:: fairseq.tasks.translation.TranslationTask
+
+.. _language modeling:
+
+Language Modeling
+-----------------
+
+.. autoclass:: fairseq.tasks.language_modeling.LanguageModelingTask
+
+
+Adding new tasks
+----------------
+
+.. autofunction:: fairseq.tasks.register_task
+.. autoclass:: fairseq.tasks.FairseqTask
+    :members:
+    :undoc-members:
--- a/fairseq/docs/tutorial_classifying_names.rst
+++ b/fairseq/docs/tutorial_classifying_names.rst
+Tutorial: Classifying Names with a Character-Level RNN
+======================================================
+
+In this tutorial we will extend fairseq to support *classification* tasks. In
+particular we will re-implement the PyTorch tutorial for `Classifying Names with
+a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`_
+in fairseq. It is recommended to quickly skim that tutorial before beginning
+this one.
+
+This tutorial covers:
+
+1. **Preprocessing the data** to create dictionaries.
+2. **Registering a new Model** that encodes an input sentence with a simple RNN
+   and predicts the output label.
+3. **Registering a new Task** that loads our dictionaries and dataset.
+4. **Training the Model** using the existing command-line tools.
+5. **Writing an evaluation script** that imports fairseq and allows us to
+   interactively evaluate our model on new inputs.
+
+
+1. Preprocessing the data
+-------------------------
+
+The original tutorial provides raw data, but we'll work with a modified version
+of the data that is already tokenized into characters and split into separate
+train, valid and test sets.
+
+Download and extract the data from here:
+`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_
+
+Once extracted, let's preprocess the data using the :ref:`fairseq-preprocess`
+command-line tool to create the dictionaries. While this tool is primarily
+intended for sequence-to-sequence problems, we're able to reuse it here by
+treating the label as a "target" sequence of length 1. We'll also output the
+preprocessed files in "raw" format using the ``--dataset-impl`` option to
+enhance readability:
+
+.. code-block:: console
+
+  > fairseq-preprocess \
+    --trainpref names/train --validpref names/valid --testpref names/test \
+    --source-lang input --target-lang label \
+    --destdir names-bin --dataset-impl raw
+
+After running the above command you should see a new directory,
+:file:`names-bin/`, containing the dictionaries for *inputs* and *labels*.
+
+
+2. Registering a new Model
+--------------------------
+
+Next we'll register a new model in fairseq that will encode an input sentence
+with a simple RNN and predict the output label. Compared to the original PyTorch
+tutorial, our version will also work with batches of data and GPU Tensors.
+
+First let's copy the simple RNN module implemented in the `PyTorch tutorial
+<https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network>`_.
+Create a new file named :file:`fairseq/models/rnn_classifier.py` with the
+following contents::
+
+    import torch
+    import torch.nn as nn
+
+    class RNN(nn.Module):
+
+        def __init__(self, input_size, hidden_size, output_size):
+            super(RNN, self).__init__()
+
+            self.hidden_size = hidden_size
+
+            self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
+            self.i2o = nn.Linear(input_size + hidden_size, output_size)
+            self.softmax = nn.LogSoftmax(dim=1)
+
+        def forward(self, input, hidden):
+            combined = torch.cat((input, hidden), 1)
+            hidden = self.i2h(combined)
+            output = self.i2o(combined)
+            output = self.softmax(output)
+            return output, hidden
+
+        def initHidden(self):
+            return torch.zeros(1, self.hidden_size)
+
+We must also *register* this model with fairseq using the
+:func:`~fairseq.models.register_model` function decorator. Once the model is
+registered we'll be able to use it with the existing :ref:`Command-line Tools`.
+
+All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
+interface, so we'll create a small wrapper class in the same file and register
+it in fairseq with the name ``'rnn_classifier'``::
+
+    from fairseq.models import BaseFairseqModel, register_model
+
+    # Note: the register_model "decorator" should immediately precede the
+    # definition of the Model class.
+
+    @register_model('rnn_classifier')
+    class FairseqRNNClassifier(BaseFairseqModel):
+
+        @staticmethod
+        def add_args(parser):
+            # Models can override this method to add new command-line arguments.
+            # Here we'll add a new command-line argument to configure the
+            # dimensionality of the hidden state.
+            parser.add_argument(
+                '--hidden-dim', type=int, metavar='N',
+                help='dimensionality of the hidden state',
+            )
+
+        @classmethod
+        def build_model(cls, args, task):
+            # Fairseq initializes models by calling the ``build_model()``
+            # function. This provides more flexibility, since the returned model
+            # instance can be of a different type than the one that was called.
+            # In this case we'll just return a FairseqRNNClassifier instance.
+
+            # Initialize our RNN module
+            rnn = RNN(
+                # We'll define the Task in the next section, but for now just
+                # notice that the task holds the dictionaries for the "source"
+                # (i.e., the input sentence) and "target" (i.e., the label).
+                input_size=len(task.source_dictionary),
+                hidden_size=args.hidden_dim,
+                output_size=len(task.target_dictionary),
+            )
+
+            # Return the wrapped version of the module
+            return FairseqRNNClassifier(
+                rnn=rnn,
+                input_vocab=task.source_dictionary,
+            )
+
+        def __init__(self, rnn, input_vocab):
+            super(FairseqRNNClassifier, self).__init__()
+
+            self.rnn = rnn
+            self.input_vocab = input_vocab
+
+            # The RNN module in the tutorial expects one-hot inputs, so we can
+            # precompute the identity matrix to help convert from indices to
+            # one-hot vectors. We register it as a buffer so that it is moved to
+            # the GPU when ``cuda()`` is called.
+            self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
+
+        def forward(self, src_tokens, src_lengths):
+            # The inputs to the ``forward()`` function are determined by the
+            # Task, and in particular the ``'net_input'`` key in each
+            # mini-batch. We'll define the Task in the next section, but for
+            # now just know that *src_tokens* has shape `(batch, src_len)` and
+            # *src_lengths* has shape `(batch)`.
+            bsz, max_src_len = src_tokens.size()
+
+            # Initialize the RNN hidden state. Compared to the original PyTorch
+            # tutorial we'll also handle batched inputs and work on the GPU.
+            hidden = self.rnn.initHidden()
+            hidden = hidden.repeat(bsz, 1)  # expand for batched inputs
+            hidden = hidden.to(src_tokens.device)  # move to GPU
+
+            for i in range(max_src_len):
+                # WARNING: The inputs have padding, so we should mask those
+                # elements here so that padding doesn't affect the results.
+                # This is left as an exercise for the reader. The padding symbol
+                # is given by ``self.input_vocab.pad()`` and the unpadded length
+                # of each input is given by *src_lengths*.
+
+                # One-hot encode a batch of input characters.
+                input = self.one_hot_inputs[src_tokens[:, i].long()]
+
+                # Feed the input to our RNN.
+                output, hidden = self.rnn(input, hidden)
+
+            # Return the final output state for making a prediction
+            return output
+
+Finally let's define a *named architecture* with the configuration for our
+model. This is done with the :func:`~fairseq.models.register_model_architecture`
+function decorator. Thereafter this named architecture can be used with the
+``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::
+
+    from fairseq.models import register_model_architecture
+
+    # The first argument to ``register_model_architecture()`` should be the name
+    # of the model we registered above (i.e., 'rnn_classifier'). The function we
+    # register here should take a single argument *args* and modify it in-place
+    # to match the desired architecture.
+
+    @register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
+    def pytorch_tutorial_rnn(args):
+        # We use ``getattr()`` to prioritize arguments that are explicitly given
+        # on the command-line, so that the defaults defined below are only used
+        # when no other value has been specified.
+        args.hidden_dim = getattr(args, 'hidden_dim', 128)
+
+
+3. Registering a new Task
+-------------------------
+
+Now we'll register a new :class:`~fairseq.tasks.FairseqTask` that will load our
+dictionaries and dataset. Tasks can also control how the data is batched into
+mini-batches, but in this tutorial we'll reuse the batching provided by
+:class:`fairseq.data.LanguagePairDataset`.
+
+Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
+following contents::
+
+  import os
+  import torch
+
+  from fairseq.data import Dictionary, LanguagePairDataset
+  from fairseq.tasks import FairseqTask, register_task
+
+
+  @register_task('simple_classification')
+  class SimpleClassificationTask(LegacyFairseqTask):
+
+      @staticmethod
+      def add_args(parser):
+          # Add some command-line arguments for specifying where the data is
+          # located and the maximum supported input length.
+          parser.add_argument('data', metavar='FILE',
+                              help='file prefix for data')
+          parser.add_argument('--max-positions', default=1024, type=int,
+                              help='max input length')
+
+      @classmethod
+      def setup_task(cls, args, **kwargs):
+          # Here we can perform any setup required for the task. This may include
+          # loading Dictionaries, initializing shared Embedding layers, etc.
+          # In this case we'll just load the Dictionaries.
+          input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
+          label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
+          print('| [input] dictionary: {} types'.format(len(input_vocab)))
+          print('| [label] dictionary: {} types'.format(len(label_vocab)))
+
+          return SimpleClassificationTask(args, input_vocab, label_vocab)
+
+      def __init__(self, args, input_vocab, label_vocab):
+          super().__init__(args)
+          self.input_vocab = input_vocab
+          self.label_vocab = label_vocab
+
+      def load_dataset(self, split, **kwargs):
+          """Load a given dataset split (e.g., train, valid, test)."""
+
+          prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
+
+          # Read input sentences.
+          sentences, lengths = [], []
+          with open(prefix + '.input', encoding='utf-8') as file:
+              for line in file:
+                  sentence = line.strip()
+
+                  # Tokenize the sentence, splitting on spaces
+                  tokens = self.input_vocab.encode_line(
+                      sentence, add_if_not_exist=False,
+                  )
+
+                  sentences.append(tokens)
+                  lengths.append(tokens.numel())
+
+          # Read labels.
+          labels = []
+          with open(prefix + '.label', encoding='utf-8') as file:
+              for line in file:
+                  label = line.strip()
+                  labels.append(
+                      # Convert label to a numeric ID.
+                      torch.LongTensor([self.label_vocab.add_symbol(label)])
+                  )
+
+          assert len(sentences) == len(labels)
+          print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
+
+          # We reuse LanguagePairDataset since classification can be modeled as a
+          # sequence-to-sequence task where the target sequence has length 1.
+          self.datasets[split] = LanguagePairDataset(
+              src=sentences,
+              src_sizes=lengths,
+              src_dict=self.input_vocab,
+              tgt=labels,
+              tgt_sizes=torch.ones(len(labels)),  # targets have length 1
+              tgt_dict=self.label_vocab,
+              left_pad_source=False,
+              # Since our target is a single class label, there's no need for
+              # teacher forcing. If we set this to ``True`` then our Model's
+              # ``forward()`` method would receive an additional argument called
+              # *prev_output_tokens* that would contain a shifted version of the
+              # target sequence.
+              input_feeding=False,
+          )
+
+      def max_positions(self):
+          """Return the max input length allowed by the task."""
+          # The source should be less than *args.max_positions* and the "target"
+          # has max length 1.
+          return (self.args.max_positions, 1)
+
+      @property
+      def source_dictionary(self):
+          """Return the source :class:`~fairseq.data.Dictionary`."""
+          return self.input_vocab
+
+      @property
+      def target_dictionary(self):
+          """Return the target :class:`~fairseq.data.Dictionary`."""
+          return self.label_vocab
+
+      # We could override this method if we wanted more control over how batches
+      # are constructed, but it's not necessary for this tutorial since we can
+      # reuse the batching provided by LanguagePairDataset.
+      #
+      # def get_batch_iterator(
+      #     self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
+      #     ignore_invalid_inputs=False, required_batch_size_multiple=1,
+      #     seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
+      #     data_buffer_size=0, disable_iterator_cache=False,
+      # ):
+      #     (...)
+
+
+4. Training the Model
+---------------------
+
+Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
+command-line tool for this, making sure to specify our new Task (``--task
+simple_classification``) and Model architecture (``--arch
+pytorch_tutorial_rnn``):
+
+.. note::
+
+  You can also configure the dimensionality of the hidden state by passing the
+  ``--hidden-dim`` argument to :ref:`fairseq-train`.
+
+.. code-block:: console
+
+  > fairseq-train names-bin \
+    --task simple_classification \
+    --arch pytorch_tutorial_rnn \
+    --optimizer adam --lr 0.001 --lr-shrink 0.5 \
+    --max-tokens 1000
+  (...)
+  | epoch 027 | loss 1.200 | ppl 2.30 | wps 15728 | ups 119.4 | wpb 116 | bsz 116 | num_updates 3726 | lr 1.5625e-05 | gnorm 1.290 | clip 0% | oom 0 | wall 32 | train_wall 21
+  | epoch 027 | valid on 'valid' subset | valid_loss 1.41304 | valid_ppl 2.66 | num_updates 3726 | best 1.41208
+  | done training in 31.6 seconds
+
+The model files should appear in the :file:`checkpoints/` directory.
+
+
+5. Writing an evaluation script
+-------------------------------
+
+Finally we can write a short script to evaluate our model on new inputs. Create
+a new file named :file:`eval_classifier.py` with the following contents::
+
+  from fairseq import checkpoint_utils, data, options, tasks
+
+  # Parse command-line arguments for generation
+  parser = options.get_generation_parser(default_task='simple_classification')
+  args = options.parse_args_and_arch(parser)
+
+  # Setup task
+  task = tasks.setup_task(args)
+
+  # Load model
+  print('| loading model from {}'.format(args.path))
+  models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
+  model = models[0]
+
+  while True:
+      sentence = input('\nInput: ')
+
+      # Tokenize into characters
+      chars = ' '.join(list(sentence.strip()))
+      tokens = task.source_dictionary.encode_line(
+          chars, add_if_not_exist=False,
+      )
+
+      # Build mini-batch to feed to the model
+      batch = data.language_pair_dataset.collate(
+          samples=[{'id': -1, 'source': tokens}],  # bsz = 1
+          pad_idx=task.source_dictionary.pad(),
+          eos_idx=task.source_dictionary.eos(),
+          left_pad_source=False,
+          input_feeding=False,
+      )
+
+      # Feed batch to the model and get predictions
+      preds = model(**batch['net_input'])
+
+      # Print top 3 predictions and their log-probabilities
+      top_scores, top_labels = preds[0].topk(k=3)
+      for score, label_idx in zip(top_scores, top_labels):
+          label_name = task.target_dictionary.string([label_idx])
+          print('({:.2f})\t{}'.format(score, label_name))
+
+Now we can evaluate our model interactively. Note that we have included the
+original data path (:file:`names-bin/`) so that the dictionaries can be loaded:
+
+.. code-block:: console
+
+  > python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
+  | [input] dictionary: 64 types
+  | [label] dictionary: 24 types
+  | loading model from checkpoints/checkpoint_best.pt
+
+  Input: Satoshi
+  (-0.61) Japanese
+  (-1.20) Arabic
+  (-2.86) Italian
+
+  Input: Sinbad
+  (-0.30) Arabic
+  (-1.76) English
+  (-4.08) Russian
--- a/fairseq/docs/tutorial_simple_lstm.rst
+++ b/fairseq/docs/tutorial_simple_lstm.rst
+Tutorial: Simple LSTM
+=====================
+
+In this tutorial we will extend fairseq by adding a new
+:class:`~fairseq.models.FairseqEncoderDecoderModel` that encodes a source
+sentence with an LSTM and then passes the final hidden state to a second LSTM
+that decodes the target sentence (without attention).
+
+This tutorial covers:
+
+1. **Writing an Encoder and Decoder** to encode/decode the source/target
+   sentence, respectively.
+2. **Registering a new Model** so that it can be used with the existing
+   :ref:`Command-line tools`.
+3. **Training the Model** using the existing command-line tools.
+4. **Making generation faster** by modifying the Decoder to use
+   :ref:`Incremental decoding`.
+
+
+1. Building an Encoder and Decoder
+----------------------------------
+
+In this section we'll define a simple LSTM Encoder and Decoder. All Encoders
+should implement the :class:`~fairseq.models.FairseqEncoder` interface and
+Decoders should implement the :class:`~fairseq.models.FairseqDecoder` interface.
+These interfaces themselves extend :class:`torch.nn.Module`, so FairseqEncoders
+and FairseqDecoders can be written and used in the same ways as ordinary PyTorch
+Modules.
+
+
+Encoder
+~~~~~~~
+
+Our Encoder will embed the tokens in the source sentence, feed them to a
+:class:`torch.nn.LSTM` and return the final hidden state. To create our encoder
+save the following in a new file named :file:`fairseq/models/simple_lstm.py`::
+
+  import torch.nn as nn
+  from fairseq import utils
+  from fairseq.models import FairseqEncoder
+
+  class SimpleLSTMEncoder(FairseqEncoder):
+
+      def __init__(
+          self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
+      ):
+          super().__init__(dictionary)
+          self.args = args
+
+          # Our encoder will embed the inputs before feeding them to the LSTM.
+          self.embed_tokens = nn.Embedding(
+              num_embeddings=len(dictionary),
+              embedding_dim=embed_dim,
+              padding_idx=dictionary.pad(),
+          )
+          self.dropout = nn.Dropout(p=dropout)
+
+          # We'll use a single-layer, unidirectional LSTM for simplicity.
+          self.lstm = nn.LSTM(
+              input_size=embed_dim,
+              hidden_size=hidden_dim,
+              num_layers=1,
+              bidirectional=False,
+              batch_first=True,
+          )
+
+      def forward(self, src_tokens, src_lengths):
+          # The inputs to the ``forward()`` function are determined by the
+          # Task, and in particular the ``'net_input'`` key in each
+          # mini-batch. We discuss Tasks in the next tutorial, but for now just
+          # know that *src_tokens* has shape `(batch, src_len)` and *src_lengths*
+          # has shape `(batch)`.
+
+          # Note that the source is typically padded on the left. This can be
+          # configured by adding the `--left-pad-source "False"` command-line
+          # argument, but here we'll make the Encoder handle either kind of
+          # padding by converting everything to be right-padded.
+          if self.args.left_pad_source:
+              # Convert left-padding to right-padding.
+              src_tokens = utils.convert_padding_direction(
+                  src_tokens,
+                  padding_idx=self.dictionary.pad(),
+                  left_to_right=True
+              )
+
+          # Embed the source.
+          x = self.embed_tokens(src_tokens)
+
+          # Apply dropout.
+          x = self.dropout(x)
+
+          # Pack the sequence into a PackedSequence object to feed to the LSTM.
+          x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)
+
+          # Get the output from the LSTM.
+          _outputs, (final_hidden, _final_cell) = self.lstm(x)
+
+          # Return the Encoder's output. This can be any object and will be
+          # passed directly to the Decoder.
+          return {
+              # this will have shape `(bsz, hidden_dim)`
+              'final_hidden': final_hidden.squeeze(0),
+          }
+
+      # Encoders are required to implement this method so that we can rearrange
+      # the order of the batch elements during inference (e.g., beam search).
+      def reorder_encoder_out(self, encoder_out, new_order):
+          """
+          Reorder encoder output according to `new_order`.
+
+          Args:
+              encoder_out: output from the ``forward()`` method
+              new_order (LongTensor): desired order
+
+          Returns:
+              `encoder_out` rearranged according to `new_order`
+          """
+          final_hidden = encoder_out['final_hidden']
+          return {
+              'final_hidden': final_hidden.index_select(0, new_order),
+          }
+
+
+Decoder
+~~~~~~~
+
+Our Decoder will predict the next word, conditioned on the Encoder's final
+hidden state and an embedded representation of the previous target word -- which
+is sometimes called *teacher forcing*. More specifically, we'll use a
+:class:`torch.nn.LSTM` to produce a sequence of hidden states that we'll project
+to the size of the output vocabulary to predict each target word.
+
+::
+
+  import torch
+  from fairseq.models import FairseqDecoder
+
+  class SimpleLSTMDecoder(FairseqDecoder):
+
+      def __init__(
+          self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
+          dropout=0.1,
+      ):
+          super().__init__(dictionary)
+
+          # Our decoder will embed the inputs before feeding them to the LSTM.
+          self.embed_tokens = nn.Embedding(
+              num_embeddings=len(dictionary),
+              embedding_dim=embed_dim,
+              padding_idx=dictionary.pad(),
+          )
+          self.dropout = nn.Dropout(p=dropout)
+
+          # We'll use a single-layer, unidirectional LSTM for simplicity.
+          self.lstm = nn.LSTM(
+              # For the first layer we'll concatenate the Encoder's final hidden
+              # state with the embedded target tokens.
+              input_size=encoder_hidden_dim + embed_dim,
+              hidden_size=hidden_dim,
+              num_layers=1,
+              bidirectional=False,
+          )
+
+          # Define the output projection.
+          self.output_projection = nn.Linear(hidden_dim, len(dictionary))
+
+      # During training Decoders are expected to take the entire target sequence
+      # (shifted right by one position) and produce logits over the vocabulary.
+      # The *prev_output_tokens* tensor begins with the end-of-sentence symbol,
+      # ``dictionary.eos()``, followed by the target sequence.
+      def forward(self, prev_output_tokens, encoder_out):
+          """
+          Args:
+              prev_output_tokens (LongTensor): previous decoder outputs of shape
+                  `(batch, tgt_len)`, for teacher forcing
+              encoder_out (Tensor, optional): output from the encoder, used for
+                  encoder-side attention
+
+          Returns:
+              tuple:
+                  - the last decoder layer's output of shape
+                    `(batch, tgt_len, vocab)`
+                  - the last decoder layer's attention weights of shape
+                    `(batch, tgt_len, src_len)`
+          """
+          bsz, tgt_len = prev_output_tokens.size()
+
+          # Extract the final hidden state from the Encoder.
+          final_encoder_hidden = encoder_out['final_hidden']
+
+          # Embed the target sequence, which has been shifted right by one
+          # position and now starts with the end-of-sentence symbol.
+          x = self.embed_tokens(prev_output_tokens)
+
+          # Apply dropout.
+          x = self.dropout(x)
+
+          # Concatenate the Encoder's final hidden state to *every* embedded
+          # target token.
+          x = torch.cat(
+              [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
+              dim=2,
+          )
+
+          # Using PackedSequence objects in the Decoder is harder than in the
+          # Encoder, since the targets are not sorted in descending length order,
+          # which is a requirement of ``pack_padded_sequence()``. Instead we'll
+          # feed nn.LSTM directly.
+          initial_state = (
+              final_encoder_hidden.unsqueeze(0),  # hidden
+              torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
+          )
+          output, _ = self.lstm(
+              x.transpose(0, 1),  # convert to shape `(tgt_len, bsz, dim)`
+              initial_state,
+          )
+          x = output.transpose(0, 1)  # convert to shape `(bsz, tgt_len, hidden)`
+
+          # Project the outputs to the size of the vocabulary.
+          x = self.output_projection(x)
+
+          # Return the logits and ``None`` for the attention weights
+          return x, None
+
+
+2. Registering the Model
+------------------------
+
+Now that we've defined our Encoder and Decoder we must *register* our model with
+fairseq using the :func:`~fairseq.models.register_model` function decorator.
+Once the model is registered we'll be able to use it with the existing
+:ref:`Command-line Tools`.
+
+All registered models must implement the
+:class:`~fairseq.models.BaseFairseqModel` interface. For sequence-to-sequence
+models (i.e., any model with a single Encoder and Decoder), we can instead
+implement the :class:`~fairseq.models.FairseqEncoderDecoderModel` interface.
+
+Create a small wrapper class in the same file and register it in fairseq with
+the name ``'simple_lstm'``::
+
+  from fairseq.models import FairseqEncoderDecoderModel, register_model
+
+  # Note: the register_model "decorator" should immediately precede the
+  # definition of the Model class.
+
+  @register_model('simple_lstm')
+  class SimpleLSTMModel(FairseqEncoderDecoderModel):
+
+      @staticmethod
+      def add_args(parser):
+          # Models can override this method to add new command-line arguments.
+          # Here we'll add some new command-line arguments to configure dropout
+          # and the dimensionality of the embeddings and hidden states.
+          parser.add_argument(
+              '--encoder-embed-dim', type=int, metavar='N',
+              help='dimensionality of the encoder embeddings',
+          )
+          parser.add_argument(
+              '--encoder-hidden-dim', type=int, metavar='N',
+              help='dimensionality of the encoder hidden state',
+          )
+          parser.add_argument(
+              '--encoder-dropout', type=float, default=0.1,
+              help='encoder dropout probability',
+          )
+          parser.add_argument(
+              '--decoder-embed-dim', type=int, metavar='N',
+              help='dimensionality of the decoder embeddings',
+          )
+          parser.add_argument(
+              '--decoder-hidden-dim', type=int, metavar='N',
+              help='dimensionality of the decoder hidden state',
+          )
+          parser.add_argument(
+              '--decoder-dropout', type=float, default=0.1,
+              help='decoder dropout probability',
+          )
+
+      @classmethod
+      def build_model(cls, args, task):
+          # Fairseq initializes models by calling the ``build_model()``
+          # function. This provides more flexibility, since the returned model
+          # instance can be of a different type than the one that was called.
+          # In this case we'll just return a SimpleLSTMModel instance.
+
+          # Initialize our Encoder and Decoder.
+          encoder = SimpleLSTMEncoder(
+              args=args,
+              dictionary=task.source_dictionary,
+              embed_dim=args.encoder_embed_dim,
+              hidden_dim=args.encoder_hidden_dim,
+              dropout=args.encoder_dropout,
+          )
+          decoder = SimpleLSTMDecoder(
+              dictionary=task.target_dictionary,
+              encoder_hidden_dim=args.encoder_hidden_dim,
+              embed_dim=args.decoder_embed_dim,
+              hidden_dim=args.decoder_hidden_dim,
+              dropout=args.decoder_dropout,
+          )
+          model = SimpleLSTMModel(encoder, decoder)
+
+          # Print the model architecture.
+          print(model)
+
+          return model
+
+      # We could override the ``forward()`` if we wanted more control over how
+      # the encoder and decoder interact, but it's not necessary for this
+      # tutorial since we can inherit the default implementation provided by
+      # the FairseqEncoderDecoderModel base class, which looks like:
+      #
+      # def forward(self, src_tokens, src_lengths, prev_output_tokens):
+      #     encoder_out = self.encoder(src_tokens, src_lengths)
+      #     decoder_out = self.decoder(prev_output_tokens, encoder_out)
+      #     return decoder_out
+
+Finally let's define a *named architecture* with the configuration for our
+model. This is done with the :func:`~fairseq.models.register_model_architecture`
+function decorator. Thereafter this named architecture can be used with the
+``--arch`` command-line argument, e.g., ``--arch tutorial_simple_lstm``::
+
+  from fairseq.models import register_model_architecture
+
+  # The first argument to ``register_model_architecture()`` should be the name
+  # of the model we registered above (i.e., 'simple_lstm'). The function we
+  # register here should take a single argument *args* and modify it in-place
+  # to match the desired architecture.
+
+  @register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
+  def tutorial_simple_lstm(args):
+      # We use ``getattr()`` to prioritize arguments that are explicitly given
+      # on the command-line, so that the defaults defined below are only used
+      # when no other value has been specified.
+      args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
+      args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
+      args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
+      args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)
+
+
+3. Training the Model
+---------------------
+
+Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
+command-line tool for this, making sure to specify our new Model architecture
+(``--arch tutorial_simple_lstm``).
+
+.. note::
+
+  Make sure you've already preprocessed the data from the IWSLT example in the
+  :file:`examples/translation/` directory.
+
+.. code-block:: console
+
+  > fairseq-train data-bin/iwslt14.tokenized.de-en \
+    --arch tutorial_simple_lstm \
+    --encoder-dropout 0.2 --decoder-dropout 0.2 \
+    --optimizer adam --lr 0.005 --lr-shrink 0.5 \
+    --max-tokens 12000
+  (...)
+  | epoch 052 | loss 4.027 | ppl 16.30 | wps 420805 | ups 39.7 | wpb 9841 | bsz 400 | num_updates 20852 | lr 1.95313e-05 | gnorm 0.218 | clip 0% | oom 0 | wall 529 | train_wall 396
+  | epoch 052 | valid on 'valid' subset | valid_loss 4.74989 | valid_ppl 26.91 | num_updates 20852 | best 4.74954
+
+The model files should appear in the :file:`checkpoints/` directory. While this
+model architecture is not very good, we can use the :ref:`fairseq-generate` script to
+generate translations and compute our BLEU score over the test set:
+
+.. code-block:: console
+
+  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 5 \
+    --remove-bpe
+  (...)
+  | Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
+  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
+
+
+4. Making generation faster
+---------------------------
+
+While autoregressive generation from sequence-to-sequence models is inherently
+slow, our implementation above is especially slow because it recomputes the
+entire sequence of Decoder hidden states for every output token (i.e., it is
+``O(n^2)``). We can make this significantly faster by instead caching the
+previous hidden states.
+
+In fairseq this is called :ref:`Incremental decoding`. Incremental decoding is a
+special mode at inference time where the Model only receives a single timestep
+of input corresponding to the immediately previous output token (for teacher
+forcing) and must produce the next output incrementally. Thus the model must
+cache any long-term state that is needed about the sequence, e.g., hidden
+states, convolutional states, etc.
+
+To implement incremental decoding we will modify our model to implement the
+:class:`~fairseq.models.FairseqIncrementalDecoder` interface. Compared to the
+standard :class:`~fairseq.models.FairseqDecoder` interface, the incremental
+decoder interface allows ``forward()`` methods to take an extra keyword argument
+(*incremental_state*) that can be used to cache state across time-steps.
+
+Let's replace our ``SimpleLSTMDecoder`` with an incremental one::
+
+  import torch
+  from fairseq.models import FairseqIncrementalDecoder
+
+  class SimpleLSTMDecoder(FairseqIncrementalDecoder):
+
+      def __init__(
+          self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
+          dropout=0.1,
+      ):
+          # This remains the same as before.
+          super().__init__(dictionary)
+          self.embed_tokens = nn.Embedding(
+              num_embeddings=len(dictionary),
+              embedding_dim=embed_dim,
+              padding_idx=dictionary.pad(),
+          )
+          self.dropout = nn.Dropout(p=dropout)
+          self.lstm = nn.LSTM(
+              input_size=encoder_hidden_dim + embed_dim,
+              hidden_size=hidden_dim,
+              num_layers=1,
+              bidirectional=False,
+          )
+          self.output_projection = nn.Linear(hidden_dim, len(dictionary))
+
+      # We now take an additional kwarg (*incremental_state*) for caching the
+      # previous hidden and cell states.
+      def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
+          if incremental_state is not None:
+              # If the *incremental_state* argument is not ``None`` then we are
+              # in incremental inference mode. While *prev_output_tokens* will
+              # still contain the entire decoded prefix, we will only use the
+              # last step and assume that the rest of the state is cached.
+              prev_output_tokens = prev_output_tokens[:, -1:]
+
+          # This remains the same as before.
+          bsz, tgt_len = prev_output_tokens.size()
+          final_encoder_hidden = encoder_out['final_hidden']
+          x = self.embed_tokens(prev_output_tokens)
+          x = self.dropout(x)
+          x = torch.cat(
+              [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
+              dim=2,
+          )
+
+          # We will now check the cache and load the cached previous hidden and
+          # cell states, if they exist, otherwise we will initialize them to
+          # zeros (as before). We will use the ``utils.get_incremental_state()``
+          # and ``utils.set_incremental_state()`` helpers.
+          initial_state = utils.get_incremental_state(
+              self, incremental_state, 'prev_state',
+          )
+          if initial_state is None:
+              # first time initialization, same as the original version
+              initial_state = (
+                  final_encoder_hidden.unsqueeze(0),  # hidden
+                  torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
+              )
+
+          # Run one step of our LSTM.
+          output, latest_state = self.lstm(x.transpose(0, 1), initial_state)
+
+          # Update the cache with the latest hidden and cell states.
+          utils.set_incremental_state(
+              self, incremental_state, 'prev_state', latest_state,
+          )
+
+          # This remains the same as before
+          x = output.transpose(0, 1)
+          x = self.output_projection(x)
+          return x, None
+
+      # The ``FairseqIncrementalDecoder`` interface also requires implementing a
+      # ``reorder_incremental_state()`` method, which is used during beam search
+      # to select and reorder the incremental state.
+      def reorder_incremental_state(self, incremental_state, new_order):
+          # Load the cached state.
+          prev_state = utils.get_incremental_state(
+              self, incremental_state, 'prev_state',
+          )
+
+          # Reorder batches according to *new_order*.
+          reordered_state = (
+              prev_state[0].index_select(1, new_order),  # hidden
+              prev_state[1].index_select(1, new_order),  # cell
+          )
+
+          # Update the cached state.
+          utils.set_incremental_state(
+              self, incremental_state, 'prev_state', reordered_state,
+          )
+
+Finally, we can rerun generation and observe the speedup:
+
+.. code-block:: console
+
+  # Before
+
+  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 5 \
+    --remove-bpe
+  (...)
+  | Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
+  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
+
+  # After
+
+  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 5 \
+    --remove-bpe
+  (...)
+  | Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
+  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
--- a/fairseq/examples/.gitignore
+++ b/fairseq/examples/.gitignore
+!*/*.sh
+!*/*.md
--- a/fairseq/examples/__init__.py
+++ b/fairseq/examples/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+try:
+    from fairseq.version import __version__  # noqa
+except ImportError:
+    pass
--- a/fairseq/examples/adaptive_span/README.md
+++ b/fairseq/examples/adaptive_span/README.md
+# Adaptive Span
+
+Adaptive Span is a novel self-attention mechanism that can learn its optimal
+attention span. This allows us to extend significantly the maximum context size
+used in Transformer, while maintaining control over their memory footprint
+and computational time. It uses the Truncated BPTT technique for training,
+as in [transformerXL](https://github.com/pytorch/fairseq/blob/main/examples/truncated_bptt/README.md).
+
+Adaptive Span was introduced by paper:
+[Adaptive Attention Span in Transformers](https://arxiv.org/abs/1905.07799),
+which achieved state-of-the-art language modeling results at the time of publication.
+
+We manage to reproduce their result in fairseq and keep most of the
+[original implementation](https://github.com/facebookresearch/adaptive-span) untouched.
+You can refer to the their sweep file as well if any combination of hyperparameter is not clear.
+
+##### 0. Setup
+
+First you need to process the Enwik8 dataset, we use the pre-tokenized dataset
+from [adaptive span paper](https://github.com/facebookresearch/adaptive-span/blob/master/get_data.sh).
+You can download the dataset, and then run:
+```bash
+fairseq-preprocess --only-source --trainpref ~/data/enwik8/train.txt \
+    --validpref ~/data/enwik8/valid.txt --testpref ~/data/enwik8/test.txt \
+    --destdir ~/data/enwik8/data-bin/ --joined-dictionary --workers 20
+```
+
+##### 1. Train a Adaptive Span model on Enwik8
+
+We will train a 12-layer Adaptive Span model following the [hyperparameters
+used in the original
+paper](https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh).
+
+The following command assumes 4 GPUs, so that the total batch size is 64
+sequences (4 x 16). Training should take 2-3 days on 4 V100 GPUs:
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
+    --user-dir examples/adaptive_span \
+    --data  ~/data/enwik8/data-bin/ \
+    --fp16 --fp16-no-flatten-grads --max-update 600000 \
+    --task truncated_bptt_lm --tokens-per-sample 512 --arch adaptive_span \
+    --n-layer 12 --d-model 512 --n-head 8 --d-inner 2048 --dropout 0.3 \
+    --attn-span 8192 --optimizer adagrad_with_grad_clip --adagrad-clip 0.03 \
+    --validate-interval-updates 1000 \
+    --lr-scheduler fixed --warmup-updates 32000 --batch-size-valid 32 \
+    --lr 0.07 --criterion adaptive_span_loss --batch-size 16 --update-freq 1 \
+    --seed 2 --log-format json --log-interval 25 --aux-loss-scaler 5e-07
+```
+This should land around 1.05 on validation, 1.03 on test. You can lower the
+--aux-loss-scaler for better performance (longer span). It gives ~0.03 bpc
+improvement to the transformerXL baseline here.
+If training on a single GPU, set `--update-freq=4` to accumulate 4x gradients
+and simulate training on 4 GPUs.
+You can also reproduce the transformerXL result on enwik8 using this code base.
+It should land around 1.06 on test,matching the [original paper](https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_enwik8_base.sh).
+You can try by
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
+    --user-dir examples/truncated_bptt \
+    ~/data/enwik8/data-bin/ \
+    --task truncated_bptt_lm  --fp16 --max-update 400000 \
+    --tokens-per-sample 512 --arch transformer_xl --n-layer 12 \
+    --d-model 512 --n-head 8 --d-head 64 --d-inner 2048 --dropout 0.1 \
+    --dropatt 0.0 --mem-len 512 --optimizer adam --clip-norm 0.25 \
+    --lr-scheduler cosine --warmup-updates 0 \
+    --lr 0.0 --lr 0.00025 --batch-size 15 \
+    --update-freq 1 --seed 2 --log-format json --log-interval 25 \
+    --fp16
+```
+
+##### 2. Evaluate
+For Adaptive Span:
+```bash
+fairseq-eval-lm ~/data/enwik8/data-bin/ --path model/checkpoint_best.pt \
+ --user-dir examples/adaptive_span \
+ --task truncated_bptt_lm --batch-size 8 --tokens-per-sample 512 --gen-subset test
+```
+For Transformer-XL evaluation:
+```bash
+fairseq-eval-lm ~/data/enwik8/data-bin/ --path model/checkpoint_best.pt \
+    --user-dir examples/truncated_bptt/ --task truncated_bptt_lm --batch-size 8 \
+    --tokens-per-sample 80 \
+    --model-overrides '{"mem_len":2100,"clamp_len":820,"same_length":True}' \
+    --gen-subset valid
+```
+
+*Note:* During training the model saw 512 tokens of context
+(``--tokens-per-sample=512``), with batch size 8. These settings match the evaluation
+settings from [the original
+paper](https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh).
--- a/fairseq/examples/adaptive_span/__init__.py
+++ b/fairseq/examples/adaptive_span/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import importlib
+import os
+
+# automatically import any Python files in the current directory
+cur_dir = os.path.dirname(__file__)
+for file in os.listdir(cur_dir):
+    path = os.path.join(cur_dir, file)
+    if (
+        not file.startswith("_")
+        and not file.startswith(".")
+        and (file.endswith(".py") or os.path.isdir(path))
+    ):
+        mod_name = file[: file.find(".py")] if file.endswith(".py") else file
+        module = importlib.import_module(__name__ + "." + mod_name)
--- a/fairseq/examples/adaptive_span/adagrad_with_grad_clip.py
+++ b/fairseq/examples/adaptive_span/adagrad_with_grad_clip.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+from torch.optim import Adagrad
+
+from fairseq.optim import LegacyFairseqOptimizer, register_optimizer
+
+
+@register_optimizer("adagrad_with_grad_clip")
+class FairseqAdagradWithGradClip(LegacyFairseqOptimizer):
+    def __init__(self, args, params):
+        super().__init__(args)
+        self._optimizer = AdagradWithGradClip(params, **self.optimizer_config)
+
+    @staticmethod
+    def add_args(parser):
+        """Add optimizer-specific arguments to the parser."""
+        # fmt: off
+        parser.add_argument('--weight-decay', '--wd', default=0.0, type=float, metavar='WD',
+                            help='weight decay')
+        parser.add_argument('--adagrad-clip', default=0.0, type=float, metavar='D',
+                            help='internal grad clip')
+        # fmt: on
+
+    @property
+    def optimizer_config(self):
+        """
+        Return a kwarg dictionary that will be used to override optimizer
+        args stored in checkpoints. This allows us to load a checkpoint and
+        resume training using a different set of optimizer args, e.g., with a
+        different learning rate.
+        """
+        return {
+            "lr": self.args.lr[0],
+            "weight_decay": self.args.weight_decay,
+            "grad_clip": self.args.adagrad_clip,
+        }
+
+    @property
+    def supports_flat_params(self):
+        return False
+
+
+def _clip_grad(clr, grad, group_grad_clip):
+    if group_grad_clip > 0:
+        norm = grad.norm(2).item()
+        if norm > group_grad_clip:
+            clr *= group_grad_clip / (norm + 1e-10)
+    return clr
+
+
+class AdagradWithGradClip(Adagrad):
+    """Adagrad algorithm with custom gradient clipping"""
+
+    def __init__(
+        self,
+        params,
+        lr=1e-2,
+        lr_decay=0,
+        weight_decay=0,
+        initial_accumulator_value=0,
+        grad_clip=0,
+    ):
+        Adagrad.__init__(
+            self,
+            params,
+            lr=lr,
+            lr_decay=lr_decay,
+            weight_decay=weight_decay,
+            initial_accumulator_value=initial_accumulator_value,
+        )
+        self.defaults["grad_clip"] = grad_clip
+        self.param_groups[0].setdefault("grad_clip", grad_clip)
+
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            loss = closure()
+
+        for group in self.param_groups:
+            for p in group["params"]:
+                if p.grad is None:
+                    continue
+
+                grad = p.grad.data
+                state = self.state[p]
+
+                state["step"] += 1
+
+                if group["weight_decay"] != 0:
+                    if p.grad.data.is_sparse:
+                        raise RuntimeError(
+                            "weight_decay option is "
+                            "not compatible with sparse "
+                            "gradients"
+                        )
+                    grad = grad.add(group["weight_decay"], p.data)
+
+                clr = group["lr"] / (1 + (state["step"] - 1) * group["lr_decay"])
+
+                # clip
+                clr = _clip_grad(clr=clr, grad=grad, group_grad_clip=group["grad_clip"])
+
+                if grad.is_sparse:
+                    # the update is non-linear so indices must be unique
+                    grad = grad.coalesce()
+                    grad_indices = grad._indices()
+                    grad_values = grad._values()
+                    size = grad.size()
+
+                    def make_sparse(values):
+                        constructor = grad.new
+                        if grad_indices.dim() == 0 or values.dim() == 0:
+                            return constructor().resize_as_(grad)
+                        return constructor(grad_indices, values, size)
+
+                    state["sum"].add_(make_sparse(grad_values.pow(2)))
+                    std = state["sum"]._sparse_mask(grad)
+                    std_values = std._values().sqrt_().add_(1e-10)
+                    p.data.add_(-clr, make_sparse(grad_values / std_values))
+                else:
+                    state["sum"].addcmul_(1, grad, grad)
+                    std = state["sum"].sqrt().add_(1e-10)
+                    p.data.addcdiv_(-clr, grad, std)
+
+        return loss
--- a/fairseq/examples/adaptive_span/adaptive_span_attention.py
+++ b/fairseq/examples/adaptive_span/adaptive_span_attention.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class AdaptiveMask(nn.Module):
+    """Soft masking function for adaptive size.
+    It masks out the last K values of an input. The masking value
+    goes from 1 to 0 gradually, so K can be learned with
+    back-propagation.
+    Args:
+        max_size: maximum size (i.e. input dimension)
+        ramp_size: size of the ramp going from 0 to 1
+        init_val: initial size proportion not to be masked out
+        shape: learn multiple sizes independent of each other
+    """
+
+    def __init__(self, max_size, ramp_size, init_val=0, shape=(1,)):
+        nn.Module.__init__(self)
+        self._max_size = max_size
+        self._ramp_size = ramp_size
+        self.current_val = nn.Parameter(torch.zeros(*shape) + init_val)
+        mask_template = torch.linspace(1 - max_size, 0, steps=max_size)
+        self.register_buffer("mask_template", mask_template)
+
+    def forward(self, x):
+        mask = self.mask_template.float() + self.current_val.float() * self._max_size
+        mask = mask / self._ramp_size + 1
+        mask = mask.clamp(0, 1)
+        if x.size(-1) < self._max_size:
+            # the input could have been trimmed beforehand to save computation
+            mask = mask.narrow(-1, self._max_size - x.size(-1), x.size(-1))
+        x = (x * mask).type_as(x)
+        return x
+
+    def get_current_max_size(self, include_ramp=True):
+        current_size = math.ceil(self.current_val.max().item() * self._max_size)
+        if include_ramp:
+            current_size += self._ramp_size
+        current_size = max(0, min(self._max_size, current_size))
+        return current_size
+
+    def get_current_avg_size(self, include_ramp=True):
+        current_size = math.ceil(
+            self.current_val.float().mean().item() * self._max_size
+        )
+        if include_ramp:
+            current_size += self._ramp_size
+        current_size = max(0, min(self._max_size, current_size))
+        return current_size
+
+    def clamp_param(self):
+        """this need to be called after each update"""
+        self.current_val.data.clamp_(0, 1)
+
+
+class AdaptiveSpan(nn.Module):
+    """Adaptive attention span for Transformerself.
+    This module learns an attention span length from data for each
+    self-attention head.
+    Args:
+        attn_span: maximum attention span
+        adapt_span_loss: loss coefficient for the span length
+        adapt_span_ramp: length of the masking ramp
+        adapt_span_init: initial size ratio
+        adapt_span_cache: adapt cache size to reduce memory usage
+    """
+
+    def __init__(
+        self,
+        attn_span,
+        adapt_span_ramp,
+        adapt_span_init,
+        n_head,
+        adapt_span_layer,
+        **kargs
+    ):
+        nn.Module.__init__(self)
+        self._max_span = attn_span
+        self._n_head = n_head
+        self._adapt_span_layer = adapt_span_layer
+        if self._adapt_span_layer:
+            self._mask = AdaptiveMask(
+                max_size=self._max_span,
+                ramp_size=adapt_span_ramp,
+                init_val=adapt_span_init,
+            )
+        else:
+            self._mask = AdaptiveMask(
+                max_size=self._max_span,
+                ramp_size=adapt_span_ramp,
+                init_val=adapt_span_init,
+                shape=(n_head, 1, 1),
+            )
+
+    def forward(self, attn, normalize=True):
+        """mask attention with the right span"""
+        # batch and head dimensions are merged together, so separate them first
+        self.clamp_param()
+        if self._adapt_span_layer:
+            attn = self._mask(attn)
+        else:
+            B = attn.size(0)  # batch size
+            M = attn.size(1)  # block size
+            attn = attn.reshape(B // self._n_head, self._n_head, M, -1)
+            attn = self._mask(attn)
+            attn = attn.view(B, M, -1)
+        return attn
+
+    def get_trim_len(self):
+        """how much of memory can be trimmed to reduce computation"""
+        L = self._max_span
+        trim_len = min(L - 1, L - self._mask.get_current_max_size())
+        # too fine granularity might be bad for the memory management
+        trim_len = math.floor(trim_len / 64) * 64
+        return trim_len
+
+    def trim_memory(self, query, key, value, key_pe):
+        """trim out unnecessary memory beforehand to reduce computation"""
+        trim_len = self.get_trim_len()
+        cache_size = key.size(1) - query.size(1)
+        trim_len_cache = trim_len - (self._max_span - cache_size)
+        if trim_len_cache > 0:
+            key = key[:, trim_len_cache:, :]
+            value = value[:, trim_len_cache:, :]
+        elif trim_len_cache < 0:
+            # cache is too short! this happens when validation resumes
+            # after a lot of updates.
+            key = F.pad(key, [0, 0, -trim_len_cache, 0])
+            value = F.pad(value, [0, 0, -trim_len_cache, 0])
+        if trim_len > 0:
+            if key_pe is not None:
+                key_pe = key_pe[:, :, trim_len:]
+        return key, value, key_pe
+
+    def get_cache_size(self):
+        """determine how long the cache should be"""
+        trim_len = self.get_trim_len()
+        # give a buffer of 64 steps since a span might increase
+        # in future updates
+        return min(self._max_span, self._max_span - trim_len + 64)
+
+    def get_loss(self):
+        """a loss term for regularizing the span length"""
+        return self._max_span * self._mask.current_val.float().mean()
+
+    def get_current_max_span(self):
+        return self._mask.get_current_max_size()
+
+    def get_current_avg_span(self):
+        return self._mask.get_current_avg_size()
+
+    def clamp_param(self):
+        self._mask.clamp_param()
--- a/fairseq/examples/adaptive_span/adaptive_span_loss.py
+++ b/fairseq/examples/adaptive_span/adaptive_span_loss.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import math
+from dataclasses import dataclass
+
+import torch.nn.functional as F
+from fairseq import metrics, utils
+from fairseq.criterions import register_criterion
+from fairseq.criterions.cross_entropy import CrossEntropyCriterion
+from fairseq.dataclass import FairseqDataclass
+from omegaconf import II
+
+
+@dataclass
+class AdaptiveSpanCriterionConfig(FairseqDataclass):
+    sentence_avg: bool = II("optimization.sentence_avg")
+
+
+@register_criterion("adaptive_span_loss", dataclass=AdaptiveSpanCriterionConfig)
+class AdaptiveSpanCriterion(CrossEntropyCriterion):
+    def __init__(self, task, sentence_avg):
+        super().__init__(task, sentence_avg)
+
+    def forward(self, model, sample, reduce=True):
+        """Compute the loss for the given sample.
+
+        Returns a tuple with three elements:
+        1) the loss here is summed, different from the adaptive span code
+        2) the sample size, which is used as the denominator for the gradient
+        3) logging outputs to display while training
+        """
+        net_output = model(**sample["net_input"])
+        loss, aux_loss, avg_span, max_span = self.compute_loss(
+            model, net_output, sample, reduce=reduce
+        )
+        sample_size = (
+            sample["target"].size(0) if self.sentence_avg else sample["ntokens"]
+        )
+        loss /= sample_size
+        total_loss = loss + aux_loss
+        sample_size = 1
+
+        logging_output = {
+            "loss": loss.data,
+            "ntokens": sample["ntokens"],
+            "nsentences": sample["target"].size(0),
+            "sample_size": sample_size,
+            "total_loss": total_loss.data,
+            "avg_span": avg_span * sample_size,
+            "max_span": max_span * sample_size,
+        }
+        return total_loss, sample_size, logging_output
+
+    def compute_loss(self, model, net_output, sample, reduce=True):
+        loss, _ = super().compute_loss(model, net_output, sample, reduce)
+        aux_loss = model.get_aux_loss()
+        avg_span = model.get_current_avg_span()
+        max_span = model.get_current_max_span()
+        return loss, aux_loss, avg_span, max_span
+
+    @staticmethod
+    def reduce_metrics(logging_outputs) -> None:
+        """Aggregate logging outputs from data parallel training."""
+        loss_sum = sum(log.get("loss", 0) for log in logging_outputs)
+        ntokens = sum(log.get("ntokens", 0) for log in logging_outputs)
+        sample_size = sum(log.get("sample_size", 0) for log in logging_outputs)
+        total_loss_sum = sum(log.get("total_loss", 0) for log in logging_outputs)
+        avg_span_sum = sum(log.get("avg_span", 0) for log in logging_outputs)
+        max_span_sum = sum(log.get("max_span", 0) for log in logging_outputs)
+
+        # we divide by log(2) to convert the loss from base e to base 2
+        metrics.log_scalar(
+            "loss", loss_sum / sample_size / math.log(2), sample_size, round=3
+        )
+        metrics.log_scalar("avg_span", avg_span_sum / sample_size, sample_size, round=3)
+        metrics.log_scalar("max_span", max_span_sum / sample_size, sample_size, round=3)
+        # total loss contains the L1 norm on adaptive-span
+        metrics.log_scalar(
+            "total_loss",
+            total_loss_sum / sample_size / math.log(2),
+            sample_size,
+            round=3,
+        )
+        if sample_size != ntokens:
+            metrics.log_scalar(
+                "nll_loss", loss_sum / ntokens / math.log(2), ntokens, round=3
+            )
+            metrics.log_derived(
+                "ppl", lambda meters: utils.get_perplexity(meters["nll_loss"].avg)
+            )
+        else:
+            metrics.log_derived(
+                "ppl", lambda meters: utils.get_perplexity(meters["loss"].avg)
+            )
+
+    @staticmethod
+    def logging_outputs_can_be_summed() -> bool:
+        """
+        Whether the logging outputs returned by `forward` can be summed
+        across workers prior to calling `reduce_metrics`. Setting this
+        to True will improves distributed training speed.
+        """
+        return True
--- a/fairseq/examples/adaptive_span/adaptive_span_model.py
+++ b/fairseq/examples/adaptive_span/adaptive_span_model.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from fairseq.modules.layer_norm import LayerNorm
+
+from .adaptive_span_attention import AdaptiveSpan
+
+# Size notations:
+# B = batch_size, H = d_model, M = block_size, L = attn_span
+
+
+def _skew(X, pad_value):
+    """shift every row 1 step to right"""
+    # X = B x M x L
+    B, M, L = X.size()
+    X = F.pad(X, (0, M + 1), value=pad_value)  # B x M x (L+M+1)
+    X = X.view(B, -1)  # B x ML+MM+M
+    X = X[:, :-M]  # B x ML+MM
+    X = X.view(B, M, M + L)  # B x M x L+M
+    return X
+
+
+def _unskew(X):
+    """reverse _skew operation"""
+    # X = B x M x L+M
+    B, M, L = X.size()
+    L -= M
+    X = X.view(B, -1)  # B x ML+MM
+    X = F.pad(X, (0, M))  # B x ML+MM+M
+    X = X.view(B, M, M + L + 1)  # B x M x L+M+1
+    X = X[:, :, :L]  # B x M x L
+    return X
+
+
+class SeqAttention(nn.Module):
+    """Sequential self-attention layer.
+    Each token will attend to its previous fixed number of steps.
+    Note that attention doesn't include the current step itself.
+    """
+
+    def __init__(self, d_model, n_head, attn_span, dropout, adapt_span_layer, **kargs):
+        nn.Module.__init__(self)
+        self.dropout = nn.Dropout(dropout)
+        self.d_model = d_model  # size of a single head
+        self.attn_span = attn_span
+        self.adaptive_span = AdaptiveSpan(
+            attn_span=attn_span,
+            n_head=n_head,
+            adapt_span_layer=adapt_span_layer,
+            **kargs
+        )
+
+    def forward(self, query, key, value, key_pe):
+        # query size = B x M x H
+        # key, value sizes = B x (M+L) x H
+
+        key, value, key_pe = self.adaptive_span.trim_memory(query, key, value, key_pe)
+
+        # compute attention from context
+        # B x M (dest) x (M+L) (src)
+        attn_cont = torch.matmul(query, key.transpose(-1, -2))
+        attn_cont = _unskew(attn_cont)  # B x M x L
+
+        # compute the effect of position embedding
+        attn_pos = torch.matmul(query, key_pe)  # B x M x L_pos
+        attn = attn_cont + attn_pos
+
+        attn = attn / math.sqrt(self.d_model)  # B x M X L_pos
+
+        attn = F.softmax(attn.float(), dim=-1).type_as(attn)
+
+        # trim attention lengths according to the learned span
+        attn = self.adaptive_span(attn)
+
+        attn = self.dropout(attn)  # B x M X L_pos
+
+        attn_cont = _skew(attn, 0)  # B x M X (L+M)
+        out = torch.matmul(attn_cont, value)  # B x M x H
+        return out
+
+    def get_cache_size(self):
+        return self.adaptive_span.get_cache_size()
+
+
+class MultiHeadSeqAttention(nn.Module):
+    def __init__(self, d_model, n_head, **kargs):
+        nn.Module.__init__(self)
+        assert d_model % n_head == 0
+        self.n_head = n_head
+        self.head_dim = d_model // n_head
+        self.attn = SeqAttention(d_model=self.head_dim, n_head=n_head, **kargs)
+        self.proj_query = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_query.weight)
+        self.proj_out = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_out.weight)
+        self.proj_val = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_val.weight)
+        self.proj_key = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_key.weight)
+
+    def head_reshape(self, x):
+        K = self.n_head
+        D = self.head_dim
+        x = x.view(x.size()[:-1] + (K, D))  # B x (M+L) x K x D
+        x = x.transpose(1, 2).contiguous()  # B x K x (M+L) x D
+        x = x.view(-1, x.size(-2), x.size(-1))  # B_K x (M+L) x D
+        return x
+
+    def forward(self, query, key, value, key_pe):
+        B = query.size(0)
+        K = self.n_head
+        D = self.head_dim
+        M = query.size(1)
+
+        query = self.proj_query(query)
+        query = self.head_reshape(query)
+        value = self.proj_val(value)
+        value = self.head_reshape(value)
+        key = self.proj_key(key)
+        key = self.head_reshape(key)
+
+        out = self.attn(query, key, value, key_pe)  # B_K x M x D
+        out = out.view(B, K, M, D)  # B x K x M x D
+        out = out.transpose(1, 2).contiguous()  # B x M x K x D
+        out = out.view(B, M, -1)  # B x M x K_D
+        out = self.proj_out(out)
+        return out
+
+
+class FeedForwardLayer(nn.Module):
+    def __init__(self, d_model, d_inner, dropout, **kargs):
+        nn.Module.__init__(self)
+        self.fc1 = nn.Linear(d_model, d_inner)
+        self.fc2 = nn.Linear(d_inner, d_model)
+        nn.init.xavier_uniform_(self.fc1.weight)
+        nn.init.xavier_uniform_(self.fc2.weight)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, h):
+        h1 = F.relu(self.fc1(h))
+        h1 = self.dropout(h1)
+        h2 = self.fc2(h1)
+        return h2
+
+
+class TransformerSeqLayer(nn.Module):
+    def __init__(self, d_model, **kargs):
+        nn.Module.__init__(self)
+        self.attn = MultiHeadSeqAttention(d_model=d_model, **kargs)
+        self.norm1 = LayerNorm(d_model)
+        self.ff = FeedForwardLayer(d_model=d_model, **kargs)
+        self.norm2 = LayerNorm(d_model)
+
+    def forward(self, h, h_cache, key_pe):
+        # h = B x M x H
+        # h_cache = B x L x H
+        h_all = torch.cat([h_cache, h], dim=1)  # B x (M+L) x H
+        attn_out = self.attn(h, h_all, h_all, key_pe)
+        h = self.norm1(h + attn_out)  # B x M x H
+        if self.ff is not None:
+            ff_out = self.ff(h)
+            out = self.norm2(h + ff_out)  # B x M x H
+        else:
+            out = h
+        return out
+
+    def get_cache_size(self):
+        return self.attn.attn.get_cache_size()
+
+
+class TransformerSeq(nn.Module):
+    def __init__(
+        self,
+        vocab_size,
+        d_model,
+        n_head,
+        n_layer,
+        attn_span,
+        emb_dropout,
+        aux_loss_scaler,
+        adapt_span_layer,
+        **kargs
+    ):
+        nn.Module.__init__(self)
+        # token embeddings
+        self.in_emb = nn.Embedding(vocab_size, d_model)
+        nn.init.normal_(self.in_emb.weight, mean=0, std=d_model ** -0.5)
+        self.out_emb = nn.Linear(d_model, vocab_size)
+        self.aux_loss_scaler = aux_loss_scaler
+        if emb_dropout > 0:
+            self.emb_dropout = nn.Dropout(emb_dropout)
+        else:
+            self.emb_dropout = None
+        # position embeddings
+        self.key_pe = nn.Parameter(torch.randn(1, d_model // n_head, attn_span))
+
+        self.layers = nn.ModuleList()
+        self.layers.extend(
+            TransformerSeqLayer(
+                d_model=d_model,
+                n_head=n_head,
+                attn_span=attn_span,
+                adapt_span_layer=adapt_span_layer,
+                **kargs
+            )
+            for _ in range(n_layer)
+        )
+
+    def forward(self, x, h_cache, target=None):
+        # x size = B x M
+        block_size = x.size(1)
+        h = self.in_emb(x)  # B x M x H
+        if self.emb_dropout is not None:
+            h = self.emb_dropout(h)
+
+        h_cache_next = []
+        for l, layer in enumerate(self.layers):
+            cache_size = layer.attn.attn.get_cache_size()
+            if cache_size > block_size:
+                h_cache_next_l = torch.cat(
+                    [h_cache[l][:, -cache_size + block_size :, :], h], dim=1
+                ).detach()
+            else:
+                h_cache_next_l = h[:, -cache_size:, :].detach()
+            h_cache_next.append(h_cache_next_l)
+            h = layer(h, h_cache[l], self.key_pe)  # B x M x H
+
+        if self.emb_dropout is not None:
+            h = self.emb_dropout(h)
+
+        out = F.log_softmax(self.out_emb(h).float(), dim=-1).type_as(h)
+        dummy_loss = None
+
+        return out, h_cache_next, dummy_loss
+
+    def get_aux_loss(self):
+        loss = 0.0
+        for layer in self.layers:
+            loss += layer.attn.attn.adaptive_span.get_loss()
+        return self.aux_loss_scaler * loss
+
+    def get_current_max_span(self):
+        max_span = 0.0
+        for layer in self.layers:
+            max_span = max(
+                max_span, layer.attn.attn.adaptive_span.get_current_max_span()
+            )
+        return max_span
+
+    def get_current_avg_span(self):
+        avg_span = 0.0
+        for layer in self.layers:
+            avg_span += layer.attn.attn.adaptive_span.get_current_avg_span()
+        return avg_span / len(self.layers)
--- a/fairseq/examples/adaptive_span/adaptive_span_model_wrapper.py
+++ b/fairseq/examples/adaptive_span/adaptive_span_model_wrapper.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import logging
+from dataclasses import dataclass
+from typing import Dict, List, Optional
+
+import torch
+from fairseq.dataclass import FairseqDataclass
+from fairseq.models import (
+    FairseqIncrementalDecoder,
+    FairseqLanguageModel,
+    register_model,
+)
+from .adaptive_span_model import TransformerSeq as AdaptiveSpanTransformerModel
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class AdaptiveSpanSmallConfig(FairseqDataclass):
+    # defaults come from https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8_small.sh
+    vocab_size: int = 50
+    d_model: int = 256
+    n_head: int = 4
+    d_inner: int = 1024
+    n_layer: int = 8
+    attn_span: int = 1024
+    dropout: float = 0.0
+    emb_dropout: float = 0.0
+    adapt_span_ramp: int = 32
+    adapt_span_init: float = 0.0
+    aux_loss_scaler: float = 0.000002
+    adapt_span_layer: bool = False
+
+
+@register_model("adaptive_span", dataclass=AdaptiveSpanSmallConfig)
+class AdaptiveSpanTransformer(FairseqLanguageModel):
+    @classmethod
+    def build_model(cls, cfg: AdaptiveSpanSmallConfig, task):
+        return cls(AdaptiveSpanDecoder(cfg, task))
+
+    def get_aux_loss(self):
+        return self.decoder.get_aux_loss()
+
+    def get_current_max_span(self):
+        return self.decoder.get_current_max_span()
+
+    def get_current_avg_span(self):
+        return self.decoder.get_current_avg_span()
+
+
+class AdaptiveSpanDecoder(FairseqIncrementalDecoder):
+    def __init__(self, cfg, task):
+
+        super().__init__(task.target_dictionary)
+
+        self.config = cfg
+        config = AdaptiveSpanSmallConfig(
+            vocab_size=len(task.target_dictionary),
+            d_model=cfg.d_model,
+            n_head=cfg.n_head,
+            d_inner=cfg.d_inner,
+            n_layer=cfg.n_layer,
+            attn_span=cfg.attn_span,
+            dropout=cfg.dropout,
+            emb_dropout=cfg.emb_dropout,
+            adapt_span_ramp=cfg.adapt_span_ramp,
+            adapt_span_init=cfg.adapt_span_init,
+            aux_loss_scaler=cfg.aux_loss_scaler,
+            adapt_span_layer=cfg.adapt_span_layer,
+        )
+        logger.info(config)
+        self.model = AdaptiveSpanTransformerModel(**config.__dict__)
+
+        self._mems = None
+
+    def forward(
+        self,
+        src_tokens,
+        incremental_state: Optional[Dict[str, List[torch.Tensor]]] = None,
+        encoder_out=None,
+    ):
+        bsz = src_tokens.size(0)
+        if incremental_state is not None:  # used during inference
+            mems = self.get_incremental_state("mems")
+            src_tokens = src_tokens[:, -1:]  # only keep the most recent token
+        else:
+            mems = self._mems
+
+        if mems is None:
+            # first time init
+            mems = self.init_hid_cache(bsz)
+        output = self.model(x=src_tokens, h_cache=mems,)
+        if incremental_state is not None:
+            self.set_incremental_state(incremental_state, "mems", output[1])
+        else:
+            self._mems = output[1]
+        return (output[0],)
+
+    def max_positions(self):
+        return self.config.attn_span
+
+    def init_hid_cache(self, batch_sz):
+        hid = []
+        for layer in self.model.layers:
+            param = next(self.model.parameters())
+            h = torch.zeros(
+                batch_sz,
+                layer.get_cache_size(),
+                self.config.d_model,
+                dtype=param.dtype,
+                device=param.device,
+            )
+            hid.append(h)
+        return hid
+
+    def get_aux_loss(self):
+        return self.model.get_aux_loss()
+
+    def get_current_max_span(self):
+        return self.model.get_current_max_span()
+
+    def get_current_avg_span(self):
+        return self.model.get_current_avg_span()
+
+    def reorder_incremental_state(
+        self,
+        incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]],
+        new_order: torch.Tensor,
+    ):
+        """Reorder incremental state.
+
+        This will be called when the order of the input has changed from the
+        previous time step. A typical use case is beam search, where the input
+        order changes between time steps based on the selection of beams.
+        """
+        raise NotImplementedError("This is required for generation/beam search")
+        # mems = self.get_incremental_state(incremental_state, "mems")
+        # if mems is not None:
+        #     new_mems = [mems_i.index_select(1, new_order) for mems_i in mems]
+        #     self.set_incremental_state(incremental_state, "mems", new_mems)
--- a/fairseq/examples/adaptive_span/truncated_bptt_lm_task.py
+++ b/fairseq/examples/adaptive_span/truncated_bptt_lm_task.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import logging
+import os
+from dataclasses import dataclass, field
+from typing import List, Optional, Tuple
+
+import torch
+from fairseq import utils
+from fairseq.data import (
+    Dictionary,
+    TokenBlockDataset,
+    data_utils,
+    iterators,
+)
+from fairseq.dataclass import FairseqDataclass
+from fairseq.distributed import utils as dist_utils
+from fairseq.tasks import FairseqTask, register_task
+from omegaconf import II
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class TruncatedBPTTLMConfig(FairseqDataclass):
+    data: str = field(default="???", metadata={"help": "path to data directory"})
+    tokens_per_sample: int = field(
+        default=1024,
+        metadata={"help": "max number of tokens per sequence"},
+    )
+    batch_size: int = II("dataset.batch_size")
+    # Some models use *max_target_positions* to know how many positional
+    # embeddings to learn. We use II(...) to make it default to
+    # *tokens_per_sample*, but in principle there could be more positional
+    # embeddings than tokens in a single batch. This may also be irrelevant for
+    # custom model implementations.
+    max_target_positions: int = II("task.tokens_per_sample")
+    # these will be populated automatically if not provided
+    data_parallel_rank: Optional[int] = None
+    data_parallel_size: Optional[int] = None
+
+
+@register_task("truncated_bptt_lm", dataclass=TruncatedBPTTLMConfig)
+class TruncatedBPTTLMTask(FairseqTask):
+    def __init__(self, cfg: TruncatedBPTTLMConfig):
+        super().__init__(cfg)
+
+        if cfg.data_parallel_rank is None or cfg.data_parallel_size is None:
+            if torch.distributed.is_initialized():
+                cfg.data_parallel_rank = dist_utils.get_data_parallel_rank()
+                cfg.data_parallel_size = dist_utils.get_data_parallel_world_size()
+            else:
+                cfg.data_parallel_rank = 0
+                cfg.data_parallel_size = 1
+
+        # load the dictionary
+        paths = utils.split_paths(cfg.data)
+        assert len(paths) > 0
+        self.dictionary = Dictionary.load(os.path.join(paths[0], "dict.txt"))
+        logger.info("dictionary: {} types".format(len(self.dictionary)))
+
+    def load_dataset(self, split, epoch=1, combine=False, **kwargs):
+        """Load a given dataset split (e.g., train, valid, test)"""
+
+        # support sharded datasets
+        paths = utils.split_paths(self.cfg.data)
+        assert len(paths) > 0
+        data_path = paths[(epoch - 1) % len(paths)]
+        split_path = os.path.join(data_path, split)
+
+        # each element of *data* will be a tensorized line from the original
+        # text dataset, similar to ``open(split_path).readlines()``
+        data = data_utils.load_indexed_dataset(
+            split_path, self.dictionary, combine=combine
+        )
+        if data is None:
+            raise FileNotFoundError(
+                "Dataset not found: {} ({})".format(split, split_path)
+            )
+
+        # this is similar to ``data.view(-1).split(tokens_per_sample)``
+        data = TokenBlockDataset(
+            data,
+            data.sizes,
+            block_size=self.cfg.tokens_per_sample,
+            pad=None,  # unused
+            eos=None,  # unused
+            break_mode="none",
+        )
+
+        self.datasets[split] = TruncatedBPTTDataset(
+            data=data,
+            bsz_per_shard=self.cfg.batch_size,
+            shard_id=self.cfg.data_parallel_rank,
+            num_shards=self.cfg.data_parallel_size,
+        )
+
+    def dataset(self, split):
+        return self.datasets[split]
+
+    def get_batch_iterator(
+        self, dataset, num_workers=0, epoch=1, data_buffer_size=0, **kwargs
+    ):
+        return iterators.EpochBatchIterator(
+            dataset=dataset,
+            collate_fn=self._collate_fn,
+            num_workers=num_workers,
+            epoch=epoch,
+            buffer_size=data_buffer_size,
+            # we don't use the batching functionality from EpochBatchIterator;
+            # instead every item in *dataset* is a whole batch
+            batch_sampler=[[i] for i in range(len(dataset))],
+            disable_shuffling=True,
+        )
+
+    def _collate_fn(self, items: List[List[torch.Tensor]]):
+        # we don't use fairseq's batching functionality, so we expect a single
+        # Tensor of type List[torch.Tensor]
+        assert len(items) == 1
+
+        # item will have shape B x T (the last batch may have length < T)
+        id, item = items[0]
+        item = data_utils.collate_tokens(item, pad_idx=self.source_dictionary.pad())
+        B, T = item.size()
+
+        # shift item one position over and append a padding token for the target
+        target = torch.nn.functional.pad(
+            item[:, 1:], (0, 1, 0, 0), value=self.target_dictionary.pad()
+        )
+
+        # fairseq expects batches to have the following structure
+        return {
+            "id": torch.tensor([id]*item.size(0)),
+            "net_input": {
+                "src_tokens": item,
+            },
+            "target": target,
+            "nsentences": item.size(0),
+            "ntokens": item.numel(),
+        }
+
+    def build_dataset_for_inference(
+        self, src_tokens: List[torch.Tensor], src_lengths: List[int], **kwargs
+    ) -> torch.utils.data.Dataset:
+        eos = self.source_dictionary.eos()
+        dataset = TokenBlockDataset(
+            src_tokens,
+            src_lengths,
+            block_size=None,  # ignored for "eos" break mode
+            pad=self.source_dictionary.pad(),
+            eos=eos,
+            break_mode="eos",
+        )
+
+        class Dataset(torch.utils.data.Dataset):
+            def __getitem__(self, i):
+                item = dataset[i]
+                if item[-1] == eos:
+                    # remove eos to support generating with a prefix
+                    item = item[:-1]
+                return (i, [item])
+
+            def __len__(self):
+                return len(dataset)
+
+        return Dataset()
+
+    def inference_step(
+        self, generator, models, sample, prefix_tokens=None, constraints=None
+    ):
+        with torch.no_grad():
+            if constraints is not None:
+                raise NotImplementedError
+
+            # SequenceGenerator doesn't use *src_tokens* directly, we need to
+            # pass the *prefix_tokens* argument instead.
+            if prefix_tokens is None and sample["net_input"]["src_tokens"].nelement():
+                prefix_tokens = sample["net_input"]["src_tokens"]
+
+            # begin generation with the end-of-sentence token
+            bos_token = self.source_dictionary.eos()
+
+            return generator.generate(
+                models, sample, prefix_tokens=prefix_tokens, bos_token=bos_token
+            )
+
+    def eval_lm_dataloader(
+        self,
+        dataset,
+        max_tokens: Optional[int] = 36000,
+        batch_size: Optional[int] = None,
+        max_positions: Optional[int] = None,
+        num_shards: int = 1,
+        shard_id: int = 0,
+        num_workers: int = 1,
+        data_buffer_size: int = 10,
+        context_window: int = 0,
+    ):
+        if context_window > 0:
+            raise NotImplementedError(
+                "Transformer-XL doesn't need --context-window, try "
+                "--model-overrides '{\"mem_len\":42}' instead "
+            )
+        return self.get_batch_iterator(
+            dataset=dataset,
+            max_tokens=max_tokens,
+            max_sentences=batch_size,
+            max_positions=max_positions,
+            ignore_invalid_inputs=True,
+            num_shards=num_shards,
+            shard_id=shard_id,
+            num_workers=num_workers,
+            data_buffer_size=data_buffer_size,
+        ).next_epoch_itr(shuffle=False)
+
+    @property
+    def source_dictionary(self):
+        return self.dictionary
+
+    @property
+    def target_dictionary(self):
+        return self.dictionary
+
+
+class TruncatedBPTTDataset(torch.utils.data.Dataset):
+    def __init__(
+        self,
+        data: List[torch.Tensor],  # ordered list of items
+        bsz_per_shard,  # number of items processed per GPUs per forward
+        shard_id,  # current GPU ID
+        num_shards,  # number of GPUs
+    ):
+        super().__init__()
+        self.data = data
+
+        def batchify(data, bsz):
+            # Work out how cleanly we can divide the dataset into bsz parts.
+            nbatch = data.size(0) // bsz
+            # Trim off any extra elements that wouldn't cleanly fit (remainders).
+            data = data.narrow(0, 0, nbatch * bsz)
+            # Evenly divide the data across the bsz batches.
+            data = data.view(bsz, -1).contiguous()
+            return data
+
+        # total number of sequences processed by all GPUs in each forward pass
+        global_batch_size = bsz_per_shard * num_shards
+
+        """
+        With a 16 item dataset, bsz_per_shard=2 and num_shards=3,
+        *indices* might look like:
+
+            indices = [[0, 1],
+                       [2, 3],
+                       [4, 5],
+                       [6, 7],
+                       [8, 9],
+                       [10, 11]]
+
+        The size of the TruncatedBPTTDataset instance will be 2,
+        and shard 1 will see items:
+
+            [(0, [data[4], data[6]]),
+             (1, [data[5], data[7]])]
+        """
+        indices = batchify(torch.arange(len(data)), global_batch_size)
+        assert indices.size(0) == global_batch_size
+
+        self.my_indices = indices[
+            shard_id * bsz_per_shard : (shard_id + 1) * bsz_per_shard
+        ]
+        assert self.my_indices.size(0) == bsz_per_shard
+
+    def __len__(self):
+        return self.my_indices.size(1)
+
+    def __getitem__(self, i) -> Tuple[int, List[torch.Tensor]]:
+        return (i, [self.data[idx] for idx in self.my_indices[:, i]])
--- a/fairseq/examples/backtranslation/README.md
+++ b/fairseq/examples/backtranslation/README.md
+# Understanding Back-Translation at Scale (Edunov et al., 2018)
+
+This page includes pre-trained models from the paper [Understanding Back-Translation at Scale (Edunov et al., 2018)](https://arxiv.org/abs/1808.09381).
+
+## Pre-trained models
+
+Model | Description | Dataset | Download
+---|---|---|---
+`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive
+
+## Example usage (torch.hub)
+
+We require a few additional Python dependencies for preprocessing:
+```bash
+pip install subword_nmt sacremoses
+```
+
+Then to generate translations from the full model ensemble:
+```python
+import torch
+
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer.wmt18.en-de', ... ]
+
+# Load the WMT'18 En-De ensemble
+en2de_ensemble = torch.hub.load(
+    'pytorch/fairseq', 'transformer.wmt18.en-de',
+    checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
+    tokenizer='moses', bpe='subword_nmt')
+
+# The ensemble contains 5 models
+len(en2de_ensemble.models)
+# 5
+
+# Translate
+en2de_ensemble.translate('Hello world!')
+# 'Hallo Welt!'
+```
+
+## Training your own model (WMT'18 English-German)
+
+The following instructions can be adapted to reproduce the models from the paper.
+
+
+#### Step 1. Prepare parallel data and optionally train a baseline (English-German) model
+
+First download and preprocess the data:
+```bash
+# Download and prepare the data
+cd examples/backtranslation/
+bash prepare-wmt18en2de.sh
+cd ../..
+
+# Binarize the data
+TEXT=examples/backtranslation/wmt18_en_de
+fairseq-preprocess \
+    --joined-dictionary \
+    --source-lang en --target-lang de \
+    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+    --destdir data-bin/wmt18_en_de --thresholdtgt 0 --thresholdsrc 0 \
+    --workers 20
+
+# Copy the BPE code into the data-bin directory for future use
+cp examples/backtranslation/wmt18_en_de/code data-bin/wmt18_en_de/code
+```
+
+(Optionally) Train a baseline model (English-German) using just the parallel data:
+```bash
+CHECKPOINT_DIR=checkpoints_en_de_parallel
+fairseq-train --fp16 \
+    data-bin/wmt18_en_de \
+    --source-lang en --target-lang de \
+    --arch transformer_wmt_en_de_big --share-all-embeddings \
+    --dropout 0.3 --weight-decay 0.0 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
+    --max-tokens 3584 --update-freq 16 \
+    --max-update 30000 \
+    --save-dir $CHECKPOINT_DIR
+# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
+# different number of GPUs.
+```
+
+Average the last 10 checkpoints:
+```bash
+python scripts/average_checkpoints.py \
+    --inputs $CHECKPOINT_DIR \
+    --num-epoch-checkpoints 10 \
+    --output $CHECKPOINT_DIR/checkpoint.avg10.pt
+```
+
+Evaluate BLEU:
+```bash
+# tokenized BLEU on newstest2017:
+bash examples/backtranslation/tokenized_bleu.sh \
+    wmt17 \
+    en-de \
+    data-bin/wmt18_en_de \
+    data-bin/wmt18_en_de/code \
+    $CHECKPOINT_DIR/checkpoint.avg10.pt
+# BLEU4 = 29.57, 60.9/35.4/22.9/15.5 (BP=1.000, ratio=1.014, syslen=63049, reflen=62152)
+# compare to 29.46 in Table 1, which is also for tokenized BLEU
+
+# generally it's better to report (detokenized) sacrebleu though:
+bash examples/backtranslation/sacrebleu.sh \
+    wmt17 \
+    en-de \
+    data-bin/wmt18_en_de \
+    data-bin/wmt18_en_de/code \
+    $CHECKPOINT_DIR/checkpoint.avg10.pt
+# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 29.0 60.6/34.7/22.4/14.9 (BP = 1.000 ratio = 1.013 hyp_len = 62099 ref_len = 61287)
+```
+
+
+#### Step 2. Back-translate monolingual German data
+
+Train a reverse model (German-English) to do the back-translation:
+```bash
+CHECKPOINT_DIR=checkpoints_de_en_parallel
+fairseq-train --fp16 \
+    data-bin/wmt18_en_de \
+    --source-lang de --target-lang en \
+    --arch transformer_wmt_en_de_big --share-all-embeddings \
+    --dropout 0.3 --weight-decay 0.0 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
+    --max-tokens 3584 --update-freq 16 \
+    --max-update 30000 \
+    --save-dir $CHECKPOINT_DIR
+# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
+# different number of GPUs.
+```
+
+Let's evaluate the back-translation (BT) model to make sure it is well trained:
+```bash
+bash examples/backtranslation/sacrebleu.sh \
+    wmt17 \
+    de-en \
+    data-bin/wmt18_en_de \
+    data-bin/wmt18_en_de/code \
+    $CHECKPOINT_DIR/checkpoint_best.py
+# BLEU+case.mixed+lang.de-en+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 34.9 66.9/41.8/28.5/19.9 (BP = 0.983 ratio = 0.984 hyp_len = 63342 ref_len = 64399)
+# compare to the best system from WMT'17 which scored 35.1: http://matrix.statmt.org/matrix/systems_list/1868
+```
+
+Next prepare the monolingual data:
+```bash
+# Download and prepare the monolingual data
+# By default the script samples 25M monolingual sentences, which after
+# deduplication should be just over 24M sentences. These are split into 25
+# shards, each with 1M sentences (except for the last shard).
+cd examples/backtranslation/
+bash prepare-de-monolingual.sh
+cd ../..
+
+# Binarize each shard of the monolingual data
+TEXT=examples/backtranslation/wmt18_de_mono
+for SHARD in $(seq -f "%02g" 0 24); do \
+    fairseq-preprocess \
+        --only-source \
+        --source-lang de --target-lang en \
+        --joined-dictionary \
+        --srcdict data-bin/wmt18_en_de/dict.de.txt \
+        --testpref $TEXT/bpe.monolingual.dedup.${SHARD} \
+        --destdir data-bin/wmt18_de_mono/shard${SHARD} \
+        --workers 20; \
+    cp data-bin/wmt18_en_de/dict.en.txt data-bin/wmt18_de_mono/shard${SHARD}/; \
+done
+```
+
+Now we're ready to perform back-translation over the monolingual data. The
+following command generates via sampling, but it's possible to use greedy
+decoding (`--beam 1`), beam search (`--beam 5`),
+top-k sampling (`--sampling --beam 1 --sampling-topk 10`), etc.:
+```bash
+mkdir backtranslation_output
+for SHARD in $(seq -f "%02g" 0 24); do \
+    fairseq-generate --fp16 \
+        data-bin/wmt18_de_mono/shard${SHARD} \
+        --path $CHECKPOINT_DIR/checkpoint_best.pt \
+        --skip-invalid-size-inputs-valid-test \
+        --max-tokens 4096 \
+        --sampling --beam 1 \
+    > backtranslation_output/sampling.shard${SHARD}.out; \
+done
+```
+
+After BT, use the `extract_bt_data.py` script to re-combine the shards, extract
+the back-translations and apply length ratio filters:
+```bash
+python examples/backtranslation/extract_bt_data.py \
+    --minlen 1 --maxlen 250 --ratio 1.5 \
+    --output backtranslation_output/bt_data --srclang en --tgtlang de \
+    backtranslation_output/sampling.shard*.out
+
+# Ensure lengths are the same:
+# wc -l backtranslation_output/bt_data.{en,de}
+#   21795614 backtranslation_output/bt_data.en
+#   21795614 backtranslation_output/bt_data.de
+#   43591228 total
+```
+
+Binarize the filtered BT data and combine it with the parallel data:
+```bash
+TEXT=backtranslation_output
+fairseq-preprocess \
+    --source-lang en --target-lang de \
+    --joined-dictionary \
+    --srcdict data-bin/wmt18_en_de/dict.en.txt \
+    --trainpref $TEXT/bt_data \
+    --destdir data-bin/wmt18_en_de_bt \
+    --workers 20
+
+# We want to train on the combined data, so we'll symlink the parallel + BT data
+# in the wmt18_en_de_para_plus_bt directory. We link the parallel data as "train"
+# and the BT data as "train1", so that fairseq will combine them automatically
+# and so that we can use the `--upsample-primary` option to upsample the
+# parallel data (if desired).
+PARA_DATA=$(readlink -f data-bin/wmt18_en_de)
+BT_DATA=$(readlink -f data-bin/wmt18_en_de_bt)
+COMB_DATA=data-bin/wmt18_en_de_para_plus_bt
+mkdir -p $COMB_DATA
+for LANG in en de; do \
+    ln -s ${PARA_DATA}/dict.$LANG.txt ${COMB_DATA}/dict.$LANG.txt; \
+    for EXT in bin idx; do \
+        ln -s ${PARA_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train.en-de.$LANG.$EXT; \
+        ln -s ${BT_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train1.en-de.$LANG.$EXT; \
+        ln -s ${PARA_DATA}/valid.en-de.$LANG.$EXT ${COMB_DATA}/valid.en-de.$LANG.$EXT; \
+        ln -s ${PARA_DATA}/test.en-de.$LANG.$EXT ${COMB_DATA}/test.en-de.$LANG.$EXT; \
+    done; \
+done
+```
+
+
+#### 3. Train an English-German model over the combined parallel + BT data
+
+Finally we can train a model over the parallel + BT data:
+```bash
+CHECKPOINT_DIR=checkpoints_en_de_parallel_plus_bt
+fairseq-train --fp16 \
+    data-bin/wmt18_en_de_para_plus_bt \
+    --upsample-primary 16 \
+    --source-lang en --target-lang de \
+    --arch transformer_wmt_en_de_big --share-all-embeddings \
+    --dropout 0.3 --weight-decay 0.0 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr 0.0007 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
+    --max-tokens 3584 --update-freq 16 \
+    --max-update 100000 \
+    --save-dir $CHECKPOINT_DIR
+# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
+# different number of GPUs.
+```
+
+Average the last 10 checkpoints:
+```bash
+python scripts/average_checkpoints.py \
+    --inputs $CHECKPOINT_DIR \
+    --num-epoch-checkpoints 10 \
+    --output $CHECKPOINT_DIR/checkpoint.avg10.pt
+```
+
+Evaluate BLEU:
+```bash
+# tokenized BLEU on newstest2017:
+bash examples/backtranslation/tokenized_bleu.sh \
+    wmt17 \
+    en-de \
+    data-bin/wmt18_en_de \
+    data-bin/wmt18_en_de/code \
+    $CHECKPOINT_DIR/checkpoint.avg10.pt
+# BLEU4 = 32.35, 64.4/38.9/26.2/18.3 (BP=0.977, ratio=0.977, syslen=60729, reflen=62152)
+# compare to 32.35 in Table 1, which is also for tokenized BLEU
+
+# generally it's better to report (detokenized) sacrebleu:
+bash examples/backtranslation/sacrebleu.sh \
+    wmt17 \
+    en-de \
+    data-bin/wmt18_en_de \
+    data-bin/wmt18_en_de/code \
+    $CHECKPOINT_DIR/checkpoint.avg10.pt
+# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 31.5 64.3/38.2/25.6/17.6 (BP = 0.971 ratio = 0.971 hyp_len = 59515 ref_len = 61287)
+```
+
+
+## Citation
+```bibtex
+@inproceedings{edunov2018backtranslation,
+  title = {Understanding Back-Translation at Scale},
+  author = {Edunov, Sergey and Ott, Myle and Auli, Michael and Grangier, David},
+  booktitle = {Conference of the Association for Computational Linguistics (ACL)},
+  year = 2018,
+}
+```
--- a/fairseq/examples/backtranslation/deduplicate_lines.py
+++ b/fairseq/examples/backtranslation/deduplicate_lines.py
+#!/usr/bin/python3
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import fileinput
+import hashlib
+import sys
+from multiprocessing import Pool
+
+
+def get_hashes_and_lines(raw_line):
+    hash = hashlib.md5(raw_line).hexdigest()
+    return hash, raw_line
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--workers", type=int, default=10)
+    parser.add_argument("files", nargs="*", help="input files")
+    args = parser.parse_args()
+
+    seen = set()
+    with fileinput.input(args.files, mode="rb") as h:
+        pool = Pool(args.workers)
+        results = pool.imap_unordered(get_hashes_and_lines, h, 1000)
+        for i, (hash, raw_line) in enumerate(results):
+            if hash not in seen:
+                seen.add(hash)
+                sys.stdout.buffer.write(raw_line)
+            if i % 1000000 == 0:
+                print(i, file=sys.stderr, end="", flush=True)
+            elif i % 100000 == 0:
+                print(".", file=sys.stderr, end="", flush=True)
+    print(file=sys.stderr, flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/fairseq/examples/backtranslation/extract_bt_data.py
+++ b/fairseq/examples/backtranslation/extract_bt_data.py
+#!/usr/bin/env python
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import fileinput
+
+from tqdm import tqdm
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description=(
+            "Extract back-translations from the stdout of fairseq-generate. "
+            "If there are multiply hypotheses for a source, we only keep the first one. "
+        )
+    )
+    parser.add_argument("--output", required=True, help="output prefix")
+    parser.add_argument(
+        "--srclang", required=True, help="source language (extracted from H-* lines)"
+    )
+    parser.add_argument(
+        "--tgtlang", required=True, help="target language (extracted from S-* lines)"
+    )
+    parser.add_argument("--minlen", type=int, help="min length filter")
+    parser.add_argument("--maxlen", type=int, help="max length filter")
+    parser.add_argument("--ratio", type=float, help="ratio filter")
+    parser.add_argument("files", nargs="*", help="input files")
+    args = parser.parse_args()
+
+    def validate(src, tgt):
+        srclen = len(src.split(" ")) if src != "" else 0
+        tgtlen = len(tgt.split(" ")) if tgt != "" else 0
+        if (
+            (args.minlen is not None and (srclen < args.minlen or tgtlen < args.minlen))
+            or (
+                args.maxlen is not None
+                and (srclen > args.maxlen or tgtlen > args.maxlen)
+            )
+            or (
+                args.ratio is not None
+                and (max(srclen, tgtlen) / float(min(srclen, tgtlen)) > args.ratio)
+            )
+        ):
+            return False
+        return True
+
+    def safe_index(toks, index, default):
+        try:
+            return toks[index]
+        except IndexError:
+            return default
+
+    with open(args.output + "." + args.srclang, "w") as src_h, open(
+        args.output + "." + args.tgtlang, "w"
+    ) as tgt_h:
+        for line in tqdm(fileinput.input(args.files)):
+            if line.startswith("S-"):
+                tgt = safe_index(line.rstrip().split("\t"), 1, "")
+            elif line.startswith("H-"):
+                if tgt is not None:
+                    src = safe_index(line.rstrip().split("\t"), 2, "")
+                    if validate(src, tgt):
+                        print(src, file=src_h)
+                        print(tgt, file=tgt_h)
+                    tgt = None
+
+
+if __name__ == "__main__":
+    main()
--- a/fairseq/examples/backtranslation/prepare-de-monolingual.sh
+++ b/fairseq/examples/backtranslation/prepare-de-monolingual.sh
+#!/bin/bash
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+
+
+BPE_CODE=wmt18_en_de/code
+SUBSAMPLE_SIZE=25000000
+LANG=de
+
+
+OUTDIR=wmt18_${LANG}_mono
+orig=orig
+tmp=$OUTDIR/tmp
+mkdir -p $OUTDIR $tmp
+
+
+URLS=(
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2007.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2008.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2009.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2011.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz"
+    "http://www.statmt.org/wmt15/training-monolingual-news-crawl-v2/news.2014.de.shuffled.v2.gz"
+    "http://data.statmt.org/wmt16/translation-task/news.2015.de.shuffled.gz"
+    "http://data.statmt.org/wmt17/translation-task/news.2016.de.shuffled.gz"
+    "http://data.statmt.org/wmt18/translation-task/news.2017.de.shuffled.deduped.gz"
+)
+FILES=(
+    "news.2007.de.shuffled.gz"
+    "news.2008.de.shuffled.gz"
+    "news.2009.de.shuffled.gz"
+    "news.2010.de.shuffled.gz"
+    "news.2011.de.shuffled.gz"
+    "news.2012.de.shuffled.gz"
+    "news.2013.de.shuffled.gz"
+    "news.2014.de.shuffled.v2.gz"
+    "news.2015.de.shuffled.gz"
+    "news.2016.de.shuffled.gz"
+    "news.2017.de.shuffled.deduped.gz"
+)
+
+
+cd $orig
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url"
+    fi
+done
+cd ..
+
+
+if [ -f $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG} ]; then
+    echo "found monolingual sample, skipping shuffle/sample/tokenize"
+else
+    gzip -c -d -k $(for FILE in "${FILES[@]}"; do echo $orig/$FILE; done) \
+    | shuf -n $SUBSAMPLE_SIZE \
+    | perl $NORM_PUNC $LANG \
+    | perl $REM_NON_PRINT_CHAR \
+    | perl $TOKENIZER -threads 8 -a -l $LANG \
+    > $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG}
+fi
+
+
+if [ -f $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG} ]; then
+    echo "found BPE monolingual sample, skipping BPE step"
+else
+    python $BPEROOT/apply_bpe.py -c $BPE_CODE \
+        < $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG} \
+        > $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG}
+fi
+
+
+if [ -f $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG} ]; then
+    echo "found deduplicated monolingual sample, skipping deduplication step"
+else
+    python deduplicate_lines.py $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG} \
+    > $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG}
+fi
+
+
+if [ -f $OUTDIR/bpe.monolingual.dedup.00.de ]; then
+    echo "found sharded data, skipping sharding step"
+else
+    split --lines 1000000 --numeric-suffixes \
+        --additional-suffix .${LANG} \
+        $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG} \
+        $OUTDIR/bpe.monolingual.dedup.
+fi
--- a/fairseq/examples/backtranslation/prepare-wmt18en2de.sh
+++ b/fairseq/examples/backtranslation/prepare-wmt18en2de.sh
+#!/bin/bash
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=32000
+
+URLS=(
+    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
+    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
+    "http://data.statmt.org/wmt18/translation-task/training-parallel-nc-v13.tgz"
+    "http://data.statmt.org/wmt18/translation-task/rapid2016.tgz"
+    "http://data.statmt.org/wmt17/translation-task/dev.tgz"
+    "http://statmt.org/wmt14/test-full.tgz"
+)
+FILES=(
+    "training-parallel-europarl-v7.tgz"
+    "training-parallel-commoncrawl.tgz"
+    "training-parallel-nc-v13.tgz"
+    "rapid2016.tgz"
+    "dev.tgz"
+    "test-full.tgz"
+)
+CORPORA=(
+    "training/europarl-v7.de-en"
+    "commoncrawl.de-en"
+    "training-parallel-nc-v13/news-commentary-v13.de-en"
+    "rapid2016.de-en"
+)
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit 1
+fi
+
+OUTDIR=wmt18_en_de
+
+src=en
+tgt=de
+lang=en-de
+prep=$OUTDIR
+tmp=$prep/tmp
+orig=orig
+
+mkdir -p $orig $tmp $prep
+
+cd $orig
+
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url"
+        if [ -f $file ]; then
+            echo "$url successfully downloaded."
+        else
+            echo "$url not successfully downloaded."
+            exit 1
+        fi
+        if [ ${file: -4} == ".tgz" ]; then
+            tar zxvf $file
+        elif [ ${file: -4} == ".tar" ]; then
+            tar xvf $file
+        fi
+    fi
+done
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    rm $tmp/train.tags.$lang.tok.$l
+    for f in "${CORPORA[@]}"; do
+        cat $orig/$f.$l | \
+            perl $NORM_PUNC $l | \
+            perl $REM_NON_PRINT_CHAR | \
+            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
+    done
+done
+
+echo "pre-processing test data..."
+for l in $src $tgt; do
+    if [ "$l" == "$src" ]; then
+        t="src"
+    else
+        t="ref"
+    fi
+    grep '<seg id' $orig/test-full/newstest2014-deen-$t.$l.sgm | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
+    echo ""
+done
+
+echo "splitting train and valid..."
+for l in $src $tgt; do
+    awk '{if (NR%100 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l
+    awk '{if (NR%100 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
+done
+
+TRAIN=$tmp/train.de-en
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L valid.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
+    done
+done
+
+perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
+perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt $prep/valid 1 250
+
+for L in $src $tgt; do
+    cp $tmp/bpe.test.$L $prep/test.$L
+done