更新transformer代码

c0f05c10 · hepj · c056df78 · c0f05c10 · c0f05c10 · c0f05c10
Commit c0f05c10 authored Nov 29, 2022 by hepj
20 changed files
--- a/PyTorch/NLP/new-Transformer/docs/overview.rst
+++ b/PyTorch/NLP/new-Transformer/docs/overview.rst
+Overview
+========
+Fairseq can be extended through user-supplied `plug-ins
+<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
+plug-ins:
+- :ref:`Models` define the neural network architecture and encapsulate all of the
+  learnable parameters.
+- :ref:`Criterions` compute the loss function given the model outputs and targets.
+- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
+  Datasets, initializing the Model/Criterion and calculating the loss.
+- :ref:`Optimizers` update the Model parameters based on the gradients.
+- :ref:`Learning Rate Schedulers` update the learning rate over the course of
+  training.
+**Training Flow**
+Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
+fairseq implements the following high-level training flow::
+  for epoch in range(num_epochs):
+      itr = task.get_batch_iterator(task.dataset('train'))
+      for num_updates, batch in enumerate(itr):
+          task.train_step(batch, model, criterion, optimizer)
+          average_and_clip_gradients()
+          optimizer.step()
+          lr_scheduler.step_update(num_updates)
+      lr_scheduler.step(epoch)
+where the default implementation for ``task.train_step`` is roughly::
+  def train_step(self, batch, model, criterion, optimizer, **unused):
+      loss = criterion(model, batch)
+      optimizer.backward(loss)
+      return loss
+**Registering new plug-ins**
+New plug-ins are *registered* through a set of ``@register`` function
+decorators, for example::
+  @register_model('my_lstm')
+  class MyLSTM(FairseqEncoderDecoderModel):
+      (...)
+Once registered, new plug-ins can be used with the existing :ref:`Command-line
+Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
+new plug-ins.
+**Loading plug-ins from another directory**
+New plug-ins can be defined in a custom module stored in the user system. In
+order to import the module, and make the plugin available to *fairseq*, the
+command line supports the ``--user-dir`` flag that can be used to specify a
+custom location for additional modules to load into *fairseq*.
+For example, assuming this directory tree::
+  /home/user/my-module/
+  └── __init__.py
+with ``__init__.py``::
+  from fairseq.models import register_model_architecture
+  from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
+  @register_model_architecture('transformer', 'my_transformer')
+  def transformer_mmt_big(args):
+      transformer_vaswani_wmt_en_de_big(args)
+it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
+  fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
--- a/PyTorch/NLP/new-Transformer/docs/requirements.txt
+++ b/PyTorch/NLP/new-Transformer/docs/requirements.txt
+sphinx<2.0
+sphinx-argparse
--- a/PyTorch/NLP/new-Transformer/docs/tasks.rst
+++ b/PyTorch/NLP/new-Transformer/docs/tasks.rst
+.. role:: hidden
+    :class: hidden-section
+.. module:: fairseq.tasks
+.. _Tasks:
+Tasks
+=====
+Tasks store dictionaries and provide helpers for loading/iterating over
+Datasets, initializing the Model/Criterion and calculating the loss.
+Tasks can be selected via the ``--task`` command-line argument. Once selected, a
+task may expose additional command-line arguments for further configuration.
+Example usage::
+    # setup the task (e.g., load dictionaries)
+    task = fairseq.tasks.setup_task(args)
+    # build model and criterion
+    model = task.build_model(args)
+    criterion = task.build_criterion(args)
+    # load datasets
+    task.load_dataset('train')
+    task.load_dataset('valid')
+    # iterate over mini-batches of data
+    batch_itr = task.get_batch_iterator(
+        task.dataset('train'), max_tokens=4096,
+    )
+    for batch in batch_itr:
+        # compute the loss
+        loss, sample_size, logging_output = task.get_loss(
+            model, criterion, batch,
+        )
+        loss.backward()
+Translation
+-----------
+.. autoclass:: fairseq.tasks.translation.TranslationTask
+.. _language modeling:
+Language Modeling
+-----------------
+.. autoclass:: fairseq.tasks.language_modeling.LanguageModelingTask
+Adding new tasks
+----------------
+.. autofunction:: fairseq.tasks.register_task
+.. autoclass:: fairseq.tasks.FairseqTask
+    :members:
+    :undoc-members:
--- a/PyTorch/NLP/new-Transformer/docs/tutorial_classifying_names.rst
+++ b/PyTorch/NLP/new-Transformer/docs/tutorial_classifying_names.rst
+Tutorial: Classifying Names with a Character-Level RNN
+======================================================
+In this tutorial we will extend fairseq to support *classification* tasks. In
+particular we will re-implement the PyTorch tutorial for `Classifying Names with
+a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`_
+in fairseq. It is recommended to quickly skim that tutorial before beginning
+this one.
+This tutorial covers:
+1. **Preprocessing the data** to create dictionaries.
+2. **Registering a new Model** that encodes an input sentence with a simple RNN
+   and predicts the output label.
+3. **Registering a new Task** that loads our dictionaries and dataset.
+4. **Training the Model** using the existing command-line tools.
+5. **Writing an evaluation script** that imports fairseq and allows us to
+   interactively evaluate our model on new inputs.
+1. Preprocessing the data
+-------------------------
+The original tutorial provides raw data, but we'll work with a modified version
+of the data that is already tokenized into characters and split into separate
+train, valid and test sets.
+Download and extract the data from here:
+`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_
+Once extracted, let's preprocess the data using the :ref:`fairseq-preprocess`
+command-line tool to create the dictionaries. While this tool is primarily
+intended for sequence-to-sequence problems, we're able to reuse it here by
+treating the label as a "target" sequence of length 1. We'll also output the
+preprocessed files in "raw" format using the ``--dataset-impl`` option to
+enhance readability:
+.. code-block:: console
+  > fairseq-preprocess \
+    --trainpref names/train --validpref names/valid --testpref names/test \
+    --source-lang input --target-lang label \
+    --destdir names-bin --dataset-impl raw
+After running the above command you should see a new directory,
+:file:`names-bin/`, containing the dictionaries for *inputs* and *labels*.
+2. Registering a new Model
+--------------------------
+Next we'll register a new model in fairseq that will encode an input sentence
+with a simple RNN and predict the output label. Compared to the original PyTorch
+tutorial, our version will also work with batches of data and GPU Tensors.
+First let's copy the simple RNN module implemented in the `PyTorch tutorial
+<https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network>`_.
+Create a new file named :file:`fairseq/models/rnn_classifier.py` with the
+following contents::
+    import torch
+    import torch.nn as nn
+    class RNN(nn.Module):
+        def __init__(self, input_size, hidden_size, output_size):
+            super(RNN, self).__init__()
+            self.hidden_size = hidden_size
+            self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
+            self.i2o = nn.Linear(input_size + hidden_size, output_size)
+            self.softmax = nn.LogSoftmax(dim=1)
+        def forward(self, input, hidden):
+            combined = torch.cat((input, hidden), 1)
+            hidden = self.i2h(combined)
+            output = self.i2o(combined)
+            output = self.softmax(output)
+            return output, hidden
+        def initHidden(self):
+            return torch.zeros(1, self.hidden_size)
+We must also *register* this model with fairseq using the
+:func:`~fairseq.models.register_model` function decorator. Once the model is
+registered we'll be able to use it with the existing :ref:`Command-line Tools`.
+All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
+interface, so we'll create a small wrapper class in the same file and register
+it in fairseq with the name ``'rnn_classifier'``::
+    from fairseq.models import BaseFairseqModel, register_model
+    # Note: the register_model "decorator" should immediately precede the
+    # definition of the Model class.
+    @register_model('rnn_classifier')
+    class FairseqRNNClassifier(BaseFairseqModel):
+        @staticmethod
+        def add_args(parser):
+            # Models can override this method to add new command-line arguments.
+            # Here we'll add a new command-line argument to configure the
+            # dimensionality of the hidden state.
+            parser.add_argument(
+                '--hidden-dim', type=int, metavar='N',
+                help='dimensionality of the hidden state',
+            )
+        @classmethod
+        def build_model(cls, args, task):
+            # Fairseq initializes models by calling the ``build_model()``
+            # function. This provides more flexibility, since the returned model
+            # instance can be of a different type than the one that was called.
+            # In this case we'll just return a FairseqRNNClassifier instance.
+            # Initialize our RNN module
+            rnn = RNN(
+                # We'll define the Task in the next section, but for now just
+                # notice that the task holds the dictionaries for the "source"
+                # (i.e., the input sentence) and "target" (i.e., the label).
+                input_size=len(task.source_dictionary),
+                hidden_size=args.hidden_dim,
+                output_size=len(task.target_dictionary),
+            )
+            # Return the wrapped version of the module
+            return FairseqRNNClassifier(
+                rnn=rnn,
+                input_vocab=task.source_dictionary,
+            )
+        def __init__(self, rnn, input_vocab):
+            super(FairseqRNNClassifier, self).__init__()
+            self.rnn = rnn
+            self.input_vocab = input_vocab
+            # The RNN module in the tutorial expects one-hot inputs, so we can
+            # precompute the identity matrix to help convert from indices to
+            # one-hot vectors. We register it as a buffer so that it is moved to
+            # the GPU when ``cuda()`` is called.
+            self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
+        def forward(self, src_tokens, src_lengths):
+            # The inputs to the ``forward()`` function are determined by the
+            # Task, and in particular the ``'net_input'`` key in each
+            # mini-batch. We'll define the Task in the next section, but for
+            # now just know that *src_tokens* has shape `(batch, src_len)` and
+            # *src_lengths* has shape `(batch)`.
+            bsz, max_src_len = src_tokens.size()
+            # Initialize the RNN hidden state. Compared to the original PyTorch
+            # tutorial we'll also handle batched inputs and work on the GPU.
+            hidden = self.rnn.initHidden()
+            hidden = hidden.repeat(bsz, 1)  # expand for batched inputs
+            hidden = hidden.to(src_tokens.device)  # move to GPU
+            for i in range(max_src_len):
+                # WARNING: The inputs have padding, so we should mask those
+                # elements here so that padding doesn't affect the results.
+                # This is left as an exercise for the reader. The padding symbol
+                # is given by ``self.input_vocab.pad()`` and the unpadded length
+                # of each input is given by *src_lengths*.
+                # One-hot encode a batch of input characters.
+                input = self.one_hot_inputs[src_tokens[:, i].long()]
+                # Feed the input to our RNN.
+                output, hidden = self.rnn(input, hidden)
+            # Return the final output state for making a prediction
+            return output
+Finally let's define a *named architecture* with the configuration for our
+model. This is done with the :func:`~fairseq.models.register_model_architecture`
+function decorator. Thereafter this named architecture can be used with the
+``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::
+    from fairseq.models import register_model_architecture
+    # The first argument to ``register_model_architecture()`` should be the name
+    # of the model we registered above (i.e., 'rnn_classifier'). The function we
+    # register here should take a single argument *args* and modify it in-place
+    # to match the desired architecture.
+    @register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
+    def pytorch_tutorial_rnn(args):
+        # We use ``getattr()`` to prioritize arguments that are explicitly given
+        # on the command-line, so that the defaults defined below are only used
+        # when no other value has been specified.
+        args.hidden_dim = getattr(args, 'hidden_dim', 128)
+3. Registering a new Task
+-------------------------
+Now we'll register a new :class:`~fairseq.tasks.FairseqTask` that will load our
+dictionaries and dataset. Tasks can also control how the data is batched into
+mini-batches, but in this tutorial we'll reuse the batching provided by
+:class:`fairseq.data.LanguagePairDataset`.
+Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
+following contents::
+  import os
+  import torch
+  from fairseq.data import Dictionary, LanguagePairDataset
+  from fairseq.tasks import FairseqTask, register_task
+  @register_task('simple_classification')
+  class SimpleClassificationTask(LegacyFairseqTask):
+      @staticmethod
+      def add_args(parser):
+          # Add some command-line arguments for specifying where the data is
+          # located and the maximum supported input length.
+          parser.add_argument('data', metavar='FILE',
+                              help='file prefix for data')
+          parser.add_argument('--max-positions', default=1024, type=int,
+                              help='max input length')
+      @classmethod
+      def setup_task(cls, args, **kwargs):
+          # Here we can perform any setup required for the task. This may include
+          # loading Dictionaries, initializing shared Embedding layers, etc.
+          # In this case we'll just load the Dictionaries.
+          input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
+          label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
+          print('| [input] dictionary: {} types'.format(len(input_vocab)))
+          print('| [label] dictionary: {} types'.format(len(label_vocab)))
+          return SimpleClassificationTask(args, input_vocab, label_vocab)
+      def __init__(self, args, input_vocab, label_vocab):
+          super().__init__(args)
+          self.input_vocab = input_vocab
+          self.label_vocab = label_vocab
+      def load_dataset(self, split, **kwargs):
+          """Load a given dataset split (e.g., train, valid, test)."""
+          prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
+          # Read input sentences.
+          sentences, lengths = [], []
+          with open(prefix + '.input', encoding='utf-8') as file:
+              for line in file:
+                  sentence = line.strip()
+                  # Tokenize the sentence, splitting on spaces
+                  tokens = self.input_vocab.encode_line(
+                      sentence, add_if_not_exist=False,
+                  )
+                  sentences.append(tokens)
+                  lengths.append(tokens.numel())
+          # Read labels.
+          labels = []
+          with open(prefix + '.label', encoding='utf-8') as file:
+              for line in file:
+                  label = line.strip()
+                  labels.append(
+                      # Convert label to a numeric ID.
+                      torch.LongTensor([self.label_vocab.add_symbol(label)])
+                  )
+          assert len(sentences) == len(labels)
+          print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
+          # We reuse LanguagePairDataset since classification can be modeled as a
+          # sequence-to-sequence task where the target sequence has length 1.
+          self.datasets[split] = LanguagePairDataset(
+              src=sentences,
+              src_sizes=lengths,
+              src_dict=self.input_vocab,
+              tgt=labels,
+              tgt_sizes=torch.ones(len(labels)),  # targets have length 1
+              tgt_dict=self.label_vocab,
+              left_pad_source=False,
+              # Since our target is a single class label, there's no need for
+              # teacher forcing. If we set this to ``True`` then our Model's
+              # ``forward()`` method would receive an additional argument called
+              # *prev_output_tokens* that would contain a shifted version of the
+              # target sequence.
+              input_feeding=False,
+          )
+      def max_positions(self):
+          """Return the max input length allowed by the task."""
+          # The source should be less than *args.max_positions* and the "target"
+          # has max length 1.
+          return (self.args.max_positions, 1)
+      @property
+      def source_dictionary(self):
+          """Return the source :class:`~fairseq.data.Dictionary`."""
+          return self.input_vocab
+      @property
+      def target_dictionary(self):
+          """Return the target :class:`~fairseq.data.Dictionary`."""
+          return self.label_vocab
+      # We could override this method if we wanted more control over how batches
+      # are constructed, but it's not necessary for this tutorial since we can
+      # reuse the batching provided by LanguagePairDataset.
+      #
+      # def get_batch_iterator(
+      #     self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
+      #     ignore_invalid_inputs=False, required_batch_size_multiple=1,
+      #     seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
+      #     data_buffer_size=0, disable_iterator_cache=False,
+      # ):
+      #     (...)
+4. Training the Model
+---------------------
+Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
+command-line tool for this, making sure to specify our new Task (``--task
+simple_classification``) and Model architecture (``--arch
+pytorch_tutorial_rnn``):
+.. note::
+  You can also configure the dimensionality of the hidden state by passing the
+  ``--hidden-dim`` argument to :ref:`fairseq-train`.
+.. code-block:: console
+  > fairseq-train names-bin \
+    --task simple_classification \
+    --arch pytorch_tutorial_rnn \
+    --optimizer adam --lr 0.001 --lr-shrink 0.5 \
+    --max-tokens 1000
+  (...)
+  | epoch 027 | loss 1.200 | ppl 2.30 | wps 15728 | ups 119.4 | wpb 116 | bsz 116 | num_updates 3726 | lr 1.5625e-05 | gnorm 1.290 | clip 0% | oom 0 | wall 32 | train_wall 21
+  | epoch 027 | valid on 'valid' subset | valid_loss 1.41304 | valid_ppl 2.66 | num_updates 3726 | best 1.41208
+  | done training in 31.6 seconds
+The model files should appear in the :file:`checkpoints/` directory.
+5. Writing an evaluation script
+-------------------------------
+Finally we can write a short script to evaluate our model on new inputs. Create
+a new file named :file:`eval_classifier.py` with the following contents::
+  from fairseq import checkpoint_utils, data, options, tasks
+  # Parse command-line arguments for generation
+  parser = options.get_generation_parser(default_task='simple_classification')
+  args = options.parse_args_and_arch(parser)
+  # Setup task
+  task = tasks.setup_task(args)
+  # Load model
+  print('| loading model from {}'.format(args.path))
+  models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
+  model = models[0]
+  while True:
+      sentence = input('\nInput: ')
+      # Tokenize into characters
+      chars = ' '.join(list(sentence.strip()))
+      tokens = task.source_dictionary.encode_line(
+          chars, add_if_not_exist=False,
+      )
+      # Build mini-batch to feed to the model
+      batch = data.language_pair_dataset.collate(
+          samples=[{'id': -1, 'source': tokens}],  # bsz = 1
+          pad_idx=task.source_dictionary.pad(),
+          eos_idx=task.source_dictionary.eos(),
+          left_pad_source=False,
+          input_feeding=False,
+      )
+      # Feed batch to the model and get predictions
+      preds = model(**batch['net_input'])
+      # Print top 3 predictions and their log-probabilities
+      top_scores, top_labels = preds[0].topk(k=3)
+      for score, label_idx in zip(top_scores, top_labels):
+          label_name = task.target_dictionary.string([label_idx])
+          print('({:.2f})\t{}'.format(score, label_name))
+Now we can evaluate our model interactively. Note that we have included the
+original data path (:file:`names-bin/`) so that the dictionaries can be loaded:
+.. code-block:: console
+  > python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
+  | [input] dictionary: 64 types
+  | [label] dictionary: 24 types
+  | loading model from checkpoints/checkpoint_best.pt
+  Input: Satoshi
+  (-0.61) Japanese
+  (-1.20) Arabic
+  (-2.86) Italian
+  Input: Sinbad
+  (-0.30) Arabic
+  (-1.76) English
+  (-4.08) Russian
--- a/PyTorch/NLP/new-Transformer/docs/tutorial_simple_lstm.rst
+++ b/PyTorch/NLP/new-Transformer/docs/tutorial_simple_lstm.rst
+Tutorial: Simple LSTM
+=====================
+In this tutorial we will extend fairseq by adding a new
+:class:`~fairseq.models.FairseqEncoderDecoderModel` that encodes a source
+sentence with an LSTM and then passes the final hidden state to a second LSTM
+that decodes the target sentence (without attention).
+This tutorial covers:
+1. **Writing an Encoder and Decoder** to encode/decode the source/target
+   sentence, respectively.
+2. **Registering a new Model** so that it can be used with the existing
+   :ref:`Command-line tools`.
+3. **Training the Model** using the existing command-line tools.
+4. **Making generation faster** by modifying the Decoder to use
+   :ref:`Incremental decoding`.
+1. Building an Encoder and Decoder
+----------------------------------
+In this section we'll define a simple LSTM Encoder and Decoder. All Encoders
+should implement the :class:`~fairseq.models.FairseqEncoder` interface and
+Decoders should implement the :class:`~fairseq.models.FairseqDecoder` interface.
+These interfaces themselves extend :class:`torch.nn.Module`, so FairseqEncoders
+and FairseqDecoders can be written and used in the same ways as ordinary PyTorch
+Modules.
+Encoder
+~~~~~~~
+Our Encoder will embed the tokens in the source sentence, feed them to a
+:class:`torch.nn.LSTM` and return the final hidden state. To create our encoder
+save the following in a new file named :file:`fairseq/models/simple_lstm.py`::
+  import torch.nn as nn
+  from fairseq import utils
+  from fairseq.models import FairseqEncoder
+  class SimpleLSTMEncoder(FairseqEncoder):
+      def __init__(
+          self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
+      ):
+          super().__init__(dictionary)
+          self.args = args
+          # Our encoder will embed the inputs before feeding them to the LSTM.
+          self.embed_tokens = nn.Embedding(
+              num_embeddings=len(dictionary),
+              embedding_dim=embed_dim,
+              padding_idx=dictionary.pad(),
+          )
+          self.dropout = nn.Dropout(p=dropout)
+          # We'll use a single-layer, unidirectional LSTM for simplicity.
+          self.lstm = nn.LSTM(
+              input_size=embed_dim,
+              hidden_size=hidden_dim,
+              num_layers=1,
+              bidirectional=False,
+              batch_first=True,
+          )
+      def forward(self, src_tokens, src_lengths):
+          # The inputs to the ``forward()`` function are determined by the
+          # Task, and in particular the ``'net_input'`` key in each
+          # mini-batch. We discuss Tasks in the next tutorial, but for now just
+          # know that *src_tokens* has shape `(batch, src_len)` and *src_lengths*
+          # has shape `(batch)`.
+          # Note that the source is typically padded on the left. This can be
+          # configured by adding the `--left-pad-source "False"` command-line
+          # argument, but here we'll make the Encoder handle either kind of
+          # padding by converting everything to be right-padded.
+          if self.args.left_pad_source:
+              # Convert left-padding to right-padding.
+              src_tokens = utils.convert_padding_direction(
+                  src_tokens,
+                  padding_idx=self.dictionary.pad(),
+                  left_to_right=True
+              )
+          # Embed the source.
+          x = self.embed_tokens(src_tokens)
+          # Apply dropout.
+          x = self.dropout(x)
+          # Pack the sequence into a PackedSequence object to feed to the LSTM.
+          x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)
+          # Get the output from the LSTM.
+          _outputs, (final_hidden, _final_cell) = self.lstm(x)
+          # Return the Encoder's output. This can be any object and will be
+          # passed directly to the Decoder.
+          return {
+              # this will have shape `(bsz, hidden_dim)`
+              'final_hidden': final_hidden.squeeze(0),
+          }
+      # Encoders are required to implement this method so that we can rearrange
+      # the order of the batch elements during inference (e.g., beam search).
+      def reorder_encoder_out(self, encoder_out, new_order):
+          """
+          Reorder encoder output according to `new_order`.
+          Args:
+              encoder_out: output from the ``forward()`` method
+              new_order (LongTensor): desired order
+          Returns:
+              `encoder_out` rearranged according to `new_order`
+          """
+          final_hidden = encoder_out['final_hidden']
+          return {
+              'final_hidden': final_hidden.index_select(0, new_order),
+          }
+Decoder
+~~~~~~~
+Our Decoder will predict the next word, conditioned on the Encoder's final
+hidden state and an embedded representation of the previous target word -- which
+is sometimes called *teacher forcing*. More specifically, we'll use a
+:class:`torch.nn.LSTM` to produce a sequence of hidden states that we'll project
+to the size of the output vocabulary to predict each target word.
+::
+  import torch
+  from fairseq.models import FairseqDecoder
+  class SimpleLSTMDecoder(FairseqDecoder):
+      def __init__(
+          self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
+          dropout=0.1,
+      ):
+          super().__init__(dictionary)
+          # Our decoder will embed the inputs before feeding them to the LSTM.
+          self.embed_tokens = nn.Embedding(
+              num_embeddings=len(dictionary),
+              embedding_dim=embed_dim,
+              padding_idx=dictionary.pad(),
+          )
+          self.dropout = nn.Dropout(p=dropout)
+          # We'll use a single-layer, unidirectional LSTM for simplicity.
+          self.lstm = nn.LSTM(
+              # For the first layer we'll concatenate the Encoder's final hidden
+              # state with the embedded target tokens.
+              input_size=encoder_hidden_dim + embed_dim,
+              hidden_size=hidden_dim,
+              num_layers=1,
+              bidirectional=False,
+          )
+          # Define the output projection.
+          self.output_projection = nn.Linear(hidden_dim, len(dictionary))
+      # During training Decoders are expected to take the entire target sequence
+      # (shifted right by one position) and produce logits over the vocabulary.
+      # The *prev_output_tokens* tensor begins with the end-of-sentence symbol,
+      # ``dictionary.eos()``, followed by the target sequence.
+      def forward(self, prev_output_tokens, encoder_out):
+          """
+          Args:
+              prev_output_tokens (LongTensor): previous decoder outputs of shape
+                  `(batch, tgt_len)`, for teacher forcing
+              encoder_out (Tensor, optional): output from the encoder, used for
+                  encoder-side attention
+          Returns:
+              tuple:
+                  - the last decoder layer's output of shape
+                    `(batch, tgt_len, vocab)`
+                  - the last decoder layer's attention weights of shape
+                    `(batch, tgt_len, src_len)`
+          """
+          bsz, tgt_len = prev_output_tokens.size()
+          # Extract the final hidden state from the Encoder.
+          final_encoder_hidden = encoder_out['final_hidden']
+          # Embed the target sequence, which has been shifted right by one
+          # position and now starts with the end-of-sentence symbol.
+          x = self.embed_tokens(prev_output_tokens)
+          # Apply dropout.
+          x = self.dropout(x)
+          # Concatenate the Encoder's final hidden state to *every* embedded
+          # target token.
+          x = torch.cat(
+              [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
+              dim=2,
+          )
+          # Using PackedSequence objects in the Decoder is harder than in the
+          # Encoder, since the targets are not sorted in descending length order,
+          # which is a requirement of ``pack_padded_sequence()``. Instead we'll
+          # feed nn.LSTM directly.
+          initial_state = (
+              final_encoder_hidden.unsqueeze(0),  # hidden
+              torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
+          )
+          output, _ = self.lstm(
+              x.transpose(0, 1),  # convert to shape `(tgt_len, bsz, dim)`
+              initial_state,
+          )
+          x = output.transpose(0, 1)  # convert to shape `(bsz, tgt_len, hidden)`
+          # Project the outputs to the size of the vocabulary.
+          x = self.output_projection(x)
+          # Return the logits and ``None`` for the attention weights
+          return x, None
+2. Registering the Model
+------------------------
+Now that we've defined our Encoder and Decoder we must *register* our model with
+fairseq using the :func:`~fairseq.models.register_model` function decorator.
+Once the model is registered we'll be able to use it with the existing
+:ref:`Command-line Tools`.
+All registered models must implement the
+:class:`~fairseq.models.BaseFairseqModel` interface. For sequence-to-sequence
+models (i.e., any model with a single Encoder and Decoder), we can instead
+implement the :class:`~fairseq.models.FairseqEncoderDecoderModel` interface.
+Create a small wrapper class in the same file and register it in fairseq with
+the name ``'simple_lstm'``::
+  from fairseq.models import FairseqEncoderDecoderModel, register_model
+  # Note: the register_model "decorator" should immediately precede the
+  # definition of the Model class.
+  @register_model('simple_lstm')
+  class SimpleLSTMModel(FairseqEncoderDecoderModel):
+      @staticmethod
+      def add_args(parser):
+          # Models can override this method to add new command-line arguments.
+          # Here we'll add some new command-line arguments to configure dropout
+          # and the dimensionality of the embeddings and hidden states.
+          parser.add_argument(
+              '--encoder-embed-dim', type=int, metavar='N',
+              help='dimensionality of the encoder embeddings',
+          )
+          parser.add_argument(
+              '--encoder-hidden-dim', type=int, metavar='N',
+              help='dimensionality of the encoder hidden state',
+          )
+          parser.add_argument(
+              '--encoder-dropout', type=float, default=0.1,
+              help='encoder dropout probability',
+          )
+          parser.add_argument(
+              '--decoder-embed-dim', type=int, metavar='N',
+              help='dimensionality of the decoder embeddings',
+          )
+          parser.add_argument(
+              '--decoder-hidden-dim', type=int, metavar='N',
+              help='dimensionality of the decoder hidden state',
+          )
+          parser.add_argument(
+              '--decoder-dropout', type=float, default=0.1,
+              help='decoder dropout probability',
+          )
+      @classmethod
+      def build_model(cls, args, task):
+          # Fairseq initializes models by calling the ``build_model()``
+          # function. This provides more flexibility, since the returned model
+          # instance can be of a different type than the one that was called.
+          # In this case we'll just return a SimpleLSTMModel instance.
+          # Initialize our Encoder and Decoder.
+          encoder = SimpleLSTMEncoder(
+              args=args,
+              dictionary=task.source_dictionary,
+              embed_dim=args.encoder_embed_dim,
+              hidden_dim=args.encoder_hidden_dim,
+              dropout=args.encoder_dropout,
+          )
+          decoder = SimpleLSTMDecoder(
+              dictionary=task.target_dictionary,
+              encoder_hidden_dim=args.encoder_hidden_dim,
+              embed_dim=args.decoder_embed_dim,
+              hidden_dim=args.decoder_hidden_dim,
+              dropout=args.decoder_dropout,
+          )
+          model = SimpleLSTMModel(encoder, decoder)
+          # Print the model architecture.
+          print(model)
+          return model
+      # We could override the ``forward()`` if we wanted more control over how
+      # the encoder and decoder interact, but it's not necessary for this
+      # tutorial since we can inherit the default implementation provided by
+      # the FairseqEncoderDecoderModel base class, which looks like:
+      #
+      # def forward(self, src_tokens, src_lengths, prev_output_tokens):
+      #     encoder_out = self.encoder(src_tokens, src_lengths)
+      #     decoder_out = self.decoder(prev_output_tokens, encoder_out)
+      #     return decoder_out
+Finally let's define a *named architecture* with the configuration for our
+model. This is done with the :func:`~fairseq.models.register_model_architecture`
+function decorator. Thereafter this named architecture can be used with the
+``--arch`` command-line argument, e.g., ``--arch tutorial_simple_lstm``::
+  from fairseq.models import register_model_architecture
+  # The first argument to ``register_model_architecture()`` should be the name
+  # of the model we registered above (i.e., 'simple_lstm'). The function we
+  # register here should take a single argument *args* and modify it in-place
+  # to match the desired architecture.
+  @register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
+  def tutorial_simple_lstm(args):
+      # We use ``getattr()`` to prioritize arguments that are explicitly given
+      # on the command-line, so that the defaults defined below are only used
+      # when no other value has been specified.
+      args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
+      args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
+      args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
+      args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)
+3. Training the Model
+---------------------
+Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
+command-line tool for this, making sure to specify our new Model architecture
+(``--arch tutorial_simple_lstm``).
+.. note::
+  Make sure you've already preprocessed the data from the IWSLT example in the
+  :file:`examples/translation/` directory.
+.. code-block:: console
+  > fairseq-train data-bin/iwslt14.tokenized.de-en \
+    --arch tutorial_simple_lstm \
+    --encoder-dropout 0.2 --decoder-dropout 0.2 \
+    --optimizer adam --lr 0.005 --lr-shrink 0.5 \
+    --max-tokens 12000
+  (...)
+  | epoch 052 | loss 4.027 | ppl 16.30 | wps 420805 | ups 39.7 | wpb 9841 | bsz 400 | num_updates 20852 | lr 1.95313e-05 | gnorm 0.218 | clip 0% | oom 0 | wall 529 | train_wall 396
+  | epoch 052 | valid on 'valid' subset | valid_loss 4.74989 | valid_ppl 26.91 | num_updates 20852 | best 4.74954
+The model files should appear in the :file:`checkpoints/` directory. While this
+model architecture is not very good, we can use the :ref:`fairseq-generate` script to
+generate translations and compute our BLEU score over the test set:
+.. code-block:: console
+  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 5 \
+    --remove-bpe
+  (...)
+  | Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
+  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
+4. Making generation faster
+---------------------------
+While autoregressive generation from sequence-to-sequence models is inherently
+slow, our implementation above is especially slow because it recomputes the
+entire sequence of Decoder hidden states for every output token (i.e., it is
+``O(n^2)``). We can make this significantly faster by instead caching the
+previous hidden states.
+In fairseq this is called :ref:`Incremental decoding`. Incremental decoding is a
+special mode at inference time where the Model only receives a single timestep
+of input corresponding to the immediately previous output token (for teacher
+forcing) and must produce the next output incrementally. Thus the model must
+cache any long-term state that is needed about the sequence, e.g., hidden
+states, convolutional states, etc.
+To implement incremental decoding we will modify our model to implement the
+:class:`~fairseq.models.FairseqIncrementalDecoder` interface. Compared to the
+standard :class:`~fairseq.models.FairseqDecoder` interface, the incremental
+decoder interface allows ``forward()`` methods to take an extra keyword argument
+(*incremental_state*) that can be used to cache state across time-steps.
+Let's replace our ``SimpleLSTMDecoder`` with an incremental one::
+  import torch
+  from fairseq.models import FairseqIncrementalDecoder
+  class SimpleLSTMDecoder(FairseqIncrementalDecoder):
+      def __init__(
+          self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
+          dropout=0.1,
+      ):
+          # This remains the same as before.
+          super().__init__(dictionary)
+          self.embed_tokens = nn.Embedding(
+              num_embeddings=len(dictionary),
+              embedding_dim=embed_dim,
+              padding_idx=dictionary.pad(),
+          )
+          self.dropout = nn.Dropout(p=dropout)
+          self.lstm = nn.LSTM(
+              input_size=encoder_hidden_dim + embed_dim,
+              hidden_size=hidden_dim,
+              num_layers=1,
+              bidirectional=False,
+          )
+          self.output_projection = nn.Linear(hidden_dim, len(dictionary))
+      # We now take an additional kwarg (*incremental_state*) for caching the
+      # previous hidden and cell states.
+      def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
+          if incremental_state is not None:
+              # If the *incremental_state* argument is not ``None`` then we are
+              # in incremental inference mode. While *prev_output_tokens* will
+              # still contain the entire decoded prefix, we will only use the
+              # last step and assume that the rest of the state is cached.
+              prev_output_tokens = prev_output_tokens[:, -1:]
+          # This remains the same as before.
+          bsz, tgt_len = prev_output_tokens.size()
+          final_encoder_hidden = encoder_out['final_hidden']
+          x = self.embed_tokens(prev_output_tokens)
+          x = self.dropout(x)
+          x = torch.cat(
+              [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
+              dim=2,
+          )
+          # We will now check the cache and load the cached previous hidden and
+          # cell states, if they exist, otherwise we will initialize them to
+          # zeros (as before). We will use the ``utils.get_incremental_state()``
+          # and ``utils.set_incremental_state()`` helpers.
+          initial_state = utils.get_incremental_state(
+              self, incremental_state, 'prev_state',
+          )
+          if initial_state is None:
+              # first time initialization, same as the original version
+              initial_state = (
+                  final_encoder_hidden.unsqueeze(0),  # hidden
+                  torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
+              )
+          # Run one step of our LSTM.
+          output, latest_state = self.lstm(x.transpose(0, 1), initial_state)
+          # Update the cache with the latest hidden and cell states.
+          utils.set_incremental_state(
+              self, incremental_state, 'prev_state', latest_state,
+          )
+          # This remains the same as before
+          x = output.transpose(0, 1)
+          x = self.output_projection(x)
+          return x, None
+      # The ``FairseqIncrementalDecoder`` interface also requires implementing a
+      # ``reorder_incremental_state()`` method, which is used during beam search
+      # to select and reorder the incremental state.
+      def reorder_incremental_state(self, incremental_state, new_order):
+          # Load the cached state.
+          prev_state = utils.get_incremental_state(
+              self, incremental_state, 'prev_state',
+          )
+          # Reorder batches according to *new_order*.
+          reordered_state = (
+              prev_state[0].index_select(1, new_order),  # hidden
+              prev_state[1].index_select(1, new_order),  # cell
+          )
+          # Update the cached state.
+          utils.set_incremental_state(
+              self, incremental_state, 'prev_state', reordered_state,
+          )
+Finally, we can rerun generation and observe the speedup:
+.. code-block:: console
+  # Before
+  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 5 \
+    --remove-bpe
+  (...)
+  | Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
+  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
+  # After
+  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 5 \
+    --remove-bpe
+  (...)
+  | Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
+  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
--- a/PyTorch/NLP/new-Transformer/env.sh
+++ b/PyTorch/NLP/new-Transformer/env.sh
+#module load compiler/intel/2021.3.0
+export ROCM_PATH=/work/home/hepj/app/dtk-22.04.2
+echo $ROCM_PATH
+export HIP_PATH=${ROCM_PATH}/hip
+export AMDGPU_TARGETS="gfx900;gfx906"
+export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PATH
+export LD_LIBRARY_PATH=${ROCM_PATH}/lib:${ROCM_PATH}/lib64:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=${ROCM_PATH}/hip/lib:${ROCM_PATH}/llvm/lib:${ROCM_PATH}/opencl/lib/x86_64:$LD_LIBRARY_PATH
+#export LD_LIBRARY_PATH=${ROCM_PATH}/hip/lib:${ROCM_PATH}/llvm/lib:$LD_LIBRARY_PATH
+#export C_INCLUDE_PATH=${ROCM_PATH}/include:${ROCM_PATH}/llvm/include${C_INCLUDE_PATH:+:${C_INCLUDE_PATH}}
+export C_INCLUDE_PATH=${ROCM_PATH}/include:${ROCM_PATH}/llvm/include:/opencl/include
+export CPLUS_INCLUDE_PATH=${ROCM_PATH}/include:${ROCM_PATH}/llvm/include
+export PATH=${ROCM_PATH}/miopen/bin:${ROCM_PATH}/rocblas/bin:${ROCM_PATH}/hipsparse/bin:$PATH
+export LD_LIBRARY_PATH=${ROCM_PATH}/miopen/lib:${ROCM_PATH}/rocblas/lib:$LD_LIBRARY_PATH
+export MIOPEN_SYSTEM_DB_PATH=${ROCM_PATH}/miopen/share/miopen/db/
+export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
+export LIBRARY_PATH=/usr/lib64:$LIBRARY_PATH
+export C_INCLUDE_PATH=/public/software/apps/deeplearning-depend/gflags-2.1.2-build/include:/public/software/apps/DeepLearning/PyTorch/glog-build/include:$C_INCLUDE_PATH
+export DEEP_PATH=/public/software/apps/deeplearning-depend
+export LD_LIBRARY_PATH=/work/home/hepj/.pyenv/versions/3.7.0/envs/torch/lib/python3.7/site-packages/Pillow.libs/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/public/software/apps/deeplearning-depend/lmdb-0.9.24-build/lib/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/public/software/apps/deeplearning-depend/opencv-2.4.13.6-build/lib/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=${DEEP_PATH}/glog-build/lib/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=${DEEP_PATH}/opencv-2.4.13.6-build/lib/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=${DEEP_PATH}/openblas-0.3.7-build/lib/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=${DEEP_PATH}/gflags-2.1.2-build/lib/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=${DEEP_PATH}/lib/:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/public/software/apps/DeepLearning/PyTorch/openmp-build/lib:$LD_LIBRARY_PATH
+#使用rocblas添加的路径
+export LD_LIBRARY_PATH=/work/home/hepj/app/dtk-22.04.2/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/work/home/hepj/app/dtk-22.04.2/rocblas/lib/benchmark_tool:$LD_LIBRARY_PATH
\ No newline at end of file
--- a/PyTorch/NLP/Transformer/examples/.gitignore
+++ b/PyTorch/NLP/Transformer/examples/.gitignore
--- a/PyTorch/NLP/Transformer/examples/translation/README.md
+++ b/PyTorch/NLP/Transformer/examples/translation/README.md
--- a/PyTorch/NLP/Transformer/examples/translation/prepare-iwslt14.sh
+++ b/PyTorch/NLP/Transformer/examples/translation/prepare-iwslt14.sh
--- a/PyTorch/NLP/Transformer/examples/translation/prepare-wmt14en2de.sh
+++ b/PyTorch/NLP/Transformer/examples/translation/prepare-wmt14en2de.sh
--- a/PyTorch/NLP/Transformer/examples/translation/prepare-wmt14en2fr.sh
+++ b/PyTorch/NLP/Transformer/examples/translation/prepare-wmt14en2fr.sh
--- a/PyTorch/NLP/new-Transformer/fairseq/__init__.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+"""isort:skip_file"""
+import os
+import sys
+try:
+    from .version import __version__  # noqa
+except ImportError:
+    version_txt = os.path.join(os.path.dirname(__file__), "version.txt")
+    with open(version_txt) as f:
+        __version__ = f.read().strip()
+__all__ = ["pdb"]
+# backwards compatibility to support `from fairseq.X import Y`
+from fairseq.distributed import utils as distributed_utils
+from fairseq.logging import meters, metrics, progress_bar  # noqa
+sys.modules["fairseq.distributed_utils"] = distributed_utils
+sys.modules["fairseq.meters"] = meters
+sys.modules["fairseq.metrics"] = metrics
+sys.modules["fairseq.progress_bar"] = progress_bar
+# initialize hydra
+from fairseq.dataclass.initialize import hydra_init
+hydra_init()
+import fairseq.criterions  # noqa
+import fairseq.distributed  # noqa
+import fairseq.models  # noqa
+import fairseq.modules  # noqa
+import fairseq.optim  # noqa
+import fairseq.optim.lr_scheduler  # noqa
+import fairseq.pdb  # noqa
+import fairseq.scoring  # noqa
+import fairseq.tasks  # noqa
+import fairseq.token_generation_constraints  # noqa
+import fairseq.benchmark  # noqa
+import fairseq.model_parallel  # noqa
--- a/PyTorch/NLP/new-Transformer/fairseq/benchmark/__init__.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/benchmark/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# import models/tasks to register them
+from . import dummy_dataset, dummy_lm, dummy_masked_lm, dummy_model, dummy_mt  # noqa
--- a/PyTorch/NLP/new-Transformer/fairseq/benchmark/benchmark_multihead_attention.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/benchmark/benchmark_multihead_attention.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import itertools
+import random
+import torch
+from torch.utils import benchmark
+from fairseq.modules.multihead_attention import MultiheadAttention
+BATCH = [20, 41, 97]
+SEQ = 64
+EMB = 48
+HEADS = 4
+DROP = 0.1
+DEVICE = torch.device("cuda")
+ATTN_MASK_DTYPE = [torch.uint8, torch.bool, torch.float]
+KEY_PADDING_MASK_DTYPE = [torch.uint8, torch.bool]
+def _reset_seeds():
+    torch.manual_seed(0)
+    random.seed(0)
+def _get_mask(to_dtype: torch.dtype, dim0: int, dim1: int):
+    if to_dtype == torch.float:
+        mask = torch.randint(0, 2, (dim0, dim1)).to(dtype=torch.bool)
+        return mask.to(dtype=to_dtype).masked_fill(mask, -float("inf"))
+    return torch.randint(0, 2, (dim0, dim1)).to(dtype=to_dtype)
+def benchmark_multihead_attention(
+    label="",
+    attn_dtype=torch.uint8,
+    key_padding_dtype=torch.uint8,
+    add_bias_kv=False,
+    add_zero_attn=False,
+    static_kv=False,
+    batch_size=20,
+    embedding=EMB,
+    seq_len=SEQ,
+    num_heads=HEADS,
+):
+    results = []
+    # device = torch.device("cuda")
+    xformers_att_config = '{"name": "scaled_dot_product"}'
+    attn_mask = _get_mask(to_dtype=attn_dtype, dim0=seq_len, dim1=seq_len)
+    key_padding_mask = _get_mask(
+        to_dtype=key_padding_dtype, dim0=batch_size, dim1=seq_len
+    )
+    q = torch.rand(seq_len, batch_size, embedding, requires_grad=True)
+    k = torch.rand(seq_len, batch_size, embedding, requires_grad=True)
+    v = torch.rand(seq_len, batch_size, embedding, requires_grad=True)
+    _reset_seeds()
+    original_mha = MultiheadAttention(
+        embedding,
+        num_heads,
+        dropout=0.0,
+        xformers_att_config=None,
+        add_bias_kv=add_bias_kv,
+        add_zero_attn=add_zero_attn,
+    )
+    xformers_mha = MultiheadAttention(
+        embedding,
+        num_heads,
+        dropout=0.0,
+        xformers_att_config=xformers_att_config,
+        add_bias_kv=add_bias_kv,
+        add_zero_attn=add_zero_attn,
+    )
+    def original_bench_fw(q, k, v, key_padding_mask, attn_mask, static_kv):
+        original_mha(
+            query=q,
+            key=k,
+            value=v,
+            key_padding_mask=key_padding_mask,
+            attn_mask=attn_mask,
+            static_kv=static_kv,
+        )
+    def xformers_bench_fw(q, k, v, key_padding_mask, attn_mask, static_kv):
+        xformers_mha(
+            query=q,
+            key=k,
+            value=v,
+            key_padding_mask=key_padding_mask,
+            attn_mask=attn_mask,
+            static_kv=static_kv,
+        )
+    def original_bench_fw_bw(q, k, v, key_padding_mask, attn_mask, static_kv):
+        output, _ = original_mha(
+            query=q,
+            key=k,
+            value=v,
+            key_padding_mask=key_padding_mask,
+            attn_mask=attn_mask,
+            static_kv=static_kv,
+        )
+        loss = torch.norm(output)
+        loss.backward()
+    def xformers_bench_fw_bw(q, k, v, key_padding_mask, attn_mask, static_kv):
+        output, _ = xformers_mha(
+            query=q,
+            key=k,
+            value=v,
+            key_padding_mask=key_padding_mask,
+            attn_mask=attn_mask,
+            static_kv=static_kv,
+        )
+        loss = torch.norm(output)
+        loss.backward()
+    fns = [
+        original_bench_fw,
+        xformers_bench_fw,
+        original_bench_fw_bw,
+        xformers_bench_fw_bw,
+    ]
+    for fn in fns:
+        results.append(
+            benchmark.Timer(
+                stmt="fn(q, k, v, key_padding_mask, attn_mask, static_kv)",
+                globals={
+                    "q": q,
+                    "k": k,
+                    "v": v,
+                    "key_padding_mask": key_padding_mask,
+                    "attn_mask": attn_mask,
+                    "static_kv": static_kv,
+                    "fn": fn,
+                },
+                label="multihead fw + bw",
+                sub_label=f"{fn.__name__}",
+                description=label,
+            ).blocked_autorange(min_run_time=1)
+        )
+    compare = benchmark.Compare(results)
+    compare.print()
+def run_benchmarks():
+    for attn_dtype, key_padding_dtype, add_bias_kv, add_zero_attn in itertools.product(
+        ATTN_MASK_DTYPE, KEY_PADDING_MASK_DTYPE, [True, False], [True, False]
+    ):
+        label = f"attn_dtype {attn_dtype}, key_padding_dtype {key_padding_dtype}, \
+            add_bias_kv {add_bias_kv}, add_zero_attn {add_zero_attn}"
+        benchmark_multihead_attention(
+            label=label,
+            attn_dtype=attn_dtype,
+            key_padding_dtype=key_padding_dtype,
+            add_bias_kv=add_bias_kv,
+            add_zero_attn=add_zero_attn,
+        )
+run_benchmarks()
--- a/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_dataset.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_dataset.py
+import numpy as np
+from fairseq.data import FairseqDataset
+class DummyDataset(FairseqDataset):
+    def __init__(self, batch, num_items, item_size):
+        super().__init__()
+        self.batch = batch
+        self.num_items = num_items
+        self.item_size = item_size
+    def __getitem__(self, index):
+        return index
+    def __len__(self):
+        return self.num_items
+    def collater(self, samples):
+        return self.batch
+    @property
+    def sizes(self):
+        return np.array([self.item_size] * self.num_items)
+    def num_tokens(self, index):
+        return self.item_size
+    def size(self, index):
+        return self.item_size
+    def ordered_indices(self):
+        return np.arange(self.num_items)
+    @property
+    def supports_prefetch(self):
+        return False
--- a/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_lm.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_lm.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+from dataclasses import dataclass, field
+from typing import Optional
+import torch
+from .dummy_dataset import DummyDataset
+from fairseq.data import Dictionary
+from fairseq.dataclass import FairseqDataclass
+from fairseq.tasks import FairseqTask, register_task
+from omegaconf import II
+logger = logging.getLogger(__name__)
+@dataclass
+class DummyLMConfig(FairseqDataclass):
+    dict_size: int = 49996
+    dataset_size: int = 100000
+    tokens_per_sample: int = field(
+        default=512, metadata={"help": "max sequence length"}
+    )
+    add_bos_token: bool = False
+    batch_size: Optional[int] = II("dataset.batch_size")
+    max_tokens: Optional[int] = II("dataset.max_tokens")
+    max_target_positions: int = II("task.tokens_per_sample")
+@register_task("dummy_lm", dataclass=DummyLMConfig)
+class DummyLMTask(FairseqTask):
+    def __init__(self, cfg: DummyLMConfig):
+        super().__init__(cfg)
+        # load dictionary
+        self.dictionary = Dictionary()
+        for i in range(cfg.dict_size):
+            self.dictionary.add_symbol("word{}".format(i))
+        self.dictionary.pad_to_multiple_(8)  # often faster if divisible by 8
+        logger.info("dictionary: {} types".format(len(self.dictionary)))
+        seq = torch.arange(cfg.tokens_per_sample + 1) + self.dictionary.pad() + 1
+        self.dummy_src = seq[:-1]
+        self.dummy_tgt = seq[1:]
+    def load_dataset(self, split, epoch=1, combine=False, **kwargs):
+        """Load a given dataset split.
+        Args:
+            split (str): name of the split (e.g., train, valid, test)
+        """
+        if self.cfg.batch_size is not None:
+            bsz = self.cfg.batch_size
+        else:
+            bsz = max(1, self.cfg.max_tokens // self.cfg.tokens_per_sample)
+        self.datasets[split] = DummyDataset(
+            {
+                "id": 1,
+                "net_input": {
+                    "src_tokens": torch.stack([self.dummy_src for _ in range(bsz)]),
+                    "src_lengths": torch.full(
+                        (bsz,), self.cfg.tokens_per_sample, dtype=torch.long
+                    ),
+                },
+                "target": torch.stack([self.dummy_tgt for _ in range(bsz)]),
+                "nsentences": bsz,
+                "ntokens": bsz * self.cfg.tokens_per_sample,
+            },
+            num_items=self.cfg.dataset_size,
+            item_size=self.cfg.tokens_per_sample,
+        )
+    @property
+    def source_dictionary(self):
+        return self.dictionary
+    @property
+    def target_dictionary(self):
+        return self.dictionary
--- a/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_masked_lm.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_masked_lm.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+from dataclasses import dataclass, field
+from typing import Optional
+import torch
+from omegaconf import II
+from .dummy_dataset import DummyDataset
+from fairseq.data import Dictionary
+from fairseq.dataclass import FairseqDataclass
+from fairseq.tasks import FairseqTask, register_task
+logger = logging.getLogger(__name__)
+@dataclass
+class DummyMaskedLMConfig(FairseqDataclass):
+    dict_size: int = 49996
+    dataset_size: int = 100000
+    tokens_per_sample: int = field(
+        default=512,
+        metadata={
+            "help": "max number of total tokens over all"
+            " segments per sample for BERT dataset"
+        },
+    )
+    batch_size: Optional[int] = II("dataset.batch_size")
+    max_tokens: Optional[int] = II("dataset.max_tokens")
+    max_target_positions: int = II("task.tokens_per_sample")
+@register_task("dummy_masked_lm", dataclass=DummyMaskedLMConfig)
+class DummyMaskedLMTask(FairseqTask):
+    def __init__(self, cfg: DummyMaskedLMConfig):
+        super().__init__(cfg)
+        self.dictionary = Dictionary()
+        for i in range(cfg.dict_size):
+            self.dictionary.add_symbol("word{}".format(i))
+        logger.info("dictionary: {} types".format(len(self.dictionary)))
+        # add mask token
+        self.mask_idx = self.dictionary.add_symbol("<mask>")
+        self.dictionary.pad_to_multiple_(8)  # often faster if divisible by 8
+        mask_idx = 0
+        pad_idx = 1
+        seq = torch.arange(cfg.tokens_per_sample) + pad_idx + 1
+        mask = torch.arange(2, cfg.tokens_per_sample, 7)  # ~15%
+        src = seq.clone()
+        src[mask] = mask_idx
+        tgt = torch.full_like(seq, pad_idx)
+        tgt[mask] = seq[mask]
+        self.dummy_src = src
+        self.dummy_tgt = tgt
+    def load_dataset(self, split, epoch=1, combine=False, **kwargs):
+        """Load a given dataset split.
+        Args:
+            split (str): name of the split (e.g., train, valid, test)
+        """
+        if self.cfg.batch_size is not None:
+            bsz = self.cfg.batch_size
+        else:
+            bsz = max(1, self.cfg.max_tokens // self.cfg.tokens_per_sample)
+        self.datasets[split] = DummyDataset(
+            {
+                "id": 1,
+                "net_input": {
+                    "src_tokens": torch.stack([self.dummy_src for _ in range(bsz)]),
+                    "src_lengths": torch.full(
+                        (bsz,), self.cfg.tokens_per_sample, dtype=torch.long
+                    ),
+                },
+                "target": torch.stack([self.dummy_tgt for _ in range(bsz)]),
+                "nsentences": bsz,
+                "ntokens": bsz * self.cfg.tokens_per_sample,
+            },
+            num_items=self.cfg.dataset_size,
+            item_size=self.cfg.tokens_per_sample,
+        )
+    @property
+    def source_dictionary(self):
+        return self.dictionary
+    @property
+    def target_dictionary(self):
+        return self.dictionary
--- a/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_model.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_model.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import torch.nn as nn
+import torch.nn.functional as F
+from fairseq.data import Dictionary
+from fairseq.models import (
+    FairseqDecoder,
+    FairseqLanguageModel,
+    register_model,
+    register_model_architecture,
+)
+@register_model("dummy_model")
+class DummyModel(FairseqLanguageModel):
+    def __init__(self, args, encoder):
+        super().__init__(encoder)
+        self.args = args
+    @staticmethod
+    def add_args(parser):
+        parser.add_argument("--num-layers", type=int, default=24)
+        parser.add_argument("--embed-dim", type=int, default=1024)
+    @classmethod
+    def build_model(cls, args, task):
+        encoder = DummyEncoder(
+            num_embed=len(task.target_dictionary),
+            embed_dim=args.embed_dim,
+            num_layers=args.num_layers,
+        )
+        return cls(args, encoder)
+    def forward(self, src_tokens, masked_tokens=None, **kwargs):
+        return self.decoder(src_tokens, masked_tokens=masked_tokens)
+class DummyEncoder(FairseqDecoder):
+    def __init__(self, num_embed=50000, embed_dim=1024, num_layers=24):
+        super().__init__(Dictionary())
+        self.embed = nn.Embedding(
+            num_embeddings=num_embed, embedding_dim=embed_dim, padding_idx=0
+        )
+        self.layers_a = nn.ModuleList(
+            [
+                nn.Sequential(
+                    nn.LayerNorm(embed_dim),
+                    nn.Linear(embed_dim, 3 * embed_dim),  # q, k, v input projection
+                    nn.Linear(3 * embed_dim, embed_dim),  # skip self-attention
+                    nn.Linear(embed_dim, embed_dim),  # output projection
+                    nn.Dropout(),
+                )
+                for i in range(num_layers)
+            ]
+        )
+        self.layers_b = nn.ModuleList(
+            [
+                nn.Sequential(
+                    nn.LayerNorm(embed_dim),
+                    nn.Linear(embed_dim, 4 * embed_dim),  # FFN
+                    nn.ReLU(),
+                    nn.Linear(4 * embed_dim, embed_dim),  # FFN
+                    nn.Dropout(0.1),
+                )
+                for i in range(num_layers)
+            ]
+        )
+        self.out_proj = nn.Linear(embed_dim, num_embed)
+    def forward(self, tokens, masked_tokens=None):
+        x = self.embed(tokens)
+        for layer_a, layer_b in zip(self.layers_a, self.layers_b):
+            x = x + layer_a(x)
+            x = x + layer_b(x)
+        x = self.out_proj(x)
+        if masked_tokens is not None:
+            x = x[masked_tokens]
+        return (x,)
+    def max_positions(self):
+        return 1024
+    def get_normalized_probs(self, net_output, log_probs, sample=None):
+        logits = net_output[0].float()
+        if log_probs:
+            return F.log_softmax(logits, dim=-1)
+        else:
+            return F.softmax(logits, dim=-1)
+@register_model_architecture("dummy_model", "dummy_model")
+def base_architecture(args):
+    pass
--- a/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_mt.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/benchmark/dummy_mt.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+import numpy as np
+import torch
+from fairseq.data import Dictionary, FairseqDataset
+from fairseq.tasks import LegacyFairseqTask, register_task
+logger = logging.getLogger(__name__)
+@register_task("dummy_mt")
+class DummyMTTask(LegacyFairseqTask):
+    @staticmethod
+    def add_args(parser):
+        """Add task-specific arguments to the parser."""
+        parser.add_argument("--dict-size", default=49996, type=int)
+        parser.add_argument("--dataset-size", default=100000, type=int)
+        parser.add_argument("--src-len", default=30, type=int)
+        parser.add_argument("--tgt-len", default=30, type=int)
+    def __init__(self, args, dictionary):
+        super().__init__(args)
+        self.dictionary = dictionary
+        self.seed = args.seed
+        dictionary.pad_to_multiple_(8)  # often faster if divisible by 8
+        self.dummy_src = torch.arange(args.src_len + 1) + dictionary.pad() + 1
+        self.dummy_tgt = torch.arange(args.tgt_len + 1) + dictionary.pad() + 1
+    @classmethod
+    def setup_task(cls, args, **kwargs):
+        """Setup the task."""
+        dictionary = Dictionary()
+        for i in range(args.dict_size):
+            dictionary.add_symbol("word{}".format(i))
+        logger.info("dictionary: {} types".format(len(dictionary)))
+        args.max_source_positions = args.src_len + dictionary.pad() + 2
+        args.max_target_positions = args.tgt_len + dictionary.pad() + 2
+        return cls(args, dictionary)
+    def load_dataset(self, split, epoch=1, combine=False, **kwargs):
+        """Load a given dataset split.
+        Args:
+            split (str): name of the split (e.g., train, valid, test)
+        """
+        item_size = max(self.args.src_len, self.args.tgt_len)
+        if self.args.batch_size is not None:
+            bsz = self.args.batch_size
+        else:
+            bsz = max(1, self.args.max_tokens // item_size)
+        tgt = torch.stack([self.dummy_tgt for _ in range(bsz)])
+        self.datasets[split] = DummyDataset(
+            {
+                "id": 1,
+                "net_input": {
+                    "src_tokens": torch.stack([self.dummy_src for _ in range(bsz)]),
+                    "src_lengths": torch.full(
+                        (bsz,), self.args.src_len, dtype=torch.long
+                    ),
+                    "prev_output_tokens": tgt.clone(),
+                },
+                "target": tgt,
+                "nsentences": bsz,
+                "ntokens": bsz * self.args.tgt_len,
+            },
+            num_items=self.args.dataset_size,
+            item_size=item_size,
+        )
+    @property
+    def source_dictionary(self):
+        return self.dictionary
+    @property
+    def target_dictionary(self):
+        return self.dictionary
+class DummyDataset(FairseqDataset):
+    def __init__(self, batch, num_items, item_size):
+        super().__init__()
+        self.batch = batch
+        self.num_items = num_items
+        self.item_size = item_size
+    def __getitem__(self, index):
+        return index
+    def __len__(self):
+        return self.num_items
+    def collater(self, samples):
+        return self.batch
+    @property
+    def sizes(self):
+        return np.array([self.item_size] * self.num_items)
+    def num_tokens(self, index):
+        return self.item_size
+    def size(self, index):
+        return self.item_size
+    def ordered_indices(self):
+        return np.arange(self.num_items)
+    @property
+    def supports_prefetch(self):
+        return False
--- a/PyTorch/NLP/new-Transformer/fairseq/binarizer.py
+++ b/PyTorch/NLP/new-Transformer/fairseq/binarizer.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+import os
+import typing as tp
+from abc import ABC, abstractmethod
+from collections import Counter
+from dataclasses import dataclass
+from multiprocessing import Pool
+import torch
+from fairseq.data import Dictionary, indexed_dataset
+from fairseq.file_chunker_utils import Chunker, find_offsets
+from fairseq.file_io import PathManager
+from fairseq.tokenizer import tokenize_line
+logger = logging.getLogger("binarizer")
+@dataclass
+class BinarizeSummary:
+    """
+    Keep track of what's going on in the binarizer
+    """
+    num_seq: int = 0
+    replaced: tp.Optional[Counter] = None
+    num_tok: int = 0
+    @property
+    def num_replaced(self) -> int:
+        if self.replaced is None:
+            return 0
+        return sum(self.replaced.values())
+    @property
+    def replaced_percent(self) -> float:
+        return 100 * self.num_replaced / self.num_tok
+    def __str__(self) -> str:
+        base = f"{self.num_seq} sents, {self.num_tok} tokens"
+        if self.replaced is None:
+            return base
+        return f"{base}, {self.replaced_percent:.3}% replaced"
+    def merge(self, other: "BinarizeSummary"):
+        replaced = None
+        if self.replaced is not None:
+            replaced = self.replaced
+        if other.replaced is not None:
+            if replaced is None:
+                replaced = other.replaced
+            else:
+                replaced += other.replaced
+        self.replaced = replaced
+        self.num_seq += other.num_seq
+        self.num_tok += other.num_tok
+class Binarizer(ABC):
+    """
+    a binarizer describes how to take a string and build a tensor out of it
+    """
+    @abstractmethod
+    def binarize_line(
+        self,
+        line: str,
+        summary: BinarizeSummary,
+    ) -> torch.IntTensor:
+        ...
+def _worker_prefix(output_prefix: str, worker_id: int):
+    return f"{output_prefix}.pt{worker_id}"
+class FileBinarizer:
+    """
+    An file binarizer can take a file, tokenize it, and binarize each line to a tensor
+    """
+    @classmethod
+    def multiprocess_dataset(
+        cls,
+        input_file: str,
+        dataset_impl: str,
+        binarizer: Binarizer,
+        output_prefix: str,
+        vocab_size=None,
+        num_workers=1,
+    ) -> BinarizeSummary:
+        final_summary = BinarizeSummary()
+        offsets = find_offsets(input_file, num_workers)
+        # find_offsets returns a list of position [pos1, pos2, pos3, pos4] but we would want pairs:
+        # [(pos1, pos2), (pos2, pos3), (pos3, pos4)] to process the chunks with start/end info
+        # we zip the list with itself shifted by one to get all the pairs.
+        (first_chunk, *more_chunks) = zip(offsets, offsets[1:])
+        pool = None
+        if num_workers > 1:
+            pool = Pool(processes=num_workers - 1)
+            worker_results = [
+                pool.apply_async(
+                    cls._binarize_chunk_and_finalize,
+                    args=(
+                        binarizer,
+                        input_file,
+                        start_offset,
+                        end_offset,
+                        _worker_prefix(
+                            output_prefix,
+                            worker_id,
+                        ),
+                        dataset_impl,
+                    ),
+                    kwds={
+                        "vocab_size": vocab_size,
+                    }
+                    if vocab_size is not None
+                    else {},
+                )
+                for worker_id, (start_offset, end_offset) in enumerate(
+                    more_chunks, start=1
+                )
+            ]
+            pool.close()
+            pool.join()
+            for r in worker_results:
+                summ = r.get()
+                final_summary.merge(summ)
+        # do not close the bin file as we need to merge the worker results in
+        final_ds, summ = cls._binarize_file_chunk(
+            binarizer,
+            input_file,
+            offset_start=first_chunk[0],
+            offset_end=first_chunk[1],
+            output_prefix=output_prefix,
+            dataset_impl=dataset_impl,
+            vocab_size=vocab_size if vocab_size is not None else None,
+        )
+        final_summary.merge(summ)
+        if num_workers > 1:
+            for worker_id in range(1, num_workers):
+                # merge the worker outputs
+                worker_output_prefix = _worker_prefix(
+                    output_prefix,
+                    worker_id,
+                )
+                final_ds.merge_file_(worker_output_prefix)
+                try:
+                    os.remove(indexed_dataset.data_file_path(worker_output_prefix))
+                    os.remove(indexed_dataset.index_file_path(worker_output_prefix))
+                except Exception as e:
+                    logger.error(
+                        f"couldn't remove {worker_output_prefix}.*", exc_info=e
+                    )
+        #  now we can close the file
+        idx_file = indexed_dataset.index_file_path(output_prefix)
+        final_ds.finalize(idx_file)
+        return final_summary
+    @staticmethod
+    def _binarize_file_chunk(
+        binarizer: Binarizer,
+        filename: str,
+        offset_start: int,
+        offset_end: int,
+        output_prefix: str,
+        dataset_impl: str,
+        vocab_size=None,
+    ) -> tp.Tuple[tp.Any, BinarizeSummary]:  # (dataset builder, BinarizeSummary)
+        """
+        creates a dataset builder and append binarized items to it. This function does not
+        finalize the builder, this is useful if you want to do other things with your bin file
+        like appending/merging other files
+        """
+        bin_file = indexed_dataset.data_file_path(output_prefix)
+        ds = indexed_dataset.make_builder(
+            bin_file,
+            impl=dataset_impl,
+            vocab_size=vocab_size,
+        )
+        summary = BinarizeSummary()
+        with Chunker(
+            PathManager.get_local_path(filename), offset_start, offset_end
+        ) as line_iterator:
+            for line in line_iterator:
+                ds.add_item(binarizer.binarize_line(line, summary))
+        return ds, summary
+    @classmethod
+    def _binarize_chunk_and_finalize(
+        cls,
+        binarizer: Binarizer,
+        filename: str,
+        offset_start: int,
+        offset_end: int,
+        output_prefix: str,
+        dataset_impl: str,
+        vocab_size=None,
+    ):
+        """
+        same as above, but also finalizes the builder
+        """
+        ds, summ = cls._binarize_file_chunk(
+            binarizer,
+            filename,
+            offset_start,
+            offset_end,
+            output_prefix,
+            dataset_impl,
+            vocab_size=vocab_size,
+        )
+        idx_file = indexed_dataset.index_file_path(output_prefix)
+        ds.finalize(idx_file)
+        return summ
+class VocabularyDatasetBinarizer(Binarizer):
+    """
+    Takes a Dictionary/Vocabulary, assign ids to each
+    token using the dictionary encode_line function.
+    """
+    def __init__(
+        self,
+        dict: Dictionary,
+        tokenize: tp.Callable[[str], tp.List[str]] = tokenize_line,
+        append_eos: bool = True,
+        reverse_order: bool = False,
+        already_numberized: bool = False,
+    ) -> None:
+        self.dict = dict
+        self.tokenize = tokenize
+        self.append_eos = append_eos
+        self.reverse_order = reverse_order
+        self.already_numberized = already_numberized
+        super().__init__()
+    def binarize_line(
+        self,
+        line: str,
+        summary: BinarizeSummary,
+    ):
+        if summary.replaced is None:
+            summary.replaced = Counter()
+        def replaced_consumer(word, idx):
+            if idx == self.dict.unk_index and word != self.dict.unk_word:
+                summary.replaced.update([word])
+        if self.already_numberized:
+            id_strings = line.strip().split()
+            id_list = [int(id_string) for id_string in id_strings]
+            if self.reverse_order:
+                id_list.reverse()
+            if self.append_eos:
+                id_list.append(self.dict.eos())
+            ids = torch.IntTensor(id_list)
+        else:
+            ids = self.dict.encode_line(
+                line=line,
+                line_tokenizer=self.tokenize,
+                add_if_not_exist=False,
+                consumer=replaced_consumer,
+                append_eos=self.append_eos,
+                reverse_order=self.reverse_order,
+            )
+        summary.num_seq += 1
+        summary.num_tok += len(ids)
+        return ids
+class AlignmentDatasetBinarizer(Binarizer):
+    """
+    binarize by parsing a set of alignments and packing
+    them in a tensor (see utils.parse_alignment)
+    """
+    def __init__(
+        self,
+        alignment_parser: tp.Callable[[str], torch.IntTensor],
+    ) -> None:
+        super().__init__()
+        self.alignment_parser = alignment_parser
+    def binarize_line(
+        self,
+        line: str,
+        summary: BinarizeSummary,
+    ):
+        ids = self.alignment_parser(line)
+        summary.num_seq += 1
+        summary.num_tok += len(ids)
+        return ids
+class LegacyBinarizer:
+    @classmethod
+    def binarize(
+        cls,
+        filename: str,
+        dico: Dictionary,
+        consumer: tp.Callable[[torch.IntTensor], None],
+        tokenize: tp.Callable[[str], tp.List[str]] = tokenize_line,
+        append_eos: bool = True,
+        reverse_order: bool = False,
+        offset: int = 0,
+        end: int = -1,
+        already_numberized: bool = False,
+    ) -> tp.Dict[str, int]:
+        binarizer = VocabularyDatasetBinarizer(
+            dict=dico,
+            tokenize=tokenize,
+            append_eos=append_eos,
+            reverse_order=reverse_order,
+            already_numberized=already_numberized,
+        )
+        return cls._consume_file(
+            filename,
+            binarizer,
+            consumer,
+            offset_start=offset,
+            offset_end=end,
+        )
+    @classmethod
+    def binarize_alignments(
+        cls,
+        filename: str,
+        alignment_parser: tp.Callable[[str], torch.IntTensor],
+        consumer: tp.Callable[[torch.IntTensor], None],
+        offset: int = 0,
+        end: int = -1,
+    ) -> tp.Dict[str, int]:
+        binarizer = AlignmentDatasetBinarizer(alignment_parser)
+        return cls._consume_file(
+            filename,
+            binarizer,
+            consumer,
+            offset_start=offset,
+            offset_end=end,
+        )
+    @staticmethod
+    def _consume_file(
+        filename: str,
+        binarizer: Binarizer,
+        consumer: tp.Callable[[torch.IntTensor], None],
+        offset_start: int,
+        offset_end: int,
+    ) -> tp.Dict[str, int]:
+        summary = BinarizeSummary()
+        with Chunker(
+            PathManager.get_local_path(filename), offset_start, offset_end
+        ) as line_iterator:
+            for line in line_iterator:
+                consumer(binarizer.binarize_line(line, summary))
+        return {
+            "nseq": summary.num_seq,
+            "nunk": summary.num_replaced,
+            "ntok": summary.num_tok,
+            "replaced": summary.replaced,
+        }