"INSTALL/vscode:/vscode.git/clone" did not exist on "f3e267a09e22f9f2db70a04b81a30fefe0824bbe"
Commit 7143f128 authored by sunxx1's avatar sunxx1
Browse files

Merge branch 'hepj-test' into 'main'

更新transformer代码

See merge request dcutoolkit/deeplearing/dlexamples_new!47
parents a30b77fe c0f05c10
Overview
========
Fairseq can be extended through user-supplied `plug-ins
<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
plug-ins:
- :ref:`Models` define the neural network architecture and encapsulate all of the
learnable parameters.
- :ref:`Criterions` compute the loss function given the model outputs and targets.
- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
Datasets, initializing the Model/Criterion and calculating the loss.
- :ref:`Optimizers` update the Model parameters based on the gradients.
- :ref:`Learning Rate Schedulers` update the learning rate over the course of
training.
**Training Flow**
Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
fairseq implements the following high-level training flow::
for epoch in range(num_epochs):
itr = task.get_batch_iterator(task.dataset('train'))
for num_updates, batch in enumerate(itr):
task.train_step(batch, model, criterion, optimizer)
average_and_clip_gradients()
optimizer.step()
lr_scheduler.step_update(num_updates)
lr_scheduler.step(epoch)
where the default implementation for ``task.train_step`` is roughly::
def train_step(self, batch, model, criterion, optimizer, **unused):
loss = criterion(model, batch)
optimizer.backward(loss)
return loss
**Registering new plug-ins**
New plug-ins are *registered* through a set of ``@register`` function
decorators, for example::
@register_model('my_lstm')
class MyLSTM(FairseqEncoderDecoderModel):
(...)
Once registered, new plug-ins can be used with the existing :ref:`Command-line
Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
new plug-ins.
**Loading plug-ins from another directory**
New plug-ins can be defined in a custom module stored in the user system. In
order to import the module, and make the plugin available to *fairseq*, the
command line supports the ``--user-dir`` flag that can be used to specify a
custom location for additional modules to load into *fairseq*.
For example, assuming this directory tree::
/home/user/my-module/
└── __init__.py
with ``__init__.py``::
from fairseq.models import register_model_architecture
from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
@register_model_architecture('transformer', 'my_transformer')
def transformer_mmt_big(args):
transformer_vaswani_wmt_en_de_big(args)
it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
.. role:: hidden
:class: hidden-section
.. module:: fairseq.tasks
.. _Tasks:
Tasks
=====
Tasks store dictionaries and provide helpers for loading/iterating over
Datasets, initializing the Model/Criterion and calculating the loss.
Tasks can be selected via the ``--task`` command-line argument. Once selected, a
task may expose additional command-line arguments for further configuration.
Example usage::
# setup the task (e.g., load dictionaries)
task = fairseq.tasks.setup_task(args)
# build model and criterion
model = task.build_model(args)
criterion = task.build_criterion(args)
# load datasets
task.load_dataset('train')
task.load_dataset('valid')
# iterate over mini-batches of data
batch_itr = task.get_batch_iterator(
task.dataset('train'), max_tokens=4096,
)
for batch in batch_itr:
# compute the loss
loss, sample_size, logging_output = task.get_loss(
model, criterion, batch,
)
loss.backward()
Translation
-----------
.. autoclass:: fairseq.tasks.translation.TranslationTask
.. _language modeling:
Language Modeling
-----------------
.. autoclass:: fairseq.tasks.language_modeling.LanguageModelingTask
Adding new tasks
----------------
.. autofunction:: fairseq.tasks.register_task
.. autoclass:: fairseq.tasks.FairseqTask
:members:
:undoc-members:
Tutorial: Classifying Names with a Character-Level RNN
======================================================
In this tutorial we will extend fairseq to support *classification* tasks. In
particular we will re-implement the PyTorch tutorial for `Classifying Names with
a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`_
in fairseq. It is recommended to quickly skim that tutorial before beginning
this one.
This tutorial covers:
1. **Preprocessing the data** to create dictionaries.
2. **Registering a new Model** that encodes an input sentence with a simple RNN
and predicts the output label.
3. **Registering a new Task** that loads our dictionaries and dataset.
4. **Training the Model** using the existing command-line tools.
5. **Writing an evaluation script** that imports fairseq and allows us to
interactively evaluate our model on new inputs.
1. Preprocessing the data
-------------------------
The original tutorial provides raw data, but we'll work with a modified version
of the data that is already tokenized into characters and split into separate
train, valid and test sets.
Download and extract the data from here:
`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_
Once extracted, let's preprocess the data using the :ref:`fairseq-preprocess`
command-line tool to create the dictionaries. While this tool is primarily
intended for sequence-to-sequence problems, we're able to reuse it here by
treating the label as a "target" sequence of length 1. We'll also output the
preprocessed files in "raw" format using the ``--dataset-impl`` option to
enhance readability:
.. code-block:: console
> fairseq-preprocess \
--trainpref names/train --validpref names/valid --testpref names/test \
--source-lang input --target-lang label \
--destdir names-bin --dataset-impl raw
After running the above command you should see a new directory,
:file:`names-bin/`, containing the dictionaries for *inputs* and *labels*.
2. Registering a new Model
--------------------------
Next we'll register a new model in fairseq that will encode an input sentence
with a simple RNN and predict the output label. Compared to the original PyTorch
tutorial, our version will also work with batches of data and GPU Tensors.
First let's copy the simple RNN module implemented in the `PyTorch tutorial
<https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network>`_.
Create a new file named :file:`fairseq/models/rnn_classifier.py` with the
following contents::
import torch
import torch.nn as nn
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.hidden_size = hidden_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def initHidden(self):
return torch.zeros(1, self.hidden_size)
We must also *register* this model with fairseq using the
:func:`~fairseq.models.register_model` function decorator. Once the model is
registered we'll be able to use it with the existing :ref:`Command-line Tools`.
All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
interface, so we'll create a small wrapper class in the same file and register
it in fairseq with the name ``'rnn_classifier'``::
from fairseq.models import BaseFairseqModel, register_model
# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.
@register_model('rnn_classifier')
class FairseqRNNClassifier(BaseFairseqModel):
@staticmethod
def add_args(parser):
# Models can override this method to add new command-line arguments.
# Here we'll add a new command-line argument to configure the
# dimensionality of the hidden state.
parser.add_argument(
'--hidden-dim', type=int, metavar='N',
help='dimensionality of the hidden state',
)
@classmethod
def build_model(cls, args, task):
# Fairseq initializes models by calling the ``build_model()``
# function. This provides more flexibility, since the returned model
# instance can be of a different type than the one that was called.
# In this case we'll just return a FairseqRNNClassifier instance.
# Initialize our RNN module
rnn = RNN(
# We'll define the Task in the next section, but for now just
# notice that the task holds the dictionaries for the "source"
# (i.e., the input sentence) and "target" (i.e., the label).
input_size=len(task.source_dictionary),
hidden_size=args.hidden_dim,
output_size=len(task.target_dictionary),
)
# Return the wrapped version of the module
return FairseqRNNClassifier(
rnn=rnn,
input_vocab=task.source_dictionary,
)
def __init__(self, rnn, input_vocab):
super(FairseqRNNClassifier, self).__init__()
self.rnn = rnn
self.input_vocab = input_vocab
# The RNN module in the tutorial expects one-hot inputs, so we can
# precompute the identity matrix to help convert from indices to
# one-hot vectors. We register it as a buffer so that it is moved to
# the GPU when ``cuda()`` is called.
self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
def forward(self, src_tokens, src_lengths):
# The inputs to the ``forward()`` function are determined by the
# Task, and in particular the ``'net_input'`` key in each
# mini-batch. We'll define the Task in the next section, but for
# now just know that *src_tokens* has shape `(batch, src_len)` and
# *src_lengths* has shape `(batch)`.
bsz, max_src_len = src_tokens.size()
# Initialize the RNN hidden state. Compared to the original PyTorch
# tutorial we'll also handle batched inputs and work on the GPU.
hidden = self.rnn.initHidden()
hidden = hidden.repeat(bsz, 1) # expand for batched inputs
hidden = hidden.to(src_tokens.device) # move to GPU
for i in range(max_src_len):
# WARNING: The inputs have padding, so we should mask those
# elements here so that padding doesn't affect the results.
# This is left as an exercise for the reader. The padding symbol
# is given by ``self.input_vocab.pad()`` and the unpadded length
# of each input is given by *src_lengths*.
# One-hot encode a batch of input characters.
input = self.one_hot_inputs[src_tokens[:, i].long()]
# Feed the input to our RNN.
output, hidden = self.rnn(input, hidden)
# Return the final output state for making a prediction
return output
Finally let's define a *named architecture* with the configuration for our
model. This is done with the :func:`~fairseq.models.register_model_architecture`
function decorator. Thereafter this named architecture can be used with the
``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::
from fairseq.models import register_model_architecture
# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., 'rnn_classifier'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.
@register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
def pytorch_tutorial_rnn(args):
# We use ``getattr()`` to prioritize arguments that are explicitly given
# on the command-line, so that the defaults defined below are only used
# when no other value has been specified.
args.hidden_dim = getattr(args, 'hidden_dim', 128)
3. Registering a new Task
-------------------------
Now we'll register a new :class:`~fairseq.tasks.FairseqTask` that will load our
dictionaries and dataset. Tasks can also control how the data is batched into
mini-batches, but in this tutorial we'll reuse the batching provided by
:class:`fairseq.data.LanguagePairDataset`.
Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
following contents::
import os
import torch
from fairseq.data import Dictionary, LanguagePairDataset
from fairseq.tasks import FairseqTask, register_task
@register_task('simple_classification')
class SimpleClassificationTask(LegacyFairseqTask):
@staticmethod
def add_args(parser):
# Add some command-line arguments for specifying where the data is
# located and the maximum supported input length.
parser.add_argument('data', metavar='FILE',
help='file prefix for data')
parser.add_argument('--max-positions', default=1024, type=int,
help='max input length')
@classmethod
def setup_task(cls, args, **kwargs):
# Here we can perform any setup required for the task. This may include
# loading Dictionaries, initializing shared Embedding layers, etc.
# In this case we'll just load the Dictionaries.
input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
print('| [input] dictionary: {} types'.format(len(input_vocab)))
print('| [label] dictionary: {} types'.format(len(label_vocab)))
return SimpleClassificationTask(args, input_vocab, label_vocab)
def __init__(self, args, input_vocab, label_vocab):
super().__init__(args)
self.input_vocab = input_vocab
self.label_vocab = label_vocab
def load_dataset(self, split, **kwargs):
"""Load a given dataset split (e.g., train, valid, test)."""
prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
# Read input sentences.
sentences, lengths = [], []
with open(prefix + '.input', encoding='utf-8') as file:
for line in file:
sentence = line.strip()
# Tokenize the sentence, splitting on spaces
tokens = self.input_vocab.encode_line(
sentence, add_if_not_exist=False,
)
sentences.append(tokens)
lengths.append(tokens.numel())
# Read labels.
labels = []
with open(prefix + '.label', encoding='utf-8') as file:
for line in file:
label = line.strip()
labels.append(
# Convert label to a numeric ID.
torch.LongTensor([self.label_vocab.add_symbol(label)])
)
assert len(sentences) == len(labels)
print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
# We reuse LanguagePairDataset since classification can be modeled as a
# sequence-to-sequence task where the target sequence has length 1.
self.datasets[split] = LanguagePairDataset(
src=sentences,
src_sizes=lengths,
src_dict=self.input_vocab,
tgt=labels,
tgt_sizes=torch.ones(len(labels)), # targets have length 1
tgt_dict=self.label_vocab,
left_pad_source=False,
# Since our target is a single class label, there's no need for
# teacher forcing. If we set this to ``True`` then our Model's
# ``forward()`` method would receive an additional argument called
# *prev_output_tokens* that would contain a shifted version of the
# target sequence.
input_feeding=False,
)
def max_positions(self):
"""Return the max input length allowed by the task."""
# The source should be less than *args.max_positions* and the "target"
# has max length 1.
return (self.args.max_positions, 1)
@property
def source_dictionary(self):
"""Return the source :class:`~fairseq.data.Dictionary`."""
return self.input_vocab
@property
def target_dictionary(self):
"""Return the target :class:`~fairseq.data.Dictionary`."""
return self.label_vocab
# We could override this method if we wanted more control over how batches
# are constructed, but it's not necessary for this tutorial since we can
# reuse the batching provided by LanguagePairDataset.
#
# def get_batch_iterator(
# self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
# ignore_invalid_inputs=False, required_batch_size_multiple=1,
# seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
# data_buffer_size=0, disable_iterator_cache=False,
# ):
# (...)
4. Training the Model
---------------------
Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
command-line tool for this, making sure to specify our new Task (``--task
simple_classification``) and Model architecture (``--arch
pytorch_tutorial_rnn``):
.. note::
You can also configure the dimensionality of the hidden state by passing the
``--hidden-dim`` argument to :ref:`fairseq-train`.
.. code-block:: console
> fairseq-train names-bin \
--task simple_classification \
--arch pytorch_tutorial_rnn \
--optimizer adam --lr 0.001 --lr-shrink 0.5 \
--max-tokens 1000
(...)
| epoch 027 | loss 1.200 | ppl 2.30 | wps 15728 | ups 119.4 | wpb 116 | bsz 116 | num_updates 3726 | lr 1.5625e-05 | gnorm 1.290 | clip 0% | oom 0 | wall 32 | train_wall 21
| epoch 027 | valid on 'valid' subset | valid_loss 1.41304 | valid_ppl 2.66 | num_updates 3726 | best 1.41208
| done training in 31.6 seconds
The model files should appear in the :file:`checkpoints/` directory.
5. Writing an evaluation script
-------------------------------
Finally we can write a short script to evaluate our model on new inputs. Create
a new file named :file:`eval_classifier.py` with the following contents::
from fairseq import checkpoint_utils, data, options, tasks
# Parse command-line arguments for generation
parser = options.get_generation_parser(default_task='simple_classification')
args = options.parse_args_and_arch(parser)
# Setup task
task = tasks.setup_task(args)
# Load model
print('| loading model from {}'.format(args.path))
models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
model = models[0]
while True:
sentence = input('\nInput: ')
# Tokenize into characters
chars = ' '.join(list(sentence.strip()))
tokens = task.source_dictionary.encode_line(
chars, add_if_not_exist=False,
)
# Build mini-batch to feed to the model
batch = data.language_pair_dataset.collate(
samples=[{'id': -1, 'source': tokens}], # bsz = 1
pad_idx=task.source_dictionary.pad(),
eos_idx=task.source_dictionary.eos(),
left_pad_source=False,
input_feeding=False,
)
# Feed batch to the model and get predictions
preds = model(**batch['net_input'])
# Print top 3 predictions and their log-probabilities
top_scores, top_labels = preds[0].topk(k=3)
for score, label_idx in zip(top_scores, top_labels):
label_name = task.target_dictionary.string([label_idx])
print('({:.2f})\t{}'.format(score, label_name))
Now we can evaluate our model interactively. Note that we have included the
original data path (:file:`names-bin/`) so that the dictionaries can be loaded:
.. code-block:: console
> python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
| [input] dictionary: 64 types
| [label] dictionary: 24 types
| loading model from checkpoints/checkpoint_best.pt
Input: Satoshi
(-0.61) Japanese
(-1.20) Arabic
(-2.86) Italian
Input: Sinbad
(-0.30) Arabic
(-1.76) English
(-4.08) Russian
Tutorial: Simple LSTM
=====================
In this tutorial we will extend fairseq by adding a new
:class:`~fairseq.models.FairseqEncoderDecoderModel` that encodes a source
sentence with an LSTM and then passes the final hidden state to a second LSTM
that decodes the target sentence (without attention).
This tutorial covers:
1. **Writing an Encoder and Decoder** to encode/decode the source/target
sentence, respectively.
2. **Registering a new Model** so that it can be used with the existing
:ref:`Command-line tools`.
3. **Training the Model** using the existing command-line tools.
4. **Making generation faster** by modifying the Decoder to use
:ref:`Incremental decoding`.
1. Building an Encoder and Decoder
----------------------------------
In this section we'll define a simple LSTM Encoder and Decoder. All Encoders
should implement the :class:`~fairseq.models.FairseqEncoder` interface and
Decoders should implement the :class:`~fairseq.models.FairseqDecoder` interface.
These interfaces themselves extend :class:`torch.nn.Module`, so FairseqEncoders
and FairseqDecoders can be written and used in the same ways as ordinary PyTorch
Modules.
Encoder
~~~~~~~
Our Encoder will embed the tokens in the source sentence, feed them to a
:class:`torch.nn.LSTM` and return the final hidden state. To create our encoder
save the following in a new file named :file:`fairseq/models/simple_lstm.py`::
import torch.nn as nn
from fairseq import utils
from fairseq.models import FairseqEncoder
class SimpleLSTMEncoder(FairseqEncoder):
def __init__(
self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
):
super().__init__(dictionary)
self.args = args
# Our encoder will embed the inputs before feeding them to the LSTM.
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
# We'll use a single-layer, unidirectional LSTM for simplicity.
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_dim,
num_layers=1,
bidirectional=False,
batch_first=True,
)
def forward(self, src_tokens, src_lengths):
# The inputs to the ``forward()`` function are determined by the
# Task, and in particular the ``'net_input'`` key in each
# mini-batch. We discuss Tasks in the next tutorial, but for now just
# know that *src_tokens* has shape `(batch, src_len)` and *src_lengths*
# has shape `(batch)`.
# Note that the source is typically padded on the left. This can be
# configured by adding the `--left-pad-source "False"` command-line
# argument, but here we'll make the Encoder handle either kind of
# padding by converting everything to be right-padded.
if self.args.left_pad_source:
# Convert left-padding to right-padding.
src_tokens = utils.convert_padding_direction(
src_tokens,
padding_idx=self.dictionary.pad(),
left_to_right=True
)
# Embed the source.
x = self.embed_tokens(src_tokens)
# Apply dropout.
x = self.dropout(x)
# Pack the sequence into a PackedSequence object to feed to the LSTM.
x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)
# Get the output from the LSTM.
_outputs, (final_hidden, _final_cell) = self.lstm(x)
# Return the Encoder's output. This can be any object and will be
# passed directly to the Decoder.
return {
# this will have shape `(bsz, hidden_dim)`
'final_hidden': final_hidden.squeeze(0),
}
# Encoders are required to implement this method so that we can rearrange
# the order of the batch elements during inference (e.g., beam search).
def reorder_encoder_out(self, encoder_out, new_order):
"""
Reorder encoder output according to `new_order`.
Args:
encoder_out: output from the ``forward()`` method
new_order (LongTensor): desired order
Returns:
`encoder_out` rearranged according to `new_order`
"""
final_hidden = encoder_out['final_hidden']
return {
'final_hidden': final_hidden.index_select(0, new_order),
}
Decoder
~~~~~~~
Our Decoder will predict the next word, conditioned on the Encoder's final
hidden state and an embedded representation of the previous target word -- which
is sometimes called *teacher forcing*. More specifically, we'll use a
:class:`torch.nn.LSTM` to produce a sequence of hidden states that we'll project
to the size of the output vocabulary to predict each target word.
::
import torch
from fairseq.models import FairseqDecoder
class SimpleLSTMDecoder(FairseqDecoder):
def __init__(
self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
dropout=0.1,
):
super().__init__(dictionary)
# Our decoder will embed the inputs before feeding them to the LSTM.
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
# We'll use a single-layer, unidirectional LSTM for simplicity.
self.lstm = nn.LSTM(
# For the first layer we'll concatenate the Encoder's final hidden
# state with the embedded target tokens.
input_size=encoder_hidden_dim + embed_dim,
hidden_size=hidden_dim,
num_layers=1,
bidirectional=False,
)
# Define the output projection.
self.output_projection = nn.Linear(hidden_dim, len(dictionary))
# During training Decoders are expected to take the entire target sequence
# (shifted right by one position) and produce logits over the vocabulary.
# The *prev_output_tokens* tensor begins with the end-of-sentence symbol,
# ``dictionary.eos()``, followed by the target sequence.
def forward(self, prev_output_tokens, encoder_out):
"""
Args:
prev_output_tokens (LongTensor): previous decoder outputs of shape
`(batch, tgt_len)`, for teacher forcing
encoder_out (Tensor, optional): output from the encoder, used for
encoder-side attention
Returns:
tuple:
- the last decoder layer's output of shape
`(batch, tgt_len, vocab)`
- the last decoder layer's attention weights of shape
`(batch, tgt_len, src_len)`
"""
bsz, tgt_len = prev_output_tokens.size()
# Extract the final hidden state from the Encoder.
final_encoder_hidden = encoder_out['final_hidden']
# Embed the target sequence, which has been shifted right by one
# position and now starts with the end-of-sentence symbol.
x = self.embed_tokens(prev_output_tokens)
# Apply dropout.
x = self.dropout(x)
# Concatenate the Encoder's final hidden state to *every* embedded
# target token.
x = torch.cat(
[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
dim=2,
)
# Using PackedSequence objects in the Decoder is harder than in the
# Encoder, since the targets are not sorted in descending length order,
# which is a requirement of ``pack_padded_sequence()``. Instead we'll
# feed nn.LSTM directly.
initial_state = (
final_encoder_hidden.unsqueeze(0), # hidden
torch.zeros_like(final_encoder_hidden).unsqueeze(0), # cell
)
output, _ = self.lstm(
x.transpose(0, 1), # convert to shape `(tgt_len, bsz, dim)`
initial_state,
)
x = output.transpose(0, 1) # convert to shape `(bsz, tgt_len, hidden)`
# Project the outputs to the size of the vocabulary.
x = self.output_projection(x)
# Return the logits and ``None`` for the attention weights
return x, None
2. Registering the Model
------------------------
Now that we've defined our Encoder and Decoder we must *register* our model with
fairseq using the :func:`~fairseq.models.register_model` function decorator.
Once the model is registered we'll be able to use it with the existing
:ref:`Command-line Tools`.
All registered models must implement the
:class:`~fairseq.models.BaseFairseqModel` interface. For sequence-to-sequence
models (i.e., any model with a single Encoder and Decoder), we can instead
implement the :class:`~fairseq.models.FairseqEncoderDecoderModel` interface.
Create a small wrapper class in the same file and register it in fairseq with
the name ``'simple_lstm'``::
from fairseq.models import FairseqEncoderDecoderModel, register_model
# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.
@register_model('simple_lstm')
class SimpleLSTMModel(FairseqEncoderDecoderModel):
@staticmethod
def add_args(parser):
# Models can override this method to add new command-line arguments.
# Here we'll add some new command-line arguments to configure dropout
# and the dimensionality of the embeddings and hidden states.
parser.add_argument(
'--encoder-embed-dim', type=int, metavar='N',
help='dimensionality of the encoder embeddings',
)
parser.add_argument(
'--encoder-hidden-dim', type=int, metavar='N',
help='dimensionality of the encoder hidden state',
)
parser.add_argument(
'--encoder-dropout', type=float, default=0.1,
help='encoder dropout probability',
)
parser.add_argument(
'--decoder-embed-dim', type=int, metavar='N',
help='dimensionality of the decoder embeddings',
)
parser.add_argument(
'--decoder-hidden-dim', type=int, metavar='N',
help='dimensionality of the decoder hidden state',
)
parser.add_argument(
'--decoder-dropout', type=float, default=0.1,
help='decoder dropout probability',
)
@classmethod
def build_model(cls, args, task):
# Fairseq initializes models by calling the ``build_model()``
# function. This provides more flexibility, since the returned model
# instance can be of a different type than the one that was called.
# In this case we'll just return a SimpleLSTMModel instance.
# Initialize our Encoder and Decoder.
encoder = SimpleLSTMEncoder(
args=args,
dictionary=task.source_dictionary,
embed_dim=args.encoder_embed_dim,
hidden_dim=args.encoder_hidden_dim,
dropout=args.encoder_dropout,
)
decoder = SimpleLSTMDecoder(
dictionary=task.target_dictionary,
encoder_hidden_dim=args.encoder_hidden_dim,
embed_dim=args.decoder_embed_dim,
hidden_dim=args.decoder_hidden_dim,
dropout=args.decoder_dropout,
)
model = SimpleLSTMModel(encoder, decoder)
# Print the model architecture.
print(model)
return model
# We could override the ``forward()`` if we wanted more control over how
# the encoder and decoder interact, but it's not necessary for this
# tutorial since we can inherit the default implementation provided by
# the FairseqEncoderDecoderModel base class, which looks like:
#
# def forward(self, src_tokens, src_lengths, prev_output_tokens):
# encoder_out = self.encoder(src_tokens, src_lengths)
# decoder_out = self.decoder(prev_output_tokens, encoder_out)
# return decoder_out
Finally let's define a *named architecture* with the configuration for our
model. This is done with the :func:`~fairseq.models.register_model_architecture`
function decorator. Thereafter this named architecture can be used with the
``--arch`` command-line argument, e.g., ``--arch tutorial_simple_lstm``::
from fairseq.models import register_model_architecture
# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., 'simple_lstm'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.
@register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
def tutorial_simple_lstm(args):
# We use ``getattr()`` to prioritize arguments that are explicitly given
# on the command-line, so that the defaults defined below are only used
# when no other value has been specified.
args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)
3. Training the Model
---------------------
Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
command-line tool for this, making sure to specify our new Model architecture
(``--arch tutorial_simple_lstm``).
.. note::
Make sure you've already preprocessed the data from the IWSLT example in the
:file:`examples/translation/` directory.
.. code-block:: console
> fairseq-train data-bin/iwslt14.tokenized.de-en \
--arch tutorial_simple_lstm \
--encoder-dropout 0.2 --decoder-dropout 0.2 \
--optimizer adam --lr 0.005 --lr-shrink 0.5 \
--max-tokens 12000
(...)
| epoch 052 | loss 4.027 | ppl 16.30 | wps 420805 | ups 39.7 | wpb 9841 | bsz 400 | num_updates 20852 | lr 1.95313e-05 | gnorm 0.218 | clip 0% | oom 0 | wall 529 | train_wall 396
| epoch 052 | valid on 'valid' subset | valid_loss 4.74989 | valid_ppl 26.91 | num_updates 20852 | best 4.74954
The model files should appear in the :file:`checkpoints/` directory. While this
model architecture is not very good, we can use the :ref:`fairseq-generate` script to
generate translations and compute our BLEU score over the test set:
.. code-block:: console
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
4. Making generation faster
---------------------------
While autoregressive generation from sequence-to-sequence models is inherently
slow, our implementation above is especially slow because it recomputes the
entire sequence of Decoder hidden states for every output token (i.e., it is
``O(n^2)``). We can make this significantly faster by instead caching the
previous hidden states.
In fairseq this is called :ref:`Incremental decoding`. Incremental decoding is a
special mode at inference time where the Model only receives a single timestep
of input corresponding to the immediately previous output token (for teacher
forcing) and must produce the next output incrementally. Thus the model must
cache any long-term state that is needed about the sequence, e.g., hidden
states, convolutional states, etc.
To implement incremental decoding we will modify our model to implement the
:class:`~fairseq.models.FairseqIncrementalDecoder` interface. Compared to the
standard :class:`~fairseq.models.FairseqDecoder` interface, the incremental
decoder interface allows ``forward()`` methods to take an extra keyword argument
(*incremental_state*) that can be used to cache state across time-steps.
Let's replace our ``SimpleLSTMDecoder`` with an incremental one::
import torch
from fairseq.models import FairseqIncrementalDecoder
class SimpleLSTMDecoder(FairseqIncrementalDecoder):
def __init__(
self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
dropout=0.1,
):
# This remains the same as before.
super().__init__(dictionary)
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
self.lstm = nn.LSTM(
input_size=encoder_hidden_dim + embed_dim,
hidden_size=hidden_dim,
num_layers=1,
bidirectional=False,
)
self.output_projection = nn.Linear(hidden_dim, len(dictionary))
# We now take an additional kwarg (*incremental_state*) for caching the
# previous hidden and cell states.
def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
if incremental_state is not None:
# If the *incremental_state* argument is not ``None`` then we are
# in incremental inference mode. While *prev_output_tokens* will
# still contain the entire decoded prefix, we will only use the
# last step and assume that the rest of the state is cached.
prev_output_tokens = prev_output_tokens[:, -1:]
# This remains the same as before.
bsz, tgt_len = prev_output_tokens.size()
final_encoder_hidden = encoder_out['final_hidden']
x = self.embed_tokens(prev_output_tokens)
x = self.dropout(x)
x = torch.cat(
[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
dim=2,
)
# We will now check the cache and load the cached previous hidden and
# cell states, if they exist, otherwise we will initialize them to
# zeros (as before). We will use the ``utils.get_incremental_state()``
# and ``utils.set_incremental_state()`` helpers.
initial_state = utils.get_incremental_state(
self, incremental_state, 'prev_state',
)
if initial_state is None:
# first time initialization, same as the original version
initial_state = (
final_encoder_hidden.unsqueeze(0), # hidden
torch.zeros_like(final_encoder_hidden).unsqueeze(0), # cell
)
# Run one step of our LSTM.
output, latest_state = self.lstm(x.transpose(0, 1), initial_state)
# Update the cache with the latest hidden and cell states.
utils.set_incremental_state(
self, incremental_state, 'prev_state', latest_state,
)
# This remains the same as before
x = output.transpose(0, 1)
x = self.output_projection(x)
return x, None
# The ``FairseqIncrementalDecoder`` interface also requires implementing a
# ``reorder_incremental_state()`` method, which is used during beam search
# to select and reorder the incremental state.
def reorder_incremental_state(self, incremental_state, new_order):
# Load the cached state.
prev_state = utils.get_incremental_state(
self, incremental_state, 'prev_state',
)
# Reorder batches according to *new_order*.
reordered_state = (
prev_state[0].index_select(1, new_order), # hidden
prev_state[1].index_select(1, new_order), # cell
)
# Update the cached state.
utils.set_incremental_state(
self, incremental_state, 'prev_state', reordered_state,
)
Finally, we can rerun generation and observe the speedup:
.. code-block:: console
# Before
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
# After
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
#module load compiler/intel/2021.3.0
export ROCM_PATH=/work/home/hepj/app/dtk-22.04.2
echo $ROCM_PATH
export HIP_PATH=${ROCM_PATH}/hip
export AMDGPU_TARGETS="gfx900;gfx906"
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PATH
export LD_LIBRARY_PATH=${ROCM_PATH}/lib:${ROCM_PATH}/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${ROCM_PATH}/hip/lib:${ROCM_PATH}/llvm/lib:${ROCM_PATH}/opencl/lib/x86_64:$LD_LIBRARY_PATH
#export LD_LIBRARY_PATH=${ROCM_PATH}/hip/lib:${ROCM_PATH}/llvm/lib:$LD_LIBRARY_PATH
#export C_INCLUDE_PATH=${ROCM_PATH}/include:${ROCM_PATH}/llvm/include${C_INCLUDE_PATH:+:${C_INCLUDE_PATH}}
export C_INCLUDE_PATH=${ROCM_PATH}/include:${ROCM_PATH}/llvm/include:/opencl/include
export CPLUS_INCLUDE_PATH=${ROCM_PATH}/include:${ROCM_PATH}/llvm/include
export PATH=${ROCM_PATH}/miopen/bin:${ROCM_PATH}/rocblas/bin:${ROCM_PATH}/hipsparse/bin:$PATH
export LD_LIBRARY_PATH=${ROCM_PATH}/miopen/lib:${ROCM_PATH}/rocblas/lib:$LD_LIBRARY_PATH
export MIOPEN_SYSTEM_DB_PATH=${ROCM_PATH}/miopen/share/miopen/db/
export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/lib64:$LIBRARY_PATH
export C_INCLUDE_PATH=/public/software/apps/deeplearning-depend/gflags-2.1.2-build/include:/public/software/apps/DeepLearning/PyTorch/glog-build/include:$C_INCLUDE_PATH
export DEEP_PATH=/public/software/apps/deeplearning-depend
export LD_LIBRARY_PATH=/work/home/hepj/.pyenv/versions/3.7.0/envs/torch/lib/python3.7/site-packages/Pillow.libs/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/public/software/apps/deeplearning-depend/lmdb-0.9.24-build/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/public/software/apps/deeplearning-depend/opencv-2.4.13.6-build/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${DEEP_PATH}/glog-build/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${DEEP_PATH}/opencv-2.4.13.6-build/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${DEEP_PATH}/openblas-0.3.7-build/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${DEEP_PATH}/gflags-2.1.2-build/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${DEEP_PATH}/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/public/software/apps/DeepLearning/PyTorch/openmp-build/lib:$LD_LIBRARY_PATH
#使用rocblas添加的路径
export LD_LIBRARY_PATH=/work/home/hepj/app/dtk-22.04.2/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/work/home/hepj/app/dtk-22.04.2/rocblas/lib/benchmark_tool:$LD_LIBRARY_PATH
\ No newline at end of file
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""isort:skip_file"""
import os
import sys
try:
from .version import __version__ # noqa
except ImportError:
version_txt = os.path.join(os.path.dirname(__file__), "version.txt")
with open(version_txt) as f:
__version__ = f.read().strip()
__all__ = ["pdb"]
# backwards compatibility to support `from fairseq.X import Y`
from fairseq.distributed import utils as distributed_utils
from fairseq.logging import meters, metrics, progress_bar # noqa
sys.modules["fairseq.distributed_utils"] = distributed_utils
sys.modules["fairseq.meters"] = meters
sys.modules["fairseq.metrics"] = metrics
sys.modules["fairseq.progress_bar"] = progress_bar
# initialize hydra
from fairseq.dataclass.initialize import hydra_init
hydra_init()
import fairseq.criterions # noqa
import fairseq.distributed # noqa
import fairseq.models # noqa
import fairseq.modules # noqa
import fairseq.optim # noqa
import fairseq.optim.lr_scheduler # noqa
import fairseq.pdb # noqa
import fairseq.scoring # noqa
import fairseq.tasks # noqa
import fairseq.token_generation_constraints # noqa
import fairseq.benchmark # noqa
import fairseq.model_parallel # noqa
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
# import models/tasks to register them
from . import dummy_dataset, dummy_lm, dummy_masked_lm, dummy_model, dummy_mt # noqa
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import itertools
import random
import torch
from torch.utils import benchmark
from fairseq.modules.multihead_attention import MultiheadAttention
BATCH = [20, 41, 97]
SEQ = 64
EMB = 48
HEADS = 4
DROP = 0.1
DEVICE = torch.device("cuda")
ATTN_MASK_DTYPE = [torch.uint8, torch.bool, torch.float]
KEY_PADDING_MASK_DTYPE = [torch.uint8, torch.bool]
def _reset_seeds():
torch.manual_seed(0)
random.seed(0)
def _get_mask(to_dtype: torch.dtype, dim0: int, dim1: int):
if to_dtype == torch.float:
mask = torch.randint(0, 2, (dim0, dim1)).to(dtype=torch.bool)
return mask.to(dtype=to_dtype).masked_fill(mask, -float("inf"))
return torch.randint(0, 2, (dim0, dim1)).to(dtype=to_dtype)
def benchmark_multihead_attention(
label="",
attn_dtype=torch.uint8,
key_padding_dtype=torch.uint8,
add_bias_kv=False,
add_zero_attn=False,
static_kv=False,
batch_size=20,
embedding=EMB,
seq_len=SEQ,
num_heads=HEADS,
):
results = []
# device = torch.device("cuda")
xformers_att_config = '{"name": "scaled_dot_product"}'
attn_mask = _get_mask(to_dtype=attn_dtype, dim0=seq_len, dim1=seq_len)
key_padding_mask = _get_mask(
to_dtype=key_padding_dtype, dim0=batch_size, dim1=seq_len
)
q = torch.rand(seq_len, batch_size, embedding, requires_grad=True)
k = torch.rand(seq_len, batch_size, embedding, requires_grad=True)
v = torch.rand(seq_len, batch_size, embedding, requires_grad=True)
_reset_seeds()
original_mha = MultiheadAttention(
embedding,
num_heads,
dropout=0.0,
xformers_att_config=None,
add_bias_kv=add_bias_kv,
add_zero_attn=add_zero_attn,
)
xformers_mha = MultiheadAttention(
embedding,
num_heads,
dropout=0.0,
xformers_att_config=xformers_att_config,
add_bias_kv=add_bias_kv,
add_zero_attn=add_zero_attn,
)
def original_bench_fw(q, k, v, key_padding_mask, attn_mask, static_kv):
original_mha(
query=q,
key=k,
value=v,
key_padding_mask=key_padding_mask,
attn_mask=attn_mask,
static_kv=static_kv,
)
def xformers_bench_fw(q, k, v, key_padding_mask, attn_mask, static_kv):
xformers_mha(
query=q,
key=k,
value=v,
key_padding_mask=key_padding_mask,
attn_mask=attn_mask,
static_kv=static_kv,
)
def original_bench_fw_bw(q, k, v, key_padding_mask, attn_mask, static_kv):
output, _ = original_mha(
query=q,
key=k,
value=v,
key_padding_mask=key_padding_mask,
attn_mask=attn_mask,
static_kv=static_kv,
)
loss = torch.norm(output)
loss.backward()
def xformers_bench_fw_bw(q, k, v, key_padding_mask, attn_mask, static_kv):
output, _ = xformers_mha(
query=q,
key=k,
value=v,
key_padding_mask=key_padding_mask,
attn_mask=attn_mask,
static_kv=static_kv,
)
loss = torch.norm(output)
loss.backward()
fns = [
original_bench_fw,
xformers_bench_fw,
original_bench_fw_bw,
xformers_bench_fw_bw,
]
for fn in fns:
results.append(
benchmark.Timer(
stmt="fn(q, k, v, key_padding_mask, attn_mask, static_kv)",
globals={
"q": q,
"k": k,
"v": v,
"key_padding_mask": key_padding_mask,
"attn_mask": attn_mask,
"static_kv": static_kv,
"fn": fn,
},
label="multihead fw + bw",
sub_label=f"{fn.__name__}",
description=label,
).blocked_autorange(min_run_time=1)
)
compare = benchmark.Compare(results)
compare.print()
def run_benchmarks():
for attn_dtype, key_padding_dtype, add_bias_kv, add_zero_attn in itertools.product(
ATTN_MASK_DTYPE, KEY_PADDING_MASK_DTYPE, [True, False], [True, False]
):
label = f"attn_dtype {attn_dtype}, key_padding_dtype {key_padding_dtype}, \
add_bias_kv {add_bias_kv}, add_zero_attn {add_zero_attn}"
benchmark_multihead_attention(
label=label,
attn_dtype=attn_dtype,
key_padding_dtype=key_padding_dtype,
add_bias_kv=add_bias_kv,
add_zero_attn=add_zero_attn,
)
run_benchmarks()
import numpy as np
from fairseq.data import FairseqDataset
class DummyDataset(FairseqDataset):
def __init__(self, batch, num_items, item_size):
super().__init__()
self.batch = batch
self.num_items = num_items
self.item_size = item_size
def __getitem__(self, index):
return index
def __len__(self):
return self.num_items
def collater(self, samples):
return self.batch
@property
def sizes(self):
return np.array([self.item_size] * self.num_items)
def num_tokens(self, index):
return self.item_size
def size(self, index):
return self.item_size
def ordered_indices(self):
return np.arange(self.num_items)
@property
def supports_prefetch(self):
return False
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import logging
from dataclasses import dataclass, field
from typing import Optional
import torch
from .dummy_dataset import DummyDataset
from fairseq.data import Dictionary
from fairseq.dataclass import FairseqDataclass
from fairseq.tasks import FairseqTask, register_task
from omegaconf import II
logger = logging.getLogger(__name__)
@dataclass
class DummyLMConfig(FairseqDataclass):
dict_size: int = 49996
dataset_size: int = 100000
tokens_per_sample: int = field(
default=512, metadata={"help": "max sequence length"}
)
add_bos_token: bool = False
batch_size: Optional[int] = II("dataset.batch_size")
max_tokens: Optional[int] = II("dataset.max_tokens")
max_target_positions: int = II("task.tokens_per_sample")
@register_task("dummy_lm", dataclass=DummyLMConfig)
class DummyLMTask(FairseqTask):
def __init__(self, cfg: DummyLMConfig):
super().__init__(cfg)
# load dictionary
self.dictionary = Dictionary()
for i in range(cfg.dict_size):
self.dictionary.add_symbol("word{}".format(i))
self.dictionary.pad_to_multiple_(8) # often faster if divisible by 8
logger.info("dictionary: {} types".format(len(self.dictionary)))
seq = torch.arange(cfg.tokens_per_sample + 1) + self.dictionary.pad() + 1
self.dummy_src = seq[:-1]
self.dummy_tgt = seq[1:]
def load_dataset(self, split, epoch=1, combine=False, **kwargs):
"""Load a given dataset split.
Args:
split (str): name of the split (e.g., train, valid, test)
"""
if self.cfg.batch_size is not None:
bsz = self.cfg.batch_size
else:
bsz = max(1, self.cfg.max_tokens // self.cfg.tokens_per_sample)
self.datasets[split] = DummyDataset(
{
"id": 1,
"net_input": {
"src_tokens": torch.stack([self.dummy_src for _ in range(bsz)]),
"src_lengths": torch.full(
(bsz,), self.cfg.tokens_per_sample, dtype=torch.long
),
},
"target": torch.stack([self.dummy_tgt for _ in range(bsz)]),
"nsentences": bsz,
"ntokens": bsz * self.cfg.tokens_per_sample,
},
num_items=self.cfg.dataset_size,
item_size=self.cfg.tokens_per_sample,
)
@property
def source_dictionary(self):
return self.dictionary
@property
def target_dictionary(self):
return self.dictionary
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import logging
from dataclasses import dataclass, field
from typing import Optional
import torch
from omegaconf import II
from .dummy_dataset import DummyDataset
from fairseq.data import Dictionary
from fairseq.dataclass import FairseqDataclass
from fairseq.tasks import FairseqTask, register_task
logger = logging.getLogger(__name__)
@dataclass
class DummyMaskedLMConfig(FairseqDataclass):
dict_size: int = 49996
dataset_size: int = 100000
tokens_per_sample: int = field(
default=512,
metadata={
"help": "max number of total tokens over all"
" segments per sample for BERT dataset"
},
)
batch_size: Optional[int] = II("dataset.batch_size")
max_tokens: Optional[int] = II("dataset.max_tokens")
max_target_positions: int = II("task.tokens_per_sample")
@register_task("dummy_masked_lm", dataclass=DummyMaskedLMConfig)
class DummyMaskedLMTask(FairseqTask):
def __init__(self, cfg: DummyMaskedLMConfig):
super().__init__(cfg)
self.dictionary = Dictionary()
for i in range(cfg.dict_size):
self.dictionary.add_symbol("word{}".format(i))
logger.info("dictionary: {} types".format(len(self.dictionary)))
# add mask token
self.mask_idx = self.dictionary.add_symbol("<mask>")
self.dictionary.pad_to_multiple_(8) # often faster if divisible by 8
mask_idx = 0
pad_idx = 1
seq = torch.arange(cfg.tokens_per_sample) + pad_idx + 1
mask = torch.arange(2, cfg.tokens_per_sample, 7) # ~15%
src = seq.clone()
src[mask] = mask_idx
tgt = torch.full_like(seq, pad_idx)
tgt[mask] = seq[mask]
self.dummy_src = src
self.dummy_tgt = tgt
def load_dataset(self, split, epoch=1, combine=False, **kwargs):
"""Load a given dataset split.
Args:
split (str): name of the split (e.g., train, valid, test)
"""
if self.cfg.batch_size is not None:
bsz = self.cfg.batch_size
else:
bsz = max(1, self.cfg.max_tokens // self.cfg.tokens_per_sample)
self.datasets[split] = DummyDataset(
{
"id": 1,
"net_input": {
"src_tokens": torch.stack([self.dummy_src for _ in range(bsz)]),
"src_lengths": torch.full(
(bsz,), self.cfg.tokens_per_sample, dtype=torch.long
),
},
"target": torch.stack([self.dummy_tgt for _ in range(bsz)]),
"nsentences": bsz,
"ntokens": bsz * self.cfg.tokens_per_sample,
},
num_items=self.cfg.dataset_size,
item_size=self.cfg.tokens_per_sample,
)
@property
def source_dictionary(self):
return self.dictionary
@property
def target_dictionary(self):
return self.dictionary
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import torch.nn as nn
import torch.nn.functional as F
from fairseq.data import Dictionary
from fairseq.models import (
FairseqDecoder,
FairseqLanguageModel,
register_model,
register_model_architecture,
)
@register_model("dummy_model")
class DummyModel(FairseqLanguageModel):
def __init__(self, args, encoder):
super().__init__(encoder)
self.args = args
@staticmethod
def add_args(parser):
parser.add_argument("--num-layers", type=int, default=24)
parser.add_argument("--embed-dim", type=int, default=1024)
@classmethod
def build_model(cls, args, task):
encoder = DummyEncoder(
num_embed=len(task.target_dictionary),
embed_dim=args.embed_dim,
num_layers=args.num_layers,
)
return cls(args, encoder)
def forward(self, src_tokens, masked_tokens=None, **kwargs):
return self.decoder(src_tokens, masked_tokens=masked_tokens)
class DummyEncoder(FairseqDecoder):
def __init__(self, num_embed=50000, embed_dim=1024, num_layers=24):
super().__init__(Dictionary())
self.embed = nn.Embedding(
num_embeddings=num_embed, embedding_dim=embed_dim, padding_idx=0
)
self.layers_a = nn.ModuleList(
[
nn.Sequential(
nn.LayerNorm(embed_dim),
nn.Linear(embed_dim, 3 * embed_dim), # q, k, v input projection
nn.Linear(3 * embed_dim, embed_dim), # skip self-attention
nn.Linear(embed_dim, embed_dim), # output projection
nn.Dropout(),
)
for i in range(num_layers)
]
)
self.layers_b = nn.ModuleList(
[
nn.Sequential(
nn.LayerNorm(embed_dim),
nn.Linear(embed_dim, 4 * embed_dim), # FFN
nn.ReLU(),
nn.Linear(4 * embed_dim, embed_dim), # FFN
nn.Dropout(0.1),
)
for i in range(num_layers)
]
)
self.out_proj = nn.Linear(embed_dim, num_embed)
def forward(self, tokens, masked_tokens=None):
x = self.embed(tokens)
for layer_a, layer_b in zip(self.layers_a, self.layers_b):
x = x + layer_a(x)
x = x + layer_b(x)
x = self.out_proj(x)
if masked_tokens is not None:
x = x[masked_tokens]
return (x,)
def max_positions(self):
return 1024
def get_normalized_probs(self, net_output, log_probs, sample=None):
logits = net_output[0].float()
if log_probs:
return F.log_softmax(logits, dim=-1)
else:
return F.softmax(logits, dim=-1)
@register_model_architecture("dummy_model", "dummy_model")
def base_architecture(args):
pass
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import logging
import numpy as np
import torch
from fairseq.data import Dictionary, FairseqDataset
from fairseq.tasks import LegacyFairseqTask, register_task
logger = logging.getLogger(__name__)
@register_task("dummy_mt")
class DummyMTTask(LegacyFairseqTask):
@staticmethod
def add_args(parser):
"""Add task-specific arguments to the parser."""
parser.add_argument("--dict-size", default=49996, type=int)
parser.add_argument("--dataset-size", default=100000, type=int)
parser.add_argument("--src-len", default=30, type=int)
parser.add_argument("--tgt-len", default=30, type=int)
def __init__(self, args, dictionary):
super().__init__(args)
self.dictionary = dictionary
self.seed = args.seed
dictionary.pad_to_multiple_(8) # often faster if divisible by 8
self.dummy_src = torch.arange(args.src_len + 1) + dictionary.pad() + 1
self.dummy_tgt = torch.arange(args.tgt_len + 1) + dictionary.pad() + 1
@classmethod
def setup_task(cls, args, **kwargs):
"""Setup the task."""
dictionary = Dictionary()
for i in range(args.dict_size):
dictionary.add_symbol("word{}".format(i))
logger.info("dictionary: {} types".format(len(dictionary)))
args.max_source_positions = args.src_len + dictionary.pad() + 2
args.max_target_positions = args.tgt_len + dictionary.pad() + 2
return cls(args, dictionary)
def load_dataset(self, split, epoch=1, combine=False, **kwargs):
"""Load a given dataset split.
Args:
split (str): name of the split (e.g., train, valid, test)
"""
item_size = max(self.args.src_len, self.args.tgt_len)
if self.args.batch_size is not None:
bsz = self.args.batch_size
else:
bsz = max(1, self.args.max_tokens // item_size)
tgt = torch.stack([self.dummy_tgt for _ in range(bsz)])
self.datasets[split] = DummyDataset(
{
"id": 1,
"net_input": {
"src_tokens": torch.stack([self.dummy_src for _ in range(bsz)]),
"src_lengths": torch.full(
(bsz,), self.args.src_len, dtype=torch.long
),
"prev_output_tokens": tgt.clone(),
},
"target": tgt,
"nsentences": bsz,
"ntokens": bsz * self.args.tgt_len,
},
num_items=self.args.dataset_size,
item_size=item_size,
)
@property
def source_dictionary(self):
return self.dictionary
@property
def target_dictionary(self):
return self.dictionary
class DummyDataset(FairseqDataset):
def __init__(self, batch, num_items, item_size):
super().__init__()
self.batch = batch
self.num_items = num_items
self.item_size = item_size
def __getitem__(self, index):
return index
def __len__(self):
return self.num_items
def collater(self, samples):
return self.batch
@property
def sizes(self):
return np.array([self.item_size] * self.num_items)
def num_tokens(self, index):
return self.item_size
def size(self, index):
return self.item_size
def ordered_indices(self):
return np.arange(self.num_items)
@property
def supports_prefetch(self):
return False
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import logging
import os
import typing as tp
from abc import ABC, abstractmethod
from collections import Counter
from dataclasses import dataclass
from multiprocessing import Pool
import torch
from fairseq.data import Dictionary, indexed_dataset
from fairseq.file_chunker_utils import Chunker, find_offsets
from fairseq.file_io import PathManager
from fairseq.tokenizer import tokenize_line
logger = logging.getLogger("binarizer")
@dataclass
class BinarizeSummary:
"""
Keep track of what's going on in the binarizer
"""
num_seq: int = 0
replaced: tp.Optional[Counter] = None
num_tok: int = 0
@property
def num_replaced(self) -> int:
if self.replaced is None:
return 0
return sum(self.replaced.values())
@property
def replaced_percent(self) -> float:
return 100 * self.num_replaced / self.num_tok
def __str__(self) -> str:
base = f"{self.num_seq} sents, {self.num_tok} tokens"
if self.replaced is None:
return base
return f"{base}, {self.replaced_percent:.3}% replaced"
def merge(self, other: "BinarizeSummary"):
replaced = None
if self.replaced is not None:
replaced = self.replaced
if other.replaced is not None:
if replaced is None:
replaced = other.replaced
else:
replaced += other.replaced
self.replaced = replaced
self.num_seq += other.num_seq
self.num_tok += other.num_tok
class Binarizer(ABC):
"""
a binarizer describes how to take a string and build a tensor out of it
"""
@abstractmethod
def binarize_line(
self,
line: str,
summary: BinarizeSummary,
) -> torch.IntTensor:
...
def _worker_prefix(output_prefix: str, worker_id: int):
return f"{output_prefix}.pt{worker_id}"
class FileBinarizer:
"""
An file binarizer can take a file, tokenize it, and binarize each line to a tensor
"""
@classmethod
def multiprocess_dataset(
cls,
input_file: str,
dataset_impl: str,
binarizer: Binarizer,
output_prefix: str,
vocab_size=None,
num_workers=1,
) -> BinarizeSummary:
final_summary = BinarizeSummary()
offsets = find_offsets(input_file, num_workers)
# find_offsets returns a list of position [pos1, pos2, pos3, pos4] but we would want pairs:
# [(pos1, pos2), (pos2, pos3), (pos3, pos4)] to process the chunks with start/end info
# we zip the list with itself shifted by one to get all the pairs.
(first_chunk, *more_chunks) = zip(offsets, offsets[1:])
pool = None
if num_workers > 1:
pool = Pool(processes=num_workers - 1)
worker_results = [
pool.apply_async(
cls._binarize_chunk_and_finalize,
args=(
binarizer,
input_file,
start_offset,
end_offset,
_worker_prefix(
output_prefix,
worker_id,
),
dataset_impl,
),
kwds={
"vocab_size": vocab_size,
}
if vocab_size is not None
else {},
)
for worker_id, (start_offset, end_offset) in enumerate(
more_chunks, start=1
)
]
pool.close()
pool.join()
for r in worker_results:
summ = r.get()
final_summary.merge(summ)
# do not close the bin file as we need to merge the worker results in
final_ds, summ = cls._binarize_file_chunk(
binarizer,
input_file,
offset_start=first_chunk[0],
offset_end=first_chunk[1],
output_prefix=output_prefix,
dataset_impl=dataset_impl,
vocab_size=vocab_size if vocab_size is not None else None,
)
final_summary.merge(summ)
if num_workers > 1:
for worker_id in range(1, num_workers):
# merge the worker outputs
worker_output_prefix = _worker_prefix(
output_prefix,
worker_id,
)
final_ds.merge_file_(worker_output_prefix)
try:
os.remove(indexed_dataset.data_file_path(worker_output_prefix))
os.remove(indexed_dataset.index_file_path(worker_output_prefix))
except Exception as e:
logger.error(
f"couldn't remove {worker_output_prefix}.*", exc_info=e
)
# now we can close the file
idx_file = indexed_dataset.index_file_path(output_prefix)
final_ds.finalize(idx_file)
return final_summary
@staticmethod
def _binarize_file_chunk(
binarizer: Binarizer,
filename: str,
offset_start: int,
offset_end: int,
output_prefix: str,
dataset_impl: str,
vocab_size=None,
) -> tp.Tuple[tp.Any, BinarizeSummary]: # (dataset builder, BinarizeSummary)
"""
creates a dataset builder and append binarized items to it. This function does not
finalize the builder, this is useful if you want to do other things with your bin file
like appending/merging other files
"""
bin_file = indexed_dataset.data_file_path(output_prefix)
ds = indexed_dataset.make_builder(
bin_file,
impl=dataset_impl,
vocab_size=vocab_size,
)
summary = BinarizeSummary()
with Chunker(
PathManager.get_local_path(filename), offset_start, offset_end
) as line_iterator:
for line in line_iterator:
ds.add_item(binarizer.binarize_line(line, summary))
return ds, summary
@classmethod
def _binarize_chunk_and_finalize(
cls,
binarizer: Binarizer,
filename: str,
offset_start: int,
offset_end: int,
output_prefix: str,
dataset_impl: str,
vocab_size=None,
):
"""
same as above, but also finalizes the builder
"""
ds, summ = cls._binarize_file_chunk(
binarizer,
filename,
offset_start,
offset_end,
output_prefix,
dataset_impl,
vocab_size=vocab_size,
)
idx_file = indexed_dataset.index_file_path(output_prefix)
ds.finalize(idx_file)
return summ
class VocabularyDatasetBinarizer(Binarizer):
"""
Takes a Dictionary/Vocabulary, assign ids to each
token using the dictionary encode_line function.
"""
def __init__(
self,
dict: Dictionary,
tokenize: tp.Callable[[str], tp.List[str]] = tokenize_line,
append_eos: bool = True,
reverse_order: bool = False,
already_numberized: bool = False,
) -> None:
self.dict = dict
self.tokenize = tokenize
self.append_eos = append_eos
self.reverse_order = reverse_order
self.already_numberized = already_numberized
super().__init__()
def binarize_line(
self,
line: str,
summary: BinarizeSummary,
):
if summary.replaced is None:
summary.replaced = Counter()
def replaced_consumer(word, idx):
if idx == self.dict.unk_index and word != self.dict.unk_word:
summary.replaced.update([word])
if self.already_numberized:
id_strings = line.strip().split()
id_list = [int(id_string) for id_string in id_strings]
if self.reverse_order:
id_list.reverse()
if self.append_eos:
id_list.append(self.dict.eos())
ids = torch.IntTensor(id_list)
else:
ids = self.dict.encode_line(
line=line,
line_tokenizer=self.tokenize,
add_if_not_exist=False,
consumer=replaced_consumer,
append_eos=self.append_eos,
reverse_order=self.reverse_order,
)
summary.num_seq += 1
summary.num_tok += len(ids)
return ids
class AlignmentDatasetBinarizer(Binarizer):
"""
binarize by parsing a set of alignments and packing
them in a tensor (see utils.parse_alignment)
"""
def __init__(
self,
alignment_parser: tp.Callable[[str], torch.IntTensor],
) -> None:
super().__init__()
self.alignment_parser = alignment_parser
def binarize_line(
self,
line: str,
summary: BinarizeSummary,
):
ids = self.alignment_parser(line)
summary.num_seq += 1
summary.num_tok += len(ids)
return ids
class LegacyBinarizer:
@classmethod
def binarize(
cls,
filename: str,
dico: Dictionary,
consumer: tp.Callable[[torch.IntTensor], None],
tokenize: tp.Callable[[str], tp.List[str]] = tokenize_line,
append_eos: bool = True,
reverse_order: bool = False,
offset: int = 0,
end: int = -1,
already_numberized: bool = False,
) -> tp.Dict[str, int]:
binarizer = VocabularyDatasetBinarizer(
dict=dico,
tokenize=tokenize,
append_eos=append_eos,
reverse_order=reverse_order,
already_numberized=already_numberized,
)
return cls._consume_file(
filename,
binarizer,
consumer,
offset_start=offset,
offset_end=end,
)
@classmethod
def binarize_alignments(
cls,
filename: str,
alignment_parser: tp.Callable[[str], torch.IntTensor],
consumer: tp.Callable[[torch.IntTensor], None],
offset: int = 0,
end: int = -1,
) -> tp.Dict[str, int]:
binarizer = AlignmentDatasetBinarizer(alignment_parser)
return cls._consume_file(
filename,
binarizer,
consumer,
offset_start=offset,
offset_end=end,
)
@staticmethod
def _consume_file(
filename: str,
binarizer: Binarizer,
consumer: tp.Callable[[torch.IntTensor], None],
offset_start: int,
offset_end: int,
) -> tp.Dict[str, int]:
summary = BinarizeSummary()
with Chunker(
PathManager.get_local_path(filename), offset_start, offset_end
) as line_iterator:
for line in line_iterator:
consumer(binarizer.binarize_line(line, summary))
return {
"nseq": summary.num_seq,
"nunk": summary.num_replaced,
"ntok": summary.num_tok,
"replaced": summary.replaced,
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment