Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
SpeechT5_pytorch
Commits
12c90639
Commit
12c90639
authored
Sep 28, 2024
by
“change”
Browse files
init
parent
417b607b
Changes
350
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
2088 additions
and
0 deletions
+2088
-0
SpeechT5/fairseq/docs/fairseq.gif
SpeechT5/fairseq/docs/fairseq.gif
+0
-0
SpeechT5/fairseq/docs/fairseq_logo.png
SpeechT5/fairseq/docs/fairseq_logo.png
+0
-0
SpeechT5/fairseq/docs/getting_started.rst
SpeechT5/fairseq/docs/getting_started.rst
+216
-0
SpeechT5/fairseq/docs/hydra_integration.md
SpeechT5/fairseq/docs/hydra_integration.md
+284
-0
SpeechT5/fairseq/docs/index.rst
SpeechT5/fairseq/docs/index.rst
+49
-0
SpeechT5/fairseq/docs/lr_scheduler.rst
SpeechT5/fairseq/docs/lr_scheduler.rst
+34
-0
SpeechT5/fairseq/docs/make.bat
SpeechT5/fairseq/docs/make.bat
+36
-0
SpeechT5/fairseq/docs/models.rst
SpeechT5/fairseq/docs/models.rst
+104
-0
SpeechT5/fairseq/docs/modules.rst
SpeechT5/fairseq/docs/modules.rst
+9
-0
SpeechT5/fairseq/docs/optim.rst
SpeechT5/fairseq/docs/optim.rst
+38
-0
SpeechT5/fairseq/docs/overview.rst
SpeechT5/fairseq/docs/overview.rst
+74
-0
SpeechT5/fairseq/docs/requirements.txt
SpeechT5/fairseq/docs/requirements.txt
+2
-0
SpeechT5/fairseq/docs/tasks.rst
SpeechT5/fairseq/docs/tasks.rst
+61
-0
SpeechT5/fairseq/docs/tutorial_classifying_names.rst
SpeechT5/fairseq/docs/tutorial_classifying_names.rst
+415
-0
SpeechT5/fairseq/docs/tutorial_simple_lstm.rst
SpeechT5/fairseq/docs/tutorial_simple_lstm.rst
+518
-0
SpeechT5/fairseq/examples/.gitignore
SpeechT5/fairseq/examples/.gitignore
+2
-0
SpeechT5/fairseq/examples/__init__.py
SpeechT5/fairseq/examples/__init__.py
+9
-0
SpeechT5/fairseq/examples/adaptive_span/README.md
SpeechT5/fairseq/examples/adaptive_span/README.md
+90
-0
SpeechT5/fairseq/examples/adaptive_span/__init__.py
SpeechT5/fairseq/examples/adaptive_span/__init__.py
+19
-0
SpeechT5/fairseq/examples/adaptive_span/adagrad_with_grad_clip.py
.../fairseq/examples/adaptive_span/adagrad_with_grad_clip.py
+128
-0
No files found.
Too many changes to show.
To preserve performance only
350 of 350+
files are displayed.
Plain diff
Email patch
SpeechT5/fairseq/docs/fairseq.gif
0 → 100644
View file @
12c90639
2.54 MB
SpeechT5/fairseq/docs/fairseq_logo.png
0 → 100644
View file @
12c90639
71.3 KB
SpeechT5/fairseq/docs/getting_started.rst
0 → 100644
View file @
12c90639
Evaluating Pre-trained Models
=============================
First, download a pre-trained model along with its vocabularies:
.. code-block:: console
> curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
This model uses a `Byte Pair Encoding (BPE)
vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
the encoding to the source text before it can be translated. This can be
done with the
`apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__
script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
used as a continuation marker and the original text can be easily
recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
flag to :ref:`fairseq-generate`. Prior to BPE, input text needs to be tokenized
using ``tokenizer.perl`` from
`mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__.
Let's use :ref:`fairseq-interactive` to generate translations interactively.
Here, we use a beam size of 5 and preprocess the input with the Moses
tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically
remove the BPE continuation markers and detokenize the output.
.. code-block:: console
> MODEL_DIR=wmt14.en-fr.fconv-py
> fairseq-interactive \
--path $MODEL_DIR/model.pt $MODEL_DIR \
--beam 5 --source-lang en --target-lang fr \
--tokenizer moses \
--bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
| Type the input sentence and press return:
Why is it rare to discover new marine mammal species?
S-0 Why is it rare to discover new marine mam@@ mal species ?
H-0 -0.0643349438905716 Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0 -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015
This generation script produces three types of outputs: a line prefixed
with *O* is a copy of the original source sentence; *H* is the
hypothesis along with an average log-likelihood; and *P* is the
positional score per token position, including the
end-of-sentence marker which is omitted from the text.
Other types of output lines you might see are *D*, the detokenized hypothesis,
*T*, the reference target, *A*, alignment info, *E* the history of generation steps.
See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a
full list of pre-trained models available.
Training a New Model
====================
The following tutorial is for machine translation. For an example of how
to use Fairseq for other tasks, such as :ref:`language modeling`, please see the
``examples/`` directory.
Data Pre-processing
-------------------
Fairseq contains example pre-processing scripts for several translation
datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT
2014 (English-German). To pre-process and binarize the IWSLT dataset:
.. code-block:: console
> cd examples/translation/
> bash prepare-iwslt14.sh
> cd ../..
> TEXT=examples/translation/iwslt14.tokenized.de-en
> fairseq-preprocess --source-lang de --target-lang en \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/iwslt14.tokenized.de-en
This will write binarized data that can be used for model training to
``data-bin/iwslt14.tokenized.de-en``.
Training
--------
Use :ref:`fairseq-train` to train a new model. Here a few example settings that work
well for the IWSLT 2014 dataset:
.. code-block:: console
> mkdir -p checkpoints/fconv
> CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
--optimizer nag --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
--arch fconv_iwslt_de_en --save-dir checkpoints/fconv
By default, :ref:`fairseq-train` will use all available GPUs on your machine. Use the
``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to
change the number of GPU devices that will be used.
Also note that the batch size is specified in terms of the maximum
number of tokens per batch (``--max-tokens``). You may need to use a
smaller value depending on the available GPU memory on your system.
Generation
----------
Once your model is trained, you can generate translations using
:ref:`fairseq-generate` **(for binarized data)** or
:ref:`fairseq-interactive` **(for raw text)**:
.. code-block:: console
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/fconv/checkpoint_best.pt \
--batch-size 128 --beam 5
| [de] dictionary: 35475 types
| [en] dictionary: 24739 types
| data-bin/iwslt14.tokenized.de-en test 6750 examples
| model fconv
| loaded checkpoint trainings/fconv/checkpoint_best.pt
S-721 danke .
T-721 thank you .
...
To generate translations with only a CPU, use the ``--cpu`` flag. BPE
continuation markers can be removed with the ``--remove-bpe`` flag.
Advanced Training Options
=========================
Large mini-batch training with delayed updates
----------------------------------------------
The ``--update-freq`` option can be used to accumulate gradients from
multiple mini-batches and delay updating, creating a larger effective
batch size. Delayed updates can also improve training speed by reducing
inter-GPU communication costs and by saving idle time caused by variance
in workload across GPUs. See `Ott et al.
(2018) <https://arxiv.org/abs/1806.00187>`__ for more details.
To train on a single GPU with an effective batch size that is equivalent
to training on 8 GPUs:
.. code-block:: console
> CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)
Training with half precision floating point (FP16)
--------------------------------------------------
.. note::
FP16 training requires a Volta GPU and CUDA 9.1 or greater
Recent GPUs enable efficient half precision floating point computation,
e.g., using `Nvidia Tensor Cores
<https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__.
Fairseq supports FP16 training with the ``--fp16`` flag:
.. code-block:: console
> fairseq-train --fp16 (...)
Distributed training
--------------------
Distributed training in fairseq is implemented on top of ``torch.distributed``.
The easiest way to launch jobs is with the `torch.distributed.launch
<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.
For example, to train a large English-German Transformer model on 2 nodes each
with 8 GPUs (in total 16 GPUs), run the following command on each node,
replacing ``node_rank=0`` with ``node_rank=1`` on the second node and making
sure to update ``--master_addr`` to the IP address of the first node:
.. code-block:: console
> python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
--master_port=12345 \
$(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--max-epoch 70 \
--fp16
On SLURM clusters, fairseq will automatically detect the number of nodes and
GPUs, but a port number must be provided:
.. code-block:: console
> salloc --gpus=16 --nodes 2 (...)
> srun fairseq-train --distributed-port 12345 (...).
Sharding very large datasets
----------------------------
It can be challenging to train over very large datasets, particularly if your
machine does not have much system RAM. Most tasks in fairseq support training
over "sharded" datasets, in which the original dataset has been preprocessed
into non-overlapping chunks (or "shards").
For example, instead of preprocessing all your data into a single "data-bin"
directory, you can split the data and create "data-bin1", "data-bin2", etc.
Then you can adapt your training command like so:
.. code-block:: console
> fairseq-train data-bin1:data-bin2:data-bin3 (...)
Training will now iterate over each shard, one by one, with each shard
corresponding to an "epoch", thus reducing system memory usage.
SpeechT5/fairseq/docs/hydra_integration.md
0 → 100644
View file @
12c90639
## Hydra
[
Hydra
](
https://github.com/facebookresearch/hydra
)
is an open-source Python
framework that simplifies the development of research and other complex
applications. The key feature is the ability to dynamically create a
hierarchical configuration by composition and override it through config files
and the command line. The name Hydra comes from its ability to run multiple
similar jobs - much like a Hydra with multiple heads.
## Motivation
Until recently, all components in fairseq were configured through a shared
`args`
namespace that was created at application startup. Components declared
their own
`add_args`
method to update the argparse parser, hoping that the names
would not clash with arguments from other components. While this model works for
smaller applications, as fairseq grew and became integrated into other
applications, this became problematic. In order to determine how to configure
each component, one needed to a) examine what args were added by this component,
and b) read the code to figure out what shared arguments it is using that were
added in other places. Reproducing models involved sharing commands that often
contained dozens of command line switches.
The model described above is still supported by fairseq for backward
compatibility, but will be deprecated some time in the future.
New components in fairseq should now create a dataclass that encapsulates all
parameters required to configure this component. The dataclass is registered
along with the component, and fairseq takes care of constructing and providing
this configuration object to the component's constructor. Note that sharing
parameters can optionally still work, but one has to explicitly point to the
"source of truth" (see inheritance example below). These changes make components
in fairseq more independent and re-usable by other applications: all that is
needed to create a component is to initialize its dataclass and overwrite some
of the defaults.
While configuring fairseq through command line (using either the legacy argparse
based or the new Hydra based entry points) is still fully supported, you can now
take advantage of configuring fairseq completely or piece-by-piece through
hierarchical YAML configuration files. These files can also be shipped as
examples that others can use to run an identically configured job.
Additionally, Hydra has a rich and growing
[
library of
plugins
](
https://github.com/facebookresearch/hydra/tree/master/plugins
)
that
provide functionality such as hyperparameter sweeping (including using bayesian
optimization through the
[
Ax
](
https://github.com/facebook/Ax
)
library), job
launching across various platforms, and more.
## Creating or migrating components
In general, each new (or updated) component should provide a companion
[
dataclass
](
https://www.python.org/dev/peps/pep-0557/
)
. These dataclass are
typically located in the same file as the component and are passed as arguments
to the
`register_*()`
functions. Top-level configs that should be present in
every fairseq application are placed in the
[
global
](
fairseq/dataclass/configs.py
)
config file and added to the
`FairseqConfig`
object.
Each dataclass is a plain-old-data object, similar to a
`NamedTuple`
. These
classes are decorated with a
`@dataclass`
decorator, and typically inherit from
`FairseqDataclass`
(which adds some functionality for backward compatibility).
Each field must have a type, and generally has metadata (such as a help string)
and a default value. Only primitive types or other config objects are allowed as
data types for each field.
#### Example:
```
python
from
dataclasses
import
dataclass
,
field
from
fairseq.dataclass
import
FairseqDataclass
@
dataclass
class
InteractiveConfig
(
FairseqDataclass
):
buffer_size
:
int
=
field
(
default
=
0
,
metadata
=
{
"help"
:
"read this many sentences into a buffer before processing them"
},
)
input
:
str
=
field
(
default
=
"-"
,
metadata
=
{
"help"
:
"file to read from; use - for stdin"
},
)
```
### Inherting values
Some components require sharing a value. For example, a learning rate scheduler
and an optimizer may both need to know the initial learning rate value. One can
declare a field that, by default, will inherit its value from another config
node in the same hierarchy:
```
python
@
dataclass
FairseqAdamConfig
(
FairseqDataclass
):
...
lr
:
List
[
float
]
=
II
(
"optimization.lr"
)
...
```
`II("optimization.lr")`
is syntactic sugar for
`"${optimization.lr}"`
, which is
the value one can use in a YAML config file or through command line to achieve
the same effect. Note that this assumes that there is an "optimization" config
object in the root config and it has a field called "lr".
### Tasks and Models
Creating Tasks and Models works same as before, except that legacy
implementations now inherit from
`LegacyFairseq*`
base classes, while new
components inherit from
`FairseqTask`
and
`FairseqModel`
and provide a dataclass
to the
`register_*()`
functions.
#### Task example:
```
python
@
dataclass
class
LanguageModelingConfig
(
FairseqDataclass
):
data
:
Optional
[
str
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"path to data directory"
}
)
...
@
register_task
(
"language_modeling"
,
dataclass
=
LanguageModelingConfig
)
class
LanguageModelingTask
(
FairseqTask
):
...
@
classmethod
def
setup_task
(
cls
,
cfg
:
LanguageModelingConfig
):
...
```
#### Model example:
```
python
@
dataclass
class
TransformerLanguageModelConfig
(
FairseqDataclass
):
activation_fn
:
ChoiceEnum
(
utils
.
get_available_activation_fns
())
=
field
(
default
=
"relu"
,
metadata
=
{
"help"
:
"activation function to use"
}
)
dropout
:
float
=
field
(
default
=
0.1
,
metadata
=
{
"help"
:
"dropout probability"
})
...
@
register_model
(
"transformer_lm"
,
dataclass
=
TransformerLanguageModelConfig
)
class
TransformerLanguageModel
(
FairseqLanguageModel
):
...
@
classmethod
def
build_model
(
cls
,
cfg
:
TransformerLanguageModelConfig
,
task
:
FairseqTask
):
...
```
### Other components
Other components work as before, but they now take their configuration dataclass
as the only constructor argument:
```
python
@
dataclass
class
MosesTokenizerConfig
(
FairseqDataclass
):
source_lang
:
str
=
field
(
default
=
"en"
,
metadata
=
{
"help"
:
"source language"
})
...
@
register_tokenizer
(
"moses"
,
dataclass
=
MosesTokenizerConfig
)
class
MosesTokenizer
(
object
):
def
__init__
(
self
,
cfg
:
MosesTokenizerConfig
):
...
```
Note that if you are adding a new registry for a new set of components, you need
to add it to the
`FairseqConfig`
object in
`fairseq/dataclass/configs.py`
:
```
python
@
dataclass
class
FairseqConfig
(
object
):
...
my_new_registry
:
Any
=
None
```
## Training with `fairseq-hydra-train`
To fully take advantage of configuration flexibility offered by Hydra, you may
want to train new models using the
`fairseq-hydra-train`
entry point. Legacy CLI
tools such as
`fairseq-train`
will remain supported for the foreseeable future
but will be deprecated eventually.
On startup, Hydra will create a configuration object that contains a hierarchy
of all the necessary dataclasses populated with their default values in the
code. The default values are overwritten by values found in YAML files in
`fairseq/config`
directory (which currently sets minimal defaults) and then
further overwritten by values provided through command line arguments.
Some of the most common use cases are shown below:
### 1. Override default values through command line:
```
shell script
$ fairseq-hydra-train \
distributed_training.distributed_world_size=1 \
dataset.batch_size=2 \
task.data=data-bin \
model=transformer_lm/transformer_lm_gpt \
task=language_modeling \
optimization.max_update=5000
```
Note that along with explicitly providing values for parameters such as
`dataset.batch_size`
, this also tells Hydra to overlay configuration found in
`fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml`
over the default
values in the dataclass. If you want to train a model without specifying a
particular architecture you can simply specify
`model=transformer_lm`
. This only
works for migrated tasks and models.
### 2. Replace bundled configs with an external config:
```
shell script
$ fairseq-hydra-train \
--config-dir /path/to/external/configs \
--config-name wiki103
```
where
`/path/to/external/configs/wiki103.yaml`
contains:
```
yaml
# @package _group_
model
:
_name
:
transformer_lm
distributed_training
:
distributed_world_size
:
1
dataset
:
batch_size
:
2
task
:
_name
:
language_modeling
data
:
/path/to/data
add_bos_token
:
false
max_target_positions
:
1024
optimization
:
max_update
:
50000
lr
:
[
0.25
]
criterion
:
cross_entropy
optimizer
:
adam
lr_scheduler
:
_name
:
cosine
```
Note that here bundled configs from
`fairseq/config`
directory are not used,
however the defaults from each dataclass will still be used (unless overwritten
by your external config).
Additionally you can choose to break up your configs by creating a directory
structure in the same location as your main config file, with the names of the
top-level fields (such as "model", "dataset", etc), and placing config files
with meaningful names that would populate that specific section of your
top-level config file (for example, you might have
`model/small_transformer_lm.yaml`
,
`model/big_transformer_lm.yaml`
, etc). You
can then specify the correct configuration via command line, defaults in the
main config, or even launch all of them as a sweep (see Hydra documentation on
how to do this).
### 3. Add an external config directory to Hydra search path:
This allows combining default configuration (including using any bundled config
files), while specifying your own config files for some parts of the
configuration.
```
shell script
$ fairseq-hydra-train \
distributed_training.distributed_world_size=1 \
dataset.batch_size=2 \
task.data=/path/to/data/ \
model=transformer_lm/2_layers \
task=language_modeling \
optimization.max_update=5000 \
--config-dir /path/to/external/configs
```
where
`/path/to/external/configs`
has the following structure:
```
.
+-- model
| +-- transformer_lm
| | +-- 2_layers.yaml
```
and
`2_layers.yaml`
contains a copy of
`transformer_lm_gpt.yaml`
but with
`decoder_layers`
set to 2. You can add other configs to configure other
components as well.
SpeechT5/fairseq/docs/index.rst
0 → 100644
View file @
12c90639
.. fairseq documentation master file, created by
sphinx-quickstart on Fri Aug 17 21:45:30 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
:github_url: https://github.com/pytorch/fairseq
fairseq documentation
=====================
Fairseq is a sequence modeling toolkit written in `PyTorch
<http://pytorch.org/>`_ that allows researchers and developers to
train custom models for translation, summarization, language modeling and other
text generation tasks.
.. toctree::
:maxdepth: 1
:caption: Getting Started
getting_started
command_line_tools
.. toctree::
:maxdepth: 1
:caption: Extending Fairseq
overview
tutorial_simple_lstm
tutorial_classifying_names
.. toctree::
:maxdepth: 2
:caption: Library Reference
tasks
models
criterions
optim
lr_scheduler
data
modules
Indices and tables
==================
* :ref:`genindex`
* :ref:`search`
SpeechT5/fairseq/docs/lr_scheduler.rst
0 → 100644
View file @
12c90639
.. role:: hidden
:class: hidden-section
.. _Learning Rate Schedulers:
Learning Rate Schedulers
========================
Learning Rate Schedulers update the learning rate over the course of training.
Learning rates can be updated after each update via :func:`step_update` or at
epoch boundaries via :func:`step`.
.. automodule:: fairseq.optim.lr_scheduler
:members:
.. autoclass:: fairseq.optim.lr_scheduler.FairseqLRScheduler
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.cosine_lr_scheduler.CosineSchedule
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.fixed_schedule.FixedSchedule
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.inverse_square_root_schedule.InverseSquareRootSchedule
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.reduce_lr_on_plateau.ReduceLROnPlateau
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.triangular_lr_scheduler.TriangularSchedule
:members:
:undoc-members:
SpeechT5/fairseq/docs/make.bat
0 → 100644
View file @
12c90639
@ECHO
OFF
pushd
%~dp0
REM Command file for Sphinx documentation
if
"
%SPHINXBUILD%
"
==
""
(
set
SPHINXBUILD
=
python
-msphinx
)
set
SOURCEDIR
=
.
set
BUILDDIR
=
_build
set
SPHINXPROJ
=
fairseq
if
"
%
1"
==
""
goto
help
%SPHINXBUILD%
>
NUL
2
>
NUL
if
errorlevel
9009
(
echo
.
echo
.The
Sphinx
module
was
not
found
.
Make
sure
you
have
Sphinx
installed
,
echo
.then
set
the
SPHINXBUILD
environment
variable
to
point
to
the
full
echo
.path
of
the
'sphinx-build'
executable
.
Alternatively
you
may
add
the
echo
.Sphinx
directory
to
PATH
.
echo
.
echo
.If
you
don
't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD%
-M
%
1
%SOURCEDIR%
%BUILDDIR%
%SPHINXOPTS%
goto end
:help
%SPHINXBUILD%
-M help
%SOURCEDIR%
%BUILDDIR%
%SPHINXOPTS%
:end
popd
SpeechT5/fairseq/docs/models.rst
0 → 100644
View file @
12c90639
.. role:: hidden
:class: hidden-section
.. module:: fairseq.models
.. _Models:
Models
======
A Model defines the neural network's ``forward()`` method and encapsulates all
of the learnable parameters in the network. Each model also provides a set of
named *architectures* that define the precise network configuration (e.g.,
embedding dimension, number of layers, etc.).
Both the model type and architecture are selected via the ``--arch``
command-line argument. Once selected, a model may expose additional command-line
arguments for further configuration.
.. note::
All fairseq Models extend :class:`BaseFairseqModel`, which in turn extends
:class:`torch.nn.Module`. Thus any fairseq Model can be used as a
stand-alone Module in other PyTorch code.
Convolutional Neural Networks (CNN)
-----------------------------------
.. module:: fairseq.models.fconv
.. autoclass:: fairseq.models.fconv.FConvModel
:members:
.. autoclass:: fairseq.models.fconv.FConvEncoder
:members:
:undoc-members:
.. autoclass:: fairseq.models.fconv.FConvDecoder
:members:
Long Short-Term Memory (LSTM) networks
--------------------------------------
.. module:: fairseq.models.lstm
.. autoclass:: fairseq.models.lstm.LSTMModel
:members:
.. autoclass:: fairseq.models.lstm.LSTMEncoder
:members:
.. autoclass:: fairseq.models.lstm.LSTMDecoder
:members:
Transformer (self-attention) networks
-------------------------------------
.. module:: fairseq.models.transformer
.. autoclass:: fairseq.models.transformer.TransformerModel
:members:
.. autoclass:: fairseq.models.transformer.TransformerEncoder
:members:
.. autoclass:: fairseq.models.transformer.TransformerEncoderLayer
:members:
.. autoclass:: fairseq.models.transformer.TransformerDecoder
:members:
.. autoclass:: fairseq.models.transformer.TransformerDecoderLayer
:members:
Adding new models
-----------------
.. currentmodule:: fairseq.models
.. autofunction:: fairseq.models.register_model
.. autofunction:: fairseq.models.register_model_architecture
.. autoclass:: fairseq.models.BaseFairseqModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqEncoderDecoderModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqEncoderModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqLanguageModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqMultiModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqEncoder
:members:
.. autoclass:: fairseq.models.CompositeEncoder
:members:
.. autoclass:: fairseq.models.FairseqDecoder
:members:
.. _Incremental decoding:
Incremental decoding
--------------------
.. autoclass:: fairseq.models.FairseqIncrementalDecoder
:members:
:undoc-members:
SpeechT5/fairseq/docs/modules.rst
0 → 100644
View file @
12c90639
Modules
=======
Fairseq provides several stand-alone :class:`torch.nn.Module` classes that may
be helpful when implementing a new :class:`~fairseq.models.BaseFairseqModel`.
.. automodule:: fairseq.modules
:members:
:undoc-members:
SpeechT5/fairseq/docs/optim.rst
0 → 100644
View file @
12c90639
.. role:: hidden
:class: hidden-section
.. _optimizers:
Optimizers
==========
Optimizers update the Model parameters based on the gradients.
.. automodule:: fairseq.optim
:members:
.. autoclass:: fairseq.optim.FairseqOptimizer
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adadelta.Adadelta
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adagrad.Adagrad
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adafactor.FairseqAdafactor
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adam.FairseqAdam
:members:
:undoc-members:
.. autoclass:: fairseq.optim.fp16_optimizer.FP16Optimizer
:members:
:undoc-members:
.. autoclass:: fairseq.optim.nag.FairseqNAG
:members:
:undoc-members:
.. autoclass:: fairseq.optim.sgd.SGD
:members:
:undoc-members:
SpeechT5/fairseq/docs/overview.rst
0 → 100644
View file @
12c90639
Overview
========
Fairseq can be extended through user-supplied `plug-ins
<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
plug-ins:
- :ref:`Models` define the neural network architecture and encapsulate all of the
learnable parameters.
- :ref:`Criterions` compute the loss function given the model outputs and targets.
- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
Datasets, initializing the Model/Criterion and calculating the loss.
- :ref:`Optimizers` update the Model parameters based on the gradients.
- :ref:`Learning Rate Schedulers` update the learning rate over the course of
training.
**Training Flow**
Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
fairseq implements the following high-level training flow::
for epoch in range(num_epochs):
itr = task.get_batch_iterator(task.dataset('train'))
for num_updates, batch in enumerate(itr):
task.train_step(batch, model, criterion, optimizer)
average_and_clip_gradients()
optimizer.step()
lr_scheduler.step_update(num_updates)
lr_scheduler.step(epoch)
where the default implementation for ``task.train_step`` is roughly::
def train_step(self, batch, model, criterion, optimizer, **unused):
loss = criterion(model, batch)
optimizer.backward(loss)
return loss
**Registering new plug-ins**
New plug-ins are *registered* through a set of ``@register`` function
decorators, for example::
@register_model('my_lstm')
class MyLSTM(FairseqEncoderDecoderModel):
(...)
Once registered, new plug-ins can be used with the existing :ref:`Command-line
Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
new plug-ins.
**Loading plug-ins from another directory**
New plug-ins can be defined in a custom module stored in the user system. In
order to import the module, and make the plugin available to *fairseq*, the
command line supports the ``--user-dir`` flag that can be used to specify a
custom location for additional modules to load into *fairseq*.
For example, assuming this directory tree::
/home/user/my-module/
└── __init__.py
with ``__init__.py``::
from fairseq.models import register_model_architecture
from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
@register_model_architecture('transformer', 'my_transformer')
def transformer_mmt_big(args):
transformer_vaswani_wmt_en_de_big(args)
it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
SpeechT5/fairseq/docs/requirements.txt
0 → 100644
View file @
12c90639
sphinx<2.0
sphinx-argparse
SpeechT5/fairseq/docs/tasks.rst
0 → 100644
View file @
12c90639
..
role
::
hidden
:
class
:
hidden
-
section
..
module
::
fairseq
.
tasks
..
_Tasks
:
Tasks
=====
Tasks
store
dictionaries
and
provide
helpers
for
loading
/
iterating
over
Datasets
,
initializing
the
Model
/
Criterion
and
calculating
the
loss
.
Tasks
can
be
selected
via
the
``--
task
``
command
-
line
argument
.
Once
selected
,
a
task
may
expose
additional
command
-
line
arguments
for
further
configuration
.
Example
usage
::
#
setup
the
task
(
e
.
g
.,
load
dictionaries
)
task
=
fairseq
.
tasks
.
setup_task
(
args
)
#
build
model
and
criterion
model
=
task
.
build_model
(
args
)
criterion
=
task
.
build_criterion
(
args
)
#
load
datasets
task
.
load_dataset
(
'train'
)
task
.
load_dataset
(
'valid'
)
#
iterate
over
mini
-
batches
of
data
batch_itr
=
task
.
get_batch_iterator
(
task
.
dataset
(
'train'
),
max_tokens
=
4096
,
)
for
batch
in
batch_itr
:
#
compute
the
loss
loss
,
sample_size
,
logging_output
=
task
.
get_loss
(
model
,
criterion
,
batch
,
)
loss
.
backward
()
Translation
-----------
..
autoclass
::
fairseq
.
tasks
.
translation
.
TranslationTask
..
_language
modeling
:
Language
Modeling
-----------------
..
autoclass
::
fairseq
.
tasks
.
language_modeling
.
LanguageModelingTask
Adding
new
tasks
----------------
..
autofunction
::
fairseq
.
tasks
.
register_task
..
autoclass
::
fairseq
.
tasks
.
FairseqTask
:
members
:
:
undoc
-
members
:
SpeechT5/fairseq/docs/tutorial_classifying_names.rst
0 → 100644
View file @
12c90639
Tutorial
:
Classifying
Names
with
a
Character
-
Level
RNN
======================================================
In
this
tutorial
we
will
extend
fairseq
to
support
*
classification
*
tasks
.
In
particular
we
will
re
-
implement
the
PyTorch
tutorial
for
`
Classifying
Names
with
a
Character
-
Level
RNN
<
https
://
pytorch
.
org
/
tutorials
/
intermediate
/
char_rnn_classification_tutorial
.
html
>`
_
in
fairseq
.
It
is
recommended
to
quickly
skim
that
tutorial
before
beginning
this
one
.
This
tutorial
covers
:
1.
**
Preprocessing
the
data
**
to
create
dictionaries
.
2.
**
Registering
a
new
Model
**
that
encodes
an
input
sentence
with
a
simple
RNN
and
predicts
the
output
label
.
3.
**
Registering
a
new
Task
**
that
loads
our
dictionaries
and
dataset
.
4.
**
Training
the
Model
**
using
the
existing
command
-
line
tools
.
5.
**
Writing
an
evaluation
script
**
that
imports
fairseq
and
allows
us
to
interactively
evaluate
our
model
on
new
inputs
.
1.
Preprocessing
the
data
-------------------------
The
original
tutorial
provides
raw
data
,
but
we
'll work with a modified version
of the data that is already tokenized into characters and split into separate
train, valid and test sets.
Download and extract the data from here:
`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_
Once extracted, let'
s
preprocess
the
data
using
the
:
ref
:`
fairseq
-
preprocess
`
command
-
line
tool
to
create
the
dictionaries
.
While
this
tool
is
primarily
intended
for
sequence
-
to
-
sequence
problems
,
we
're able to reuse it here by
treating the label as a "target" sequence of length 1. We'
ll
also
output
the
preprocessed
files
in
"raw"
format
using
the
``--
dataset
-
impl
``
option
to
enhance
readability
:
..
code
-
block
::
console
>
fairseq
-
preprocess
\
--
trainpref
names
/
train
--
validpref
names
/
valid
--
testpref
names
/
test
\
--
source
-
lang
input
--
target
-
lang
label
\
--
destdir
names
-
bin
--
dataset
-
impl
raw
After
running
the
above
command
you
should
see
a
new
directory
,
:
file
:`
names
-
bin
/`,
containing
the
dictionaries
for
*
inputs
*
and
*
labels
*.
2.
Registering
a
new
Model
--------------------------
Next
we
'll register a new model in fairseq that will encode an input sentence
with a simple RNN and predict the output label. Compared to the original PyTorch
tutorial, our version will also work with batches of data and GPU Tensors.
First let'
s
copy
the
simple
RNN
module
implemented
in
the
`
PyTorch
tutorial
<
https
://
pytorch
.
org
/
tutorials
/
intermediate
/
char_rnn_classification_tutorial
.
html
#
creating
-
the
-
network
>`
_
.
Create
a
new
file
named
:
file
:`
fairseq
/
models
/
rnn_classifier
.
py
`
with
the
following
contents
::
import
torch
import
torch
.
nn
as
nn
class
RNN
(
nn
.
Module
):
def
__init__
(
self
,
input_size
,
hidden_size
,
output_size
):
super
(
RNN
,
self
).
__init__
()
self
.
hidden_size
=
hidden_size
self
.
i2h
=
nn
.
Linear
(
input_size
+
hidden_size
,
hidden_size
)
self
.
i2o
=
nn
.
Linear
(
input_size
+
hidden_size
,
output_size
)
self
.
softmax
=
nn
.
LogSoftmax
(
dim
=
1
)
def
forward
(
self
,
input
,
hidden
):
combined
=
torch
.
cat
((
input
,
hidden
),
1
)
hidden
=
self
.
i2h
(
combined
)
output
=
self
.
i2o
(
combined
)
output
=
self
.
softmax
(
output
)
return
output
,
hidden
def
initHidden
(
self
):
return
torch
.
zeros
(
1
,
self
.
hidden_size
)
We
must
also
*
register
*
this
model
with
fairseq
using
the
:
func
:`~
fairseq
.
models
.
register_model
`
function
decorator
.
Once
the
model
is
registered
we
'll be able to use it with the existing :ref:`Command-line Tools`.
All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
interface, so we'
ll
create
a
small
wrapper
class
in
the
same
file
and
register
it
in
fairseq
with
the
name
``
'rnn_classifier'
``::
from
fairseq
.
models
import
BaseFairseqModel
,
register_model
#
Note
:
the
register_model
"decorator"
should
immediately
precede
the
#
definition
of
the
Model
class
.
@
register_model
(
'rnn_classifier'
)
class
FairseqRNNClassifier
(
BaseFairseqModel
):
@
staticmethod
def
add_args
(
parser
):
#
Models
can
override
this
method
to
add
new
command
-
line
arguments
.
#
Here
we
'll add a new command-line argument to configure the
# dimensionality of the hidden state.
parser.add_argument(
'
--
hidden
-
dim
', type=int, metavar='
N
',
help='
dimensionality
of
the
hidden
state
',
)
@classmethod
def build_model(cls, args, task):
# Fairseq initializes models by calling the ``build_model()``
# function. This provides more flexibility, since the returned model
# instance can be of a different type than the one that was called.
# In this case we'
ll
just
return
a
FairseqRNNClassifier
instance
.
#
Initialize
our
RNN
module
rnn
=
RNN
(
#
We
'll define the Task in the next section, but for now just
# notice that the task holds the dictionaries for the "source"
# (i.e., the input sentence) and "target" (i.e., the label).
input_size=len(task.source_dictionary),
hidden_size=args.hidden_dim,
output_size=len(task.target_dictionary),
)
# Return the wrapped version of the module
return FairseqRNNClassifier(
rnn=rnn,
input_vocab=task.source_dictionary,
)
def __init__(self, rnn, input_vocab):
super(FairseqRNNClassifier, self).__init__()
self.rnn = rnn
self.input_vocab = input_vocab
# The RNN module in the tutorial expects one-hot inputs, so we can
# precompute the identity matrix to help convert from indices to
# one-hot vectors. We register it as a buffer so that it is moved to
# the GPU when ``cuda()`` is called.
self.register_buffer('
one_hot_inputs
', torch.eye(len(input_vocab)))
def forward(self, src_tokens, src_lengths):
# The inputs to the ``forward()`` function are determined by the
# Task, and in particular the ``'
net_input
'`` key in each
# mini-batch. We'
ll
define
the
Task
in
the
next
section
,
but
for
#
now
just
know
that
*
src_tokens
*
has
shape
`(
batch
,
src_len
)`
and
#
*
src_lengths
*
has
shape
`(
batch
)`.
bsz
,
max_src_len
=
src_tokens
.
size
()
#
Initialize
the
RNN
hidden
state
.
Compared
to
the
original
PyTorch
#
tutorial
we
'll also handle batched inputs and work on the GPU.
hidden = self.rnn.initHidden()
hidden = hidden.repeat(bsz, 1) # expand for batched inputs
hidden = hidden.to(src_tokens.device) # move to GPU
for i in range(max_src_len):
# WARNING: The inputs have padding, so we should mask those
# elements here so that padding doesn'
t
affect
the
results
.
#
This
is
left
as
an
exercise
for
the
reader
.
The
padding
symbol
#
is
given
by
``
self
.
input_vocab
.
pad
()``
and
the
unpadded
length
#
of
each
input
is
given
by
*
src_lengths
*.
#
One
-
hot
encode
a
batch
of
input
characters
.
input
=
self
.
one_hot_inputs
[
src_tokens
[:,
i
].
long
()]
#
Feed
the
input
to
our
RNN
.
output
,
hidden
=
self
.
rnn
(
input
,
hidden
)
#
Return
the
final
output
state
for
making
a
prediction
return
output
Finally
let
's define a *named architecture* with the configuration for our
model. This is done with the :func:`~fairseq.models.register_model_architecture`
function decorator. Thereafter this named architecture can be used with the
``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::
from fairseq.models import register_model_architecture
# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., '
rnn_classifier
'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.
@register_model_architecture('
rnn_classifier
', '
pytorch_tutorial_rnn
')
def pytorch_tutorial_rnn(args):
# We use ``getattr()`` to prioritize arguments that are explicitly given
# on the command-line, so that the defaults defined below are only used
# when no other value has been specified.
args.hidden_dim = getattr(args, '
hidden_dim
', 128)
3. Registering a new Task
-------------------------
Now we'
ll
register
a
new
:
class
:`~
fairseq
.
tasks
.
FairseqTask
`
that
will
load
our
dictionaries
and
dataset
.
Tasks
can
also
control
how
the
data
is
batched
into
mini
-
batches
,
but
in
this
tutorial
we
'll reuse the batching provided by
:class:`fairseq.data.LanguagePairDataset`.
Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
following contents::
import os
import torch
from fairseq.data import Dictionary, LanguagePairDataset
from fairseq.tasks import FairseqTask, register_task
@register_task('
simple_classification
')
class SimpleClassificationTask(LegacyFairseqTask):
@staticmethod
def add_args(parser):
# Add some command-line arguments for specifying where the data is
# located and the maximum supported input length.
parser.add_argument('
data
', metavar='
FILE
',
help='
file
prefix
for
data
')
parser.add_argument('
--
max
-
positions
', default=1024, type=int,
help='
max
input
length
')
@classmethod
def setup_task(cls, args, **kwargs):
# Here we can perform any setup required for the task. This may include
# loading Dictionaries, initializing shared Embedding layers, etc.
# In this case we'
ll
just
load
the
Dictionaries
.
input_vocab
=
Dictionary
.
load
(
os
.
path
.
join
(
args
.
data
,
'dict.input.txt'
))
label_vocab
=
Dictionary
.
load
(
os
.
path
.
join
(
args
.
data
,
'dict.label.txt'
))
print
(
'| [input] dictionary: {} types'
.
format
(
len
(
input_vocab
)))
print
(
'| [label] dictionary: {} types'
.
format
(
len
(
label_vocab
)))
return
SimpleClassificationTask
(
args
,
input_vocab
,
label_vocab
)
def
__init__
(
self
,
args
,
input_vocab
,
label_vocab
):
super
().
__init__
(
args
)
self
.
input_vocab
=
input_vocab
self
.
label_vocab
=
label_vocab
def
load_dataset
(
self
,
split
,
**
kwargs
):
"""Load a given dataset split (e.g., train, valid, test)."""
prefix
=
os
.
path
.
join
(
self
.
args
.
data
,
'{}.input-label'
.
format
(
split
))
#
Read
input
sentences
.
sentences
,
lengths
=
[],
[]
with
open
(
prefix
+
'.input'
,
encoding
=
'utf-8'
)
as
file
:
for
line
in
file
:
sentence
=
line
.
strip
()
#
Tokenize
the
sentence
,
splitting
on
spaces
tokens
=
self
.
input_vocab
.
encode_line
(
sentence
,
add_if_not_exist
=
False
,
)
sentences
.
append
(
tokens
)
lengths
.
append
(
tokens
.
numel
())
#
Read
labels
.
labels
=
[]
with
open
(
prefix
+
'.label'
,
encoding
=
'utf-8'
)
as
file
:
for
line
in
file
:
label
=
line
.
strip
()
labels
.
append
(
#
Convert
label
to
a
numeric
ID
.
torch
.
LongTensor
([
self
.
label_vocab
.
add_symbol
(
label
)])
)
assert
len
(
sentences
)
==
len
(
labels
)
print
(
'| {} {} {} examples'
.
format
(
self
.
args
.
data
,
split
,
len
(
sentences
)))
#
We
reuse
LanguagePairDataset
since
classification
can
be
modeled
as
a
#
sequence
-
to
-
sequence
task
where
the
target
sequence
has
length
1.
self
.
datasets
[
split
]
=
LanguagePairDataset
(
src
=
sentences
,
src_sizes
=
lengths
,
src_dict
=
self
.
input_vocab
,
tgt
=
labels
,
tgt_sizes
=
torch
.
ones
(
len
(
labels
)),
#
targets
have
length
1
tgt_dict
=
self
.
label_vocab
,
left_pad_source
=
False
,
#
Since
our
target
is
a
single
class
label
,
there
's no need for
# teacher forcing. If we set this to ``True`` then our Model'
s
#
``
forward
()``
method
would
receive
an
additional
argument
called
#
*
prev_output_tokens
*
that
would
contain
a
shifted
version
of
the
#
target
sequence
.
input_feeding
=
False
,
)
def
max_positions
(
self
):
"""Return the max input length allowed by the task."""
#
The
source
should
be
less
than
*
args
.
max_positions
*
and
the
"target"
#
has
max
length
1.
return
(
self
.
args
.
max_positions
,
1
)
@
property
def
source_dictionary
(
self
):
"""Return the source :class:`~fairseq.data.Dictionary`."""
return
self
.
input_vocab
@
property
def
target_dictionary
(
self
):
"""Return the target :class:`~fairseq.data.Dictionary`."""
return
self
.
label_vocab
#
We
could
override
this
method
if
we
wanted
more
control
over
how
batches
#
are
constructed
,
but
it
's not necessary for this tutorial since we can
# reuse the batching provided by LanguagePairDataset.
#
# def get_batch_iterator(
# self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
# ignore_invalid_inputs=False, required_batch_size_multiple=1,
# seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
# data_buffer_size=0, disable_iterator_cache=False,
# ):
# (...)
4. Training the Model
---------------------
Now we'
re
ready
to
train
the
model
.
We
can
use
the
existing
:
ref
:`
fairseq
-
train
`
command
-
line
tool
for
this
,
making
sure
to
specify
our
new
Task
(``--
task
simple_classification
``)
and
Model
architecture
(``--
arch
pytorch_tutorial_rnn
``):
..
note
::
You
can
also
configure
the
dimensionality
of
the
hidden
state
by
passing
the
``--
hidden
-
dim
``
argument
to
:
ref
:`
fairseq
-
train
`.
..
code
-
block
::
console
>
fairseq
-
train
names
-
bin
\
--
task
simple_classification
\
--
arch
pytorch_tutorial_rnn
\
--
optimizer
adam
--
lr
0.001
--
lr
-
shrink
0.5
\
--
max
-
tokens
1000
(...)
|
epoch
027
|
loss
1.200
|
ppl
2.30
|
wps
15728
|
ups
119.4
|
wpb
116
|
bsz
116
|
num_updates
3726
|
lr
1.5625e-05
|
gnorm
1.290
|
clip
0
%
|
oom
0
|
wall
32
|
train_wall
21
|
epoch
027
|
valid
on
'valid'
subset
|
valid_loss
1.41304
|
valid_ppl
2.66
|
num_updates
3726
|
best
1.41208
|
done
training
in
31.6
seconds
The
model
files
should
appear
in
the
:
file
:`
checkpoints
/`
directory
.
5.
Writing
an
evaluation
script
-------------------------------
Finally
we
can
write
a
short
script
to
evaluate
our
model
on
new
inputs
.
Create
a
new
file
named
:
file
:`
eval_classifier
.
py
`
with
the
following
contents
::
from
fairseq
import
checkpoint_utils
,
data
,
options
,
tasks
#
Parse
command
-
line
arguments
for
generation
parser
=
options
.
get_generation_parser
(
default_task
=
'simple_classification'
)
args
=
options
.
parse_args_and_arch
(
parser
)
#
Setup
task
task
=
tasks
.
setup_task
(
args
)
#
Load
model
print
(
'| loading model from {}'
.
format
(
args
.
path
))
models
,
_model_args
=
checkpoint_utils
.
load_model_ensemble
([
args
.
path
],
task
=
task
)
model
=
models
[
0
]
while
True
:
sentence
=
input
(
'\nInput: '
)
#
Tokenize
into
characters
chars
=
' '
.
join
(
list
(
sentence
.
strip
()))
tokens
=
task
.
source_dictionary
.
encode_line
(
chars
,
add_if_not_exist
=
False
,
)
#
Build
mini
-
batch
to
feed
to
the
model
batch
=
data
.
language_pair_dataset
.
collate
(
samples
=[{
'id'
:
-
1
,
'source'
:
tokens
}],
#
bsz
=
1
pad_idx
=
task
.
source_dictionary
.
pad
(),
eos_idx
=
task
.
source_dictionary
.
eos
(),
left_pad_source
=
False
,
input_feeding
=
False
,
)
#
Feed
batch
to
the
model
and
get
predictions
preds
=
model
(**
batch
[
'net_input'
])
#
Print
top
3
predictions
and
their
log
-
probabilities
top_scores
,
top_labels
=
preds
[
0
].
topk
(
k
=
3
)
for
score
,
label_idx
in
zip
(
top_scores
,
top_labels
):
label_name
=
task
.
target_dictionary
.
string
([
label_idx
])
print
(
'({:.2f})\t{}'
.
format
(
score
,
label_name
))
Now
we
can
evaluate
our
model
interactively
.
Note
that
we
have
included
the
original
data
path
(:
file
:`
names
-
bin
/`)
so
that
the
dictionaries
can
be
loaded
:
..
code
-
block
::
console
>
python
eval_classifier
.
py
names
-
bin
--
path
checkpoints
/
checkpoint_best
.
pt
|
[
input
]
dictionary
:
64
types
|
[
label
]
dictionary
:
24
types
|
loading
model
from
checkpoints
/
checkpoint_best
.
pt
Input
:
Satoshi
(-
0.61
)
Japanese
(-
1.20
)
Arabic
(-
2.86
)
Italian
Input
:
Sinbad
(-
0.30
)
Arabic
(-
1.76
)
English
(-
4.08
)
Russian
SpeechT5/fairseq/docs/tutorial_simple_lstm.rst
0 → 100644
View file @
12c90639
Tutorial
:
Simple
LSTM
=====================
In
this
tutorial
we
will
extend
fairseq
by
adding
a
new
:
class
:`~
fairseq
.
models
.
FairseqEncoderDecoderModel
`
that
encodes
a
source
sentence
with
an
LSTM
and
then
passes
the
final
hidden
state
to
a
second
LSTM
that
decodes
the
target
sentence
(
without
attention
).
This
tutorial
covers
:
1.
**
Writing
an
Encoder
and
Decoder
**
to
encode
/
decode
the
source
/
target
sentence
,
respectively
.
2.
**
Registering
a
new
Model
**
so
that
it
can
be
used
with
the
existing
:
ref
:`
Command
-
line
tools
`.
3.
**
Training
the
Model
**
using
the
existing
command
-
line
tools
.
4.
**
Making
generation
faster
**
by
modifying
the
Decoder
to
use
:
ref
:`
Incremental
decoding
`.
1.
Building
an
Encoder
and
Decoder
----------------------------------
In
this
section
we
'll define a simple LSTM Encoder and Decoder. All Encoders
should implement the :class:`~fairseq.models.FairseqEncoder` interface and
Decoders should implement the :class:`~fairseq.models.FairseqDecoder` interface.
These interfaces themselves extend :class:`torch.nn.Module`, so FairseqEncoders
and FairseqDecoders can be written and used in the same ways as ordinary PyTorch
Modules.
Encoder
~~~~~~~
Our Encoder will embed the tokens in the source sentence, feed them to a
:class:`torch.nn.LSTM` and return the final hidden state. To create our encoder
save the following in a new file named :file:`fairseq/models/simple_lstm.py`::
import torch.nn as nn
from fairseq import utils
from fairseq.models import FairseqEncoder
class SimpleLSTMEncoder(FairseqEncoder):
def __init__(
self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
):
super().__init__(dictionary)
self.args = args
# Our encoder will embed the inputs before feeding them to the LSTM.
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
# We'
ll
use
a
single
-
layer
,
unidirectional
LSTM
for
simplicity
.
self
.
lstm
=
nn
.
LSTM
(
input_size
=
embed_dim
,
hidden_size
=
hidden_dim
,
num_layers
=
1
,
bidirectional
=
False
,
batch_first
=
True
,
)
def
forward
(
self
,
src_tokens
,
src_lengths
):
#
The
inputs
to
the
``
forward
()``
function
are
determined
by
the
#
Task
,
and
in
particular
the
``
'net_input'
``
key
in
each
#
mini
-
batch
.
We
discuss
Tasks
in
the
next
tutorial
,
but
for
now
just
#
know
that
*
src_tokens
*
has
shape
`(
batch
,
src_len
)`
and
*
src_lengths
*
#
has
shape
`(
batch
)`.
#
Note
that
the
source
is
typically
padded
on
the
left
.
This
can
be
#
configured
by
adding
the
`--
left
-
pad
-
source
"False"
`
command
-
line
#
argument
,
but
here
we
'll make the Encoder handle either kind of
# padding by converting everything to be right-padded.
if self.args.left_pad_source:
# Convert left-padding to right-padding.
src_tokens = utils.convert_padding_direction(
src_tokens,
padding_idx=self.dictionary.pad(),
left_to_right=True
)
# Embed the source.
x = self.embed_tokens(src_tokens)
# Apply dropout.
x = self.dropout(x)
# Pack the sequence into a PackedSequence object to feed to the LSTM.
x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)
# Get the output from the LSTM.
_outputs, (final_hidden, _final_cell) = self.lstm(x)
# Return the Encoder'
s
output
.
This
can
be
any
object
and
will
be
#
passed
directly
to
the
Decoder
.
return
{
#
this
will
have
shape
`(
bsz
,
hidden_dim
)`
'final_hidden'
:
final_hidden
.
squeeze
(
0
),
}
#
Encoders
are
required
to
implement
this
method
so
that
we
can
rearrange
#
the
order
of
the
batch
elements
during
inference
(
e
.
g
.,
beam
search
).
def
reorder_encoder_out
(
self
,
encoder_out
,
new_order
):
"""
Reorder encoder output according to `new_order`.
Args:
encoder_out: output from the ``forward()`` method
new_order (LongTensor): desired order
Returns:
`encoder_out` rearranged according to `new_order`
"""
final_hidden
=
encoder_out
[
'final_hidden'
]
return
{
'final_hidden'
:
final_hidden
.
index_select
(
0
,
new_order
),
}
Decoder
~~~~~~~
Our
Decoder
will
predict
the
next
word
,
conditioned
on
the
Encoder
's final
hidden state and an embedded representation of the previous target word -- which
is sometimes called *teacher forcing*. More specifically, we'
ll
use
a
:
class
:`
torch
.
nn
.
LSTM
`
to
produce
a
sequence
of
hidden
states
that
we
'll project
to the size of the output vocabulary to predict each target word.
::
import torch
from fairseq.models import FairseqDecoder
class SimpleLSTMDecoder(FairseqDecoder):
def __init__(
self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
dropout=0.1,
):
super().__init__(dictionary)
# Our decoder will embed the inputs before feeding them to the LSTM.
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
# We'
ll
use
a
single
-
layer
,
unidirectional
LSTM
for
simplicity
.
self
.
lstm
=
nn
.
LSTM
(
#
For
the
first
layer
we
'll concatenate the Encoder'
s
final
hidden
#
state
with
the
embedded
target
tokens
.
input_size
=
encoder_hidden_dim
+
embed_dim
,
hidden_size
=
hidden_dim
,
num_layers
=
1
,
bidirectional
=
False
,
)
#
Define
the
output
projection
.
self
.
output_projection
=
nn
.
Linear
(
hidden_dim
,
len
(
dictionary
))
#
During
training
Decoders
are
expected
to
take
the
entire
target
sequence
#
(
shifted
right
by
one
position
)
and
produce
logits
over
the
vocabulary
.
#
The
*
prev_output_tokens
*
tensor
begins
with
the
end
-
of
-
sentence
symbol
,
#
``
dictionary
.
eos
()``,
followed
by
the
target
sequence
.
def
forward
(
self
,
prev_output_tokens
,
encoder_out
):
"""
Args:
prev_output_tokens (LongTensor): previous decoder outputs of shape
`(batch, tgt_len)`, for teacher forcing
encoder_out (Tensor, optional): output from the encoder, used for
encoder-side attention
Returns:
tuple:
- the last decoder layer's output of shape
`(batch, tgt_len, vocab)`
- the last decoder layer's attention weights of shape
`(batch, tgt_len, src_len)`
"""
bsz
,
tgt_len
=
prev_output_tokens
.
size
()
#
Extract
the
final
hidden
state
from
the
Encoder
.
final_encoder_hidden
=
encoder_out
[
'final_hidden'
]
#
Embed
the
target
sequence
,
which
has
been
shifted
right
by
one
#
position
and
now
starts
with
the
end
-
of
-
sentence
symbol
.
x
=
self
.
embed_tokens
(
prev_output_tokens
)
#
Apply
dropout
.
x
=
self
.
dropout
(
x
)
#
Concatenate
the
Encoder
's final hidden state to *every* embedded
# target token.
x = torch.cat(
[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
dim=2,
)
# Using PackedSequence objects in the Decoder is harder than in the
# Encoder, since the targets are not sorted in descending length order,
# which is a requirement of ``pack_padded_sequence()``. Instead we'
ll
#
feed
nn
.
LSTM
directly
.
initial_state
=
(
final_encoder_hidden
.
unsqueeze
(
0
),
#
hidden
torch
.
zeros_like
(
final_encoder_hidden
).
unsqueeze
(
0
),
#
cell
)
output
,
_
=
self
.
lstm
(
x
.
transpose
(
0
,
1
),
#
convert
to
shape
`(
tgt_len
,
bsz
,
dim
)`
initial_state
,
)
x
=
output
.
transpose
(
0
,
1
)
#
convert
to
shape
`(
bsz
,
tgt_len
,
hidden
)`
#
Project
the
outputs
to
the
size
of
the
vocabulary
.
x
=
self
.
output_projection
(
x
)
#
Return
the
logits
and
``
None
``
for
the
attention
weights
return
x
,
None
2.
Registering
the
Model
------------------------
Now
that
we
've defined our Encoder and Decoder we must *register* our model with
fairseq using the :func:`~fairseq.models.register_model` function decorator.
Once the model is registered we'
ll
be
able
to
use
it
with
the
existing
:
ref
:`
Command
-
line
Tools
`.
All
registered
models
must
implement
the
:
class
:`~
fairseq
.
models
.
BaseFairseqModel
`
interface
.
For
sequence
-
to
-
sequence
models
(
i
.
e
.,
any
model
with
a
single
Encoder
and
Decoder
),
we
can
instead
implement
the
:
class
:`~
fairseq
.
models
.
FairseqEncoderDecoderModel
`
interface
.
Create
a
small
wrapper
class
in
the
same
file
and
register
it
in
fairseq
with
the
name
``
'simple_lstm'
``::
from
fairseq
.
models
import
FairseqEncoderDecoderModel
,
register_model
#
Note
:
the
register_model
"decorator"
should
immediately
precede
the
#
definition
of
the
Model
class
.
@
register_model
(
'simple_lstm'
)
class
SimpleLSTMModel
(
FairseqEncoderDecoderModel
):
@
staticmethod
def
add_args
(
parser
):
#
Models
can
override
this
method
to
add
new
command
-
line
arguments
.
#
Here
we
'll add some new command-line arguments to configure dropout
# and the dimensionality of the embeddings and hidden states.
parser.add_argument(
'
--
encoder
-
embed
-
dim
', type=int, metavar='
N
',
help='
dimensionality
of
the
encoder
embeddings
',
)
parser.add_argument(
'
--
encoder
-
hidden
-
dim
', type=int, metavar='
N
',
help='
dimensionality
of
the
encoder
hidden
state
',
)
parser.add_argument(
'
--
encoder
-
dropout
', type=float, default=0.1,
help='
encoder
dropout
probability
',
)
parser.add_argument(
'
--
decoder
-
embed
-
dim
', type=int, metavar='
N
',
help='
dimensionality
of
the
decoder
embeddings
',
)
parser.add_argument(
'
--
decoder
-
hidden
-
dim
', type=int, metavar='
N
',
help='
dimensionality
of
the
decoder
hidden
state
',
)
parser.add_argument(
'
--
decoder
-
dropout
', type=float, default=0.1,
help='
decoder
dropout
probability
',
)
@classmethod
def build_model(cls, args, task):
# Fairseq initializes models by calling the ``build_model()``
# function. This provides more flexibility, since the returned model
# instance can be of a different type than the one that was called.
# In this case we'
ll
just
return
a
SimpleLSTMModel
instance
.
#
Initialize
our
Encoder
and
Decoder
.
encoder
=
SimpleLSTMEncoder
(
args
=
args
,
dictionary
=
task
.
source_dictionary
,
embed_dim
=
args
.
encoder_embed_dim
,
hidden_dim
=
args
.
encoder_hidden_dim
,
dropout
=
args
.
encoder_dropout
,
)
decoder
=
SimpleLSTMDecoder
(
dictionary
=
task
.
target_dictionary
,
encoder_hidden_dim
=
args
.
encoder_hidden_dim
,
embed_dim
=
args
.
decoder_embed_dim
,
hidden_dim
=
args
.
decoder_hidden_dim
,
dropout
=
args
.
decoder_dropout
,
)
model
=
SimpleLSTMModel
(
encoder
,
decoder
)
#
Print
the
model
architecture
.
print
(
model
)
return
model
#
We
could
override
the
``
forward
()``
if
we
wanted
more
control
over
how
#
the
encoder
and
decoder
interact
,
but
it
's not necessary for this
# tutorial since we can inherit the default implementation provided by
# the FairseqEncoderDecoderModel base class, which looks like:
#
# def forward(self, src_tokens, src_lengths, prev_output_tokens):
# encoder_out = self.encoder(src_tokens, src_lengths)
# decoder_out = self.decoder(prev_output_tokens, encoder_out)
# return decoder_out
Finally let'
s
define
a
*
named
architecture
*
with
the
configuration
for
our
model
.
This
is
done
with
the
:
func
:`~
fairseq
.
models
.
register_model_architecture
`
function
decorator
.
Thereafter
this
named
architecture
can
be
used
with
the
``--
arch
``
command
-
line
argument
,
e
.
g
.,
``--
arch
tutorial_simple_lstm
``::
from
fairseq
.
models
import
register_model_architecture
#
The
first
argument
to
``
register_model_architecture
()``
should
be
the
name
#
of
the
model
we
registered
above
(
i
.
e
.,
'simple_lstm'
).
The
function
we
#
register
here
should
take
a
single
argument
*
args
*
and
modify
it
in
-
place
#
to
match
the
desired
architecture
.
@
register_model_architecture
(
'simple_lstm'
,
'tutorial_simple_lstm'
)
def
tutorial_simple_lstm
(
args
):
#
We
use
``
getattr
()``
to
prioritize
arguments
that
are
explicitly
given
#
on
the
command
-
line
,
so
that
the
defaults
defined
below
are
only
used
#
when
no
other
value
has
been
specified
.
args
.
encoder_embed_dim
=
getattr
(
args
,
'encoder_embed_dim'
,
256
)
args
.
encoder_hidden_dim
=
getattr
(
args
,
'encoder_hidden_dim'
,
256
)
args
.
decoder_embed_dim
=
getattr
(
args
,
'decoder_embed_dim'
,
256
)
args
.
decoder_hidden_dim
=
getattr
(
args
,
'decoder_hidden_dim'
,
256
)
3.
Training
the
Model
---------------------
Now
we
're ready to train the model. We can use the existing :ref:`fairseq-train`
command-line tool for this, making sure to specify our new Model architecture
(``--arch tutorial_simple_lstm``).
.. note::
Make sure you'
ve
already
preprocessed
the
data
from
the
IWSLT
example
in
the
:
file
:`
examples
/
translation
/`
directory
.
..
code
-
block
::
console
>
fairseq
-
train
data
-
bin
/
iwslt14
.
tokenized
.
de
-
en
\
--
arch
tutorial_simple_lstm
\
--
encoder
-
dropout
0.2
--
decoder
-
dropout
0.2
\
--
optimizer
adam
--
lr
0.005
--
lr
-
shrink
0.5
\
--
max
-
tokens
12000
(...)
|
epoch
052
|
loss
4.027
|
ppl
16.30
|
wps
420805
|
ups
39.7
|
wpb
9841
|
bsz
400
|
num_updates
20852
|
lr
1.95313e-05
|
gnorm
0.218
|
clip
0
%
|
oom
0
|
wall
529
|
train_wall
396
|
epoch
052
|
valid
on
'valid'
subset
|
valid_loss
4.74989
|
valid_ppl
26.91
|
num_updates
20852
|
best
4.74954
The
model
files
should
appear
in
the
:
file
:`
checkpoints
/`
directory
.
While
this
model
architecture
is
not
very
good
,
we
can
use
the
:
ref
:`
fairseq
-
generate
`
script
to
generate
translations
and
compute
our
BLEU
score
over
the
test
set
:
..
code
-
block
::
console
>
fairseq
-
generate
data
-
bin
/
iwslt14
.
tokenized
.
de
-
en
\
--
path
checkpoints
/
checkpoint_best
.
pt
\
--
beam
5
\
--
remove
-
bpe
(...)
|
Translated
6750
sentences
(
153132
tokens
)
in
17.3
s
(
389.12
sentences
/
s
,
8827.68
tokens
/
s
)
|
Generate
test
with
beam
=
5
:
BLEU4
=
8.18
,
38.8
/
12.1
/
4.7
/
2.0
(
BP
=
1.000
,
ratio
=
1.066
,
syslen
=
139865
,
reflen
=
131146
)
4.
Making
generation
faster
---------------------------
While
autoregressive
generation
from
sequence
-
to
-
sequence
models
is
inherently
slow
,
our
implementation
above
is
especially
slow
because
it
recomputes
the
entire
sequence
of
Decoder
hidden
states
for
every
output
token
(
i
.
e
.,
it
is
``
O
(
n
^
2
)``).
We
can
make
this
significantly
faster
by
instead
caching
the
previous
hidden
states
.
In
fairseq
this
is
called
:
ref
:`
Incremental
decoding
`.
Incremental
decoding
is
a
special
mode
at
inference
time
where
the
Model
only
receives
a
single
timestep
of
input
corresponding
to
the
immediately
previous
output
token
(
for
teacher
forcing
)
and
must
produce
the
next
output
incrementally
.
Thus
the
model
must
cache
any
long
-
term
state
that
is
needed
about
the
sequence
,
e
.
g
.,
hidden
states
,
convolutional
states
,
etc
.
To
implement
incremental
decoding
we
will
modify
our
model
to
implement
the
:
class
:`~
fairseq
.
models
.
FairseqIncrementalDecoder
`
interface
.
Compared
to
the
standard
:
class
:`~
fairseq
.
models
.
FairseqDecoder
`
interface
,
the
incremental
decoder
interface
allows
``
forward
()``
methods
to
take
an
extra
keyword
argument
(*
incremental_state
*)
that
can
be
used
to
cache
state
across
time
-
steps
.
Let
's replace our ``SimpleLSTMDecoder`` with an incremental one::
import torch
from fairseq.models import FairseqIncrementalDecoder
class SimpleLSTMDecoder(FairseqIncrementalDecoder):
def __init__(
self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
dropout=0.1,
):
# This remains the same as before.
super().__init__(dictionary)
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
self.lstm = nn.LSTM(
input_size=encoder_hidden_dim + embed_dim,
hidden_size=hidden_dim,
num_layers=1,
bidirectional=False,
)
self.output_projection = nn.Linear(hidden_dim, len(dictionary))
# We now take an additional kwarg (*incremental_state*) for caching the
# previous hidden and cell states.
def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
if incremental_state is not None:
# If the *incremental_state* argument is not ``None`` then we are
# in incremental inference mode. While *prev_output_tokens* will
# still contain the entire decoded prefix, we will only use the
# last step and assume that the rest of the state is cached.
prev_output_tokens = prev_output_tokens[:, -1:]
# This remains the same as before.
bsz, tgt_len = prev_output_tokens.size()
final_encoder_hidden = encoder_out['
final_hidden
']
x = self.embed_tokens(prev_output_tokens)
x = self.dropout(x)
x = torch.cat(
[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
dim=2,
)
# We will now check the cache and load the cached previous hidden and
# cell states, if they exist, otherwise we will initialize them to
# zeros (as before). We will use the ``utils.get_incremental_state()``
# and ``utils.set_incremental_state()`` helpers.
initial_state = utils.get_incremental_state(
self, incremental_state, '
prev_state
',
)
if initial_state is None:
# first time initialization, same as the original version
initial_state = (
final_encoder_hidden.unsqueeze(0), # hidden
torch.zeros_like(final_encoder_hidden).unsqueeze(0), # cell
)
# Run one step of our LSTM.
output, latest_state = self.lstm(x.transpose(0, 1), initial_state)
# Update the cache with the latest hidden and cell states.
utils.set_incremental_state(
self, incremental_state, '
prev_state
', latest_state,
)
# This remains the same as before
x = output.transpose(0, 1)
x = self.output_projection(x)
return x, None
# The ``FairseqIncrementalDecoder`` interface also requires implementing a
# ``reorder_incremental_state()`` method, which is used during beam search
# to select and reorder the incremental state.
def reorder_incremental_state(self, incremental_state, new_order):
# Load the cached state.
prev_state = utils.get_incremental_state(
self, incremental_state, '
prev_state
',
)
# Reorder batches according to *new_order*.
reordered_state = (
prev_state[0].index_select(1, new_order), # hidden
prev_state[1].index_select(1, new_order), # cell
)
# Update the cached state.
utils.set_incremental_state(
self, incremental_state, '
prev_state
', reordered_state,
)
Finally, we can rerun generation and observe the speedup:
.. code-block:: console
# Before
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
# After
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
SpeechT5/fairseq/examples/.gitignore
0 → 100644
View file @
12c90639
!*/*.sh
!*/*.md
SpeechT5/fairseq/examples/__init__.py
0 → 100644
View file @
12c90639
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
try
:
from
fairseq.version
import
__version__
# noqa
except
ImportError
:
pass
SpeechT5/fairseq/examples/adaptive_span/README.md
0 → 100644
View file @
12c90639
# Adaptive Span
Adaptive Span is a novel self-attention mechanism that can learn its optimal
attention span. This allows us to extend significantly the maximum context size
used in Transformer, while maintaining control over their memory footprint
and computational time. It uses the Truncated BPTT technique for training,
as in
[
transformerXL
](
https://github.com/pytorch/fairseq/blob/master/examples/truncated_bptt/README.md
)
.
Adaptive Span was introduced by paper:
[
Adaptive Attention Span in Transformers
](
https://arxiv.org/abs/1905.07799
)
,
which achieved state-of-the-art language modeling results at the time of publication.
We manage to reproduce their result in fairseq and keep most of the
[
original implementation
](
https://github.com/facebookresearch/adaptive-span
)
untouched.
You can refer to the their sweep file as well if any combination of hyperparameter is not clear.
##### 0. Setup
First you need to process the Enwik8 dataset, we use the pre-tokenized dataset
from
[
adaptive span paper
](
https://github.com/facebookresearch/adaptive-span/blob/master/get_data.sh
)
.
You can download the dataset, and then run:
```
bash
fairseq-preprocess
--only-source
--trainpref
~/data/enwik8/train.txt
\
--validpref
~/data/enwik8/valid.txt
--testpref
~/data/enwik8/test.txt
\
--destdir
~/data/enwik8/data-bin/
--joined-dictionary
--workers
20
```
##### 1. Train a Adaptive Span model on Enwik8
We will train a 12-layer Adaptive Span model following the
[
hyperparameters
used in the original
paper
](
https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh
)
.
The following command assumes 4 GPUs, so that the total batch size is 64
sequences (4 x 16). Training should take 2-3 days on 4 V100 GPUs:
```
bash
CUDA_VISIBLE_DEVICES
=
0,1,2,3 fairseq-train
\
--user-dir
examples/adaptive_span
\
--data
~/data/enwik8/data-bin/
\
--fp16
--fp16-no-flatten-grads
--max-update
600000
\
--task
truncated_bptt_lm
--tokens-per-sample
512
--arch
adaptive_span
\
--n-layer
12
--d-model
512
--n-head
8
--d-inner
2048
--dropout
0.3
\
--attn-span
8192
--optimizer
adagrad_with_grad_clip
--adagrad-clip
0.03
\
--validate-interval-updates
1000
\
--lr-scheduler
fixed
--warmup-updates
32000
--batch-size-valid
32
\
--lr
0.07
--criterion
adaptive_span_loss
--batch-size
16
--update-freq
1
\
--seed
2
--log-format
json
--log-interval
25
--aux-loss-scaler
5e-07
```
This should land around 1.05 on validation, 1.03 on test. You can lower the
--aux-loss-scaler for better performance (longer span). It gives ~0.03 bpc
improvement to the transformerXL baseline here.
If training on a single GPU, set
`--update-freq=4`
to accumulate 4x gradients
and simulate training on 4 GPUs.
You can also reproduce the transformerXL result on enwik8 using this code base.
It should land around 1.06 on test,matching the
[
original paper
](
https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_enwik8_base.sh
)
.
You can try by
```
bash
CUDA_VISIBLE_DEVICES
=
0,1,2,3 fairseq-train
\
--user-dir
examples/truncated_bptt
\
~/data/enwik8/data-bin/
\
--task
truncated_bptt_lm
--fp16
--max-update
400000
\
--tokens-per-sample
512
--arch
transformer_xl
--n-layer
12
\
--d-model
512
--n-head
8
--d-head
64
--d-inner
2048
--dropout
0.1
\
--dropatt
0.0
--mem-len
512
--optimizer
adam
--clip-norm
0.25
\
--lr-scheduler
cosine
--warmup-updates
0
\
--lr
0.0
--lr
0.00025
--batch-size
15
\
--update-freq
1
--seed
2
--log-format
json
--log-interval
25
\
--fp16
```
##### 2. Evaluate
For Adaptive Span:
```
bash
fairseq-eval-lm ~/data/enwik8/data-bin/
--path
model/checkpoint_best.pt
\
--user-dir
examples/adaptive_span
\
--task
truncated_bptt_lm
--batch-size
8
--tokens-per-sample
512
--gen-subset
test
```
For Transformer-XL evaluation:
```
bash
fairseq-eval-lm ~/data/enwik8/data-bin/
--path
model/checkpoint_best.pt
\
--user-dir
examples/truncated_bptt/
--task
truncated_bptt_lm
--batch-size
8
\
--tokens-per-sample
80
\
--model-overrides
'{"mem_len":2100,"clamp_len":820,"same_length":True}'
\
--gen-subset
valid
```
*Note:*
During training the model saw 512 tokens of context
(
``--tokens-per-sample=512``
), with batch size 8. These settings match the evaluation
settings from
[
the original
paper
](
https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh
)
.
SpeechT5/fairseq/examples/adaptive_span/__init__.py
0 → 100644
View file @
12c90639
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import
importlib
import
os
# automatically import any Python files in the current directory
cur_dir
=
os
.
path
.
dirname
(
__file__
)
for
file
in
os
.
listdir
(
cur_dir
):
path
=
os
.
path
.
join
(
cur_dir
,
file
)
if
(
not
file
.
startswith
(
"_"
)
and
not
file
.
startswith
(
"."
)
and
(
file
.
endswith
(
".py"
)
or
os
.
path
.
isdir
(
path
))
):
mod_name
=
file
[:
file
.
find
(
".py"
)]
if
file
.
endswith
(
".py"
)
else
file
module
=
importlib
.
import_module
(
__name__
+
"."
+
mod_name
)
SpeechT5/fairseq/examples/adaptive_span/adagrad_with_grad_clip.py
0 → 100644
View file @
12c90639
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from
torch.optim
import
Adagrad
from
fairseq.optim
import
LegacyFairseqOptimizer
,
register_optimizer
@
register_optimizer
(
"adagrad_with_grad_clip"
)
class
FairseqAdagradWithGradClip
(
LegacyFairseqOptimizer
):
def
__init__
(
self
,
args
,
params
):
super
().
__init__
(
args
)
self
.
_optimizer
=
AdagradWithGradClip
(
params
,
**
self
.
optimizer_config
)
@
staticmethod
def
add_args
(
parser
):
"""Add optimizer-specific arguments to the parser."""
# fmt: off
parser
.
add_argument
(
'--weight-decay'
,
'--wd'
,
default
=
0.0
,
type
=
float
,
metavar
=
'WD'
,
help
=
'weight decay'
)
parser
.
add_argument
(
'--adagrad-clip'
,
default
=
0.0
,
type
=
float
,
metavar
=
'D'
,
help
=
'internal grad clip'
)
# fmt: on
@
property
def
optimizer_config
(
self
):
"""
Return a kwarg dictionary that will be used to override optimizer
args stored in checkpoints. This allows us to load a checkpoint and
resume training using a different set of optimizer args, e.g., with a
different learning rate.
"""
return
{
"lr"
:
self
.
args
.
lr
[
0
],
"weight_decay"
:
self
.
args
.
weight_decay
,
"grad_clip"
:
self
.
args
.
adagrad_clip
,
}
@
property
def
supports_flat_params
(
self
):
return
False
def
_clip_grad
(
clr
,
grad
,
group_grad_clip
):
if
group_grad_clip
>
0
:
norm
=
grad
.
norm
(
2
).
item
()
if
norm
>
group_grad_clip
:
clr
*=
group_grad_clip
/
(
norm
+
1e-10
)
return
clr
class
AdagradWithGradClip
(
Adagrad
):
"""Adagrad algorithm with custom gradient clipping"""
def
__init__
(
self
,
params
,
lr
=
1e-2
,
lr_decay
=
0
,
weight_decay
=
0
,
initial_accumulator_value
=
0
,
grad_clip
=
0
,
):
Adagrad
.
__init__
(
self
,
params
,
lr
=
lr
,
lr_decay
=
lr_decay
,
weight_decay
=
weight_decay
,
initial_accumulator_value
=
initial_accumulator_value
,
)
self
.
defaults
[
"grad_clip"
]
=
grad_clip
self
.
param_groups
[
0
].
setdefault
(
"grad_clip"
,
grad_clip
)
def
step
(
self
,
closure
=
None
):
loss
=
None
if
closure
is
not
None
:
loss
=
closure
()
for
group
in
self
.
param_groups
:
for
p
in
group
[
"params"
]:
if
p
.
grad
is
None
:
continue
grad
=
p
.
grad
.
data
state
=
self
.
state
[
p
]
state
[
"step"
]
+=
1
if
group
[
"weight_decay"
]
!=
0
:
if
p
.
grad
.
data
.
is_sparse
:
raise
RuntimeError
(
"weight_decay option is "
"not compatible with sparse "
"gradients"
)
grad
=
grad
.
add
(
group
[
"weight_decay"
],
p
.
data
)
clr
=
group
[
"lr"
]
/
(
1
+
(
state
[
"step"
]
-
1
)
*
group
[
"lr_decay"
])
# clip
clr
=
_clip_grad
(
clr
=
clr
,
grad
=
grad
,
group_grad_clip
=
group
[
"grad_clip"
])
if
grad
.
is_sparse
:
# the update is non-linear so indices must be unique
grad
=
grad
.
coalesce
()
grad_indices
=
grad
.
_indices
()
grad_values
=
grad
.
_values
()
size
=
grad
.
size
()
def
make_sparse
(
values
):
constructor
=
grad
.
new
if
grad_indices
.
dim
()
==
0
or
values
.
dim
()
==
0
:
return
constructor
().
resize_as_
(
grad
)
return
constructor
(
grad_indices
,
values
,
size
)
state
[
"sum"
].
add_
(
make_sparse
(
grad_values
.
pow
(
2
)))
std
=
state
[
"sum"
].
_sparse_mask
(
grad
)
std_values
=
std
.
_values
().
sqrt_
().
add_
(
1e-10
)
p
.
data
.
add_
(
-
clr
,
make_sparse
(
grad_values
/
std_values
))
else
:
state
[
"sum"
].
addcmul_
(
1
,
grad
,
grad
)
std
=
state
[
"sum"
].
sqrt
().
add_
(
1e-10
)
p
.
data
.
addcdiv_
(
-
clr
,
grad
,
std
)
return
loss
Prev
1
…
9
10
11
12
13
14
15
16
17
18
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment