Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization and other text generation tasks. It provides reference implementations of various sequence-to-sequence models, including:
Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks. It provides reference implementations of various sequence-to-sequence models, including:
-**Convolutional Neural Networks (CNN)**
-[Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](https://arxiv.org/abs/1612.08083)
-[Gehring et al. (2017): Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122)
-[Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://arxiv.org/abs/1711.04956)
-[Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](https://arxiv.org/abs/1612.08083.pdf)
-**_New_**[Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://arxiv.org/abs/1711.04956)
-**_New_**[Fan et al. (2018): Hierarchical Neural Story Generation](https://arxiv.org/abs/1805.04833)
-**Long Short-Term Memory (LSTM) networks**
-[Luong et al. (2015): Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
-[Wiseman and Rush (2016): Sequence-to-Sequence Learning as Beam-Search Optimization](https://arxiv.org/abs/1606.02960)
-**Transformer (self-attention) networks**
-[Vaswani et al. (2017): Attention Is All You Need](https://arxiv.org/abs/1706.03762)
-**_New_**[Ott et al. (2018): Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187)
Fairseq features multi-GPU (distributed) training on one machine or across multiple machines, fast beam search generation on both CPU and GPU, and includes pre-trained models for several benchmark translation datasets.
Fairseq features:
- multi-GPU (distributed) training on one machine or across multiple machines
- fast beam search generation on both CPU and GPU
- large mini-batch training (even on a single GPU) via delayed updates
- fast half-precision floating point (FP16) training
We also provide [pre-trained models](#pre-trained-models) for several benchmark translation datasets.

...
...
@@ -73,10 +83,10 @@ This generation script produces four types of outputs: a line prefixed with *S*
Check [below](#pre-trained-models) for a full list of pre-trained models available.
## Training a New Model
The following tutorial is for machine translation. For an example of how to use Fairseq for language modeling, please see the [language modeling example README](examples/language_model/README.md).
The following tutorial is for machine translation.
For an example of how to use Fairseq for other tasks, such as [language modeling](examples/language_model/README.md), please see the `examples/` directory.
The `--update-freq` option can be used to accumulate gradients from multiple mini-batches and delay updating,
creating a larger effective batch size.
Delayed updates can also improve training speed by reducing inter-GPU communication costs and by saving idle time caused by variance in workload across GPUs.
See [Ott et al. (2018)](https://arxiv.org/abs/1806.00187) for more details.
To train on a single GPU with an effective batch size that is equivalent to training on 8 GPUs:
# Training with half precision floating point (FP16)
> Note: FP16 training requires a Volta GPU and CUDA 9.1 or greater
Recent GPUs enable efficient half precision floating point computation, e.g., using [Nvidia Tensor Cores](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
Fairseq supports FP16 training with the `--fp16` flag:
```
python train.py --fp16 (...)
```
# Distributed training
Distributed training in fairseq is implemented on top of [torch.distributed](http://pytorch.org/docs/master/distributed.html).
Training begins by launching one worker process per GPU.
These workers discover each other via a unique host and port (required) that can be used to establish an initial connection.
Additionally, each worker is given a rank, that is a unique number from 0 to n-1 where n is the total number of GPUs.
Additionally, each worker has a rank, that is a unique number from 0 to n-1 where n is the total number of GPUs.
If you run on a cluster managed by [SLURM](https://slurm.schedmd.com/) you can train a large English-French model on the WMT 2014 dataset on 16 nodes with 8 GPUs each (in total 128 GPUs) using this command:
```
$ DATA=... # path to the preprocessed dataset, must be visible from all nodes
$ PORT=9218 # any available tcp port that can be used by the trained to establish initial connection