README.md 15 KB
Newer Older
Sergey Edunov's avatar
Sergey Edunov committed
1
2
# Introduction

Myle Ott's avatar
Myle Ott committed
3
Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks. It provides reference implementations of various sequence-to-sequence models, including:
Myle Ott's avatar
Myle Ott committed
4
- **Convolutional Neural Networks (CNN)**
Myle Ott's avatar
Myle Ott committed
5
  - [Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](https://arxiv.org/abs/1612.08083)
Myle Ott's avatar
Myle Ott committed
6
  - [Gehring et al. (2017): Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122)
Myle Ott's avatar
Myle Ott committed
7
8
  - **_New_** [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://arxiv.org/abs/1711.04956)
  - **_New_** [Fan et al. (2018): Hierarchical Neural Story Generation](https://arxiv.org/abs/1805.04833)
Myle Ott's avatar
Myle Ott committed
9
10
11
- **Long Short-Term Memory (LSTM) networks**
  - [Luong et al. (2015): Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
  - [Wiseman and Rush (2016): Sequence-to-Sequence Learning as Beam-Search Optimization](https://arxiv.org/abs/1606.02960)
Myle Ott's avatar
Myle Ott committed
12
13
14
- **Transformer (self-attention) networks**
  - [Vaswani et al. (2017): Attention Is All You Need](https://arxiv.org/abs/1706.03762)
  - **_New_** [Ott et al. (2018): Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187)
15

Myle Ott's avatar
Myle Ott committed
16
17
18
19
20
21
22
Fairseq features:
- multi-GPU (distributed) training on one machine or across multiple machines
- fast beam search generation on both CPU and GPU
- large mini-batch training (even on a single GPU) via delayed updates
- fast half-precision floating point (FP16) training

We also provide [pre-trained models](#pre-trained-models) for several benchmark translation datasets.
Sergey Edunov's avatar
Sergey Edunov committed
23
24
25
26
27

![Model](fairseq.gif)

# Requirements and Installation
* A [PyTorch installation](http://pytorch.org/)
Myle Ott's avatar
Myle Ott committed
28
29
* For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
* Python version 3.6
Sergey Edunov's avatar
Sergey Edunov committed
30

Myle Ott's avatar
Myle Ott committed
31
Currently fairseq requires PyTorch version >= 0.4.0.
32
Please follow the instructions here: https://github.com/pytorch/pytorch#installation.
Sergey Edunov's avatar
Sergey Edunov committed
33

34
35
36
If you use Docker make sure to increase the shared memory size either with `--ipc=host` or `--shm-size` as command line
options to `nvidia-docker run`.

Myle Ott's avatar
Myle Ott committed
37
After PyTorch is installed, you can install fairseq with:
Sergey Edunov's avatar
Sergey Edunov committed
38
39
40
41
42
43
```
pip install -r requirements.txt
python setup.py build
python setup.py develop
```

44
45
# Quick Start

Myle Ott's avatar
Myle Ott committed
46
The following command-line tools are provided:
Sergey Edunov's avatar
Sergey Edunov committed
47
48
49
* `python preprocess.py`: Data pre-processing: build vocabularies and binarize training data
* `python train.py`: Train a new model on one or multiple GPUs
* `python generate.py`: Translate pre-processed data with a trained model
50
* `python interactive.py`: Translate raw text with a trained model
Sergey Edunov's avatar
Sergey Edunov committed
51
* `python score.py`: BLEU scoring of generated translations against reference translations
52
* `python eval_lm.py`: Language model evaluation
Sergey Edunov's avatar
Sergey Edunov committed
53

Sergey Edunov's avatar
Sergey Edunov committed
54
## Evaluating Pre-trained Models
Sergey Edunov's avatar
Sergey Edunov committed
55
56
First, download a pre-trained model along with its vocabularies:
```
Sergey Edunov's avatar
Sergey Edunov committed
57
$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
Sergey Edunov's avatar
Sergey Edunov committed
58
59
60
61
62
63
64
```

This model uses a [Byte Pair Encoding (BPE) vocabulary](https://arxiv.org/abs/1508.07909), so we'll have to apply the encoding to the source text before it can be translated.
This can be done with the [apply_bpe.py](https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py) script using the `wmt14.en-fr.fconv-cuda/bpecodes` file.
`@@` is used as a continuation marker and the original text can be easily recovered with e.g. `sed s/@@ //g` or by passing the `--remove-bpe` flag to `generate.py`.
Prior to BPE, input text needs to be tokenized using `tokenizer.perl` from [mosesdecoder](https://github.com/moses-smt/mosesdecoder).

65
Let's use `python interactive.py` to generate translations interactively.
Sergey Edunov's avatar
Sergey Edunov committed
66
67
68
Here, we use a beam size of 5:
```
$ MODEL_DIR=wmt14.en-fr.fconv-py
69
$ python interactive.py \
Sergey Edunov's avatar
Sergey Edunov committed
70
71
 --path $MODEL_DIR/model.pt $MODEL_DIR \
 --beam 5
72
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
Sergey Edunov's avatar
Sergey Edunov committed
73
74
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
75
| Type the input sentence and press return:
Sergey Edunov's avatar
Sergey Edunov committed
76
77
> Why is it rare to discover new marine mam@@ mal species ?
O       Why is it rare to discover new marine mam@@ mal species ?
78
79
H       -0.06429661810398102    Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins ?
A       0 1 3 3 5 6 6 8 8 8 7 11 12
Sergey Edunov's avatar
Sergey Edunov committed
80
81
82
83
84
85
86
87
```

This generation script produces four types of outputs: a line prefixed with *S* shows the supplied source sentence after applying the vocabulary; *O* is a copy of the original source sentence; *H* is the hypothesis along with an average log-likelihood; and *A* is the attention maxima for each word in the hypothesis, including the end-of-sentence marker which is omitted from the text.

Check [below](#pre-trained-models) for a full list of pre-trained models available.

## Training a New Model

Myle Ott's avatar
Myle Ott committed
88
89
The following tutorial is for machine translation.
For an example of how to use Fairseq for other tasks, such as [language modeling](examples/language_model/README.md), please see the `examples/` directory.
90

Sergey Edunov's avatar
Sergey Edunov committed
91
### Data Pre-processing
92

Myle Ott's avatar
Myle Ott committed
93
94
Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German).
To pre-process and binarize the IWSLT dataset:
Sergey Edunov's avatar
Sergey Edunov committed
95
```
96
$ cd examples/translation/
Sergey Edunov's avatar
Sergey Edunov committed
97
$ bash prepare-iwslt14.sh
98
$ cd ../..
99
$ TEXT=examples/translation/iwslt14.tokenized.de-en
Sergey Edunov's avatar
Sergey Edunov committed
100
101
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
Sergey Edunov's avatar
Sergey Edunov committed
102
  --destdir data-bin/iwslt14.tokenized.de-en
Sergey Edunov's avatar
Sergey Edunov committed
103
104
105
106
107
108
109
```
This will write binarized data that can be used for model training to `data-bin/iwslt14.tokenized.de-en`.

### Training
Use `python train.py` to train a new model.
Here a few example settings that work well for the IWSLT 2014 dataset:
```
110
$ mkdir -p checkpoints/fconv
Sergey Edunov's avatar
Sergey Edunov committed
111
112
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
113
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
Sergey Edunov's avatar
Sergey Edunov committed
114
115
116
117
118
119
120
121
122
```

By default, `python train.py` will use all available GPUs on your machine.
Use the [CUDA_VISIBLE_DEVICES](http://acceleware.com/blog/cudavisibledevices-masking-gpus) environment variable to select specific GPUs and/or to change the number of GPU devices that will be used.

Also note that the batch size is specified in terms of the maximum number of tokens per batch (`--max-tokens`).
You may need to use a smaller value depending on the available GPU memory on your system.

### Generation
123
Once your model is trained, you can generate translations using `python generate.py` **(for binarized data)** or `python interactive.py` **(for raw text)**:
Sergey Edunov's avatar
Sergey Edunov committed
124
125
```
$ python generate.py data-bin/iwslt14.tokenized.de-en \
126
  --path checkpoints/fconv/checkpoint_best.pt \
Sergey Edunov's avatar
Sergey Edunov committed
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
  --batch-size 128 --beam 5
  | [de] dictionary: 35475 types
  | [en] dictionary: 24739 types
  | data-bin/iwslt14.tokenized.de-en test 6750 examples
  | model fconv
  | loaded checkpoint trainings/fconv/checkpoint_best.pt
  S-721   danke .
  T-721   thank you .
  ...
```

To generate translations with only a CPU, use the `--cpu` flag.
BPE continuation markers can be removed with the `--remove-bpe` flag.

# Pre-trained Models

Myle Ott's avatar
Myle Ott committed
143
We provide the following pre-trained models and pre-processed, binarized test sets:
Sergey Edunov's avatar
Sergey Edunov committed
144

Myle Ott's avatar
Myle Ott committed
145
### Translation
Sergey Edunov's avatar
Sergey Edunov committed
146

Myle Ott's avatar
Myle Ott committed
147
148
149
150
Description | Dataset | Model | Test set(s)
---|---|---|---
Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2) <br> newstest2012/2013: <br> [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.ntst1213.tar.bz2)
Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-German](https://nlp.stanford.edu/projects/nmt) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-de.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-de.newstest2014.tar.bz2)
Myle Ott's avatar
Myle Ott committed
151
152
Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/models/wmt16.en-de.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
153

Myle Ott's avatar
Myle Ott committed
154
155
156
157
158
159
160
### Language models

Description | Dataset | Model | Test set(s)
---|---|---|---
Convolutional <br> ([Dauphin et al., 2017](https://arxiv.org/abs/1612.08083)) | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/models/gbw_fconv_lm.tar.bz2) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/gbw_test_lm.tar.bz2)
Convolutional <br> ([Dauphin et al., 2017](https://arxiv.org/abs/1612.08083)) | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/models/wiki103_fconv_lm.tar.bz2) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/wiki103_test_lm.tar.bz2)

Angela Fan's avatar
Angela Fan committed
161
162
163
164
165
166
167
### Stories

Description | Dataset | Model | Test set(s)
---|---|---|---
Stories with Convolutional Model <br> ([Fan et al., 2018](https://arxiv.org/abs/1805.04833)) | [WritingPrompts](https://arxiv.org/abs/1805.04833) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/models/stories_checkpoint.tar.bz2) | [download (.tar.bz2)](https://s3.amazonaws.com/fairseq-py/data/stories_test.tar.bz2)


Myle Ott's avatar
Myle Ott committed
168
### Usage
Sergey Edunov's avatar
Sergey Edunov committed
169

Myle Ott's avatar
Myle Ott committed
170
Generation with the binarized test sets can be run in batch mode as follows, e.g. for WMT 2014 English-French on a GTX-1080ti:
Sergey Edunov's avatar
Sergey Edunov committed
171
```
Sergey Edunov's avatar
Sergey Edunov committed
172
173
$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
$ curl https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
Sergey Edunov's avatar
Sergey Edunov committed
174
175
176
177
$ python generate.py data-bin/wmt14.en-fr.newstest2014  \
  --path data-bin/wmt14.en-fr.fconv-py/model.pt \
  --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
...
Sergey Edunov's avatar
Sergey Edunov committed
178
179
| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
Sergey Edunov's avatar
Sergey Edunov committed
180

181
# Scoring with score.py:
Sergey Edunov's avatar
Sergey Edunov committed
182
183
$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
Sergey Edunov's avatar
Sergey Edunov committed
184
$ python score.py --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
Sergey Edunov's avatar
Sergey Edunov committed
185
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
Sergey Edunov's avatar
Sergey Edunov committed
186
187
```

Myle Ott's avatar
Myle Ott committed
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# Large mini-batch training with delayed updates

The `--update-freq` option can be used to accumulate gradients from multiple mini-batches and delay updating,
creating a larger effective batch size.
Delayed updates can also improve training speed by reducing inter-GPU communication costs and by saving idle time caused by variance in workload across GPUs.
See [Ott et al. (2018)](https://arxiv.org/abs/1806.00187) for more details.

To train on a single GPU with an effective batch size that is equivalent to training on 8 GPUs:
```
CUDA_VISIBLE_DEVICES=0 python train.py --update-freq 8 (...)
```

# Training with half precision floating point (FP16)

> Note: FP16 training requires a Volta GPU and CUDA 9.1 or greater

Recent GPUs enable efficient half precision floating point computation, e.g., using [Nvidia Tensor Cores](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).

Fairseq supports FP16 training with the `--fp16` flag:
```
python train.py --fp16 (...)
```

# Distributed training
Myle Ott's avatar
Myle Ott committed
212

Myle Ott's avatar
Myle Ott committed
213
214
215
Distributed training in fairseq is implemented on top of [torch.distributed](http://pytorch.org/docs/master/distributed.html).
Training begins by launching one worker process per GPU.
These workers discover each other via a unique host and port (required) that can be used to establish an initial connection.
Myle Ott's avatar
Myle Ott committed
216
Additionally, each worker has a rank, that is a unique number from 0 to n-1 where n is the total number of GPUs.
Myle Ott's avatar
Myle Ott committed
217

Myle Ott's avatar
Myle Ott committed
218
If you run on a cluster managed by [SLURM](https://slurm.schedmd.com/) you can train a large English-French model on the WMT 2014 dataset on 16 nodes with 8 GPUs each (in total 128 GPUs) using this command:
Myle Ott's avatar
Myle Ott committed
219
220

```
Myle Ott's avatar
Myle Ott committed
221
222
223
224
225
226
227
228
229
$ DATA=...   # path to the preprocessed dataset, must be visible from all nodes
$ PORT=9218  # any available TCP port that can be used by the trainer to establish initial connection
$ sbatch --job-name fairseq-py --gres gpu:8 --cpus-per-task 10 \
    --nodes 16 --ntasks-per-node 8 \
    --wrap 'srun --output train.log.node%t --error train.stderr.node%t.%j \
    python train.py $DATA \
    --distributed-world-size 128 \
    --distributed-port $PORT \
    --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
Myle Ott's avatar
Myle Ott committed
230
231
232
233
234
    --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
    --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 --wd 0.0001'
```

Myle Ott's avatar
Myle Ott committed
235
Alternatively you can manually start one process per GPU:
Myle Ott's avatar
Myle Ott committed
236
```
Myle Ott's avatar
Myle Ott committed
237
238
239
240
241
242
243
244
245
246
247
$ DATA=...  # path to the preprocessed dataset, must be visible from all nodes
$ HOST_PORT=master.devserver.com:9218  # one of the hosts used by the job
$ RANK=...  # the rank of this process, from 0 to 127 in case of 128 GPUs
$ python train.py $DATA \
    --distributed-world-size 128 \
    --distributed-init-method 'tcp://$HOST_PORT' \
    --distributed-rank $RANK \
    --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
    --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
    --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 --wd 0.0001
Myle Ott's avatar
Myle Ott committed
248
249
```

Sergey Edunov's avatar
Sergey Edunov committed
250
251
252
253
254
# Join the fairseq community

* Facebook page: https://www.facebook.com/groups/fairseq.users
* Google group: https://groups.google.com/forum/#!forum/fairseq-users

Myle Ott's avatar
Myle Ott committed
255
256
257
258
259
260
261
262
263
264
265
266
267
# Citation

If you use the code in your paper, then please cite it as:

```
@inproceedings{gehring2017convs2s,
  author    = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title     = "{Convolutional Sequence to Sequence Learning}",
  booktitle = {Proc. of ICML},
  year      = 2017,
}
```

Sergey Edunov's avatar
Sergey Edunov committed
268
# License
Myle Ott's avatar
Myle Ott committed
269
fairseq(-py) is BSD-licensed.
Sergey Edunov's avatar
Sergey Edunov committed
270
271
The license applies to the pre-trained models as well.
We also provide an additional patent grant.
Myle Ott's avatar
Myle Ott committed
272
273
274

# Credits
This is a PyTorch version of [fairseq](https://github.com/facebookresearch/fairseq), a sequence-to-sequence learning toolkit from Facebook AI Research. The original authors of this reimplementation are (in no particular order) Sergey Edunov, Myle Ott, and Sam Gross.