README.md 11.3 KB
Newer Older
Sergey Edunov's avatar
Sergey Edunov committed
1
2
# Introduction

Myle Ott's avatar
Myle Ott committed
3
4
5
6
7
8
9
10
11
Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization and other text generation tasks. It provides reference implementations of various sequence-to-sequence models, including:
- **Convolutional Neural Networks (CNN)**
  - [Gehring et al. (2017): Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122)
  - [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://arxiv.org/abs/1711.04956)
- **Long Short-Term Memory (LSTM) networks**
  - [Luong et al. (2015): Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
  - [Wiseman and Rush (2016): Sequence-to-Sequence Learning as Beam-Search Optimization](https://arxiv.org/abs/1606.02960)
 
Fairseq features multi-GPU (distributed) training on one machine or across multiple machines, fast beam search generation on both CPU and GPU, and includes pre-trained models for several benchmark translation datasets.
Sergey Edunov's avatar
Sergey Edunov committed
12
13
14
15
16

![Model](fairseq.gif)

# Requirements and Installation
* A [PyTorch installation](http://pytorch.org/)
Myle Ott's avatar
Myle Ott committed
17
18
* For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
* Python version 3.6
Sergey Edunov's avatar
Sergey Edunov committed
19

Myle Ott's avatar
Myle Ott committed
20
Currently fairseq requires PyTorch version >= 0.4.0.
21
Please follow the instructions here: https://github.com/pytorch/pytorch#installation.
Sergey Edunov's avatar
Sergey Edunov committed
22

23
24
25
If you use Docker make sure to increase the shared memory size either with `--ipc=host` or `--shm-size` as command line
options to `nvidia-docker run`.

Myle Ott's avatar
Myle Ott committed
26
After PyTorch is installed, you can install fairseq with:
Sergey Edunov's avatar
Sergey Edunov committed
27
28
29
30
31
32
```
pip install -r requirements.txt
python setup.py build
python setup.py develop
```

33
34
# Quick Start

Myle Ott's avatar
Myle Ott committed
35
The following command-line tools are provided:
Sergey Edunov's avatar
Sergey Edunov committed
36
37
38
* `python preprocess.py`: Data pre-processing: build vocabularies and binarize training data
* `python train.py`: Train a new model on one or multiple GPUs
* `python generate.py`: Translate pre-processed data with a trained model
39
* `python interactive.py`: Translate raw text with a trained model
Sergey Edunov's avatar
Sergey Edunov committed
40
41
* `python score.py`: BLEU scoring of generated translations against reference translations

Sergey Edunov's avatar
Sergey Edunov committed
42
## Evaluating Pre-trained Models
Sergey Edunov's avatar
Sergey Edunov committed
43
44
First, download a pre-trained model along with its vocabularies:
```
Sergey Edunov's avatar
Sergey Edunov committed
45
$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
Sergey Edunov's avatar
Sergey Edunov committed
46
47
48
49
50
51
52
```

This model uses a [Byte Pair Encoding (BPE) vocabulary](https://arxiv.org/abs/1508.07909), so we'll have to apply the encoding to the source text before it can be translated.
This can be done with the [apply_bpe.py](https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py) script using the `wmt14.en-fr.fconv-cuda/bpecodes` file.
`@@` is used as a continuation marker and the original text can be easily recovered with e.g. `sed s/@@ //g` or by passing the `--remove-bpe` flag to `generate.py`.
Prior to BPE, input text needs to be tokenized using `tokenizer.perl` from [mosesdecoder](https://github.com/moses-smt/mosesdecoder).

53
Let's use `python interactive.py` to generate translations interactively.
Sergey Edunov's avatar
Sergey Edunov committed
54
55
56
Here, we use a beam size of 5:
```
$ MODEL_DIR=wmt14.en-fr.fconv-py
57
$ python interactive.py \
Sergey Edunov's avatar
Sergey Edunov committed
58
59
 --path $MODEL_DIR/model.pt $MODEL_DIR \
 --beam 5
60
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
Sergey Edunov's avatar
Sergey Edunov committed
61
62
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
63
| Type the input sentence and press return:
Sergey Edunov's avatar
Sergey Edunov committed
64
65
> Why is it rare to discover new marine mam@@ mal species ?
O       Why is it rare to discover new marine mam@@ mal species ?
66
67
H       -0.06429661810398102    Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins ?
A       0 1 3 3 5 6 6 8 8 8 7 11 12
Sergey Edunov's avatar
Sergey Edunov committed
68
69
70
71
72
73
74
75
76
77
```

This generation script produces four types of outputs: a line prefixed with *S* shows the supplied source sentence after applying the vocabulary; *O* is a copy of the original source sentence; *H* is the hypothesis along with an average log-likelihood; and *A* is the attention maxima for each word in the hypothesis, including the end-of-sentence marker which is omitted from the text.

Check [below](#pre-trained-models) for a full list of pre-trained models available.


## Training a New Model

### Data Pre-processing
Myle Ott's avatar
Myle Ott committed
78
79
Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German).
To pre-process and binarize the IWSLT dataset:
Sergey Edunov's avatar
Sergey Edunov committed
80
81
82
83
84
85
86
```
$ cd data/
$ bash prepare-iwslt14.sh
$ cd ..
$ TEXT=data/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
Sergey Edunov's avatar
Sergey Edunov committed
87
  --destdir data-bin/iwslt14.tokenized.de-en
Sergey Edunov's avatar
Sergey Edunov committed
88
89
90
91
92
93
94
```
This will write binarized data that can be used for model training to `data-bin/iwslt14.tokenized.de-en`.

### Training
Use `python train.py` to train a new model.
Here a few example settings that work well for the IWSLT 2014 dataset:
```
95
$ mkdir -p checkpoints/fconv
Sergey Edunov's avatar
Sergey Edunov committed
96
97
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
98
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
Sergey Edunov's avatar
Sergey Edunov committed
99
100
101
102
103
104
105
106
107
```

By default, `python train.py` will use all available GPUs on your machine.
Use the [CUDA_VISIBLE_DEVICES](http://acceleware.com/blog/cudavisibledevices-masking-gpus) environment variable to select specific GPUs and/or to change the number of GPU devices that will be used.

Also note that the batch size is specified in terms of the maximum number of tokens per batch (`--max-tokens`).
You may need to use a smaller value depending on the available GPU memory on your system.

### Generation
108
Once your model is trained, you can generate translations using `python generate.py` **(for binarized data)** or `python interactive.py` **(for raw text)**:
Sergey Edunov's avatar
Sergey Edunov committed
109
110
```
$ python generate.py data-bin/iwslt14.tokenized.de-en \
111
  --path checkpoints/fconv/checkpoint_best.pt \
Sergey Edunov's avatar
Sergey Edunov committed
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
  --batch-size 128 --beam 5
  | [de] dictionary: 35475 types
  | [en] dictionary: 24739 types
  | data-bin/iwslt14.tokenized.de-en test 6750 examples
  | model fconv
  | loaded checkpoint trainings/fconv/checkpoint_best.pt
  S-721   danke .
  T-721   thank you .
  ...
```

To generate translations with only a CPU, use the `--cpu` flag.
BPE continuation markers can be removed with the `--remove-bpe` flag.

# Pre-trained Models

We provide the following pre-trained fully convolutional sequence-to-sequence models:

Sergey Edunov's avatar
Sergey Edunov committed
130
131
* [wmt14.en-fr.fconv-py.tar.bz2](https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2): Pre-trained model for [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) including vocabularies
* [wmt14.en-de.fconv-py.tar.bz2](https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-de.fconv-py.tar.bz2): Pre-trained model for [WMT14 English-German](https://nlp.stanford.edu/projects/nmt) including vocabularies
Sergey Edunov's avatar
Sergey Edunov committed
132
133

In addition, we provide pre-processed and binarized test sets for the models above:
Sergey Edunov's avatar
Sergey Edunov committed
134
135
136
* [wmt14.en-fr.newstest2014.tar.bz2](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2): newstest2014 test set for WMT14 English-French
* [wmt14.en-fr.ntst1213.tar.bz2](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.ntst1213.tar.bz2): newstest2012 and newstest2013 test sets for WMT14 English-French
* [wmt14.en-de.newstest2014.tar.bz2](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-de.newstest2014.tar.bz2): newstest2014 test set for WMT14 English-German
Sergey Edunov's avatar
Sergey Edunov committed
137

Myle Ott's avatar
Myle Ott committed
138
Generation with the binarized test sets can be run in batch mode as follows, e.g. for WMT 2014 English-French on a GTX-1080ti:
Sergey Edunov's avatar
Sergey Edunov committed
139
```
Sergey Edunov's avatar
Sergey Edunov committed
140
141
$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
$ curl https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
Sergey Edunov's avatar
Sergey Edunov committed
142
143
144
145
$ python generate.py data-bin/wmt14.en-fr.newstest2014  \
  --path data-bin/wmt14.en-fr.fconv-py/model.pt \
  --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
...
Sergey Edunov's avatar
Sergey Edunov committed
146
147
| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
Sergey Edunov's avatar
Sergey Edunov committed
148

149
# Scoring with score.py:
Sergey Edunov's avatar
Sergey Edunov committed
150
151
$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
Sergey Edunov's avatar
Sergey Edunov committed
152
$ python score.py --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
Sergey Edunov's avatar
Sergey Edunov committed
153
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
Sergey Edunov's avatar
Sergey Edunov committed
154
155
```

Myle Ott's avatar
Myle Ott committed
156
157
# Distributed version

Myle Ott's avatar
Myle Ott committed
158
159
160
161
Distributed training in fairseq is implemented on top of [torch.distributed](http://pytorch.org/docs/master/distributed.html).
Training begins by launching one worker process per GPU.
These workers discover each other via a unique host and port (required) that can be used to establish an initial connection.
Additionally, each worker is given a rank, that is a unique number from 0 to n-1 where n is the total number of GPUs.
Myle Ott's avatar
Myle Ott committed
162

Myle Ott's avatar
Myle Ott committed
163
If you run on a cluster managed by [SLURM](https://slurm.schedmd.com/) you can train a large English-French model on the WMT 2014 dataset on 16 nodes with 8 GPUs each (in total 128 GPUs) using this command:
Myle Ott's avatar
Myle Ott committed
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190

```
$ DATA=... # path to the preprocessed dataset, must be visible from all nodes
$ PORT=9218 # any available tcp port that can be used by the trained to establish initial connection
$ sbatch --job-name fairseq-py --gres gpu:8 --nodes 16 --ntasks-per-node 8 \
    --cpus-per-task 10 --no-requeue --wrap 'srun --output train.log.node%t \
    --error train.stderr.node%t.%j python train.py $DATA --distributed-world-size 128 \
    --distributed-port $PORT --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
    --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
    --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 --wd 0.0001'
```

Alternatively you'll need to manually start one process per each GPU:
```
$ DATA=... # path to the preprocessed dataset, must be visible from all nodes
$ HOST_PORT=your.devserver.com:9218 # has to be one of the hosts that will be used by the job \
    and the port on that host has to be available
$ RANK=... # the rank of this process, has to go from 0 to 127 in case of 128 GPUs
$ python train.py $DATA --distributed-world-size 128 \
      --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
      --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
      --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
      --label-smoothing 0.1 --wd 0.0001 \
      --distributed-init-method='tcp://$HOST_PORT' --distributed-rank=$RANK
```

Sergey Edunov's avatar
Sergey Edunov committed
191
192
193
194
195
# Join the fairseq community

* Facebook page: https://www.facebook.com/groups/fairseq.users
* Google group: https://groups.google.com/forum/#!forum/fairseq-users

Myle Ott's avatar
Myle Ott committed
196
197
198
199
200
201
202
203
204
205
206
207
208
# Citation

If you use the code in your paper, then please cite it as:

```
@inproceedings{gehring2017convs2s,
  author    = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title     = "{Convolutional Sequence to Sequence Learning}",
  booktitle = {Proc. of ICML},
  year      = 2017,
}
```

Sergey Edunov's avatar
Sergey Edunov committed
209
# License
Myle Ott's avatar
Myle Ott committed
210
fairseq(-py) is BSD-licensed.
Sergey Edunov's avatar
Sergey Edunov committed
211
212
The license applies to the pre-trained models as well.
We also provide an additional patent grant.
Myle Ott's avatar
Myle Ott committed
213
214
215

# Credits
This is a PyTorch version of [fairseq](https://github.com/facebookresearch/fairseq), a sequence-to-sequence learning toolkit from Facebook AI Research. The original authors of this reimplementation are (in no particular order) Sergey Edunov, Myle Ott, and Sam Gross.