README.md 8.49 KB
Newer Older
Sergey Edunov's avatar
Sergey Edunov committed
1
2
3
# Introduction
FAIR Sequence-to-Sequence Toolkit (PyTorch)

4
This is a PyTorch version of [fairseq](https://github.com/facebookresearch/fairseq), a sequence-to-sequence learning toolkit from Facebook AI Research. The original authors of this reimplementation are (in no particular order) Sergey Edunov, Myle Ott, and Sam Gross. The toolkit implements the fully convolutional model described in [Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122) and features multi-GPU training on a single machine as well as fast beam search generation on both CPU and GPU. We provide pre-trained models for English to French and English to German translation.
Sergey Edunov's avatar
Sergey Edunov committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

![Model](fairseq.gif)

# Citation

If you use the code in your paper, then please cite it as:

```
@inproceedings{gehring2017convs2s,
  author    = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title     = "{Convolutional Sequence to Sequence Learning}",
  booktitle = {Proc. of ICML},
  year      = 2017,
}
```

# Requirements and Installation
* A computer running macOS or Linux
* For training new models, you'll also need a NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
* Python version 3.6
* A [PyTorch installation](http://pytorch.org/)

27
28
Currently fairseq-py requires PyTorch version >= 0.3.0.
Please follow the instructions here: https://github.com/pytorch/pytorch#installation.
Sergey Edunov's avatar
Sergey Edunov committed
29

30
31
32
If you use Docker make sure to increase the shared memory size either with `--ipc=host` or `--shm-size` as command line
options to `nvidia-docker run`.

Myle Ott's avatar
Myle Ott committed
33
After PyTorch is installed, you can install fairseq-py with:
Sergey Edunov's avatar
Sergey Edunov committed
34
35
36
37
38
39
```
pip install -r requirements.txt
python setup.py build
python setup.py develop
```

40
41
# Quick Start

Sergey Edunov's avatar
Sergey Edunov committed
42
43
44
45
The following command-line tools are available:
* `python preprocess.py`: Data pre-processing: build vocabularies and binarize training data
* `python train.py`: Train a new model on one or multiple GPUs
* `python generate.py`: Translate pre-processed data with a trained model
46
* `python interactive.py`: Translate raw text with a trained model
Sergey Edunov's avatar
Sergey Edunov committed
47
48
* `python score.py`: BLEU scoring of generated translations against reference translations

Sergey Edunov's avatar
Sergey Edunov committed
49
## Evaluating Pre-trained Models
Sergey Edunov's avatar
Sergey Edunov committed
50
51
First, download a pre-trained model along with its vocabularies:
```
Sergey Edunov's avatar
Sergey Edunov committed
52
$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
Sergey Edunov's avatar
Sergey Edunov committed
53
54
55
56
57
58
59
```

This model uses a [Byte Pair Encoding (BPE) vocabulary](https://arxiv.org/abs/1508.07909), so we'll have to apply the encoding to the source text before it can be translated.
This can be done with the [apply_bpe.py](https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py) script using the `wmt14.en-fr.fconv-cuda/bpecodes` file.
`@@` is used as a continuation marker and the original text can be easily recovered with e.g. `sed s/@@ //g` or by passing the `--remove-bpe` flag to `generate.py`.
Prior to BPE, input text needs to be tokenized using `tokenizer.perl` from [mosesdecoder](https://github.com/moses-smt/mosesdecoder).

60
Let's use `python interactive.py` to generate translations interactively.
Sergey Edunov's avatar
Sergey Edunov committed
61
62
63
Here, we use a beam size of 5:
```
$ MODEL_DIR=wmt14.en-fr.fconv-py
64
$ python interactive.py \
Sergey Edunov's avatar
Sergey Edunov committed
65
66
 --path $MODEL_DIR/model.pt $MODEL_DIR \
 --beam 5
67
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
Sergey Edunov's avatar
Sergey Edunov committed
68
69
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
70
| Type the input sentence and press return:
Sergey Edunov's avatar
Sergey Edunov committed
71
72
> Why is it rare to discover new marine mam@@ mal species ?
O       Why is it rare to discover new marine mam@@ mal species ?
73
74
H       -0.06429661810398102    Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins ?
A       0 1 3 3 5 6 6 8 8 8 7 11 12
Sergey Edunov's avatar
Sergey Edunov committed
75
76
77
78
79
80
81
82
83
84
```

This generation script produces four types of outputs: a line prefixed with *S* shows the supplied source sentence after applying the vocabulary; *O* is a copy of the original source sentence; *H* is the hypothesis along with an average log-likelihood; and *A* is the attention maxima for each word in the hypothesis, including the end-of-sentence marker which is omitted from the text.

Check [below](#pre-trained-models) for a full list of pre-trained models available.


## Training a New Model

### Data Pre-processing
85
The fairseq-py source distribution contains an example pre-processing script for
Sergey Edunov's avatar
Sergey Edunov committed
86
87
88
89
90
91
92
93
94
the IWSLT 2014 German-English corpus.
Pre-process and binarize the data as follows:
```
$ cd data/
$ bash prepare-iwslt14.sh
$ cd ..
$ TEXT=data/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
Sergey Edunov's avatar
Sergey Edunov committed
95
  --destdir data-bin/iwslt14.tokenized.de-en
Sergey Edunov's avatar
Sergey Edunov committed
96
97
98
99
100
101
102
```
This will write binarized data that can be used for model training to `data-bin/iwslt14.tokenized.de-en`.

### Training
Use `python train.py` to train a new model.
Here a few example settings that work well for the IWSLT 2014 dataset:
```
103
$ mkdir -p checkpoints/fconv
Sergey Edunov's avatar
Sergey Edunov committed
104
105
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
106
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
Sergey Edunov's avatar
Sergey Edunov committed
107
108
109
110
111
112
113
114
115
```

By default, `python train.py` will use all available GPUs on your machine.
Use the [CUDA_VISIBLE_DEVICES](http://acceleware.com/blog/cudavisibledevices-masking-gpus) environment variable to select specific GPUs and/or to change the number of GPU devices that will be used.

Also note that the batch size is specified in terms of the maximum number of tokens per batch (`--max-tokens`).
You may need to use a smaller value depending on the available GPU memory on your system.

### Generation
116
Once your model is trained, you can generate translations using `python generate.py` **(for binarized data)** or `python interactive.py` **(for raw text)**:
Sergey Edunov's avatar
Sergey Edunov committed
117
118
```
$ python generate.py data-bin/iwslt14.tokenized.de-en \
119
  --path checkpoints/fconv/checkpoint_best.pt \
Sergey Edunov's avatar
Sergey Edunov committed
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
  --batch-size 128 --beam 5
  | [de] dictionary: 35475 types
  | [en] dictionary: 24739 types
  | data-bin/iwslt14.tokenized.de-en test 6750 examples
  | model fconv
  | loaded checkpoint trainings/fconv/checkpoint_best.pt
  S-721   danke .
  T-721   thank you .
  ...
```

To generate translations with only a CPU, use the `--cpu` flag.
BPE continuation markers can be removed with the `--remove-bpe` flag.

# Pre-trained Models

We provide the following pre-trained fully convolutional sequence-to-sequence models:

Sergey Edunov's avatar
Sergey Edunov committed
138
139
* [wmt14.en-fr.fconv-py.tar.bz2](https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2): Pre-trained model for [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) including vocabularies
* [wmt14.en-de.fconv-py.tar.bz2](https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-de.fconv-py.tar.bz2): Pre-trained model for [WMT14 English-German](https://nlp.stanford.edu/projects/nmt) including vocabularies
Sergey Edunov's avatar
Sergey Edunov committed
140
141

In addition, we provide pre-processed and binarized test sets for the models above:
Sergey Edunov's avatar
Sergey Edunov committed
142
143
144
* [wmt14.en-fr.newstest2014.tar.bz2](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2): newstest2014 test set for WMT14 English-French
* [wmt14.en-fr.ntst1213.tar.bz2](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.ntst1213.tar.bz2): newstest2012 and newstest2013 test sets for WMT14 English-French
* [wmt14.en-de.newstest2014.tar.bz2](https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-de.newstest2014.tar.bz2): newstest2014 test set for WMT14 English-German
Sergey Edunov's avatar
Sergey Edunov committed
145
146
147

Generation with the binarized test sets can be run in batch mode as follows, e.g. for English-French on a GTX-1080ti:
```
Sergey Edunov's avatar
Sergey Edunov committed
148
149
$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
$ curl https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
Sergey Edunov's avatar
Sergey Edunov committed
150
151
152
153
$ python generate.py data-bin/wmt14.en-fr.newstest2014  \
  --path data-bin/wmt14.en-fr.fconv-py/model.pt \
  --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
...
Sergey Edunov's avatar
Sergey Edunov committed
154
155
| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
Sergey Edunov's avatar
Sergey Edunov committed
156

157
# Scoring with score.py:
Sergey Edunov's avatar
Sergey Edunov committed
158
159
$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
Sergey Edunov's avatar
Sergey Edunov committed
160
$ python score.py --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
Sergey Edunov's avatar
Sergey Edunov committed
161
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
Sergey Edunov's avatar
Sergey Edunov committed
162
163
164
165
166
167
168
169
```

# Join the fairseq community

* Facebook page: https://www.facebook.com/groups/fairseq.users
* Google group: https://groups.google.com/forum/#!forum/fairseq-users

# License
170
fairseq-py is BSD-licensed.
Sergey Edunov's avatar
Sergey Edunov committed
171
172
The license applies to the pre-trained models as well.
We also provide an additional patent grant.