README.md 3.94 KB
Newer Older
1
2
3
4
5
6
# Scaling Neural Machine Translation (Ott et al., 2018)

This page includes instructions for reproducing results from the paper [Scaling Neural Machine Translation (Ott et al., 2018)](https://arxiv.org/abs/1806.00187).

## Pre-trained models

Myle Ott's avatar
Myle Ott committed
7
Model | Description | Dataset | Download
8
---|---|---|---
Myle Ott's avatar
Myle Ott committed
9
10
`transformer.wmt14.en-fr` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
`transformer.wmt16.en-de` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
11
12
13

## Training a new model on WMT'16 En-De

Myle Ott's avatar
Myle Ott committed
14
15
First download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8).

16
17
Then:

Myle Ott's avatar
Myle Ott committed
18
##### 1. Extract the WMT'16 En-De data
Myle Ott's avatar
Myle Ott committed
19
20
```bash
TEXT=wmt16_en_de_bpe32k
21
mkdir -p $TEXT
Myle Ott's avatar
Myle Ott committed
22
tar -xzvf wmt16_en_de.tar.gz -C $TEXT
23
24
```

Myle Ott's avatar
Myle Ott committed
25
##### 2. Preprocess the dataset with a joined dictionary
Myle Ott's avatar
Myle Ott committed
26
```bash
Myle Ott's avatar
Myle Ott committed
27
28
fairseq-preprocess \
    --source-lang en --target-lang de \
Myle Ott's avatar
Myle Ott committed
29
30
31
32
33
    --trainpref $TEXT/train.tok.clean.bpe.32000 \
    --validpref $TEXT/newstest2013.tok.bpe.32000 \
    --testpref $TEXT/newstest2014.tok.bpe.32000 \
    --destdir data-bin/wmt16_en_de_bpe32k \
    --nwordssrc 32768 --nwordstgt 32768 \
Myle Ott's avatar
Myle Ott committed
34
35
    --joined-dictionary \
    --workers 20
36
37
```

Myle Ott's avatar
Myle Ott committed
38
##### 3. Train a model
Myle Ott's avatar
Myle Ott committed
39
```bash
Myle Ott's avatar
Myle Ott committed
40
41
fairseq-train \
    data-bin/wmt16_en_de_bpe32k \
Myle Ott's avatar
Myle Ott committed
42
43
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
Myle Ott's avatar
Myle Ott committed
44
45
46
    --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
    --dropout 0.3 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
Myle Ott's avatar
Myle Ott committed
47
48
    --max-tokens 3584 \
    --fp16
49
50
```

Myle Ott's avatar
Myle Ott committed
51
Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU or newer.
52

Myle Ott's avatar
Myle Ott committed
53
54
55
***IMPORTANT:*** You will get better performance by training with big batches and
increasing the learning rate. If you want to train the above model with big batches
(assuming your machine has 8 GPUs):
Myle Ott's avatar
Myle Ott committed
56
- add `--update-freq 16` to simulate training on 8x16=128 GPUs
57
58
- increase the learning rate; 0.001 works well for big batches

Myle Ott's avatar
Myle Ott committed
59
##### 4. Evaluate
Myle Ott's avatar
Myle Ott committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

Now we can evaluate our trained model.

Note that the original [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
paper used a couple tricks to achieve better BLEU scores. We use these same tricks in
the Scaling NMT paper, so it's important to apply them when reproducing our results.

First, use the [average_checkpoints.py](/scripts/average_checkpoints.py) script to
average the last few checkpoints. Averaging the last 5-10 checkpoints is usually
good, but you may need to adjust this depending on how long you've trained:
```bash
python scripts/average_checkpoints \
    --inputs /path/to/checkpoints \
    --num-epoch-checkpoints 5 \
    --output checkpoint.avg5.pt
```

Next, generate translations using a beam width of 4 and length penalty of 0.6:
Myle Ott's avatar
Myle Ott committed
78
79
80
```bash
fairseq-generate \
    data-bin/wmt16_en_de_bpe32k \
Myle Ott's avatar
Myle Ott committed
81
    --path checkpoint.avg5.pt \
Myle Ott's avatar
Myle Ott committed
82
83
84
    --beam 4 --lenpen 0.6 --remove-bpe
```

85
86
87
88
89
90
91
92
93
94
## Citation

```bibtex
@inproceedings{ott2018scaling,
  title = {Scaling Neural Machine Translation},
  author = {Ott, Myle and Edunov, Sergey and Grangier, David and Auli, Michael},
  booktitle = {Proceedings of the Third Conference on Machine Translation (WMT)},
  year = 2018,
}
```