README.md 5.18 KB
Newer Older
1
# Example usage for Neural Machine Translation
2

3
4
These scripts provide an example of pre-processing data for the NMT task
and instructions for how to replicate the results from the paper [Scaling Neural Machine Translation (Ott et al., 2018)](https://arxiv.org/abs/1806.00187).
5

6
7
8
## Preprocessing

### prepare-iwslt14.sh
9
10
11
12
13

Provides an example of pre-processing for IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al.](http://workshop2014.iwslt.org/downloads/proceeding.pdf)

Example usage:
```
14
$ cd examples/translation/
15
$ bash prepare-iwslt14.sh
16
$ cd ../..
17
18

# Binarize the dataset:
19
$ TEXT=examples/translation/iwslt14.tokenized.de-en
20
21
22
23
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/iwslt14.tokenized.de-en

24
# Train the model (better for a single GPU setup):
25
$ mkdir -p checkpoints/fconv
26
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
27
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
Runqi Yang's avatar
Runqi Yang committed
28
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
Runqi Yang's avatar
Runqi Yang committed
29
  --lr-scheduler fixed --force-anneal 200 \
30
31
32
33
34
35
36
37
38
39
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

# Generate:
$ python generate.py data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/fconv/checkpoint_best.pt \
  --batch-size 128 --beam 5 --remove-bpe

```


40
### prepare-wmt14en2de.sh
41

Sergey Edunov's avatar
Sergey Edunov committed
42
Provides an example of pre-processing for the WMT'14 English to German translation task. By default it will produce a dataset that was modeled after ["Attention Is All You Need" by Vaswani et al.](https://arxiv.org/abs/1706.03762) that includes news-commentary-v12 data.
43

Runqi Yang's avatar
Runqi Yang committed
44
To use only data available in WMT'14 or to replicate results obtained in the original paper ["Convolutional Sequence to Sequence Learning" by Gehring et al.](https://arxiv.org/abs/1705.03122) run it with --icml17 instead:
45
46
47
48
49
50
51
52

```
$ bash prepare-wmt14en2de.sh --icml17
```

Example usage:

```
53
$ cd examples/translation/
54
$ bash prepare-wmt14en2de.sh
55
$ cd ../..
56
57

# Binarize the dataset:
58
$ TEXT=examples/translation/wmt14_en_de
59
60
61
62
63
64
65
66
67
$ python preprocess.py --source-lang en --target-lang de \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_de --thresholdtgt 0 --thresholdsrc 0

# Train the model:
# If it runs out of memory, try to set --max-tokens 1500 instead
$ mkdir -p checkpoints/fconv_wmt_en_de
$ python train.py data-bin/wmt14_en_de \
  --lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
Runqi Yang's avatar
Runqi Yang committed
68
69
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 50 \
70
71
72
73
74
75
76
77
  --arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de

# Generate:
$ python generate.py data-bin/wmt14_en_de \
  --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe

```

78
### prepare-wmt14en2fr.sh
79

Sergey Edunov's avatar
Sergey Edunov committed
80
Provides an example of pre-processing for the WMT'14 English to French translation task.
81
82
83
84

Example usage:

```
85
$ cd examples/translation/
86
$ bash prepare-wmt14en2fr.sh
87
$ cd ../..
88
89

# Binarize the dataset:
90
$ TEXT=examples/translation/wmt14_en_fr
91
92
93
94
95
96
97
98
99
$ python preprocess.py --source-lang en --target-lang fr \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0

# Train the model:
# If it runs out of memory, try to set --max-tokens 1000 instead
$ mkdir -p checkpoints/fconv_wmt_en_fr
$ python train.py data-bin/wmt14_en_fr \
  --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
Runqi Yang's avatar
Runqi Yang committed
100
101
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 50 \
102
103
104
105
106
107
108
  --arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr

# Generate:
$ python generate.py data-bin/fconv_wmt_en_fr \
  --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe

```
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144

## Replicating results from "Scaling Neural Machine Translation"

To replicate results from the paper [Scaling Neural Machine Translation (Ott et al., 2018)](https://arxiv.org/abs/1806.00187):

1. Prepare the WMT'14 En-De data with a BPE vocab of 32k:
```
$ BPE_TOKENS=32764 bash prepare-wmt14en2de.sh
$ cd ../..
```
2. Preprocess the dataset with a joined dictionary:
```
$ TEXT=examples/translation/wmt14_en_de
$ python preprocess.py --source-lang en --target-lang de \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_de_joined_dict \
  --nwordssrc 32768 --nwordstgt 32768 \
  --joined-dictionary
```
3. Train a model:
```
$ python train.py data-bin/wmt14_en_de_joined_dict \
  --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
  --lr 0.0005 --min-lr 1e-09 \
  --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --max-tokens 3584 \
  --fp16
```

Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.

If you want to train the above model with big batches (assuming your machine has 8 GPUs):
- add `--update-freq 16` to simulate training on 8*16=128 GPUs
- increase the learning rate; 0.001 works well for big batches