"vscode:/vscode.git/clone" did not exist on "c9064e6fd9a5356ee579e9d452bfad725f8e6f2c"
README.md 6.15 KB
Newer Older
1
# Example usage for Neural Machine Translation
2

3
4
These scripts provide an example of pre-processing data for the NMT task
and instructions for how to replicate the results from the paper [Scaling Neural Machine Translation (Ott et al., 2018)](https://arxiv.org/abs/1806.00187).
5

6
7
8
## Preprocessing

### prepare-iwslt14.sh
9
10
11
12
13

Provides an example of pre-processing for IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al.](http://workshop2014.iwslt.org/downloads/proceeding.pdf)

Example usage:
```
14
$ cd examples/translation/
15
$ bash prepare-iwslt14.sh
16
$ cd ../..
17
18

# Binarize the dataset:
19
$ TEXT=examples/translation/iwslt14.tokenized.de-en
20
21
22
23
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/iwslt14.tokenized.de-en

24
# Train the model (better for a single GPU setup):
25
$ mkdir -p checkpoints/fconv
26
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
27
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
Runqi Yang's avatar
Runqi Yang committed
28
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
Runqi Yang's avatar
Runqi Yang committed
29
  --lr-scheduler fixed --force-anneal 200 \
30
31
32
33
34
35
36
37
38
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

# Generate:
$ python generate.py data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/fconv/checkpoint_best.pt \
  --batch-size 128 --beam 5 --remove-bpe

```

39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
To train transformer model on IWSLT'14 German to English:
```
# Preparation steps are the same as for fconv model.

# Train the model (better for a single GPU setup):
$ mkdir -p checkpoints/transformer
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
  -a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s de -t en \
  --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
  --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
  --criterion label_smoothed_cross_entropy --max-update 50000 \
  --warmup-updates 4000 --warmup-init-lr '1e-07' \
  --adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer

# Average 10 latest checkpoints:
$ python scripts/average_checkpoints.py --inputs checkpoints/transformer \
   --num-epoch-checkpoints 10 --output checkpoints/transformer/model.pt

# Generate:
$ python generate.py data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/transformer/model.pt \
  --batch-size 128 --beam 5 --remove-bpe

```

64

65
### prepare-wmt14en2de.sh
66

Sergey Edunov's avatar
Sergey Edunov committed
67
Provides an example of pre-processing for the WMT'14 English to German translation task. By default it will produce a dataset that was modeled after ["Attention Is All You Need" by Vaswani et al.](https://arxiv.org/abs/1706.03762) that includes news-commentary-v12 data.
68

Runqi Yang's avatar
Runqi Yang committed
69
To use only data available in WMT'14 or to replicate results obtained in the original paper ["Convolutional Sequence to Sequence Learning" by Gehring et al.](https://arxiv.org/abs/1705.03122) run it with --icml17 instead:
70
71
72
73
74
75
76
77

```
$ bash prepare-wmt14en2de.sh --icml17
```

Example usage:

```
78
$ cd examples/translation/
79
$ bash prepare-wmt14en2de.sh
80
$ cd ../..
81
82

# Binarize the dataset:
83
$ TEXT=examples/translation/wmt14_en_de
84
85
86
87
88
89
90
91
92
$ python preprocess.py --source-lang en --target-lang de \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_de --thresholdtgt 0 --thresholdsrc 0

# Train the model:
# If it runs out of memory, try to set --max-tokens 1500 instead
$ mkdir -p checkpoints/fconv_wmt_en_de
$ python train.py data-bin/wmt14_en_de \
  --lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
Runqi Yang's avatar
Runqi Yang committed
93
94
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 50 \
95
96
97
98
99
100
101
102
  --arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de

# Generate:
$ python generate.py data-bin/wmt14_en_de \
  --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe

```

103
### prepare-wmt14en2fr.sh
104

Sergey Edunov's avatar
Sergey Edunov committed
105
Provides an example of pre-processing for the WMT'14 English to French translation task.
106
107
108
109

Example usage:

```
110
$ cd examples/translation/
111
$ bash prepare-wmt14en2fr.sh
112
$ cd ../..
113
114

# Binarize the dataset:
115
$ TEXT=examples/translation/wmt14_en_fr
116
117
118
119
120
121
122
123
124
$ python preprocess.py --source-lang en --target-lang fr \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0

# Train the model:
# If it runs out of memory, try to set --max-tokens 1000 instead
$ mkdir -p checkpoints/fconv_wmt_en_fr
$ python train.py data-bin/wmt14_en_fr \
  --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
Runqi Yang's avatar
Runqi Yang committed
125
126
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 50 \
127
128
129
130
131
132
133
  --arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr

# Generate:
$ python generate.py data-bin/fconv_wmt_en_fr \
  --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe

```
134
135
136
137
138
139
140

## Replicating results from "Scaling Neural Machine Translation"

To replicate results from the paper [Scaling Neural Machine Translation (Ott et al., 2018)](https://arxiv.org/abs/1806.00187):

1. Prepare the WMT'14 En-De data with a BPE vocab of 32k:
```
141
$ bash prepare-wmt14en2de.sh --scaling18
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
$ cd ../..
```
2. Preprocess the dataset with a joined dictionary:
```
$ TEXT=examples/translation/wmt14_en_de
$ python preprocess.py --source-lang en --target-lang de \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_de_joined_dict \
  --nwordssrc 32768 --nwordstgt 32768 \
  --joined-dictionary
```
3. Train a model:
```
$ python train.py data-bin/wmt14_en_de_joined_dict \
  --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
  --lr 0.0005 --min-lr 1e-09 \
  --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --max-tokens 3584 \
  --fp16
```

Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.

If you want to train the above model with big batches (assuming your machine has 8 GPUs):
- add `--update-freq 16` to simulate training on 8*16=128 GPUs
- increase the learning rate; 0.001 works well for big batches