README.md 3.55 KB
Newer Older
Sergey Edunov's avatar
Sergey Edunov committed
1
Sample data processing scripts for the FAIR Sequence-to-Sequence Toolkit
2

Sergey Edunov's avatar
Sergey Edunov committed
3
These scripts provide an example of pre-processing data for the NMT task.
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# prepare-iwslt14.sh

Provides an example of pre-processing for IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al.](http://workshop2014.iwslt.org/downloads/proceeding.pdf)

Example usage:
```
$ cd data/
$ bash prepare-iwslt14.sh
$ cd ..

# Binarize the dataset:
$ TEXT=data/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/iwslt14.tokenized.de-en

21
# Train the model (better for a single GPU setup):
22
$ mkdir -p checkpoints/fconv
23
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
24
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
Runqi Yang's avatar
Runqi Yang committed
25
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
Runqi Yang's avatar
Runqi Yang committed
26
  --lr-scheduler fixed --force-anneal 200 \
27
28
29
30
31
32
33
34
35
36
37
38
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

# Generate:
$ python generate.py data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/fconv/checkpoint_best.pt \
  --batch-size 128 --beam 5 --remove-bpe

```


# prepare-wmt14en2de.sh

Sergey Edunov's avatar
Sergey Edunov committed
39
Provides an example of pre-processing for the WMT'14 English to German translation task. By default it will produce a dataset that was modeled after ["Attention Is All You Need" by Vaswani et al.](https://arxiv.org/abs/1706.03762) that includes news-commentary-v12 data.
40

Runqi Yang's avatar
Runqi Yang committed
41
To use only data available in WMT'14 or to replicate results obtained in the original paper ["Convolutional Sequence to Sequence Learning" by Gehring et al.](https://arxiv.org/abs/1705.03122) run it with --icml17 instead:
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

```
$ bash prepare-wmt14en2de.sh --icml17
```

Example usage:

```
$ cd data/
$ bash prepare-wmt14en2de.sh
$ cd ..

# Binarize the dataset:
$ TEXT=data/wmt14_en_de
$ python preprocess.py --source-lang en --target-lang de \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_de --thresholdtgt 0 --thresholdsrc 0

# Train the model:
# If it runs out of memory, try to set --max-tokens 1500 instead
$ mkdir -p checkpoints/fconv_wmt_en_de
$ python train.py data-bin/wmt14_en_de \
  --lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
Runqi Yang's avatar
Runqi Yang committed
65
66
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 50 \
67
68
69
70
71
72
73
74
75
76
  --arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de

# Generate:
$ python generate.py data-bin/wmt14_en_de \
  --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe

```

# prepare-wmt14en2fr.sh

Sergey Edunov's avatar
Sergey Edunov committed
77
Provides an example of pre-processing for the WMT'14 English to French translation task.
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96

Example usage:

```
$ cd data/
$ bash prepare-wmt14en2fr.sh
$ cd ..

# Binarize the dataset:
$ TEXT=data/wmt14_en_fr
$ python preprocess.py --source-lang en --target-lang fr \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0

# Train the model:
# If it runs out of memory, try to set --max-tokens 1000 instead
$ mkdir -p checkpoints/fconv_wmt_en_fr
$ python train.py data-bin/wmt14_en_fr \
  --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
Runqi Yang's avatar
Runqi Yang committed
97
98
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 50 \
99
100
101
102
103
104
105
106
  --arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr

# Generate:
$ python generate.py data-bin/fconv_wmt_en_fr \
  --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe

```