README.md 3.58 KB
Newer Older
1
2
3
4
# Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)

This page includes instructions for reproducing results from the paper [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](https://arxiv.org/abs/1902.07816).

5
## Download data
6
7
8
9

First, follow the [instructions to download and preprocess the WMT'17 En-De dataset](../translation#prepare-wmt14en2desh).
Make sure to learn a joint vocabulary by passing the `--joined-dictionary` option to `fairseq-preprocess`.

10
11
## Train a model

12
Then we can train a mixture of experts model using the `translation_moe` task.
13
14
Use the `--method` flag to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`).
The model is trained with online responsibility assignment and shared parameterization.
15

16
The following command will train a `hMoElp` model with `3` experts:
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
```
$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
  --max-update 100000 \
  --task translation_moe \
  --method hMoElp --mean-pool-gating-network \
  --num-experts 3 \
  --arch transformer_vaswani_wmt_en_de --share-all-embeddings \
  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
  --lr 0.0007 --min-lr 1e-09 \
  --dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
  --max-tokens 3584 \
  --update-freq 8
```

**Note**: the above command assumes 1 GPU, but accumulates gradients from 8 fwd/bwd passes to simulate training on 8 GPUs.
You can accelerate training on up to 8 GPUs by adjusting the `CUDA_VISIBLE_DEVICES` and `--update-freq` options accordingly.

35
36
## Translate

37
38
39
40
Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
For example, to generate from expert 0:
```
$ fairseq-generate data-bin/wmt17_en_de \
Myle Ott's avatar
Myle Ott committed
41
42
43
44
45
46
  --path checkpoints/checkpoint_best.pt \
  --beam 1 --remove-bpe \
  --task translation_moe \
  --method hMoElp --mean-pool-gating-network \
  --num-experts 3 \
  --gen-expert 0
47
48
```

49
50
51
## Evaluate

First download a tokenized version of the WMT'14 En-De test set with multiple references:
52
53
54
55
56
57
58
59
60
61
62
63
64
65
```
$ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
```

Next apply BPE on the fly and run generation for each expert:
```
$ BPEROOT=examples/translation/subword-nmt/
$ BPE_CODE=examples/translation/wmt17_en_de/code
$ for EXPERT in $(seq 0 2); do \
    cat wmt14-en-de.extra_refs.tok | grep ^S | cut -f 2 | \
      python $BPEROOT/apply_bpe.py -c $BPE_CODE | \
      fairseq-interactive data-bin/wmt17_en_de \
        --path checkpoints/checkpoint_best.pt \
        --beam 1 --remove-bpe \
Myle Ott's avatar
Myle Ott committed
66
        --buffer-size 500 --max-tokens 6000 \
67
68
69
        --task translation_moe \
        --method hMoElp --mean-pool-gating-network \
        --num-experts 3 \
Myle Ott's avatar
Myle Ott committed
70
        --gen-expert $EXPERT ; \
71
72
73
  done > wmt14-en-de.extra_refs.tok.gen.3experts
```

Myle Ott's avatar
Myle Ott committed
74
Finally use `score_moe.py` to compute pairwise BLUE and average oracle BLEU:
75
```
Myle Ott's avatar
Myle Ott committed
76
$ python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
77
78
pairwise BLEU: 48.26
#refs covered: 2.11
Myle Ott's avatar
Myle Ott committed
79
multi-reference BLEU (leave-one-out): 59.46
80
```
81
This matches row 3 from Table 7 in the paper.
82
83
84
85
86
87
88

## Citation

```bibtex
@article{shen2019mixture,
  title = {Mixture Models for Diverse Machine Translation: Tricks of the Trade},
  author = {Tianxiao Shen and Myle Ott and Michael Auli and Marc'Aurelio Ranzato},
Myle Ott's avatar
Myle Ott committed
89
  journal = {International Conference on Machine Learning},
90
91
92
  year = 2019,
}
```