Commit 392bdd6c authored by Myle Ott's avatar Myle Ott Committed by Facebook Github Bot
Browse files

Update README for Mixture of Experts paper

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/522

Differential Revision: D14194672

Pulled By: myleott

fbshipit-source-id: 4ff669826c4313de6f12076915cfb1bd15289ef0
parent 4294c4f6
......@@ -17,7 +17,7 @@ of various sequence-to-sequence models, including:
- **Transformer (self-attention) networks**
- [Vaswani et al. (2017): Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md)
- [Edunov et al. (2018): Understanding Back-Translation at Scale](https://arxiv.org/abs/1808.09381)
- [Edunov et al. (2018): Understanding Back-Translation at Scale](examples/backtranslation/README.md)
- **_New_** [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md)
Fairseq features:
......@@ -77,6 +77,7 @@ as well as example training and evaluation commands.
We also have more detailed READMEs to reproduce results from specific papers:
- [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md)
- [Wu et al. (2019): Pay Less Attention with Lightweight and Dynamic Convolutions](examples/pay_less_attention_paper/README.md)
- [Edunov et al. (2018): Understanding Back-Translation at Scale](examples/backtranslation/README.md)
- [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://github.com/pytorch/fairseq/tree/classic_seqlevel)
- [Fan et al. (2018): Hierarchical Neural Story Generation](examples/stories/README.md)
- [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md)
......
......@@ -2,15 +2,18 @@
This page includes instructions for reproducing results from the paper [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](https://arxiv.org/abs/1902.07816).
## Training a new model on WMT'17 En-De
## Download data
First, follow the [instructions to download and preprocess the WMT'17 En-De dataset](../translation#prepare-wmt14en2desh).
Make sure to learn a joint vocabulary by passing the `--joined-dictionary` option to `fairseq-preprocess`.
## Train a model
Then we can train a mixture of experts model using the `translation_moe` task.
Use the `--method` option to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`).
Use the `--method` flag to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`).
The model is trained with online responsibility assignment and shared parameterization.
To train a hard mixture of experts model with a learned prior (`hMoElp`) on 1 GPU:
The following command will train a `hMoElp` model with `3` experts:
```
$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
--max-update 100000 \
......@@ -29,11 +32,13 @@ $ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
**Note**: the above command assumes 1 GPU, but accumulates gradients from 8 fwd/bwd passes to simulate training on 8 GPUs.
You can accelerate training on up to 8 GPUs by adjusting the `CUDA_VISIBLE_DEVICES` and `--update-freq` options accordingly.
## Translate
Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
For example, to generate from expert 0:
```
$ fairseq-generate data-bin/wmt17_en_de \
--path checkpoints/checkpoint_best.pt
--path checkpoints/checkpoint_best.pt \
--beam 1 --remove-bpe \
--task translation_moe \
--method hMoElp --mean-pool-gating-network \
......@@ -41,8 +46,9 @@ $ fairseq-generate data-bin/wmt17_en_de \
--gen-expert 0 \
```
You can also use `scripts/score_moe.py` to compute pairwise BLEU and average oracle BLEU.
We'll first download a tokenized version of the multi-reference WMT'14 En-De dataset:
## Evaluate
First download a tokenized version of the WMT'14 En-De test set with multiple references:
```
$ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
```
......@@ -65,15 +71,14 @@ $ for EXPERT in $(seq 0 2); do \
done > wmt14-en-de.extra_refs.tok.gen.3experts
```
Finally compute pairwise BLUE and average oracle BLEU:
Finally use `scripts/score_moe.py` to compute pairwise BLUE and average oracle BLEU:
```
$ python scripts/score_moe.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
pairwise BLEU: 48.26
avg oracle BLEU: 49.50
#refs covered: 2.11
```
This reproduces row 3 from Table 7 in the paper.
This matches row 3 from Table 7 in the paper.
## Citation
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment