Commit 392bdd6c authored by Myle Ott's avatar Myle Ott Committed by Facebook Github Bot
Browse files

Update README for Mixture of Experts paper

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/522

Differential Revision: D14194672

Pulled By: myleott

fbshipit-source-id: 4ff669826c4313de6f12076915cfb1bd15289ef0
parent 4294c4f6
...@@ -17,7 +17,7 @@ of various sequence-to-sequence models, including: ...@@ -17,7 +17,7 @@ of various sequence-to-sequence models, including:
- **Transformer (self-attention) networks** - **Transformer (self-attention) networks**
- [Vaswani et al. (2017): Attention Is All You Need](https://arxiv.org/abs/1706.03762) - [Vaswani et al. (2017): Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md) - [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md)
- [Edunov et al. (2018): Understanding Back-Translation at Scale](https://arxiv.org/abs/1808.09381) - [Edunov et al. (2018): Understanding Back-Translation at Scale](examples/backtranslation/README.md)
- **_New_** [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md) - **_New_** [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md)
Fairseq features: Fairseq features:
...@@ -77,6 +77,7 @@ as well as example training and evaluation commands. ...@@ -77,6 +77,7 @@ as well as example training and evaluation commands.
We also have more detailed READMEs to reproduce results from specific papers: We also have more detailed READMEs to reproduce results from specific papers:
- [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md) - [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md)
- [Wu et al. (2019): Pay Less Attention with Lightweight and Dynamic Convolutions](examples/pay_less_attention_paper/README.md) - [Wu et al. (2019): Pay Less Attention with Lightweight and Dynamic Convolutions](examples/pay_less_attention_paper/README.md)
- [Edunov et al. (2018): Understanding Back-Translation at Scale](examples/backtranslation/README.md)
- [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://github.com/pytorch/fairseq/tree/classic_seqlevel) - [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://github.com/pytorch/fairseq/tree/classic_seqlevel)
- [Fan et al. (2018): Hierarchical Neural Story Generation](examples/stories/README.md) - [Fan et al. (2018): Hierarchical Neural Story Generation](examples/stories/README.md)
- [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md) - [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md)
......
...@@ -2,15 +2,18 @@ ...@@ -2,15 +2,18 @@
This page includes instructions for reproducing results from the paper [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](https://arxiv.org/abs/1902.07816). This page includes instructions for reproducing results from the paper [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](https://arxiv.org/abs/1902.07816).
## Training a new model on WMT'17 En-De ## Download data
First, follow the [instructions to download and preprocess the WMT'17 En-De dataset](../translation#prepare-wmt14en2desh). First, follow the [instructions to download and preprocess the WMT'17 En-De dataset](../translation#prepare-wmt14en2desh).
Make sure to learn a joint vocabulary by passing the `--joined-dictionary` option to `fairseq-preprocess`. Make sure to learn a joint vocabulary by passing the `--joined-dictionary` option to `fairseq-preprocess`.
## Train a model
Then we can train a mixture of experts model using the `translation_moe` task. Then we can train a mixture of experts model using the `translation_moe` task.
Use the `--method` option to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`). Use the `--method` flag to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`).
The model is trained with online responsibility assignment and shared parameterization.
To train a hard mixture of experts model with a learned prior (`hMoElp`) on 1 GPU: The following command will train a `hMoElp` model with `3` experts:
``` ```
$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \ $ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
--max-update 100000 \ --max-update 100000 \
...@@ -29,11 +32,13 @@ $ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \ ...@@ -29,11 +32,13 @@ $ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
**Note**: the above command assumes 1 GPU, but accumulates gradients from 8 fwd/bwd passes to simulate training on 8 GPUs. **Note**: the above command assumes 1 GPU, but accumulates gradients from 8 fwd/bwd passes to simulate training on 8 GPUs.
You can accelerate training on up to 8 GPUs by adjusting the `CUDA_VISIBLE_DEVICES` and `--update-freq` options accordingly. You can accelerate training on up to 8 GPUs by adjusting the `CUDA_VISIBLE_DEVICES` and `--update-freq` options accordingly.
## Translate
Once a model is trained, we can generate translations from different experts using the `--gen-expert` option. Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
For example, to generate from expert 0: For example, to generate from expert 0:
``` ```
$ fairseq-generate data-bin/wmt17_en_de \ $ fairseq-generate data-bin/wmt17_en_de \
--path checkpoints/checkpoint_best.pt --path checkpoints/checkpoint_best.pt \
--beam 1 --remove-bpe \ --beam 1 --remove-bpe \
--task translation_moe \ --task translation_moe \
--method hMoElp --mean-pool-gating-network \ --method hMoElp --mean-pool-gating-network \
...@@ -41,8 +46,9 @@ $ fairseq-generate data-bin/wmt17_en_de \ ...@@ -41,8 +46,9 @@ $ fairseq-generate data-bin/wmt17_en_de \
--gen-expert 0 \ --gen-expert 0 \
``` ```
You can also use `scripts/score_moe.py` to compute pairwise BLEU and average oracle BLEU. ## Evaluate
We'll first download a tokenized version of the multi-reference WMT'14 En-De dataset:
First download a tokenized version of the WMT'14 En-De test set with multiple references:
``` ```
$ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok $ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
``` ```
...@@ -65,15 +71,14 @@ $ for EXPERT in $(seq 0 2); do \ ...@@ -65,15 +71,14 @@ $ for EXPERT in $(seq 0 2); do \
done > wmt14-en-de.extra_refs.tok.gen.3experts done > wmt14-en-de.extra_refs.tok.gen.3experts
``` ```
Finally compute pairwise BLUE and average oracle BLEU: Finally use `scripts/score_moe.py` to compute pairwise BLUE and average oracle BLEU:
``` ```
$ python scripts/score_moe.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok $ python scripts/score_moe.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
pairwise BLEU: 48.26 pairwise BLEU: 48.26
avg oracle BLEU: 49.50 avg oracle BLEU: 49.50
#refs covered: 2.11 #refs covered: 2.11
``` ```
This matches row 3 from Table 7 in the paper.
This reproduces row 3 from Table 7 in the paper.
## Citation ## Citation
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment