Update README for Mixture of Experts paper

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/522 Differential Revision: D14194672 Pulled By: myleott fbshipit-source-id: 4ff669826c4313de6f12076915cfb1bd15289ef0

Update README for Mixture of Experts paper
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/522 Differential Revision: D14194672 Pulled By: myleott fbshipit-source-id: 4ff669826c4313de6f12076915cfb1bd15289ef0
392bdd6c · Myle Ott · Facebook Github Bot · 4294c4f6 · 392bdd6c · 392bdd6c
Commit 392bdd6c authored Feb 22, 2019 by Myle Ott Committed by Facebook Github Bot Feb 22, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 10 deletions

README.md README.md +2 -1

examples/translation_moe/README.md examples/translation_moe/README.md +14 -9

No files found.
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@ of various sequence-to-sequence models, including:
 - **Transformer (self-attention) networks**
  - [Vaswani et al. (2017): Attention Is All You Need](https://arxiv.org/abs/1706.03762)
  - [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md)
-  - [Edunov et al. (2018): Understanding Back-Translation at Scale](https://arxiv.org/abs/1808.09381)
+  - [Edunov et al. (2018): Understanding Back-Translation at Scale](examples/backtranslation/README.md)
  - **_New_** [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md)

 Fairseq features:
@@ -77,6 +77,7 @@ as well as example training and evaluation commands.
 We also have more detailed READMEs to reproduce results from specific papers:
 - [Shen et al. (2019) Mixture Models for Diverse Machine Translation: Tricks of the Trade](examples/translation_moe/README.md)
 - [Wu et al. (2019): Pay Less Attention with Lightweight and Dynamic Convolutions](examples/pay_less_attention_paper/README.md)
+- [Edunov et al. (2018): Understanding Back-Translation at Scale](examples/backtranslation/README.md)
 - [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://github.com/pytorch/fairseq/tree/classic_seqlevel)
 - [Fan et al. (2018): Hierarchical Neural Story Generation](examples/stories/README.md)
 - [Ott et al. (2018): Scaling Neural Machine Translation](examples/scaling_nmt/README.md)

--- a/examples/translation_moe/README.md
+++ b/examples/translation_moe/README.md
@@ -2,15 +2,18 @@

 This page includes instructions for reproducing results from the paper [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](https://arxiv.org/abs/1902.07816).

-## Training a new model on WMT'17 En-De
+## Download data

 First, follow the [instructions to download and preprocess the WMT'17 En-De dataset](../translation#prepare-wmt14en2desh).
 Make sure to learn a joint vocabulary by passing the `--joined-dictionary` option to `fairseq-preprocess`.

+## Train a model
+
 Then we can train a mixture of experts model using the `translation_moe` task.
-Use the `--method` option to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`).
+Use the `--method` flag to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`).
+The model is trained with online responsibility assignment and shared parameterization.

-To train a hard mixture of experts model with a learned prior (`hMoElp`) on 1 GPU:
+The following command will train a `hMoElp` model with `3` experts:
 ```
 $ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
  --max-update 100000 \
@@ -29,11 +32,13 @@ $ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
 **Note**: the above command assumes 1 GPU, but accumulates gradients from 8 fwd/bwd passes to simulate training on 8 GPUs.
 You can accelerate training on up to 8 GPUs by adjusting the `CUDA_VISIBLE_DEVICES` and `--update-freq` options accordingly.

+## Translate
+
 Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
 For example, to generate from expert 0:
 ```
 $ fairseq-generate data-bin/wmt17_en_de \
-    --path checkpoints/checkpoint_best.pt 
+    --path checkpoints/checkpoint_best.pt \
    --beam 1 --remove-bpe \
    --task translation_moe \
    --method hMoElp --mean-pool-gating-network \
@@ -41,8 +46,9 @@ $ fairseq-generate data-bin/wmt17_en_de \
    --gen-expert 0 \
 ```

-You can also use `scripts/score_moe.py` to compute pairwise BLEU and average oracle BLEU.
-We'll first download a tokenized version of the multi-reference WMT'14 En-De dataset:
+## Evaluate
+
+First download a tokenized version of the WMT'14 En-De test set with multiple references:
 ```
 $ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
 ```
@@ -65,15 +71,14 @@ $ for EXPERT in $(seq 0 2); do \
  done > wmt14-en-de.extra_refs.tok.gen.3experts
 ```

-Finally compute pairwise BLUE and average oracle BLEU:
+Finally use `scripts/score_moe.py` to compute pairwise BLUE and average oracle BLEU:
 ```
 $ python scripts/score_moe.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
 pairwise BLEU: 48.26
 avg oracle BLEU: 49.50
 #refs covered: 2.11
 ```
-
-This reproduces row 3 from Table 7 in the paper.
+This matches row 3 from Table 7 in the paper.

 ## Citation