Documentation fixes

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/927 Differential Revision: D18691521 Pulled By: myleott fbshipit-source-id: a79cb0a7614a30be765e741761819263d9fb5047

Documentation fixes
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/927 Differential Revision: D18691521 Pulled By: myleott fbshipit-source-id: a79cb0a7614a30be765e741761819263d9fb5047
181dc58e · Myle Ott · Facebook Github Bot · 5d9392df · 181dc58e · 181dc58e
Commit 181dc58e authored Nov 25, 2019 by Myle Ott Committed by Facebook Github Bot Nov 26, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 25 additions and 13 deletions

examples/pay_less_attention_paper/README.md examples/pay_less_attention_paper/README.md +3 -11

examples/scaling_nmt/README.md examples/scaling_nmt/README.md +22 -2

No files found.
--- a/examples/pay_less_attention_paper/README.md
+++ b/examples/pay_less_attention_paper/README.md
@@ -59,17 +59,9 @@ To use the model without GLU, please set `--encoder-glu 0 --decoder-glu 0`.
 For LightConv, please use `--encoder-conv-type lightweight --decoder-conv-type lightweight`, otherwise the default is DynamicConv.
 For best BLEU results, lenpen may need to be manually tuned.
-To use the CUDA kernels, first install the PyTorch modules using the commands below
+To use the CUDA kernels, first install the PyTorch modules using the commands
-```sh
+above. Once the CUDA modules are installed, they will automatically be used
-# to install lightconv
+instead of the PyTorch modules.
-python fairseq/modules/lightconv_layer/cuda_function_gen.py
-python fairseq/modules/lightconv_layer/setup.py install
-# to install dynamicconv
-python fairseq/modules/dynamicconv_layer/cuda_function_gen.py
-python fairseq/modules/dynamicconv_layer/setup.py install
-```
-Once the CUDA modules are installed, they will automatically be used instead of the PyTorch modules.
 ### IWSLT14 De-En
 Training and evaluating DynamicConv (without GLU) on a GPU:

--- a/examples/scaling_nmt/README.md
+++ b/examples/scaling_nmt/README.md
@@ -50,15 +50,35 @@ fairseq-train \
 Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU or newer.
-If you want to train the above model with big batches (assuming your machine has 8 GPUs):
+***IMPORTANT:*** You will get better performance by training with big batches and
+increasing the learning rate. If you want to train the above model with big batches
+(assuming your machine has 8 GPUs):
 - add `--update-freq 16` to simulate training on 8x16=128 GPUs
 - increase the learning rate; 0.001 works well for big batches
 ##### 4. Evaluate
+Now we can evaluate our trained model.
+Note that the original [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
+paper used a couple tricks to achieve better BLEU scores. We use these same tricks in
+the Scaling NMT paper, so it's important to apply them when reproducing our results.
+First, use the [average_checkpoints.py](/scripts/average_checkpoints.py) script to
+average the last few checkpoints. Averaging the last 5-10 checkpoints is usually
+good, but you may need to adjust this depending on how long you've trained:
+```bash
+python scripts/average_checkpoints \
+    --inputs /path/to/checkpoints \
+    --num-epoch-checkpoints 5 \
+    --output checkpoint.avg5.pt
+```
+Next, generate translations using a beam width of 4 and length penalty of 0.6:
 ```bash
 fairseq-generate \
    data-bin/wmt16_en_de_bpe32k \
-    --path checkpoints/checkpoint_best.pt \
+    --path checkpoint.avg5.pt \
    --beam 4 --lenpen 0.6 --remove-bpe
 ```