Update README

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/826 Differential Revision: D16830402 Pulled By: myleott fbshipit-source-id: 25afaa6d9de7b51cc884e3f417c8e6b349f5a7bc

Update README
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/826 Differential Revision: D16830402 Pulled By: myleott fbshipit-source-id: 25afaa6d9de7b51cc884e3f417c8e6b349f5a7bc
ac66df47 · Myle Ott · Facebook Github Bot · 1d44cc85 · ac66df47 · ac66df47
Commit ac66df47 authored Aug 15, 2019 by Myle Ott Committed by Facebook Github Bot Aug 15, 2019
4 changed files
--- a/examples/roberta/README.md
+++ b/examples/roberta/README.md
@@ -2,7 +2,7 @@
 https://arxiv.org/abs/1907.11692
-### Introduction
+## Introduction
 RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
@@ -10,7 +10,7 @@ RoBERTa iterates on BERT's pretraining procedure, including training the model l
 - August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
-### Pre-trained models
+## Pre-trained models
 Model | Description | # params | Download
 ---|---|---|---
@@ -19,9 +19,10 @@ Model | Description | # params | Download
 `roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
 `roberta.large.wsc` | `roberta.large` finetuned on [WSC](wsc/README.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
-### Results
+## Results
-##### Results on GLUE tasks (dev set, single model, single-task finetuning)
+**[GLUE (Wang et al., 2019)](https://gluebenchmark.com/)**
+_(dev set, single model, single-task finetuning)_
 Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
 ---|---|---|---|---|---|---|---|---
@@ -29,26 +30,51 @@ Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
 `roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
 `roberta.large.mnli` | 90.2 | - | - | - | - | - | - | -
-##### Results on SuperGLUE tasks (dev set, single model, single-task finetuning)
+**[SuperGLUE (Wang et al., 2019)](https://super.gluebenchmark.com/)**
+_(dev set, single model, single-task finetuning)_
 Model | BoolQ | CB | COPA | MultiRC | RTE | WiC | WSC
 ---|---|---|---|---|---|---|---
 `roberta.large` | 86.9 | 98.2 | 94.0 | 85.7 | 89.5 | 75.6 | -
 `roberta.large.wsc` | - | - | - | - | - | - | 91.3
-##### Results on SQuAD (dev set)
+**[SQuAD (Rajpurkar et al., 2018)](https://rajpurkar.github.io/SQuAD-explorer/)**
+_(dev set, no additional data used)_
 Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
 ---|---|---
 `roberta.large` | 88.9/94.6 | 86.5/89.4
-##### Results on Reading Comprehension (RACE, test set)
+**[RACE (Lai et al., 2017)](http://www.qizhexie.com/data/RACE_leaderboard.html)**
+_(test set)_
 Model | Accuracy | Middle | High
 ---|---|---|---
 `roberta.large` | 83.2 | 86.5 | 81.3
-### Example usage
+**[HellaSwag (Zellers et al., 2019)](https://rowanzellers.com/hellaswag/)**
+_(test set)_
+Model | Overall | In-domain | Zero-shot | ActivityNet | WikiHow
+---|---|---|---|---|---
+`roberta.large` | 85.2 | 87.3 | 83.1 | 74.6 | 90.9
+**[Commonsense QA (Talmor et al., 2019)](https://www.tau-nlp.org/commonsenseqa)**
+_(test set)_
+Model | Accuracy
+---|---
+`roberta.large` (single model) | 72.1
+`roberta.large` (ensemble) | 72.5
+**[Winogrande (Sakaguchi et al., 2019)](https://arxiv.org/abs/1907.10641)**
+_(test set)_
+Model | Accuracy
+---|---
+`roberta.large` | 78.1
+## Example usage
 ##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
 ```python
@@ -124,7 +150,7 @@ roberta.cuda()
 roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
 ```
-### Advanced usage
+## Advanced usage
 #### Filling masks:
@@ -216,7 +242,7 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
 # Expected output: 0.9060
 ```
-### Finetuning
+## Finetuning
 - [Finetuning on GLUE](README.glue.md)
 - [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
@@ -224,11 +250,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
 - [Finetuning on Commonsense QA (CQA)](commonsense_qa/README.md)
 - Finetuning on SQuAD: coming soon
-### Pretraining using your own data
+## Pretraining using your own data
 See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
-### Citation
+## Citation
 ```bibtex
 @article{liu2019roberta,

--- a/examples/roberta/README.pretraining.md
+++ b/examples/roberta/README.pretraining.md
@@ -2,7 +2,7 @@
 This tutorial will walk you through pretraining RoBERTa over your own data.
-### 1) Preprocess the data.
+### 1) Preprocess the data
 Data should be preprocessed following the [language modeling format](/examples/language_model).

--- a/examples/scaling_nmt/README.md
+++ b/examples/scaling_nmt/README.md
@@ -11,45 +11,57 @@ Model | Description | Dataset | Download
 ## Training a new model on WMT'16 En-De
-Please first download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8).
+First download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8).
 Then:
-1. Extract the WMT'16 En-De data:
+##### 1. Extract the WMT'16 En-De data
 ```bash
 TEXT=wmt16_en_de_bpe32k
 mkdir -p $TEXT
 tar -xzvf wmt16_en_de.tar.gz -C $TEXT
 ```
-2. Preprocess the dataset with a joined dictionary:
+##### 2. Preprocess the dataset with a joined dictionary
 ```bash
-fairseq-preprocess --source-lang en --target-lang de \
+fairseq-preprocess \
+    --source-lang en --target-lang de \
    --trainpref $TEXT/train.tok.clean.bpe.32000 \
    --validpref $TEXT/newstest2013.tok.bpe.32000 \
    --testpref $TEXT/newstest2014.tok.bpe.32000 \
    --destdir data-bin/wmt16_en_de_bpe32k \
    --nwordssrc 32768 --nwordstgt 32768 \
-    --joined-dictionary
+    --joined-dictionary \
+    --workers 20
 ```
-3. Train a model:
+##### 3. Train a model
 ```bash
-fairseq-train data-bin/wmt16_en_de_bpe32k \
+fairseq-train \
+    data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
+    --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
-    --lr 0.0005 --min-lr 1e-09 \
+    --dropout 0.3 --weight-decay 0.0 \
-    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 3584 \
    --fp16
 ```
-Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.
+Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU or newer.
 If you want to train the above model with big batches (assuming your machine has 8 GPUs):
- add `--update-freq 16` to simulate training on 8*16=128 GPUs
+- add `--update-freq 16` to simulate training on 8x16=128 GPUs
 - increase the learning rate; 0.001 works well for big batches
+##### 4. Evaluate
+```bash
+fairseq-generate \
+    data-bin/wmt16_en_de_bpe32k \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 4 --lenpen 0.6 --remove-bpe
+```
 ## Citation
 ```bibtex

--- a/examples/translation/README.md
+++ b/examples/translation/README.md
 # Neural Machine Translation
+This README contains instructions for [using pretrained translation models](#example-usage-torchhub)
+as well as [training new models](#training-a-new-model).
 ## Pre-trained models
 Model | Description | Dataset | Download
@@ -56,132 +59,119 @@ fairseq-score --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
 # BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
 ```
-## Preprocessing
+## Training a new model
-These scripts provide an example of pre-processing data for the NMT task.
-### prepare-iwslt14.sh
+### IWSLT'14 German to English (Transformer)
-Provides an example of pre-processing for IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al.](http://workshop2014.iwslt.org/downloads/proceeding.pdf)
+The following instructions can be used to train a Transformer model on the [IWSLT'14 German to English dataset](http://workshop2014.iwslt.org/downloads/proceeding.pdf).
-Example usage:
+First download and preprocess the data:
 ```bash
+# Download and prepare the data
 cd examples/translation/
 bash prepare-iwslt14.sh
 cd ../..
-# Binarize the dataset:
+# Preprocess/binarize the data
 TEXT=examples/translation/iwslt14.tokenized.de-en
 fairseq-preprocess --source-lang de --target-lang en \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-    --destdir data-bin/iwslt14.tokenized.de-en
+    --destdir data-bin/iwslt14.tokenized.de-en \
+    --workers 20
+```
-# Train the model (better for a single GPU setup):
+Next we'll train a Transformer translation model over this data:
-mkdir -p checkpoints/fconv
+```bash
-CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
+CUDA_VISIBLE_DEVICES=0 fairseq-train \
-    --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
+    data-bin/iwslt14.tokenized.de-en \
+    --arch transformer_iwslt_de_en --share-decoder-input-output-embed \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
+    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-    --lr-scheduler fixed --force-anneal 200 \
+    --max-tokens 4096
-    --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
-# Generate:
-fairseq-generate data-bin/iwslt14.tokenized.de-en \
-    --path checkpoints/fconv/checkpoint_best.pt \
-    --batch-size 128 --beam 5 --remove-bpe
 ```
-To train transformer model on IWSLT'14 German to English:
+Finally we can evaluate our trained model:
 ```bash
-# Preparation steps are the same as for fconv model.
-# Train the model (better for a single GPU setup):
-mkdir -p checkpoints/transformer
-CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
-    -a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s de -t en \
-    --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
-    --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
-    --criterion label_smoothed_cross_entropy --max-update 50000 \
-    --warmup-updates 4000 --warmup-init-lr '1e-07' \
-    --adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer
-# Average 10 latest checkpoints:
-python scripts/average_checkpoints.py --inputs checkpoints/transformer \
-    --num-epoch-checkpoints 10 --output checkpoints/transformer/model.pt
-# Generate:
 fairseq-generate data-bin/iwslt14.tokenized.de-en \
-    --path checkpoints/transformer/model.pt \
+    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe
 ```
-### prepare-wmt14en2de.sh
+### WMT'14 English to German (Convolutional)
-The WMT English to German dataset can be preprocessed using the `prepare-wmt14en2de.sh` script.
-By default it will produce a dataset that was modeled after ["Attention Is All You Need" (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762), but with news-commentary-v12 data from WMT'17.
-To use only data available in WMT'14 or to replicate results obtained in the original ["Convolutional Sequence to Sequence Learning" (Gehring et al., 2017)](https://arxiv.org/abs/1705.03122) paper, please use the `--icml17` option.
+The following instructions can be used to train a Convolutional translation model on the WMT English to German dataset.
+See the [Scaling NMT README](../scaling_nmt/README.md) for instructions to train a Transformer translation model on this data.
-```bash
+The WMT English to German dataset can be preprocessed using the `prepare-wmt14en2de.sh` script.
-bash prepare-wmt14en2de.sh --icml17
+By default it will produce a dataset that was modeled after [Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762), but with additional news-commentary-v12 data from WMT'17.
-```
-Example usage:
+To use only data available in WMT'14 or to replicate results obtained in the original [Convolutional Sequence to Sequence Learning (Gehring et al., 2017)](https://arxiv.org/abs/1705.03122) paper, please use the `--icml17` option.
 ```bash
+# Download and prepare the data
 cd examples/translation/
+# WMT'17 data:
 bash prepare-wmt14en2de.sh
+# or to use WMT'14 data:
+# bash prepare-wmt14en2de.sh --icml17
 cd ../..
-# Binarize the dataset:
+# Binarize the dataset
 TEXT=examples/translation/wmt17_en_de
-fairseq-preprocess --source-lang en --target-lang de \
+fairseq-preprocess \
+    --source-lang en --target-lang de \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-    --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0
+    --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 \
+    --workers 20
-# Train the model:
+# Train the model
-# If it runs out of memory, try to set --max-tokens 1500 instead
 mkdir -p checkpoints/fconv_wmt_en_de
-fairseq-train data-bin/wmt17_en_de \
+fairseq-train \
+    data-bin/wmt17_en_de \
+    --arch fconv_wmt_en_de \
    --lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --lr-scheduler fixed --force-anneal 50 \
-    --arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de
+    --save-dir checkpoints/fconv_wmt_en_de
-# Generate:
+# Evaluate
 fairseq-generate data-bin/wmt17_en_de \
-    --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe
+    --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt \
+    --beam 5 --remove-bpe
 ```
-### prepare-wmt14en2fr.sh
+### WMT'14 English to French
-Provides an example of pre-processing for the WMT'14 English to French translation task.
-Example usage:
 ```bash
+# Download and prepare the data
 cd examples/translation/
 bash prepare-wmt14en2fr.sh
 cd ../..
-# Binarize the dataset:
+# Binarize the dataset
 TEXT=examples/translation/wmt14_en_fr
-fairseq-preprocess --source-lang en --target-lang fr \
+fairseq-preprocess \
+    --source-lang en --target-lang fr \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-    --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0
+    --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0 \
+    --workers 60
-# Train the model:
+# Train the model
-# If it runs out of memory, try to set --max-tokens 1000 instead
 mkdir -p checkpoints/fconv_wmt_en_fr
-fairseq-train data-bin/wmt14_en_fr \
+fairseq-train \
+    data-bin/wmt14_en_fr \
    --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --lr-scheduler fixed --force-anneal 50 \
-    --arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr
+    --arch fconv_wmt_en_fr \
+    --save-dir checkpoints/fconv_wmt_en_fr
-# Generate:
-fairseq-generate data-bin/fconv_wmt_en_fr \
+# Evaluate
-    --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe
+fairseq-generate \
+    data-bin/fconv_wmt_en_fr \
+    --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt \
+    --beam 5 --remove-bpe
 ```
 ## Multilingual Translation
@@ -253,7 +243,8 @@ grep ^H iwslt17.test.${SRC}-en.en.sys | cut -f3 \
    | sacrebleu --test-set iwslt17 --language-pair ${SRC}-en
 ```
-### Argument format during inference
+##### Argument format during inference
 During inference it is required to specify a single `--source-lang` and
 `--target-lang`, which indicates the inference langauge direction.
 `--lang-pairs`, `--encoder-langtok`, `--decoder-langtok` have to be set to