Update READMEs

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/823 Differential Revision: D16804995 Pulled By: myleott fbshipit-source-id: abac5dc0ed6b7bfe2309ba273456e54b37340b2c

Update READMEs
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/823 Differential Revision: D16804995 Pulled By: myleott fbshipit-source-id: abac5dc0ed6b7bfe2309ba273456e54b37340b2c
b8704686 · Myle Ott · Facebook Github Bot · ffffe04e · b8704686 · b8704686
Commit b8704686 authored Aug 14, 2019 by Myle Ott Committed by Facebook Github Bot Aug 14, 2019
8 changed files
--- a/README.md
+++ b/README.md
@@ -6,10 +6,10 @@ modeling and other text generation tasks.

 ### What's New:

+- August 2019: [WMT'19 models released](examples/wmt19/README.md)
 - July 2019: fairseq relicensed under MIT license
- July 2019: [RoBERTa models and code release](examples/roberta/README.md)
- June 2019: [wav2vec models and code release](examples/wav2vec/README.md)
- April 2019: [fairseq demo paper @ NAACL 2019](https://arxiv.org/abs/1904.01038)
+- July 2019: [RoBERTa models and code released](examples/roberta/README.md)
+- June 2019: [wav2vec models and code released](examples/wav2vec/README.md)

 ### Features:

@@ -31,6 +31,7 @@ Fairseq provides reference implementations of various sequence-to-sequence model
  - [Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)](examples/language_model/transformer_lm/README.md)
  - [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
  - [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
+  - [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)

 **Additionally:**
 - multi-GPU (distributed) training on one machine or across multiple machines
@@ -49,38 +50,33 @@ translation and language modeling datasets.

 # Requirements and Installation

-* [PyTorch](http://pytorch.org/) version >= 1.0.0
+* [PyTorch](http://pytorch.org/) version >= 1.1.0
 * Python version >= 3.5
 * For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
+* **For faster training** install NVIDIA's [apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option

-Please follow the instructions here to install PyTorch: https://github.com/pytorch/pytorch#installation.
-
-If you use Docker make sure to increase the shared memory size either with
-`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.
-
-After PyTorch is installed, you can install fairseq with `pip`:
-```
+To install fairseq:
+```bash
 pip install fairseq
 ```
-On MacOS,
-```
+
+On MacOS:
+```bash
 CFLAGS="-stdlib=libc++" pip install fairseq
 ```
+
+If you use Docker make sure to increase the shared memory size either with
+`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.
+
 **Installing from source**

 To install fairseq from source and develop locally:
-```
+```bash
 git clone https://github.com/pytorch/fairseq
 cd fairseq
 pip install --editable .
 ```

-**Improved training speed**
-
-Training speed can be further improved by installing NVIDIA's
-[apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option.
-fairseq will automatically switch to the faster modules provided by apex.
-
 # Getting Started

 The [full documentation](https://fairseq.readthedocs.io/) contains instructions
@@ -93,9 +89,10 @@ We provide pre-trained models and pre-processed, binarized test sets for several
 as well as example training and evaluation commands.

 - [Translation](examples/translation/README.md): convolutional and transformer models are available
- [Language Modeling](examples/language_model/README.md): convolutional models are available
+- [Language Modeling](examples/language_model/README.md): convolutional and transformer models are available

 We also have more detailed READMEs to reproduce results from specific papers:
+- [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
 - [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
 - [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md)
 - [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)

--- a/examples/language_model/README.md
+++ b/examples/language_model/README.md
@@ -27,58 +27,57 @@ en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperatur
 # "Barack Obama is coming to Sydney and New Zealand (...)"
 ```

-## Training a new model with the CLI tools
+## Training a transformer language model with the CLI tools

-These scripts provide an example of pre-processing data for the Language Modeling task.
+### 1) Preprocess the data

-### prepare-wikitext-103.sh
-
-Provides an example of pre-processing for [WikiText-103 language modeling task](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):
-
-Example usage:
-
-Prepare data:
+First download and prepare the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):
 ```bash
 cd examples/language_model/
 bash prepare-wikitext-103.sh
 cd ../..
+```

-# Binarize the dataset:
+Next preprocess/binarize the data:
+```bash
 TEXT=examples/language_model/wikitext-103
-
-fairseq-preprocess --only-source \
-    --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
-    --destdir data-bin/wikitext-103
+fairseq-preprocess \
+    --only-source \
+    --trainpref $TEXT/wiki.train.tokens \
+    --validpref $TEXT/wiki.valid.tokens \
+    --testpref $TEXT/wiki.test.tokens \ 
+    --destdir data-bin/wikitext-103 \
+    --workers 20
 ```

-Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
+### 2) Train a language model
+
+Next we'll train a transformer language model using [adaptive inputs](transformer_lm/README.md):
 ```bash
-# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-mkdir -p checkpoints/transformer_wikitext-103
-fairseq-train --task language_modeling data-bin/wikitext-103 \
-    --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
+fairseq-train --task language_modeling \
+    data-bin/wikitext-103 \
+    --save-dir checkpoints/transformer_wikitext-103 \
+    --arch transformer_lm_wiki103 \
    --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
    --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
    --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
    --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
-
-# Evaluate:
-fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
-    --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
 ```

-Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
-```
-# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-mkdir -p checkpoints/fconv_wikitext-103
-fairseq-train --task language_modeling data-bin/wikitext-103 \
-    --save-dir checkpoints/fconv_wikitext-103 \
-    --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
-    --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
-    --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
-    --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
-    --ddp-backend=no_c10d
-
-# Evaluate:
-fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
+If the above command runs out of memory, try reducing `--max-tokens` (max number
+of tokens per batch) or `--tokens-per-sample` (max sequence length). You can
+also increase `--update-freq` to accumulate gradients and simulate training on
+more GPUs.
+
+### 3) Evaluate
+```bash
+fairseq-eval-lm data-bin/wikitext-103 \
+    --path checkpoints/transformer_wiki103/checkpoint_best.pt \
+    --sample-break-mode complete --max-tokens 3072 \
+    --context-window 2560 --softmax-batch 1024
 ```
+
+## Convolutional language models
+
+Please see the [convolutional LM README](conv_lm/README.md) for instructions to
+train convolutional language models.
--- a/examples/language_model/conv_lm/README.md
+++ b/examples/language_model/conv_lm/README.md
@@ -2,8 +2,27 @@

 ## Example usage

-See the [language modeling README](../README.md) for instructions on reproducing results for WikiText-103
-using the `fconv_lm_dauphin_wikitext103` model architecture.
+First download and preprocess the data following the main [language modeling
+README](../README.md).
+
+Then to train a convolutional LM using the `fconv_lm_dauphin_wikitext103`
+architecture:
+```bash
+fairseq-train --task language_modeling \
+    data-bin/wikitext-103 \
+    --save-dir checkpoints/fconv_wikitext-103 \
+    --arch fconv_lm_dauphin_wikitext103 \
+    --max-epoch 35 \ --optimizer nag \
+    --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
+    --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
+    --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
+    --ddp-backend=no_c10d
+```
+
+And evaluate with:
+```bash
+fairseq-eval-lm data-bin/wikitext-103 --path checkpoints/fconv_wiki103/checkpoint_best.pt
+```

 ## Citation


--- a/examples/language_model/transformer_lm/README.md
+++ b/examples/language_model/transformer_lm/README.md
-# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli; 2018)
+# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)

 ## Pre-trained models


--- a/examples/roberta/README.cqa.md
+++ b/examples/roberta/README.cqa.md
@@ -8,7 +8,7 @@ representations through a fully-connected layer to predict the correct answer.
 We train with a standard cross-entropy loss.

 We also found it helpful to prepend a prefix of `Q:` to the question and `A:` to
-the input. The complete input format is:
+the answer. The complete input format is:
 ```
 <s> Q: Where would I not want a fox? </s> A: hen house </s>
 ```
@@ -18,7 +18,7 @@ Our final submission is based on a hyperparameter search over the learning rate
 4000) and random seed. We selected the model with the best performance on the
 development set after 100 trials.

-### 1) Download the data from Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
+### 1) Download data from the Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
 ```bash
 bash examples/roberta/commonsense_qa/download_cqa_data.sh
 ```

--- a/examples/roberta/README.md
+++ b/examples/roberta/README.md
@@ -2,20 +2,24 @@

 https://arxiv.org/abs/1907.11692

-## Introduction
+### Introduction

-**RoBERTa** iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
+RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.

-## Pre-trained models
+### What's New:
+
+- August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
+
+### Pre-trained models

 Model | Description | # params | Download
 ---|---|---|---
 `roberta.base` | RoBERTa using the BERT-base architecture | 125M | [roberta.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz)
 `roberta.large` | RoBERTa using the BERT-large architecture | 355M | [roberta.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz)
 `roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
-`roberta.large.wsc` | `roberta.large` finetuned on [WSC](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
+`roberta.large.wsc` | `roberta.large` finetuned on [WSC](README.wsc.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)

-## Results
+### Results

 ##### Results on GLUE tasks (dev set, single model, single-task finetuning)

@@ -44,7 +48,7 @@ Model | Accuracy | Middle | High
 ---|---|---|---
 `roberta.large` | 83.2 | 86.5 | 81.3

-## Example usage
+### Example usage

 ##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
 ```python
@@ -53,7 +57,7 @@ roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
 roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```

-##### Load RoBERTa (for PyTorch 1.0):
+##### Load RoBERTa (for PyTorch 1.0 or custom models):
 ```python
 # Download roberta.large model
 wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
@@ -61,7 +65,7 @@ tar -xzvf roberta.large.tar.gz

 # Load the model in fairseq
 from fairseq.models.roberta import RobertaModel
-roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
+roberta = RobertaModel.from_pretrained('/path/to/roberta.large', checkpoint_file='model.pt')
 roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```

@@ -120,7 +124,7 @@ roberta.cuda()
 roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
 ```

-## Advanced usage
+### Advanced usage

 #### Filling masks:

@@ -212,8 +216,7 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
 # Expected output: 0.9060
 ```

-
-## Finetuning
+### Finetuning

 - [Finetuning on GLUE](README.glue.md)
 - [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
@@ -221,15 +224,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
 - [Finetuning on Commonsense QA (CQA)](README.cqa.md)
 - Finetuning on SQuAD: coming soon

-## Pretraining using your own data
-
-You can use the [`masked_lm` task](/fairseq/tasks/masked_lm.py) to pretrain RoBERTa from scratch, or to continue pretraining RoBERTa starting from one of the released checkpoints.
-
-Data should be preprocessed following the [language modeling example](/examples/language_model).
+### Pretraining using your own data

-A more detailed tutorial is coming soon.
+See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).

-## Citation
+### Citation

 ```bibtex
 @article{liu2019roberta,

--- a/examples/roberta/README.pretraining.md
+++ b/examples/roberta/README.pretraining.md
+# Pretraining RoBERTa using your own data
+
+This tutorial will walk you through pretraining RoBERTa over your own data.
+
+### 1) Preprocess the data.
+
+Data should be preprocessed following the [language modeling format](/examples/language_model).
+
+We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/)
+to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course
+this dataset is quite small, so the resulting pretrained model will perform
+poorly, but it gives the general idea.
+
+First download the dataset:
+```bash
+wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
+unzip wikitext-103-raw-v1.zip
+```
+
+Next encode it with the GPT-2 BPE:
+```bash
+mkdir -p gpt2_bpe
+wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
+wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
+for SPLIT in train valid test; do \
+    python -m examples.roberta.multiprocessing_bpe_encoder \
+        --encoder-json gpt2_bpe/encoder.json \
+        --vocab-bpe gpt2_bpe/vocab.bpe \
+        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
+        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
+        --keep-empty \
+        --workers 60; \
+done
+```
+
+Finally preprocess/binarize the data using the GPT-2 fairseq dictionary:
+```bash
+wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
+fairseq-preprocess \
+    --only-source \
+    --srcdict gpt2_bpe/dict.txt \
+    --trainpref wikitext-103-raw/wiki.train.bpe \
+    --validpref wikitext-103-raw/wiki.valid.bpe \
+    --testpref wikitext-103-raw/wiki.test.bpe \
+    --destdir data-bin/wikitext-103 \
+    --workers 60
+```
+
+### 2) Train RoBERTa base
+```bash
+TOTAL_UPDATES=125000    # Total number of training steps
+WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
+PEAK_LR=0.0005          # Peak learning rate, adjust as needed
+TOKENS_PER_SAMPLE=512   # Max sequence length
+MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
+MAX_SENTENCES=16        # Number of sequences per batch (batch size)
+UPDATE_FREQ=16           # Increase the batch size 16x
+
+DATA_DIR=data-bin/wikitext-103
+
+fairseq-train --fp16 $DATA_DIR \
+    --task masked_lm --criterion masked_lm \
+    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
+    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
+    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
+    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1
+```
+
+The above command assumes training on 8x32GB V100 GPUs. Each GPU uses a batch
+size of 16 sequences (`$MAX_SENTENCES`) and accumulates gradients to further
+increase the batch size by 16x (`$UPDATE_FREQ`), for a total batch size of 2048
+sequences. If you have fewer GPUs or GPUs with less memory you may need to
+reduce `$MAX_SENTENCES` and increase `$UPDATE_FREQ` to compensate. Alternatively
+if you have more GPUs you can decrease `$UPDATE_FREQ` accordingly to increase
+training speed.
+
+Also note that the learning rate and batch size are tightly connected and need
+to be adjusted together. We generally recommend increasing the learning rate as
+you increase the batch size according to the following table (although it's also
+dataset dependent, so don't rely on the following values too closely):
+
+batch size | peak learning rate
+---|---
+256 | 0.0001
+2048 | 0.0005
+8192 | 0.0007
+
+### 3) Load your pretrained model
+```python
+from fairseq.models.roberta import RobertaModel
+roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data')
+assert isinstance(roberta.model, torch.nn.Module)
+```
--- a/examples/roberta/README.wsc.md
+++ b/examples/roberta/README.wsc.md
@@ -4,21 +4,23 @@ The following instructions can be used to finetune RoBERTa on the WSC training
 data provided by [SuperGLUE](https://super.gluebenchmark.com/).

 Note that there is high variance in the results. For our GLUE/SuperGLUE
-submission we swept over the learning rate, batch size and total number of
-updates, as well as the random seed. Out of ~100 runs we chose the best 7 models
-and ensembled them.
+submission we swept over the learning rate (1e-5, 2e-5, 3e-5), batch size (16,
+32, 64) and total number of updates (500, 1000, 2000, 3000), as well as the
+random seed. Out of ~100 runs we chose the best 7 models and ensembled them.

-**Note:** The instructions below use a slightly different loss function than
+**Approach:** The instructions below use a slightly different loss function than
 what's described in the original RoBERTa arXiv paper. In particular,
 [Kocijan et al. (2019)](https://arxiv.org/abs/1905.06290) introduce a margin
 ranking loss between `(query, candidate)` pairs with tunable hyperparameters
 alpha and beta. This is supported in our code as well with the `--wsc-alpha` and
 `--wsc-beta` arguments. However, we achieved slightly better (and more robust)
 results on the development set by instead using a single cross entropy loss term
-over the log-probabilities for the query and all candidates. This reduces the
-number of hyperparameters and our best model achieved 92.3% development set
-accuracy, compared to ~90% accuracy for the margin loss. Later versions of the
-RoBERTa arXiv paper will describe this updated formulation.
+over the log-probabilities for the query and all mined candidates. **The
+candidates are mined using spaCy from each input sentence in isolation, so the
+approach remains strictly pointwise.** This reduces the number of
+hyperparameters and our best model achieved 92.3% development set accuracy,
+compared to ~90% accuracy for the margin loss. Later versions of the RoBERTa
+arXiv paper will describe this updated formulation.

 ### 1) Download the WSC data from the SuperGLUE website:
 ```bash