Update READMEs for torch.hub

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/795 Differential Revision: D16620488 Pulled By: myleott fbshipit-source-id: 1998a9ccd8816fc7f590861fb4898f910a36bc1e

Update READMEs for torch.hub
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/795 Differential Revision: D16620488 Pulled By: myleott fbshipit-source-id: 1998a9ccd8816fc7f590861fb4898f910a36bc1e
abb7ed4c · Myle Ott · Facebook Github Bot · 5f342527 · abb7ed4c · abb7ed4c
Commit abb7ed4c authored Aug 02, 2019 by Myle Ott Committed by Facebook Github Bot Aug 02, 2019
14 changed files
--- a/examples/backtranslation/README.md
+++ b/examples/backtranslation/README.md
@@ -4,29 +4,32 @@ This page includes pre-trained models from the paper [Understanding Back-Transla

 ## Pre-trained models

-Description | Dataset | Model | Test set(s)
+Model | Description | Dataset | Download
 ---|---|---|---
-Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381); WMT'18 winner) | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) | See NOTE in the archive
+`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive

 ## Example usage

 Interactive generation from the full ensemble via PyTorch Hub:
-```
->>> import torch
->>> torch.hub.list('pytorch/fairseq')
-[..., 'transformer.wmt14.en-fr', 'transformer.wmt16.en-de', 'transformer.wmt18.en-de', ... ]
->>> en2de_ensemble = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt18.en-de',
-...   checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
-...   data_name_or_path='.',
-...   tokenizer='moses',
-...   bpe='subword_nmt',
-... )
->>> len(en2de_ensemble.models)
-5
->>> print(en2de_ensemble.generate('Hello world!'))
-Hallo Welt!
+```python
+import torch
+
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer.wmt18.en-de', ... ]
+
+# Load the WMT'18 En-De ensemble
+en2de_ensemble = torch.hub.load(
+    'pytorch/fairseq', 'transformer.wmt18.en-de',
+    checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
+    tokenizer='moses', bpe='subword_nmt')
+
+# The ensemble contains 5 models
+len(en2de_ensemble.models)
+# 5
+
+# Translate
+en2de_ensemble.translate('Hello world!')
+# 'Hallo Welt!'
 ```

 ## Citation

--- a/examples/language_model/README.md
+++ b/examples/language_model/README.md
@@ -2,36 +2,30 @@

 ## Pre-trained models

-Description | Parameters | Dataset | Model and Test set(s)
---|---:|---|---
-Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 1026M | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2)
-Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 247M | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2)
-
+Model | Description | Dataset | Download
+---|---|---|---
+`transformer_lm.gbw.adaptive_huge` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 1026M params | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2)
+`transformer_lm.wiki103.adaptive` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 247M params | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2)
+`transformer_lm.wmt19.en` | English LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
+`transformer_lm.wmt19.de` | German LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
+`transformer_lm.wmt19.ru` | Russian LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)

 ## Example usage

-Interactive generation via PyTorch Hub:
-```
->>> import torch
->>> torch.hub.list('pytorch/fairseq')
-[..., 'transformer_lm.gbw.adaptive_huge', 'transformer_lm.wiki103.adaptive', ...]
->>> lm = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer_lm.wiki103.adaptive',
-...   data_name_or_path='./data-bin',
-...   tokenizer='moses',
-...   no_escape=True,
-...   beam=1,
-...   sampling=True,
-...   sampling_topk=10,
-...   temperature=0.8,
-... )
->>> lm.generate('Barack Obama', verbose=True)
-```
+Sampling from a language model using PyTorch Hub:
+```python
+import torch

-Available models are listed in the ``hub_models()`` method in each model file, for example:
-[transformer_lm.py](https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer_lm.py).
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer_lm.wmt19.en', ...]

+# Load an English LM trained on WMT'19 News Crawl data
+en_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
+
+# Sample from the language model
+en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperature=0.8)
+# "Barack Obama is coming to Sydney and New Zealand (...)"
+```

 ## Training a new model with the CLI tools

@@ -44,47 +38,47 @@ Provides an example of pre-processing for [WikiText-103 language modeling task](
 Example usage:

 Prepare data:
-```
-$ cd examples/language_model/
-$ bash prepare-wikitext-103.sh
-$ cd ../..
+```bash
+cd examples/language_model/
+bash prepare-wikitext-103.sh
+cd ../..

 # Binarize the dataset:
-$ TEXT=examples/language_model/wikitext-103
+TEXT=examples/language_model/wikitext-103

-$ fairseq-preprocess --only-source \
-  --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
-  --destdir data-bin/wikitext-103
+fairseq-preprocess --only-source \
+    --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
+    --destdir data-bin/wikitext-103
 ```

 Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
-```
+```bash
 # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-$ mkdir -p checkpoints/transformer_wikitext-103
-$ fairseq-train --task language_modeling data-bin/wikitext-103 \
-  --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
-  --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
-  --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
-  --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
-  --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
+mkdir -p checkpoints/transformer_wikitext-103
+fairseq-train --task language_modeling data-bin/wikitext-103 \
+    --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
+    --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
+    --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
+    --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
+    --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d

 # Evaluate:
-$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
-  --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
+fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
+    --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
 ```

 Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
 ```
 # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-$ mkdir -p checkpoints/fconv_wikitext-103
-$ fairseq-train --task language_modeling data-bin/wikitext-103 \
-  --save-dir checkpoints/fconv_wikitext-103 \
-  --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
-  --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
-  --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
-  --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
-  --ddp-backend=no_c10d
+mkdir -p checkpoints/fconv_wikitext-103
+fairseq-train --task language_modeling data-bin/wikitext-103 \
+    --save-dir checkpoints/fconv_wikitext-103 \
+    --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
+    --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
+    --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
+    --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
+    --ddp-backend=no_c10d

 # Evaluate:
-$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
+fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
 ```
--- a/examples/roberta/README.finetune_custom_classification.md
+++ b/examples/roberta/README.finetune_custom_classification.md
-# RoBERTa fine-tuning on custom classification task (example IMDB)
+# Finetuning RoBERTa on a custom classification task

-## 1) Get the data
-```
+This example shows how to finetune RoBERTa on the IMDB dataset, but should illustrate the process for most classification tasks.
+
+### 1) Get the data
+```bash
 wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
 tar zxvf aclImdb_v1.tar.gz
 ```

-## 2) Format data
+### 2) Format data
 `IMDB` data has one data-sample in each file, below python code-snippet converts it one file for train and valid each for ease of processing.  
-```
+```python
 import argparse
 import os
 import random
@@ -42,79 +44,78 @@ if __name__ == '__main__':
    main(args)
 ```

-## 3) BPE Encode
+### 3) BPE Encode
 Run `multiprocessing_bpe_encoder`, you can also do this in previous step for each sample but that might be slower.
-```
+```bash
 # Download encoder.json and vocab.bpe
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'

-for SPLIT in train dev;
-do
-  python -m examples.roberta.multiprocessing_bpe_encoder \
-  --encoder-json encoder.json \
-  --vocab-bpe vocab.bpe \
-  --inputs "aclImdb/$SPLIT.input0" \
-  --outputs "aclImdb/$SPLIT.input0.bpe" \
-  --workers 60 \
-  --keep-empty;
+for SPLIT in train dev; do
+    python -m examples.roberta.multiprocessing_bpe_encoder \
+        --encoder-json encoder.json \
+        --vocab-bpe vocab.bpe \
+        --inputs "aclImdb/$SPLIT.input0" \
+        --outputs "aclImdb/$SPLIT.input0.bpe" \
+        --workers 60 \
+        --keep-empty
 done
 ```

+### 4) Preprocess data

-## 4) Preprocess data
-
-```
+```bash
 # Download fairseq dictionary.
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'  

 fairseq-preprocess \
-  --only-source \
-  --trainpref "aclImdb/train.input0.bpe" \
-  --validpref "aclImdb/dev.input0.bpe" \
-  --destdir "IMDB-bin/input0" \
-  --workers 60 \
-  --srcdict dict.txt;
+    --only-source \
+    --trainpref "aclImdb/train.input0.bpe" \
+    --validpref "aclImdb/dev.input0.bpe" \
+    --destdir "IMDB-bin/input0" \
+    --workers 60 \
+    --srcdict dict.txt

 fairseq-preprocess \
-  --only-source \
-  --trainpref "aclImdb/train.label" \
-  --validpref "aclImdb/dev.label" \
-  --destdir "IMDB-bin/label" \
-  --workers 60;
+    --only-source \
+    --trainpref "aclImdb/train.label" \
+    --validpref "aclImdb/dev.label" \
+    --destdir "IMDB-bin/label" \
+    --workers 60

 ```

-## 5) Run Training
+### 5) Run Training

-```
+```bash
 TOTAL_NUM_UPDATES=7812  # 10 epochs through IMDB for bsz 32
 WARMUP_UPDATES=469      # 6 percent of the number of updates
 LR=1e-05                # Peak LR for polynomial LR scheduler.
 NUM_CLASSES=2
 MAX_SENTENCES=8        # Batch size.
+ROBERTA_PATH=/path/to/roberta/model.pt

 CUDA_VISIBLE_DEVICES=0 python train.py IMDB-bin/ \
--restore-file <roberta_large_absolute_path> \
--max-positions 512 \
--max-sentences $MAX_SENTENCES \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--truncate-sequence \
--update-freq 4;
+    --restore-file $ROBERTA_PATH \
+    --max-positions 512 \
+    --max-sentences $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 --separator-token 2 \
+    --arch roberta_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
+    --truncate-sequence \
+    --update-freq 4
 ```
 Above will train with effective batch-size of `32`, tested on one `Nvidia V100 32gb`.
 Expected `best-validation-accuracy` after `10` epochs is `~96.5%`.
--- a/examples/roberta/README.finetune_glue.md
+++ b/examples/roberta/README.finetune_glue.md
+# Finetuning RoBERTa on GLUE tasks
+
+### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
+```bash
+wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
+python download_glue_data.py --data_dir glue_data --tasks all
+```
+
+### 2) Preprocess GLUE task data:
+```bash
+./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
+```
+`glue_task_name` is one of the following:
+`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
+Use `ALL` for preprocessing all the glue tasks.
+
+### 3) Fine-tuning on GLUE task:
+Example fine-tuning cmd for `RTE` task
+```bash
+TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
+WARMUP_UPDATES=122      # 6 percent of the number of updates
+LR=2e-05                # Peak LR for polynomial LR scheduler.
+NUM_CLASSES=2
+MAX_SENTENCES=16        # Batch size.
+ROBERTA_PATH=/path/to/roberta/model.pt
+
+CUDA_VISIBLE_DEVICES=0 python train.py RTE-bin/ \
+    --restore-file $ROBERTA_PATH \
+    --max-positions 512 \
+    --max-sentences $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 --separator-token 2 \
+    --arch roberta_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
+```
+
+For each of the GLUE task, you will need to use following cmd-line arguments:
+
+Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
+---|---|---|---|---|---|---|---|---
+`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
+`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
+`--max-sentences` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
+`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
+`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
+
+For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
+
+**Note:**
+
+a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--max-sentences=16/32` depending on the task.
+
+b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--max-sentences`.
+
+c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.  
--- a/examples/roberta/README.md
+++ b/examples/roberta/README.md
@@ -39,85 +39,83 @@ Model | Accuracy | Middle | High
 ## Example usage

 ##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
-```
->>> import torch
->>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
->>> roberta.eval()  # disable dropout (or leave in train mode to finetune)
+```python
+import torch
+roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
+roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```

 ##### Load RoBERTa (for PyTorch 1.0):
-```
-$ wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
-$ tar -xzvf roberta.large.tar.gz
+```python
+# Download roberta.large model
+wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
+tar -xzvf roberta.large.tar.gz

->>> from fairseq.models.roberta import RobertaModel
->>> roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
->>> roberta.eval()  # disable dropout (or leave in train mode to finetune)
+# Load the model in fairseq
+from fairseq.models.roberta import RobertaModel
+roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
+roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```

 ##### Apply Byte-Pair Encoding (BPE) to input text:
-```
->>> tokens = roberta.encode('Hello world!')
->>> tokens
-tensor([    0, 31414,   232,   328,     2])
->>> roberta.decode(tokens)
-'Hello world!'
+```python
+tokens = roberta.encode('Hello world!')
+assert tokens.tolist() == [0, 31414, 232, 328, 2]
+roberta.decode(tokens)  # 'Hello world!'
 ```

 ##### Extract features from RoBERTa:
-```
->>> last_layer_features = roberta.extract_features(tokens)
->>> last_layer_features.size()
-torch.Size([1, 5, 1024])
+```python
+# Extract the last layer's features
+last_layer_features = roberta.extract_features(tokens)
+assert last_layer_features.size() == torch.Size([1, 5, 1024])

->>> all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
->>> len(all_layers)
-25
-
->>> torch.all(all_layers[-1] == last_layer_features)
-tensor(1, dtype=torch.uint8)
+# Extract all layer's features (layer 0 is the embedding layer)
+all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
+assert len(all_layers) == 25
+assert torch.all(all_layers[-1] == last_layer_features)
 ```

 ##### Use RoBERTa for sentence-pair classification tasks:
-```
->>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')  # already finetuned
->>> roberta.eval()  # disable dropout for evaluation
-
->>> tokens = roberta.encode(
-...   'Roberta is a heavily optimized version of BERT.',
-...   'Roberta is not very optimized.'
-... )
-
->>> roberta.predict('mnli', tokens).argmax()
-tensor(0)  # contradiction
+```python
+# Download RoBERTa already finetuned for MNLI
+roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
+roberta.eval()  # disable dropout for evaluation

->>> tokens = roberta.encode(
-...   'Roberta is a heavily optimized version of BERT.',
-...   'Roberta is based on BERT.'
-... )
+# Encode a pair of sentences and make a prediction
+tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
+roberta.predict('mnli', tokens).argmax()  # 0: contradiction

->>> roberta.predict('mnli', tokens).argmax()
-tensor(2)  # entailment
+# Encode another pair of sentences
+tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.')
+roberta.predict('mnli', tokens).argmax()  # 2: entailment
 ```

 ##### Register a new (randomly initialized) classification head:
+```python
+roberta.register_classification_head('new_task', num_classes=3)
+logprobs = roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>)
 ```
->>> roberta.register_classification_head('new_task', num_classes=3)
->>> roberta.predict('new_task', tokens)
-tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>)
+
+##### Batched prediction:
+```python
+from fairseq.data.data_utils import collate_tokens
+sentences = ['Hello world.', 'Another unrelated sentence.']
+batch = collate_tokens([roberta.encode(sent) for sent in sentences], pad_idx=1)
+logprobs = roberta.predict('new_task', batch)
+assert logprobs.size() == torch.Size([2, 3])
 ```

 ##### Using the GPU:
-```
->>> roberta.cuda()
->>> roberta.predict('new_task', tokens)
-tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
+```python
+roberta.cuda()
+roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
 ```

 ##### Evaluating the `roberta.large.mnli` model

 Example python code snippet to evaluate accuracy on the MNLI dev_matched set.
-```
+```python
 label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
 ncorrect, nsamples = 0, 0
 roberta.cuda()
@@ -137,79 +135,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
 ```


-## Finetuning on GLUE tasks
-
-##### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
-```
-$ wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
-$ python download_glue_data.py --data_dir glue_data --tasks all
-```
-
-##### 2) Preprocess GLUE task data:
-```
-$ ./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
-```
-`glue_task_name` is one of the following:
-`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
-Use `ALL` for preprocessing all the glue tasks.
-
-##### 3) Fine-tuning on GLUE task :
-Example fine-tuning cmd for `RTE` task
-```
-TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
-WARMUP_UPDATES=122      # 6 percent of the number of updates
-LR=2e-05                # Peak LR for polynomial LR scheduler.
-NUM_CLASSES=2
-MAX_SENTENCES=16        # Batch size.
-
-CUDA_VISIBLE_DEVICES=0 python train.py RTE-bin/ \
--restore-file <roberta_large_absolute_path> \
--max-positions 512 \
--max-sentences $MAX_SENTENCES \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
-```
-
-For each of the GLUE task, you will need to use following cmd-line arguments:
-
-Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
---|---|---|---|---|---|---|---|---
-`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
-`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
-`--max-sentences` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
-`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
-`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
-
-For `STS-B` additionally use following cmd-line argument:
-```
--regression-target
--best-checkpoint-metric loss
-```
-and remove `--maximize-best-checkpoint-metric`.
-
-**Note:**
-
-a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--max-sentences=16/32` depending on the task.
-
-b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--max-sentences`.
-
-c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.  
+## Finetuning

-## Fine-tuning on custom classification tasks
-[Example of fine-tuning Roberta on simple custom classification task](README.finetune_custom_classification.md)
+- [Finetuning on GLUE](README.finetune_glue.md)
+- [Finetuning on custom classification tasks (e.g., IMDB)](README.finetune_custom_classification.md)
+- Finetuning on SQuAD: coming soon

 ## Pretraining using your own data

@@ -223,11 +153,11 @@ A more detailed tutorial is coming soon.

 ```bibtex
 @article{liu2019roberta,
-  title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
-  author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
-            Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
-            Luke Zettlemoyer and Veselin Stoyanov},
-  journal={arXiv preprint arXiv:1907.11692},
-  year = {2019},
+    title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
+    author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
+              Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
+              Luke Zettlemoyer and Veselin Stoyanov},
+    journal={arXiv preprint arXiv:1907.11692},
+    year = {2019},
 }
 ```
--- a/examples/scaling_nmt/README.md
+++ b/examples/scaling_nmt/README.md
@@ -4,10 +4,10 @@ This page includes instructions for reproducing results from the paper [Scaling

 ## Pre-trained models

-Description | Dataset | Model | Test set(s)
+Model | Description | Dataset | Download
 ---|---|---|---
-Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
-Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
+`transformer.wmt14.en-fr` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
+`transformer.wmt16.en-de` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)

 ## Training a new model on WMT'16 En-De

@@ -15,33 +15,33 @@ Please first download the [preprocessed WMT'16 En-De data provided by Google](ht
 Then:

 1. Extract the WMT'16 En-De data:
-```
-$ TEXT=wmt16_en_de_bpe32k
-$ mkdir $TEXT
-$ tar -xzvf wmt16_en_de.tar.gz -C $TEXT
+```bash
+TEXT=wmt16_en_de_bpe32k
+mkdir $TEXT
+tar -xzvf wmt16_en_de.tar.gz -C $TEXT
 ```

 2. Preprocess the dataset with a joined dictionary:
-```
-$ fairseq-preprocess --source-lang en --target-lang de \
-  --trainpref $TEXT/train.tok.clean.bpe.32000 \
-  --validpref $TEXT/newstest2013.tok.bpe.32000 \
-  --testpref $TEXT/newstest2014.tok.bpe.32000 \
-  --destdir data-bin/wmt16_en_de_bpe32k \
-  --nwordssrc 32768 --nwordstgt 32768 \
-  --joined-dictionary
+```bash
+fairseq-preprocess --source-lang en --target-lang de \
+    --trainpref $TEXT/train.tok.clean.bpe.32000 \
+    --validpref $TEXT/newstest2013.tok.bpe.32000 \
+    --testpref $TEXT/newstest2014.tok.bpe.32000 \
+    --destdir data-bin/wmt16_en_de_bpe32k \
+    --nwordssrc 32768 --nwordstgt 32768 \
+    --joined-dictionary
 ```

 3. Train a model:
-```
-$ fairseq-train data-bin/wmt16_en_de_bpe32k \
-  --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
-  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
-  --lr 0.0005 --min-lr 1e-09 \
-  --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-  --max-tokens 3584 \
-  --fp16
+```bash
+fairseq-train data-bin/wmt16_en_de_bpe32k \
+    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
+    --lr 0.0005 --min-lr 1e-09 \
+    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --max-tokens 3584 \
+    --fp16
 ```

 Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.

--- a/examples/stories/README.md
+++ b/examples/stories/README.md
@@ -14,7 +14,7 @@ We provide sample stories generated by the [convolutional seq2seq model](https:/

 The dataset can be downloaded like this:

-```
+```bash
 cd examples/stories
 curl https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz | tar xvzf -
 ```
@@ -23,28 +23,28 @@ and contains a train, test, and valid split. The dataset is described here: http

 ## Example usage

+First we will preprocess the dataset. Note that the dataset release is the full data, but the paper models the first 1000 words of each story. Here is example code that trims the dataset to the first 1000 words of each story:
+```python
+data = ["train", "test", "valid"]
+for name in data:
+    with open(name + ".wp_target") as f:
+        stories = f.readlines()
+    stories = [" ".join(i.split()[0:1000]) for i in stories]
+    with open(name + ".wp_target", "w") as o:
+        for line in stories:
+            o.write(line.strip() + "\n")
 ```
-# Preprocess the dataset:
-# Note that the dataset release is the full data, but the paper models the first 1000 words of each story
-# Here is some example code that can trim the dataset to the first 1000 words of each story
-$ python
-$ data = ["train", "test", "valid"]
-$ for name in data:
-$   with open(name + ".wp_target") as f:
-$     stories = f.readlines()
-$   stories = [" ".join(i.split()[0:1000]) for i in stories]
-$   with open(name + ".wp_target", "w") as o:
-$     for line in stories:
-$       o.write(line.strip() + "\n")

+Once we've trimmed the data we can binarize it and train our model:
+```bash
 # Binarize the dataset:
-$ export TEXT=examples/stories/writingPrompts
-$ fairseq-preprocess --source-lang wp_source --target-lang wp_target \
-  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-  --destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10
+export TEXT=examples/stories/writingPrompts
+fairseq-preprocess --source-lang wp_source --target-lang wp_target \
+    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+    --destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10

 # Train the model:
-$ fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False
+fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False

 # Train a fusion model:
 # add the arguments: --pretrained True --pretrained-checkpoint path/to/checkpoint
@@ -52,7 +52,7 @@ $ fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-no
 # Generate:
 # Note: to load the pretrained model at generation time, you need to pass in a model-override argument to communicate to the fusion model at generation time where you have placed the pretrained checkpoint. By default, it will load the exact path of the fusion model's pretrained model from training time. You should use model-override if you have moved the pretrained model (or are using our provided models). If you are generating from a non-fusion model, the model-override argument is not necessary.

-$ fairseq-generate data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'/path/to/pretrained/model/checkpoint'}"
+fairseq-generate data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'/path/to/pretrained/model/checkpoint'}"
 ```

 ## Citation

--- a/examples/translation/README.md
+++ b/examples/translation/README.md
--- a/examples/translation_moe/README.md
+++ b/examples/translation_moe/README.md
@@ -14,47 +14,47 @@ Use the `--method` flag to choose the MoE variant; we support hard mixtures with
 The model is trained with online responsibility assignment and shared parameterization.

 The following command will train a `hMoElp` model with `3` experts:
-```
-$ fairseq-train --ddp-backend='no_c10d' \
-  data-bin/wmt17_en_de \
-  --max-update 100000 \
-  --task translation_moe \
-  --method hMoElp --mean-pool-gating-network \
-  --num-experts 3 \
-  --arch transformer_wmt_en_de --share-all-embeddings \
-  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
-  --lr 0.0007 --min-lr 1e-09 \
-  --dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
-  --max-tokens 3584
+```bash
+fairseq-train --ddp-backend='no_c10d' \
+    data-bin/wmt17_en_de \
+    --max-update 100000 \
+    --task translation_moe \
+    --method hMoElp --mean-pool-gating-network \
+    --num-experts 3 \
+    --arch transformer_wmt_en_de --share-all-embeddings \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
+    --lr 0.0007 --min-lr 1e-09 \
+    --dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
+    --max-tokens 3584
 ```

 ## Translate

 Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
 For example, to generate from expert 0:
-```
-$ fairseq-generate data-bin/wmt17_en_de \
-  --path checkpoints/checkpoint_best.pt \
-  --beam 1 --remove-bpe \
-  --task translation_moe \
-  --method hMoElp --mean-pool-gating-network \
-  --num-experts 3 \
-  --gen-expert 0
+```bash
+fairseq-generate data-bin/wmt17_en_de \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 1 --remove-bpe \
+    --task translation_moe \
+    --method hMoElp --mean-pool-gating-network \
+    --num-experts 3 \
+    --gen-expert 0
 ```

 ## Evaluate

 First download a tokenized version of the WMT'14 En-De test set with multiple references:
-```
-$ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
+```bash
+wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
 ```

 Next apply BPE on the fly and run generation for each expert:
-```
-$ BPEROOT=examples/translation/subword-nmt/
-$ BPE_CODE=examples/translation/wmt17_en_de/code
-$ for EXPERT in $(seq 0 2); do \
+```bash
+BPEROOT=examples/translation/subword-nmt/
+BPE_CODE=examples/translation/wmt17_en_de/code
+for EXPERT in $(seq 0 2); do \
    cat wmt14-en-de.extra_refs.tok \
    | grep ^S | cut -f 2 \
    | fairseq-interactive data-bin/wmt17_en_de \
@@ -66,15 +66,15 @@ $ for EXPERT in $(seq 0 2); do \
        --method hMoElp --mean-pool-gating-network \
        --num-experts 3 \
        --gen-expert $EXPERT ; \
-  done > wmt14-en-de.extra_refs.tok.gen.3experts
+done > wmt14-en-de.extra_refs.tok.gen.3experts
 ```

 Finally use `score_moe.py` to compute pairwise BLUE and average oracle BLEU:
-```
-$ python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
-pairwise BLEU: 48.26
-#refs covered: 2.11
-multi-reference BLEU (leave-one-out): 59.46
+```bash
+python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
+# pairwise BLEU: 48.26
+# #refs covered: 2.11
+# multi-reference BLEU (leave-one-out): 59.46
 ```
 This matches row 3 from Table 7 in the paper.


--- a/examples/wmt19/README.md
+++ b/examples/wmt19/README.md
@@ -4,86 +4,52 @@ This page provides pointers to the models of Facebook-FAIR's WMT'19 news transla

 ## Pre-trained models

-Description | Model
---|---
-En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
-De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
-En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
-Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
-En LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
-De LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
-Ru LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
+Model | Description | Download
+---|---|---
+`transformer.wmt19.en-de` | En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.de-en` | De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.en-ru` | En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
+`transformer.wmt19.ru-en` | Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
+`transformer_lm.wmt19.en` | En Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
+`transformer_lm.wmt19.de` | De Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
+`transformer_lm.wmt19.ru` | Ru Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)

 ## Example usage (torch.hub)

-```
->>> import torch
->>> en2de = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.en-de',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> en2de.generate("Machine learning is great!")
-'Maschinelles Lernen ist großartig!'
+```python
+import torch
+
+# English to German translation
+en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+en2de.translate("Machine learning is great!")  # 'Maschinelles Lernen ist großartig!'

->>> de2en = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.de-en',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> de2en.generate("Maschinelles Lernen ist großartig!")
-'Machine learning is great!'
+# German to English translation
+de2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.de-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+de2en.translate("Maschinelles Lernen ist großartig!")  # 'Machine learning is great!'

->>> en2ru = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.en-ru',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> en2ru.generate("Machine learning is great!")
-'Машинное обучение - это здорово!'
+# English to Russian translation
+en2ru = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-ru', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+en2ru.translate("Machine learning is great!")  # 'Машинное обучение - это здорово!'

->>> ru2en = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.ru-en',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> ru2en.generate("Машинное обучение - это здорово!")
-'Machine learning is great!'
+# Russian to English translation
+ru2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.ru-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+ru2en.translate("Машинное обучение - это здорово!")  # 'Machine learning is great!'

->>> en_lm = torch.hub.load(
-...   'pytorch.fairseq',
-...   'transformer_lm.wmt19.en'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> en_lm.generate("Machine learning is")
-'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'
+# Sample from the English LM
+en_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
+en_lm.sample("Machine learning is")  # 'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'

->>> de_lm = torch.hub.load(
-...   'pytorch.fairseq',
-...   'transformer_lm.wmt19.de'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> de_lm.generate("Maschinelles lernen ist")
-''Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'
+# Sample from the German LM
+de_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.de', tokenizer='moses', bpe='fastbpe')
+de_lm.sample("Maschinelles lernen ist")  # 'Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'

->>> ru_lm = torch.hub.load(
-...   'pytorch.fairseq',
-...   'transformer_lm.wmt19.ru'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> ru_lm.generate("машинное обучение это")
-'машинное обучение это то, что мы называем "искусственным интеллектом".'
+# Sample from the Russian LM
+ru_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.ru', tokenizer='moses', bpe='fastbpe')
+ru_lm.sample("машинное обучение это")  # 'машинное обучение это то, что мы называем "искусственным интеллектом".'
 ```

 ## Citation

--- a/fairseq/data/encoders/moses_tokenizer.py
+++ b/fairseq/data/encoders/moses_tokenizer.py
@@ -12,9 +12,9 @@ class MosesTokenizer(object):
    @staticmethod
    def add_args(parser):
        # fmt: off
-        parser.add_argument('--moses-source-lang', default='en', metavar='SRC',
+        parser.add_argument('--moses-source-lang', metavar='SRC',
                            help='source language')
-        parser.add_argument('--moses-target-lang', default='en', metavar='TARGET',
+        parser.add_argument('--moses-target-lang', metavar='TARGET',
                            help='target language')
        parser.add_argument('--moses-no-dash-splits', action='store_true', default=False,
                            help='don\'t apply dash split rules')
@@ -24,6 +24,12 @@ class MosesTokenizer(object):

    def __init__(self, args):
        self.args = args
+
+        if getattr(args, 'moses_source_lang', None) is None:
+            args.moses_source_lang = getattr(args, 'source_lang', 'en')
+        if getattr(args, 'moses_target_lang', None) is None:
+            args.moses_target_lang = getattr(args, 'target_lang', 'en')
+
        try:
            from sacremoses import MosesTokenizer, MosesDetokenizer
            self.tok = MosesTokenizer(args.moses_source_lang)

--- a/fairseq/hub_utils.py
+++ b/fairseq/hub_utils.py
@@ -97,12 +97,15 @@ class GeneratorHubInterface(nn.Module):
    def device(self):
        return self._float_tensor.device

-    def translate(self, sentence: str, verbose: bool = False, **kwargs) -> str:
+    def translate(self, sentence: str, beam: int = 5, verbose: bool = False, **kwargs) -> str:
+        return self.sample(sentence, beam, verbose, **kwargs)
+
+    def sample(self, sentence: str, beam: int = 1, verbose: bool = False, **kwargs) -> str:
        input = self.encode(sentence)
-        hypo = self.generate(input, verbose, **kwargs)
+        hypo = self.generate(input, beam, verbose, **kwargs)[0]['tokens']
        return self.decode(hypo)

-    def generate(self, tokens: torch.LongTensor, verbose: bool = False, **kwargs) -> torch.LongTensor:
+    def generate(self, tokens: torch.LongTensor, beam: int = 5, verbose: bool = False, **kwargs) -> torch.LongTensor:
        sample = self._build_sample(tokens)

        # build generator using current args as well as any kwargs
@@ -117,20 +120,24 @@ class GeneratorHubInterface(nn.Module):
            src_str_with_unk = self.string(tokens)
            print('S\t{}'.format(src_str_with_unk))

+        def getarg(name, default):
+            return getattr(gen_args, name, getattr(self.args, name, default))
+
        # Process top predictions
-        for hypo in translations[0][:min(len(translations), getattr(self.args, 'nbest', 1))]:
-            hypo_str = self.decode(hypo['tokens'])
-            if verbose:
+        hypos = translations[0]
+        if verbose:
+            for hypo in hypos:
+                hypo_str = self.decode(hypo['tokens'])
                print('H\t{}\t{}'.format(hypo['score'], hypo_str))
                print('P\t{}'.format(
                    ' '.join(map(lambda x: '{:.4f}'.format(x), hypo['positional_scores'].tolist()))
                ))
-                if hypo['alignment'] is not None and getattr(self.args, 'print_alignment', False):
+                if hypo['alignment'] is not None and getarg('print_alignment', False):
                    print('A\t{}'.format(
                        ' '.join(map(lambda x: str(utils.item(x)), hypo['alignment'].int().cpu()))
                    ))

-        return hypo['tokens']
+        return hypos

    def encode(self, sentence: str) -> torch.LongTensor:
        sentence = self.tokenize(sentence)

--- a/hubconf.py
+++ b/hubconf.py
@@ -11,6 +11,7 @@ from fairseq.models import MODEL_REGISTRY


 dependencies = [
+    'fastBPE',
    'regex',
    'requests',
    'sacremoses',

--- a/setup.py
+++ b/setup.py
@@ -44,7 +44,9 @@ setup(
    long_description_content_type='text/markdown',
    install_requires=[
        'cffi',
+        'fastBPE',
        'numpy',
+        'regex',
        'sacrebleu',
        'torch',
        'tqdm',