Commit abb7ed4c authored by Myle Ott's avatar Myle Ott Committed by Facebook Github Bot
Browse files

Update READMEs for torch.hub

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/795

Differential Revision: D16620488

Pulled By: myleott

fbshipit-source-id: 1998a9ccd8816fc7f590861fb4898f910a36bc1e
parent 5f342527
...@@ -4,29 +4,32 @@ This page includes pre-trained models from the paper [Understanding Back-Transla ...@@ -4,29 +4,32 @@ This page includes pre-trained models from the paper [Understanding Back-Transla
## Pre-trained models ## Pre-trained models
Description | Dataset | Model | Test set(s) Model | Description | Dataset | Download
---|---|---|--- ---|---|---|---
Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381); WMT'18 winner) | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) | See NOTE in the archive `transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive
## Example usage ## Example usage
Interactive generation from the full ensemble via PyTorch Hub: Interactive generation from the full ensemble via PyTorch Hub:
``` ```python
>>> import torch import torch
>>> torch.hub.list('pytorch/fairseq')
[..., 'transformer.wmt14.en-fr', 'transformer.wmt16.en-de', 'transformer.wmt18.en-de', ... ] # List available models
>>> en2de_ensemble = torch.hub.load( torch.hub.list('pytorch/fairseq') # [..., 'transformer.wmt18.en-de', ... ]
... 'pytorch/fairseq',
... 'transformer.wmt18.en-de', # Load the WMT'18 En-De ensemble
... checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt', en2de_ensemble = torch.hub.load(
... data_name_or_path='.', 'pytorch/fairseq', 'transformer.wmt18.en-de',
... tokenizer='moses', checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
... bpe='subword_nmt', tokenizer='moses', bpe='subword_nmt')
... )
>>> len(en2de_ensemble.models) # The ensemble contains 5 models
5 len(en2de_ensemble.models)
>>> print(en2de_ensemble.generate('Hello world!')) # 5
Hallo Welt!
# Translate
en2de_ensemble.translate('Hello world!')
# 'Hallo Welt!'
``` ```
## Citation ## Citation
......
...@@ -2,36 +2,30 @@ ...@@ -2,36 +2,30 @@
## Pre-trained models ## Pre-trained models
Description | Parameters | Dataset | Model and Test set(s) Model | Description | Dataset | Download
---|---:|---|--- ---|---|---|---
Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 1026M | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2) `transformer_lm.gbw.adaptive_huge` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 1026M params | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2)
Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 247M | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2) `transformer_lm.wiki103.adaptive` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 247M params | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2)
`transformer_lm.wmt19.en` | English LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
`transformer_lm.wmt19.de` | German LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
`transformer_lm.wmt19.ru` | Russian LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
## Example usage ## Example usage
Interactive generation via PyTorch Hub: Sampling from a language model using PyTorch Hub:
``` ```python
>>> import torch import torch
>>> torch.hub.list('pytorch/fairseq')
[..., 'transformer_lm.gbw.adaptive_huge', 'transformer_lm.wiki103.adaptive', ...]
>>> lm = torch.hub.load(
... 'pytorch/fairseq',
... 'transformer_lm.wiki103.adaptive',
... data_name_or_path='./data-bin',
... tokenizer='moses',
... no_escape=True,
... beam=1,
... sampling=True,
... sampling_topk=10,
... temperature=0.8,
... )
>>> lm.generate('Barack Obama', verbose=True)
```
Available models are listed in the ``hub_models()`` method in each model file, for example: # List available models
[transformer_lm.py](https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer_lm.py). torch.hub.list('pytorch/fairseq') # [..., 'transformer_lm.wmt19.en', ...]
# Load an English LM trained on WMT'19 News Crawl data
en_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
# Sample from the language model
en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperature=0.8)
# "Barack Obama is coming to Sydney and New Zealand (...)"
```
## Training a new model with the CLI tools ## Training a new model with the CLI tools
...@@ -44,47 +38,47 @@ Provides an example of pre-processing for [WikiText-103 language modeling task]( ...@@ -44,47 +38,47 @@ Provides an example of pre-processing for [WikiText-103 language modeling task](
Example usage: Example usage:
Prepare data: Prepare data:
``` ```bash
$ cd examples/language_model/ cd examples/language_model/
$ bash prepare-wikitext-103.sh bash prepare-wikitext-103.sh
$ cd ../.. cd ../..
# Binarize the dataset: # Binarize the dataset:
$ TEXT=examples/language_model/wikitext-103 TEXT=examples/language_model/wikitext-103
$ fairseq-preprocess --only-source \ fairseq-preprocess --only-source \
--trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103 --destdir data-bin/wikitext-103
``` ```
Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)): Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
``` ```bash
# If it runs out of memory, try to reduce max-tokens and tokens-per-sample # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
$ mkdir -p checkpoints/transformer_wikitext-103 mkdir -p checkpoints/transformer_wikitext-103
$ fairseq-train --task language_modeling data-bin/wikitext-103 \ fairseq-train --task language_modeling data-bin/wikitext-103 \
--save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \ --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
--max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \ --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
--warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \ --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
--criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \ --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
--sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
# Evaluate: # Evaluate:
$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
--sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024 --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
``` ```
Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)): Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
``` ```
# If it runs out of memory, try to reduce max-tokens and tokens-per-sample # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
$ mkdir -p checkpoints/fconv_wikitext-103 mkdir -p checkpoints/fconv_wikitext-103
$ fairseq-train --task language_modeling data-bin/wikitext-103 \ fairseq-train --task language_modeling data-bin/wikitext-103 \
--save-dir checkpoints/fconv_wikitext-103 \ --save-dir checkpoints/fconv_wikitext-103 \
--max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \ --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
--lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \ --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
--clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \ --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
--adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \ --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
--ddp-backend=no_c10d --ddp-backend=no_c10d
# Evaluate: # Evaluate:
$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt' fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
``` ```
# RoBERTa fine-tuning on custom classification task (example IMDB) # Finetuning RoBERTa on a custom classification task
## 1) Get the data This example shows how to finetune RoBERTa on the IMDB dataset, but should illustrate the process for most classification tasks.
```
### 1) Get the data
```bash
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz tar zxvf aclImdb_v1.tar.gz
``` ```
## 2) Format data ### 2) Format data
`IMDB` data has one data-sample in each file, below python code-snippet converts it one file for train and valid each for ease of processing. `IMDB` data has one data-sample in each file, below python code-snippet converts it one file for train and valid each for ease of processing.
``` ```python
import argparse import argparse
import os import os
import random import random
...@@ -42,79 +44,78 @@ if __name__ == '__main__': ...@@ -42,79 +44,78 @@ if __name__ == '__main__':
main(args) main(args)
``` ```
## 3) BPE Encode ### 3) BPE Encode
Run `multiprocessing_bpe_encoder`, you can also do this in previous step for each sample but that might be slower. Run `multiprocessing_bpe_encoder`, you can also do this in previous step for each sample but that might be slower.
``` ```bash
# Download encoder.json and vocab.bpe # Download encoder.json and vocab.bpe
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json' wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe' wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
for SPLIT in train dev; for SPLIT in train dev; do
do python -m examples.roberta.multiprocessing_bpe_encoder \
python -m examples.roberta.multiprocessing_bpe_encoder \ --encoder-json encoder.json \
--encoder-json encoder.json \ --vocab-bpe vocab.bpe \
--vocab-bpe vocab.bpe \ --inputs "aclImdb/$SPLIT.input0" \
--inputs "aclImdb/$SPLIT.input0" \ --outputs "aclImdb/$SPLIT.input0.bpe" \
--outputs "aclImdb/$SPLIT.input0.bpe" \ --workers 60 \
--workers 60 \ --keep-empty
--keep-empty;
done done
``` ```
### 4) Preprocess data
## 4) Preprocess data ```bash
```
# Download fairseq dictionary. # Download fairseq dictionary.
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt' wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'
fairseq-preprocess \ fairseq-preprocess \
--only-source \ --only-source \
--trainpref "aclImdb/train.input0.bpe" \ --trainpref "aclImdb/train.input0.bpe" \
--validpref "aclImdb/dev.input0.bpe" \ --validpref "aclImdb/dev.input0.bpe" \
--destdir "IMDB-bin/input0" \ --destdir "IMDB-bin/input0" \
--workers 60 \ --workers 60 \
--srcdict dict.txt; --srcdict dict.txt
fairseq-preprocess \ fairseq-preprocess \
--only-source \ --only-source \
--trainpref "aclImdb/train.label" \ --trainpref "aclImdb/train.label" \
--validpref "aclImdb/dev.label" \ --validpref "aclImdb/dev.label" \
--destdir "IMDB-bin/label" \ --destdir "IMDB-bin/label" \
--workers 60; --workers 60
``` ```
## 5) Run Training ### 5) Run Training
``` ```bash
TOTAL_NUM_UPDATES=7812 # 10 epochs through IMDB for bsz 32 TOTAL_NUM_UPDATES=7812 # 10 epochs through IMDB for bsz 32
WARMUP_UPDATES=469 # 6 percent of the number of updates WARMUP_UPDATES=469 # 6 percent of the number of updates
LR=1e-05 # Peak LR for polynomial LR scheduler. LR=1e-05 # Peak LR for polynomial LR scheduler.
NUM_CLASSES=2 NUM_CLASSES=2
MAX_SENTENCES=8 # Batch size. MAX_SENTENCES=8 # Batch size.
ROBERTA_PATH=/path/to/roberta/model.pt
CUDA_VISIBLE_DEVICES=0 python train.py IMDB-bin/ \ CUDA_VISIBLE_DEVICES=0 python train.py IMDB-bin/ \
--restore-file <roberta_large_absolute_path> \ --restore-file $ROBERTA_PATH \
--max-positions 512 \ --max-positions 512 \
--max-sentences $MAX_SENTENCES \ --max-sentences $MAX_SENTENCES \
--max-tokens 4400 \ --max-tokens 4400 \
--task sentence_prediction \ --task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \ --reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \ --required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \ --init-token 0 --separator-token 2 \
--arch roberta_large \ --arch roberta_large \
--criterion sentence_prediction \ --criterion sentence_prediction \
--num-classes $NUM_CLASSES \ --num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \ --dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \ --clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \ --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \ --max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--truncate-sequence \ --truncate-sequence \
--update-freq 4; --update-freq 4
``` ```
Above will train with effective batch-size of `32`, tested on one `Nvidia V100 32gb`. Above will train with effective batch-size of `32`, tested on one `Nvidia V100 32gb`.
Expected `best-validation-accuracy` after `10` epochs is `~96.5%`. Expected `best-validation-accuracy` after `10` epochs is `~96.5%`.
# Finetuning RoBERTa on GLUE tasks
### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
```bash
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
python download_glue_data.py --data_dir glue_data --tasks all
```
### 2) Preprocess GLUE task data:
```bash
./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
```
`glue_task_name` is one of the following:
`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
Use `ALL` for preprocessing all the glue tasks.
### 3) Fine-tuning on GLUE task:
Example fine-tuning cmd for `RTE` task
```bash
TOTAL_NUM_UPDATES=2036 # 10 epochs through RTE for bsz 16
WARMUP_UPDATES=122 # 6 percent of the number of updates
LR=2e-05 # Peak LR for polynomial LR scheduler.
NUM_CLASSES=2
MAX_SENTENCES=16 # Batch size.
ROBERTA_PATH=/path/to/roberta/model.pt
CUDA_VISIBLE_DEVICES=0 python train.py RTE-bin/ \
--restore-file $ROBERTA_PATH \
--max-positions 512 \
--max-sentences $MAX_SENTENCES \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
```
For each of the GLUE task, you will need to use following cmd-line arguments:
Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
---|---|---|---|---|---|---|---|---
`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
`--max-sentences` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
**Note:**
a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--max-sentences=16/32` depending on the task.
b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--max-sentences`.
c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.
...@@ -39,85 +39,83 @@ Model | Accuracy | Middle | High ...@@ -39,85 +39,83 @@ Model | Accuracy | Middle | High
## Example usage ## Example usage
##### Load RoBERTa from torch.hub (PyTorch >= 1.1): ##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
``` ```python
>>> import torch import torch
>>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large') roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
>>> roberta.eval() # disable dropout (or leave in train mode to finetune) roberta.eval() # disable dropout (or leave in train mode to finetune)
``` ```
##### Load RoBERTa (for PyTorch 1.0): ##### Load RoBERTa (for PyTorch 1.0):
``` ```python
$ wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz # Download roberta.large model
$ tar -xzvf roberta.large.tar.gz wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
tar -xzvf roberta.large.tar.gz
>>> from fairseq.models.roberta import RobertaModel # Load the model in fairseq
>>> roberta = RobertaModel.from_pretrained('/path/to/roberta.large') from fairseq.models.roberta import RobertaModel
>>> roberta.eval() # disable dropout (or leave in train mode to finetune) roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
roberta.eval() # disable dropout (or leave in train mode to finetune)
``` ```
##### Apply Byte-Pair Encoding (BPE) to input text: ##### Apply Byte-Pair Encoding (BPE) to input text:
``` ```python
>>> tokens = roberta.encode('Hello world!') tokens = roberta.encode('Hello world!')
>>> tokens assert tokens.tolist() == [0, 31414, 232, 328, 2]
tensor([ 0, 31414, 232, 328, 2]) roberta.decode(tokens) # 'Hello world!'
>>> roberta.decode(tokens)
'Hello world!'
``` ```
##### Extract features from RoBERTa: ##### Extract features from RoBERTa:
``` ```python
>>> last_layer_features = roberta.extract_features(tokens) # Extract the last layer's features
>>> last_layer_features.size() last_layer_features = roberta.extract_features(tokens)
torch.Size([1, 5, 1024]) assert last_layer_features.size() == torch.Size([1, 5, 1024])
>>> all_layers = roberta.extract_features(tokens, return_all_hiddens=True) # Extract all layer's features (layer 0 is the embedding layer)
>>> len(all_layers) all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
25 assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)
>>> torch.all(all_layers[-1] == last_layer_features)
tensor(1, dtype=torch.uint8)
``` ```
##### Use RoBERTa for sentence-pair classification tasks: ##### Use RoBERTa for sentence-pair classification tasks:
``` ```python
>>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli') # already finetuned # Download RoBERTa already finetuned for MNLI
>>> roberta.eval() # disable dropout for evaluation roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
roberta.eval() # disable dropout for evaluation
>>> tokens = roberta.encode(
... 'Roberta is a heavily optimized version of BERT.',
... 'Roberta is not very optimized.'
... )
>>> roberta.predict('mnli', tokens).argmax()
tensor(0) # contradiction
>>> tokens = roberta.encode( # Encode a pair of sentences and make a prediction
... 'Roberta is a heavily optimized version of BERT.', tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
... 'Roberta is based on BERT.' roberta.predict('mnli', tokens).argmax() # 0: contradiction
... )
>>> roberta.predict('mnli', tokens).argmax() # Encode another pair of sentences
tensor(2) # entailment tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.')
roberta.predict('mnli', tokens).argmax() # 2: entailment
``` ```
##### Register a new (randomly initialized) classification head: ##### Register a new (randomly initialized) classification head:
```python
roberta.register_classification_head('new_task', num_classes=3)
logprobs = roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>)
``` ```
>>> roberta.register_classification_head('new_task', num_classes=3)
>>> roberta.predict('new_task', tokens) ##### Batched prediction:
tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>) ```python
from fairseq.data.data_utils import collate_tokens
sentences = ['Hello world.', 'Another unrelated sentence.']
batch = collate_tokens([roberta.encode(sent) for sent in sentences], pad_idx=1)
logprobs = roberta.predict('new_task', batch)
assert logprobs.size() == torch.Size([2, 3])
``` ```
##### Using the GPU: ##### Using the GPU:
``` ```python
>>> roberta.cuda() roberta.cuda()
>>> roberta.predict('new_task', tokens) roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
``` ```
##### Evaluating the `roberta.large.mnli` model ##### Evaluating the `roberta.large.mnli` model
Example python code snippet to evaluate accuracy on the MNLI dev_matched set. Example python code snippet to evaluate accuracy on the MNLI dev_matched set.
``` ```python
label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'} label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
ncorrect, nsamples = 0, 0 ncorrect, nsamples = 0, 0
roberta.cuda() roberta.cuda()
...@@ -137,79 +135,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples)) ...@@ -137,79 +135,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
``` ```
## Finetuning on GLUE tasks ## Finetuning
##### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
```
$ wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
$ python download_glue_data.py --data_dir glue_data --tasks all
```
##### 2) Preprocess GLUE task data:
```
$ ./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
```
`glue_task_name` is one of the following:
`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
Use `ALL` for preprocessing all the glue tasks.
##### 3) Fine-tuning on GLUE task :
Example fine-tuning cmd for `RTE` task
```
TOTAL_NUM_UPDATES=2036 # 10 epochs through RTE for bsz 16
WARMUP_UPDATES=122 # 6 percent of the number of updates
LR=2e-05 # Peak LR for polynomial LR scheduler.
NUM_CLASSES=2
MAX_SENTENCES=16 # Batch size.
CUDA_VISIBLE_DEVICES=0 python train.py RTE-bin/ \
--restore-file <roberta_large_absolute_path> \
--max-positions 512 \
--max-sentences $MAX_SENTENCES \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
```
For each of the GLUE task, you will need to use following cmd-line arguments:
Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
---|---|---|---|---|---|---|---|---
`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
`--max-sentences` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
For `STS-B` additionally use following cmd-line argument:
```
--regression-target
--best-checkpoint-metric loss
```
and remove `--maximize-best-checkpoint-metric`.
**Note:**
a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--max-sentences=16/32` depending on the task.
b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--max-sentences`.
c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.
## Fine-tuning on custom classification tasks - [Finetuning on GLUE](README.finetune_glue.md)
[Example of fine-tuning Roberta on simple custom classification task](README.finetune_custom_classification.md) - [Finetuning on custom classification tasks (e.g., IMDB)](README.finetune_custom_classification.md)
- Finetuning on SQuAD: coming soon
## Pretraining using your own data ## Pretraining using your own data
...@@ -223,11 +153,11 @@ A more detailed tutorial is coming soon. ...@@ -223,11 +153,11 @@ A more detailed tutorial is coming soon.
```bibtex ```bibtex
@article{liu2019roberta, @article{liu2019roberta,
title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach}, title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
Luke Zettlemoyer and Veselin Stoyanov}, Luke Zettlemoyer and Veselin Stoyanov},
journal={arXiv preprint arXiv:1907.11692}, journal={arXiv preprint arXiv:1907.11692},
year = {2019}, year = {2019},
} }
``` ```
...@@ -4,10 +4,10 @@ This page includes instructions for reproducing results from the paper [Scaling ...@@ -4,10 +4,10 @@ This page includes instructions for reproducing results from the paper [Scaling
## Pre-trained models ## Pre-trained models
Description | Dataset | Model | Test set(s) Model | Description | Dataset | Download
---|---|---|--- ---|---|---|---
Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2) `transformer.wmt14.en-fr` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2) `transformer.wmt16.en-de` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
## Training a new model on WMT'16 En-De ## Training a new model on WMT'16 En-De
...@@ -15,33 +15,33 @@ Please first download the [preprocessed WMT'16 En-De data provided by Google](ht ...@@ -15,33 +15,33 @@ Please first download the [preprocessed WMT'16 En-De data provided by Google](ht
Then: Then:
1. Extract the WMT'16 En-De data: 1. Extract the WMT'16 En-De data:
``` ```bash
$ TEXT=wmt16_en_de_bpe32k TEXT=wmt16_en_de_bpe32k
$ mkdir $TEXT mkdir $TEXT
$ tar -xzvf wmt16_en_de.tar.gz -C $TEXT tar -xzvf wmt16_en_de.tar.gz -C $TEXT
``` ```
2. Preprocess the dataset with a joined dictionary: 2. Preprocess the dataset with a joined dictionary:
``` ```bash
$ fairseq-preprocess --source-lang en --target-lang de \ fairseq-preprocess --source-lang en --target-lang de \
--trainpref $TEXT/train.tok.clean.bpe.32000 \ --trainpref $TEXT/train.tok.clean.bpe.32000 \
--validpref $TEXT/newstest2013.tok.bpe.32000 \ --validpref $TEXT/newstest2013.tok.bpe.32000 \
--testpref $TEXT/newstest2014.tok.bpe.32000 \ --testpref $TEXT/newstest2014.tok.bpe.32000 \
--destdir data-bin/wmt16_en_de_bpe32k \ --destdir data-bin/wmt16_en_de_bpe32k \
--nwordssrc 32768 --nwordstgt 32768 \ --nwordssrc 32768 --nwordstgt 32768 \
--joined-dictionary --joined-dictionary
``` ```
3. Train a model: 3. Train a model:
``` ```bash
$ fairseq-train data-bin/wmt16_en_de_bpe32k \ fairseq-train data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \ --lr 0.0005 --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \ --max-tokens 3584 \
--fp16 --fp16
``` ```
Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU. Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.
......
...@@ -14,7 +14,7 @@ We provide sample stories generated by the [convolutional seq2seq model](https:/ ...@@ -14,7 +14,7 @@ We provide sample stories generated by the [convolutional seq2seq model](https:/
The dataset can be downloaded like this: The dataset can be downloaded like this:
``` ```bash
cd examples/stories cd examples/stories
curl https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz | tar xvzf - curl https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz | tar xvzf -
``` ```
...@@ -23,28 +23,28 @@ and contains a train, test, and valid split. The dataset is described here: http ...@@ -23,28 +23,28 @@ and contains a train, test, and valid split. The dataset is described here: http
## Example usage ## Example usage
First we will preprocess the dataset. Note that the dataset release is the full data, but the paper models the first 1000 words of each story. Here is example code that trims the dataset to the first 1000 words of each story:
```python
data = ["train", "test", "valid"]
for name in data:
with open(name + ".wp_target") as f:
stories = f.readlines()
stories = [" ".join(i.split()[0:1000]) for i in stories]
with open(name + ".wp_target", "w") as o:
for line in stories:
o.write(line.strip() + "\n")
``` ```
# Preprocess the dataset:
# Note that the dataset release is the full data, but the paper models the first 1000 words of each story
# Here is some example code that can trim the dataset to the first 1000 words of each story
$ python
$ data = ["train", "test", "valid"]
$ for name in data:
$ with open(name + ".wp_target") as f:
$ stories = f.readlines()
$ stories = [" ".join(i.split()[0:1000]) for i in stories]
$ with open(name + ".wp_target", "w") as o:
$ for line in stories:
$ o.write(line.strip() + "\n")
Once we've trimmed the data we can binarize it and train our model:
```bash
# Binarize the dataset: # Binarize the dataset:
$ export TEXT=examples/stories/writingPrompts export TEXT=examples/stories/writingPrompts
$ fairseq-preprocess --source-lang wp_source --target-lang wp_target \ fairseq-preprocess --source-lang wp_source --target-lang wp_target \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10 --destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10
# Train the model: # Train the model:
$ fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False
# Train a fusion model: # Train a fusion model:
# add the arguments: --pretrained True --pretrained-checkpoint path/to/checkpoint # add the arguments: --pretrained True --pretrained-checkpoint path/to/checkpoint
...@@ -52,7 +52,7 @@ $ fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-no ...@@ -52,7 +52,7 @@ $ fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-no
# Generate: # Generate:
# Note: to load the pretrained model at generation time, you need to pass in a model-override argument to communicate to the fusion model at generation time where you have placed the pretrained checkpoint. By default, it will load the exact path of the fusion model's pretrained model from training time. You should use model-override if you have moved the pretrained model (or are using our provided models). If you are generating from a non-fusion model, the model-override argument is not necessary. # Note: to load the pretrained model at generation time, you need to pass in a model-override argument to communicate to the fusion model at generation time where you have placed the pretrained checkpoint. By default, it will load the exact path of the fusion model's pretrained model from training time. You should use model-override if you have moved the pretrained model (or are using our provided models). If you are generating from a non-fusion model, the model-override argument is not necessary.
$ fairseq-generate data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'/path/to/pretrained/model/checkpoint'}" fairseq-generate data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'/path/to/pretrained/model/checkpoint'}"
``` ```
## Citation ## Citation
......
This diff is collapsed.
...@@ -14,47 +14,47 @@ Use the `--method` flag to choose the MoE variant; we support hard mixtures with ...@@ -14,47 +14,47 @@ Use the `--method` flag to choose the MoE variant; we support hard mixtures with
The model is trained with online responsibility assignment and shared parameterization. The model is trained with online responsibility assignment and shared parameterization.
The following command will train a `hMoElp` model with `3` experts: The following command will train a `hMoElp` model with `3` experts:
``` ```bash
$ fairseq-train --ddp-backend='no_c10d' \ fairseq-train --ddp-backend='no_c10d' \
data-bin/wmt17_en_de \ data-bin/wmt17_en_de \
--max-update 100000 \ --max-update 100000 \
--task translation_moe \ --task translation_moe \
--method hMoElp --mean-pool-gating-network \ --method hMoElp --mean-pool-gating-network \
--num-experts 3 \ --num-experts 3 \
--arch transformer_wmt_en_de --share-all-embeddings \ --arch transformer_wmt_en_de --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0007 --min-lr 1e-09 \ --lr 0.0007 --min-lr 1e-09 \
--dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \ --dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
--max-tokens 3584 --max-tokens 3584
``` ```
## Translate ## Translate
Once a model is trained, we can generate translations from different experts using the `--gen-expert` option. Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
For example, to generate from expert 0: For example, to generate from expert 0:
``` ```bash
$ fairseq-generate data-bin/wmt17_en_de \ fairseq-generate data-bin/wmt17_en_de \
--path checkpoints/checkpoint_best.pt \ --path checkpoints/checkpoint_best.pt \
--beam 1 --remove-bpe \ --beam 1 --remove-bpe \
--task translation_moe \ --task translation_moe \
--method hMoElp --mean-pool-gating-network \ --method hMoElp --mean-pool-gating-network \
--num-experts 3 \ --num-experts 3 \
--gen-expert 0 --gen-expert 0
``` ```
## Evaluate ## Evaluate
First download a tokenized version of the WMT'14 En-De test set with multiple references: First download a tokenized version of the WMT'14 En-De test set with multiple references:
``` ```bash
$ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
``` ```
Next apply BPE on the fly and run generation for each expert: Next apply BPE on the fly and run generation for each expert:
``` ```bash
$ BPEROOT=examples/translation/subword-nmt/ BPEROOT=examples/translation/subword-nmt/
$ BPE_CODE=examples/translation/wmt17_en_de/code BPE_CODE=examples/translation/wmt17_en_de/code
$ for EXPERT in $(seq 0 2); do \ for EXPERT in $(seq 0 2); do \
cat wmt14-en-de.extra_refs.tok \ cat wmt14-en-de.extra_refs.tok \
| grep ^S | cut -f 2 \ | grep ^S | cut -f 2 \
| fairseq-interactive data-bin/wmt17_en_de \ | fairseq-interactive data-bin/wmt17_en_de \
...@@ -66,15 +66,15 @@ $ for EXPERT in $(seq 0 2); do \ ...@@ -66,15 +66,15 @@ $ for EXPERT in $(seq 0 2); do \
--method hMoElp --mean-pool-gating-network \ --method hMoElp --mean-pool-gating-network \
--num-experts 3 \ --num-experts 3 \
--gen-expert $EXPERT ; \ --gen-expert $EXPERT ; \
done > wmt14-en-de.extra_refs.tok.gen.3experts done > wmt14-en-de.extra_refs.tok.gen.3experts
``` ```
Finally use `score_moe.py` to compute pairwise BLUE and average oracle BLEU: Finally use `score_moe.py` to compute pairwise BLUE and average oracle BLEU:
``` ```bash
$ python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
pairwise BLEU: 48.26 # pairwise BLEU: 48.26
#refs covered: 2.11 # #refs covered: 2.11
multi-reference BLEU (leave-one-out): 59.46 # multi-reference BLEU (leave-one-out): 59.46
``` ```
This matches row 3 from Table 7 in the paper. This matches row 3 from Table 7 in the paper.
......
...@@ -4,86 +4,52 @@ This page provides pointers to the models of Facebook-FAIR's WMT'19 news transla ...@@ -4,86 +4,52 @@ This page provides pointers to the models of Facebook-FAIR's WMT'19 news transla
## Pre-trained models ## Pre-trained models
Description | Model Model | Description | Download
---|--- ---|---|---
En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz) `transformer.wmt19.en-de` | En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz) `transformer.wmt19.de-en` | De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz) `transformer.wmt19.en-ru` | En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz) `transformer.wmt19.ru-en` | Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
En LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz) `transformer_lm.wmt19.en` | En Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
De LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz) `transformer_lm.wmt19.de` | De Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
Ru LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz) `transformer_lm.wmt19.ru` | Ru Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
## Example usage (torch.hub) ## Example usage (torch.hub)
``` ```python
>>> import torch import torch
>>> en2de = torch.hub.load(
... 'pytorch/fairseq', # English to German translation
... 'transformer.wmt19.en-de', en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
... checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt' tokenizer='moses', bpe='fastbpe')
... tokenizer='moses', en2de.translate("Machine learning is great!") # 'Maschinelles Lernen ist großartig!'
... bpe='fastbpe',
... )
>>> en2de.generate("Machine learning is great!")
'Maschinelles Lernen ist großartig!'
>>> de2en = torch.hub.load( # German to English translation
... 'pytorch/fairseq', de2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.de-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
... 'transformer.wmt19.de-en', tokenizer='moses', bpe='fastbpe')
... checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt' de2en.translate("Maschinelles Lernen ist großartig!") # 'Machine learning is great!'
... tokenizer='moses',
... bpe='fastbpe',
... )
>>> de2en.generate("Maschinelles Lernen ist großartig!")
'Machine learning is great!'
>>> en2ru = torch.hub.load( # English to Russian translation
... 'pytorch/fairseq', en2ru = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-ru', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
... 'transformer.wmt19.en-ru', tokenizer='moses', bpe='fastbpe')
... checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt' en2ru.translate("Machine learning is great!") # 'Машинное обучение - это здорово!'
... tokenizer='moses',
... bpe='fastbpe',
... )
>>> en2ru.generate("Machine learning is great!")
'Машинное обучение - это здорово!'
>>> ru2en = torch.hub.load( # Russian to English translation
... 'pytorch/fairseq', ru2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.ru-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
... 'transformer.wmt19.ru-en', tokenizer='moses', bpe='fastbpe')
... checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt' ru2en.translate("Машинное обучение - это здорово!") # 'Machine learning is great!'
... tokenizer='moses',
... bpe='fastbpe',
... )
>>> ru2en.generate("Машинное обучение - это здорово!")
'Machine learning is great!'
>>> en_lm = torch.hub.load( # Sample from the English LM
... 'pytorch.fairseq', en_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
... 'transformer_lm.wmt19.en' en_lm.sample("Machine learning is") # 'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'
... tokenizer='moses',
... bpe='fastbpe',
... )
>>> en_lm.generate("Machine learning is")
'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'
>>> de_lm = torch.hub.load( # Sample from the German LM
... 'pytorch.fairseq', de_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.de', tokenizer='moses', bpe='fastbpe')
... 'transformer_lm.wmt19.de' de_lm.sample("Maschinelles lernen ist") # 'Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'
... tokenizer='moses',
... bpe='fastbpe',
... )
>>> de_lm.generate("Maschinelles lernen ist")
''Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'
>>> ru_lm = torch.hub.load( # Sample from the Russian LM
... 'pytorch.fairseq', ru_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.ru', tokenizer='moses', bpe='fastbpe')
... 'transformer_lm.wmt19.ru' ru_lm.sample("машинное обучение это") # 'машинное обучение это то, что мы называем "искусственным интеллектом".'
... tokenizer='moses',
... bpe='fastbpe',
... )
>>> ru_lm.generate("машинное обучение это")
'машинное обучение это то, что мы называем "искусственным интеллектом".'
``` ```
## Citation ## Citation
......
...@@ -12,9 +12,9 @@ class MosesTokenizer(object): ...@@ -12,9 +12,9 @@ class MosesTokenizer(object):
@staticmethod @staticmethod
def add_args(parser): def add_args(parser):
# fmt: off # fmt: off
parser.add_argument('--moses-source-lang', default='en', metavar='SRC', parser.add_argument('--moses-source-lang', metavar='SRC',
help='source language') help='source language')
parser.add_argument('--moses-target-lang', default='en', metavar='TARGET', parser.add_argument('--moses-target-lang', metavar='TARGET',
help='target language') help='target language')
parser.add_argument('--moses-no-dash-splits', action='store_true', default=False, parser.add_argument('--moses-no-dash-splits', action='store_true', default=False,
help='don\'t apply dash split rules') help='don\'t apply dash split rules')
...@@ -24,6 +24,12 @@ class MosesTokenizer(object): ...@@ -24,6 +24,12 @@ class MosesTokenizer(object):
def __init__(self, args): def __init__(self, args):
self.args = args self.args = args
if getattr(args, 'moses_source_lang', None) is None:
args.moses_source_lang = getattr(args, 'source_lang', 'en')
if getattr(args, 'moses_target_lang', None) is None:
args.moses_target_lang = getattr(args, 'target_lang', 'en')
try: try:
from sacremoses import MosesTokenizer, MosesDetokenizer from sacremoses import MosesTokenizer, MosesDetokenizer
self.tok = MosesTokenizer(args.moses_source_lang) self.tok = MosesTokenizer(args.moses_source_lang)
......
...@@ -97,12 +97,15 @@ class GeneratorHubInterface(nn.Module): ...@@ -97,12 +97,15 @@ class GeneratorHubInterface(nn.Module):
def device(self): def device(self):
return self._float_tensor.device return self._float_tensor.device
def translate(self, sentence: str, verbose: bool = False, **kwargs) -> str: def translate(self, sentence: str, beam: int = 5, verbose: bool = False, **kwargs) -> str:
return self.sample(sentence, beam, verbose, **kwargs)
def sample(self, sentence: str, beam: int = 1, verbose: bool = False, **kwargs) -> str:
input = self.encode(sentence) input = self.encode(sentence)
hypo = self.generate(input, verbose, **kwargs) hypo = self.generate(input, beam, verbose, **kwargs)[0]['tokens']
return self.decode(hypo) return self.decode(hypo)
def generate(self, tokens: torch.LongTensor, verbose: bool = False, **kwargs) -> torch.LongTensor: def generate(self, tokens: torch.LongTensor, beam: int = 5, verbose: bool = False, **kwargs) -> torch.LongTensor:
sample = self._build_sample(tokens) sample = self._build_sample(tokens)
# build generator using current args as well as any kwargs # build generator using current args as well as any kwargs
...@@ -117,20 +120,24 @@ class GeneratorHubInterface(nn.Module): ...@@ -117,20 +120,24 @@ class GeneratorHubInterface(nn.Module):
src_str_with_unk = self.string(tokens) src_str_with_unk = self.string(tokens)
print('S\t{}'.format(src_str_with_unk)) print('S\t{}'.format(src_str_with_unk))
def getarg(name, default):
return getattr(gen_args, name, getattr(self.args, name, default))
# Process top predictions # Process top predictions
for hypo in translations[0][:min(len(translations), getattr(self.args, 'nbest', 1))]: hypos = translations[0]
hypo_str = self.decode(hypo['tokens']) if verbose:
if verbose: for hypo in hypos:
hypo_str = self.decode(hypo['tokens'])
print('H\t{}\t{}'.format(hypo['score'], hypo_str)) print('H\t{}\t{}'.format(hypo['score'], hypo_str))
print('P\t{}'.format( print('P\t{}'.format(
' '.join(map(lambda x: '{:.4f}'.format(x), hypo['positional_scores'].tolist())) ' '.join(map(lambda x: '{:.4f}'.format(x), hypo['positional_scores'].tolist()))
)) ))
if hypo['alignment'] is not None and getattr(self.args, 'print_alignment', False): if hypo['alignment'] is not None and getarg('print_alignment', False):
print('A\t{}'.format( print('A\t{}'.format(
' '.join(map(lambda x: str(utils.item(x)), hypo['alignment'].int().cpu())) ' '.join(map(lambda x: str(utils.item(x)), hypo['alignment'].int().cpu()))
)) ))
return hypo['tokens'] return hypos
def encode(self, sentence: str) -> torch.LongTensor: def encode(self, sentence: str) -> torch.LongTensor:
sentence = self.tokenize(sentence) sentence = self.tokenize(sentence)
......
...@@ -11,6 +11,7 @@ from fairseq.models import MODEL_REGISTRY ...@@ -11,6 +11,7 @@ from fairseq.models import MODEL_REGISTRY
dependencies = [ dependencies = [
'fastBPE',
'regex', 'regex',
'requests', 'requests',
'sacremoses', 'sacremoses',
......
...@@ -44,7 +44,9 @@ setup( ...@@ -44,7 +44,9 @@ setup(
long_description_content_type='text/markdown', long_description_content_type='text/markdown',
install_requires=[ install_requires=[
'cffi', 'cffi',
'fastBPE',
'numpy', 'numpy',
'regex',
'sacrebleu', 'sacrebleu',
'torch', 'torch',
'tqdm', 'tqdm',
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment