Commit b8704686 authored by Myle Ott's avatar Myle Ott Committed by Facebook Github Bot
Browse files

Update READMEs

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/823

Differential Revision: D16804995

Pulled By: myleott

fbshipit-source-id: abac5dc0ed6b7bfe2309ba273456e54b37340b2c
parent ffffe04e
......@@ -6,10 +6,10 @@ modeling and other text generation tasks.
### What's New:
- August 2019: [WMT'19 models released](examples/wmt19/README.md)
- July 2019: fairseq relicensed under MIT license
- July 2019: [RoBERTa models and code release](examples/roberta/README.md)
- June 2019: [wav2vec models and code release](examples/wav2vec/README.md)
- April 2019: [fairseq demo paper @ NAACL 2019](https://arxiv.org/abs/1904.01038)
- July 2019: [RoBERTa models and code released](examples/roberta/README.md)
- June 2019: [wav2vec models and code released](examples/wav2vec/README.md)
### Features:
......@@ -31,6 +31,7 @@ Fairseq provides reference implementations of various sequence-to-sequence model
- [Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)](examples/language_model/transformer_lm/README.md)
- [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
- [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
**Additionally:**
- multi-GPU (distributed) training on one machine or across multiple machines
......@@ -49,38 +50,33 @@ translation and language modeling datasets.
# Requirements and Installation
* [PyTorch](http://pytorch.org/) version >= 1.0.0
* [PyTorch](http://pytorch.org/) version >= 1.1.0
* Python version >= 3.5
* For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
* **For faster training** install NVIDIA's [apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option
Please follow the instructions here to install PyTorch: https://github.com/pytorch/pytorch#installation.
If you use Docker make sure to increase the shared memory size either with
`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.
After PyTorch is installed, you can install fairseq with `pip`:
```
To install fairseq:
```bash
pip install fairseq
```
On MacOS,
```
On MacOS:
```bash
CFLAGS="-stdlib=libc++" pip install fairseq
```
If you use Docker make sure to increase the shared memory size either with
`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.
**Installing from source**
To install fairseq from source and develop locally:
```
```bash
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable .
```
**Improved training speed**
Training speed can be further improved by installing NVIDIA's
[apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option.
fairseq will automatically switch to the faster modules provided by apex.
# Getting Started
The [full documentation](https://fairseq.readthedocs.io/) contains instructions
......@@ -93,9 +89,10 @@ We provide pre-trained models and pre-processed, binarized test sets for several
as well as example training and evaluation commands.
- [Translation](examples/translation/README.md): convolutional and transformer models are available
- [Language Modeling](examples/language_model/README.md): convolutional models are available
- [Language Modeling](examples/language_model/README.md): convolutional and transformer models are available
We also have more detailed READMEs to reproduce results from specific papers:
- [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
- [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md)
- [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
......
......@@ -27,58 +27,57 @@ en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperatur
# "Barack Obama is coming to Sydney and New Zealand (...)"
```
## Training a new model with the CLI tools
## Training a transformer language model with the CLI tools
These scripts provide an example of pre-processing data for the Language Modeling task.
### 1) Preprocess the data
### prepare-wikitext-103.sh
Provides an example of pre-processing for [WikiText-103 language modeling task](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):
Example usage:
Prepare data:
First download and prepare the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):
```bash
cd examples/language_model/
bash prepare-wikitext-103.sh
cd ../..
```
# Binarize the dataset:
Next preprocess/binarize the data:
```bash
TEXT=examples/language_model/wikitext-103
fairseq-preprocess --only-source \
--trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103
fairseq-preprocess \
--only-source \
--trainpref $TEXT/wiki.train.tokens \
--validpref $TEXT/wiki.valid.tokens \
--testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103 \
--workers 20
```
Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
### 2) Train a language model
Next we'll train a transformer language model using [adaptive inputs](transformer_lm/README.md):
```bash
# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
mkdir -p checkpoints/transformer_wikitext-103
fairseq-train --task language_modeling data-bin/wikitext-103 \
--save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
fairseq-train --task language_modeling \
data-bin/wikitext-103 \
--save-dir checkpoints/transformer_wikitext-103 \
--arch transformer_lm_wiki103 \
--max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
--warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
--criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
--sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
# Evaluate:
fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
--sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
```
Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
```
# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
mkdir -p checkpoints/fconv_wikitext-103
fairseq-train --task language_modeling data-bin/wikitext-103 \
--save-dir checkpoints/fconv_wikitext-103 \
--max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
--lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
--clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
--adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
--ddp-backend=no_c10d
# Evaluate:
fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
If the above command runs out of memory, try reducing `--max-tokens` (max number
of tokens per batch) or `--tokens-per-sample` (max sequence length). You can
also increase `--update-freq` to accumulate gradients and simulate training on
more GPUs.
### 3) Evaluate
```bash
fairseq-eval-lm data-bin/wikitext-103 \
--path checkpoints/transformer_wiki103/checkpoint_best.pt \
--sample-break-mode complete --max-tokens 3072 \
--context-window 2560 --softmax-batch 1024
```
## Convolutional language models
Please see the [convolutional LM README](conv_lm/README.md) for instructions to
train convolutional language models.
......@@ -2,8 +2,27 @@
## Example usage
See the [language modeling README](../README.md) for instructions on reproducing results for WikiText-103
using the `fconv_lm_dauphin_wikitext103` model architecture.
First download and preprocess the data following the main [language modeling
README](../README.md).
Then to train a convolutional LM using the `fconv_lm_dauphin_wikitext103`
architecture:
```bash
fairseq-train --task language_modeling \
data-bin/wikitext-103 \
--save-dir checkpoints/fconv_wikitext-103 \
--arch fconv_lm_dauphin_wikitext103 \
--max-epoch 35 \ --optimizer nag \
--lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
--clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
--adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
--ddp-backend=no_c10d
```
And evaluate with:
```bash
fairseq-eval-lm data-bin/wikitext-103 --path checkpoints/fconv_wiki103/checkpoint_best.pt
```
## Citation
......
# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli; 2018)
# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)
## Pre-trained models
......
......@@ -8,7 +8,7 @@ representations through a fully-connected layer to predict the correct answer.
We train with a standard cross-entropy loss.
We also found it helpful to prepend a prefix of `Q:` to the question and `A:` to
the input. The complete input format is:
the answer. The complete input format is:
```
<s> Q: Where would I not want a fox? </s> A: hen house </s>
```
......@@ -18,7 +18,7 @@ Our final submission is based on a hyperparameter search over the learning rate
4000) and random seed. We selected the model with the best performance on the
development set after 100 trials.
### 1) Download the data from Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
### 1) Download data from the Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
```bash
bash examples/roberta/commonsense_qa/download_cqa_data.sh
```
......
......@@ -2,20 +2,24 @@
https://arxiv.org/abs/1907.11692
## Introduction
### Introduction
**RoBERTa** iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
## Pre-trained models
### What's New:
- August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
### Pre-trained models
Model | Description | # params | Download
---|---|---|---
`roberta.base` | RoBERTa using the BERT-base architecture | 125M | [roberta.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz)
`roberta.large` | RoBERTa using the BERT-large architecture | 355M | [roberta.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz)
`roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](README.wsc.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
## Results
### Results
##### Results on GLUE tasks (dev set, single model, single-task finetuning)
......@@ -44,7 +48,7 @@ Model | Accuracy | Middle | High
---|---|---|---
`roberta.large` | 83.2 | 86.5 | 81.3
## Example usage
### Example usage
##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
```python
......@@ -53,7 +57,7 @@ roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval() # disable dropout (or leave in train mode to finetune)
```
##### Load RoBERTa (for PyTorch 1.0):
##### Load RoBERTa (for PyTorch 1.0 or custom models):
```python
# Download roberta.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
......@@ -61,7 +65,7 @@ tar -xzvf roberta.large.tar.gz
# Load the model in fairseq
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
roberta = RobertaModel.from_pretrained('/path/to/roberta.large', checkpoint_file='model.pt')
roberta.eval() # disable dropout (or leave in train mode to finetune)
```
......@@ -120,7 +124,7 @@ roberta.cuda()
roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
```
## Advanced usage
### Advanced usage
#### Filling masks:
......@@ -212,8 +216,7 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
# Expected output: 0.9060
```
## Finetuning
### Finetuning
- [Finetuning on GLUE](README.glue.md)
- [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
......@@ -221,15 +224,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
- [Finetuning on Commonsense QA (CQA)](README.cqa.md)
- Finetuning on SQuAD: coming soon
## Pretraining using your own data
You can use the [`masked_lm` task](/fairseq/tasks/masked_lm.py) to pretrain RoBERTa from scratch, or to continue pretraining RoBERTa starting from one of the released checkpoints.
Data should be preprocessed following the [language modeling example](/examples/language_model).
### Pretraining using your own data
A more detailed tutorial is coming soon.
See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
## Citation
### Citation
```bibtex
@article{liu2019roberta,
......
# Pretraining RoBERTa using your own data
This tutorial will walk you through pretraining RoBERTa over your own data.
### 1) Preprocess the data.
Data should be preprocessed following the [language modeling format](/examples/language_model).
We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/)
to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course
this dataset is quite small, so the resulting pretrained model will perform
poorly, but it gives the general idea.
First download the dataset:
```bash
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
```
Next encode it with the GPT-2 BPE:
```bash
mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json gpt2_bpe/encoder.json \
--vocab-bpe gpt2_bpe/vocab.bpe \
--inputs wikitext-103-raw/wiki.${SPLIT}.raw \
--outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
--keep-empty \
--workers 60; \
done
```
Finally preprocess/binarize the data using the GPT-2 fairseq dictionary:
```bash
wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
fairseq-preprocess \
--only-source \
--srcdict gpt2_bpe/dict.txt \
--trainpref wikitext-103-raw/wiki.train.bpe \
--validpref wikitext-103-raw/wiki.valid.bpe \
--testpref wikitext-103-raw/wiki.test.bpe \
--destdir data-bin/wikitext-103 \
--workers 60
```
### 2) Train RoBERTa base
```bash
TOTAL_UPDATES=125000 # Total number of training steps
WARMUP_UPDATES=10000 # Warmup the learning rate over this many updates
PEAK_LR=0.0005 # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512 # Max sequence length
MAX_POSITIONS=512 # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16 # Number of sequences per batch (batch size)
UPDATE_FREQ=16 # Increase the batch size 16x
DATA_DIR=data-bin/wikitext-103
fairseq-train --fp16 $DATA_DIR \
--task masked_lm --criterion masked_lm \
--arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
--optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
--max-update $TOTAL_UPDATES --log-format simple --log-interval 1
```
The above command assumes training on 8x32GB V100 GPUs. Each GPU uses a batch
size of 16 sequences (`$MAX_SENTENCES`) and accumulates gradients to further
increase the batch size by 16x (`$UPDATE_FREQ`), for a total batch size of 2048
sequences. If you have fewer GPUs or GPUs with less memory you may need to
reduce `$MAX_SENTENCES` and increase `$UPDATE_FREQ` to compensate. Alternatively
if you have more GPUs you can decrease `$UPDATE_FREQ` accordingly to increase
training speed.
Also note that the learning rate and batch size are tightly connected and need
to be adjusted together. We generally recommend increasing the learning rate as
you increase the batch size according to the following table (although it's also
dataset dependent, so don't rely on the following values too closely):
batch size | peak learning rate
---|---
256 | 0.0001
2048 | 0.0005
8192 | 0.0007
### 3) Load your pretrained model
```python
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data')
assert isinstance(roberta.model, torch.nn.Module)
```
......@@ -4,21 +4,23 @@ The following instructions can be used to finetune RoBERTa on the WSC training
data provided by [SuperGLUE](https://super.gluebenchmark.com/).
Note that there is high variance in the results. For our GLUE/SuperGLUE
submission we swept over the learning rate, batch size and total number of
updates, as well as the random seed. Out of ~100 runs we chose the best 7 models
and ensembled them.
submission we swept over the learning rate (1e-5, 2e-5, 3e-5), batch size (16,
32, 64) and total number of updates (500, 1000, 2000, 3000), as well as the
random seed. Out of ~100 runs we chose the best 7 models and ensembled them.
**Note:** The instructions below use a slightly different loss function than
**Approach:** The instructions below use a slightly different loss function than
what's described in the original RoBERTa arXiv paper. In particular,
[Kocijan et al. (2019)](https://arxiv.org/abs/1905.06290) introduce a margin
ranking loss between `(query, candidate)` pairs with tunable hyperparameters
alpha and beta. This is supported in our code as well with the `--wsc-alpha` and
`--wsc-beta` arguments. However, we achieved slightly better (and more robust)
results on the development set by instead using a single cross entropy loss term
over the log-probabilities for the query and all candidates. This reduces the
number of hyperparameters and our best model achieved 92.3% development set
accuracy, compared to ~90% accuracy for the margin loss. Later versions of the
RoBERTa arXiv paper will describe this updated formulation.
over the log-probabilities for the query and all mined candidates. **The
candidates are mined using spaCy from each input sentence in isolation, so the
approach remains strictly pointwise.** This reduces the number of
hyperparameters and our best model achieved 92.3% development set accuracy,
compared to ~90% accuracy for the margin loss. Later versions of the RoBERTa
arXiv paper will describe this updated formulation.
### 1) Download the WSC data from the SuperGLUE website:
```bash
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment