"vscode:/vscode.git/clone" did not exist on "ceb105a7802329165a14337eb46c6da6e094f124"
README.md 5.93 KB
Newer Older
Myle Ott's avatar
Myle Ott committed
1
2
# RoBERTa: A Robustly Optimized BERT Pretraining Approach

Myle Ott's avatar
Myle Ott committed
3
https://arxiv.org/abs/1907.11692
Myle Ott's avatar
Myle Ott committed
4
5
6
7
8
9
10
11
12
13
14
15
16

## Introduction

**RoBERTa** iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.

## Pre-trained models

Model | Description | # params | Download
---|---|---|---
`roberta.base` | RoBERTa using the BERT-base architecture | 125M | [roberta.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz)
`roberta.large` | RoBERTa using the BERT-large architecture | 355M | [roberta.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz)
`roberta.large.mnli` | `roberta.large` finetuned on MNLI | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)

17
18
19
## Results

##### Results on GLUE tasks (dev set, single model, single-task finetuning)
Myle Ott's avatar
Myle Ott committed
20

21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
---|---|---|---|---|---|---|---|---
`roberta.base` | 87.6 | 92.8 | 91.9 | 78.7 | 94.8 | 90.2 | 63.6 | 91.2
`roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
`roberta.large.mnli` | 90.2 | - | - | - | - | - | - | -

##### Results on SQuAD (dev set)

Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
---|---|---
`roberta.large` | 88.9/94.6 | 86.5/89.4

##### Results on Reading Comprehension (RACE, test set)

Model | Accuracy | Middle | High
---|---|---|---
`roberta.large` | 83.2 | 86.5 | 81.3

## Example usage

##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
Myle Ott's avatar
Myle Ott committed
42
43
44
45
```python
import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()  # disable dropout (or leave in train mode to finetune)
Myle Ott's avatar
Myle Ott committed
46
47
```

48
##### Load RoBERTa (for PyTorch 1.0):
Myle Ott's avatar
Myle Ott committed
49
50
51
52
```python
# Download roberta.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
tar -xzvf roberta.large.tar.gz
53

Myle Ott's avatar
Myle Ott committed
54
55
56
57
# Load the model in fairseq
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
roberta.eval()  # disable dropout (or leave in train mode to finetune)
58
59
```

Myle Ott's avatar
Myle Ott committed
60
##### Apply Byte-Pair Encoding (BPE) to input text:
Myle Ott's avatar
Myle Ott committed
61
62
63
64
```python
tokens = roberta.encode('Hello world!')
assert tokens.tolist() == [0, 31414, 232, 328, 2]
roberta.decode(tokens)  # 'Hello world!'
Myle Ott's avatar
Myle Ott committed
65
66
67
```

##### Extract features from RoBERTa:
Myle Ott's avatar
Myle Ott committed
68
69
70
71
```python
# Extract the last layer's features
last_layer_features = roberta.extract_features(tokens)
assert last_layer_features.size() == torch.Size([1, 5, 1024])
72

Myle Ott's avatar
Myle Ott committed
73
74
75
76
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)
Myle Ott's avatar
Myle Ott committed
77
78
79
```

##### Use RoBERTa for sentence-pair classification tasks:
Myle Ott's avatar
Myle Ott committed
80
81
82
83
```python
# Download RoBERTa already finetuned for MNLI
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
roberta.eval()  # disable dropout for evaluation
Myle Ott's avatar
Myle Ott committed
84

Myle Ott's avatar
Myle Ott committed
85
86
87
# Encode a pair of sentences and make a prediction
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
roberta.predict('mnli', tokens).argmax()  # 0: contradiction
Myle Ott's avatar
Myle Ott committed
88

Myle Ott's avatar
Myle Ott committed
89
90
91
# Encode another pair of sentences
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.')
roberta.predict('mnli', tokens).argmax()  # 2: entailment
Myle Ott's avatar
Myle Ott committed
92
93
94
```

##### Register a new (randomly initialized) classification head:
Myle Ott's avatar
Myle Ott committed
95
96
97
```python
roberta.register_classification_head('new_task', num_classes=3)
logprobs = roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>)
Myle Ott's avatar
Myle Ott committed
98
```
Myle Ott's avatar
Myle Ott committed
99
100
101
102
103
104
105
106

##### Batched prediction:
```python
from fairseq.data.data_utils import collate_tokens
sentences = ['Hello world.', 'Another unrelated sentence.']
batch = collate_tokens([roberta.encode(sent) for sent in sentences], pad_idx=1)
logprobs = roberta.predict('new_task', batch)
assert logprobs.size() == torch.Size([2, 3])
Myle Ott's avatar
Myle Ott committed
107
108
109
```

##### Using the GPU:
Myle Ott's avatar
Myle Ott committed
110
111
112
```python
roberta.cuda()
roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
Myle Ott's avatar
Myle Ott committed
113
114
```

115
##### Evaluating the `roberta.large.mnli` model
Myle Ott's avatar
Myle Ott committed
116
117

Example python code snippet to evaluate accuracy on the MNLI dev_matched set.
Myle Ott's avatar
Myle Ott committed
118
```python
Myle Ott's avatar
Myle Ott committed
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
ncorrect, nsamples = 0, 0
roberta.cuda()
roberta.eval()
with open('glue_data/MNLI/dev_matched.tsv') as fin:
    fin.readline()
    for index, line in enumerate(fin):
        tokens = line.strip().split('\t')
        sent1, sent2, target = tokens[8], tokens[9], tokens[-1]
        tokens = roberta.encode(sent1, sent2)
        prediction = roberta.predict('mnli', tokens).argmax().item()
        prediction_label = label_map[prediction]
        ncorrect += int(prediction_label == target)
        nsamples += 1
print('| Accuracy: ', float(ncorrect)/float(nsamples))
# Expected output: 0.9060
```

137

Myle Ott's avatar
Myle Ott committed
138
## Finetuning
Myle Ott's avatar
Myle Ott committed
139

Myle Ott's avatar
Myle Ott committed
140
141
142
- [Finetuning on GLUE](README.finetune_glue.md)
- [Finetuning on custom classification tasks (e.g., IMDB)](README.finetune_custom_classification.md)
- Finetuning on SQuAD: coming soon
143

Myle Ott's avatar
Myle Ott committed
144
145
146
147
148
149
150
151
152
153
154
155
## Pretraining using your own data

You can use the [`masked_lm` task](/fairseq/tasks/masked_lm.py) to pretrain RoBERTa from scratch, or to continue pretraining RoBERTa starting from one of the released checkpoints.

Data should be preprocessed following the [language modeling example](/examples/language_model).

A more detailed tutorial is coming soon.

## Citation

```bibtex
@article{liu2019roberta,
Myle Ott's avatar
Myle Ott committed
156
157
158
159
160
161
    title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
    author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
              Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
              Luke Zettlemoyer and Veselin Stoyanov},
    journal={arXiv preprint arXiv:1907.11692},
    year = {2019},
Myle Ott's avatar
Myle Ott committed
162
163
}
```