README.md 10.3 KB
Newer Older
Myle Ott's avatar
Myle Ott committed
1
2
# RoBERTa: A Robustly Optimized BERT Pretraining Approach

Myle Ott's avatar
Myle Ott committed
3
https://arxiv.org/abs/1907.11692
Myle Ott's avatar
Myle Ott committed
4

Myle Ott's avatar
Myle Ott committed
5
### Introduction
Myle Ott's avatar
Myle Ott committed
6

Myle Ott's avatar
Myle Ott committed
7
RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
Myle Ott's avatar
Myle Ott committed
8

Myle Ott's avatar
Myle Ott committed
9
10
11
12
13
### What's New:

- August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).

### Pre-trained models
Myle Ott's avatar
Myle Ott committed
14
15
16
17
18

Model | Description | # params | Download
---|---|---|---
`roberta.base` | RoBERTa using the BERT-base architecture | 125M | [roberta.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz)
`roberta.large` | RoBERTa using the BERT-large architecture | 355M | [roberta.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz)
Myle Ott's avatar
Myle Ott committed
19
`roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
Myle Ott's avatar
Myle Ott committed
20
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](README.wsc.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
Myle Ott's avatar
Myle Ott committed
21

Myle Ott's avatar
Myle Ott committed
22
### Results
23
24

##### Results on GLUE tasks (dev set, single model, single-task finetuning)
Myle Ott's avatar
Myle Ott committed
25

26
27
28
29
30
31
Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
---|---|---|---|---|---|---|---|---
`roberta.base` | 87.6 | 92.8 | 91.9 | 78.7 | 94.8 | 90.2 | 63.6 | 91.2
`roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
`roberta.large.mnli` | 90.2 | - | - | - | - | - | - | -

32
33
34
35
##### Results on SuperGLUE tasks (dev set, single model, single-task finetuning)

Model | BoolQ | CB | COPA | MultiRC | RTE | WiC | WSC
---|---|---|---|---|---|---|---
Myle Ott's avatar
Myle Ott committed
36
37
`roberta.large` | 86.9 | 98.2 | 94.0 | 85.7 | 89.5 | 75.6 | -
`roberta.large.wsc` | - | - | - | - | - | - | 91.3
38

39
40
41
42
43
44
45
46
47
48
49
50
##### Results on SQuAD (dev set)

Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
---|---|---
`roberta.large` | 88.9/94.6 | 86.5/89.4

##### Results on Reading Comprehension (RACE, test set)

Model | Accuracy | Middle | High
---|---|---|---
`roberta.large` | 83.2 | 86.5 | 81.3

Myle Ott's avatar
Myle Ott committed
51
### Example usage
52
53

##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
Myle Ott's avatar
Myle Ott committed
54
55
56
57
```python
import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()  # disable dropout (or leave in train mode to finetune)
Myle Ott's avatar
Myle Ott committed
58
59
```

Myle Ott's avatar
Myle Ott committed
60
##### Load RoBERTa (for PyTorch 1.0 or custom models):
Myle Ott's avatar
Myle Ott committed
61
62
63
64
```python
# Download roberta.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
tar -xzvf roberta.large.tar.gz
65

Myle Ott's avatar
Myle Ott committed
66
67
# Load the model in fairseq
from fairseq.models.roberta import RobertaModel
Myle Ott's avatar
Myle Ott committed
68
roberta = RobertaModel.from_pretrained('/path/to/roberta.large', checkpoint_file='model.pt')
Myle Ott's avatar
Myle Ott committed
69
roberta.eval()  # disable dropout (or leave in train mode to finetune)
70
71
```

Myle Ott's avatar
Myle Ott committed
72
##### Apply Byte-Pair Encoding (BPE) to input text:
Myle Ott's avatar
Myle Ott committed
73
74
75
76
```python
tokens = roberta.encode('Hello world!')
assert tokens.tolist() == [0, 31414, 232, 328, 2]
roberta.decode(tokens)  # 'Hello world!'
Myle Ott's avatar
Myle Ott committed
77
78
79
```

##### Extract features from RoBERTa:
Myle Ott's avatar
Myle Ott committed
80
81
82
83
```python
# Extract the last layer's features
last_layer_features = roberta.extract_features(tokens)
assert last_layer_features.size() == torch.Size([1, 5, 1024])
84

Myle Ott's avatar
Myle Ott committed
85
86
87
88
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)
Myle Ott's avatar
Myle Ott committed
89
90
91
```

##### Use RoBERTa for sentence-pair classification tasks:
Myle Ott's avatar
Myle Ott committed
92
93
94
95
```python
# Download RoBERTa already finetuned for MNLI
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
roberta.eval()  # disable dropout for evaluation
Myle Ott's avatar
Myle Ott committed
96

Myle Ott's avatar
Myle Ott committed
97
98
99
# Encode a pair of sentences and make a prediction
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
roberta.predict('mnli', tokens).argmax()  # 0: contradiction
Myle Ott's avatar
Myle Ott committed
100

Myle Ott's avatar
Myle Ott committed
101
102
103
# Encode another pair of sentences
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.')
roberta.predict('mnli', tokens).argmax()  # 2: entailment
Myle Ott's avatar
Myle Ott committed
104
105
106
```

##### Register a new (randomly initialized) classification head:
Myle Ott's avatar
Myle Ott committed
107
108
109
```python
roberta.register_classification_head('new_task', num_classes=3)
logprobs = roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>)
Myle Ott's avatar
Myle Ott committed
110
```
Myle Ott's avatar
Myle Ott committed
111
112
113
114
115
116
117
118

##### Batched prediction:
```python
from fairseq.data.data_utils import collate_tokens
sentences = ['Hello world.', 'Another unrelated sentence.']
batch = collate_tokens([roberta.encode(sent) for sent in sentences], pad_idx=1)
logprobs = roberta.predict('new_task', batch)
assert logprobs.size() == torch.Size([2, 3])
Myle Ott's avatar
Myle Ott committed
119
120
121
```

##### Using the GPU:
Myle Ott's avatar
Myle Ott committed
122
123
124
```python
roberta.cuda()
roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
Myle Ott's avatar
Myle Ott committed
125
126
```

Myle Ott's avatar
Myle Ott committed
127
### Advanced usage
Myle Ott's avatar
Myle Ott committed
128
129
130
131
132

#### Filling masks:

RoBERTa can be used to fill `<mask>` tokens in the input. Some examples from the
[Natural Questions dataset](https://ai.google.com/research/NaturalQuestions/):
133
```python
Myle Ott's avatar
Myle Ott committed
134
135
136
137
138
139
140
141
142
roberta.fill_mask('The first Star wars movie came out in <mask>', topk=3)
# [('The first Star wars movie came out in 1977', 0.9504712224006653), ('The first Star wars movie came out in 1978', 0.009986752644181252), ('The first Star wars movie came out in 1979', 0.00957468245178461)]

roberta.fill_mask('Vikram samvat calender is official in <mask>', topk=3)
# [('Vikram samvat calender is official in India', 0.21878768503665924), ('Vikram samvat calender is official in Delhi', 0.08547217398881912), ('Vikram samvat calender is official in Gujarat', 0.07556255906820297)]

roberta.fill_mask('<mask> is the common currency of the European Union', topk=3)
# [('Euro is the common currency of the European Union', 0.945650577545166), ('euro is the common currency of the European Union', 0.025747718289494514), ('€ is the common currency of the European Union', 0.011183015070855618)]
```
143

Myle Ott's avatar
Myle Ott committed
144
#### Pronoun disambiguation (Winograd Schema Challenge):
145

Myle Ott's avatar
Myle Ott committed
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
RoBERTa can be used to disambiguate pronouns. First install spaCy and download the English-language model:
```bash
pip install spacy
python -m spacy download en_core_web_lg
```

Next load the `roberta.large.wsc` model and call the `disambiguate_pronoun`
function. The pronoun should be surrounded by square brackets (`[]`) and the
query referent surrounded by underscores (`_`), or left blank to return the
predicted candidate text directly:
```python
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.wsc', user_dir='examples/roberta/wsc')
roberta.cuda()  # use the GPU (optional)

roberta.disambiguate_pronoun('The _trophy_ would not fit in the brown suitcase because [it] was too big.')
# True
roberta.disambiguate_pronoun('The trophy would not fit in the brown _suitcase_ because [it] was too big.')
# False

roberta.disambiguate_pronoun('The city councilmen refused the demonstrators a permit because [they] feared violence.')
# 'The city councilmen'
roberta.disambiguate_pronoun('The city councilmen refused the demonstrators a permit because [they] advocated violence.')
# 'demonstrators'
```

See the [RoBERTA Winograd Schema Challenge (WSC) README](README.wsc.md) for more details on how to train this model.

#### Extract features aligned to words:

By default RoBERTa outputs one feature vector per BPE token. You can instead
realign the features to match [spaCy's word-level tokenization](https://spacy.io/usage/linguistic-features#tokenization)
with the `extract_features_aligned_to_words` method. This will compute a
weighted average of the BPE-level features for each word and expose them in
spaCy's `Token.vector` attribute:
```python
doc = roberta.extract_features_aligned_to_words('I said, "hello RoBERTa."')
assert len(doc) == 10
for tok in doc:
    print('{:10}{} (...)'.format(str(tok), tok.vector[:5]))
# <s>       tensor([-0.1316, -0.0386, -0.0832, -0.0477,  0.1943], grad_fn=<SliceBackward>) (...)
# I         tensor([ 0.0559,  0.1541, -0.4832,  0.0880,  0.0120], grad_fn=<SliceBackward>) (...)
# said      tensor([-0.1565, -0.0069, -0.8915,  0.0501, -0.0647], grad_fn=<SliceBackward>) (...)
# ,         tensor([-0.1318, -0.0387, -0.0834, -0.0477,  0.1944], grad_fn=<SliceBackward>) (...)
# "         tensor([-0.0486,  0.1818, -0.3946, -0.0553,  0.0981], grad_fn=<SliceBackward>) (...)
# hello     tensor([ 0.0079,  0.1799, -0.6204, -0.0777, -0.0923], grad_fn=<SliceBackward>) (...)
# RoBERTa   tensor([-0.2339, -0.1184, -0.7343, -0.0492,  0.5829], grad_fn=<SliceBackward>) (...)
# .         tensor([-0.1341, -0.1203, -0.1012, -0.0621,  0.1892], grad_fn=<SliceBackward>) (...)
# "         tensor([-0.1341, -0.1203, -0.1012, -0.0621,  0.1892], grad_fn=<SliceBackward>) (...)
# </s>      tensor([-0.0930, -0.0392, -0.0821,  0.0158,  0.0649], grad_fn=<SliceBackward>) (...)
195
196
```

Myle Ott's avatar
Myle Ott committed
197
#### Evaluating the `roberta.large.mnli` model:
Myle Ott's avatar
Myle Ott committed
198

Myle Ott's avatar
Myle Ott committed
199
Example python code snippet to evaluate accuracy on the MNLI `dev_matched` set.
Myle Ott's avatar
Myle Ott committed
200
```python
Myle Ott's avatar
Myle Ott committed
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
ncorrect, nsamples = 0, 0
roberta.cuda()
roberta.eval()
with open('glue_data/MNLI/dev_matched.tsv') as fin:
    fin.readline()
    for index, line in enumerate(fin):
        tokens = line.strip().split('\t')
        sent1, sent2, target = tokens[8], tokens[9], tokens[-1]
        tokens = roberta.encode(sent1, sent2)
        prediction = roberta.predict('mnli', tokens).argmax().item()
        prediction_label = label_map[prediction]
        ncorrect += int(prediction_label == target)
        nsamples += 1
print('| Accuracy: ', float(ncorrect)/float(nsamples))
# Expected output: 0.9060
```

Myle Ott's avatar
Myle Ott committed
219
### Finetuning
Myle Ott's avatar
Myle Ott committed
220

Myle Ott's avatar
Myle Ott committed
221
222
- [Finetuning on GLUE](README.glue.md)
- [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
Myle Ott's avatar
Myle Ott committed
223
- [Finetuning on Winograd Schema Challenge (WSC)](README.wsc.md)
Myle Ott's avatar
Myle Ott committed
224
- [Finetuning on Commonsense QA (CQA)](README.cqa.md)
Myle Ott's avatar
Myle Ott committed
225
- Finetuning on SQuAD: coming soon
226

Myle Ott's avatar
Myle Ott committed
227
### Pretraining using your own data
Myle Ott's avatar
Myle Ott committed
228

Myle Ott's avatar
Myle Ott committed
229
See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
Myle Ott's avatar
Myle Ott committed
230

Myle Ott's avatar
Myle Ott committed
231
### Citation
Myle Ott's avatar
Myle Ott committed
232
233
234

```bibtex
@article{liu2019roberta,
Myle Ott's avatar
Myle Ott committed
235
236
237
238
239
240
    title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
    author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
              Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
              Luke Zettlemoyer and Veselin Stoyanov},
    journal={arXiv preprint arXiv:1907.11692},
    year = {2019},
Myle Ott's avatar
Myle Ott committed
241
242
}
```