README.md 10.4 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
3

Sylvain Gugger's avatar
Sylvain Gugger committed
4
5
6
7
8
9
10
11
12
13
14
15
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
16

Sylvain Gugger's avatar
Sylvain Gugger committed
17
# SQuAD
18

Sylvain Gugger's avatar
Sylvain Gugger committed
19
Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_qa.py).
Sylvain Gugger's avatar
Sylvain Gugger committed
20
21
22

**Note:** This script only works with models that have a fast tokenizer (backed by the 馃 Tokenizers library) as it
uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
23
[this table](https://huggingface.co/transformers/index.html#supported-frameworks), if it doesn't you can still use the old version
Sylvain Gugger's avatar
Sylvain Gugger committed
24
25
of the script.

26
The old version of this script can be found [here](https://github.com/huggingface/transformers/tree/master/examples/legacy/question-answering).
27
28
29
30
31

`run_qa.py` allows you to fine-tune any model from our [hub](https://huggingface.co/models) (as long as its architecture as a `ForQuestionAnswering` version in the library) on the SQUAD dataset or another question-answering dataset of the `datasets` library or your own csv/jsonlines files as long as they are structured the same way as SQUAD. You might need to tweak the data processing inside the script if your data is structured differently.

Note that if your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag `--version_2_with_negative`.

Sylvain Gugger's avatar
Sylvain Gugger committed
32
33
34
## Trainer-based scripts

### Fine-tuning BERT on SQuAD1.0
35
36

This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
Sylvain Gugger's avatar
Sylvain Gugger committed
37
on a single tesla V100 16GB.
38
39

```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
40
python run_qa.py \
41
  --model_name_or_path bert-base-uncased \
Sylvain Gugger's avatar
Sylvain Gugger committed
42
  --dataset_name squad \
43
44
  --do_train \
  --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
45
  --per_device_train_batch_size 12 \
46
  --learning_rate 3e-5 \
Sylvain Gugger's avatar
Sylvain Gugger committed
47
  --num_train_epochs 2 \
48
49
50
51
52
53
54
55
56
57
58
59
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/
```

Training with the previously defined hyper-parameters yields the following results:

```bash
f1 = 88.52
exact_match = 81.22
```

60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
### Fine-tuning T5 on SQuAD2.0

This example code fine-tunes T5 on the SQuAD2.0 dataset.

```bash
python run_seq2seq_qa.py \
  --model_name_or_path t5-small \
  --dataset_name squad_v2 \
  --context_column context \
  --question_column question \
  --answer_column answer \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_seq2seq_squad/
```


82
83
84
85
86
#### Distributed training

Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:

```bash
87
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
88
    --model_name_or_path bert-large-uncased-whole-word-masking \
Sylvain Gugger's avatar
Sylvain Gugger committed
89
    --dataset_name squad \
90
91
92
93
94
95
96
    --do_train \
    --do_eval \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
Sylvain Gugger's avatar
Sylvain Gugger committed
97
98
    --per_device_eval_batch_size=3   \
    --per_device_train_batch_size=3   \
99
100
101
102
103
104
105
106
107
108
```

Training with the previously defined hyper-parameters yields the following results:

```bash
f1 = 93.15
exact_match = 86.91
```

This fine-tuned model is available as a checkpoint under the reference
109
[`bert-large-uncased-whole-word-masking-finetuned-squad`](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad).
110

Sylvain Gugger's avatar
Sylvain Gugger committed
111
#### Fine-tuning XLNet with beam search on SQuAD
112

Sylvain Gugger's avatar
Sylvain Gugger committed
113
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset.
114
115
116
117

##### Command for SQuAD1.0:

```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
118
python run_qa_beam_search.py \
119
    --model_name_or_path xlnet-large-cased \
Sylvain Gugger's avatar
Sylvain Gugger committed
120
    --dataset_name squad \
121
122
123
124
125
126
127
    --do_train \
    --do_eval \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
Sylvain Gugger's avatar
Sylvain Gugger committed
128
129
    --per_device_eval_batch_size=4  \
    --per_device_train_batch_size=4   \
130
131
132
133
134
135
136
137
    --save_steps 5000
```

##### Command for SQuAD2.0:

```bash
export SQUAD_DIR=/path/to/SQUAD

Sylvain Gugger's avatar
Sylvain Gugger committed
138
python run_qa_beam_search.py \
139
    --model_name_or_path xlnet-large-cased \
Sylvain Gugger's avatar
Sylvain Gugger committed
140
    --dataset_name squad_v2 \
141
142
143
144
145
146
147
148
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --learning_rate 3e-5 \
    --num_train_epochs 4 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
Sylvain Gugger's avatar
Sylvain Gugger committed
149
150
    --per_device_eval_batch_size=2  \
    --per_device_train_batch_size=2   \
151
152
153
    --save_steps 5000
```

Sylvain Gugger's avatar
Sylvain Gugger committed
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
## With Accelerate

Based on the script `run_qa_no_trainer.py` and `run_qa_beam_search_no_trainer.py`.

Like `run_qa.py` and `run_qa_beam_search.py`, these scripts allow you to fine-tune any of the models supported on a
SQUAD or a similar dataset, the main difference is that this
script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.

It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
the mean of the [馃 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
after installing it:

```bash
pip install accelerate
```

then

```bash
python run_qa_no_trainer.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ~/tmp/debug_squad
```

You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run

```bash
accelerate config
```

and reply to the questions asked. Then

```bash
accelerate test
```

that will check everything is ready for training. Finally, you cna launch training with

```bash
accelerate launch run_qa_no_trainer.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ~/tmp/debug_squad
```

This command is the same and will work for:

- a CPU-only setup
- a setup with one GPU
- a distributed training with several GPUs (single or multi node)
- a training on TPUs

Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.


## Results

217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
Larger batch size may improve the performance while costing more memory.

##### Results for SQuAD1.0 with the previously defined hyper-parameters:

```python
{
"exact": 85.45884578997162,
"f1": 92.5974600601065,
"total": 10570,
"HasAns_exact": 85.45884578997162,
"HasAns_f1": 92.59746006010651,
"HasAns_total": 10570
}
```

##### Results for SQuAD2.0 with the previously defined hyper-parameters:

```python
{
"exact": 80.4177545691906,
"f1": 84.07154997729623,
"total": 11873,
"HasAns_exact": 76.73751686909581,
"HasAns_f1": 84.05558584352873,
"HasAns_total": 5928,
"NoAns_exact": 84.0874684608915,
"NoAns_f1": 84.0874684608915,
"NoAns_total": 5945
}
```

248
249
250
#### Fine-tuning BERT on SQuAD1.0 with relative position embeddings

The following examples show how to fine-tune BERT models with different relative position embeddings. The BERT model 
Sylvain Gugger's avatar
Sylvain Gugger committed
251
`bert-base-uncased` was pretrained with default absolute position embeddings. We provide the following pretrained 
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
models which were pre-trained on the same training data (BooksCorpus and English Wikipedia) as in the BERT model 
training, but with different relative position embeddings. 

* `zhiheng-huang/bert-base-uncased-embedding-relative-key`, trained from scratch with relative embedding proposed by 
Shaw et al., [Self-Attention with Relative Position Representations](https://arxiv.org/abs/1803.02155)
* `zhiheng-huang/bert-base-uncased-embedding-relative-key-query`, trained from scratch with relative embedding method 4 
in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)
* `zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query`, fine-tuned from model 
`bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al. 
[Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)


##### Base models fine-tuning

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \
Sylvain Gugger's avatar
Sylvain Gugger committed
270
    --dataset_name squad \
271
272
273
274
275
276
    --do_train \
    --do_eval \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 512 \
    --doc_stride 128 \
Sylvain Gugger's avatar
Sylvain Gugger committed
277
278
279
    --output_dir relative_squad \
    --per_device_eval_batch_size=60 \
    --per_device_train_batch_size=6
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
```
Training with the above command leads to the following results. It boosts the BERT default from f1 score of 88.52 to 90.54.

```bash
'exact': 83.6802270577105, 'f1': 90.54772098174814
```

The change of `max_seq_length` from 512 to 384 in the above command leads to the f1 score of 90.34. Replacing the above 
model `zhiheng-huang/bert-base-uncased-embedding-relative-key-query` with 
`zhiheng-huang/bert-base-uncased-embedding-relative-key` leads to the f1 score of 89.51. The changing of 8 gpus to one 
gpu training leads to the f1 score of 90.71.

##### Large models fine-tuning

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \
Sylvain Gugger's avatar
Sylvain Gugger committed
298
    --dataset_name squad \
299
300
301
302
303
304
    --do_train \
    --do_eval \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 512 \
    --doc_stride 128 \
Sylvain Gugger's avatar
Sylvain Gugger committed
305
    --output_dir relative_squad \
306
307
308
309
310
311
    --per_gpu_eval_batch_size=6 \
    --per_gpu_train_batch_size=2 \
    --gradient_accumulation_steps 3
```
Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for 
`bert-large-uncased-whole-word-masking`.