README.md 22.8 KB
Newer Older
LysandreJik's avatar
LysandreJik committed
1
2
3
4
5
# Examples

In this section a few examples are put together. All of these examples work for several models, making use of the very
similar API between the different models.

6
**Important**
thomwolf's avatar
thomwolf committed
7
8
To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
Execute the following steps in a new virtual environment:
R茅mi Louf's avatar
R茅mi Louf committed
9
10

```bash
Julien Chaumond's avatar
Julien Chaumond committed
11
git clone https://github.com/huggingface/transformers
R茅mi Louf's avatar
R茅mi Louf committed
12
cd transformers
13
pip install .
thomwolf's avatar
thomwolf committed
14
pip install -r ./examples/requirements.txt
R茅mi Louf's avatar
R茅mi Louf committed
15
16
```

LysandreJik's avatar
LysandreJik committed
17
| Section                    | Description                                                                                                                                                |
18
19
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
20
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
21
22
23
24
25
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
Jhuo IH's avatar
Jhuo IH committed
26
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
VictorSanh's avatar
VictorSanh committed
27
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
28
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
LysandreJik's avatar
LysandreJik committed
29

thomwolf's avatar
thomwolf committed
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## TensorFlow 2.0 Bert models on GLUE

Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).

Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).

This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
These options and the below benchmark are provided by @tlkh.

Quick benchmarks from the script (no other modifications):

| GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
| --------- | -------- | ----------------------- | ----------------------|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
| V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
| V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
Xu Hongshen's avatar
Xu Hongshen committed
48
| 1080 Ti | FP32 | 55s | - |
thomwolf's avatar
thomwolf committed
49
50

Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
LysandreJik's avatar
LysandreJik committed
51

52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
## Running on TPUs

You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
[README](https://github.com/pytorch/xla/blob/master/README.md).

The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
identical to your normal GPU + Huggingface setup.

### GLUE

Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.

For running your GLUE task on MNLI dataset you can run something like the following:

```
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
export GLUE_DIR=/path/to/glue
export TASK_NAME=MNLI

python run_glue_tpu.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 3e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME \
  --overwrite_output_dir \
  --logging_steps 50 \
  --save_steps 200 \
  --num_cores=8 \
  --only_log_master
```


94
## Language model training
LysandreJik's avatar
LysandreJik committed
95

96
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
LysandreJik's avatar
LysandreJik committed
97

98
99
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
LysandreJik's avatar
LysandreJik committed
100
101
102
are fine-tuned using a masked language modeling (MLM) loss.

Before running the following example, you should get a file that contains text on which the language model will be
103
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
LysandreJik's avatar
LysandreJik committed
104
105
106
107
108
109
110
111
112
113
114
115
116

We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
text that will be used for evaluation.

### GPT-2/GPT and causal language modeling

The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.

```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw

117
python run_language_modeling.py \
LysandreJik's avatar
LysandreJik committed
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE
```

This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.

### RoBERTa/BERT and masked language modeling

The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
134
pre-training: masked language modeling.
LysandreJik's avatar
LysandreJik committed
135

LysandreJik's avatar
LysandreJik committed
136
137
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
slightly slower (over-fitting takes more epochs).
LysandreJik's avatar
LysandreJik committed
138
139
140
141
142
143
144

We use the `--mlm` flag so that the script may change its loss function.

```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw

145
python run_language_modeling.py \
LysandreJik's avatar
LysandreJik committed
146
147
148
149
150
151
152
153
154
155
156
157
    --output_dir=output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm
```

## Language generation

158
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
LysandreJik's avatar
LysandreJik committed
159

160
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
LysandreJik's avatar
LysandreJik committed
161
162
163
164
165
166
167
168
169
170
171
172
173
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
can try out the different models available in the library.

Example usage:

```bash
python run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=gpt2
```

## GLUE

174
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).
LysandreJik's avatar
LysandreJik committed
175

176
177
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
LysandreJik's avatar
LysandreJik committed
178
179

GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
180
181
uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
LysandreJik's avatar
LysandreJik committed
182
183
184
185
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.

| Task  | Metric                       | Result      |
|-------|------------------------------|-------------|
186
187
188
189
190
191
192
193
194
| CoLA  | Matthew's corr               | 49.23       |
| SST-2 | Accuracy                     | 91.97       |
| MRPC  | F1/Accuracy                  | 89.47/85.29 |
| STS-B | Person/Spearman corr.        | 83.95/83.70 |
| QQP   | Accuracy/F1                  | 88.40/84.31 |
| MNLI  | Matched acc./Mismatched acc. | 80.61/81.08 |
| QNLI  | Accuracy                     | 87.46       |
| RTE   | Accuracy                     | 61.73       |
| WNLI  | Accuracy                     | 45.07       |
LysandreJik's avatar
LysandreJik committed
195
196
197
198

Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.

199
Before running any one of these GLUE tasks you should download the
LysandreJik's avatar
LysandreJik committed
200
201
202
203
204
205
206
207
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.

```bash
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

LysandreJik's avatar
LysandreJik committed
208
209
210
python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
LysandreJik's avatar
LysandreJik committed
211
212
213
214
215
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
216
  --per_gpu_train_batch_size 32 \
LysandreJik's avatar
LysandreJik committed
217
218
219
220
221
222
223
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
```

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

224
225
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
LysandreJik's avatar
LysandreJik committed
226
227
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.

228
229
230
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
said, there shouldn鈥檛 be any issues in running half-precision training with the remaining GLUE tasks as well,
LysandreJik's avatar
LysandreJik committed
231
232
233
234
235
236
since the data processor for each task inherits from the base class DataProcessor.

### MRPC

#### Fine-tuning example

237
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
LysandreJik's avatar
LysandreJik committed
238
239
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.

240
Before running any one of these GLUE tasks you should download the
LysandreJik's avatar
LysandreJik committed
241
242
243
244
245
246
247
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.

```bash
export GLUE_DIR=/path/to/glue

LysandreJik's avatar
LysandreJik committed
248
249
250
python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
LysandreJik's avatar
LysandreJik committed
251
252
253
254
255
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
256
  --per_gpu_train_batch_size 32 \
LysandreJik's avatar
LysandreJik committed
257
258
259
260
261
262
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/
```

Our test ran on a few seeds with [the original implementation hyper-
263
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
LysandreJik's avatar
LysandreJik committed
264
265
266
267
results between 84% and 88%.

#### Using Apex and mixed-precision

268
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
LysandreJik's avatar
LysandreJik committed
269
270
271
272
273
[apex](https://github.com/NVIDIA/apex), then run the following example:

```bash
export GLUE_DIR=/path/to/glue

LysandreJik's avatar
LysandreJik committed
274
275
276
python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
LysandreJik's avatar
LysandreJik committed
277
278
279
280
281
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
282
  --per_gpu_train_batch_size 32 \
LysandreJik's avatar
LysandreJik committed
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
  --fp16
```

#### Distributed training

Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
reaches F1 > 92 on MRPC.

```bash
export GLUE_DIR=/path/to/glue

python -m torch.distributed.launch \
LysandreJik's avatar
LysandreJik committed
298
299
300
    --nproc_per_node 8 run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
LysandreJik's avatar
LysandreJik committed
301
302
303
304
305
    --task_name MRPC \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
306
    --per_gpu_train_batch_size 8 \
LysandreJik's avatar
LysandreJik committed
307
308
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
LysandreJik's avatar
LysandreJik committed
309
    --output_dir /tmp/mrpc_output/
LysandreJik's avatar
LysandreJik committed
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
```

Training with these hyper-parameters gave us the following results:

```bash
acc = 0.8823529411764706
acc_and_f1 = 0.901702786377709
eval_loss = 0.3418912578906332
f1 = 0.9210526315789473
global_step = 174
loss = 0.07231863956341798
```

### MNLI

The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.

```bash
export GLUE_DIR=/path/to/glue

python -m torch.distributed.launch \
LysandreJik's avatar
LysandreJik committed
331
332
333
    --nproc_per_node 8 run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
LysandreJik's avatar
LysandreJik committed
334
335
336
337
338
    --task_name mnli \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
339
    --per_gpu_train_batch_size 8 \
LysandreJik's avatar
LysandreJik committed
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir output_dir \
```

The results  are the following:

```bash
***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904

***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904
```

361
## Multiple Choice
erenup's avatar
erenup committed
362
363
364
365
366
367

Based on the script [`run_multiple_choice.py`]().

#### Fine-tuning on SWAG
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data

368
```bash
erenup's avatar
erenup committed
369
370
#training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir
371
python ./examples/run_multiple_choice.py \
erenup's avatar
erenup committed
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
--model_type roberta \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
```
Training with the defined hyper-parameters yields the following results:
```
***** Eval results *****
eval_acc = 0.8338998300509847
eval_loss = 0.44457291918821606
```

LysandreJik's avatar
LysandreJik committed
394
395
## SQuAD

396
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
LysandreJik's avatar
LysandreJik committed
397

Xu Hongshen's avatar
Xu Hongshen committed
398
#### Fine-tuning BERT on SQuAD1.0
LysandreJik's avatar
LysandreJik committed
399

400
401
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
LysandreJik's avatar
LysandreJik committed
402
403
404
405
406
407
$SQUAD_DIR directory.

* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

Xu Hongshen's avatar
Xu Hongshen committed
408
409
410
411
412
413
And for SQuAD2.0, you need to download:

- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)

LysandreJik's avatar
LysandreJik committed
414
415
416
```bash
export SQUAD_DIR=/path/to/SQUAD

LysandreJik's avatar
LysandreJik committed
417
418
python run_squad.py \
  --model_type bert \
419
  --model_name_or_path bert-base-uncased \
LysandreJik's avatar
LysandreJik committed
420
  --do_train \
421
  --do_eval \
LysandreJik's avatar
LysandreJik committed
422
423
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
424
  --per_gpu_train_batch_size 12 \
LysandreJik's avatar
LysandreJik committed
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/
```

Training with the previously defined hyper-parameters yields the following results:

```bash
f1 = 88.52
exact_match = 81.22
```

#### Distributed training


Jared Nielsen's avatar
Jared Nielsen committed
442
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
LysandreJik's avatar
LysandreJik committed
443
444

```bash
Jared Nielsen's avatar
Jared Nielsen committed
445
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
LysandreJik's avatar
LysandreJik committed
446
    --model_type bert \
Jared Nielsen's avatar
Jared Nielsen committed
447
    --model_name_or_path bert-large-uncased-whole-word-masking \
LysandreJik's avatar
LysandreJik committed
448
    --do_train \
449
    --do_eval \
LysandreJik's avatar
LysandreJik committed
450
451
452
453
454
455
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
Jared Nielsen's avatar
Jared Nielsen committed
456
457
458
    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \
LysandreJik's avatar
LysandreJik committed
459
460
461
462
463
464
465
466
467
```

Training with the previously defined hyper-parameters yields the following results:

```bash
f1 = 93.15
exact_match = 86.91
```

468
This fine-tuned model is available as a checkpoint under the reference
LysandreJik's avatar
LysandreJik committed
469
470
`bert-large-uncased-whole-word-masking-finetuned-squad`.

thomwolf's avatar
thomwolf committed
471
472
#### Fine-tuning XLNet on SQuAD

Xu Hongshen's avatar
Xu Hongshen committed
473
474
475
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .

##### Command for SQuAD1.0:
thomwolf's avatar
thomwolf committed
476
477
478
479

```bash
export SQUAD_DIR=/path/to/SQUAD

480
python run_squad.py \
thomwolf's avatar
thomwolf committed
481
482
483
484
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
485
486
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
thomwolf's avatar
thomwolf committed
487
488
489
490
491
492
493
494
495
496
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=4  \
    --per_gpu_train_batch_size=4   \
    --save_steps 5000
```

Xu Hongshen's avatar
Xu Hongshen committed
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
##### Command for SQuAD2.0:

```bash
export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --train_file $SQUAD_DIR/train-v2.0.json \
    --predict_file $SQUAD_DIR/dev-v2.0.json \
    --learning_rate 3e-5 \
    --num_train_epochs 4 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=2  \
    --per_gpu_train_batch_size=2   \
    --save_steps 5000
```

Larger batch size may improve the performance while costing more memory.

##### Results for SQuAD1.0 with the previously defined hyper-parameters:
thomwolf's avatar
thomwolf committed
523
524
525
526
527
528
529
530
531
532
533

```python
{
"exact": 85.45884578997162,
"f1": 92.5974600601065,
"total": 10570,
"HasAns_exact": 85.45884578997162,
"HasAns_f1": 92.59746006010651,
"HasAns_total": 10570
}
```
534

Xu Hongshen's avatar
Xu Hongshen committed
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
##### Results for SQuAD2.0 with the previously defined hyper-parameters:

```python
{
"exact": 80.4177545691906,
"f1": 84.07154997729623,
"total": 11873,
"HasAns_exact": 76.73751686909581,
"HasAns_f1": 84.05558584352873,
"HasAns_total": 5928,
"NoAns_exact": 84.0874684608915,
"NoAns_f1": 84.0874684608915,
"NoAns_total": 5945
}
```



553

VictorSanh's avatar
VictorSanh committed
554
555
## XNLI

VictorSanh's avatar
VictorSanh committed
556
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
VictorSanh's avatar
VictorSanh committed
557

558
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
VictorSanh's avatar
VictorSanh committed
559
560
561

#### Fine-tuning on XNLI

VictorSanh's avatar
VictorSanh committed
562
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
563
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
VictorSanh's avatar
VictorSanh committed
564
565
566
567
568
569
570
571
572
573
574
`$XNLI_DIR` directory.

* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)

```bash
export XNLI_DIR=/path/to/XNLI

python run_xnli.py \
  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
VictorSanh's avatar
VictorSanh committed
575
  --language de \
VictorSanh's avatar
VictorSanh committed
576
577
578
  --train_language en \
  --do_train \
  --do_eval \
VictorSanh's avatar
VictorSanh committed
579
  --data_dir $XNLI_DIR \
VictorSanh's avatar
VictorSanh committed
580
581
582
583
  --per_gpu_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
VictorSanh's avatar
VictorSanh committed
584
585
  --output_dir /tmp/debug_xnli/ \
  --save_steps -1
VictorSanh's avatar
VictorSanh committed
586
587
```

VictorSanh's avatar
VictorSanh committed
588
Training with the previously defined hyper-parameters yields the following results on the **test** set:
VictorSanh's avatar
VictorSanh committed
589
590

```bash
VictorSanh's avatar
VictorSanh committed
591
acc = 0.7093812375249501
VictorSanh's avatar
VictorSanh committed
592
```
593
594
595

## MM-IMDb

thomwolf's avatar
thomwolf committed
596
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/mm-imdb/run_mmimdb.py).
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616

[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.

### Training on MM-IMDb

```
python run_mmimdb.py \
    --data_dir /path/to/mmimdb/dataset/ \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --output_dir /path/to/save/dir/ \
    --do_train \
    --do_eval \
    --max_seq_len 512 \
    --gradient_accumulation_steps 20 \
    --num_image_embeds 3 \
    --num_train_epochs 100 \
    --patience 5
```

thomwolf's avatar
thomwolf committed
617
## Adversarial evaluation of model performances
618

619
620
621
622
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).

The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).

thomwolf's avatar
thomwolf committed
623
624
This is an example of using test_hans.py:

625
```bash
thomwolf's avatar
thomwolf committed
626
627
628
export HANS_DIR=path-to-hans
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
629

VictorSanh's avatar
VictorSanh committed
630
python examples/hans/test_hans.py \
thomwolf's avatar
thomwolf committed
631
632
633
634
635
636
        --task_name hans \
        --model_type $MODEL_TYPE \
        --do_eval \
        --data_dir $HANS_DIR \
        --model_name_or_path $MODEL_PATH \
        --max_seq_length 128 \
VictorSanh's avatar
VictorSanh committed
637
        --output_dir $MODEL_PATH \
638
639
```

thomwolf's avatar
thomwolf committed
640
641
642
This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.

The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
643
644

```bash
thomwolf's avatar
thomwolf committed
645
646
647
648
649
650
651
652
653
Heuristic entailed results:
lexical_overlap: 0.9702
subsequence: 0.9942
constituent: 0.9962

Heuristic non-entailed results:
lexical_overlap: 0.199
subsequence: 0.0396
constituent: 0.118
654
```