examples.rst 26.1 KB
Newer Older
1
2
examples.rst

3
4
5
6
7
8
9
10
Examples
================================================

.. list-table::
   :header-rows: 1

   * - Sub-section
     - Description
11
   * - `Training large models: introduction, tools and examples <#introduction>`_
12
     - How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
13
14
   * - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
15
16
   * - `Fine-tuning with OpenAI GPT, Transformer-XL, GPT-2 as well as BERT and RoBERTa <#fine-tuning>`_
     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py``, ``run_gpt2.py`` and ``run_lm_finetuning.py``
17
   * - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
18
19
20
     - How to fine tune ``BERT large``


21
22
.. _introduction:

23
24
25
26
27
Training large models: introduction, tools and examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).

28
To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ : gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read `the tips on training large batches in PyTorch <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_ that I published earlier this year.
29
30
31
32
33
34
35

Here is how to use these techniques in our scripts:


* **Gradient Accumulation**\ : Gradient accumulation can be used by supplying a integer greater than 1 to the ``--gradient_accumulation_steps`` argument. The batch at each step will be divided by this integer and gradient will be accumulated over ``gradient_accumulation_steps`` steps.
* **Multi-GPU**\ : Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
* **Distributed training**\ : Distributed training can be activated by supplying an integer greater or equal to 0 to the ``--local_rank`` argument (see below).
36
* **16-bits training**\ : 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found `here <https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/>`__ and a full documentation is `here <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__. In our scripts, this option can be activated by setting the ``--fp16`` flag and you can play with loss scaling using the ``--loss_scale`` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
37

38
To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here <https://github.com/nvidia/apex>`__. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository <https://github.com/nvidia/apex>`_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository <https://github.com/huggingface/pytorch-pretrained-BERT/pull/116>`_.
39

40
Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_\ ) for more details):
41
42
43

.. code-block:: bash

44
45
46
47
48
49
50
    python -m torch.distributed.launch \
        --nproc_per_node=4 \
        --nnodes=2 \
        --node_rank=$THIS_MACHINE_INDEX \
        --master_addr="192.168.1.1" \
        --master_port=1234 run_bert_classifier.py \
        (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
51
52
53

Where ``$THIS_MACHINE_INDEX`` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address ``192.168.1.1`` and an open port ``1234``.

54
55
.. _fine-tuning-bert-examples:

56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
Fine-tuning with BERT: running the examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We showcase several fine-tuning examples based on (and extended from) `the original implementation <https://github.com/google-research/bert/>`_\ :


* a *sequence-level classifier* on nine different GLUE tasks,
* a *token-level classifier* on the question answering dataset SQuAD, and
* a *sequence-level multiple-choice classifier* on the SWAG classification corpus.
* a *BERT language model* on another target corpus

GLUE results on dev set
~~~~~~~~~~~~~~~~~~~~~~~

We get the following results on the dev set of GLUE benchmark with an uncased BERT base
model. All experiments were run on a P100 GPU with a batch size of 32.

.. list-table::
   :header-rows: 1

   * - Task
     - Metric
     - Result
   * - CoLA
     - Matthew's corr.
     - 57.29
   * - SST-2
     - accuracy
     - 93.00
   * - MRPC
     - F1/accuracy
     - 88.85/83.82
   * - STS-B
     - Pearson/Spearman corr.
     - 89.70/89.37
   * - QQP
     - accuracy/F1
     - 90.72/87.41
   * - MNLI
     - matched acc./mismatched acc.
     - 83.95/84.39
   * - QNLI
     - accuracy
     - 89.04
   * - RTE
     - accuracy
     - 61.01
   * - WNLI
     - accuracy
     - 53.52


Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to `FAQ #12 <https://gluebenchmark.com/faq>`_ on the webite.

Before running anyone of these GLUE tasks you should download the
`GLUE data <https://gluebenchmark.com/tasks>`_ by running
`this script <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_
and unpack it to some directory ``$GLUE_DIR``.

.. code-block:: shell

   export GLUE_DIR=/path/to/glue
   export TASK_NAME=MRPC

   python run_bert_classifier.py \
     --task_name $TASK_NAME \
     --do_train \
     --do_eval \
     --do_lower_case \
     --data_dir $GLUE_DIR/$TASK_NAME \
     --bert_model bert-base-uncased \
     --max_seq_length 128 \
     --train_batch_size 32 \
     --learning_rate 2e-5 \
     --num_train_epochs 3.0 \
     --output_dir /tmp/$TASK_NAME/

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.

MRPC
~~~~

This example code fine-tunes BERT on the Microsoft Research Paraphrase
Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.

Before running this example you should download the
`GLUE data <https://gluebenchmark.com/tasks>`_ by running
`this script <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_
and unpack it to some directory ``$GLUE_DIR``.

.. code-block:: shell

   export GLUE_DIR=/path/to/glue

   python run_bert_classifier.py \
     --task_name MRPC \
     --do_train \
     --do_eval \
     --do_lower_case \
     --data_dir $GLUE_DIR/MRPC/ \
     --bert_model bert-base-uncased \
     --max_seq_length 128 \
     --train_batch_size 32 \
     --learning_rate 2e-5 \
     --num_train_epochs 3.0 \
     --output_dir /tmp/mrpc_output/

168
Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks>`__ gave evaluation results between 84% and 88%.
169
170

**Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!**
171
First install apex as indicated `here <https://github.com/NVIDIA/apex>`__.
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
Then run

.. code-block:: shell

   export GLUE_DIR=/path/to/glue

   python run_bert_classifier.py \
     --task_name MRPC \
     --do_train \
     --do_eval \
     --do_lower_case \
     --data_dir $GLUE_DIR/MRPC/ \
     --bert_model bert-base-uncased \
     --max_seq_length 128 \
     --train_batch_size 32 \
     --learning_rate 2e-5 \
     --num_train_epochs 3.0 \
     --output_dir /tmp/mrpc_output/ \
     --fp16

**Distributed training**
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking model to reach a F1 > 92 on MRPC:

.. code-block:: bash

197
198
199
200
201
202
203
204
205
206
207
208
209
    python -m torch.distributed.launch \
        --nproc_per_node 8 run_bert_classifier.py \
        --bert_model bert-large-uncased-whole-word-masking \
        --task_name MRPC \
        --do_train \
        --do_eval \
        --do_lower_case \
        --data_dir $GLUE_DIR/MRPC/ \
        --max_seq_length 128 \
        --train_batch_size 8 \
        --learning_rate 2e-5 \
        --num_train_epochs 3.0 \
         --output_dir /tmp/mrpc_output/
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225

Training with these hyper-parameters gave us the following results:

.. code-block:: bash

     acc = 0.8823529411764706
     acc_and_f1 = 0.901702786377709
     eval_loss = 0.3418912578906332
     f1 = 0.9210526315789473
     global_step = 174
     loss = 0.07231863956341798

Here is an example on MNLI:

.. code-block:: bash

226
227
228
229
230
231
232
233
234
235
236
237
238
239
    python -m torch.distributed.launch \
        --nproc_per_node 8 run_bert_classifier.py \
        --bert_model bert-large-uncased-whole-word-masking \
        --task_name mnli \
        --do_train \
        --do_eval \
        --do_lower_case \
        --data_dir /datadrive/bert_data/glue_data//MNLI/ \
        --max_seq_length 128 \
        --train_batch_size 8 \
        --learning_rate 2e-5 \
        --num_train_epochs 3.0 \
        --output_dir ../models/wwm-uncased-finetuned-mnli/ \
        --overwrite_output_dir
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328

.. code-block:: bash

   ***** Eval results *****
     acc = 0.8679706601466992
     eval_loss = 0.4911287787382479
     global_step = 18408
     loss = 0.04755385363816904

   ***** Eval results *****
     acc = 0.8747965825874695
     eval_loss = 0.45516540421714036
     global_step = 18408
     loss = 0.04755385363816904

This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model

SQuAD
~~~~~

This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.

The data for SQuAD can be downloaded with the following links and should be saved in a ``$SQUAD_DIR`` directory.


* `train-v1.1.json <https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json>`_
* `dev-v1.1.json <https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json>`_
* `evaluate-v1.1.py <https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py>`_

.. code-block:: shell

   export SQUAD_DIR=/path/to/SQUAD

   python run_bert_squad.py \
     --bert_model bert-base-uncased \
     --do_train \
     --do_predict \
     --do_lower_case \
     --train_file $SQUAD_DIR/train-v1.1.json \
     --predict_file $SQUAD_DIR/dev-v1.1.json \
     --train_batch_size 12 \
     --learning_rate 3e-5 \
     --num_train_epochs 2.0 \
     --max_seq_length 384 \
     --doc_stride 128 \
     --output_dir /tmp/debug_squad/

Training with the previous hyper-parameters gave us the following results:

.. code-block:: bash

   python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json /tmp/debug_squad/predictions.json
   {"f1": 88.52381567990474, "exact_match": 81.22043519394512}

**distributed training**

Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

.. code-block:: bash

   python -m torch.distributed.launch --nproc_per_node=8 \
    run_bert_squad.py \
    --bert_model bert-large-uncased-whole-word-masking  \
    --do_train \
    --do_predict \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
    --train_batch_size 24 \
    --gradient_accumulation_steps 12

Training with these hyper-parameters gave us the following results:

.. code-block:: bash

   python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
   {"exact_match": 86.91579943235573, "f1": 93.1532499015869}

This is the model provided as ``bert-large-uncased-whole-word-masking-finetuned-squad``.

And here is the model provided as ``bert-large-cased-whole-word-masking-finetuned-squad``\ :

.. code-block:: bash

329
330
331
332
333
334
335
336
337
338
339
340
341
342
    python -m torch.distributed.launch --nproc_per_node=8  run_bert_squad.py \
        --bert_model bert-large-cased-whole-word-masking \
        --do_train \
        --do_predict \
        --do_lower_case \
        --train_file $SQUAD_DIR/train-v1.1.json \
        --predict_file $SQUAD_DIR/dev-v1.1.json \
        --learning_rate 3e-5 \
        --num_train_epochs 2 \
        --max_seq_length 384 \
        --doc_stride 128 \
        --output_dir ../models/wwm_cased_finetuned_squad/ \
        --train_batch_size 24 \
        --gradient_accumulation_steps 12
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385

Training with these hyper-parameters gave us the following results:

.. code-block:: bash

   python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
   {"exact_match": 84.18164616840113, "f1": 91.58645594850135}

SWAG
~~~~

The data for SWAG can be downloaded by cloning the following `repository <https://github.com/rowanz/swagaf>`_

.. code-block:: shell

   export SWAG_DIR=/path/to/SWAG

   python run_bert_swag.py \
     --bert_model bert-base-uncased \
     --do_train \
     --do_lower_case \
     --do_eval \
     --data_dir $SWAG_DIR/data \
     --train_batch_size 16 \
     --learning_rate 2e-5 \
     --num_train_epochs 3.0 \
     --max_seq_length 80 \
     --output_dir /tmp/swag_output/ \
     --gradient_accumulation_steps 4

Training with the previous hyper-parameters on a single GPU gave us the following results:

.. code-block::

   eval_accuracy = 0.8062081375587323
   eval_loss = 0.5966546792367169
   global_step = 13788
   loss = 0.06423990014260186

LM Fine-tuning
~~~~~~~~~~~~~~

The data should be a text file in the same format as `sample_text.txt <./samples/sample_text.txt>`_  (one sentence per line, docs separated by empty line).
386
You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and split into ~500k sentences with spaCy.
387
388
Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with ``train_batch_size=200`` and ``max_seq_length=128``\ :

389
390
391
Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the `README <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/README.md>`_ of the `examples/lm_finetuning/ <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/>`_ folder.

.. _fine-tuning:
392
393
394
395

OpenAI GPT, Transformer-XL and GPT-2: running the examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

396
We provide three examples of scripts for OpenAI GPT, Transformer-XL, OpenAI GPT-2, BERT and RoBERTa based on (and extended from) the respective original implementations:
397
398
399
400
401


* fine-tuning OpenAI GPT on the ROCStories dataset
* evaluating Transformer-XL on Wikitext 103
* unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
402
* fine-tuning GPT/GPT-2 on a causal language modeling task and BERT/RoBERTa on a masked language modeling task
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455

Fine-tuning OpenAI GPT on the RocStories dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This example code fine-tunes OpenAI GPT on the RocStories dataset.

Before running this example you should download the
`RocStories dataset <https://github.com/snigdhac/StoryComprehension_EMNLP/tree/master/Dataset/RoCStories>`_ and unpack it to some directory ``$ROC_STORIES_DIR``.

.. code-block:: shell

   export ROC_STORIES_DIR=/path/to/RocStories

   python run_openai_gpt.py \
     --model_name openai-gpt \
     --do_train \
     --do_eval \
     --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
     --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
     --output_dir ../log \
     --train_batch_size 16 \

This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).

Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset.
This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed.

.. code-block:: shell

   python run_transfo_xl.py --work_dir ../log

This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code).

Unconditional and conditional generation from OpenAI's GPT-2 model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This example code is identical to the original unconditional and conditional generation codes.

Conditional generation:

.. code-block:: shell

   python run_gpt2.py

Unconditional generation:

.. code-block:: shell

   python run_gpt2.py --unconditional

456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
The same option as in the original scripts are provided, please refer to the code of the example and the original repository of OpenAI.


Causal LM fine-tuning on GPT/GPT-2, Masked LM fine-tuning on BERT/RoBERTa
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before running the following examples you should download the `WikiText-2 dataset <https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/>`__ and unpack it to some directory `$WIKITEXT_2_DATASET`
The following results were obtained using the `raw` WikiText-2 (no tokens were replaced before the tokenization).

This example fine-tunes GPT-2 on the WikiText-2 dataset. The loss function is a causal language modeling loss (perplexity).

.. code-block:: bash
    export WIKITEXT_2_DATASET=/path/to/wikitext_dataset

    python run_lm_finetuning.py
        --output_dir=output
        --model_type=gpt2
        --model_name_or_path=gpt2
        --do_train
        --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
        --do_eval
        --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw

This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run.
It reaches a score of about 20 perplexity once fine-tuned on the dataset.

This example fine-tunes RoBERTa on the WikiText-2 dataset. The loss function is a masked language modeling loss (masked perplexity).
The `--mlm` flag is necessary to fine-tune BERT/RoBERTa on masked language modeling.

.. code-block:: bash
    export WIKITEXT_2_DATASET=/path/to/wikitext_dataset

    python run_lm_finetuning.py
        --output_dir=output
        --model_type=roberta
        --model_name_or_path=roberta-base
        --do_train
        --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
        --do_eval
        --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
        --mlm
497

498
499
.. _fine-tuning-BERT-large:

500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
Fine-tuning BERT-large on GPUs
------------------------------

The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.

For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):

.. code-block:: bash

   {"exact_match": 84.56953642384106, "f1": 91.04028647786927}

To get these results we used a combination of:


* multi-GPU training (automatically activated on a multi-GPU server),
* 2 steps of gradient accumulation and
* perform the optimization step on CPU to store Adam's averages in RAM.

Here is the full list of hyper-parameters for this run:

.. code-block:: bash

   export SQUAD_DIR=/path/to/SQUAD

   python ./run_bert_squad.py \
     --bert_model bert-large-uncased \
     --do_train \
     --do_predict \
     --do_lower_case \
     --train_file $SQUAD_DIR/train-v1.1.json \
     --predict_file $SQUAD_DIR/dev-v1.1.json \
     --learning_rate 3e-5 \
     --num_train_epochs 2 \
     --max_seq_length 384 \
     --doc_stride 128 \
     --output_dir /tmp/debug_squad/ \
     --train_batch_size 24 \
     --gradient_accumulation_steps 2

If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16).

Here is an example of hyper-parameters for a FP16 run we tried:

.. code-block:: bash

   export SQUAD_DIR=/path/to/SQUAD

   python ./run_bert_squad.py \
     --bert_model bert-large-uncased \
     --do_train \
     --do_predict \
     --do_lower_case \
     --train_file $SQUAD_DIR/train-v1.1.json \
     --predict_file $SQUAD_DIR/dev-v1.1.json \
     --learning_rate 3e-5 \
     --num_train_epochs 2 \
     --max_seq_length 384 \
     --doc_stride 128 \
     --output_dir /tmp/debug_squad/ \
     --train_batch_size 24 \
     --fp16 \
     --loss_scale 128

The results were similar to the above FP32 results (actually slightly higher):

.. code-block:: bash

   {"exact_match": 84.65468306527909, "f1": 91.238669287002}

Here is an example with the recent ``bert-large-uncased-whole-word-masking``\ :

.. code-block:: bash

   python -m torch.distributed.launch --nproc_per_node=8 \
     run_bert_squad.py \
     --bert_model bert-large-uncased-whole-word-masking \
     --do_train \
     --do_predict \
     --do_lower_case \
     --train_file $SQUAD_DIR/train-v1.1.json \
     --predict_file $SQUAD_DIR/dev-v1.1.json \
     --learning_rate 3e-5 \
     --num_train_epochs 2 \
     --max_seq_length 384 \
     --doc_stride 128 \
     --output_dir /tmp/debug_squad/ \
     --train_batch_size 24 \
     --gradient_accumulation_steps 2

Fine-tuning XLNet
-----------------

STS-B
~~~~~

This example code fine-tunes XLNet on the STS-B corpus.

Before running this example you should download the
`GLUE data <https://gluebenchmark.com/tasks>`_ by running
`this script <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_
and unpack it to some directory ``$GLUE_DIR``.

.. code-block:: shell

   export GLUE_DIR=/path/to/glue

   python run_xlnet_classifier.py \
    --task_name STS-B \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/STS-B/ \
    --max_seq_length 128 \
    --train_batch_size 8 \
    --gradient_accumulation_steps 1 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/

618
Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/zihangdai/xlnet#1-sts-b-sentence-pair-relevance-regression-with-gpus>`__ gave evaluation results between 84% and 88%.
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652

**Distributed training**
Here is an example using distributed training on 8 V100 GPUs to reach XXXX:

.. code-block:: bash

   python -m torch.distributed.launch --nproc_per_node 8 \
    run_xlnet_classifier.py \
    --task_name STS-B \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/STS-B/ \
    --max_seq_length 128 \
    --train_batch_size 8 \
    --gradient_accumulation_steps 1 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/

Training with these hyper-parameters gave us the following results:

.. code-block:: bash

     acc = 0.8823529411764706
     acc_and_f1 = 0.901702786377709
     eval_loss = 0.3418912578906332
     f1 = 0.9210526315789473
     global_step = 174
     loss = 0.07231863956341798

Here is an example on MNLI:

.. code-block:: bash

653
654
655
656
657
658
659
660
661
662
663
664
    python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py \
        --bert_model bert-large-uncased-whole-word-masking \
        --task_name mnli \
        --do_train \
        --do_eval \
        --data_dir /datadrive/bert_data/glue_data//MNLI/ \
        --max_seq_length 128 \
        --train_batch_size 8 \
        --learning_rate 2e-5 \
        --num_train_epochs 3.0 \
        --output_dir ../models/wwm-uncased-finetuned-mnli/ \
        --overwrite_output_dir
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679

.. code-block:: bash

   ***** Eval results *****
     acc = 0.8679706601466992
     eval_loss = 0.4911287787382479
     global_step = 18408
     loss = 0.04755385363816904

   ***** Eval results *****
     acc = 0.8747965825874695
     eval_loss = 0.45516540421714036
     global_step = 18408
     loss = 0.04755385363816904

680
This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model.