Commit f4719791 authored by Ananya Harsh Jha's avatar Ananya Harsh Jha
Browse files

added GLUE dev set results and details on how to run GLUE tasks

parent e5b63fb5
...@@ -927,11 +927,60 @@ Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your mach ...@@ -927,11 +927,60 @@ Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your mach
We showcase several fine-tuning examples based on (and extended from) [the original implementation](https://github.com/google-research/bert/): We showcase several fine-tuning examples based on (and extended from) [the original implementation](https://github.com/google-research/bert/):
- a *sequence-level classifier* on the MRPC classification corpus, - a *sequence-level classifier* on nine different GLUE tasks,
- a *token-level classifier* on the question answering dataset SQuAD, and - a *token-level classifier* on the question answering dataset SQuAD, and
- a *sequence-level multiple-choice classifier* on the SWAG classification corpus. - a *sequence-level multiple-choice classifier* on the SWAG classification corpus.
- a *BERT language model* on another target corpus - a *BERT language model* on another target corpus
#### GLUE results on dev set
We get the following results on the dev set of GLUE benchmark with an uncased BERT base
model. All experiments were run on a P100 GPU with a batch size of 32.
| Task | Metric | Result |
|-|-|-|
| CoLA | Matthew's corr. | 57.29 |
| SST-2 | accuracy | 93.00 |
| MRPC | F1/accuracy | 88.85/83.82 |
| STS-B | Pearson/Spearman corr. | 89.70/89.37 |
| QQP | accuracy/F1 | 90.72/87.41 |
| MNLI | matched acc./mismatched acc.| 83.95/84.39 |
| QNLI | accuracy | 89.04 |
| RTE | accuracy | 61.01 |
| WNLI | accuracy | 53.52 |
Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```shell
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python run_classifier.py \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--bert_model bert-base-uncased \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
```
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.
#### MRPC #### MRPC
This example code fine-tunes BERT on the Microsoft Research Paraphrase This example code fine-tunes BERT on the Microsoft Research Paraphrase
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment