Unverified Commit 2e5052d4 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

New run glue script (#7917)



* Start simplification

* More progress

* Finished script

* Address comments and update tests instructions

* Wrong test

* Accept files as inputs and fix test

* Update src/transformers/trainer_utils.py
Co-authored-by: default avatarJulien Chaumond <chaumond@gmail.com>

* Fix labels and add combined score

* Add special labels

* Update TPU command

* Revert to old label strategy

* Use model labels

* Fix for STT-B

* Styling

* Apply suggestions from code review
Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>

* Code styling

* Fix review comments
Co-authored-by: default avatarJulien Chaumond <chaumond@gmail.com>
Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>
parent 18ce6b8f
......@@ -67,10 +67,10 @@ class ExamplesTests(TestCasePlus):
testargs = f"""
run_glue.py
--model_name_or_path distilbert-base-uncased
--data_dir ./tests/fixtures/tests_samples/MRPC/
--output_dir {tmp_dir}
--overwrite_output_dir
--task_name mrpc
--train_file ./tests/fixtures/tests_samples/MRPC/train.csv
--validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv
--do_train
--do_eval
--per_device_train_batch_size=2
......
......@@ -44,8 +44,7 @@ class TorchXLAExamplesTests(unittest.TestCase):
transformers/examples/text-classification/run_glue.py
--do_train
--do_eval
--task_name=MRPC
--data_dir=/datasets/glue_data/MRPC
--task_name=mrpc
--cache_dir=./cache_dir
--num_train_epochs=1
--max_seq_length=128
......
......@@ -74,18 +74,10 @@ between different runs. We report the median on 5 runs (with different seeds) fo
| WNLI | Accuracy | 45.07 |
Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
Before running any one of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running the following lines at the root of the repo
```
python utils/download_glue_data.py --data_dir /path/to/glue --tasks all
```
after replacing *path/to/glue* with a value that you like. Then you can run
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the
website.
```bash
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python run_glue.py \
......@@ -93,7 +85,6 @@ python run_glue.py \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
......@@ -114,69 +105,33 @@ since the data processor for each task inherits from the base class DataProcesso
## Running on TPUs in PyTorch
**Update**: read the more up-to-date [Running on TPUs](../README.md#running-on-tpus) in the main README.md instead.
Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on how to setup your TPU environment refer to the
Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on
how to setup your TPU environment refer to the
[pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
identical to your normal GPU + Huggingface setup.
For running your GLUE task on MNLI dataset you can run something like the following:
For running your GLUE task on MNLI dataset you can run something like the following form the root of the transformers
repo:
```
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
export GLUE_DIR=/path/to/glue
export TASK_NAME=MNLI
python run_glue_tpu.py \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
python examples/xla_spawn.py \
--num_cores=8 \
transformers/examples/text-classification/run_glue.py \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME \
--task_name=mrpc \
--num_train_epochs=3 \
--max_seq_length=128 \
--learning_rate=5e-5 \
--output_dir=/tmp/mrpc \
--overwrite_output_dir \
--logging_steps 50 \
--save_steps 200 \
--num_cores=8
--logging_steps=5 \
--save_steps=5 \
--tpu_metrics_debug \
--model_name_or_path=bert-base-cased \
--per_device_train_batch_size=64 \
--per_device_eval_batch_size=64
```
### MRPC
#### Fine-tuning example
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
Before running any one of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
```
Our test ran on a few seeds with [the original implementation hyper-
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
results between 84% and 88%.
#### Using Apex and mixed-precision
......@@ -184,14 +139,12 @@ Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds.
[apex](https://github.com/NVIDIA/apex), then run the following example:
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
......@@ -206,15 +159,13 @@ Here is an example using distributed training on 8 V100 GPUs. The model used is
reaches F1 > 92 on MRPC.
```bash
export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--task_name mrpc \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
......@@ -246,7 +197,6 @@ python -m torch.distributed.launch \
--task_name mnli \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MNLI/ \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
......@@ -272,7 +222,9 @@ The results are the following:
# Run PyTorch version using PyTorch-Lightning
Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory.
Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in
`examples/requirements.txt`. It is a shell pipeline that will automatically download, preprocess the data and run the
specified models. Logs are saved in `lightning_logs` directory.
Pass `--gpus` flag to change the number of GPUs. Default uses 1. At the end, the expected results are:
......
This diff is collapsed.
......@@ -854,8 +854,6 @@ class Trainer:
metrics = self.evaluate()
self._report_to_hp_search(trial, epoch, metrics)
self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics)
if self.control.should_save:
self._save_checkpoint(model, trial, metrics=metrics)
self.control = self.callback_handler.on_save(self.args, self.state, self.control)
......@@ -1173,7 +1171,7 @@ class Trainer:
xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
else:
self.model.save_pretrained(output_dir)
if self.tokenizer is not None:
if self.tokenizer is not None and self.is_world_process_zero():
self.tokenizer.save_pretrained(output_dir)
def _save(self, output_dir: Optional[str] = None):
......@@ -1188,7 +1186,7 @@ class Trainer:
torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
else:
self.model.save_pretrained(output_dir)
if self.tokenizer is not None:
if self.tokenizer is not None and self.is_world_process_zero():
self.tokenizer.save_pretrained(output_dir)
# Good practice: save your training arguments together with the trained model
......@@ -1272,6 +1270,7 @@ class Trainer:
# tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
xm.master_print(met.metrics_report())
self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
return output.metrics
def predict(self, test_dataset: Dataset) -> PredictionOutput:
......
......@@ -22,7 +22,7 @@ from typing import Any, Dict, NamedTuple, Optional, Tuple, Union
import numpy as np
from .file_utils import is_tf_available, is_torch_available
from .file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from .tokenization_utils_base import ExplicitEnum
......@@ -157,3 +157,30 @@ default_hp_space = {
HPSearchBackend.OPTUNA: default_hp_space_optuna,
HPSearchBackend.RAY: default_hp_space_ray,
}
def is_main_process(local_rank):
"""
Whether or not the current process is the local process, based on `xm.get_ordinal()` (for TPUs) first, then on
`local_rank`.
"""
if is_torch_tpu_available():
import torch_xla.core.xla_model as xm
return xm.get_ordinal() == 0
return local_rank in [-1, 0]
def total_processes_number(local_rank):
"""
Return the number of processes launched in parallel. Works with `torch.distributed` and TPUs.
"""
if is_torch_tpu_available():
import torch_xla.core.xla_model as xm
return xm.xrt_world_size()
elif local_rank != -1 and is_torch_available():
import torch
return torch.distributed.get_world_size()
return 1
label,sentence1,sentence2
equivalent,He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .,""" The foodservice pie business does not fit our long-term growth strategy ."
not_equivalent,Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .,"His wife said he was "" 100 percent behind George Bush "" and looked forward to using his years of training in the war ."
not_equivalent,"The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .","The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent ."
equivalent,The AFL-CIO is waiting until October to decide if it will endorse a candidate .,The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
not_equivalent,No dates have been set for the civil or the criminal trial .,"No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty ."
equivalent,Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .,It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
label,sentence1,sentence2
equivalent,He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .,""" The foodservice pie business does not fit our long-term growth strategy ."
not_equivalent,Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .,"His wife said he was "" 100 percent behind George Bush "" and looked forward to using his years of training in the war ."
not_equivalent,"The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .","The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent ."
equivalent,The AFL-CIO is waiting until October to decide if it will endorse a candidate .,The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
not_equivalent,No dates have been set for the civil or the criminal trial .,"No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty ."
equivalent,Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .,It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment