New run glue script (#7917)

* Start simplification * More progress * Finished script * Address comments and update tests instructions * Wrong test * Accept files as inputs and fix test * Update src/transformers/trainer_utils.py Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Fix labels and add combined score * Add special labels * Update TPU command * Revert to old label strategy * Use model labels * Fix for STT-B * Styling * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Code styling * Fix review comments Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

New run glue script (#7917)
* Start simplification * More progress * Finished script * Address comments and update tests instructions * Wrong test * Accept files as inputs and fix test * Update src/transformers/trainer_utils.py Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Fix labels and add combined score * Add special labels * Update TPU command * Revert to old label strategy * Use model labels * Fix for STT-B * Styling * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Code styling * Fix review comments Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2e5052d4 · Sylvain Gugger · GitHub · 18ce6b8f · 2e5052d4 · 2e5052d4
Unverified Commit 2e5052d4 authored Oct 22, 2020 by Sylvain Gugger Committed by GitHub Oct 22, 2020
8 changed files
--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -67,10 +67,10 @@ class ExamplesTests(TestCasePlus):
        testargs = f"""
            run_glue.py
            --model_name_or_path distilbert-base-uncased
-            --data_dir ./tests/fixtures/tests_samples/MRPC/
            --output_dir {tmp_dir}
            --overwrite_output_dir
-            --task_name mrpc
+            --train_file ./tests/fixtures/tests_samples/MRPC/train.csv
+            --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv
            --do_train
            --do_eval
            --per_device_train_batch_size=2

--- a/examples/test_xla_examples.py
+++ b/examples/test_xla_examples.py
@@ -44,8 +44,7 @@ class TorchXLAExamplesTests(unittest.TestCase):
            transformers/examples/text-classification/run_glue.py
            --do_train
            --do_eval
-            --task_name=MRPC
-            --data_dir=/datasets/glue_data/MRPC
+            --task_name=mrpc
            --cache_dir=./cache_dir
            --num_train_epochs=1
            --max_seq_length=128

--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -74,18 +74,10 @@ between different runs. We report the median on 5 runs (with different seeds) fo
 | WNLI  | Accuracy                     | 45.07       |

 Some of these results are significantly different from the ones reported on the test set
-of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
-
-Before running any one of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running the following lines at the root of the repo
-```
-python utils/download_glue_data.py --data_dir /path/to/glue --tasks all
-```
-
-after replacing *path/to/glue* with a value that you like. Then you can run
+of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the
+website.

 ```bash
-export GLUE_DIR=/path/to/glue
 export TASK_NAME=MRPC

 python run_glue.py \
@@ -93,7 +85,6 @@ python run_glue.py \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
-  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
@@ -114,69 +105,33 @@ since the data processor for each task inherits from the base class DataProcesso

 ## Running on TPUs in PyTorch

-**Update**: read the more up-to-date [Running on TPUs](../README.md#running-on-tpus) in the main README.md instead.
-
-Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on how to setup your TPU environment refer to the
+Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on
+how to setup your TPU environment refer to the
 [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).

-The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
-identical to your normal GPU + Huggingface setup.
-
-For running your GLUE task on MNLI dataset you can run something like the following:
+For running your GLUE task on MNLI dataset you can run something like the following form the root of the transformers
+repo:

 ```
-export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
-export GLUE_DIR=/path/to/glue
-export TASK_NAME=MNLI
-
-python run_glue_tpu.py \
-  --model_name_or_path bert-base-cased \
-  --task_name $TASK_NAME \
+python examples/xla_spawn.py \
+  --num_cores=8 \
+  transformers/examples/text-classification/run_glue.py \
  --do_train \
  --do_eval \
-  --data_dir $GLUE_DIR/$TASK_NAME \
-  --max_seq_length 128 \
-  --train_batch_size 32 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/$TASK_NAME \
+  --task_name=mrpc \
+  --num_train_epochs=3 \
+  --max_seq_length=128 \
+  --learning_rate=5e-5 \
+  --output_dir=/tmp/mrpc \
  --overwrite_output_dir \
-  --logging_steps 50 \
-  --save_steps 200 \
-  --num_cores=8
+  --logging_steps=5 \
+  --save_steps=5 \
+  --tpu_metrics_debug \
+  --model_name_or_path=bert-base-cased \
+  --per_device_train_batch_size=64 \
+  --per_device_eval_batch_size=64
 ```

-### MRPC
-
-#### Fine-tuning example
-
-The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
-than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
-
-Before running any one of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python run_glue.py \
-  --model_name_or_path bert-base-cased \
-  --task_name MRPC \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/MRPC/ \
-  --max_seq_length 128 \
-  --per_device_train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/mrpc_output/
-```
-
-Our test ran on a few seeds with [the original implementation hyper-
-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
-results between 84% and 88%.

 #### Using Apex and mixed-precision

@@ -184,14 +139,12 @@ Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds.
 [apex](https://github.com/NVIDIA/apex), then run the following example:

 ```bash
-export GLUE_DIR=/path/to/glue

 python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
-  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
@@ -206,15 +159,13 @@ Here is an example using distributed training on 8 V100 GPUs. The model used is
 reaches F1 > 92 on MRPC.

 ```bash
-export GLUE_DIR=/path/to/glue

 python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_name_or_path bert-base-cased \
-    --task_name MRPC \
+    --task_name mrpc \
    --do_train \
    --do_eval \
-    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
@@ -246,7 +197,6 @@ python -m torch.distributed.launch \
    --task_name mnli \
    --do_train \
    --do_eval \
-    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
@@ -272,7 +222,9 @@ The results  are the following:

 # Run PyTorch version using PyTorch-Lightning

-Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory.
+Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in
+`examples/requirements.txt`. It is a shell pipeline that will automatically download, preprocess the data and run the
+specified models. Logs are saved in `lightning_logs` directory.

 Pass `--gpus` flag to change the number of GPUs. Default uses 1. At the end, the expected results are:


--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -854,8 +854,6 @@ class Trainer:
            metrics = self.evaluate()
            self._report_to_hp_search(trial, epoch, metrics)

-            self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics)
-
        if self.control.should_save:
            self._save_checkpoint(model, trial, metrics=metrics)
            self.control = self.callback_handler.on_save(self.args, self.state, self.control)
@@ -1173,7 +1171,7 @@ class Trainer:
            xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
        else:
            self.model.save_pretrained(output_dir)
-        if self.tokenizer is not None:
+        if self.tokenizer is not None and self.is_world_process_zero():
            self.tokenizer.save_pretrained(output_dir)

    def _save(self, output_dir: Optional[str] = None):
@@ -1188,7 +1186,7 @@ class Trainer:
            torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
        else:
            self.model.save_pretrained(output_dir)
-        if self.tokenizer is not None:
+        if self.tokenizer is not None and self.is_world_process_zero():
            self.tokenizer.save_pretrained(output_dir)

        # Good practice: save your training arguments together with the trained model
@@ -1272,6 +1270,7 @@ class Trainer:
            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
            xm.master_print(met.metrics_report())

+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
        return output.metrics

    def predict(self, test_dataset: Dataset) -> PredictionOutput:

--- a/src/transformers/trainer_utils.py
+++ b/src/transformers/trainer_utils.py
@@ -22,7 +22,7 @@ from typing import Any, Dict, NamedTuple, Optional, Tuple, Union

 import numpy as np

-from .file_utils import is_tf_available, is_torch_available
+from .file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
 from .tokenization_utils_base import ExplicitEnum


@@ -157,3 +157,30 @@ default_hp_space = {
    HPSearchBackend.OPTUNA: default_hp_space_optuna,
    HPSearchBackend.RAY: default_hp_space_ray,
 }
+
+
+def is_main_process(local_rank):
+    """
+    Whether or not the current process is the local process, based on `xm.get_ordinal()` (for TPUs) first, then on
+    `local_rank`.
+    """
+    if is_torch_tpu_available():
+        import torch_xla.core.xla_model as xm
+
+        return xm.get_ordinal() == 0
+    return local_rank in [-1, 0]
+
+
+def total_processes_number(local_rank):
+    """
+    Return the number of processes launched in parallel. Works with `torch.distributed` and TPUs.
+    """
+    if is_torch_tpu_available():
+        import torch_xla.core.xla_model as xm
+
+        return xm.xrt_world_size()
+    elif local_rank != -1 and is_torch_available():
+        import torch
+
+        return torch.distributed.get_world_size()
+    return 1
--- a/tests/fixtures/tests_samples/MRPC/dev.csv
+++ b/tests/fixtures/tests_samples/MRPC/dev.csv
+label,sentence1,sentence2
+equivalent,He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .,""" The foodservice pie business does not fit our long-term growth strategy ."
+not_equivalent,Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .,"His wife said he was "" 100 percent behind George Bush "" and looked forward to using his years of training in the war ."
+not_equivalent,"The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .","The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent ."
+equivalent,The AFL-CIO is waiting until October to decide if it will endorse a candidate .,The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
+not_equivalent,No dates have been set for the civil or the criminal trial .,"No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty ."
+equivalent,Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .,It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
--- a/tests/fixtures/tests_samples/MRPC/train.csv
+++ b/tests/fixtures/tests_samples/MRPC/train.csv
+label,sentence1,sentence2
+equivalent,He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .,""" The foodservice pie business does not fit our long-term growth strategy ."
+not_equivalent,Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .,"His wife said he was "" 100 percent behind George Bush "" and looked forward to using his years of training in the war ."
+not_equivalent,"The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .","The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent ."
+equivalent,The AFL-CIO is waiting until October to decide if it will endorse a candidate .,The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
+not_equivalent,No dates have been set for the civil or the criminal trial .,"No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty ."
+equivalent,Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .,It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .