"lightx2v/text2v/models/vscode:/vscode.git/clone" did not exist on "f21528e750271dfb5b8271128b1d017815034d8a"
Unverified Commit dabeb152 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Examples reorg (#11350)



* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Address review comments and clean
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
parent ca7ff64f
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Multiple Choice
## Fine-tuning on SWAG
```bash
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/multiple-choice/run_tf_multiple_choice.py \
--task_name swag \
--model_name_or_path bert-base-cased \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_device_train_batch_size=16 \
--logging-dir logs \
--gradient_accumulation_steps 2 \
--overwrite_output
```
sentencepiece != 0.1.92
protobuf
tensorflow >= 2.3
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## SQuAD with the Tensorflow Trainer
```bash
python run_tf_squad.py \
--model_name_or_path bert-base-uncased \
--output_dir model \
--max_seq_length 384 \
--num_train_epochs 2 \
--per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 16 \
--do_train \
--logging_dir logs \
--logging_steps 10 \
--learning_rate 3e-5 \
--doc_stride 128
```
For the moment evaluation is not available in the Tensorflow Trainer only the training.
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Text classification examples
## GLUE tasks
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/tensorflow/text-classification/run_tf_glue.py).
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
These options and the below benchmark are provided by @tlkh.
Quick benchmarks from the script (no other modifications):
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
| --------- | -------- | ----------------------- | ----------------------|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
| 1080 Ti | FP32 | 55s | - |
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
## Run generic text classification script in TensorFlow
The script [run_tf_text_classification.py](https://github.com/huggingface/transformers/blob/master/examples/tensorflow/text-classification/run_tf_text_classification.py) allows users to run a text classification on their own CSV files. For now there are few restrictions, the CSV files must have a header corresponding to the column names and not more than three columns: one column for the id, one column for the text and another column for a second piece of text in case of an entailment classification for example.
To use the script, one as to run the following command line:
```bash
python run_tf_text_classification.py \
--train_file train.csv \ ### training dataset file location (mandatory if running with --do_train option)
--dev_file dev.csv \ ### development dataset file location (mandatory if running with --do_eval option)
--test_file test.csv \ ### test dataset file location (mandatory if running with --do_predict option)
--label_column_id 0 \ ### which column corresponds to the labels
--model_name_or_path bert-base-multilingual-uncased \
--output_dir model \
--num_train_epochs 4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--do_train \
--do_eval \
--do_predict \
--logging_steps 10 \
--evaluation_strategy steps \
--save_steps 10 \
--overwrite_output_dir \
--max_seq_length 128
```
accelerate
datasets >= 1.1.3
sentencepiece != 0.1.92
protobuf
tensorflow >= 2.3
......@@ -87,7 +87,7 @@ class GlueDataset(Dataset):
warnings.warn(
"This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py",
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py",
FutureWarning,
)
self.args = args
......
......@@ -53,7 +53,7 @@ class TextDataset(Dataset):
):
warnings.warn(
DEPRECATION_WARNING.format(
"https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
),
FutureWarning,
)
......@@ -119,7 +119,7 @@ class LineByLineTextDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
warnings.warn(
DEPRECATION_WARNING.format(
"https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
),
FutureWarning,
)
......@@ -151,7 +151,7 @@ class LineByLineWithRefDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, ref_path: str):
warnings.warn(
DEPRECATION_WARNING.format(
"https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm_wwm.py"
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm_wwm.py"
),
FutureWarning,
)
......@@ -193,7 +193,7 @@ class LineByLineWithSOPTextDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, file_dir: str, block_size: int):
warnings.warn(
DEPRECATION_WARNING.format(
"https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
),
FutureWarning,
)
......@@ -348,7 +348,7 @@ class TextDatasetForNextSentencePrediction(Dataset):
):
warnings.warn(
DEPRECATION_WARNING.format(
"https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
),
FutureWarning,
)
......
......@@ -28,7 +28,7 @@ if is_sklearn_available():
DEPRECATION_WARNING = (
"This metric will be removed from the library soon, metrics should be handled with the 🤗 Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py"
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py"
)
......
......@@ -35,7 +35,7 @@ logger = logging.get_logger(__name__)
DEPRECATION_WARNING = (
"This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py"
"https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py"
)
......
......@@ -536,7 +536,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
remove_args_str: str = None,
):
max_len = 32
data_dir = self.examples_dir / "test_data/wmt_en_ro"
data_dir = self.test_file_dir / "../fixtures/tests_samples/wmt_en_ro"
output_dir = self.get_auto_remove_tmp_dir()
args = f"""
--model_name_or_path {model_name}
......@@ -594,7 +594,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
args = [x for x in args if x not in remove_args]
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
script = [f"{self.examples_dir_str}/seq2seq/run_translation.py"]
script = [f"{self.examples_dir_str}/pytorch/translation/run_translation.py"]
launcher = self.get_launcher(distributed)
cmd = launcher + script + args + ds_args
......@@ -629,7 +629,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
""".split()
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
script = [f"{self.examples_dir_str}/language-modeling/run_clm.py"]
script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
launcher = self.get_launcher(distributed=True)
cmd = launcher + script + args + ds_args
......
......@@ -35,7 +35,7 @@ from transformers.trainer_utils import set_seed
bindir = os.path.abspath(os.path.dirname(__file__))
with ExtendSysPath(f"{bindir}/../../examples/seq2seq"):
with ExtendSysPath(f"{bindir}/../../examples/pytorch/translation"):
from run_translation import main # noqa
......@@ -181,7 +181,7 @@ class TestTrainerExt(TestCasePlus):
extra_args_str: str = None,
predict_with_generate: bool = True,
):
data_dir = self.examples_dir / "test_data/wmt_en_ro"
data_dir = self.test_file_dir / "../fixtures/tests_samples/wmt_en_ro"
output_dir = self.get_auto_remove_tmp_dir()
args = f"""
--model_name_or_path {model_name}
......@@ -226,7 +226,7 @@ class TestTrainerExt(TestCasePlus):
distributed_args = f"""
-m torch.distributed.launch
--nproc_per_node={n_gpu}
{self.examples_dir_str}/seq2seq/run_translation.py
{self.examples_dir_str}/pytorch/translation/run_translation.py
""".split()
cmd = [sys.executable] + distributed_args + args
execute_subprocess_async(cmd, env=self.get_env())
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment