Examples reorg (#11350)

* Base move * Examples reorganization * Update references * Put back test data * Move conftest * More fixes * Move test data to test fixtures * Update path * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments and clean Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Examples reorg (#11350)
* Base move * Examples reorganization * Update references * Put back test data * Move conftest * More fixes * Move test data to test fixtures * Update path * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments and clean Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
dabeb152 · Sylvain Gugger · GitHub · ca7ff64f · dabeb152 · dabeb152
Unverified Commit dabeb152 authored Apr 21, 2021 by Sylvain Gugger Committed by GitHub Apr 21, 2021
20 changed files
--- a/examples/multiple-choice/run_no_trainer.sh
+++ b/examples/multiple-choice/run_no_trainer.sh
--- a/examples/multiple-choice/run_swag.py
+++ b/examples/multiple-choice/run_swag.py
--- a/examples/multiple-choice/run_swag_no_trainer.py
+++ b/examples/multiple-choice/run_swag_no_trainer.py
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -14,9 +14,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
-## SQuAD
+# SQuAD
-Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py).
+Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_qa.py).
 **Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
 uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
@@ -29,7 +29,9 @@ The old version of this script can be found [here](https://github.com/huggingfac
 Note that if your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag `--version_2_with_negative`.
-#### Fine-tuning BERT on SQuAD1.0
+## Trainer-based scripts
+### Fine-tuning BERT on SQuAD1.0
 This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
 on a single tesla V100 16GB.
@@ -57,7 +59,6 @@ exact_match = 81.22
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
 ```bash
@@ -128,6 +129,71 @@ python run_qa_beam_search.py \
    --save_steps 5000
 ```
+## With Accelerate
+Based on the script `run_qa_no_trainer.py` and `run_qa_beam_search_no_trainer.py`.
+Like `run_qa.py` and `run_qa_beam_search.py`, these scripts allow you to fine-tune any of the models supported on a
+SQUAD or a similar dataset, the main difference is that this
+script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
+It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
+or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
+the mean of the [🤗 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
+after installing it:
+```bash
+pip install accelerate
+```
+then
+```bash
+python run_qa_no_trainer.py \
+  --model_name_or_path bert-base-uncased \
+  --dataset_name squad \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir ~/tmp/debug_squad
+```
+You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run
+```bash
+accelerate config
+```
+and reply to the questions asked. Then
+```bash
+accelerate test
+```
+that will check everything is ready for training. Finally, you cna launch training with
+```bash
+export TASK_NAME=mrpc
+accelerate launch run_qa_no_trainer.py \
+  --model_name_or_path bert-base-uncased \
+  --dataset_name squad \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir ~/tmp/debug_squad
+```
+This command is the same and will work for:
+- a CPU-only setup
+- a setup with one GPU
+- a distributed training with several GPUs (single or multi node)
+- a training on TPUs
+Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
+## Results
 Larger batch size may improve the performance while costing more memory.
 ##### Results for SQuAD1.0 with the previously defined hyper-parameters:
@@ -223,22 +289,3 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answer
 ```
 Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for 
 `bert-large-uncased-whole-word-masking`.
-## SQuAD with the Tensorflow Trainer
-```bash
-python run_tf_squad.py \
-    --model_name_or_path bert-base-uncased \
-    --output_dir model \
-    --max_seq_length 384 \
-    --num_train_epochs 2 \
-    --per_gpu_train_batch_size 8 \
-    --per_gpu_eval_batch_size 16 \
-    --do_train \
-    --logging_dir logs \    
-    --logging_steps 10 \
-    --learning_rate 3e-5 \
-    --doc_stride 128    
-```
-For the moment evaluation is not available in the Tensorflow Trainer only the training.
--- a/examples/question-answering/requirements.txt
+++ b/examples/question-answering/requirements.txt
 datasets >= 1.4.0
+torch >= 1.3.0
--- a/examples/question-answering/run_qa.py
+++ b/examples/question-answering/run_qa.py
--- a/examples/question-answering/run_qa_beam_search.py
+++ b/examples/question-answering/run_qa_beam_search.py
--- a/examples/question-answering/run_qa_beam_search_no_trainer.py
+++ b/examples/question-answering/run_qa_beam_search_no_trainer.py
--- a/examples/question-answering/run_qa_no_trainer.py
+++ b/examples/question-answering/run_qa_no_trainer.py
--- a/examples/question-answering/trainer_qa.py
+++ b/examples/question-answering/trainer_qa.py
--- a/examples/question-answering/utils_qa.py
+++ b/examples/question-answering/utils_qa.py
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -14,9 +14,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
-## Sequence to Sequence Training and Evaluation
+## Summarization
-This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
+This directory contains examples for finetuning and evaluating transformers on summarization  tasks.
 Please tag @patil-suraj with any issues/unexpected behaviors, or send a PR!
 For deprecated `bertabs` instructions, see [`bertabs/README.md`](https://github.com/huggingface/transformers/blob/master/examples/research_projects/bertabs/README.md).
 For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2seq`](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq).
@@ -30,16 +30,16 @@ For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2s
 - `PegasusForConditionalGeneration`
 - `T5ForConditionalGeneration`
-`run_summarization.py` and `run_translation.py` are lightweight examples of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
+`run_summarization.py` is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
 For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files
 and you also will find examples of these below.
-### Summarization
+## With Trainer
 Here is an example on a summarization task:
 ```bash
-python examples/seq2seq/run_summarization.py \
+python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
@@ -63,7 +63,7 @@ And here is how you would use it on your own files, after adjusting the values f
 `--train_file`, `--validation_file`, `--text_column` and `--summary_column` to match your setup:
 ```bash
-python examples/seq2seq/run_summarization.py \
+python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
@@ -134,115 +134,64 @@ And as with the CSV files, you can specify which values to select from the file,
    --summary_column summary \
 ```
+## With Accelerate
+Based on the script [`run_summarization_no_trainer.py`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/summarization/run_summarization_no_trainer.py).
-### Translation
+Like `run_summarization.py`, this script allows you to fine-tune any of the models supported on a
+summarization task, the main difference is that this
+script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
-Here is an example of a translation fine-tuning with a MarianMT model:
+It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
+or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
+the mean of the [🤗 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
+after installing it:
 ```bash
-python examples/seq2seq/run_translation.py \
+pip install accelerate
-    --model_name_or_path Helsinki-NLP/opus-mt-en-ro \
-    --do_train \
-    --do_eval \
-    --source_lang en \
-    --target_lang ro \
-    --dataset_name wmt16 \
-    --dataset_config_name ro-en \
-    --output_dir /tmp/tst-translation \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
-    --predict_with_generate
 ```
-MBart and some T5 models require special handling.
+then
-T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:
 ```bash
-python examples/seq2seq/run_translation.py \
+python run_summarization_no_trainer.py \
    --model_name_or_path t5-small \
-    --do_train \
+    --dataset_name cnn_dailymail \
-    --do_eval \
+    --dataset_config "3.0.0" \
-    --source_lang en \
+    --source_prefix "summarize: " \
-    --target_lang ro \
+    --output_dir ~/tmp/tst-summarization
-    --source_prefix "translate English to Romanian: " \
-    --dataset_name wmt16 \
-    --dataset_config_name ro-en \
-    --output_dir /tmp/tst-translation \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
-    --predict_with_generate
 ```
-If you get a terrible BLEU score, make sure that you didn't forget to use the `--source_prefix` argument.
+You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run
-For the aforementioned group of T5 models it's important to remember that if you switch to a different language pair, make sure to adjust the source and target values in all 3 language-specific command line argument: `--source_lang`, `--target_lang` and `--source_prefix`.
+```bash
+accelerate config
+```
-MBart models require a different format for `--source_lang` and `--target_lang` values, e.g. instead of `en` it expects `en_XX`, for `ro` it expects `ro_RO`. The full MBart specification for language codes can be found [here](https://huggingface.co/facebook/mbart-large-cc25). For example:
+and reply to the questions asked. Then
 ```bash
-python examples/seq2seq/run_translation.py \
+accelerate test
-    --model_name_or_path facebook/mbart-large-en-ro  \
+```
-    --do_train \
-    --do_eval \
-    --dataset_name wmt16 \
-    --dataset_config_name ro-en \
-    --source_lang en_XX \
-    --target_lang ro_RO \
-    --output_dir /tmp/tst-translation \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
-    --predict_with_generate
- ```
-And here is how you would use the translation finetuning on your own files, after adjusting the
+that will check everything is ready for training. Finally, you cna launch training with
-values for the arguments `--train_file`, `--validation_file` to match your setup:
 ```bash
-python examples/seq2seq/run_translation.py \
+export TASK_NAME=mrpc
+accelerate launch run_summarization_no_trainer.py \
    --model_name_or_path t5-small \
-    --do_train \
+    --dataset_name cnn_dailymail \
-    --do_eval \
+    --dataset_config "3.0.0" \
-    --source_lang en \
+    --source_prefix "summarize: " \
-    --target_lang ro \
+    --output_dir ~/tmp/tst-summarization
-    --source_prefix "translate English to Romanian: " \
-    --dataset_name wmt16 \
-    --dataset_config_name ro-en \
-    --train_file path_to_jsonlines_file \
-    --validation_file path_to_jsonlines_file \
-    --output_dir /tmp/tst-translation \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
-    --predict_with_generate
 ```
-The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key `"translation"` and its value another dictionary whose keys is the language pair. For example:
+This command is the same and will work for:
-```json
+- a CPU-only setup
-{ "translation": { "en": "Others have dismissed him as a joke.", "ro": "Alții l-au numit o glumă." } }
+- a setup with one GPU
-{ "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alții așteaptă implozia." } }
+- a distributed training with several GPUs (single or multi node)
-```
+- a training on TPUs
-Here the languages are Romanian (`ro`) and English (`en`).
-If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as following:
-```bash
+Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
-python examples/seq2seq/run_translation.py \
-    --model_name_or_path t5-small \
-    --do_train \
-    --do_eval \
-    --source_lang en \
-    --target_lang de \
-    --source_prefix "translate English to German: " \
-    --dataset_name stas/wmt14-en-de-pre-processed \
-    --output_dir /tmp/tst-translation \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
-    --predict_with_generate
- ```
--- a/examples/pytorch/summarization/requirements.txt
+++ b/examples/pytorch/summarization/requirements.txt
+datasets >= 1.1.3
+sentencepiece != 0.1.92
+protobuf
+rouge-score
+nltk
+py7zr
+torch >= 1.3
--- a/examples/seq2seq/run_summarization.py
+++ b/examples/seq2seq/run_summarization.py
--- a/examples/seq2seq/run_summarization_no_trainer.py
+++ b/examples/seq2seq/run_summarization_no_trainer.py
--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -36,7 +36,8 @@ SRC_DIRS = [
        "language-modeling",
        "multiple-choice",
        "question-answering",
-        "seq2seq",
+        "summarization",
+        "translation",
    ]
 ]
 sys.path.extend(SRC_DIRS)

--- a/examples/test_xla_examples.py
+++ b/examples/test_xla_examples.py
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -16,7 +16,7 @@ limitations under the License.
 # Text classification examples
-## PyTorch version
+## GLUE tasks
 Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
@@ -129,7 +129,7 @@ and reply to the questions asked. Then
 accelerate test
 ```
-that will check everything is ready for training. Finally, you cna launch training with
+that will check everything is ready for training. Finally, you can launch training with
 ```bash
 export TASK_NAME=mrpc
@@ -152,84 +152,3 @@ This command is the same and will work for:
 - a training on TPUs
 Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
-## TensorFlow 2.0 version
-Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).
-Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
-This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
-Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
-These options and the below benchmark are provided by @tlkh.
-Quick benchmarks from the script (no other modifications):
-| GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
-| --------- | -------- | ----------------------- | ----------------------|
-| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
-| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
-| V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
-| V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
-| 1080 Ti | FP32 | 55s | - |
-Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
-## Run generic text classification script in TensorFlow
-The script [run_tf_text_classification.py](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_text_classification.py) allows users to run a text classification on their own CSV files. For now there are few restrictions, the CSV files must have a header corresponding to the column names and not more than three columns: one column for the id, one column for the text and another column for a second piece of text in case of an entailment classification for example.
-To use the script, one as to run the following command line:
-```bash
-python run_tf_text_classification.py \
-  --train_file train.csv \ ### training dataset file location (mandatory if running with --do_train option)
-  --dev_file dev.csv \ ### development dataset file location (mandatory if running with --do_eval option)
-  --test_file test.csv \ ### test dataset file location (mandatory if running with --do_predict option)
-  --label_column_id 0 \ ### which column corresponds to the labels
-  --model_name_or_path bert-base-multilingual-uncased \
-  --output_dir model \
-  --num_train_epochs 4 \
-  --per_device_train_batch_size 16 \
-  --per_device_eval_batch_size 32 \
-  --do_train \
-  --do_eval \
-  --do_predict \
-  --logging_steps 10 \
-  --evaluation_strategy steps \
-  --save_steps 10 \
-  --overwrite_output_dir \
-  --max_seq_length 128
-```
-## XNLI
-Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
-[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is a crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
-#### Fine-tuning on XNLI
-This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins on a single tesla V100 16GB.
-```bash
-python run_xnli.py \
-  --model_name_or_path bert-base-multilingual-cased \
-  --language de \
-  --train_language en \
-  --do_train \
-  --do_eval \
-  --per_device_train_batch_size 32 \
-  --learning_rate 5e-5 \
-  --num_train_epochs 2.0 \
-  --max_seq_length 128 \
-  --output_dir /tmp/debug_xnli/ \
-  --save_steps -1
-```
-Training with the previously defined hyper-parameters yields the following results on the **test** set:
-```bash
-acc = 0.7093812375249501
-```
--- a/examples/text-classification/requirements.txt
+++ b/examples/text-classification/requirements.txt
@@ -2,3 +2,4 @@ accelerate
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
+torch >= 1.3
--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py