Unverified Commit ce315081 authored by Peter Pan's avatar Peter Pan Committed by GitHub
Browse files

docs: replace torch.distributed.run by torchrun (#27528)



* docs: replace torch.distributed.run by torchrun

 `transformers` now officially support pytorch >= 1.10.
 The entrypoint `torchrun`` is present from 1.10 onwards.
Signed-off-by: default avatarPeter Pan <Peter.Pan@daocloud.io>

* Update src/transformers/trainer.py

with @ArthurZucker's suggestion
Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>

---------
Signed-off-by: default avatarPeter Pan <Peter.Pan@daocloud.io>
Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
parent c832bcb8
......@@ -18,7 +18,7 @@ in Huang et al. [Improve Transformer Models with Better Relative Position Embedd
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \
--dataset_name squad \
--do_train \
......@@ -46,7 +46,7 @@ gpu training leads to the f1 score of 90.71.
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \
--dataset_name squad \
--do_train \
......@@ -68,7 +68,7 @@ Training with the above command leads to the f1 score of 93.52, which is slightl
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
......
......@@ -140,7 +140,7 @@ python finetune_trainer.py --help
For multi-gpu training use `torch.distributed.launch`, e.g. with 2 gpus:
```bash
python -m torch.distributed.launch --nproc_per_node=2 finetune_trainer.py ...
torchrun --nproc_per_node=2 finetune_trainer.py ...
```
**At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.**
......@@ -214,7 +214,7 @@ because it uses SortishSampler to minimize padding. You can also use it on 1 GPU
`{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs.
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_distributed_eval.py \
torchrun --nproc_per_node=8 run_distributed_eval.py \
--model_name sshleifer/distilbart-large-xsum-12-3 \
--save_dir xsum_generations \
--data_dir xsum \
......
......@@ -98,7 +98,7 @@ the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
use the following command:
```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node number_of_gpu_you_have path_to_script.py \
--all_arguments_of_the_script
```
......@@ -107,7 +107,7 @@ As an example, here is how you would fine-tune the BERT large model (with whole
classification MNLI task using the `run_glue` script, with 8 GPUs:
```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--task_name mnli \
......
......@@ -100,7 +100,7 @@ of **0.35**.
The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision.
```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 run_speech_recognition_ctc.py \
--dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
......@@ -147,7 +147,7 @@ However, the `--shuffle_buffer_size` argument controls how many examples we can
```bash
**python -m torch.distributed.launch \
**torchrun \
--nproc_per_node 4 run_speech_recognition_ctc_streaming.py \
--dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \
......@@ -404,7 +404,7 @@ If training on a different language, you should be sure to change the `language`
#### Multi GPU Whisper Training
The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 2 GPU devices in half-precision:
```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 2 run_speech_recognition_seq2seq.py \
--model_name_or_path="openai/whisper-small" \
--dataset_name="mozilla-foundation/common_voice_11_0" \
......@@ -572,7 +572,7 @@ cross-entropy loss of **0.405** and word error rate of **0.0728**.
The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision.
```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 run_speech_recognition_seq2seq.py \
--dataset_name="librispeech_asr" \
--model_name_or_path="./" \
......
......@@ -1595,7 +1595,7 @@ class Trainer:
# references registered here no longer work on other gpus, breaking the module
raise ValueError(
"Currently --debug underflow_overflow is not supported under DP. Please use DDP"
" (torch.distributed.launch)."
" (torchrun or torch.distributed.launch (deprecated))."
)
else:
debug_overflow = DebugUnderflowOverflow(self.model) # noqa
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment