Depricate xpu_backend for ddp_backend (#23085)

* Depricate xpu_backend for ddp_backend * Typo * Only do a minor deprecation, no need for major Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Depricate xpu_backend for ddp_backend (#23085)
* Depricate xpu_backend for ddp_backend * Typo * Only do a minor deprecation, no need for major Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
98848623 · Zachary Mueller · GitHub · 95cf3725 · 98848623 · 98848623
Unverified Commit 98848623 authored May 01, 2023 by Zachary Mueller Committed by GitHub May 01, 2023
3 changed files
--- a/docs/source/en/perf_train_cpu_many.mdx
+++ b/docs/source/en/perf_train_cpu_many.mdx
@@ -73,7 +73,7 @@ The following "Usage in Trainer" takes mpirun in Intel® MPI library as an examp


 ## Usage in Trainer
-To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--xpu_backend ccl`** in the command arguments.
+To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--ddp_backend ccl`** in the command arguments.

 Let's see an example with the [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)

@@ -95,7 +95,7 @@ The following command enables training with 2 processes on one Xeon node, with o
 --doc_stride 128  \
 --output_dir /tmp/debug_squad/ \
 --no_cuda \
- --xpu_backend ccl \
+ --ddp_backend ccl \
 --use_ipex
 ```
 The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
@@ -124,7 +124,7 @@ Now, run the following command in node0 and **4DDP** will be enabled in node0 an
 --doc_stride 128  \
 --output_dir /tmp/debug_squad/ \
 --no_cuda \
- --xpu_backend ccl \
+ --ddp_backend ccl \
 --use_ipex \
 --bf16
 ```
--- a/docs/source/it/perf_train_cpu_many.mdx
+++ b/docs/source/it/perf_train_cpu_many.mdx
@@ -76,7 +76,7 @@ Il seguente "Utilizzo in Trainer" prende come esempio mpirun nella libreria Inte

 ## Utilizzo in Trainer

-Per abilitare l'addestramento distribuito multi CPU nel Trainer con il ccl backend, gli utenti devono aggiungere **`--xpu_backend ccl`** negli argomenti del comando.
+Per abilitare l'addestramento distribuito multi CPU nel Trainer con il ccl backend, gli utenti devono aggiungere **`--ddp_backend ccl`** negli argomenti del comando.

 Vediamo un esempio per il [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)

@@ -98,7 +98,7 @@ Il seguente comando abilita due processi sul nodo Xeon, con un processo in esecu
 --doc_stride 128  \
 --output_dir /tmp/debug_squad/ \
 --no_cuda \
- --xpu_backend ccl \
+ --ddp_backend ccl \
 --use_ipex
 ```

@@ -131,7 +131,7 @@ A questo punto, esegui il seguente comando nel nodo0 e **4DDP** sarà abilitato
 --doc_stride 128  \
 --output_dir /tmp/debug_squad/ \
 --no_cuda \
- --xpu_backend ccl \
+ --ddp_backend ccl \
 --use_ipex \
 --bf16
 ```
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -325,8 +325,8 @@ class TrainingArguments:
            experimental API and it may change.
        local_rank (`int`, *optional*, defaults to -1):
            Rank of the process during distributed training.
-        xpu_backend (`str`, *optional*):
-            The backend to use for xpu distributed training. Must be one of `"mpi"` or `"ccl"` or `"gloo"`.
+        ddp_backend (`str`, *optional*):
+            The backend to use for distributed training. Must be one of `"nccl"`, `"mpi"`, `"ccl"`, `"gloo"`.
        tpu_num_cores (`int`, *optional*):
            When training on TPU, the number of TPU cores (automatically passed by launcher script).
        dataloader_drop_last (`bool`, *optional*, defaults to `False`):
@@ -822,11 +822,11 @@ class TrainingArguments:
        },
    )
    local_rank: int = field(default=-1, metadata={"help": "For distributed training: local_rank"})
-    xpu_backend: Optional[str] = field(
+    ddp_backend: Optional[str] = field(
        default=None,
        metadata={
-            "help": "The backend to be used for distributed training on Intel XPU.",
-            "choices": ["mpi", "ccl", "gloo"],
+            "help": "The backend to be used for distributed training",
+            "choices": ["nccl", "gloo", "mpi", "ccl"],
        },
    )
    tpu_num_cores: Optional[int] = field(
@@ -1123,6 +1123,14 @@ class TrainingArguments:
        },
    )

+    xpu_backend: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "The backend to be used for distributed training on Intel XPU.",
+            "choices": ["mpi", "ccl", "gloo"],
+        },
+    )
+
    def __post_init__(self):
        # expand paths, if not os.makedirs("~/bar") will make directory
        # in the current directory instead of the actual home
@@ -1146,6 +1154,14 @@ class TrainingArguments:
            # Go back to the underlying string or we won't be able to instantiate `IntervalStrategy` on it.
            self.evaluation_strategy = self.evaluation_strategy.value

+        if self.xpu_backend is not None:
+            warnings.warn(
+                "using `xpu_backend` is deprecated and will be removed in version 4.31"
+                " of 🤗 Transformers. Use `ddp_backend` instead",
+                FutureWarning,
+            )
+            self.ddp_backend = self.xpu_backend
+
        self.evaluation_strategy = IntervalStrategy(self.evaluation_strategy)
        self.logging_strategy = IntervalStrategy(self.logging_strategy)
        self.save_strategy = IntervalStrategy(self.save_strategy)
@@ -1544,7 +1560,7 @@ class TrainingArguments:
                "Using the `Trainer` with `PyTorch` requires `accelerate`: Run `pip install --upgrade accelerate`"
            )
        if self.no_cuda:
-            self.distributed_state = PartialState(cpu=True)
+            self.distributed_state = PartialState(cpu=True, backend=self.ddp_backend)
            self._n_gpu = 0
        elif is_sagemaker_mp_enabled():
            local_rank = smp.local_rank()
@@ -1558,7 +1574,7 @@ class TrainingArguments:
            del os.environ["ACCELERATE_USE_DEEPSPEED"]
            self._n_gpu = 1
        else:
-            self.distributed_state = PartialState(backend=self.xpu_backend)
+            self.distributed_state = PartialState(backend=self.ddp_backend)
            self._n_gpu = 1
        if not is_sagemaker_mp_enabled():
            device = self.distributed_state.device