fix(docs): use static rdzv backend in multi-node troubleshooting script (#34784)

Signed-off-by: machov <mv1742@nyu.edu>

fix(docs): use static rdzv backend in multi-node troubleshooting script (#34784)
Signed-off-by: machov <mv1742@nyu.edu>
28028dff · Manrique Vargas · GitHub · 3417ba56 · 28028dff
Unverified Commit 28028dff authored Mar 04, 2026 by Manrique Vargas Committed by GitHub Mar 04, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 10 deletions

docs/usage/troubleshooting.md docs/usage/troubleshooting.md +8 -10

No files found.
--- a/docs/usage/troubleshooting.md
+++ b/docs/usage/troubleshooting.md
@@ -155,26 +155,24 @@ If you are testing with a single node, adjust `--nproc-per-node` to the number o
 NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
 ```

-If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
+If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address and port of the master node (e.g., `10.0.0.1:29400`), reachable from all nodes. Then, run:

 ```bash
 NCCL_DEBUG=TRACE torchrun --nnodes 2 \
    --nproc-per-node=2 \
-    --rdzv_backend=c10d \
-    --rdzv_endpoint=$MASTER_ADDR test.py
+    --rdzv_backend=static \
+    --rdzv_endpoint=$MASTER_ADDR \
+    --node-rank $NODE_RANK test.py
 ```

-If the script runs successfully, you should see the message `sanity check is successful!`.
-
-If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
+Set `MASTER_ADDR` to the IP address and port of the master node (e.g., `10.0.0.1:29400`), reachable from all nodes. Set `NODE_RANK` to `0` on the master node and `1`, `2`, ... on the workers. Adjust `--nproc-per-node` and `--nnodes` according to your setup.

 !!! note
-    A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
+    We use `--rdzv_backend=static` instead of `c10d` because the `c10d` rendezvous backend can fail with DNS resolution errors in multi-node setups (see [pytorch/pytorch#85300](https://github.com/pytorch/pytorch/issues/85300)). The `static` backend avoids this by requiring explicit node ranks.

-    - In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
-    - In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
+If the script runs successfully, you should see the message `sanity check is successful!`.

-    Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
+If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.

 ## Python multiprocessing