"...ssh:/git@developer.sourcefind.cn:2222/OpenDAS/dynamo.git" did not exist on "eb675ae38945b9b663b61777cc82460e0f9cc6d0"
Unverified Commit 28028dff authored by Manrique Vargas's avatar Manrique Vargas Committed by GitHub
Browse files

fix(docs): use static rdzv backend in multi-node troubleshooting script (#34784)


Signed-off-by: default avatarmachov <mv1742@nyu.edu>
parent 3417ba56
...@@ -155,26 +155,24 @@ If you are testing with a single node, adjust `--nproc-per-node` to the number o ...@@ -155,26 +155,24 @@ If you are testing with a single node, adjust `--nproc-per-node` to the number o
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
``` ```
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run: If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address and port of the master node (e.g., `10.0.0.1:29400`), reachable from all nodes. Then, run:
```bash ```bash
NCCL_DEBUG=TRACE torchrun --nnodes 2 \ NCCL_DEBUG=TRACE torchrun --nnodes 2 \
--nproc-per-node=2 \ --nproc-per-node=2 \
--rdzv_backend=c10d \ --rdzv_backend=static \
--rdzv_endpoint=$MASTER_ADDR test.py --rdzv_endpoint=$MASTER_ADDR \
--node-rank $NODE_RANK test.py
``` ```
If the script runs successfully, you should see the message `sanity check is successful!`. Set `MASTER_ADDR` to the IP address and port of the master node (e.g., `10.0.0.1:29400`), reachable from all nodes. Set `NODE_RANK` to `0` on the master node and `1`, `2`, ... on the workers. Adjust `--nproc-per-node` and `--nnodes` according to your setup.
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
!!! note !!! note
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments: We use `--rdzv_backend=static` instead of `c10d` because the `c10d` rendezvous backend can fail with DNS resolution errors in multi-node setups (see [pytorch/pytorch#85300](https://github.com/pytorch/pytorch/issues/85300)). The `static` backend avoids this by requiring explicit node ranks.
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`. If the script runs successfully, you should see the message `sanity check is successful!`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes. If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
## Python multiprocessing ## Python multiprocessing
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment