Unverified Commit fd3112bc authored by Enze Xie's avatar Enze Xie Committed by GitHub
Browse files

[Doc] Add documentation for multi-node train with pytorch original ddp (#1296)



* update mn_train

* update

* Fix typos
Co-authored-by: default avatarTai-Wang <tab_wang@outlook.com>
parent 33de2083
...@@ -201,7 +201,27 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16 ...@@ -201,7 +201,27 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16
You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables. You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables.
If you have just multiple machines connected with ethernet, you can refer to You can also use pytorch original DDP with script `multinode_train.sh`. (This script also supports single machine training.)
For each machine, run
```shell
./tools/sh_train.sh ${CONFIG_FILE} ${NODE_NUM} ${NODE_RANK} ${MASTER_NODE_IP}
```
Here is an example of using 16 GPUs (2 nodes), the IP=10.10.10.10:
run in node0:
```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 0 10.10.10.10
```
run in node1:
```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 1 10.10.10.10
```
If you have just multiple machines connected within ethernet, you can refer to
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html). PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html).
Usually it is slow if you do not have high speed networking like InfiniBand. Usually it is slow if you do not have high speed networking like InfiniBand.
......
#!/usr/bin/env bash
set -e
set -x
CONFIG=$1
NODE_NUM=$2
NODE_RANK=$3
MASTER_ADDR=$4
PORT=${PORT:-29500}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=8 --master_port=$PORT \
--nnodes=$NODE_NUM --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR \
$(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:5}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment