[Doc] Add documentation for multi-node train with pytorch original ddp (#1296)

* update mn_train * update * Fix typos Co-authored-by: Tai-Wang <tab_wang@outlook.com>

[Doc] Add documentation for multi-node train with pytorch original ddp (#1296)
* update mn_train * update * Fix typos Co-authored-by: Tai-Wang <tab_wang@outlook.com>
fd3112bc · Enze Xie · GitHub · 33de2083 · fd3112bc · fd3112bc
Unverified Commit fd3112bc authored Mar 09, 2022 by Enze Xie Committed by GitHub Mar 09, 2022
Hide whitespace changes
Inline Side-by-side

Showing with 38 additions and 1 deletion

docs/en/1_exist_data_model.md docs/en/1_exist_data_model.md +21 -1

tools/multinode_train.sh tools/multinode_train.sh +17 -0

No files found.
--- a/docs/en/1_exist_data_model.md
+++ b/docs/en/1_exist_data_model.md
@@ -201,7 +201,27 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16
 You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables.
-If you have just multiple machines connected with ethernet, you can refer to
+You can also use pytorch original DDP with script `multinode_train.sh`. (This script also supports single machine training.)
+For each machine, run
+```shell
+./tools/sh_train.sh ${CONFIG_FILE} ${NODE_NUM} ${NODE_RANK} ${MASTER_NODE_IP}
+```
+Here is an example of using 16 GPUs (2 nodes), the IP=10.10.10.10:
+run in node0: 
+```shell
+./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 0 10.10.10.10
+```
+run in node1: 
+```shell
+./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 1 10.10.10.10
+```
+If you have just multiple machines connected within ethernet, you can refer to
 PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html).
 Usually it is slow if you do not have high speed networking like InfiniBand.

--- a/tools/multinode_train.sh
+++ b/tools/multinode_train.sh
+#!/usr/bin/env bash
+set -e
+set -x
+CONFIG=$1
+NODE_NUM=$2
+NODE_RANK=$3
+MASTER_ADDR=$4
+PORT=${PORT:-29500}
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+python -m torch.distributed.launch --nproc_per_node=8 --master_port=$PORT \
+    --nnodes=$NODE_NUM  --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR \
+    $(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:5}
\ No newline at end of file