Unverified Commit 9c7270d0 authored by ChaimZhu's avatar ChaimZhu Committed by GitHub
Browse files

add multi-machine dist_train (#1303)

parent fd3112bc
...@@ -201,30 +201,23 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16 ...@@ -201,30 +201,23 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16
You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables. You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables.
You can also use pytorch original DDP with script `multinode_train.sh`. (This script also supports single machine training.) If you launch with multiple machines simply connected with ethernet, you can simply run following commands:
For each machine, run On the first machine:
```shell
./tools/sh_train.sh ${CONFIG_FILE} ${NODE_NUM} ${NODE_RANK} ${MASTER_NODE_IP}
```
Here is an example of using 16 GPUs (2 nodes), the IP=10.10.10.10:
run in node0:
```shell ```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 0 10.10.10.10 NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
``` ```
run in node1: On the second machine:
```shell ```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 1 10.10.10.10 NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
``` ```
If you have just multiple machines connected within ethernet, you can refer to
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html).
Usually it is slow if you do not have high speed networking like InfiniBand. Usually it is slow if you do not have high speed networking like InfiniBand.
### Launch multiple jobs on a single machine ### Launch multiple jobs on a single machine
If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
......
...@@ -198,7 +198,21 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16 ...@@ -198,7 +198,21 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16
你可以查看 [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) 来获取所有的参数和环境变量。 你可以查看 [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) 来获取所有的参数和环境变量。
如果你有多个机器连接到以太网,可以参考 PyTorch 的 [launch utility](https://pytorch.org/docs/stable/distributed.html),如果你没有像 InfiniBand 一样的高速率网络,通常会很慢。 如果您想使用由 ethernet 连接起来的多台机器, 您可以使用以下命令:
在第一台机器上:
```shell
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
```
在第二台机器上:
```shell
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
```
但是,如果您不使用高速网路连接这几台机器的话,训练将会非常慢。
### 在单个机器上启动多个任务 ### 在单个机器上启动多个任务
......
...@@ -3,8 +3,20 @@ ...@@ -3,8 +3,20 @@
CONFIG=$1 CONFIG=$1
CHECKPOINT=$2 CHECKPOINT=$2
GPUS=$3 GPUS=$3
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500} PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \ PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \ python -m torch.distributed.launch \
$(dirname "$0")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4} --nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
$(dirname "$0")/test.py \
$CONFIG \
$CHECKPOINT \
--launcher pytorch \
${@:4}
...@@ -2,8 +2,19 @@ ...@@ -2,8 +2,19 @@
CONFIG=$1 CONFIG=$1
GPUS=$2 GPUS=$2
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500} PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \ PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \ python -m torch.distributed.launch \
$(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:3} --nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
$(dirname "$0")/train.py \
$CONFIG \
--seed 0 \
--launcher pytorch ${@:3}
#!/usr/bin/env bash
set -e
set -x
CONFIG=$1
NODE_NUM=$2
NODE_RANK=$3
MASTER_ADDR=$4
PORT=${PORT:-29500}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=8 --master_port=$PORT \
--nnodes=$NODE_NUM --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR \
$(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:5}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment