README.md 1.65 KB
Newer Older
huchen's avatar
huchen committed
1
2
3
4
5
6
# 介绍
本测试用例用于测试目标检测MaskRCNN模型在ROCm平台的性能,测试流程如下

# 测试流程
## 进入工作目录 
	cd references/detection  
7
8
9
10
## 数据集准备

COCO2017数据集

huchen's avatar
huchen committed
11
## 运行指令
12

huchen's avatar
huchen committed
13
### 单卡  
14
15
	export HIP_VISIBLE_DEVICES=0
	
huchen's avatar
huchen committed
16
	python3 train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 \
17
18
	     --lr-steps 16 22 --aspect-ratio-group-factor 3 \
	     --data-path /path/to/{COCO2017_data_dir}  
huchen's avatar
huchen committed
19
若报错Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to .cache/torch/checkpoints/resnet50-19c8e357.pth失败,则需提前下载resnet50-19c8e357.pth,拷贝至.cache/torch/checkpoints/。  
20
21
22
23

### 单机多卡

1)pytorch启动方式
24
25
26
27
28

	export HIP_VISIBLE_DEVICES=0,1,2,3
	export NGPUS=4
	export OMP_NUM_THREADS=1
	python3 -m torch.distributed.launch --nproc_per_node= ${NGPUS} --use_env train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.005 --data-path /path/to/{COCO2017_data_dir} > train_2gpu_lr0.005.log 2>&1 &
huchen's avatar
huchen committed
29
30
注意:多卡运行时,学习率与卡数的对应关系为0.02/8*$NGPU,例如,lr_4gpu=0.01,lr_2gpu=0.005,lr_1gpu=0.0025。  

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
2)mpi启动

```
cd references/detection 
mpirun -np 4  --bind-to none single_process.sh localhost
```

### 多机多卡

mpi启动

```
mpirun -np $np --hostfile hostfile --bind-to none single_process.sh $dist_url
```

其中,$dist_url为master_node的ip,在多节点的时候需要参考hostfile文件中的格式进行修改。

huchen's avatar
huchen committed
48
# 参考
49

huchen's avatar
huchen committed
50
51
[https://github.com/pytorch/vision/tree/master/references/detection](https://github.com/pytorch/vision/tree/master/references/detection)