README.md 3.21 KB
Newer Older
1
2
# Fast-Rcnn 

3
4
5
6
7
[TOC]



## 简介
huchen's avatar
huchen committed
8
9
10

该测试用例用于PyTorch目标检测模型Fasterrcnn测试。

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
## 数据集准备

使用Objects365数据集,昆山平台Objects365路径:/public/software/apps/DeepLearning/Data/objects365

需要在train.py中的parser中设置“--data-path”为数据集所在的路径,也可设置epoch、学习率和output输出位置等参数。

参数说明:

- 上面的中trian.py需要关注--batch_size与-j ;
- --save-path 是chekpoint保存路径,要求是已经存在的文件夹;
- --data-path是数据集所在位置。

## 环境准备

下载dtk22.04.1的安装包  [下载链接]([centos7.6 | 光合开发者社区 (hpccube.com)](https://cancon.hpccube.com:65024/1/main/DTK-22.04.1/centos7.6)) 

解压后执行如下命令即可载入dtk22.04.1系统

```
source env.sh
```

准备python3.7的python环境

```
conda create --name Fast-Rcnn python=‘3.7’

conda env list

conda activate Fast-Rcnn
```

安装其他依赖(由于网络问题,建议安装时使用镜像源)

```
pip3 install torch-1.10.0a0+git450cdd1.dtk22.4-cp37-cp37m-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple

pip3 install torchvision-0.10.0a0_dtk22.04_300a8a4-cp37-cp37m-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple

pip3 install pycocotools -i https://pypi.tuna.tsinghua.edu.cn/simple
```

## 训练

### 单卡
    export HIP_VISIVLE_DEVICES=0
huchen's avatar
huchen committed
57
    python3 train.py  --batch-size=2 -j 8 --epochs=26 --data-path=/path/to/datasets/folder --output-dir=/path/to/result/save/folder
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
### 单机多卡

single_process.sh

```
#!/bin/bash

lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE

APP="python3 `pwd`/train.py  --batch-size=8 -j 16 --epochs=12 --dist-url tcp://${1}:34567 --world-size=${comm_size} --rank=${comm_rank} --output-dir `pwd`/../output_dir_4card --data-path=/public/software/apps/DeepLearning/Data/objects365  --lr 0.01"
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:50Gbs
echo numactl --cpunodebind=0 --membind=0 ${APP}
numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:50Gbs
echo numactl --cpunodebind=1 --membind=1 ${APP}
numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:50Gbs
echo numactl --cpunodebind=2 --membind=2 ${APP} 
numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:50Gbs
echo numactl --cpunodebind=3 --membind=3 ${APP}
numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
```

run_muti.sh

    mpirun -np 4 --bind-to none single_process.sh localhost
    
    #mpirun -np 4 --hostfile hostfile --bind-to none `pwd`/single_process.sh localhost
运行

```
nohup $WORK_PATH/run_multi.sh > train_4card_32.log   2>&1  &
```

### 多机多卡

huchen's avatar
huchen committed
115
116
    mpirun -np $np --hostfile hostfile --bind-to none `pwd`/single_process.sh ${master_ip}

117
## 参考
huchen's avatar
huchen committed
118
119
120
121
122
123
[https://github.com/pytorch/vision/tree/master/references/detection](https://github.com/pytorch/vision/tree/master/references/detection)