train.md 3.34 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# MMClassification算例测试

## 测试前准备

### 数据集准备

使用ImageNet-pytorch数据集。

### 环境部署

```python
yum install python3
yum install libquadmath
yum install numactl
yum install openmpi3
yum install glog
yum install lmdb-libs
yum install opencv-core
yum install opencv
yum install openblas-serial
pip3 install --upgrade pip
pip3 install opencv-python
```

### 安装python依赖包

```python
pip3 install torch-1.10.0a0+git2040069.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install torchvision-0.10.0a0+e04d001.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install mmcv_full-1.6.1+gitdebbc80.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
mmcls 安装
cd mmclassification-0.24.1
pip3 install -e .
```

注:测试不同版本的dtk,需安装对应版本的库whl包

## ResNet18测试
### 单精度测试

### 单卡测试(单精度)

```python
./sing_test.sh configs/resnet/resnet18_b32x8_imagenet.py
```
#### 参数说明

configs/_base_/datasets/imagenet_bs32.py 中batch_size=samples_per_gpu*卡数,性能计算方法:batch_size/time

#### 性能关注:time

### 多卡测试(单精度)
#### 单机多卡训练

1.pytorch单机多卡训练

```python
unknown's avatar
unknown committed
58
./tools/dist_train.sh configs/resnet/resnet18_b32x8_imagenet.py $GPUS
59
60
61
62
63
64
65
66
67
```
2.mpirun单机多卡训练
mpirun --allow-run-as-root --bind-to none -np 4 single_process.sh a03r3n15
a03r3n15为master节点ip

#### 多机多卡训练

1.pytorch多机多卡训练
在第一台机器上:
unknown's avatar
unknown committed
68
NODES=2 NODE_RANK=0 PORT=12345 MASTER_ADDR=10.1.3.56 sh tools/dist_train.sh configs/resnet/resnet18_b32x8_imagenet.py $GPUS
69
在第二台机器上:
unknown's avatar
unknown committed
70
NODES=2 NODE_RANK=1 PORT=12345 MASTER_ADDR=10.1.3.56 sh tools/dist_train.sh configs/resnet/resnet18_b32x8_imagenet.py $GPUS
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

2.mpirun多机多卡训练
mpirun --allow-run-as-root --hostfile hostfile --bind-to none -np 4 single_process.sh a03r3n15
a03r3n15为master节点ip

hostfile 文件

a03r3n15 slots=4

e10r4n04 slots=4

### 半精度测试
修改configs文件,添加fp16 = dict(loss_scale=512.),单机多卡和多机多卡测试与单精度测试方法相同。

### 其他模型测试

其他模型的测试步骤和ResNet18相同,只需修改对应的config文件即可,下面列出相关模型对应的config文件列表:

| 模型          | configs                                                      |
| ------------- | ------------------------------------------------------------ |
| ResNet34      | configs/resnet/resnet34_b32x8_imagenet.py                    |
| ResNet50      | configs/resnet/resnet50_b32x8_imagenet.py                    |
| ResNet152     | configs/resnet/resnet152_b32x8_imagenet.py                   |
| Vgg11         | configs/vgg/vgg11_b32x8_imagenet.py                          |
| Vgg16         | configs/vgg/vgg16_b32x8_imagenet.py                          |
| SeresNet50    | configs/seresnet/seresnet50_b32x8_imagenet.py                |
| ResNext50     | configs/resnext/resnext50_32x4d_b32x8_imagenet.py            |
| MobileNet-v2  | configs/mobilenet_v2/mobilenet_v2_b32x8_imagenet.py          |
| ShuffleNet-v1 | configs/shufflenet_v1/shufflenet_v1_1x_b64x16_linearlr_bn_nowd_imagenet.py |
| ShuffleNet-v2 | configs/shufflenet_v2/shufflenet_v2_1x_b64x16_linearlr_bn_nowd_imagenet.py |