README.md 7.41 KB
Newer Older
1
2
3
4
5
# ResNet50

## 论文
`Deep Residual Learning for Image Recognition`
- https://arxiv.org/abs/1512.03385
qianyj's avatar
qianyj committed
6
7
## 模型结构
ResNet50网络中包含了49个卷积层、1个全连接层等
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

![img](./doc/ResNet50.png)
## 算法原理
ResNet50使用了多个具有残差连接的残差块来解决梯度消失或梯度爆炸问题,并使得网络可以向更深层发展。

![img](./doc/Residual_Block.png)
## 环境配置
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.7.0-centos7.6-dtk-22.10.1-py38-latest
# <Your Image ID>用上面拉取docker镜像的ID替换
docker run --shm-size 16g --network=host --name=ResNet50-TensorFlow2x --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/ResNet50-TensorFlow2x:/home/ResNet50-TensorFlow2x -it <Your Image ID> bash
pip install -r requirements.txt
```
### Dockerfile(方法二)
```
cd ResNet50-TensorFlow2x/docker
docker build --no-cache -t ResNet50-TensorFlow2x:latest .
docker run --rm --shm-size 16g --network=host --name=ResNet50-TensorFlow2x --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/ResNet50-TensorFlow2x:/home/ResNet50-TensorFlow2x -it ResNet50-TensorFlow2x:latest bash

```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可以从开发社社区下载安装:
https://developer.hpccube.com/tool/
```
DTK版本:dtk22.10.1
python:  3.8
tensorflow: 2.9
tf-models-official: 2.7
keras: 2.7
tensorboard: 2.7
```
`Tips:以上dtk、python、tensorflow等DCU相关工具版本需要严格一一对应`
2、其他非特殊库参照requirements.txt安装
```
pip3 install -r requirements.txt
```

qianyj's avatar
qianyj committed
46
47
48
49
50
51
52
53
## 数据集
使用ImageNet数据集,并且需要转成TFRecord格式
ImageNet数据集可以[官网](https://image-net.org/ "ImageNet数据集官网")下载、百度搜索或者联系我们
ImageNet数据集转成TFRecord格式,可以参考以下[script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)[README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)

## 训练
### 环境配置
使用[光源](https://www.sourcefind.cn/#/service-details)拉取训练的docker镜像:
qianyj's avatar
qianyj committed
54
训练镜像:docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.7.0-centos7.6-dtk-22.10.1-py37-latest
qianyj's avatar
qianyj committed
55
56
57

python依赖安装:

qianyj's avatar
qianyj committed
58
    pip3 install -r requirements.txt
qianyj's avatar
qianyj committed
59
60
### fp32训练
#### 单机单卡训练命令:
qianyj's avatar
qianyj committed
61

qianyj's avatar
qianyj committed
62
不打开xla:
qianyj's avatar
qianyj committed
63
64
65

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH  
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
66
67

打开xla:
qianyj's avatar
qianyj committed
68
69
70

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
71
72
73

#### 单机四卡训练指令:
不打开xla:
qianyj's avatar
qianyj committed
74
75
76

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
77
78

打开xla:
qianyj's avatar
qianyj committed
79
80
81

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
82
83

#### 多机多卡训练指令(以单机四卡模拟四卡四进程为例):
qianyj's avatar
qianyj committed
84

qianyj's avatar
qianyj committed
85
sed指令只需要执行一次,添加支持多卡运行的代码
qianyj's avatar
qianyj committed
86

qianyj's avatar
qianyj committed
87
    sed -i '100 r configfile' models-master/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py
qianyj's avatar
qianyj committed
88
89

不打开xla:
qianyj's avatar
qianyj committed
90
91
92

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process.sh
qianyj's avatar
qianyj committed
93
94

打开xla:
qianyj's avatar
qianyj committed
95
96
97

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process_xla.sh
qianyj's avatar
qianyj committed
98
99
100
    
### fp16训练
#### 单机单卡训练指令
qianyj's avatar
qianyj committed
101

qianyj's avatar
qianyj committed
102
不打开xla:
qianyj's avatar
qianyj committed
103
104
105
   
    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
106
107

打开xla:
qianyj's avatar
qianyj committed
108
109
110
  
    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
111
112

#### 单机四卡训练指令
qianyj's avatar
qianyj committed
113

qianyj's avatar
qianyj committed
114
不打开xla:
qianyj's avatar
qianyj committed
115
116
117
  
    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
118
119

打开xla:
qianyj's avatar
qianyj committed
120
121
122

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
123
124

#### 多机多卡训练指令(以单机四卡模拟四卡四进程为例)
qianyj's avatar
qianyj committed
125

qianyj's avatar
qianyj committed
126
sed指令只需要执行一次,添加支持多卡运行的代码
qianyj's avatar
qianyj committed
127
    
qianyj's avatar
qianyj committed
128
    sed -i '100 r configfile' models-master/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py
qianyj's avatar
qianyj committed
129

qianyj's avatar
qianyj committed
130
131
132
修改scripts-run/single_process.sh和scripts-run/single_process_xla.sh文件里的--dtype=fp16

不打开xla:
qianyj's avatar
qianyj committed
133
134
135
136

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process.sh

qianyj's avatar
qianyj committed
137
打开xla:
qianyj's avatar
qianyj committed
138
139
140

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process_xla.sh
qianyj's avatar
qianyj committed
141
142
143
144
145


## 性能和准确率数据
测试数据:[ImageNet的测试数据集](https://image-net.org/ "ImageNet数据集官网"),使用的加速卡:DCU-Z00-16G

146
147
148
149
150
151
152
| 卡数 | batch size | 类型 |  Accuracy | 是否打开xla | 进程数 |
| :------: | :------: |  :------: | :------: | :------:| -------- |
| 4 | 512 | fp32 |  0.7628 | 否 | 单进程 |
| 4 | 512 | fp16 |  0.7616 | 否 | 单进程 |
| 4 | 512 | fp32 |  0.7608 | 否 | 四进程 |
| 4 | 512 | fp16 |  0.7615 | 否 | 四进程 |

153
## 源码仓库及问题反馈
154
155
* https://developer.hpccube.com/codes/modelzoo/resnet50_tensorflow

qianyj's avatar
qianyj committed
156
157
158
## 参考
* https://github.com/tensorflow/models/tree/master
* https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy