README.md 8.22 KB
Newer Older
1
2
3
4
5
# ResNet50

## 论文
`Deep Residual Learning for Image Recognition`
- https://arxiv.org/abs/1512.03385
qianyj's avatar
qianyj committed
6
7
## 模型结构
ResNet50网络中包含了49个卷积层、1个全连接层等
8
9
10
11
12
13
14
15
16

![img](./doc/ResNet50.png)
## 算法原理
ResNet50使用了多个具有残差连接的残差块来解决梯度消失或梯度爆炸问题,并使得网络可以向更深层发展。

![img](./doc/Residual_Block.png)
## 环境配置
### Docker(方法一)
```
zhanggezhong's avatar
zhanggezhong committed
17
docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.13.1-ubuntu20.04-dtk24.04.1-py3.10
18
# <Your Image ID>用上面拉取docker镜像的ID替换
dcuai's avatar
dcuai committed
19
docker run --shm-size 16g --network=host --name=resnet50_tensorFlow --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /opt/hyhal:/opt/hyhal:ro -v $PWD/resnet50_tensorflow:/home/resnet50_tensorflow -it <Your Image ID> bash
“qianyj”'s avatar
“qianyj” committed
20
pip install -r requirements.txt --no-deps
21
22
23
```
### Dockerfile(方法二)
```
“qianyj”'s avatar
“qianyj” committed
24
25
cd resnet50_tensorflow/docker
docker build --no-cache -t resnet50_tensorflow:latest .
dcuai's avatar
dcuai committed
26
docker run --rm --shm-size 16g --network=host --name=resnet50_tensorflow --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /opt/hyhal:/opt/hyhal:ro -v $PWD/../../resnet50_tensorflow:/home/resnet50_tensorflow -it resnet50_tensorflow:latest bash
27
28
29

```
### Anaconda(方法三)
“qianyj”'s avatar
“qianyj” committed
30
1、关于本项目DCU显卡所需的特殊深度学习库可以从开发者社区下载安装:
31
32
https://developer.hpccube.com/tool/
```
dcuai's avatar
dcuai committed
33
DTK版本:dtk24.04.1
zhanggzh's avatar
zhanggzh committed
34
python:  3.10
zhanggezhong's avatar
zhanggezhong committed
35
tensorflow: 2.13.1
zhanggezhong's avatar
zhanggezhong committed
36
tf-models-official: 2.13.1
zhanggezhong's avatar
zhanggezhong committed
37
38
keras: 2.13.1
tensorboard: 2.13
qianyj's avatar
qianyj committed
39
hyhal
40
41
```
`Tips:以上dtk、python、tensorflow等DCU相关工具版本需要严格一一对应`
qianyj's avatar
qianyj committed
42

43
44
2、其他非特殊库参照requirements.txt安装
```
“qianyj”'s avatar
“qianyj” committed
45
pip3 install -r requirements.txt  --no-deps
46
47
```

qianyj's avatar
qianyj committed
48
## 数据集
“qianyj”'s avatar
“qianyj” committed
49
50

1、真实数据
“qianyj”'s avatar
“qianyj” committed
51

qianyj's avatar
qianyj committed
52
使用ImageNet数据集,并且需要转成TFRecord格式
dcuai's avatar
dcuai committed
53
ImageNet数据集可以[官网](https://image-net.org/ "ImageNet数据集官网")下载,scnet快速下载通道[imagenet](http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012)、百度搜索或者联系我们
qianyj's avatar
qianyj committed
54
ImageNet数据集转成TFRecord格式,可以参考以下[script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)[README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
“qianyj”'s avatar
“qianyj” committed
55
制作完成的TFRrecord数据形式如下:
“qianyj”'s avatar
“qianyj” committed
56
```
“qianyj”'s avatar
“qianyj” committed
57
58
59
tfrecord-imagenet
                | 
                train-00000-of-01024
“qianyj”'s avatar
“qianyj” committed
60
                train-00001-of-01024
“qianyj”'s avatar
“qianyj” committed
61
                ...
“qianyj”'s avatar
“qianyj” committed
62
                train-01022-of-01024
“qianyj”'s avatar
“qianyj” committed
63
64
65
66
                train-01023-of-01024
                validation-00000-of-00128
                validation-00001-of-00128
                ...
“qianyj”'s avatar
“qianyj” committed
67
                validation-00126-of-00128
“qianyj”'s avatar
“qianyj” committed
68
                validation-00127-of-00128
“qianyj”'s avatar
“qianyj” committed
69
```
“qianyj”'s avatar
“qianyj” committed
70
2、合成数据
“qianyj”'s avatar
“qianyj” committed
71

“qianyj”'s avatar
“qianyj” committed
72
基于随机合成的数据,不需要下载ImageNet数据集,执行网络训练时只需要把程序执行语句中的--use_synthetic_data设置为true即可
qianyj's avatar
qianyj committed
73
74
75
76

## 训练
### fp32训练
#### 单机单卡训练命令:
qianyj's avatar
qianyj committed
77

qianyj's avatar
qianyj committed
78
不打开xla:
qianyj's avatar
qianyj committed
79

qianyj's avatar
qianyj committed
80
    export PYTHONPATH=/home/resnet50_tensorFlow:$PYTHONPATH  
qianyj's avatar
qianyj committed
81
    python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false  --train_epochs=90  --dtype=fp32
qianyj's avatar
qianyj committed
82
83

打开xla:
qianyj's avatar
qianyj committed
84
85
    
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
86
    TF_XLA_FLAGS="--tf_xla_auto_jit=1" python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false  --train_epochs=90  --dtype=fp32
qianyj's avatar
qianyj committed
87
88
89

#### 单机四卡训练指令:
不打开xla:
qianyj's avatar
qianyj committed
90

qianyj's avatar
qianyj committed
91
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
92
    python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false  --train_epochs=90  --dtype=fp32
qianyj's avatar
qianyj committed
93
94

打开xla:
qianyj's avatar
qianyj committed
95

qianyj's avatar
qianyj committed
96
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
97
    TF_XLA_FLAGS="--tf_xla_auto_jit=1" python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --train_epochs=90  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
98
99

#### 多机多卡训练指令(以单机四卡模拟四卡四进程为例):
qianyj's avatar
qianyj committed
100

qianyj's avatar
qianyj committed
101
sed指令只需要执行一次,添加支持多卡运行的代码
qianyj's avatar
qianyj committed
102

qianyj's avatar
qianyj committed
103
    sed -i '100 r configfile' official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py
qianyj's avatar
qianyj committed
104
105

不打开xla:
qianyj's avatar
qianyj committed
106

qianyj's avatar
qianyj committed
107
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
108
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process.sh
qianyj's avatar
qianyj committed
109
110

打开xla:
qianyj's avatar
qianyj committed
111
112
    
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
113
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process_xla.sh
qianyj's avatar
qianyj committed
114
115
116
    
### fp16训练
#### 单机单卡训练指令
qianyj's avatar
qianyj committed
117

qianyj's avatar
qianyj committed
118
不打开xla:
qianyj's avatar
qianyj committed
119
   
qianyj's avatar
qianyj committed
120
    export PYTHONPATH=/home/resnet50_tensorFlow:$PYTHONPATH
qianyj's avatar
qianyj committed
121
    python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --train_epochs=90  --dtype=fp16
qianyj's avatar
qianyj committed
122
123

打开xla:
qianyj's avatar
qianyj committed
124
  
qianyj's avatar
qianyj committed
125
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
126
    TF_XLA_FLAGS="--tf_xla_auto_jit=1" python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --train_epochs=90  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
127
128

#### 单机四卡训练指令
qianyj's avatar
qianyj committed
129

qianyj's avatar
qianyj committed
130
不打开xla:
qianyj's avatar
qianyj committed
131
  
qianyj's avatar
qianyj committed
132
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
133
    python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --train_epochs=90  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
134
135

打开xla:
qianyj's avatar
qianyj committed
136
137
    
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
138
    TF_XLA_FLAGS="--tf_xla_auto_jit=1" python3 official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --train_epochs=90  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
139
140

#### 多机多卡训练指令(以单机四卡模拟四卡四进程为例)
qianyj's avatar
qianyj committed
141

qianyj's avatar
qianyj committed
142
sed指令只需要执行一次,添加支持多卡运行的代码
qianyj's avatar
qianyj committed
143
    
qianyj's avatar
qianyj committed
144
    sed -i '100 r configfile' official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py
qianyj's avatar
qianyj committed
145

qianyj's avatar
qianyj committed
146
147
148
修改scripts-run/single_process.sh和scripts-run/single_process_xla.sh文件里的--dtype=fp16

不打开xla:
qianyj's avatar
qianyj committed
149

qianyj's avatar
qianyj committed
150
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
151
152
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process.sh

qianyj's avatar
qianyj committed
153
打开xla:
qianyj's avatar
qianyj committed
154
155
 
    export PYTHONPATH=/home/resnet50_tensorflow:$PYTHONPATH
qianyj's avatar
qianyj committed
156
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process_xla.sh
qianyj's avatar
qianyj committed
157

qianyj's avatar
qianyj committed
158
159
160
### result
![img](./doc/ILSVRC2012_val_00001915.PNG)
![img](./doc/ILSVRC2012_val_00003386.PNG)
qianyj's avatar
qianyj committed
161

qianyj's avatar
qianyj committed
162
## 精度
qianyj's avatar
qianyj committed
163
测试数据:[ImageNet的测试数据集](https://image-net.org/ "ImageNet数据集官网"),使用的加速卡:DCU-Z100-16G
qianyj's avatar
qianyj committed
164

165
166
| 卡数 | batch size | 类型 |  Accuracy | 是否打开xla | 进程数 |
| :------: | :------: |  :------: | :------: | :------:| -------- |
qianyj's avatar
qianyj committed
167
168
169
170
| 4 | 512 | fp32 |  0.763  | 否 | 单进程 |
| 4 | 512 | fp16 |  0.764  | 否 | 单进程 |
| 4 | 512 | fp32 |  0.764  | 否 | 四进程 |
| 4 | 512 | fp16 |  0.763  | 否 | 四进程 |
171

“qianyj”'s avatar
“qianyj” committed
172
173
174
175
176
177
## 应用场景
### 算法类别
`图像分类`
### 热点应用行业
`制造,政府,医疗,科研`

178
## 源码仓库及问题反馈
179
180
* https://developer.hpccube.com/codes/modelzoo/resnet50_tensorflow

dcuai's avatar
dcuai committed
181
## 参考资料
qianyj's avatar
qianyj committed
182
183
* https://github.com/tensorflow/models/tree/master
* https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy