README.md 4.15 KB
Newer Older
dcuai's avatar
dcuai committed
1
# ResNet50
yuguo's avatar
update  
yuguo committed
2
3
4
5
6
## 论文
`Deep Residual Learning for Image Recognition`

- [https://arxiv.org/abs/1512.03385](https://arxiv.org/abs/1512.03385)

yuguo960516's avatar
yuguo960516 committed
7
8
## 模型结构
Resnet50 网络中包含了 49 个卷积层、1个全连接层等。
yuguo's avatar
update  
yuguo committed
9

chenzk's avatar
chenzk committed
10
<img src="http://developer.sourcefind.cn/codes/modelzoo/resnet50_oneflow/-/raw/main/ResNet50%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="ResNet50模型结构.png" style="zoom:67%;" />
yuguo's avatar
update  
yuguo committed
11
12
13
14
15

## 算法原理

ResNet50使用了多个具有残差连接的残差块来解决梯度消失或梯度爆炸问题,并使得网络可以向更深层发展。

chenzk's avatar
chenzk committed
16
<img src="http://developer.sourcefind.cn/codes/modelzoo/resnet50_oneflow/-/raw/main/Residual_Block.png" alt="Residual_Block.png" style="zoom:67%;" />
yuguo's avatar
update  
yuguo committed
17
18
19
20
21
22
23
24
25
26
27
28
29

## 环境配置

### Docker

```plaintext
docker pull image.sourcefind.cn:5000/dcu/admin/base/oneflow:0.9.1-centos7.6-dtk-22.10.1-py39-latest
# <Your Image ID>用上面拉取docker镜像的ID替换
docker run --shm-size 16g --network=host --name=resnet50_oneflow --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/resnet50_oneflow:/home/resnet50_oneflow -it <Your Image ID> bash
cd /home/resnet50_oneflow
pip install -r requirements.txt
```

“yuguo”'s avatar
update  
“yuguo” committed
30
31
32
33
34
35
36
37
### Conda

1. 创建conda虚拟环境:

```plaintext
conda create -n resnet python=3.9
```

chenzk's avatar
chenzk committed
38
关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
“yuguo”'s avatar
update  
“yuguo” committed
39
40
41
42
43
44
45
46
47
48
49
50
51
52

- [DTK-22.10.1](https://cancon.hpccube.com:65024/1/main/DTK-22.10.1)

- [Oneflow-0.9](https://cancon.hpccube.com:65024/4/main/oneflow/dtk22.10)

  Tips:以上dtk驱动、python等工具版本需要严格一一对应。

其它依赖库参照requirements.txt安装:

```plaintext
cd resnet50_oneflow
pip install -r requirements.txt
```

yuguo960516's avatar
yuguo960516 committed
53
## 数据集
yuguo's avatar
update  
yuguo committed
54

chenzk's avatar
chenzk committed
55
我们考虑到imagenet数据集比较庞大,为了用户可以使用OneFlow快速进行Resnet50的训练验证,采用mini-imagenet小数据集:[tiny-imagenet-200](http://cs231n.stanford.edu/tiny-imagenet-200.zip)。原始数据集下载通道[imagenet-2012](https://image-net.org/download.php),如果需要原始数据需要参考该链接https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50#prepare-ofrecord-for-the-full-imagenet-dataset进行转换为ofrecord 
dcuai's avatar
dcuai committed
56

yuguo960516's avatar
yuguo960516 committed
57
58

    OFRECORD_PATH="./mini-imagenet/ofrecord"
yuguo's avatar
update  
yuguo committed
59

yuguo's avatar
update  
yuguo committed
60
制作完成的OFRrecord数据形式如下:
yuguo's avatar
update  
yuguo committed
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

```plaintext
ofrecord-imagenet
                | 
                train-00000-of-01024
                train-00001-of-01024
                ...
                train-01022-of-01024
                train-01023-of-01024
                validation-00000-of-00128
                validation-00001-of-00128
                ...
                validation-00126-of-00128
                validation-00127-of-00128
```

“yuguo”'s avatar
update  
“yuguo” committed
77
## 训练
yuguo's avatar
update  
yuguo committed
78

yuguo960516's avatar
yuguo960516 committed
79
### fp32训练
yuguo's avatar
update  
yuguo committed
80

yuguo960516's avatar
yuguo960516 committed
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
单机单卡训练命令:

    bash examples/train_graph_distributed_fp32.sh

修改examples/train_graph_distributed_fp32.sh中DEVICE_NUM_PER_NODE=4,单机四卡训练命令:

    bash examples/train_graph_distributed_fp32.sh
### fp16训练
单机单卡训练命令:

    bash examples/train_graph_distributed_fp16.sh
修改examples/train_graph_distributed_fp16.sh中DEVICE_NUM_PER_NODE=4,单机四卡训练命令:

    bash examples/train_graph_distributed_fp16.sh

“yuguo”'s avatar
update  
“yuguo” committed
96
## 推理
yuguo960516's avatar
yuguo960516 committed
97
98
99
执行推理命令:

    bash examples/infer_graph.sh
“yuguo”'s avatar
update  
“yuguo” committed
100
101
## result

chenzk's avatar
chenzk committed
102
<img src="http://developer.sourcefind.cn/codes/modelzoo/resnet50_oneflow/-/raw/main/result.png" alt="result.png" style="zoom:50%;" />
“yuguo”'s avatar
update  
“yuguo” committed
103

“yuguo”'s avatar
update  
“yuguo” committed
104
105
106
### 精度

测试数据:mini-imagenet,使用的加速卡:4张DCU-Z100-16G:
yuguo960516's avatar
yuguo960516 committed
107

yuguo's avatar
perf  
yuguo committed
108
109
| 卡数 | batch size | 类型 | Accuracy(%) |
| :------: | :------: | :------: | -------- |
yuguo's avatar
update  
yuguo committed
110
111
112
113
114
115
116
117
118
119
120
121
122
123
| 1 | 128 | fp32 | 76.5/300 epoches |
| 1 | 128 | fp16 | 76.3/300 epoches |
| 4 | 128 | fp32 | 76.5/300 epoches |
| 4 | 128 | fp16 | 76.3/300 epoches |
## 应用场景

### 算法类别

`图像分类`

### 热点应用行业

`制造,政府,医疗,科研`

yuguo960516yuguo's avatar
1.1  
yuguo960516yuguo committed
124
## 源码仓库及问题反馈
yuguo960516yuguo's avatar
v1.0  
yuguo960516yuguo committed
125

chenzk's avatar
chenzk committed
126
- https://developer.sourcefind.cn/codes/modelzoo/resnet50_oneflow
yuguo960516yuguo's avatar
v1.0  
yuguo960516yuguo committed
127

“yuguo”'s avatar
update  
“yuguo” committed
128
## 参考资料
yuguo960516yuguo's avatar
v1.0  
yuguo960516yuguo committed
129

yuguo960516's avatar
yuguo960516 committed
130
* https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50
dcuai's avatar
dcuai committed
131
* https://github.com/Oneflow-Inc/oneflow