README.md 3.65 KB
Newer Older
dcuai's avatar
dcuai committed
1
# MLPerf_ResNet50
liangjing's avatar
liangjing committed
2

liangjing's avatar
update  
liangjing committed
3
## 论文
liangjing's avatar
liangjing committed
4

liangjing's avatar
update  
liangjing committed
5
6
7
Deep Residual Learning for Image Recognition

* https://arxiv.org/abs/1512.03385
liangjing's avatar
liangjing committed
8
9
10

## 模型结构

liangjing's avatar
update  
liangjing committed
11
12
13
14
15
ResNet50是一种用于图像识别的深度神经网络模型,该模型由一系列卷积层、池化层、全局平均池化层和全连接层组成。该模型的特殊之处在于它包含多个残差块,每个残差块由多个卷积层和跳跃连接组成。

![img](ResNet50.png)

## 算法原理
liangjing's avatar
liangjing committed
16
17
18

在ResNet50中,输入图像通过一系列卷积层和池化层进行特征提取,然后通过多个残差块进行深度特征学习。每个残差块包含多个卷积层和跳跃连接,跳跃连接允许信息在残差块内和残差块之间的传递,从而解决了深度神经网络中出现的梯度消失问题。最后,全局平均池化层将特征映射到一个固定长度的向量,该向量再通过全连接层进行分类或回归等任务。

liangjing's avatar
update  
liangjing committed
19
20
21
![img](Residual_Block.png)

## 环境配置
liangjing's avatar
liangjing committed
22

liangjing's avatar
update  
liangjing committed
23
24
**Docker (方法一)**

liangjing's avatar
update  
liangjing committed
25
26
27
28
29
30
31
提供[光源](https://www.sourcefind.cn/#/service-details)拉取的训练的docker镜像:

    docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:mlperf-resnet50-mpirun-latest
    # <Image ID>用上面拉取docker镜像的ID替换
    # <Host Path>主机端路径
    # <Container Path>容器映射路径
    docker run -it --name mlperf_bert --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
liangjing's avatar
liangjing committed
32

liangjing's avatar
update  
liangjing committed
33
34
**Dockerfile (方法二)**

liangjing's avatar
liangjing committed
35
    docker build --no-cache -t mlperf_resnet50:latest .
liangjing's avatar
update  
liangjing committed
36
37
38
39
40
    docker run -it --name mlperf_resnet50 --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
    # <Image ID>用上面拉取docker镜像的ID替换
    # <Host Path>主机端路径
    # <Container Path>容器映射路径

liangjing's avatar
update  
liangjing committed
41
镜像版本依赖:
liangjing's avatar
liangjing committed
42

liangjing's avatar
update  
liangjing committed
43
44
* DTK驱动:dtk22.10.1
* python: python3.8.2
liangjing's avatar
liangjing committed
45

liangjing's avatar
liangjing committed
46
47
注明:目前本镜像仅支持Z100/Z100L系列卡

liangjing's avatar
update  
liangjing committed
48
49
50
51
52
测试目录:

```
/root/resnet50
```
liangjing's avatar
liangjing committed
53

liangjing's avatar
liangjing committed
54
55
56
57
58
59
## 数据集

需使用ImageNET数据集,下载地址:http://image-net.org/challenges/LSVRC/2012/2012-downloads (require an account)

具体处理方式可详见:https://github.com/mlcommons/training/tree/master/image_classification

liangjing's avatar
update  
liangjing committed
60
处理好的预训练模型结果如下,镜像中已经给出,无需额外下载
liangjing's avatar
liangjing committed
61

liangjing's avatar
update  
liangjing committed
62
63
64
65
66
67
    mlperf_resnet50
    ├── checkpoint
    ├── ckpt-0.data-00000-of-00001
    ├── ckpt-0.index
    ├── ckpt-500.data-00000-of-00001
    ├── ckpt-500.index
liangjing's avatar
liangjing committed
68

dongchy920's avatar
dongchy920 committed
69
70
71
SCNet快速下载链接[http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
](http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
)
liangjing's avatar
update  
liangjing committed
72
## 训练
liangjing's avatar
liangjing committed
73

liangjing's avatar
update  
liangjing committed
74
### 单机多卡
liangjing's avatar
liangjing committed
75

liangjing's avatar
update  
liangjing committed
76
单机8卡进行性能&&精度测试
liangjing's avatar
liangjing committed
77

liangjing's avatar
update  
liangjing committed
78
79
80
```
bash 8dcu_multi.sh >& output.log &
```
liangjing's avatar
liangjing committed
81

liangjing's avatar
update  
liangjing committed
82
## result
liangjing's avatar
liangjing committed
83

liangjing's avatar
update  
liangjing committed
84
![result](result.png)
liangjing's avatar
liangjing committed
85

dcuai's avatar
dcuai committed
86
### 精度
liangjing's avatar
liangjing committed
87

liangjing's avatar
update  
liangjing committed
88
采用上述输入数据,加速卡采用Z100L * 8,可最终达到官方收敛要求,即达到目标精度75.90% classification;
liangjing's avatar
liangjing committed
89

liangjing's avatar
update  
liangjing committed
90
91
92
| 卡数 | 类型     | 进程数 | 达到精度              |
| ---- | -------- | ------ | --------------------- |
| 8    | 混合精度 | 8      | 75.90% classification |
liangjing's avatar
liangjing committed
93

liangjing's avatar
update  
liangjing committed
94
## 应用场景
liangjing's avatar
liangjing committed
95

liangjing's avatar
update  
liangjing committed
96
### 算法类别
liangjing's avatar
liangjing committed
97

liangjing's avatar
update  
liangjing committed
98
`图像分类`
liangjing's avatar
liangjing committed
99

liangjing's avatar
update  
liangjing committed
100
### 热点应用行业
liangjing's avatar
liangjing committed
101

liangjing's avatar
update  
liangjing committed
102
`制造,政府,医疗,科研`
liangjing's avatar
liangjing committed
103

liangjing's avatar
liangjing committed
104
## 源码仓库及问题反馈
liangjing's avatar
liangjing committed
105

liangjing's avatar
update  
liangjing committed
106
* https://developer.hpccube.com/codes/modelzoo/mlperf_resnet50_tensorflow
liangjing's avatar
liangjing committed
107

liangjing's avatar
update  
liangjing committed
108
## 参考资料
liangjing's avatar
liangjing committed
109
110

* https://mlcommons.org/en/
liangjing's avatar
liangjing committed
111
* https://github.com/mlcommons
liangjing's avatar
update  
liangjing committed
112
113
* https://github.com/mlcommons/training/tree/master/image_classification/tensorflow2