README.md 5.8 KB
Newer Older
qianyj's avatar
qianyj committed
1
# 模型名称
qianyj's avatar
qianyj committed
2
3
4
5
6
7
8
9
10
11
12
13
## 模型介绍
使用TensorFlow2进行ResNet50的训练
## 模型结构
ResNet50网络中包含了49个卷积层、1个全连接层等
## 数据集
使用ImageNet数据集,并且需要转成TFRecord格式
ImageNet数据集可以[官网](https://image-net.org/ "ImageNet数据集官网")下载、百度搜索或者联系我们
ImageNet数据集转成TFRecord格式,可以参考以下[script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)[README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)

## 训练
### 环境配置
使用[光源](https://www.sourcefind.cn/#/service-details)拉取训练的docker镜像:
qianyj's avatar
qianyj committed
14
训练镜像:docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.7.0-centos7.6-dtk-22.10.1-py37-latest
qianyj's avatar
qianyj committed
15
16
17
18
19
20

python依赖安装:

    pip install -r requirement.txt
### fp32训练
#### 单机单卡训练命令:
qianyj's avatar
qianyj committed
21

qianyj's avatar
qianyj committed
22
不打开xla:
qianyj's avatar
qianyj committed
23
24
25

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH  
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
26
27

打开xla:
qianyj's avatar
qianyj committed
28
29
30

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
31
32
33

#### 单机四卡训练指令:
不打开xla:
qianyj's avatar
qianyj committed
34
35
36

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
37
38

打开xla:
qianyj's avatar
qianyj committed
39
40
41

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp32
qianyj's avatar
qianyj committed
42
43

#### 多机多卡训练指令(以单机四卡模拟四卡四进程为例):
qianyj's avatar
qianyj committed
44

qianyj's avatar
qianyj committed
45
sed指令只需要执行一次,添加支持多卡运行的代码
qianyj's avatar
qianyj committed
46

qianyj's avatar
qianyj committed
47
    sed -i '100 r configfile' models-master/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py
qianyj's avatar
qianyj committed
48
49

不打开xla:
qianyj's avatar
qianyj committed
50
51
52

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process.sh
qianyj's avatar
qianyj committed
53
54

打开xla:
qianyj's avatar
qianyj committed
55
56
57

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process_xla.sh
qianyj's avatar
qianyj committed
58
59
60
    
### fp16训练
#### 单机单卡训练指令
qianyj's avatar
qianyj committed
61

qianyj's avatar
qianyj committed
62
不打开xla:
qianyj's avatar
qianyj committed
63
64
65
   
    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
66
67

打开xla:
qianyj's avatar
qianyj committed
68
69
70
  
    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=128 --num_gpus=1  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
71
72

#### 单机四卡训练指令
qianyj's avatar
qianyj committed
73

qianyj's avatar
qianyj committed
74
不打开xla:
qianyj's avatar
qianyj committed
75
76
77
  
    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
78
79

打开xla:
qianyj's avatar
qianyj committed
80
81
82

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py --data_dir=/path/to/{ImageNet-tensorflow_data_dir} --model_dir=/path/to/{model_save_dir} --batch_size=512 --num_gpus=4  --use_synthetic_data=false --dtype=fp16
qianyj's avatar
qianyj committed
83
84

#### 多机多卡训练指令(以单机四卡模拟四卡四进程为例)
qianyj's avatar
qianyj committed
85

qianyj's avatar
qianyj committed
86
sed指令只需要执行一次,添加支持多卡运行的代码
qianyj's avatar
qianyj committed
87
    
qianyj's avatar
qianyj committed
88
    sed -i '100 r configfile' models-master/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py
qianyj's avatar
qianyj committed
89

qianyj's avatar
qianyj committed
90
91
92
修改scripts-run/single_process.sh和scripts-run/single_process_xla.sh文件里的--dtype=fp16

不打开xla:
qianyj's avatar
qianyj committed
93
94
95
96

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process.sh

qianyj's avatar
qianyj committed
97
打开xla:
qianyj's avatar
qianyj committed
98
99
100

    export PYTHONPATH=/path/to/ResNet50_TensorFlow2:$PYTHONPATH
    mpirun -np 4 --hostfile hostfile  -mca btl self,tcp  --allow-run-as-root  --bind-to none scripts-run/single_process_xla.sh
qianyj's avatar
qianyj committed
101
102
103
104
105
106


## 性能和准确率数据
测试数据:[ImageNet的测试数据集](https://image-net.org/ "ImageNet数据集官网"),使用的加速卡:DCU-Z00-16G

根据模型情况填写表格:
qianyj's avatar
qianyj committed
107
| 卡数 | batch size | 类型 | 性能 | Accuracy | 是否打开xla | 进程数 |
qianyj's avatar
qianyj committed
108
| :------: | :------: | :------: | :------: | :------: | :------:| -------- |
qianyj's avatar
qianyj committed
109
110
111
112
113
114
115
| 4 | 512 | fp32 | 843 examples/second | 0.7628 | 否 | 单进程 |
| 4 | 512 | fp16 | - | 0.7616 | 否 | 单进程 |
| 4 | 512 | fp32 | - | 0.7608 | 否 | 四进程 |
| 4 | 512 | fp16 | - | 0.7615 | 否 | 四进程 |
## 参考
* https://github.com/tensorflow/models/tree/master
* https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy