README.md 4.72 KB
Newer Older
zhenyi's avatar
zhenyi committed
1
# 简介
2
* Tensorflow2.7训练Mask R-CNN模型,使用coco2017数据集
zhenyi's avatar
zhenyi committed
3
4
5
# 环境准备  

## 1)安装工具包  
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
dtk22.04.1环境下安装TensorFlow2.7

```
conda create -n Maskrcnn_tf python='3.7'
conda env list
conda activate Maskrcnn_tf
pip3 install tensorflow-2.7.0_dtk22.04-cp37-cp37m-linux_x86_64.whl
```

- 安装pycocotools 

```
pip3 install pycocotools -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com 
```

- 更新pandas 

```
pip3 install -U pandas -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com  
```

- 安装dllogger 

```
git clone --recursive https://github.com/NVIDIA/dllogger.git  
python3 setup.py install 
```

- 安装horovod

```
wget https://files.pythonhosted.org/packages/a7/8b/fcf373aade011f88dafe48628f7d348471d2891974eb5762ceeff3f13920/horovod-0.25.0.tar.gz

tar -zxvf horovod-0.25.0.tar.gz

cd horovod-0.25.0.tar.gz

python3 setup.py build --force develop
```

安装其他依赖

```
pip3 install opencv-python
pip3 install pyyaml
```



zhenyi's avatar
zhenyi committed
55
## 2)数据处理(train 和 val)
56

zhenyi's avatar
zhenyi committed
57
58
59
60
61
62
63
64
```  
cd dataset/  
git clone http://github.com/tensorflow/models tf-models  
cd tf-models/research  
wget -O protobuf.zip https://github.com/google/protobuf/releases/download/v3.0.0/protoc-3.0.0-linux-x86_64.zip protobuf.zip  
unzip protobuf.zip  
./bin/protoc object_detection/protos/.proto --python_out=. 
```
65
66
67
68
返回dataset目录 
  vim create_coco_tf_record.py 
注释掉310 316行  

zhenyi's avatar
zhenyi committed
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
```
PYTHONPATH="tf-models:tf-models/research" python3 create_coco_tf_record.py \
  --logtostderr \
  --include_masks \
  --train_image_dir=/path/to/COCO2017/images/train2017 \
  --val_image_dir=/path/to/COCO2017/images/val2017 \
  --train_object_annotations_file=/path/to/COCO2017/annotations/instances_train2017.json \
  --val_object_annotations_file=/path/to/COCO2017/annotations/instances_val2017.json \
  --train_caption_annotations_file=/path/to/COCO2017/annotations/captions_train2017.json \
  --val_caption_annotations_file=/path/to/COCO2017/annotations/captions_val2017.json \
  --output_dir=coco2017_tfrecord  
```
生成coco2017_tfrecord文件夹  
## 3)预训练模型下载  
生成的模型文件结构如下:

``` 
weights/
>mask-rcnn/1555659850/  
https://storage.googleapis.com/cloud-tpu-checkpoints/mask-rcnn/1555659850/saved_model.pb 
>>variables/  
https://storage.googleapis.com/cloud-tpu-checkpoints/mask-rcnn/1555659850/variables/variables.data-00000-of-00001  
https://storage.googleapis.com/cloud-tpu-checkpoints/mask-rcnn/1555659850/variables/variables.index  
>resnet/
>>extracted_from_maskrcnn/
>>resnet-nhwc-2018-02-07/  
https://storage.googleapis.com/cloud-tpu-checkpoints/retinanet/resnet50-checkpoint-2018-02-07/checkpoint 
>>>model.ckpt-112603/  
https://storage.googleapis.com/cloud-tpu-checkpoints/retinanet/resnet50-checkpoint-2018-02-07/model.ckpt-112603.data-00000-of-00001  
https://storage.googleapis.com/cloud-tpu-checkpoints/retinanet/resnet50-checkpoint-2018-02-07/model.ckpt-112603.index  
https://storage.googleapis.com/cloud-tpu-checkpoints/retinanet/resnet50-checkpoint-2018-02-07/model.ckpt-112603.meta  
>>resnet-nhwc-2018-10-14/
```

# 测试  

## 单卡训练  
```
liangjj's avatar
update  
liangjj committed
107
#without_xla
108
export HIP_VISIBLE_DEVICES=0
zhenyi's avatar
zhenyi committed
109
python3 scripts/benchmark_training.py --gpus 1 --batch_size 2 --model_dir save_model --data_dir /public/home/tianlh/AI-application/Tensorflow/MaskRCNN_tf2/dataset/coco2017_tfrecord --weights_dir weights 
liangjj's avatar
update  
liangjj committed
110
111
112
113

#with xla
export HIP_VISIBLE_DEVICES=0
python3 scripts/benchmark_training_xla.py --gpus 1 --batch_size 2 --model_dir save_model --data_dir /public/home/tianlh/AI-application/Tensorflow/MaskRCNN_tf2/dataset/coco2017_tfrecord --weights_dir weights 
zhenyi's avatar
zhenyi committed
114
115
116
```
## 多卡训练 
``` 
liangjj's avatar
update  
liangjj committed
117
#without xla
118
export HIP_VISIBLE_DEVICES=0,1
zhenyi's avatar
zhenyi committed
119
python3 scripts/benchmark_training.py --gpus 2 --batch_size 4 --model_dir save_model_2dcu --data_dir /public/home/tianlh/AI-application/Tensorflow/MaskRCNN_tf2/dataset/coco2017_tfrecord --weights_dir weights 
liangjj's avatar
update  
liangjj committed
120
121
122
123

#with xla 
export HIP_VISIBLE_DEVICES=0,1
python3 scripts/benchmark_training_xla.py --gpus 2 --batch_size 4 --model_dir save_model_2dcu --data_dir /public/home/tianlh/AI-application/Tensorflow/MaskRCNN_tf2/dataset/coco2017_tfrecord --weights_dir weights 
zhenyi's avatar
zhenyi committed
124
125
126
127
```

## 推理  
```
liangjj's avatar
update  
liangjj committed
128
#without xla
zhenyi's avatar
zhenyi committed
129
python3 scripts/benchmark_inference.py --batch_size 2 --model_dir save_model --data_dir /public/home/tianlh/AI-application/Tensorflow/MaskRCNN_tf2/dataset/coco2017_tfrecord --weights_dir weights
liangjj's avatar
update  
liangjj committed
130
131
132

#with xla
python3 scripts/benchmark_inference_xla.py --batch_size 2 --model_dir save_model --data_dir /public/home/tianlh/AI-application/Tensorflow/MaskRCNN_tf2/dataset/coco2017_tfrecord --weights_dir weights
zhenyi's avatar
zhenyi committed
133
134
135
136
```

# 参考资料
[https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN)