"src/lib/vscode:/vscode.git/clone" did not exist on "6d350fb8bcdbe00722a22e2a77ce14f7d93f5118"
README.md 4.84 KB
Newer Older
huchen's avatar
huchen committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# 1. 简介

该脚本是基于目标检测模型SSD_ResNet34的功能测试用例,参考mlperf工程,当mAP值达到0.23时,视为模型收敛并成功结束作业运行。

# 2. 运行

## 安装依赖库
    Cython==0.28.4
    mlperf-compliance==0.0.10
    cycler==0.10.0
    kiwisolver==1.0.1
    matplotlib==2.2.2
    numpy==1.14.5
    Pillow==5.2.0
    pyparsing==2.2.0
    python-dateutil==2.7.3
    pytz==2018.5
    six==1.11.0
    torchvision(if installed, ignore it)
    apex(if installed, ignore it)


## 下载数据集
   
    bash download_dataset.sh


## 运行训练脚本
* 单节点环境配置、系统超参设置脚本为config_singlenode.sh,可根据实际情况对应修改
* 多节点环境配置、系统超参设置脚本为config_multinode.sh,可根据实际情况对应修改
### 单机单卡(FP32)
    python3 train_fp32.py \
                      --epochs "${NUMEPOCHS}" \
                      --warmup-factor 0 \
                      --lr "${LR}" \
                      --no-save \
                      --threshold=0.23 \
                      --data ${DATASET_DIR} \
                      --batch-size ${BATCH_SIZE}
                      --warmup-factor 0
                      --warmup ${WARMUP}
### 单机多卡(FP32)

    python3 -m bind_launch --nsockets_per_node ${NSOCKET} \
                      --ncores_per_socket ${SOCKETCORES} \
                      --nproc_per_node ${NTASKS_PER_NODE} \
                      --no_hyperthreads \
                      --no_membind \
                      train_fp32.py \
                      --epochs "${NUMEPOCHS}" \
                      --warmup-factor 0 \
                      --lr "${LR}" \
                      --no-save \
                      --threshold=0.23 \
                      --data ${DATASET_DIR} \
                      --batch-size ${BATCH_SIZE}
                      --warmup-factor 0
                      --warmup ${WARMUP}
* 可参考作业提交脚本 run_fp32_single.sh 

### 多机多卡(FP32)
   
    sh run_fp32_multi.sh

* 参考run_fp32_multi.sh脚本,其中hostfile文件内容格式参考如下:
        
        node1 slots=4  
        node2 slots=4
    
### 单机单卡(FP16)
    python3  train_fp16.py \
                      --epochs "${NUMEPOCHS}" \
                      --warmup-factor 0 \
                      --lr "${LR}" \
                      --no-save \
                      --threshold=0.23 \
                      --data ${DATASET_DIR} \
                      --opt-level O3 --loss-scale="dynamic" --keep-batchnorm-fp32 True \
                      --batch-size 180 \
                      --warmup ${WARMUP}

### 单机多卡(FP16)
    python3 -m bind_launch --nsockets_per_node ${NSOCKET} \
                      --ncores_per_socket ${SOCKETCORES} \
                      --nproc_per_node ${NTASKS_PER_NODE} \
                      --no_hyperthreads \
                      --no_membind \
                      train_fp16.py \
                      --epochs "${NUMEPOCHS}" \
                      --warmup-factor 0 \
                      --lr "${LR}" \
                      --no-save \
                      --threshold=0.23 \
                      --data ${DATASET_DIR} \
                      --opt-level O3 --loss-scale="dynamic" --keep-batchnorm-fp32 True \
                      --batch-size 180 \
                      --warmup ${WARMUP}
* 可参考作业提交脚本 run_fp16_single.sh                      

### 多机多卡(FP16)
    sh run_fp16_multi.sh
* 类似地, hostfile文件的设置可参考上文部分


# 3. 数据集


### Publiction/Attribution.
Microsoft COCO: COmmon Objects in Context. 2017.

### Training and test data separation
Train on 2017 COCO train data set, compute mAP on 2017 COCO val data set.

# 4. 模型
### Publication/Attribution
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. In the Proceedings of the European Conference on Computer Vision (ECCV), 2016.

Backbone is ResNet34 pretrained on ILSVRC 2012 (from torchvision). Modifications to the backbone networks: remove conv_5x residual blocks, change the first 3x3 convolution of the conv_4x block from stride 2 to stride1 (this increases the resolution of the feature map to which detector heads are attached), attach all 6 detector heads to the output of the last conv_4x residual block. Thus detections are attached to 38x38, 19x19, 10x10, 5x5, 3x3, and 1x1 feature maps.

# 5. 评价指标
### Quality metric
Metric is COCO box mAP (averaged over IoU of 0.5:0.95), computed over 2017 COCO val data.

### Quality target
mAP of 0.23

### Evaluation frequency

### Evaluation thoroughness
All the images in COCO 2017 val data set.

# 6. 参考
[https://github.com/mlperf/training/tree/master/single_stage_detector/ssd](https://github.com/mlperf/training/tree/master/single_stage_detector/ssd)