README.md 11.3 KB
Newer Older
hepj's avatar
hepj committed
1
### 1.Benchmark下载地址
huchen's avatar
huchen committed
2

hepj's avatar
hepj committed
3
transformer:https://github.com/pytorch/fairseq.git
huchen's avatar
huchen committed
4

hepj's avatar
hepj committed
5
最新的githup上代码和现在存在区别,最好使用本地旧版本代码,否则可能出现HIP版本的torch不识别情况
huchen's avatar
huchen committed
6

hepj's avatar
hepj committed
7
### 2.1 数据集准备
huchen's avatar
huchen committed
8

hepj's avatar
hepj committed
9
10
11
12
13
14
15
16
17
18
19
20
21
```
环境准备:

pip3 install fastBPE sacremoses subword_nmt

数据集下载:

cd examples/translation/

bash prepare-wmt14en2de.sh --icml17

数据预处理:

hepj's avatar
hepj committed
22
DATA_DIR=~/data/wmt14_en_de_joined_dict
hepj's avatar
hepj committed
23

hepj's avatar
hepj committed
24
TEXT=`pwd`/examples/translation/wmt14_en_de
hepj's avatar
hepj committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

fairseq-preprocess --source-lang en --target-lang de --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir $DATA_DIR --nwordssrc 32768 --nwordstgt 32768 --joined-dictionary --workers 20

相关参数说明:

--source-lang   source language

--target-lang   target language

--trainpref     rain file prefix (also used to build dictionaries)

--validpref     comma separated, valid file prefixes (words missing from train set are replaced with <unk>)

--testpref      comma separated, test file prefixes (words missing from train set are replaced with <unk>)

--destdir       destination dir, Default: “data-bin”

数据集路径:

hepj's avatar
hepj committed
44
DATA_PATH=~/data/wmt14_en_de_joined_dict
hepj's avatar
hepj committed
45
46
47
48
49
50
51
```



#### 2.2.环境部署

##### 2.2.1.构建测试的虚拟环境
huchen's avatar
huchen committed
52

hepj's avatar
hepj committed
53
54
```
virtualenv -p python3 venv
huchen's avatar
huchen committed
55

hepj's avatar
hepj committed
56
57
source venv/bin/activate
```
huchen's avatar
huchen committed
58

hepj's avatar
hepj committed
59
##### 2.2.2.安装python3.7环境下的依赖包
huchen's avatar
huchen committed
60

hepj's avatar
hepj committed
61
62
```
pip3 install --upgrade pip
huchen's avatar
huchen committed
63

hepj's avatar
hepj committed
64
pip3 install typing
huchen's avatar
huchen committed
65

hepj's avatar
hepj committed
66
pip3 install sacremoses
huchen's avatar
huchen committed
67

hepj's avatar
hepj committed
68
pip3 install numpy
huchen's avatar
huchen committed
69

hepj's avatar
hepj committed
70
pip3 install torch-1.10.0a0+gitd8cde89.atomic.dtk22042-cp37-cp37m-linux_x86_64.whl
huchen's avatar
huchen committed
71

hepj's avatar
hepj committed
72
pip3 install apex-0.1_dtk22.04-cp37-cp37m-linux_x86_64.whl
huchen's avatar
huchen committed
73

hepj's avatar
hepj committed
74
pip3 install setuptools==59.5.0
huchen's avatar
huchen committed
75

hepj's avatar
hepj committed
76
pip3 install protobuf==3.20.0
hepj's avatar
hepj committed
77
```
huchen's avatar
huchen committed
78

hepj's avatar
hepj committed
79
##### 2.2.3.安装fairseq
huchen's avatar
huchen committed
80

hepj's avatar
hepj committed
81
82
83
84
```
#git clone https://github.com/pytorch/fairseq.git
#cd fairseq
#整个transformer文件夹就是从github上拷贝下来的fariseq,可以直接安装
hepj's avatar
hepj committed
85
pip3 install --editable ./
huchen's avatar
huchen committed
86

hepj's avatar
hepj committed
87
88
相关说明:可以先把setup.py里的torch、torch-audio的安装屏蔽掉
```
huchen's avatar
huchen committed
89

hepj's avatar
hepj committed
90
##### 2.2.4.环境变量里设置env.sh
huchen's avatar
huchen committed
91
92

```
hepj's avatar
hepj committed
93
94
WORK_PATH=`pwd`
source env.sh
hepj's avatar
hepj committed
95
96
97
98
99
100
```

### 3.transformer测试(昆山)

#### 3.1.单卡测试(单精度)

hepj's avatar
hepj committed
101
##### 3.1.1.run.sh
hepj's avatar
hepj committed
102
103
104

```

hepj's avatar
hepj committed
105
106
107
108
109
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=1
export HIP_VISIBLE_DEVICES=0
export TOKEN=2560
export DATA_PATH=~/data/wmt14_en_de_joined_dict
hepj's avatar
hepj committed
110

hepj's avatar
hepj committed
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
export HIP_LAUNCH_BLOCKING=1
export ROCBLAS_ATOMICS_MOD=1
python3 train.py \
    $DATA_PATH \
    --arch transformer_wmt_en_de \
    --share-decoder-input-output-embed \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.0 \
    --lr 5e-4 \
    --lr-scheduler inverse_sqrt \
    --warmup-updates 4000 \
    --dropout 0.3 \
    --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --max-tokens ${TOKEN} \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu \
    --maximize-best-checkpoint-metric \
    --max-epoch 1
hepj's avatar
hepj committed
136
137
138
139
140

```

##### 3.1.2.运行
```
hepj's avatar
hepj committed
141
./ run.sh
hepj's avatar
hepj committed
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
```

#### 3.2.四卡测试(单精度)

##### 3.2.1.single_process.sh

```
#!/bin/bash

export MIOPEN_DEBUG_DISABLE_FIND_DB=1

export NCCL_SOCKET_IFNAME=eno1

export HSA_USERPTR_FOR_PAGED_MEM=0

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

comm_rank=$OMPI_COMM_WORLD_RANK

comm_size=$OMPI_COMM_WORLD_SIZE

TOKENS=2560

DATA_PATH=~/fairseq/examples/translation/data-bin/wmt14_en_de_joined_dict

APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9,0.98) --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1"

case ${lrank} in

[0])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_0:1

  export UCX_IB_PCI_BW=mlx5_0:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}

  ;;

[1])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_1:1

  export UCX_IB_PCI_BW=mlx5_1:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}

  ;;

[2])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_2:1

  export UCX_IB_PCI_BW=mlx5_2:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}

  ;;

[3])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_3:1

  export UCX_IB_PCI_BW=mlx5_3:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}

  ;;

esac
huchen's avatar
huchen committed
228
```
hepj's avatar
hepj committed
229

hepj's avatar
hepj committed
230
##### 3.2.2.run4.sh
hepj's avatar
hepj committed
231

huchen's avatar
huchen committed
232
```
hepj's avatar
hepj committed
233
234
#!/usr/bin/env bash
#SBATCH -J distribute
hepj's avatar
hepj committed
235
#SBATCH -p wzhdtest
hepj's avatar
hepj committed
236
237
238
239
240
241
242
243
244
245
246
247
248
#SBATCH -N 1
#SBARCH -n 32
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=dcu:4
set -x
hostfile=./$SLURM_JOB_ID
scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}
for i in `cat $hostfile`
do
    echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID
    ((num_node=${num_node}+1))
done
hepj's avatar
hepj committed
249
num_dcu=$((${num_node}*4))
hepj's avatar
hepj committed
250
251
252
253
254
255
echo $num_dcu
nodename=$(cat $hostfile |sed -n "1p")
echo $nodename
dist_url=`echo $nodename | awk '{print $1}'`
export HSA_USERPTR_FOR_PAGED_MEM=0
mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_process.sh  $dist_url
huchen's avatar
huchen committed
256
```
hepj's avatar
hepj committed
257
258
259

##### 3.2.3.运行
```
hepj's avatar
hepj committed
260
sbatch run4.sh
huchen's avatar
huchen committed
261
262
```

hepj's avatar
hepj committed
263
264
265
266
267
268
269
270
271
##### 3.2.4.参数说明

- 上面的中single_process.sh需要关注--max-tokens;
-  通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等;
- 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。

#### 3.3.单卡测试(半精度)

##### 3.3.1.fp16_run_transformer.sh
huchen's avatar
huchen committed
272
273

```
hepj's avatar
hepj committed
274
275
276
277
278
279
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=1
export HIP_VISIBLE_DEVICES=0
export DATA_PATH=~/data/wmt14_en_de_joined_dict
export TOKEN=2560
python3 train.py \
hepj's avatar
hepj committed
280
    $DATA_PATH \
hepj's avatar
hepj committed
281
    --save-dir module-fp16_2560 \
hepj's avatar
hepj committed
282
283
284
285
286
    --arch transformer_wmt_en_de --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
hepj's avatar
hepj committed
287
    --max-tokens ${TOKEN} \
hepj's avatar
hepj committed
288
289
290
291
292
293
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --max-epoch 1 --fp16
huchen's avatar
huchen committed
294
295
```

hepj's avatar
hepj committed
296
##### 3.3.2.运行
huchen's avatar
huchen committed
297

hepj's avatar
hepj committed
298
./ fp16_run_transformer.sh
huchen's avatar
huchen committed
299

hepj's avatar
hepj committed
300
##### 3.3.3.参数说明
huchen's avatar
huchen committed
301

hepj's avatar
hepj committed
302
--max-tokens 根据tokens设置batch size
huchen's avatar
huchen committed
303

hepj's avatar
hepj committed
304
--fp16  使用半精度训练
huchen's avatar
huchen committed
305

hepj's avatar
hepj committed
306
#### 3.4.四卡测试(半精度)
huchen's avatar
huchen committed
307

hepj's avatar
hepj committed
308
##### 3.4.1.fp16_single_process.sh
huchen's avatar
huchen committed
309

hepj's avatar
hepj committed
310
```
huchen's avatar
huchen committed
311

hepj's avatar
hepj committed
312
export HIP_VISIBLE_DEVICES=0
hepj's avatar
hepj committed
313
314
315
316
317
318
319
export MIOPEN_DEBUG_DISABLE_FIND_DB=1
export NCCL_SOCKET_IFNAME=eno1
export HSA_USERPTR_FOR_PAGED_MEM=0
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
TOKENS=2560
hepj's avatar
hepj committed
320
321
DATA_PATH=~/data/wmt14_en_de_joined_dict
APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9,0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1 --fp16"
hepj's avatar
hepj committed
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
case ${lrank} in
[0])
  export HIP_VISIBLE_DEVICES=0,1,2,3
  export UCX_NET_DEVICES=mlx5_0:1
  export UCX_IB_PCI_BW=mlx5_0:50Gbs
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
  ;;
[1])
  export HIP_VISIBLE_DEVICES=0,1,2,3
  export UCX_NET_DEVICES=mlx5_1:1
  export UCX_IB_PCI_BW=mlx5_1:50Gbs
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
  ;;
[2])
  export HIP_VISIBLE_DEVICES=0,1,2,3
  export UCX_NET_DEVICES=mlx5_2:1
  export UCX_IB_PCI_BW=mlx5_2:50Gbs
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP} 
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}
  ;;
[3])
  export HIP_VISIBLE_DEVICES=0,1,2,3
  export UCX_NET_DEVICES=mlx5_3:1
  export UCX_IB_PCI_BW=mlx5_3:50Gbs
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
  ;;
esac
```

##### 3.4.2.fp16_run_transformer_4dcus.sh

```
#!/usr/bin/env bash
hepj's avatar
hepj committed
358
359
#SBATCH -J transformer
#SBATCH -p wzhdtest
hepj's avatar
hepj committed
360
361
362
363
364
365
366
367
368
369
370
371
372
#SBATCH -N 1
#SBARCH -n 32
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=dcu:4
set -x
hostfile=./$SLURM_JOB_ID
scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}
for i in `cat $hostfile`
do
    echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID
    ((num_node=${num_node}+1))
done
hepj's avatar
hepj committed
373
num_dcu=$((${num_node}*4))
hepj's avatar
hepj committed
374
375
376
377
378
echo $num_dcu
nodename=$(cat $hostfile |sed -n "1p")
echo $nodename
dist_url=`echo $nodename | awk '{print $1}'`
export HSA_USERPTR_FOR_PAGED_MEM=0
hepj's avatar
hepj committed
379
mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_fp16.sh  $dist_ur
hepj's avatar
hepj committed
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440

```

##### 3.4.3.运行

sbatch fp16_ run_transformer_4dcus.sh

##### 3.4.4.参数说明

- 上面的中single_process.sh需要关注--max-tokens;
-  通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等;
- 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。

#### 3.5. 部分问题说明

##### 3.5.1. format错误

报错信息如下:

```
  File "~/virturlenv-test/venv/lib/python3.6/site-packages/sacrebleu/metrics/bleu.py", line 103, in __init__
    self._verbose += f"ratio = {self.ratio:.3f} hyp_len = {self.sys_len:d} "
 ...

ValueError: Unknown format code 'd' for object of type 'float'
```

修改方法:修改报错提示中的bleu.py,103和104行的d改成.0f

```
#修改后
self._verbose += f"ratio = {self.ratio:.3f} hyp_len ={self.sys_len:.0f}"
self._verbose += f"ref_len = {slef.ref_len:.0f}"
```



##### 3.5.2 json格式解析错误

报错信息如下:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

解析错误的地方,可以看出是引号嵌套问题

错误的打印

```
'eval_bleu_args': "'{beam:5,max_len_a:1.2,max_len_b:10}'
```

正确的打印

```
'eval_bleu_args': '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
```

修改为:

```
" --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10}"
hepj's avatar
hepj committed
441
442
```