README.md-old2 11.8 KB
Newer Older
hepj's avatar
hepj committed
1
### 1.Benchmark下载地址
huchen's avatar
huchen committed
2

hepj's avatar
hepj committed
3
transformer:https://github.com/pytorch/fairseq.git
huchen's avatar
huchen committed
4

hepj's avatar
hepj committed
5
最新的githup上代码和现在存在区别,最好使用本地旧版本代码,否则可能出现HIP版本的torch不识别情况
huchen's avatar
huchen committed
6

hepj's avatar
hepj committed
7
### 2.1 数据集准备
huchen's avatar
huchen committed
8

hepj's avatar
hepj committed
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
```
环境准备:

pip3 install fastBPE sacremoses subword_nmt

数据集下载:

cd examples/translation/

bash prepare-wmt14en2de.sh --icml17

数据预处理:

DATA_DIR=`pwd`/data-bin/wmt14_en_de_joined_dict

TEXT=`pwd`/wmt14_en_de

fairseq-preprocess --source-lang en --target-lang de --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir $DATA_DIR --nwordssrc 32768 --nwordstgt 32768 --joined-dictionary --workers 20

相关参数说明:

--source-lang   source language

--target-lang   target language

--trainpref     rain file prefix (also used to build dictionaries)

--validpref     comma separated, valid file prefixes (words missing from train set are replaced with <unk>)

--testpref      comma separated, test file prefixes (words missing from train set are replaced with <unk>)

--destdir       destination dir, Default: “data-bin”

数据集路径:

DATA_PATH=`pwd`/data-bin/wmt14_en_de_joined_dict
```



#### 2.2.环境部署

##### 2.2.1.构建测试的虚拟环境
huchen's avatar
huchen committed
52

hepj's avatar
hepj committed
53
54
```
virtualenv -p python3 venv
huchen's avatar
huchen committed
55

hepj's avatar
hepj committed
56
57
source venv/bin/activate
```
huchen's avatar
huchen committed
58

hepj's avatar
hepj committed
59
##### 2.2.2.安装python3.6环境下的依赖包
huchen's avatar
huchen committed
60

hepj's avatar
hepj committed
61
62
```
pip3 install --upgrade pip
huchen's avatar
huchen committed
63

hepj's avatar
hepj committed
64
pip3 install typing
huchen's avatar
huchen committed
65

hepj's avatar
hepj committed
66
pip3 install sacremoses
huchen's avatar
huchen committed
67

hepj's avatar
hepj committed
68
pip3 install numpy
huchen's avatar
huchen committed
69

hepj's avatar
hepj committed
70
pip3 install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
huchen's avatar
huchen committed
71

hepj's avatar
hepj committed
72
73
pip3 install apex-0.1-cp36-cp36m-linux_x86_64.whl 
```
huchen's avatar
huchen committed
74

hepj's avatar
hepj committed
75
##### 2.2.3.安装fairseq
huchen's avatar
huchen committed
76

hepj's avatar
hepj committed
77
78
```
git clone https://github.com/pytorch/fairseq.git
huchen's avatar
huchen committed
79

hepj's avatar
hepj committed
80
cd fairseq
huchen's avatar
huchen committed
81

hepj's avatar
hepj committed
82
pip3 install --editable ./
huchen's avatar
huchen committed
83

hepj's avatar
hepj committed
84
85
相关说明:可以先把setup.py里的torch、torch-audio的安装屏蔽掉
```
huchen's avatar
huchen committed
86

hepj's avatar
hepj committed
87
##### 2.2.4.环境变量里设置env.sh
huchen's avatar
huchen committed
88

hepj's avatar
hepj committed
89
(昆山平台依赖库已配置好,直接设置环境变量使用)
huchen's avatar
huchen committed
90
91

```
hepj's avatar
hepj committed
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
module unload compiler/rocm/2.9 

module load compiler/rocm/dtk-21.10.1 

export PYTORCH_DEP=/public/software/apps/DeepLearning/PyTorch

export C_INCLUDE_PATH=${PYTORCH_DEP}/gflags-2.1.2-build/include:${PYTORCH_DEP}/glog-build/include:$C_INCLUDE_PATH

export CPLUS_INCLUDE_PATH=${PYTORCH_DEP}/gflags-2.1.2-build/include:${PYTORCH_DEP}/glog-build/include:$CPLUS_INCLUDE_PATH

export LD_LIBRARY_PATH=${PYTORCH_DEP}/glog-build/lib/:${PYTORCH_DEP}/lmdb-0.9.24-build/lib/:${PYTORCH_DEP}/opencv-2.4.13.6-build/lib/:${PYTORCH_DEP}/openblas-0.3.7-build/lib/:${PYTORCH_DEP}/gflags-2.1.2-build/lib/:${PYTORCH_DEP}/lib/:${PYTORCH_DEP}/openmp-build/lib:$LD_LIBRARY_PATH

export NCCL_SOCKET_IFNAME=eno1

export HSA_FORCE_FINE_GRAIN_PCIE=1

export MIOPEN_FIND_MODE=3
```

### 3.transformer测试(昆山)

```
WORK_PATH=~/fairseq

source env.sh
```

#### 3.1.单卡测试(单精度)

##### 3.1.1.run_transformer_single.sh

```
export HIP_VISIBLE_DEVICES=0

python3 $WORK_PATH/train.py \

  $DATA_PATH \

   --arch transformer_wmt_en_de --share-decoder-input-output-embed \

   --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \

   --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \

   --dropout 0.3 --weight-decay 0.0001 \

   --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \

   --max-tokens 2560 \
 
   --eval-bleu \

   --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \

   --eval-bleu-detok moses \

   --eval-bleu-remove-bpe \

   --eval-bleu-print-samples \

   --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --max-epoch 1
```

##### 3.1.2.运行
```
./ run_transformer_single.sh
```

#### 3.2.四卡测试(单精度)

##### 3.2.1.single_process.sh

```
#!/bin/bash

export MIOPEN_DEBUG_DISABLE_FIND_DB=1

export NCCL_SOCKET_IFNAME=eno1

export HSA_USERPTR_FOR_PAGED_MEM=0

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

comm_rank=$OMPI_COMM_WORLD_RANK

comm_size=$OMPI_COMM_WORLD_SIZE

TOKENS=2560

DATA_PATH=~/fairseq/examples/translation/data-bin/wmt14_en_de_joined_dict

APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9,0.98) --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1"

case ${lrank} in

[0])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_0:1

  export UCX_IB_PCI_BW=mlx5_0:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}

  ;;

[1])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_1:1

  export UCX_IB_PCI_BW=mlx5_1:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}

  ;;

[2])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_2:1

  export UCX_IB_PCI_BW=mlx5_2:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}

  ;;

[3])

  export HIP_VISIBLE_DEVICES=0,1,2,3

  export UCX_NET_DEVICES=mlx5_3:1

  export UCX_IB_PCI_BW=mlx5_3:50Gbs

  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}

  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}

  ;;

esac
huchen's avatar
huchen committed
244
```
hepj's avatar
hepj committed
245
246
247

##### 3.2.2.run_transformer_4dcus.sh

huchen's avatar
huchen committed
248
```
hepj's avatar
hepj committed
249
250
251
252
253
254
255
256
257
258
259
#!/usr/bin/env bash

#SBATCH -J distribute

#SBATCH -p kshdtest

#SBATCH -N 1

#SBARCH -n 32

#SBATCH --ntasks-per-node=4
huchen's avatar
huchen committed
260

hepj's avatar
hepj committed
261
#SBATCH --cpus-per-task=8
huchen's avatar
huchen committed
262

hepj's avatar
hepj committed
263
#SBATCH --gres=dcu:4
huchen's avatar
huchen committed
264

hepj's avatar
hepj committed
265
set -x
huchen's avatar
huchen committed
266

hepj's avatar
hepj committed
267
source env.sh
huchen's avatar
huchen committed
268

hepj's avatar
hepj committed
269
hostfile=./$SLURM_JOB_ID
huchen's avatar
huchen committed
270

hepj's avatar
hepj committed
271
scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}
huchen's avatar
huchen committed
272

hepj's avatar
hepj committed
273
for i in `cat $hostfile`
huchen's avatar
huchen committed
274

hepj's avatar
hepj committed
275
do
huchen's avatar
huchen committed
276

hepj's avatar
hepj committed
277
    echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID
huchen's avatar
huchen committed
278

hepj's avatar
hepj committed
279
    ((num_node=${num_node}+1))
huchen's avatar
huchen committed
280

hepj's avatar
hepj committed
281
done
huchen's avatar
huchen committed
282

hepj's avatar
hepj committed
283
284
285
286
287
288
289
290
291
292
293
294
295
num_dcu=$((${num_node}\4))

echo $num_dcu

nodename=$(cat $hostfile |sed -n "1p")

echo $nodename

dist_url=`echo $nodename | awk '{print $1}'`

export HSA_USERPTR_FOR_PAGED_MEM=0

mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_process.sh  $dist_url
huchen's avatar
huchen committed
296
```
hepj's avatar
hepj committed
297
298
299
300

##### 3.2.3.运行
```
sbatch run_transformer_4dcus.sh
huchen's avatar
huchen committed
301
302
```

hepj's avatar
hepj committed
303
304
305
306
307
308
309
310
311
##### 3.2.4.参数说明

- 上面的中single_process.sh需要关注--max-tokens;
-  通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等;
- 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。

#### 3.3.单卡测试(半精度)

##### 3.3.1.fp16_run_transformer.sh
huchen's avatar
huchen committed
312
313

```
hepj's avatar
hepj committed
314
315
316
export HIP_VISIBLE_DEVICES=2

python3 $WORK_PATH/train.py \
huchen's avatar
huchen committed
317

hepj's avatar
hepj committed
318
    $DATA_PATH \
huchen's avatar
huchen committed
319

hepj's avatar
hepj committed
320
    --arch transformer_wmt_en_de --share-decoder-input-output-embed \
huchen's avatar
huchen committed
321

hepj's avatar
hepj committed
322
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
huchen's avatar
huchen committed
323

hepj's avatar
hepj committed
324
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
huchen's avatar
huchen committed
325

hepj's avatar
hepj committed
326
    --dropout 0.3 --weight-decay 0.0001 \
huchen's avatar
huchen committed
327

hepj's avatar
hepj committed
328
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
huchen's avatar
huchen committed
329

hepj's avatar
hepj committed
330
    --max-tokens 2560 \
huchen's avatar
huchen committed
331

hepj's avatar
hepj committed
332
333
334
335
336
337
338
339
340
341
342
    --eval-bleu \

    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \

    --eval-bleu-detok moses \

    --eval-bleu-remove-bpe \

    --eval-bleu-print-samples \

    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --max-epoch 1 --fp16
huchen's avatar
huchen committed
343
344
```

hepj's avatar
hepj committed
345
##### 3.3.2.运行
huchen's avatar
huchen committed
346

hepj's avatar
hepj committed
347
./ fp16_run_transformer.sh
huchen's avatar
huchen committed
348

hepj's avatar
hepj committed
349
##### 3.3.3.参数说明
huchen's avatar
huchen committed
350

hepj's avatar
hepj committed
351
--max-tokens 根据tokens设置batch size
huchen's avatar
huchen committed
352

hepj's avatar
hepj committed
353
--fp16  使用半精度训练
huchen's avatar
huchen committed
354

hepj's avatar
hepj committed
355
#### 3.4.四卡测试(半精度)
huchen's avatar
huchen committed
356

hepj's avatar
hepj committed
357
##### 3.4.1.fp16_single_process.sh
huchen's avatar
huchen committed
358

hepj's avatar
hepj committed
359
360
```
#!/bin/bash
huchen's avatar
huchen committed
361

hepj's avatar
hepj committed
362
export MIOPEN_DEBUG_DISABLE_FIND_DB=1
huchen's avatar
huchen committed
363

hepj's avatar
hepj committed
364
export NCCL_SOCKET_IFNAME=eno1
huchen's avatar
huchen committed
365

hepj's avatar
hepj committed
366
export HSA_USERPTR_FOR_PAGED_MEM=0
huchen's avatar
huchen committed
367

hepj's avatar
hepj committed
368
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
huchen's avatar
huchen committed
369

hepj's avatar
hepj committed
370
comm_rank=$OMPI_COMM_WORLD_RANK
huchen's avatar
huchen committed
371

hepj's avatar
hepj committed
372
comm_size=$OMPI_COMM_WORLD_SIZE
huchen's avatar
huchen committed
373

hepj's avatar
hepj committed
374
TOKENS=2560
huchen's avatar
huchen committed
375

hepj's avatar
hepj committed
376
DATA_PATH=~/fairseq/examples/translation/data-bin/wmt14_en_de_joined_dict
huchen's avatar
huchen committed
377

hepj's avatar
hepj committed
378
APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9,0.98) --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1 --fp16"
huchen's avatar
huchen committed
379

hepj's avatar
hepj committed
380
case ${lrank} in
huchen's avatar
huchen committed
381

hepj's avatar
hepj committed
382
[0])
huchen's avatar
huchen committed
383

hepj's avatar
hepj committed
384
  export HIP_VISIBLE_DEVICES=0,1,2,3
huchen's avatar
huchen committed
385

hepj's avatar
hepj committed
386
  export UCX_NET_DEVICES=mlx5_0:1
huchen's avatar
huchen committed
387

hepj's avatar
hepj committed
388
  export UCX_IB_PCI_BW=mlx5_0:50Gbs
huchen's avatar
huchen committed
389

hepj's avatar
hepj committed
390
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
huchen's avatar
huchen committed
391

hepj's avatar
hepj committed
392
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
huchen's avatar
huchen committed
393

hepj's avatar
hepj committed
394
  ;;
huchen's avatar
huchen committed
395

hepj's avatar
hepj committed
396
[1])
huchen's avatar
huchen committed
397

hepj's avatar
hepj committed
398
  export HIP_VISIBLE_DEVICES=0,1,2,3
huchen's avatar
huchen committed
399

hepj's avatar
hepj committed
400
  export UCX_NET_DEVICES=mlx5_1:1
huchen's avatar
huchen committed
401

hepj's avatar
hepj committed
402
  export UCX_IB_PCI_BW=mlx5_1:50Gbs
huchen's avatar
huchen committed
403

hepj's avatar
hepj committed
404
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
huchen's avatar
huchen committed
405

hepj's avatar
hepj committed
406
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
huchen's avatar
huchen committed
407

hepj's avatar
hepj committed
408
  ;;
huchen's avatar
huchen committed
409

hepj's avatar
hepj committed
410
[2])
huchen's avatar
huchen committed
411

hepj's avatar
hepj committed
412
  export HIP_VISIBLE_DEVICES=0,1,2,3
huchen's avatar
huchen committed
413

hepj's avatar
hepj committed
414
  export UCX_NET_DEVICES=mlx5_2:1
huchen's avatar
huchen committed
415

hepj's avatar
hepj committed
416
  export UCX_IB_PCI_BW=mlx5_2:50Gbs
huchen's avatar
huchen committed
417

hepj's avatar
hepj committed
418
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP} 
huchen's avatar
huchen committed
419

hepj's avatar
hepj committed
420
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}
huchen's avatar
huchen committed
421

hepj's avatar
hepj committed
422
  ;;
huchen's avatar
huchen committed
423

hepj's avatar
hepj committed
424
[3])
huchen's avatar
huchen committed
425

hepj's avatar
hepj committed
426
  export HIP_VISIBLE_DEVICES=0,1,2,3
huchen's avatar
huchen committed
427

hepj's avatar
hepj committed
428
  export UCX_NET_DEVICES=mlx5_3:1
huchen's avatar
huchen committed
429

hepj's avatar
hepj committed
430
  export UCX_IB_PCI_BW=mlx5_3:50Gbs
huchen's avatar
huchen committed
431

hepj's avatar
hepj committed
432
  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
huchen's avatar
huchen committed
433

hepj's avatar
hepj committed
434
  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
huchen's avatar
huchen committed
435

hepj's avatar
hepj committed
436
437
438
439
440
441
442
443
444
  ;;

esac
```

##### 3.4.2.fp16_run_transformer_4dcus.sh

```
#!/usr/bin/env bash
huchen's avatar
huchen committed
445

hepj's avatar
hepj committed
446
#SBATCH -J distribute
huchen's avatar
huchen committed
447

hepj's avatar
hepj committed
448
#SBATCH -p kshdtest
huchen's avatar
huchen committed
449

hepj's avatar
hepj committed
450
#SBATCH -N 1
huchen's avatar
huchen committed
451

hepj's avatar
hepj committed
452
#SBARCH -n 32
huchen's avatar
huchen committed
453

hepj's avatar
hepj committed
454
#SBATCH --ntasks-per-node=4
huchen's avatar
huchen committed
455

hepj's avatar
hepj committed
456
#SBATCH --cpus-per-task=8
huchen's avatar
huchen committed
457

hepj's avatar
hepj committed
458
#SBATCH --gres=dcu:4
huchen's avatar
huchen committed
459

hepj's avatar
hepj committed
460
set -x
huchen's avatar
huchen committed
461

hepj's avatar
hepj committed
462
source env.sh
huchen's avatar
huchen committed
463

hepj's avatar
hepj committed
464
hostfile=./$SLURM_JOB_ID
huchen's avatar
huchen committed
465

hepj's avatar
hepj committed
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}

for i in `cat $hostfile`

do

    echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID

    ((num_node=${num_node}+1))

done

num_dcu=$((${num_node}\4))

echo $num_dcu

nodename=$(cat $hostfile |sed -n "1p")

echo $nodename

dist_url=`echo $nodename | awk '{print $1}'`

export HSA_USERPTR_FOR_PAGED_MEM=0

mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_process.sh  $dist_url
```

##### 3.4.3.运行

sbatch fp16_ run_transformer_4dcus.sh

##### 3.4.4.参数说明

- 上面的中single_process.sh需要关注--max-tokens;
-  通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等;
- 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。

#### 3.5. 部分问题说明

##### 3.5.1. format错误

报错信息如下:

```
  File "~/virturlenv-test/venv/lib/python3.6/site-packages/sacrebleu/metrics/bleu.py", line 103, in __init__
    self._verbose += f"ratio = {self.ratio:.3f} hyp_len = {self.sys_len:d} "
 ...

ValueError: Unknown format code 'd' for object of type 'float'
```

修改方法:修改报错提示中的bleu.py,103和104行的d改成.0f

```
#修改后
self._verbose += f"ratio = {self.ratio:.3f} hyp_len ={self.sys_len:.0f}"
self._verbose += f"ref_len = {slef.ref_len:.0f}"
```



##### 3.5.2 json格式解析错误

报错信息如下:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

解析错误的地方,可以看出是引号嵌套问题

错误的打印

```
'eval_bleu_args': "'{beam:5,max_len_a:1.2,max_len_b:10}'
```

正确的打印

```
'eval_bleu_args': '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
```

修改为:

```
" --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10}"
```