### 1.Benchmark下载地址 transformer:https://github.com/pytorch/fairseq.git 最新的githup上代码和现在存在区别,最好使用本地旧版本代码,否则可能出现HIP版本的torch不识别情况 ### 2.1 数据集准备 ``` 环境准备: pip3 install fastBPE sacremoses subword_nmt 数据集下载: cd examples/translation/ bash prepare-wmt14en2de.sh --icml17 数据预处理: DATA_DIR=~/data/wmt14_en_de_joined_dict TEXT=`pwd`/examples/translation/wmt14_en_de fairseq-preprocess --source-lang en --target-lang de --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir $DATA_DIR --nwordssrc 32768 --nwordstgt 32768 --joined-dictionary --workers 20 相关参数说明: --source-lang source language --target-lang target language --trainpref rain file prefix (also used to build dictionaries) --validpref comma separated, valid file prefixes (words missing from train set are replaced with ) --testpref comma separated, test file prefixes (words missing from train set are replaced with ) --destdir destination dir, Default: “data-bin” 数据集路径: DATA_PATH=~/data/wmt14_en_de_joined_dict ``` #### 2.2.环境部署 ##### 2.2.1.构建测试的虚拟环境 ``` virtualenv -p python3 venv source venv/bin/activate ``` ##### 2.2.2.安装python3.7环境下的依赖包 ``` pip3 install --upgrade pip pip3 install typing pip3 install sacremoses pip3 install numpy pip3 install torch-1.10.0a0+gitd8cde89.atomic.dtk22042-cp37-cp37m-linux_x86_64.whl pip3 install apex-0.1_dtk22.04-cp37-cp37m-linux_x86_64.whl pip3 install setuptools==59.5.0 pip3 install protobuf==3.20.0 ``` ##### 2.2.3.安装fairseq ``` #git clone https://github.com/pytorch/fairseq.git #cd fairseq #整个transformer文件夹就是从github上拷贝下来的fariseq,可以直接安装 pip3 install --editable ./ 相关说明:可以先把setup.py里的torch、torch-audio的安装屏蔽掉 ``` ##### 2.2.4.环境变量里设置env.sh ``` WORK_PATH=`pwd` source env.sh ``` ### 3.transformer测试(昆山) #### 3.1.单卡测试(单精度) ##### 3.1.1.run.sh ``` export HSA_FORCE_FINE_GRAIN_PCIE=1 export MIOPEN_FIND_MODE=1 export HIP_VISIBLE_DEVICES=0 export TOKEN=2560 export DATA_PATH=~/data/wmt14_en_de_joined_dict export HIP_LAUNCH_BLOCKING=1 export ROCBLAS_ATOMICS_MOD=1 python3 train.py \ $DATA_PATH \ --arch transformer_wmt_en_de \ --share-decoder-input-output-embed \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 0.0 \ --lr 5e-4 \ --lr-scheduler inverse_sqrt \ --warmup-updates 4000 \ --dropout 0.3 \ --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens ${TOKEN} \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu \ --maximize-best-checkpoint-metric \ --max-epoch 1 ``` ##### 3.1.2.运行 ``` ./ run.sh ``` #### 3.2.四卡测试(单精度) ##### 3.2.1.single_process.sh ``` #!/bin/bash export MIOPEN_DEBUG_DISABLE_FIND_DB=1 export NCCL_SOCKET_IFNAME=eno1 export HSA_USERPTR_FOR_PAGED_MEM=0 lrank=$OMPI_COMM_WORLD_LOCAL_RANK comm_rank=$OMPI_COMM_WORLD_RANK comm_size=$OMPI_COMM_WORLD_SIZE TOKENS=2560 DATA_PATH=~/fairseq/examples/translation/data-bin/wmt14_en_de_joined_dict APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9,0.98) --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1" case ${lrank} in [0]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_0:1 export UCX_IB_PCI_BW=mlx5_0:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} ;; [1]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_1:1 export UCX_IB_PCI_BW=mlx5_1:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP} ;; [2]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_2:1 export UCX_IB_PCI_BW=mlx5_2:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP} ;; [3]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_3:1 export UCX_IB_PCI_BW=mlx5_3:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP} ;; esac ``` ##### 3.2.2.run4.sh ``` #!/usr/bin/env bash #SBATCH -J distribute #SBATCH -p wzhdtest #SBATCH -N 1 #SBARCH -n 32 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=8 #SBATCH --gres=dcu:4 set -x hostfile=./$SLURM_JOB_ID scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile} for i in `cat $hostfile` do echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID ((num_node=${num_node}+1)) done num_dcu=$((${num_node}*4)) echo $num_dcu nodename=$(cat $hostfile |sed -n "1p") echo $nodename dist_url=`echo $nodename | awk '{print $1}'` export HSA_USERPTR_FOR_PAGED_MEM=0 mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_process.sh $dist_url ``` ##### 3.2.3.运行 ``` sbatch run4.sh ``` ##### 3.2.4.参数说明 - 上面的中single_process.sh需要关注--max-tokens; - 通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等; - 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。 #### 3.3.单卡测试(半精度) ##### 3.3.1.fp16_run_transformer.sh ``` export HSA_FORCE_FINE_GRAIN_PCIE=1 export MIOPEN_FIND_MODE=1 export HIP_VISIBLE_DEVICES=0 export DATA_PATH=~/data/wmt14_en_de_joined_dict export TOKEN=2560 python3 train.py \ $DATA_PATH \ --save-dir module-fp16_2560 \ --arch transformer_wmt_en_de --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens ${TOKEN} \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --max-epoch 1 --fp16 ``` ##### 3.3.2.运行 ./ fp16_run_transformer.sh ##### 3.3.3.参数说明 --max-tokens 根据tokens设置batch size --fp16 使用半精度训练 #### 3.4.四卡测试(半精度) ##### 3.4.1.fp16_single_process.sh ``` export HIP_VISIBLE_DEVICES=0 export MIOPEN_DEBUG_DISABLE_FIND_DB=1 export NCCL_SOCKET_IFNAME=eno1 export HSA_USERPTR_FOR_PAGED_MEM=0 lrank=$OMPI_COMM_WORLD_LOCAL_RANK comm_rank=$OMPI_COMM_WORLD_RANK comm_size=$OMPI_COMM_WORLD_SIZE TOKENS=2560 DATA_PATH=~/data/wmt14_en_de_joined_dict APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9,0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1 --fp16" case ${lrank} in [0]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_0:1 export UCX_IB_PCI_BW=mlx5_0:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} ;; [1]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_1:1 export UCX_IB_PCI_BW=mlx5_1:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP} ;; [2]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_2:1 export UCX_IB_PCI_BW=mlx5_2:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP} ;; [3]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_3:1 export UCX_IB_PCI_BW=mlx5_3:50Gbs echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP} NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP} ;; esac ``` ##### 3.4.2.fp16_run_transformer_4dcus.sh ``` #!/usr/bin/env bash #SBATCH -J transformer #SBATCH -p wzhdtest #SBATCH -N 1 #SBARCH -n 32 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=8 #SBATCH --gres=dcu:4 set -x hostfile=./$SLURM_JOB_ID scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile} for i in `cat $hostfile` do echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID ((num_node=${num_node}+1)) done num_dcu=$((${num_node}*4)) echo $num_dcu nodename=$(cat $hostfile |sed -n "1p") echo $nodename dist_url=`echo $nodename | awk '{print $1}'` export HSA_USERPTR_FOR_PAGED_MEM=0 mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_fp16.sh $dist_ur ``` ##### 3.4.3.运行 sbatch fp16_ run_transformer_4dcus.sh ##### 3.4.4.参数说明 - 上面的中single_process.sh需要关注--max-tokens; - 通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等; - 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。 #### 3.5. 部分问题说明 ##### 3.5.1. format错误 报错信息如下: ``` File "~/virturlenv-test/venv/lib/python3.6/site-packages/sacrebleu/metrics/bleu.py", line 103, in __init__ self._verbose += f"ratio = {self.ratio:.3f} hyp_len = {self.sys_len:d} " ... ValueError: Unknown format code 'd' for object of type 'float' ``` 修改方法:修改报错提示中的bleu.py,103和104行的d改成.0f ``` #修改后 self._verbose += f"ratio = {self.ratio:.3f} hyp_len ={self.sys_len:.0f}" self._verbose += f"ref_len = {slef.ref_len:.0f}" ``` ##### 3.5.2 json格式解析错误 报错信息如下: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) 解析错误的地方,可以看出是引号嵌套问题 错误的打印 ``` 'eval_bleu_args': "'{beam:5,max_len_a:1.2,max_len_b:10}' ``` 正确的打印 ``` 'eval_bleu_args': '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' ``` 修改为: ``` " --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10}" ```