Commit 8ddf66c6 authored by sunxx1's avatar sunxx1
Browse files

Merge branch 'hepj-test' into 'main'

修改README,增加训练脚本,完善模型转换代码

See merge request dcutoolkit/deeplearing/dlexamples_new!38
parents 0200794c bedf3c0c
# 简介
使用PyTorch框架计算Bert网络。
* BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
* BERT 的推理可基于不同数据集进行精度验证
* 数据生成、模型转换相关细节见 [README.md](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md)
# 运行示例
# **Bert算力测试**
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## 1.数据集准备
## pre-train phrase1
pre_train 数据,目前最新的是wiki20220401的数据,但数据集压缩后近20GB,解压后300GB下载速度慢,解压占大量空间。enwiki-20220401-pages-articles-multistream.xml.bz2下载链接如下:
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
https://dumps.wikimedia.org/enwiki/20220401/
这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据,预训练数据分为PHRASE1、PHRASE2
* 方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
昆山wiki数据集地址PHRASE1:
PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
昆山wiki数据集地址PHRASE2:
PATH_PHRASE2=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
```
* 方法二
hostfile:
```
node1 slots=4
node2 slots=4
乌镇wiki地址PHRASE1:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
乌镇wiki地址PHRASE2:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
```
问答SQUAD1.1数据:
[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
## 2.测试环境
注意dtk python torch apex 等版本要对齐
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
1.创建python虚拟环境并进入
virtualenv --python=~/package/Python-3.6.8/build/bin/python3 venv_dtk21.10.1_torch1.10
source venv_dtk21.10_torch1.10/bin/activate
2.安装依赖包
pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
pip install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
pip install torchvision-0.10.0a0+300a8a4-cp36-cp36m-linux_x86_64.whl
pip install apex-0.1-cp36-cp36m-linux_x86_64.whl
3.环境变量设置
module rm compiler/rocm/2.9
export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
export HIP_PATH=${ROCM_PATH}/hip
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PAT
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZ
```
## 3.squad测试
## pre-train phrase2
### 1.模型转化
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
```
### 多卡
目前模型转换还存在问题,可能是由于下载的TF模型与model.ckpt-28252不同导致,或torch 、apex版本兼容性问题,还在排查当中,可以直接使用转换好的模型进行squad任务的微调训练(PHRASE的测试则不受此影响,PHRASE为预训练只需要训练数据与网络结构即可,不需要加载模型)
[转换好的模型 提取密码:vs8d](https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ)
### 2.参数说明
* 方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
--train_file 训练数据
--predict_file 预测文件
--init_checkpoint 模型文件
--vocab_file 词向量文件
--output_dir 输出文件夹
--config_file 模型配置文件
--json-summary 输出json文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--do_train 是否训练
--do_predict 是否预测
--train_batch_size 训练batch_size
--predict_batch_size 预测batch_size
--gpus_per_node 使用gpu节点数
--local_rank 基于GPU的分布式训练的local_rank(单卡设置为-1)
--fp16 混合精度训练
--amp 混合精度训练
```
* 方法二
hostfile:
### 3.运行
```
node1 slots=4
node2 slots=4
#单卡
./bert_squad.sh #单精度 (按自己路径对single_squad.sh里APP设置进行修改)
./bert_squad_fp16.sh #半精度 (按自己路径对single_squad_fp16.sh里APP设置进行修改)
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
#多卡
./bert_squad4.sh #单精度 (按自己路径对single_squad4.sh里APP设置进行修改)
./bert_squad4_fp16.sh #半精度 (按自己路径对single_squad4_fp16.sh里APP设置进行修改)
```
## 4.**PHRASE测试**
### 1.参数说明
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
--input_dir 输入数据文件夹
--output_dir 输出保存文件夹
--config_file 模型配置文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--train_batch_size 训练batch_size
--max_seq_length=128 最大长度(需要和训练数据相匹配)
--max_predictions_per_seq 输入序列中屏蔽标记的最大总数
--max_steps 最大步长
--warmup_proportion 进行线性学习率热身的训练比例
--num_steps_per_checkpoint 多少步保存一次模型
--learning_rate 学习率
--seed 随机种子
--gradient_accumulation_steps 在执行向后/更新过程之前,Accumulte的更新步骤数
--allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
--do_train 是否训练
--fp16 混合精度训练
--amp 混合精度训练
--json-summary 输出json文件
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
### 2.PHRASE1
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
#单卡
./bert_pre1.sh #单精度 (按自己路径对single_pre1_1.sh里APP设置进行修改)
./bert_pre1_fp16.sh #半精度 (按自己路径对single_pre1_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre1_4.sh #单精度 (按自己路径对single_pre1_4.sh里APP设置进行修改)
./bert_pre1_4_fp16.sh #半精度 (按自己路径对single_pre1_4_fp16.sh里APP设置进行修改)
```
### 3.PHRASE2
```
#单卡
./bert_pre2.sh #单精度 (按自己路径对single_pre2_1.sh里APP设置进行修改)
./bert_pre2_fp16.sh #半精度 (按自己路径对single_pre2_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre2_4.sh #单精度 (按自己路径对single_pre2_4.sh里APP设置进行修改)
./bert_pre2_4_fp16.sh #半精度 (按自己路径对single_pre2_4_fp16.sh里APP设置进行修改)
```
# 参考资料
[https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch](https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch)
[https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT)
# 简介
使用PyTorch框架计算Bert网络。
* BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
* BERT 的推理可基于不同数据集进行精度验证
* 数据生成、模型转换相关细节见 [README.md](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md)
# 运行示例
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## pre-train phrase1
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
* 方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
* 方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
```
## pre-train phrase2
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
### 多卡
* 方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
* 方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
```
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
```
# 参考资料
[https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch](https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch)
[https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT)
export HIP_LAUNCH_BLOCKING=1
mpirun --allow-run-as-root -np 4 single_pre1_4_fp16.sh
#!/bin/bash
mpirun --allow-run-as-root -np 1 single_pre1_1.sh
export HIP_LAUNCH_BLOCKING=1
mpirun --allow-run-as-root -np 4 single_pre1_4.sh
#!/bin/bash
mpirun --allow-run-as-root -np 1 single_pre1_1_fp16.sh
#!/bin/bash
mpirun --allow-run-as-root -np 1 single_pre2_1.sh
#!/bin/bash
export HIP_LAUNCH_BLOCKING=1
mpirun --allow-run-as-root -np 4 single_pre2_4.sh
#!/bin/bash
export HIP_LAUNCH_BLOCKING=1
mpirun --allow-run-as-root -np 4 single_pre2_4_fp16.sh
#!/bin/bash
mpirun --allow-run-as-root -np 1 single_pre2_1_fp16.sh
#!/bin/bash
mpirun --allow-run-as-root -np 1 single_squad.sh
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun --allow-run-as-root -np 4 single_squad4.sh
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun --allow-run-as-root -np 4 single_squad4_fp16.sh
#!/bin/bash
mpirun --allow-run-as-root -np 1 single_squad_fp16.sh
This diff is collapsed.
This diff is collapsed.
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export HSA_FORCE_FINE_GRAIN_PCIE=1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export MIOPEN_FIND_MODE=1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP="python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1 \
--config_file=./bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--gpus_per_node 1 \
--do_train \
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:50Gbs
echo numactl --cpunodebind=0 --membind=0 ${APP}
numactl --cpunodebind=0 --membind=0 ${APP}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=1
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:50Gbs
echo numactl --cpunodebind=1 --membind=1 ${APP}
numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=2
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:50Gbs
echo numactl --cpunodebind=2 --membind=2 ${APP}
numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:50Gbs
echo numactl --cpunodebind=3 --membind=3 ${APP}
numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export HSA_FORCE_FINE_GRAIN_PCIE=1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export MIOPEN_FIND_MODE=1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP="python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1 \
--config_file=./bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20 \
--learning_rate=4.0e-4 \
--seed=12439 \
--fp16 \
--amp \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--gpus_per_node 1 \
--do_train \
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
#--fp16 \
# --amp \
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:50Gbs
echo numactl --cpunodebind=0 --membind=0 ${APP}
numactl --cpunodebind=0 --membind=0 ${APP}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=1
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:50Gbs
echo numactl --cpunodebind=1 --membind=1 ${APP}
numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=2
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:50Gbs
echo numactl --cpunodebind=2 --membind=2 ${APP}
numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:50Gbs
echo numactl --cpunodebind=3 --membind=3 ${APP}
numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=1
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
#下边是修改的
export HIP_VISIBLE_DEVICES=0,1,2,3
export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP="python3 run_pretraining_v4.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32 \
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--use_env \
--local_rank ${comm_rank} \
--world_size 4 \
--gpus_per_node 1 \
--dist_url tcp://localhost:34567 \
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:50Gbs
echo numactl --cpunodebind=0 --membind=0 ${APP}
numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=1
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:50Gbs
echo numactl --cpunodebind=1 --membind=1 ${APP}
numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=2
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:50Gbs
echo numactl --cpunodebind=2 --membind=2 ${APP}
numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:50Gbs
echo numactl --cpunodebind=3 --membind=3 ${APP}
numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=1
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
#下边是修改的
export HIP_VISIBLE_DEVICES=0,1,2,3
export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP="python3 run_pretraining_v4.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32 \
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--fp16 \
--amp \
--use_env \
--local_rank ${comm_rank} \
--world_size 4 \
--gpus_per_node 1 \
--dist_url tcp://localhost:34567 \
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:50Gbs
echo numactl --cpunodebind=0 --membind=0 ${APP}
numactl --cpunodebind=0 --membind=0 ${APP}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=1
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:50Gbs
echo numactl --cpunodebind=1 --membind=1 ${APP}
numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=2
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:50Gbs
echo numactl --cpunodebind=2 --membind=2 ${APP}
numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:50Gbs
echo numactl --cpunodebind=3 --membind=3 ${APP}
numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment