Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
dcuai
dlexamples
Commits
bedf3c0c
Commit
bedf3c0c
authored
Sep 16, 2022
by
hepj
Browse files
修改README,增加训练脚本,完善模型转换代码
parent
49afe744
Changes
89
Expand all
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
2627 additions
and
142 deletions
+2627
-142
PyTorch/NLP/BERT/README.md
PyTorch/NLP/BERT/README.md
+119
-142
PyTorch/NLP/BERT/README_old.md
PyTorch/NLP/BERT/README_old.md
+181
-0
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
+4
-0
PyTorch/NLP/BERT/bert_pre1.sh
PyTorch/NLP/BERT/bert_pre1.sh
+2
-0
PyTorch/NLP/BERT/bert_pre1_4.sh
PyTorch/NLP/BERT/bert_pre1_4.sh
+4
-0
PyTorch/NLP/BERT/bert_pre1_fp16.sh
PyTorch/NLP/BERT/bert_pre1_fp16.sh
+3
-0
PyTorch/NLP/BERT/bert_pre2.sh
PyTorch/NLP/BERT/bert_pre2.sh
+2
-0
PyTorch/NLP/BERT/bert_pre2_4.sh
PyTorch/NLP/BERT/bert_pre2_4.sh
+4
-0
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
+4
-0
PyTorch/NLP/BERT/bert_pre2_fp16.sh
PyTorch/NLP/BERT/bert_pre2_fp16.sh
+2
-0
PyTorch/NLP/BERT/bert_squad.sh
PyTorch/NLP/BERT/bert_squad.sh
+5
-0
PyTorch/NLP/BERT/bert_squad4.sh
PyTorch/NLP/BERT/bert_squad4.sh
+9
-0
PyTorch/NLP/BERT/bert_squad4_fp16.sh
PyTorch/NLP/BERT/bert_squad4_fp16.sh
+9
-0
PyTorch/NLP/BERT/bert_squad_fp16.sh
PyTorch/NLP/BERT/bert_squad_fp16.sh
+5
-0
PyTorch/NLP/BERT/run_pretraining_v4.py
PyTorch/NLP/BERT/run_pretraining_v4.py
+709
-0
PyTorch/NLP/BERT/run_squad_v4.py
PyTorch/NLP/BERT/run_squad_v4.py
+1242
-0
PyTorch/NLP/BERT/single_pre1_1.sh
PyTorch/NLP/BERT/single_pre1_1.sh
+87
-0
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
+91
-0
PyTorch/NLP/BERT/single_pre1_4.sh
PyTorch/NLP/BERT/single_pre1_4.sh
+70
-0
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
+75
-0
No files found.
PyTorch/NLP/BERT/README.md
View file @
bedf3c0c
# 简介
使用PyTorch框架计算Bert网络。
*
BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
*
BERT 的推理可基于不同数据集进行精度验证
*
数据生成、模型转换相关细节见
[
README.md
](
http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md
)
# 运行示例
# **Bert算力测试**
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## 1.数据集准备
##
pre
-
train
phrase1
pre
_
train
数据,目前最新的是wiki20220401的数据,但数据集压缩后近20GB,解压后300GB下载速度慢,解压占大量空间。enwiki-20220401-pages-articles-multistream.xml.bz2下载链接如下:
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
https://dumps.wikimedia.org/enwiki/20220401/
这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据,预训练数据分为PHRASE1、PHRASE2
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
昆山wiki数据集地址PHRASE1:
PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
昆山wiki数据集地址PHRASE2:
PATH_PHRASE2=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
乌镇wiki地址PHRASE1:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
乌镇wiki地址PHRASE2:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
```
问答SQUAD1.1数据:
[
train-v1.1
](
https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
)
[
dev-v1.1
](
https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
)
## 2.测试环境
注意dtk python torch apex 等版本要对齐
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
1.创建python虚拟环境并进入
virtualenv --python=~/package/Python-3.6.8/build/bin/python3 venv_dtk21.10.1_torch1.10
source venv_dtk21.10_torch1.10/bin/activate
2.安装依赖包
pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
pip install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
pip install torchvision-0.10.0a0+300a8a4-cp36-cp36m-linux_x86_64.whl
pip install apex-0.1-cp36-cp36m-linux_x86_64.whl
3.环境变量设置
module rm compiler/rocm/2.9
export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
export HIP_PATH=${ROCM_PATH}/hip
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PAT
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZ
```
## 3.squad测试
##
pre-train phrase2
##
# 1.模型转化
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
```
### 多卡
目前模型转换还存在问题,可能是由于下载的TF模型与model.ckpt-28252不同导致,或torch 、apex版本兼容性问题,还在排查当中,可以直接使用转换好的模型进行squad任务的微调训练(PHRASE的测试则不受此影响,PHRASE为预训练只需要训练数据与网络结构即可,不需要加载模型)
[
转换好的模型 提取密码:vs8d
](
https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ
)
### 2.参数说明
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
--train_file 训练数据
--predict_file 预测文件
--init_checkpoint 模型文件
--vocab_file 词向量文件
--output_dir 输出文件夹
--config_file 模型配置文件
--json-summary 输出json文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--do_train 是否训练
--do_predict 是否预测
--train_batch_size 训练batch_size
--predict_batch_size 预测batch_size
--gpus_per_node 使用gpu节点数
--local_rank 基于GPU的分布式训练的local_rank(单卡设置为-1)
--fp16 混合精度训练
--amp 混合精度训练
```
*
方法二
hostfile:
### 3.运行
```
node1 slots=4
node2 slots=4
#单卡
./bert_squad.sh #单精度 (按自己路径对single_squad.sh里APP设置进行修改)
./bert_squad_fp16.sh #半精度 (按自己路径对single_squad_fp16.sh里APP设置进行修改)
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
#多卡
./bert_squad4.sh #单精度 (按自己路径对single_squad4.sh里APP设置进行修改)
./bert_squad4_fp16.sh #半精度 (按自己路径对single_squad4_fp16.sh里APP设置进行修改)
```
## 4.**PHRASE测试**
### 1.参数说明
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
--input_dir 输入数据文件夹
--output_dir 输出保存文件夹
--config_file 模型配置文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--train_batch_size 训练batch_size
--max_seq_length=128 最大长度(需要和训练数据相匹配)
--max_predictions_per_seq 输入序列中屏蔽标记的最大总数
--max_steps 最大步长
--warmup_proportion 进行线性学习率热身的训练比例
--num_steps_per_checkpoint 多少步保存一次模型
--learning_rate 学习率
--seed 随机种子
--gradient_accumulation_steps 在执行向后/更新过程之前,Accumulte的更新步骤数
--allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
--do_train 是否训练
--fp16 混合精度训练
--amp 混合精度训练
--json-summary 输出json文件
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
### 2.PHRASE1
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
#单卡
./bert_pre1.sh #单精度 (按自己路径对single_pre1_1.sh里APP设置进行修改)
./bert_pre1_fp16.sh #半精度 (按自己路径对single_pre1_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre1_4.sh #单精度 (按自己路径对single_pre1_4.sh里APP设置进行修改)
./bert_pre1_4_fp16.sh #半精度 (按自己路径对single_pre1_4_fp16.sh里APP设置进行修改)
```
### 3.PHRASE2
```
#单卡
./bert_pre2.sh #单精度 (按自己路径对single_pre2_1.sh里APP设置进行修改)
./bert_pre2_fp16.sh #半精度 (按自己路径对single_pre2_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre2_4.sh #单精度 (按自己路径对single_pre2_4.sh里APP设置进行修改)
./bert_pre2_4_fp16.sh #半精度 (按自己路径对single_pre2_4_fp16.sh里APP设置进行修改)
```
# 参考资料
[
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
](
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
)
[
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
](
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
)
PyTorch/NLP/BERT/README_old.md
0 → 100644
View file @
bedf3c0c
# 简介
使用PyTorch框架计算Bert网络。
*
BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
*
BERT 的推理可基于不同数据集进行精度验证
*
数据生成、模型转换相关细节见
[
README.md
](
http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md
)
# 运行示例
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## pre-train phrase1
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
```
## pre-train phrase2
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
### 多卡
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
```
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
```
# 参考资料
[
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
](
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
)
[
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
](
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
)
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
0 → 100644
View file @
bedf3c0c
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre1_4_fp16.sh
PyTorch/NLP/BERT/bert_pre1.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre1_1.sh
PyTorch/NLP/BERT/bert_pre1_4.sh
0 → 100644
View file @
bedf3c0c
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre1_4.sh
PyTorch/NLP/BERT/bert_pre1_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre1_1_fp16.sh
PyTorch/NLP/BERT/bert_pre2.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre2_1.sh
PyTorch/NLP/BERT/bert_pre2_4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre2_4.sh
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre2_4_fp16.sh
PyTorch/NLP/BERT/bert_pre2_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre2_1_fp16.sh
PyTorch/NLP/BERT/bert_squad.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_squad.sh
PyTorch/NLP/BERT/bert_squad4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun
--allow-run-as-root
-np
4 single_squad4.sh
PyTorch/NLP/BERT/bert_squad4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun
--allow-run-as-root
-np
4 single_squad4_fp16.sh
PyTorch/NLP/BERT/bert_squad_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_squad_fp16.sh
PyTorch/NLP/BERT/run_pretraining_v4.py
0 → 100644
View file @
bedf3c0c
This diff is collapsed.
Click to expand it.
PyTorch/NLP/BERT/run_squad_v4.py
0 → 100644
View file @
bedf3c0c
This diff is collapsed.
Click to expand it.
PyTorch/NLP/BERT/single_pre1_1.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export
MIOPEN_FIND_MODE
=
1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v1.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1
\
--config_file=./bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--gpus_per_node 1
\
--do_train
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export
MIOPEN_FIND_MODE
=
1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v1.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1
\
--config_file=./bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20
\
--learning_rate=4.0e-4
\
--seed=12439
\
--fp16
\
--amp
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--gpus_per_node 1
\
--do_train
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
#--fp16 \
# --amp \
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
MIOPEN_FIND_MODE
=
1
module unload compiler/rocm/2.9
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v4.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32
\
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--do_train
\
--use_env
\
--local_rank
${
comm_rank
}
\
--world_size 4
\
--gpus_per_node 1
\
--dist_url tcp://localhost:34567
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
MIOPEN_FIND_MODE
=
1
module unload compiler/rocm/2.9
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v4.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32
\
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--do_train
\
--fp16
\
--amp
\
--use_env
\
--local_rank
${
comm_rank
}
\
--world_size 4
\
--gpus_per_node 1
\
--dist_url tcp://localhost:34567
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
Prev
1
2
3
4
5
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment