Merge branch 'hepj-test' into 'main'

修改README，增加训练脚本，完善模型转换代码 See merge request dcutoolkit/deeplearing/dlexamples_new!38

Merge branch 'hepj-test' into 'main'
修改README，增加训练脚本，完善模型转换代码 See merge request dcutoolkit/deeplearing/dlexamples_new!38
8ddf66c6 · sunxx1 · 0200794c · bedf3c0c · 8ddf66c6 · 8ddf66c6
Commit 8ddf66c6 authored Sep 19, 2022 by sunxx1
20 changed files
--- a/PyTorch/NLP/BERT/README.md
+++ b/PyTorch/NLP/BERT/README.md
-# 简介
-
-使用PyTorch框架计算Bert网络。
-
-* BERT 的训练分为pre-train和fine-tune两种，pre-train训练分为两个phrase。
-
-* BERT 的推理可基于不同数据集进行精度验证
-* 数据生成、模型转换相关细节见  [README.md](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md)
-
-# 运行示例
+# **Bert算力测试**

-目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例，
+## 1.数据集准备

-## pre-train phrase1
+pre_train 数据，目前最新的是wiki20220401的数据，但数据集压缩后近20GB，解压后300GB下载速度慢，解压占大量空间。enwiki-20220401-pages-articles-multistream.xml.bz2下载链接如下：

-|参数名|解释|示例|
-|:---:|:---:|:---:|
-|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
-|OUTPUT_DIR|输出路径|/workspace/results
-|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
-|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
-<br>
-
-### 单卡
-```
-export HIP_VISIBLE_DEVICES=0
-python3 run_pretraining_v1.py  \
-    --input_dir=${PATH_PHRASE1}    \
-    --output_dir=${OUTPUT_DIR}/checkpoints1 \
-    --config_file=${PATH_CONFIG}bert_config.json \
-    --bert_model=bert-large-uncased \
-    --train_batch_size=16 \
-    --max_seq_length=128 \
-    --max_predictions_per_seq=20 \
-    --max_steps=100000 \
-    --warmup_proportion=0.0 \
-    --num_steps_per_checkpoint=20000 \
-    --learning_rate=4.0e-4 \
-    --seed=12439 \
-    --gradient_accumulation_steps=1 \
-    --allreduce_post_accumulation \
-    --do_train \
-    --json-summary dllogger.json
-```
-
-### 多卡
+https://dumps.wikimedia.org/enwiki/20220401/ 
+
+这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据，预训练数据分为PHRASE1、PHRASE2

-* 方法一
 ```
-export HIP_VISIBLE_DEVICES=0,1,2,3
-python3 run_pretraining_v1.py  \
-    --input_dir=${PATH_PHRASE1}    \
-    --output_dir=${OUTPUT_DIR}/checkpoints \
-    --config_file=${PATH_CONFIG}bert_config.json \
-    --bert_model=bert-large-uncased \
-    --train_batch_size=16 \
-    --max_seq_length=128 \
-    --max_predictions_per_seq=20 \
-    --max_steps=100000 \
-    --warmup_proportion=0.0 \
-    --num_steps_per_checkpoint=20000 \
-    --learning_rate=4.0e-4 \
-    --seed=12439 \
-    --gradient_accumulation_steps=1 \
-    --allreduce_post_accumulation \
-    --do_train \
-    --json-summary dllogger.json
+昆山wiki数据集地址PHRASE1:
+PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
+
+昆山wiki数据集地址PHRASE2:
+PATH_PHRASE2=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
 ```
-* 方法二

-hostfile:
 ```
-node1 slots=4
-node2 slots=4
+乌镇wiki地址PHRASE1:
+/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
+乌镇wiki地址PHRASE2:
+/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
 ```

+问答SQUAD1.1数据：
+
+[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
+
+[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
+
+## 2.测试环境
+
+注意dtk python torch apex 等版本要对齐
+
 ```
-#scripts/run_pretrain.sh 脚本默认每个节点四块卡
-cd scripts; bash run_pretrain.sh
+1.创建python虚拟环境并进入
+virtualenv --python=~/package/Python-3.6.8/build/bin/python3 venv_dtk21.10.1_torch1.10
+source venv_dtk21.10_torch1.10/bin/activate
+
+2.安装依赖包
+pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+
+pip install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
+pip install torchvision-0.10.0a0+300a8a4-cp36-cp36m-linux_x86_64.whl
+pip install apex-0.1-cp36-cp36m-linux_x86_64.whl
+
+3.环境变量设置
+module rm compiler/rocm/2.9 
+export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
+export HIP_PATH=${ROCM_PATH}/hip
+export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PAT
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_FIND_MODE=3
+export MIOPEN_ENABLE_LOGGING_CMD=1
+export ROCBLAS_LAYER=3
+module unload compiler/rocm/2.9
+echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+comm_rank=$OMPI_COMM_WORLD_RANK
+comm_size=$OMPI_COMM_WORLD_SIZ
 ```

+## 3.squad测试

-## pre-train phrase2
+### 1.模型转化

-### 单卡
 ```
-HIP_VISIBLE_DEVICES=0
-python3 run_pretraining_v1.py
-   --input_dir=${PATH_PHRASE2} \
-   --output_dir=${OUTPUT_DIR}/checkpoints2 \
-   --config_file=${PATH_CONFIG}bert_config.json \
-   --bert_model=bert-large-uncased \
-   --train_batch_size=4 \
-   --max_seq_length=512 \
-   --max_predictions_per_seq=80 \
-   --max_steps=400000 \
-   --warmup_proportion=0.128 \
-   --num_steps_per_checkpoint=200000 \
-   --learning_rate=4e-3 \
-   --seed=12439 \
-   --gradient_accumulation_steps=1 \
-   --allreduce_post_accumulation \
-   --do_train \
-   --phase2 \
-   --phase1_end_step=0 \
-   --json-summary dllogger.json
+python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
 ```

-### 多卡
+目前模型转换还存在问题，可能是由于下载的TF模型与model.ckpt-28252不同导致，或torch 、apex版本兼容性问题，还在排查当中，可以直接使用转换好的模型进行squad任务的微调训练（PHRASE的测试则不受此影响，PHRASE为预训练只需要训练数据与网络结构即可，不需要加载模型）
+
+[转换好的模型  提取密码：vs8d](https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ)
+
+### 2.参数说明

-* 方法一
 ```
-export HIP_VISIBLE_DEVICES=0,1,2,3
-python3 run_pretraining_v1.py
-   --input_dir=${PATH_PHRASE2} \
-   --output_dir=${OUTPUT_DIR}/checkpoints2 \
-   --config_file=${PATH_CONFIG}bert_config.json \
-   --bert_model=bert-large-uncased \
-   --train_batch_size=4 \
-   --max_seq_length=512 \
-   --max_predictions_per_seq=80 \
-   --max_steps=400000 \
-   --warmup_proportion=0.128 \
-   --num_steps_per_checkpoint=200000 \
-   --learning_rate=4e-3 \
-   --seed=12439 \
-   --gradient_accumulation_steps=1 \
-   --allreduce_post_accumulation \
-   --do_train \
-   --phase2 \
-   --phase1_end_step=0 \
-   --json-summary dllogger.json
+  --train_file  训练数据
+  --predict_file  预测文件
+  --init_checkpoint  模型文件
+  --vocab_file  词向量文件
+  --output_dir  输出文件夹
+  --config_file  模型配置文件
+  --json-summary  输出json文件
+  --bert_model bert模型类型可选： bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
+  --do_train 是否训练
+  --do_predict 是否预测
+  --train_batch_size  训练batch_size
+  --predict_batch_size 预测batch_size
+  --gpus_per_node  使用gpu节点数
+  --local_rank 基于GPU的分布式训练的local_rank（单卡设置为-1）
+  --fp16 混合精度训练
+  --amp 混合精度训练
 ```
-* 方法二

-hostfile:
+### 3.运行
+
 ```
-node1 slots=4
-node2 slots=4
+#单卡
+./bert_squad.sh #单精度 （按自己路径对single_squad.sh里APP设置进行修改）
+./bert_squad_fp16.sh  #半精度 （按自己路径对single_squad_fp16.sh里APP设置进行修改）
 ```

 ```
-#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
-cd scripts; bash run_pretrain2.sh
+#多卡
+./bert_squad4.sh #单精度  （按自己路径对single_squad4.sh里APP设置进行修改）
+./bert_squad4_fp16.sh #半精度  （按自己路径对single_squad4_fp16.sh里APP设置进行修改）
 ```

+## 4.**PHRASE测试**

+### 1.参数说明

-## fine-tune 训练
-
-### 单卡
 ```
-python3 run_squad_v1.py \
-  --train_file squad/v1.1/train-v1.1.json \
-  --init_checkpoint model.ckpt-28252.pt \
-  --vocab_file vocab.txt \
-  --output_dir SQuAD \
-  --config_file bert_config.json \
-  --bert_model=bert-large-uncased \
-  --do_train \
-  --train_batch_size 1 \
-  --gpus_per_node 1 
+    --input_dir  输入数据文件夹
+    --output_dir 输出保存文件夹
+    --config_file 模型配置文件
+    --bert_model  bert模型类型可选： bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
+    --train_batch_size 训练batch_size
+    --max_seq_length=128 最大长度（需要和训练数据相匹配）
+    --max_predictions_per_seq 输入序列中屏蔽标记的最大总数 
+    --max_steps 最大步长
+    --warmup_proportion 进行线性学习率热身的训练比例
+    --num_steps_per_checkpoint 多少步保存一次模型
+    --learning_rate 学习率
+    --seed 随机种子
+    --gradient_accumulation_steps 在执行向后/更新过程之前，Accumulte的更新步骤数
+    --allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
+    --do_train 是否训练
+    --fp16 混合精度训练
+    --amp 混合精度训练
+    --json-summary 输出json文件
 ```
-### 多卡

-hostfile:
-```
-node1 slots=4
-node2 slots=4
-```
+### 2.PHRASE1

 ```
-#scripts/run_squad_1.sh 脚本默认每个节点四块卡
-bash run_squad_1.sh
+#单卡
+./bert_pre1.sh #单精度 （按自己路径对single_pre1_1.sh里APP设置进行修改）
+./bert_pre1_fp16.sh  #半精度 （按自己路径对single_pre1_1_fp16.sh里APP设置进行修改）
+#多卡
+./bert_pre1_4.sh #单精度 （按自己路径对single_pre1_4.sh里APP设置进行修改）
+./bert_pre1_4_fp16.sh   #半精度 （按自己路径对single_pre1_4_fp16.sh里APP设置进行修改）
 ```

+### 3.PHRASE2
+
+```
+#单卡
+./bert_pre2.sh  #单精度 （按自己路径对single_pre2_1.sh里APP设置进行修改）
+./bert_pre2_fp16.sh  #半精度 （按自己路径对single_pre2_1_fp16.sh里APP设置进行修改）
+#多卡
+./bert_pre2_4.sh  #单精度 （按自己路径对single_pre2_4.sh里APP设置进行修改）
+./bert_pre2_4_fp16.sh  #半精度 （按自己路径对single_pre2_4_fp16.sh里APP设置进行修改）

+```

-# 参考资料
-[https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch](https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch)
-[https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT)
--- a/PyTorch/NLP/BERT/README_old.md
+++ b/PyTorch/NLP/BERT/README_old.md
+# 简介
+
+使用PyTorch框架计算Bert网络。
+
+* BERT 的训练分为pre-train和fine-tune两种，pre-train训练分为两个phrase。
+
+* BERT 的推理可基于不同数据集进行精度验证
+* 数据生成、模型转换相关细节见  [README.md](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md)
+
+# 运行示例
+
+目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例，
+
+## pre-train phrase1
+
+|参数名|解释|示例|
+|:---:|:---:|:---:|
+|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
+|OUTPUT_DIR|输出路径|/workspace/results
+|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
+|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.<br>15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
+<br>
+
+### 单卡
+```
+export HIP_VISIBLE_DEVICES=0
+python3 run_pretraining_v1.py  \
+    --input_dir=${PATH_PHRASE1}    \
+    --output_dir=${OUTPUT_DIR}/checkpoints1 \
+    --config_file=${PATH_CONFIG}bert_config.json \
+    --bert_model=bert-large-uncased \
+    --train_batch_size=16 \
+    --max_seq_length=128 \
+    --max_predictions_per_seq=20 \
+    --max_steps=100000 \
+    --warmup_proportion=0.0 \
+    --num_steps_per_checkpoint=20000 \
+    --learning_rate=4.0e-4 \
+    --seed=12439 \
+    --gradient_accumulation_steps=1 \
+    --allreduce_post_accumulation \
+    --do_train \
+    --json-summary dllogger.json
+```
+
+### 多卡
+
+* 方法一
+```
+export HIP_VISIBLE_DEVICES=0,1,2,3
+python3 run_pretraining_v1.py  \
+    --input_dir=${PATH_PHRASE1}    \
+    --output_dir=${OUTPUT_DIR}/checkpoints \
+    --config_file=${PATH_CONFIG}bert_config.json \
+    --bert_model=bert-large-uncased \
+    --train_batch_size=16 \
+    --max_seq_length=128 \
+    --max_predictions_per_seq=20 \
+    --max_steps=100000 \
+    --warmup_proportion=0.0 \
+    --num_steps_per_checkpoint=20000 \
+    --learning_rate=4.0e-4 \
+    --seed=12439 \
+    --gradient_accumulation_steps=1 \
+    --allreduce_post_accumulation \
+    --do_train \
+    --json-summary dllogger.json
+```
+* 方法二
+
+hostfile:
+```
+node1 slots=4
+node2 slots=4
+```
+
+```
+#scripts/run_pretrain.sh 脚本默认每个节点四块卡
+cd scripts; bash run_pretrain.sh
+```
+
+
+## pre-train phrase2
+
+### 单卡
+```
+HIP_VISIBLE_DEVICES=0
+python3 run_pretraining_v1.py
+   --input_dir=${PATH_PHRASE2} \
+   --output_dir=${OUTPUT_DIR}/checkpoints2 \
+   --config_file=${PATH_CONFIG}bert_config.json \
+   --bert_model=bert-large-uncased \
+   --train_batch_size=4 \
+   --max_seq_length=512 \
+   --max_predictions_per_seq=80 \
+   --max_steps=400000 \
+   --warmup_proportion=0.128 \
+   --num_steps_per_checkpoint=200000 \
+   --learning_rate=4e-3 \
+   --seed=12439 \
+   --gradient_accumulation_steps=1 \
+   --allreduce_post_accumulation \
+   --do_train \
+   --phase2 \
+   --phase1_end_step=0 \
+   --json-summary dllogger.json
+```
+
+### 多卡
+
+* 方法一
+```
+export HIP_VISIBLE_DEVICES=0,1,2,3
+python3 run_pretraining_v1.py
+   --input_dir=${PATH_PHRASE2} \
+   --output_dir=${OUTPUT_DIR}/checkpoints2 \
+   --config_file=${PATH_CONFIG}bert_config.json \
+   --bert_model=bert-large-uncased \
+   --train_batch_size=4 \
+   --max_seq_length=512 \
+   --max_predictions_per_seq=80 \
+   --max_steps=400000 \
+   --warmup_proportion=0.128 \
+   --num_steps_per_checkpoint=200000 \
+   --learning_rate=4e-3 \
+   --seed=12439 \
+   --gradient_accumulation_steps=1 \
+   --allreduce_post_accumulation \
+   --do_train \
+   --phase2 \
+   --phase1_end_step=0 \
+   --json-summary dllogger.json
+```
+* 方法二
+
+hostfile:
+```
+node1 slots=4
+node2 slots=4
+```
+
+```
+#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
+cd scripts; bash run_pretrain2.sh
+```
+
+
+
+## fine-tune 训练
+
+### 单卡
+```
+python3 run_squad_v1.py \
+  --train_file squad/v1.1/train-v1.1.json \
+  --init_checkpoint model.ckpt-28252.pt \
+  --vocab_file vocab.txt \
+  --output_dir SQuAD \
+  --config_file bert_config.json \
+  --bert_model=bert-large-uncased \
+  --do_train \
+  --train_batch_size 1 \
+  --gpus_per_node 1 
+```
+### 多卡
+
+hostfile:
+```
+node1 slots=4
+node2 slots=4
+```
+
+```
+#scripts/run_squad_1.sh 脚本默认每个节点四块卡
+bash run_squad_1.sh
+```
+
+
+
+# 参考资料
+[https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch](https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch)
+[https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT)
--- a/PyTorch/NLP/BERT/bert_per1_4_fp16.sh
+++ b/PyTorch/NLP/BERT/bert_per1_4_fp16.sh
+export HIP_LAUNCH_BLOCKING=1
+mpirun --allow-run-as-root -np 4  single_pre1_4_fp16.sh
+
+
--- a/PyTorch/NLP/BERT/bert_pre1.sh
+++ b/PyTorch/NLP/BERT/bert_pre1.sh
+#!/bin/bash
+mpirun --allow-run-as-root -np 1  single_pre1_1.sh
--- a/PyTorch/NLP/BERT/bert_pre1_4.sh
+++ b/PyTorch/NLP/BERT/bert_pre1_4.sh
+export HIP_LAUNCH_BLOCKING=1
+mpirun --allow-run-as-root -np 4  single_pre1_4.sh
+
+
--- a/PyTorch/NLP/BERT/bert_pre1_fp16.sh
+++ b/PyTorch/NLP/BERT/bert_pre1_fp16.sh
+#!/bin/bash
+mpirun --allow-run-as-root -np 1  single_pre1_1_fp16.sh
+
--- a/PyTorch/NLP/BERT/bert_pre2.sh
+++ b/PyTorch/NLP/BERT/bert_pre2.sh
+#!/bin/bash
+mpirun --allow-run-as-root -np 1  single_pre2_1.sh 
--- a/PyTorch/NLP/BERT/bert_pre2_4.sh
+++ b/PyTorch/NLP/BERT/bert_pre2_4.sh
+#!/bin/bash
+export HIP_LAUNCH_BLOCKING=1
+mpirun --allow-run-as-root -np 4  single_pre2_4.sh 
+
--- a/PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
+++ b/PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
+#!/bin/bash
+export HIP_LAUNCH_BLOCKING=1
+mpirun --allow-run-as-root -np 4  single_pre2_4_fp16.sh 
+
--- a/PyTorch/NLP/BERT/bert_pre2_fp16.sh
+++ b/PyTorch/NLP/BERT/bert_pre2_fp16.sh
+#!/bin/bash
+mpirun --allow-run-as-root -np 1  single_pre2_1_fp16.sh 
--- a/PyTorch/NLP/BERT/bert_squad.sh
+++ b/PyTorch/NLP/BERT/bert_squad.sh
+#!/bin/bash
+mpirun --allow-run-as-root -np 1  single_squad.sh 
+
+
+
--- a/PyTorch/NLP/BERT/bert_squad4.sh
+++ b/PyTorch/NLP/BERT/bert_squad4.sh
+#!/bin/bash
+#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
+
+mpirun --allow-run-as-root -np 4 single_squad4.sh  
+
+
+
+
+
--- a/PyTorch/NLP/BERT/bert_squad4_fp16.sh
+++ b/PyTorch/NLP/BERT/bert_squad4_fp16.sh
+#!/bin/bash
+#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
+
+mpirun --allow-run-as-root -np 4 single_squad4_fp16.sh  
+
+
+
+
+
--- a/PyTorch/NLP/BERT/bert_squad_fp16.sh
+++ b/PyTorch/NLP/BERT/bert_squad_fp16.sh
+#!/bin/bash
+mpirun --allow-run-as-root -np 1  single_squad_fp16.sh 
+
+
+
--- a/PyTorch/NLP/BERT/run_pretraining_v4.py
+++ b/PyTorch/NLP/BERT/run_pretraining_v4.py
--- a/PyTorch/NLP/BERT/run_squad_v4.py
+++ b/PyTorch/NLP/BERT/run_squad_v4.py
--- a/PyTorch/NLP/BERT/single_pre1_1.sh
+++ b/PyTorch/NLP/BERT/single_pre1_1.sh
+#!/bin/bash
+#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
+#export NCCL_SOCKET_IFNAME=ib0
+#export HSA_USERPTR_FOR_PAGED_MEM=0
+#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
+#export NCCL_SOCKET_IFNAME=eno1
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
+export MIOPEN_FIND_MODE=1
+
+#export MIOPEN_ENABLE_LOGGING_CMD=1
+#export ROCBLAS_LAYER=3
+module unload compiler/rocm/2.9
+#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
+#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
+echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
+#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
+#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+comm_rank=$OMPI_COMM_WORLD_RANK
+comm_size=$OMPI_COMM_WORLD_SIZE
+
+#下边是修改的
+
+
+#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
+
+#module load compiler/rocm/3.9.1
+
+export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
+
+APP="python3 run_pretraining_v1.py  \
+    --input_dir=${PATH_PHRASE1}    \
+    --output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1 \
+    --config_file=./bert_config.json \
+    --bert_model=bert-large-uncased \
+    --train_batch_size=16 \
+    --max_seq_length=128 \
+    --max_predictions_per_seq=20 \
+    --max_steps=100000 \
+    --warmup_proportion=0.0 \
+    --num_steps_per_checkpoint=20000 \
+    --learning_rate=4.0e-4 \
+    --seed=12439 \
+    --gradient_accumulation_steps=1 \
+    --allreduce_post_accumulation \
+    --gpus_per_node 1 \
+    --do_train \
+    --json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
+ "
+
+
+
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  echo numactl --cpunodebind=0 --membind=0 ${APP}
+  numactl --cpunodebind=0 --membind=0 ${APP}
+
+  #echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} 
+  #GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=1
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  echo numactl --cpunodebind=1 --membind=1 ${APP}
+  numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=2
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  echo numactl --cpunodebind=2 --membind=2 ${APP} 
+  numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  echo numactl --cpunodebind=3 --membind=3 ${APP}
+  numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac
--- a/PyTorch/NLP/BERT/single_pre1_1_fp16.sh
+++ b/PyTorch/NLP/BERT/single_pre1_1_fp16.sh
+#!/bin/bash
+#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
+#export NCCL_SOCKET_IFNAME=ib0
+#export HSA_USERPTR_FOR_PAGED_MEM=0
+#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
+#export NCCL_SOCKET_IFNAME=eno1
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
+export MIOPEN_FIND_MODE=1
+
+#export MIOPEN_ENABLE_LOGGING_CMD=1
+#export ROCBLAS_LAYER=3
+module unload compiler/rocm/2.9
+#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
+#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
+echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
+#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
+#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+comm_rank=$OMPI_COMM_WORLD_RANK
+comm_size=$OMPI_COMM_WORLD_SIZE
+
+#下边是修改的
+
+
+#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
+
+#module load compiler/rocm/3.9.1
+
+export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
+
+APP="python3 run_pretraining_v1.py  \
+    --input_dir=${PATH_PHRASE1}    \
+    --output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1 \
+    --config_file=./bert_config.json \
+    --bert_model=bert-large-uncased \
+    --train_batch_size=16 \
+    --max_seq_length=128 \
+    --max_predictions_per_seq=20 \
+    --max_steps=100000 \
+    --warmup_proportion=0.0 \
+    --num_steps_per_checkpoint=20 \
+    --learning_rate=4.0e-4 \
+    --seed=12439 \
+    --fp16 \
+    --amp \
+    --gradient_accumulation_steps=1 \
+    --allreduce_post_accumulation \
+    --gpus_per_node 1 \
+    --do_train \
+    --json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
+ "
+
+#--fp16 \
+# --amp \
+
+
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  echo numactl --cpunodebind=0 --membind=0 ${APP}
+  numactl --cpunodebind=0 --membind=0 ${APP}
+
+  #echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} 
+  #GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=1
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  echo numactl --cpunodebind=1 --membind=1 ${APP}
+  numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=2
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  echo numactl --cpunodebind=2 --membind=2 ${APP} 
+  numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  echo numactl --cpunodebind=3 --membind=3 ${APP}
+  numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac
--- a/PyTorch/NLP/BERT/single_pre1_4.sh
+++ b/PyTorch/NLP/BERT/single_pre1_4.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_FIND_MODE=1
+module unload compiler/rocm/2.9
+echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+comm_rank=$OMPI_COMM_WORLD_RANK
+comm_size=$OMPI_COMM_WORLD_SIZE
+
+#下边是修改的
+
+
+export HIP_VISIBLE_DEVICES=0,1,2,3
+
+export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
+
+APP="python3 run_pretraining_v4.py  \
+    --input_dir=${PATH_PHRASE1}    \
+    --output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32 \
+    --config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json \
+    --bert_model=bert-large-uncased \
+    --train_batch_size=16 \
+    --max_seq_length=128 \
+    --max_predictions_per_seq=20 \
+    --max_steps=100000 \
+    --warmup_proportion=0.0 \
+    --num_steps_per_checkpoint=20000 \
+    --learning_rate=4.0e-4 \
+    --seed=12439 \
+    --gradient_accumulation_steps=1 \
+    --allreduce_post_accumulation \
+    --do_train \
+    --use_env \
+    --local_rank ${comm_rank} \
+    --world_size 4 \
+    --gpus_per_node  1 \
+    --dist_url tcp://localhost:34567 \
+    --json-summary  /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
+ "
+
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  echo numactl --cpunodebind=0 --membind=0 ${APP}
+  numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=1
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  echo numactl --cpunodebind=1 --membind=1 ${APP}
+  numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=2
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  echo numactl --cpunodebind=2 --membind=2 ${APP} 
+  numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  echo numactl --cpunodebind=3 --membind=3 ${APP}
+  numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac
--- a/PyTorch/NLP/BERT/single_pre1_4_fp16.sh
+++ b/PyTorch/NLP/BERT/single_pre1_4_fp16.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_FIND_MODE=1
+module unload compiler/rocm/2.9
+echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+comm_rank=$OMPI_COMM_WORLD_RANK
+comm_size=$OMPI_COMM_WORLD_SIZE
+
+#下边是修改的
+
+
+export HIP_VISIBLE_DEVICES=0,1,2,3
+
+export PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
+
+APP="python3 run_pretraining_v4.py  \
+    --input_dir=${PATH_PHRASE1}    \
+    --output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32 \
+    --config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json \
+    --bert_model=bert-large-uncased \
+    --train_batch_size=16 \
+    --max_seq_length=128 \
+    --max_predictions_per_seq=20 \
+    --max_steps=100000 \
+    --warmup_proportion=0.0 \
+    --num_steps_per_checkpoint=20000 \
+    --learning_rate=4.0e-4 \
+    --seed=12439 \
+    --gradient_accumulation_steps=1 \
+    --allreduce_post_accumulation \
+    --do_train \
+    --fp16 \
+    --amp \
+    --use_env \
+    --local_rank ${comm_rank} \
+    --world_size 4 \
+    --gpus_per_node  1 \
+    --dist_url tcp://localhost:34567 \
+    --json-summary  /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
+ "
+
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  echo numactl --cpunodebind=0 --membind=0 ${APP}
+  numactl --cpunodebind=0 --membind=0 ${APP}
+
+  #echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} 
+  #GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=1
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  echo numactl --cpunodebind=1 --membind=1 ${APP}
+  numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=2
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  echo numactl --cpunodebind=2 --membind=2 ${APP} 
+  numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  echo numactl --cpunodebind=3 --membind=3 ${APP}
+  numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac