Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
dcuai
dlexamples
Commits
bedf3c0c
Commit
bedf3c0c
authored
Sep 16, 2022
by
hepj
Browse files
修改README,增加训练脚本,完善模型转换代码
parent
49afe744
Changes
89
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
2627 additions
and
142 deletions
+2627
-142
PyTorch/NLP/BERT/README.md
PyTorch/NLP/BERT/README.md
+119
-142
PyTorch/NLP/BERT/README_old.md
PyTorch/NLP/BERT/README_old.md
+181
-0
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
+4
-0
PyTorch/NLP/BERT/bert_pre1.sh
PyTorch/NLP/BERT/bert_pre1.sh
+2
-0
PyTorch/NLP/BERT/bert_pre1_4.sh
PyTorch/NLP/BERT/bert_pre1_4.sh
+4
-0
PyTorch/NLP/BERT/bert_pre1_fp16.sh
PyTorch/NLP/BERT/bert_pre1_fp16.sh
+3
-0
PyTorch/NLP/BERT/bert_pre2.sh
PyTorch/NLP/BERT/bert_pre2.sh
+2
-0
PyTorch/NLP/BERT/bert_pre2_4.sh
PyTorch/NLP/BERT/bert_pre2_4.sh
+4
-0
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
+4
-0
PyTorch/NLP/BERT/bert_pre2_fp16.sh
PyTorch/NLP/BERT/bert_pre2_fp16.sh
+2
-0
PyTorch/NLP/BERT/bert_squad.sh
PyTorch/NLP/BERT/bert_squad.sh
+5
-0
PyTorch/NLP/BERT/bert_squad4.sh
PyTorch/NLP/BERT/bert_squad4.sh
+9
-0
PyTorch/NLP/BERT/bert_squad4_fp16.sh
PyTorch/NLP/BERT/bert_squad4_fp16.sh
+9
-0
PyTorch/NLP/BERT/bert_squad_fp16.sh
PyTorch/NLP/BERT/bert_squad_fp16.sh
+5
-0
PyTorch/NLP/BERT/run_pretraining_v4.py
PyTorch/NLP/BERT/run_pretraining_v4.py
+709
-0
PyTorch/NLP/BERT/run_squad_v4.py
PyTorch/NLP/BERT/run_squad_v4.py
+1242
-0
PyTorch/NLP/BERT/single_pre1_1.sh
PyTorch/NLP/BERT/single_pre1_1.sh
+87
-0
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
+91
-0
PyTorch/NLP/BERT/single_pre1_4.sh
PyTorch/NLP/BERT/single_pre1_4.sh
+70
-0
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
+75
-0
No files found.
PyTorch/NLP/BERT/README.md
View file @
bedf3c0c
# 简介
使用PyTorch框架计算Bert网络。
*
BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
*
BERT 的推理可基于不同数据集进行精度验证
*
数据生成、模型转换相关细节见
[
README.md
](
http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md
)
# 运行示例
# **Bert算力测试**
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## 1.数据集准备
##
pre
-
train
phrase1
pre
_
train
数据,目前最新的是wiki20220401的数据,但数据集压缩后近20GB,解压后300GB下载速度慢,解压占大量空间。enwiki-20220401-pages-articles-multistream.xml.bz2下载链接如下:
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
https://dumps.wikimedia.org/enwiki/20220401/
这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据,预训练数据分为PHRASE1、PHRASE2
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
昆山wiki数据集地址PHRASE1:
PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
昆山wiki数据集地址PHRASE2:
PATH_PHRASE2=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
乌镇wiki地址PHRASE1:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
乌镇wiki地址PHRASE2:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
```
问答SQUAD1.1数据:
[
train-v1.1
](
https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
)
[
dev-v1.1
](
https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
)
## 2.测试环境
注意dtk python torch apex 等版本要对齐
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
1.创建python虚拟环境并进入
virtualenv --python=~/package/Python-3.6.8/build/bin/python3 venv_dtk21.10.1_torch1.10
source venv_dtk21.10_torch1.10/bin/activate
2.安装依赖包
pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
pip install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
pip install torchvision-0.10.0a0+300a8a4-cp36-cp36m-linux_x86_64.whl
pip install apex-0.1-cp36-cp36m-linux_x86_64.whl
3.环境变量设置
module rm compiler/rocm/2.9
export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
export HIP_PATH=${ROCM_PATH}/hip
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PAT
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZ
```
## 3.squad测试
##
pre-train phrase2
##
# 1.模型转化
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
```
### 多卡
目前模型转换还存在问题,可能是由于下载的TF模型与model.ckpt-28252不同导致,或torch 、apex版本兼容性问题,还在排查当中,可以直接使用转换好的模型进行squad任务的微调训练(PHRASE的测试则不受此影响,PHRASE为预训练只需要训练数据与网络结构即可,不需要加载模型)
[
转换好的模型 提取密码:vs8d
](
https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ
)
### 2.参数说明
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
--train_file 训练数据
--predict_file 预测文件
--init_checkpoint 模型文件
--vocab_file 词向量文件
--output_dir 输出文件夹
--config_file 模型配置文件
--json-summary 输出json文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--do_train 是否训练
--do_predict 是否预测
--train_batch_size 训练batch_size
--predict_batch_size 预测batch_size
--gpus_per_node 使用gpu节点数
--local_rank 基于GPU的分布式训练的local_rank(单卡设置为-1)
--fp16 混合精度训练
--amp 混合精度训练
```
*
方法二
hostfile:
### 3.运行
```
node1 slots=4
node2 slots=4
#单卡
./bert_squad.sh #单精度 (按自己路径对single_squad.sh里APP设置进行修改)
./bert_squad_fp16.sh #半精度 (按自己路径对single_squad_fp16.sh里APP设置进行修改)
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
#多卡
./bert_squad4.sh #单精度 (按自己路径对single_squad4.sh里APP设置进行修改)
./bert_squad4_fp16.sh #半精度 (按自己路径对single_squad4_fp16.sh里APP设置进行修改)
```
## 4.**PHRASE测试**
### 1.参数说明
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
--input_dir 输入数据文件夹
--output_dir 输出保存文件夹
--config_file 模型配置文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--train_batch_size 训练batch_size
--max_seq_length=128 最大长度(需要和训练数据相匹配)
--max_predictions_per_seq 输入序列中屏蔽标记的最大总数
--max_steps 最大步长
--warmup_proportion 进行线性学习率热身的训练比例
--num_steps_per_checkpoint 多少步保存一次模型
--learning_rate 学习率
--seed 随机种子
--gradient_accumulation_steps 在执行向后/更新过程之前,Accumulte的更新步骤数
--allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
--do_train 是否训练
--fp16 混合精度训练
--amp 混合精度训练
--json-summary 输出json文件
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
### 2.PHRASE1
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
#单卡
./bert_pre1.sh #单精度 (按自己路径对single_pre1_1.sh里APP设置进行修改)
./bert_pre1_fp16.sh #半精度 (按自己路径对single_pre1_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre1_4.sh #单精度 (按自己路径对single_pre1_4.sh里APP设置进行修改)
./bert_pre1_4_fp16.sh #半精度 (按自己路径对single_pre1_4_fp16.sh里APP设置进行修改)
```
### 3.PHRASE2
```
#单卡
./bert_pre2.sh #单精度 (按自己路径对single_pre2_1.sh里APP设置进行修改)
./bert_pre2_fp16.sh #半精度 (按自己路径对single_pre2_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre2_4.sh #单精度 (按自己路径对single_pre2_4.sh里APP设置进行修改)
./bert_pre2_4_fp16.sh #半精度 (按自己路径对single_pre2_4_fp16.sh里APP设置进行修改)
```
# 参考资料
[
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
](
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
)
[
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
](
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
)
PyTorch/NLP/BERT/README_old.md
0 → 100644
View file @
bedf3c0c
# 简介
使用PyTorch框架计算Bert网络。
*
BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
*
BERT 的推理可基于不同数据集进行精度验证
*
数据生成、模型转换相关细节见
[
README.md
](
http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md
)
# 运行示例
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## pre-train phrase1
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
```
## pre-train phrase2
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
### 多卡
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
```
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
```
# 参考资料
[
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
](
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
)
[
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
](
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
)
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
0 → 100644
View file @
bedf3c0c
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre1_4_fp16.sh
PyTorch/NLP/BERT/bert_pre1.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre1_1.sh
PyTorch/NLP/BERT/bert_pre1_4.sh
0 → 100644
View file @
bedf3c0c
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre1_4.sh
PyTorch/NLP/BERT/bert_pre1_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre1_1_fp16.sh
PyTorch/NLP/BERT/bert_pre2.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre2_1.sh
PyTorch/NLP/BERT/bert_pre2_4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre2_4.sh
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre2_4_fp16.sh
PyTorch/NLP/BERT/bert_pre2_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre2_1_fp16.sh
PyTorch/NLP/BERT/bert_squad.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_squad.sh
PyTorch/NLP/BERT/bert_squad4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun
--allow-run-as-root
-np
4 single_squad4.sh
PyTorch/NLP/BERT/bert_squad4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun
--allow-run-as-root
-np
4 single_squad4_fp16.sh
PyTorch/NLP/BERT/bert_squad_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_squad_fp16.sh
PyTorch/NLP/BERT/run_pretraining_v4.py
0 → 100644
View file @
bedf3c0c
# coding=utf-8
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
# ==================
import
csv
import
os
import
time
import
argparse
import
random
import
h5py
from
tqdm
import
tqdm
,
trange
import
os
import
numpy
as
np
import
torch
from
torch.utils.data
import
DataLoader
,
RandomSampler
,
SequentialSampler
,
Dataset
from
torch.utils.data.distributed
import
DistributedSampler
import
math
from
apex
import
amp
import
multiprocessing
from
tokenization
import
BertTokenizer
import
modeling
from
apex.optimizers
import
FusedLAMB
from
schedulers
import
PolyWarmUpScheduler
from
file_utils
import
PYTORCH_PRETRAINED_BERT_CACHE
from
utils
import
is_main_process
,
format_step
,
get_world_size
,
get_rank
from
apex.parallel
import
DistributedDataParallel
as
DDP
from
schedulers
import
LinearWarmUpScheduler
from
apex.parallel.distributed
import
flat_dist_call
import
amp_C
import
apex_C
from
apex.amp
import
_amp_state
import
dllogger
from
concurrent.futures
import
ProcessPoolExecutor
os
.
environ
[
"HIP_VISIBLE_DEVICES"
]
=
"0,1,2,3"
torch
.
_C
.
_jit_set_profiling_mode
(
False
)
torch
.
_C
.
_jit_set_profiling_executor
(
False
)
skipped_steps
=
0
# Track whether a SIGTERM (cluster time up) has been handled
timeout_sent
=
False
import
signal
# handle SIGTERM sent from the scheduler and mark so we
# can gracefully save & exit
def
signal_handler
(
sig
,
frame
):
global
timeout_sent
timeout_sent
=
True
signal
.
signal
(
signal
.
SIGTERM
,
signal_handler
)
#Workaround because python functions are not picklable
class
WorkerInitObj
(
object
):
def
__init__
(
self
,
seed
):
self
.
seed
=
seed
def
__call__
(
self
,
id
):
np
.
random
.
seed
(
seed
=
self
.
seed
+
id
)
random
.
seed
(
self
.
seed
+
id
)
def
create_pretraining_dataset
(
input_file
,
max_pred_length
,
shared_list
,
args
,
worker_init
):
train_data
=
pretraining_dataset
(
input_file
=
input_file
,
max_pred_length
=
max_pred_length
)
train_sampler
=
RandomSampler
(
train_data
)
train_dataloader
=
DataLoader
(
train_data
,
sampler
=
train_sampler
,
batch_size
=
args
.
train_batch_size
*
args
.
n_gpu
,
num_workers
=
1
,
worker_init_fn
=
worker_init
,
pin_memory
=
True
)
return
train_dataloader
,
input_file
class
pretraining_dataset
(
Dataset
):
def
__init__
(
self
,
input_file
,
max_pred_length
):
self
.
input_file
=
input_file
self
.
max_pred_length
=
max_pred_length
f
=
h5py
.
File
(
input_file
,
"r"
)
keys
=
[
'input_ids'
,
'input_mask'
,
'segment_ids'
,
'masked_lm_positions'
,
'masked_lm_ids'
,
'next_sentence_labels'
]
self
.
inputs
=
[
np
.
asarray
(
f
[
key
][:])
for
key
in
keys
]
f
.
close
()
def
__len__
(
self
):
'Denotes the total number of samples'
return
len
(
self
.
inputs
[
0
])
def
__getitem__
(
self
,
index
):
[
input_ids
,
input_mask
,
segment_ids
,
masked_lm_positions
,
masked_lm_ids
,
next_sentence_labels
]
=
[
torch
.
from_numpy
(
input
[
index
].
astype
(
np
.
int64
))
if
indice
<
5
else
torch
.
from_numpy
(
np
.
asarray
(
input
[
index
].
astype
(
np
.
int64
)))
for
indice
,
input
in
enumerate
(
self
.
inputs
)]
masked_lm_labels
=
torch
.
ones
(
input_ids
.
shape
,
dtype
=
torch
.
long
)
*
-
1
index
=
self
.
max_pred_length
# store number of masked tokens in index
padded_mask_indices
=
(
masked_lm_positions
==
0
).
nonzero
()
if
len
(
padded_mask_indices
)
!=
0
:
index
=
padded_mask_indices
[
0
].
item
()
masked_lm_labels
[
masked_lm_positions
[:
index
]]
=
masked_lm_ids
[:
index
]
return
[
input_ids
,
segment_ids
,
input_mask
,
masked_lm_labels
,
next_sentence_labels
]
class
BertPretrainingCriterion
(
torch
.
nn
.
Module
):
def
__init__
(
self
,
vocab_size
):
super
(
BertPretrainingCriterion
,
self
).
__init__
()
self
.
loss_fn
=
torch
.
nn
.
CrossEntropyLoss
(
ignore_index
=-
1
)
self
.
vocab_size
=
vocab_size
def
forward
(
self
,
prediction_scores
,
seq_relationship_score
,
masked_lm_labels
,
next_sentence_labels
):
masked_lm_loss
=
self
.
loss_fn
(
prediction_scores
.
view
(
-
1
,
self
.
vocab_size
),
masked_lm_labels
.
view
(
-
1
))
next_sentence_loss
=
self
.
loss_fn
(
seq_relationship_score
.
view
(
-
1
,
2
),
next_sentence_labels
.
view
(
-
1
))
total_loss
=
masked_lm_loss
+
next_sentence_loss
return
total_loss
def
parse_arguments
():
parser
=
argparse
.
ArgumentParser
()
## Required parameters
parser
.
add_argument
(
"--input_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The input data dir. Should contain .hdf5 files for the task."
)
parser
.
add_argument
(
"--config_file"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The BERT model config"
)
parser
.
add_argument
(
"--bert_model"
,
default
=
"bert-large-uncased"
,
type
=
str
,
help
=
"Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese."
)
parser
.
add_argument
(
"--output_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The output directory where the model checkpoints will be written."
)
## Other parameters
parser
.
add_argument
(
"--init_checkpoint"
,
default
=
None
,
type
=
str
,
help
=
"The initial checkpoint to start training from."
)
parser
.
add_argument
(
"--max_seq_length"
,
default
=
512
,
type
=
int
,
help
=
"The maximum total input sequence length after WordPiece tokenization.
\n
"
"Sequences longer than this will be truncated, and sequences shorter
\n
"
"than this will be padded."
)
parser
.
add_argument
(
"--max_predictions_per_seq"
,
default
=
80
,
type
=
int
,
help
=
"The maximum total of masked tokens in input sequence"
)
parser
.
add_argument
(
"--train_batch_size"
,
default
=
32
,
type
=
int
,
help
=
"Total batch size for training."
)
parser
.
add_argument
(
"--learning_rate"
,
default
=
5e-5
,
type
=
float
,
help
=
"The initial learning rate for Adam."
)
parser
.
add_argument
(
"--num_train_epochs"
,
default
=
3.0
,
type
=
float
,
help
=
"Total number of training epochs to perform."
)
parser
.
add_argument
(
"--max_steps"
,
default
=
1000
,
type
=
float
,
help
=
"Total number of training steps to perform."
)
parser
.
add_argument
(
"--warmup_proportion"
,
default
=
0.01
,
type
=
float
,
help
=
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10%% of training."
)
parser
.
add_argument
(
"--local_rank"
,
type
=
int
,
default
=
os
.
getenv
(
'LOCAL_RANK'
,
-
1
),
help
=
"local_rank for distributed training on gpus"
)
parser
.
add_argument
(
'--seed'
,
type
=
int
,
default
=
42
,
help
=
"random seed for initialization"
)
parser
.
add_argument
(
'--gradient_accumulation_steps'
,
type
=
int
,
default
=
1
,
help
=
"Number of updates steps to accumualte before performing a backward/update pass."
)
parser
.
add_argument
(
'--fp16'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--amp'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--loss_scale'
,
type
=
float
,
default
=
0.0
,
help
=
'Loss scaling, positive power of 2 values can improve fp16 convergence.'
)
parser
.
add_argument
(
'--log_freq'
,
type
=
float
,
default
=
1.0
,
help
=
'frequency of logging loss.'
)
parser
.
add_argument
(
'--checkpoint_activations'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to use gradient checkpointing"
)
parser
.
add_argument
(
"--resume_from_checkpoint"
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to resume training from checkpoint."
)
parser
.
add_argument
(
'--resume_step'
,
type
=
int
,
default
=-
1
,
help
=
"Step to resume training from."
)
parser
.
add_argument
(
'--num_steps_per_checkpoint'
,
type
=
int
,
default
=
100
,
help
=
"Number of update steps until a model checkpoint is saved to disk."
)
parser
.
add_argument
(
'--skip_checkpoint'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to save checkpoints"
)
parser
.
add_argument
(
'--phase2'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to train with seq len 512"
)
parser
.
add_argument
(
'--allreduce_post_accumulation'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to do allreduces during gradient accumulation steps."
)
parser
.
add_argument
(
'--allreduce_post_accumulation_fp16'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to do fp16 allreduce post accumulation."
)
parser
.
add_argument
(
'--phase1_end_step'
,
type
=
int
,
default
=
7038
,
help
=
"Number of training steps in Phase1 - seq len 128"
)
parser
.
add_argument
(
'--init_loss_scale'
,
type
=
int
,
default
=
2
**
20
,
help
=
"Initial loss scaler value"
)
parser
.
add_argument
(
"--do_train"
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to run training."
)
parser
.
add_argument
(
'--json-summary'
,
type
=
str
,
default
=
"results/dllogger.json"
,
help
=
'If provided, the json summary will be written to'
'the specified file.'
)
parser
.
add_argument
(
"--use_env"
,
action
=
'store_true'
,
help
=
"Whether to read local rank from ENVVAR"
)
parser
.
add_argument
(
'--disable_progress_bar'
,
default
=
False
,
action
=
'store_true'
,
help
=
'Disable tqdm progress bar'
)
parser
.
add_argument
(
'--steps_this_run'
,
type
=
int
,
default
=-
1
,
help
=
'If provided, only run this many steps before exiting'
)
parser
.
add_argument
(
"--dist_url"
,
default
=
'tcp://224.66.41.62:23456'
,
type
=
str
,
help
=
'url used to set up distributed training'
)
parser
.
add_argument
(
"--gpus_per_node"
,
type
=
int
,
default
=
4
,
help
=
'num of gpus per node'
)
parser
.
add_argument
(
"--world_size"
,
type
=
int
,
default
=
1
,
help
=
"number of process"
)
args
=
parser
.
parse_args
()
args
.
fp16
=
args
.
fp16
or
args
.
amp
if
args
.
steps_this_run
<
0
:
args
.
steps_this_run
=
args
.
max_steps
return
args
def
setup_training
(
args
):
assert
(
torch
.
cuda
.
is_available
())
if
args
.
local_rank
==
-
1
:
device
=
torch
.
device
(
"cuda"
)
args
.
n_gpu
=
torch
.
cuda
.
device_count
()
args
.
allreduce_post_accumulation
=
False
args
.
allreduce_post_accumulation_fp16
=
False
else
:
#torch.cuda.set_device(args.local_rank)
#device = torch.device("cuda", args.local_rank)
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
#torch.distributed.init_process_group(backend='nccl', init_method='env://')
#xuan
device_n
=
args
.
local_rank
%
4
torch
.
cuda
.
set_device
(
device_n
)
device
=
torch
.
device
(
"cuda"
,
device_n
)
torch
.
distributed
.
init_process_group
(
backend
=
'nccl'
,
init_method
=
args
.
dist_url
,
world_size
=
args
.
world_size
,
rank
=
args
.
local_rank
)
args
.
n_gpu
=
1
if
args
.
gradient_accumulation_steps
==
1
:
args
.
allreduce_post_accumulation
=
False
args
.
allreduce_post_accumulation_fp16
=
False
if
is_main_process
():
dllogger
.
init
(
backends
=
[
dllogger
.
JSONStreamBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
filename
=
args
.
json_summary
),
dllogger
.
StdOutBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
step_format
=
format_step
)])
else
:
dllogger
.
init
(
backends
=
[])
print
(
"device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}"
.
format
(
device
,
args
.
n_gpu
,
bool
(
args
.
local_rank
!=
-
1
),
args
.
fp16
))
if
args
.
gradient_accumulation_steps
<
1
:
raise
ValueError
(
"Invalid gradient_accumulation_steps parameter: {}, should be >= 1"
.
format
(
args
.
gradient_accumulation_steps
))
if
args
.
train_batch_size
%
args
.
gradient_accumulation_steps
!=
0
:
raise
ValueError
(
"Invalid gradient_accumulation_steps parameter: {}, batch size {} should be divisible"
.
format
(
args
.
gradient_accumulation_steps
,
args
.
train_batch_size
))
args
.
train_batch_size
=
args
.
train_batch_size
//
args
.
gradient_accumulation_steps
if
not
args
.
do_train
:
raise
ValueError
(
" `do_train` must be True."
)
if
not
args
.
resume_from_checkpoint
and
os
.
path
.
exists
(
args
.
output_dir
)
and
(
os
.
listdir
(
args
.
output_dir
)
and
any
([
i
.
startswith
(
'ckpt'
)
for
i
in
os
.
listdir
(
args
.
output_dir
)])):
raise
ValueError
(
"Output directory ({}) already exists and is not empty."
.
format
(
args
.
output_dir
))
if
(
not
args
.
resume_from_checkpoint
or
not
os
.
path
.
exists
(
args
.
output_dir
))
and
is_main_process
():
os
.
makedirs
(
args
.
output_dir
,
exist_ok
=
True
)
return
device
,
args
def
prepare_model_and_optimizer
(
args
,
device
):
# Prepare model
config
=
modeling
.
BertConfig
.
from_json_file
(
args
.
config_file
)
# Padding for divisibility by 8
if
config
.
vocab_size
%
8
!=
0
:
config
.
vocab_size
+=
8
-
(
config
.
vocab_size
%
8
)
modeling
.
ACT2FN
[
"bias_gelu"
]
=
modeling
.
bias_gelu_training
model
=
modeling
.
BertForPreTraining
(
config
)
checkpoint
=
None
if
not
args
.
resume_from_checkpoint
:
global_step
=
0
else
:
if
args
.
resume_step
==
-
1
and
not
args
.
init_checkpoint
:
model_names
=
[
f
for
f
in
os
.
listdir
(
args
.
output_dir
)
if
f
.
endswith
(
".pt"
)]
args
.
resume_step
=
max
([
int
(
x
.
split
(
'.pt'
)[
0
].
split
(
'_'
)[
1
].
strip
())
for
x
in
model_names
])
global_step
=
args
.
resume_step
if
not
args
.
init_checkpoint
else
0
if
not
args
.
init_checkpoint
:
checkpoint
=
torch
.
load
(
os
.
path
.
join
(
args
.
output_dir
,
"ckpt_{}.pt"
.
format
(
global_step
)),
map_location
=
"cpu"
)
else
:
checkpoint
=
torch
.
load
(
args
.
init_checkpoint
,
map_location
=
"cpu"
)
model
.
load_state_dict
(
checkpoint
[
'model'
],
strict
=
False
)
if
args
.
phase2
and
not
args
.
init_checkpoint
:
global_step
-=
args
.
phase1_end_step
if
is_main_process
():
print
(
"resume step from "
,
args
.
resume_step
)
model
.
to
(
device
)
param_optimizer
=
list
(
model
.
named_parameters
())
no_decay
=
[
'bias'
,
'gamma'
,
'beta'
,
'LayerNorm'
]
optimizer_grouped_parameters
=
[
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
not
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.01
},
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.0
}]
optimizer
=
FusedLAMB
(
optimizer_grouped_parameters
,
lr
=
args
.
learning_rate
)
lr_scheduler
=
PolyWarmUpScheduler
(
optimizer
,
warmup
=
args
.
warmup_proportion
,
total_steps
=
args
.
max_steps
)
if
args
.
fp16
:
if
args
.
loss_scale
==
0
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
loss_scale
=
"dynamic"
,
cast_model_outputs
=
torch
.
float16
)
else
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
loss_scale
=
args
.
loss_scale
,
cast_model_outputs
=
torch
.
float16
)
amp
.
_amp_state
.
loss_scalers
[
0
].
_loss_scale
=
args
.
init_loss_scale
model
.
checkpoint_activations
(
args
.
checkpoint_activations
)
if
args
.
resume_from_checkpoint
:
if
args
.
phase2
or
args
.
init_checkpoint
:
keys
=
list
(
checkpoint
[
'optimizer'
][
'state'
].
keys
())
#Override hyperparameters from previous checkpoint
for
key
in
keys
:
checkpoint
[
'optimizer'
][
'state'
][
key
][
'step'
]
=
global_step
for
iter
,
item
in
enumerate
(
checkpoint
[
'optimizer'
][
'param_groups'
]):
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
'step'
]
=
global_step
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
't_total'
]
=
args
.
max_steps
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
'warmup'
]
=
args
.
warmup_proportion
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
'lr'
]
=
args
.
learning_rate
optimizer
.
load_state_dict
(
checkpoint
[
'optimizer'
])
# , strict=False)
# Restore AMP master parameters
if
args
.
fp16
:
optimizer
.
_lazy_init_maybe_master_weights
()
optimizer
.
_amp_stash
.
lazy_init_called
=
True
optimizer
.
load_state_dict
(
checkpoint
[
'optimizer'
])
for
param
,
saved_param
in
zip
(
amp
.
master_params
(
optimizer
),
checkpoint
[
'master params'
]):
param
.
data
.
copy_
(
saved_param
.
data
)
if
args
.
local_rank
!=
-
1
:
if
not
args
.
allreduce_post_accumulation
:
model
=
DDP
(
model
,
message_size
=
250000000
,
gradient_predivide_factor
=
get_world_size
())
else
:
flat_dist_call
([
param
.
data
for
param
in
model
.
parameters
()],
torch
.
distributed
.
broadcast
,
(
0
,)
)
elif
args
.
n_gpu
>
1
:
model
=
torch
.
nn
.
DataParallel
(
model
)
criterion
=
BertPretrainingCriterion
(
config
.
vocab_size
)
return
model
,
optimizer
,
lr_scheduler
,
checkpoint
,
global_step
,
criterion
def
take_optimizer_step
(
args
,
optimizer
,
model
,
overflow_buf
,
global_step
):
global
skipped_steps
if
args
.
allreduce_post_accumulation
:
# manually allreduce gradients after all accumulation steps
# check for Inf/NaN
# 1. allocate an uninitialized buffer for flattened gradient
loss_scale
=
_amp_state
.
loss_scalers
[
0
].
loss_scale
()
if
args
.
fp16
else
1
master_grads
=
[
p
.
grad
for
p
in
amp
.
master_params
(
optimizer
)
if
p
.
grad
is
not
None
]
flat_grad_size
=
sum
(
p
.
numel
()
for
p
in
master_grads
)
allreduce_dtype
=
torch
.
float16
if
args
.
allreduce_post_accumulation_fp16
else
torch
.
float32
flat_raw
=
torch
.
empty
(
flat_grad_size
,
device
=
'cuda'
,
dtype
=
allreduce_dtype
)
# 2. combine unflattening and predivision of unscaled 'raw' gradient
allreduced_views
=
apex_C
.
unflatten
(
flat_raw
,
master_grads
)
overflow_buf
.
zero_
()
amp_C
.
multi_tensor_scale
(
65536
,
overflow_buf
,
[
master_grads
,
allreduced_views
],
loss_scale
/
(
get_world_size
()
*
args
.
gradient_accumulation_steps
))
# 3. sum gradient across ranks. Because of the predivision, this averages the gradient
torch
.
distributed
.
all_reduce
(
flat_raw
)
# 4. combine unscaling and unflattening of allreduced gradient
overflow_buf
.
zero_
()
amp_C
.
multi_tensor_scale
(
65536
,
overflow_buf
,
[
allreduced_views
,
master_grads
],
1.
/
loss_scale
)
# 5. update loss scale
if
args
.
fp16
:
scaler
=
_amp_state
.
loss_scalers
[
0
]
old_overflow_buf
=
scaler
.
_overflow_buf
scaler
.
_overflow_buf
=
overflow_buf
had_overflow
=
scaler
.
update_scale
()
scaler
.
_overfloat_buf
=
old_overflow_buf
else
:
had_overflow
=
0
# 6. call optimizer step function
if
had_overflow
==
0
:
optimizer
.
step
()
global_step
+=
1
else
:
# Overflow detected, print message and clear gradients
skipped_steps
+=
1
if
is_main_process
():
scaler
=
_amp_state
.
loss_scalers
[
0
]
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"loss_scale"
:
scaler
.
loss_scale
()})
if
_amp_state
.
opt_properties
.
master_weights
:
for
param
in
optimizer
.
_amp_stash
.
all_fp32_from_fp16_params
:
param
.
grad
=
None
for
param
in
model
.
parameters
():
param
.
grad
=
None
else
:
optimizer
.
step
()
#optimizer.zero_grad()
for
param
in
model
.
parameters
():
param
.
grad
=
None
global_step
+=
1
return
global_step
def
main
():
global
timeout_sent
args
=
parse_arguments
()
random
.
seed
(
args
.
seed
+
args
.
local_rank
)
np
.
random
.
seed
(
args
.
seed
+
args
.
local_rank
)
torch
.
manual_seed
(
args
.
seed
+
args
.
local_rank
)
torch
.
cuda
.
manual_seed
(
args
.
seed
+
args
.
local_rank
)
worker_init
=
WorkerInitObj
(
args
.
seed
+
args
.
local_rank
)
device
,
args
=
setup_training
(
args
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"Config"
:
[
str
(
args
)]})
# Prepare optimizer
model
,
optimizer
,
lr_scheduler
,
checkpoint
,
global_step
,
criterion
=
prepare_model_and_optimizer
(
args
,
device
)
if
is_main_process
():
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"SEED"
:
args
.
seed
})
raw_train_start
=
None
if
args
.
do_train
:
if
is_main_process
():
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"train_start"
:
True
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"batch_size_per_gpu"
:
args
.
train_batch_size
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"learning_rate"
:
args
.
learning_rate
})
model
.
train
()
most_recent_ckpts_paths
=
[]
average_loss
=
0.0
# averaged loss every args.log_freq steps
epoch
=
0
training_steps
=
0
pool
=
ProcessPoolExecutor
(
1
)
# Note: We loop infinitely over epochs, termination is handled via iteration count
while
True
:
thread
=
None
restored_data_loader
=
None
if
not
args
.
resume_from_checkpoint
or
epoch
>
0
or
(
args
.
phase2
and
global_step
<
1
)
or
args
.
init_checkpoint
:
files
=
[
os
.
path
.
join
(
args
.
input_dir
,
f
)
for
f
in
os
.
listdir
(
args
.
input_dir
)
if
os
.
path
.
isfile
(
os
.
path
.
join
(
args
.
input_dir
,
f
))
and
'training'
in
f
]
files
.
sort
()
num_files
=
len
(
files
)
random
.
Random
(
args
.
seed
+
epoch
).
shuffle
(
files
)
f_start_id
=
0
else
:
f_start_id
=
checkpoint
[
'files'
][
0
]
files
=
checkpoint
[
'files'
][
1
:]
args
.
resume_from_checkpoint
=
False
num_files
=
len
(
files
)
# may not exist in all checkpoints
epoch
=
checkpoint
.
get
(
'epoch'
,
0
)
restored_dataloader
=
checkpoint
.
get
(
'data_loader'
,
None
)
shared_file_list
=
{}
if
torch
.
distributed
.
is_initialized
()
and
get_world_size
()
>
num_files
:
remainder
=
get_world_size
()
%
num_files
data_file
=
files
[(
f_start_id
*
get_world_size
()
+
get_rank
()
+
remainder
*
f_start_id
)
%
num_files
]
else
:
data_file
=
files
[(
f_start_id
*
get_world_size
()
+
get_rank
())
%
num_files
]
previous_file
=
data_file
if
restored_data_loader
is
None
:
train_data
=
pretraining_dataset
(
data_file
,
args
.
max_predictions_per_seq
)
if
args
.
local_rank
==
-
1
:
train_sampler
=
RandomSampler
(
train_data
)
else
:
train_sampler
=
DistributedSampler
(
train_data
)
train_dataloader
=
DataLoader
(
train_data
,
sampler
=
train_sampler
,
batch_size
=
args
.
train_batch_size
*
args
.
n_gpu
,
num_workers
=
4
,
worker_init_fn
=
worker_init
,
pin_memory
=
True
)
# shared_file_list["0"] = (train_dataloader, data_file)
else
:
train_dataloader
=
restored_data_loader
restored_data_loader
=
None
overflow_buf
=
None
if
args
.
allreduce_post_accumulation
:
overflow_buf
=
torch
.
cuda
.
IntTensor
([
0
])
for
f_id
in
range
(
f_start_id
+
1
,
len
(
files
)):
if
get_world_size
()
>
num_files
:
data_file
=
files
[(
f_id
*
get_world_size
()
+
get_rank
()
+
remainder
*
f_id
)
%
num_files
]
else
:
data_file
=
files
[(
f_id
*
get_world_size
()
+
get_rank
())
%
num_files
]
previous_file
=
data_file
dataset_future
=
pool
.
submit
(
create_pretraining_dataset
,
data_file
,
args
.
max_predictions_per_seq
,
shared_file_list
,
args
,
worker_init
)
train_iter
=
tqdm
(
train_dataloader
,
desc
=
"Iteration"
,
disable
=
args
.
disable_progress_bar
)
if
is_main_process
()
else
train_dataloader
if
raw_train_start
is
None
:
raw_train_start
=
time
.
time
()
for
step
,
batch
in
enumerate
(
train_iter
):
training_steps
+=
1
batch
=
[
t
.
to
(
device
)
for
t
in
batch
]
input_ids
,
segment_ids
,
input_mask
,
masked_lm_labels
,
next_sentence_labels
=
batch
prediction_scores
,
seq_relationship_score
=
model
(
input_ids
=
input_ids
,
token_type_ids
=
segment_ids
,
attention_mask
=
input_mask
)
loss
=
criterion
(
prediction_scores
,
seq_relationship_score
,
masked_lm_labels
,
next_sentence_labels
)
if
args
.
n_gpu
>
1
:
loss
=
loss
.
mean
()
# mean() to average on multi-gpu.
divisor
=
args
.
gradient_accumulation_steps
if
args
.
gradient_accumulation_steps
>
1
:
if
not
args
.
allreduce_post_accumulation
:
# this division was merged into predivision
loss
=
loss
/
args
.
gradient_accumulation_steps
divisor
=
1.0
if
args
.
fp16
:
with
amp
.
scale_loss
(
loss
,
optimizer
,
delay_overflow_check
=
args
.
allreduce_post_accumulation
)
as
scaled_loss
:
scaled_loss
.
backward
()
else
:
loss
.
backward
()
average_loss
+=
loss
.
item
()
if
training_steps
%
args
.
gradient_accumulation_steps
==
0
:
lr_scheduler
.
step
()
# learning rate warmup
global_step
=
take_optimizer_step
(
args
,
optimizer
,
model
,
overflow_buf
,
global_step
)
if
global_step
>=
args
.
steps_this_run
or
timeout_sent
:
train_time_raw
=
time
.
time
()
-
raw_train_start
last_num_steps
=
int
(
training_steps
/
args
.
gradient_accumulation_steps
)
%
args
.
log_freq
last_num_steps
=
args
.
log_freq
if
last_num_steps
==
0
else
last_num_steps
average_loss
=
torch
.
tensor
(
average_loss
,
dtype
=
torch
.
float32
).
cuda
()
average_loss
=
average_loss
/
(
last_num_steps
*
divisor
)
if
(
torch
.
distributed
.
is_initialized
()):
average_loss
/=
get_world_size
()
torch
.
distributed
.
all_reduce
(
average_loss
)
final_loss
=
average_loss
.
item
()
if
is_main_process
():
dllogger
.
log
(
step
=
(
epoch
,
global_step
,
),
data
=
{
"final_loss"
:
final_loss
})
elif
training_steps
%
(
args
.
log_freq
*
args
.
gradient_accumulation_steps
)
==
0
:
if
is_main_process
():
dllogger
.
log
(
step
=
(
epoch
,
global_step
,
),
data
=
{
"average_loss"
:
average_loss
/
(
args
.
log_freq
*
divisor
),
"step_loss"
:
loss
.
item
()
*
args
.
gradient_accumulation_steps
/
divisor
,
"learning_rate"
:
optimizer
.
param_groups
[
0
][
'lr'
]})
average_loss
=
0
if
global_step
>=
args
.
steps_this_run
or
training_steps
%
(
args
.
num_steps_per_checkpoint
*
args
.
gradient_accumulation_steps
)
==
0
or
timeout_sent
:
if
is_main_process
()
and
not
args
.
skip_checkpoint
:
# Save a trained model
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"checkpoint_step"
:
global_step
})
model_to_save
=
model
.
module
if
hasattr
(
model
,
'module'
)
else
model
# Only save the model it-self
if
args
.
resume_step
<
0
or
not
args
.
phase2
:
output_save_file
=
os
.
path
.
join
(
args
.
output_dir
,
"ckpt_{}.pt"
.
format
(
global_step
))
else
:
output_save_file
=
os
.
path
.
join
(
args
.
output_dir
,
"ckpt_{}.pt"
.
format
(
global_step
+
args
.
phase1_end_step
))
if
args
.
do_train
:
torch
.
save
({
'model'
:
model_to_save
.
state_dict
(),
'optimizer'
:
optimizer
.
state_dict
(),
'master params'
:
list
(
amp
.
master_params
(
optimizer
)),
'files'
:
[
f_id
]
+
files
,
'epoch'
:
epoch
,
'data_loader'
:
None
if
global_step
>=
args
.
max_steps
else
train_dataloader
},
output_save_file
)
most_recent_ckpts_paths
.
append
(
output_save_file
)
if
len
(
most_recent_ckpts_paths
)
>
3
:
ckpt_to_be_removed
=
most_recent_ckpts_paths
.
pop
(
0
)
os
.
remove
(
ckpt_to_be_removed
)
# Exiting the training due to hitting max steps, or being sent a
# timeout from the cluster scheduler
if
global_step
>=
args
.
steps_this_run
or
timeout_sent
:
del
train_dataloader
# thread.join()
return
args
,
final_loss
,
train_time_raw
,
global_step
del
train_dataloader
# thread.join()
# Make sure pool has finished and switch train_dataloader
# NOTE: Will block until complete
train_dataloader
,
data_file
=
dataset_future
.
result
(
timeout
=
None
)
epoch
+=
1
if
__name__
==
"__main__"
:
now
=
time
.
time
()
args
,
final_loss
,
train_time_raw
,
global_step
=
main
()
gpu_count
=
args
.
n_gpu
global_step
+=
args
.
phase1_end_step
if
(
args
.
phase2
and
args
.
resume_step
>
0
)
else
0
if
args
.
resume_step
==
-
1
:
args
.
resume_step
=
0
if
torch
.
distributed
.
is_initialized
():
gpu_count
=
get_world_size
()
if
is_main_process
():
e2e_time
=
time
.
time
()
-
now
training_perf
=
args
.
train_batch_size
*
args
.
gradient_accumulation_steps
*
gpu_count
\
*
(
global_step
-
args
.
resume_step
+
skipped_steps
)
/
train_time_raw
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_train_time"
:
e2e_time
,
"training_sequences_per_second"
:
training_perf
,
"final_loss"
:
final_loss
,
"raw_train_time"
:
train_time_raw
})
dllogger
.
flush
()
PyTorch/NLP/BERT/run_squad_v4.py
0 → 100644
View file @
bedf3c0c
# coding=utf-8
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Run BERT on SQuAD."""
from
__future__
import
absolute_import
,
division
,
print_function
import
argparse
import
collections
import
json
import
logging
import
math
import
os
import
random
import
sys
from
io
import
open
import
numpy
as
np
import
torch
from
torch.utils.data
import
(
DataLoader
,
RandomSampler
,
SequentialSampler
,
TensorDataset
)
from
torch.utils.data.distributed
import
DistributedSampler
from
tqdm
import
tqdm
,
trange
from
apex
import
amp
from
schedulers
import
LinearWarmUpScheduler
from
file_utils
import
PYTORCH_PRETRAINED_BERT_CACHE
import
modeling
from
optimization
import
BertAdam
,
warmup_linear
from
tokenization
import
(
BasicTokenizer
,
BertTokenizer
,
whitespace_tokenize
)
from
utils
import
is_main_process
,
format_step
import
dllogger
,
time
os
.
environ
[
"HIP_VISIBLE_DEVICES"
]
=
"0,1,2,3"
torch
.
_C
.
_jit_set_profiling_mode
(
False
)
#torch._C._jit_set_profiling_executor(False)
if
sys
.
version_info
[
0
]
==
2
:
import
cPickle
as
pickle
else
:
import
pickle
logging
.
basicConfig
(
format
=
'%(asctime)s - %(levelname)s - %(name)s - %(message)s'
,
datefmt
=
'%m/%d/%Y %H:%M:%S'
,
level
=
logging
.
INFO
)
logger
=
logging
.
getLogger
(
__name__
)
class
SquadExample
(
object
):
"""
A single training/test example for the Squad dataset.
For examples without an answer, the start and end position are -1.
"""
def
__init__
(
self
,
qas_id
,
question_text
,
doc_tokens
,
orig_answer_text
=
None
,
start_position
=
None
,
end_position
=
None
,
is_impossible
=
None
):
self
.
qas_id
=
qas_id
self
.
question_text
=
question_text
self
.
doc_tokens
=
doc_tokens
self
.
orig_answer_text
=
orig_answer_text
self
.
start_position
=
start_position
self
.
end_position
=
end_position
self
.
is_impossible
=
is_impossible
def
__str__
(
self
):
return
self
.
__repr__
()
def
__repr__
(
self
):
s
=
""
s
+=
"qas_id: %s"
%
(
self
.
qas_id
)
s
+=
", question_text: %s"
%
(
self
.
question_text
)
s
+=
", doc_tokens: [%s]"
%
(
" "
.
join
(
self
.
doc_tokens
))
if
self
.
start_position
:
s
+=
", start_position: %d"
%
(
self
.
start_position
)
if
self
.
end_position
:
s
+=
", end_position: %d"
%
(
self
.
end_position
)
if
self
.
is_impossible
:
s
+=
", is_impossible: %r"
%
(
self
.
is_impossible
)
return
s
class
InputFeatures
(
object
):
"""A single set of features of data."""
def
__init__
(
self
,
unique_id
,
example_index
,
doc_span_index
,
tokens
,
token_to_orig_map
,
token_is_max_context
,
input_ids
,
input_mask
,
segment_ids
,
start_position
=
None
,
end_position
=
None
,
is_impossible
=
None
):
self
.
unique_id
=
unique_id
self
.
example_index
=
example_index
self
.
doc_span_index
=
doc_span_index
self
.
tokens
=
tokens
self
.
token_to_orig_map
=
token_to_orig_map
self
.
token_is_max_context
=
token_is_max_context
self
.
input_ids
=
input_ids
self
.
input_mask
=
input_mask
self
.
segment_ids
=
segment_ids
self
.
start_position
=
start_position
self
.
end_position
=
end_position
self
.
is_impossible
=
is_impossible
def
read_squad_examples
(
input_file
,
is_training
,
version_2_with_negative
):
"""Read a SQuAD json file into a list of SquadExample."""
with
open
(
input_file
,
"r"
,
encoding
=
'utf-8'
)
as
reader
:
input_data
=
json
.
load
(
reader
)[
"data"
]
def
is_whitespace
(
c
):
if
c
==
" "
or
c
==
"
\t
"
or
c
==
"
\r
"
or
c
==
"
\n
"
or
ord
(
c
)
==
0x202F
:
return
True
return
False
examples
=
[]
for
entry
in
input_data
:
for
paragraph
in
entry
[
"paragraphs"
]:
paragraph_text
=
paragraph
[
"context"
]
doc_tokens
=
[]
char_to_word_offset
=
[]
prev_is_whitespace
=
True
for
c
in
paragraph_text
:
if
is_whitespace
(
c
):
prev_is_whitespace
=
True
else
:
if
prev_is_whitespace
:
doc_tokens
.
append
(
c
)
else
:
doc_tokens
[
-
1
]
+=
c
prev_is_whitespace
=
False
char_to_word_offset
.
append
(
len
(
doc_tokens
)
-
1
)
for
qa
in
paragraph
[
"qas"
]:
qas_id
=
qa
[
"id"
]
question_text
=
qa
[
"question"
]
start_position
=
None
end_position
=
None
orig_answer_text
=
None
is_impossible
=
False
if
is_training
:
if
version_2_with_negative
:
is_impossible
=
qa
[
"is_impossible"
]
if
(
len
(
qa
[
"answers"
])
!=
1
)
and
(
not
is_impossible
):
raise
ValueError
(
"For training, each question should have exactly 1 answer."
)
if
not
is_impossible
:
answer
=
qa
[
"answers"
][
0
]
orig_answer_text
=
answer
[
"text"
]
answer_offset
=
answer
[
"answer_start"
]
answer_length
=
len
(
orig_answer_text
)
start_position
=
char_to_word_offset
[
answer_offset
]
end_position
=
char_to_word_offset
[
answer_offset
+
answer_length
-
1
]
# Only add answers where the text can be exactly recovered from the
# document. If this CAN'T happen it's likely due to weird Unicode
# stuff so we will just skip the example.
#
# Note that this means for training mode, every example is NOT
# guaranteed to be preserved.
actual_text
=
" "
.
join
(
doc_tokens
[
start_position
:(
end_position
+
1
)])
cleaned_answer_text
=
" "
.
join
(
whitespace_tokenize
(
orig_answer_text
))
if
actual_text
.
find
(
cleaned_answer_text
)
==
-
1
:
logger
.
warning
(
"Could not find answer: '%s' vs. '%s'"
,
actual_text
,
cleaned_answer_text
)
continue
else
:
start_position
=
-
1
end_position
=
-
1
orig_answer_text
=
""
example
=
SquadExample
(
qas_id
=
qas_id
,
question_text
=
question_text
,
doc_tokens
=
doc_tokens
,
orig_answer_text
=
orig_answer_text
,
start_position
=
start_position
,
end_position
=
end_position
,
is_impossible
=
is_impossible
)
examples
.
append
(
example
)
return
examples
def
convert_examples_to_features
(
examples
,
tokenizer
,
max_seq_length
,
doc_stride
,
max_query_length
,
is_training
):
"""Loads a data file into a list of `InputBatch`s."""
unique_id
=
1000000000
features
=
[]
for
(
example_index
,
example
)
in
enumerate
(
examples
):
query_tokens
=
tokenizer
.
tokenize
(
example
.
question_text
)
if
len
(
query_tokens
)
>
max_query_length
:
query_tokens
=
query_tokens
[
0
:
max_query_length
]
tok_to_orig_index
=
[]
orig_to_tok_index
=
[]
all_doc_tokens
=
[]
for
(
i
,
token
)
in
enumerate
(
example
.
doc_tokens
):
orig_to_tok_index
.
append
(
len
(
all_doc_tokens
))
sub_tokens
=
tokenizer
.
tokenize
(
token
)
for
sub_token
in
sub_tokens
:
tok_to_orig_index
.
append
(
i
)
all_doc_tokens
.
append
(
sub_token
)
tok_start_position
=
None
tok_end_position
=
None
if
is_training
and
example
.
is_impossible
:
tok_start_position
=
-
1
tok_end_position
=
-
1
if
is_training
and
not
example
.
is_impossible
:
tok_start_position
=
orig_to_tok_index
[
example
.
start_position
]
if
example
.
end_position
<
len
(
example
.
doc_tokens
)
-
1
:
tok_end_position
=
orig_to_tok_index
[
example
.
end_position
+
1
]
-
1
else
:
tok_end_position
=
len
(
all_doc_tokens
)
-
1
(
tok_start_position
,
tok_end_position
)
=
_improve_answer_span
(
all_doc_tokens
,
tok_start_position
,
tok_end_position
,
tokenizer
,
example
.
orig_answer_text
)
# The -3 accounts for [CLS], [SEP] and [SEP]
max_tokens_for_doc
=
max_seq_length
-
len
(
query_tokens
)
-
3
# We can have documents that are longer than the maximum sequence length.
# To deal with this we do a sliding window approach, where we take chunks
# of the up to our max length with a stride of `doc_stride`.
_DocSpan
=
collections
.
namedtuple
(
# pylint: disable=invalid-name
"DocSpan"
,
[
"start"
,
"length"
])
doc_spans
=
[]
start_offset
=
0
while
start_offset
<
len
(
all_doc_tokens
):
length
=
len
(
all_doc_tokens
)
-
start_offset
if
length
>
max_tokens_for_doc
:
length
=
max_tokens_for_doc
doc_spans
.
append
(
_DocSpan
(
start
=
start_offset
,
length
=
length
))
if
start_offset
+
length
==
len
(
all_doc_tokens
):
break
start_offset
+=
min
(
length
,
doc_stride
)
for
(
doc_span_index
,
doc_span
)
in
enumerate
(
doc_spans
):
tokens
=
[]
token_to_orig_map
=
{}
token_is_max_context
=
{}
segment_ids
=
[]
tokens
.
append
(
"[CLS]"
)
segment_ids
.
append
(
0
)
for
token
in
query_tokens
:
tokens
.
append
(
token
)
segment_ids
.
append
(
0
)
tokens
.
append
(
"[SEP]"
)
segment_ids
.
append
(
0
)
for
i
in
range
(
doc_span
.
length
):
split_token_index
=
doc_span
.
start
+
i
token_to_orig_map
[
len
(
tokens
)]
=
tok_to_orig_index
[
split_token_index
]
is_max_context
=
_check_is_max_context
(
doc_spans
,
doc_span_index
,
split_token_index
)
token_is_max_context
[
len
(
tokens
)]
=
is_max_context
tokens
.
append
(
all_doc_tokens
[
split_token_index
])
segment_ids
.
append
(
1
)
tokens
.
append
(
"[SEP]"
)
segment_ids
.
append
(
1
)
input_ids
=
tokenizer
.
convert_tokens_to_ids
(
tokens
)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask
=
[
1
]
*
len
(
input_ids
)
# Zero-pad up to the sequence length.
while
len
(
input_ids
)
<
max_seq_length
:
input_ids
.
append
(
0
)
input_mask
.
append
(
0
)
segment_ids
.
append
(
0
)
assert
len
(
input_ids
)
==
max_seq_length
assert
len
(
input_mask
)
==
max_seq_length
assert
len
(
segment_ids
)
==
max_seq_length
start_position
=
None
end_position
=
None
if
is_training
and
not
example
.
is_impossible
:
# For training, if our document chunk does not contain an annotation
# we throw it out, since there is nothing to predict.
doc_start
=
doc_span
.
start
doc_end
=
doc_span
.
start
+
doc_span
.
length
-
1
out_of_span
=
False
if
not
(
tok_start_position
>=
doc_start
and
tok_end_position
<=
doc_end
):
out_of_span
=
True
if
out_of_span
:
start_position
=
0
end_position
=
0
else
:
doc_offset
=
len
(
query_tokens
)
+
2
start_position
=
tok_start_position
-
doc_start
+
doc_offset
end_position
=
tok_end_position
-
doc_start
+
doc_offset
if
is_training
and
example
.
is_impossible
:
start_position
=
0
end_position
=
0
features
.
append
(
InputFeatures
(
unique_id
=
unique_id
,
example_index
=
example_index
,
doc_span_index
=
doc_span_index
,
tokens
=
tokens
,
token_to_orig_map
=
token_to_orig_map
,
token_is_max_context
=
token_is_max_context
,
input_ids
=
input_ids
,
input_mask
=
input_mask
,
segment_ids
=
segment_ids
,
start_position
=
start_position
,
end_position
=
end_position
,
is_impossible
=
example
.
is_impossible
))
unique_id
+=
1
return
features
def
_improve_answer_span
(
doc_tokens
,
input_start
,
input_end
,
tokenizer
,
orig_answer_text
):
"""Returns tokenized answer spans that better match the annotated answer."""
# The SQuAD annotations are character based. We first project them to
# whitespace-tokenized words. But then after WordPiece tokenization, we can
# often find a "better match". For example:
#
# Question: What year was John Smith born?
# Context: The leader was John Smith (1895-1943).
# Answer: 1895
#
# The original whitespace-tokenized answer will be "(1895-1943).". However
# after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
# the exact answer, 1895.
#
# However, this is not always possible. Consider the following:
#
# Question: What country is the top exporter of electornics?
# Context: The Japanese electronics industry is the lagest in the world.
# Answer: Japan
#
# In this case, the annotator chose "Japan" as a character sub-span of
# the word "Japanese". Since our WordPiece tokenizer does not split
# "Japanese", we just use "Japanese" as the annotation. This is fairly rare
# in SQuAD, but does happen.
tok_answer_text
=
" "
.
join
(
tokenizer
.
tokenize
(
orig_answer_text
))
for
new_start
in
range
(
input_start
,
input_end
+
1
):
for
new_end
in
range
(
input_end
,
new_start
-
1
,
-
1
):
text_span
=
" "
.
join
(
doc_tokens
[
new_start
:(
new_end
+
1
)])
if
text_span
==
tok_answer_text
:
return
(
new_start
,
new_end
)
return
(
input_start
,
input_end
)
def
_check_is_max_context
(
doc_spans
,
cur_span_index
,
position
):
"""Check if this is the 'max context' doc span for the token."""
# Because of the sliding window approach taken to scoring documents, a single
# token can appear in multiple documents. E.g.
# Doc: the man went to the store and bought a gallon of milk
# Span A: the man went to the
# Span B: to the store and bought
# Span C: and bought a gallon of
# ...
#
# Now the word 'bought' will have two scores from spans B and C. We only
# want to consider the score with "maximum context", which we define as
# the *minimum* of its left and right context (the *sum* of left and
# right context will always be the same, of course).
#
# In the example the maximum context for 'bought' would be span C since
# it has 1 left context and 3 right context, while span B has 4 left context
# and 0 right context.
best_score
=
None
best_span_index
=
None
for
(
span_index
,
doc_span
)
in
enumerate
(
doc_spans
):
end
=
doc_span
.
start
+
doc_span
.
length
-
1
if
position
<
doc_span
.
start
:
continue
if
position
>
end
:
continue
num_left_context
=
position
-
doc_span
.
start
num_right_context
=
end
-
position
score
=
min
(
num_left_context
,
num_right_context
)
+
0.01
*
doc_span
.
length
if
best_score
is
None
or
score
>
best_score
:
best_score
=
score
best_span_index
=
span_index
return
cur_span_index
==
best_span_index
RawResult
=
collections
.
namedtuple
(
"RawResult"
,
[
"unique_id"
,
"start_logits"
,
"end_logits"
])
def
get_answers
(
examples
,
features
,
results
,
args
):
predictions
=
collections
.
defaultdict
(
list
)
#it is possible that one example corresponds to multiple features
Prediction
=
collections
.
namedtuple
(
'Prediction'
,
[
'text'
,
'start_logit'
,
'end_logit'
])
if
args
.
version_2_with_negative
:
null_vals
=
collections
.
defaultdict
(
lambda
:
(
float
(
"inf"
),
0
,
0
))
for
ex
,
feat
,
result
in
match_results
(
examples
,
features
,
results
):
start_indices
=
_get_best_indices
(
result
.
start_logits
,
args
.
n_best_size
)
end_indices
=
_get_best_indices
(
result
.
end_logits
,
args
.
n_best_size
)
prelim_predictions
=
get_valid_prelim_predictions
(
start_indices
,
end_indices
,
feat
,
result
,
args
)
prelim_predictions
=
sorted
(
prelim_predictions
,
key
=
lambda
x
:
(
x
.
start_logit
+
x
.
end_logit
),
reverse
=
True
)
if
args
.
version_2_with_negative
:
score
=
result
.
start_logits
[
0
]
+
result
.
end_logits
[
0
]
if
score
<
null_vals
[
ex
.
qas_id
][
0
]:
null_vals
[
ex
.
qas_id
]
=
(
score
,
result
.
start_logits
[
0
],
result
.
end_logits
[
0
])
curr_predictions
=
[]
seen_predictions
=
[]
for
pred
in
prelim_predictions
:
if
len
(
curr_predictions
)
==
args
.
n_best_size
:
break
if
pred
.
start_index
>
0
:
# this is a non-null prediction TODO: this probably is irrelevant
final_text
=
get_answer_text
(
ex
,
feat
,
pred
,
args
)
if
final_text
in
seen_predictions
:
continue
else
:
final_text
=
""
seen_predictions
.
append
(
final_text
)
curr_predictions
.
append
(
Prediction
(
final_text
,
pred
.
start_logit
,
pred
.
end_logit
))
predictions
[
ex
.
qas_id
]
+=
curr_predictions
#Add empty prediction
if
args
.
version_2_with_negative
:
for
qas_id
in
predictions
.
keys
():
predictions
[
qas_id
].
append
(
Prediction
(
''
,
null_vals
[
ex
.
qas_id
][
1
],
null_vals
[
ex
.
qas_id
][
2
]))
nbest_answers
=
collections
.
defaultdict
(
list
)
answers
=
{}
for
qas_id
,
preds
in
predictions
.
items
():
nbest
=
sorted
(
preds
,
key
=
lambda
x
:
(
x
.
start_logit
+
x
.
end_logit
),
reverse
=
True
)[:
args
.
n_best_size
]
# In very rare edge cases we could only have single null prediction.
# So we just create a nonce prediction in this case to avoid failure.
if
not
nbest
:
nbest
.
append
(
Prediction
(
text
=
"empty"
,
start_logit
=
0.0
,
end_logit
=
0.0
))
total_scores
=
[]
best_non_null_entry
=
None
for
entry
in
nbest
:
total_scores
.
append
(
entry
.
start_logit
+
entry
.
end_logit
)
if
not
best_non_null_entry
and
entry
.
text
:
best_non_null_entry
=
entry
probs
=
_compute_softmax
(
total_scores
)
for
(
i
,
entry
)
in
enumerate
(
nbest
):
output
=
collections
.
OrderedDict
()
output
[
"text"
]
=
entry
.
text
output
[
"probability"
]
=
probs
[
i
]
output
[
"start_logit"
]
=
entry
.
start_logit
output
[
"end_logit"
]
=
entry
.
end_logit
nbest_answers
[
qas_id
].
append
(
output
)
if
args
.
version_2_with_negative
:
score_diff
=
null_vals
[
qas_id
][
0
]
-
best_non_null_entry
.
start_logit
-
best_non_null_entry
.
end_logit
if
score_diff
>
args
.
null_score_diff_threshold
:
answers
[
qas_id
]
=
""
else
:
answers
[
qas_id
]
=
best_non_null_entry
.
text
else
:
answers
[
qas_id
]
=
nbest_answers
[
qas_id
][
0
][
'text'
]
return
answers
,
nbest_answers
def
get_answer_text
(
example
,
feature
,
pred
,
args
):
tok_tokens
=
feature
.
tokens
[
pred
.
start_index
:(
pred
.
end_index
+
1
)]
orig_doc_start
=
feature
.
token_to_orig_map
[
pred
.
start_index
]
orig_doc_end
=
feature
.
token_to_orig_map
[
pred
.
end_index
]
orig_tokens
=
example
.
doc_tokens
[
orig_doc_start
:(
orig_doc_end
+
1
)]
tok_text
=
" "
.
join
(
tok_tokens
)
# De-tokenize WordPieces that have been split off.
tok_text
=
tok_text
.
replace
(
" ##"
,
""
)
tok_text
=
tok_text
.
replace
(
"##"
,
""
)
# Clean whitespace
tok_text
=
tok_text
.
strip
()
tok_text
=
" "
.
join
(
tok_text
.
split
())
orig_text
=
" "
.
join
(
orig_tokens
)
final_text
=
get_final_text
(
tok_text
,
orig_text
,
args
.
do_lower_case
,
args
.
verbose_logging
)
return
final_text
def
get_valid_prelim_predictions
(
start_indices
,
end_indices
,
feature
,
result
,
args
):
_PrelimPrediction
=
collections
.
namedtuple
(
"PrelimPrediction"
,
[
"start_index"
,
"end_index"
,
"start_logit"
,
"end_logit"
])
prelim_predictions
=
[]
for
start_index
in
start_indices
:
for
end_index
in
end_indices
:
if
start_index
>=
len
(
feature
.
tokens
):
continue
if
end_index
>=
len
(
feature
.
tokens
):
continue
if
start_index
not
in
feature
.
token_to_orig_map
:
continue
if
end_index
not
in
feature
.
token_to_orig_map
:
continue
if
not
feature
.
token_is_max_context
.
get
(
start_index
,
False
):
continue
if
end_index
<
start_index
:
continue
length
=
end_index
-
start_index
+
1
if
length
>
args
.
max_answer_length
:
continue
prelim_predictions
.
append
(
_PrelimPrediction
(
start_index
=
start_index
,
end_index
=
end_index
,
start_logit
=
result
.
start_logits
[
start_index
],
end_logit
=
result
.
end_logits
[
end_index
]))
return
prelim_predictions
def
match_results
(
examples
,
features
,
results
):
unique_f_ids
=
set
([
f
.
unique_id
for
f
in
features
])
unique_r_ids
=
set
([
r
.
unique_id
for
r
in
results
])
matching_ids
=
unique_f_ids
&
unique_r_ids
features
=
[
f
for
f
in
features
if
f
.
unique_id
in
matching_ids
]
results
=
[
r
for
r
in
results
if
r
.
unique_id
in
matching_ids
]
features
.
sort
(
key
=
lambda
x
:
x
.
unique_id
)
results
.
sort
(
key
=
lambda
x
:
x
.
unique_id
)
for
f
,
r
in
zip
(
features
,
results
):
#original code assumes strict ordering of examples. TODO: rewrite this
yield
examples
[
f
.
example_index
],
f
,
r
def
get_final_text
(
pred_text
,
orig_text
,
do_lower_case
,
verbose_logging
=
False
):
"""Project the tokenized prediction back to the original text."""
# When we created the data, we kept track of the alignment between original
# (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
# now `orig_text` contains the span of our original text corresponding to the
# span that we predicted.
#
# However, `orig_text` may contain extra characters that we don't want in
# our prediction.
#
# For example, let's say:
# pred_text = steve smith
# orig_text = Steve Smith's
#
# We don't want to return `orig_text` because it contains the extra "'s".
#
# We don't want to return `pred_text` because it's already been normalized
# (the SQuAD eval script also does punctuation stripping/lower casing but
# our tokenizer does additional normalization like stripping accent
# characters).
#
# What we really want to return is "Steve Smith".
#
# Therefore, we have to apply a semi-complicated alignment heruistic between
# `pred_text` and `orig_text` to get a character-to-charcter alignment. This
# can fail in certain cases in which case we just return `orig_text`.
def
_strip_spaces
(
text
):
ns_chars
=
[]
ns_to_s_map
=
collections
.
OrderedDict
()
for
(
i
,
c
)
in
enumerate
(
text
):
if
c
==
" "
:
continue
ns_to_s_map
[
len
(
ns_chars
)]
=
i
ns_chars
.
append
(
c
)
ns_text
=
""
.
join
(
ns_chars
)
return
(
ns_text
,
ns_to_s_map
)
# We first tokenize `orig_text`, strip whitespace from the result
# and `pred_text`, and check if they are the same length. If they are
# NOT the same length, the heuristic has failed. If they are the same
# length, we assume the characters are one-to-one aligned.
tokenizer
=
BasicTokenizer
(
do_lower_case
=
do_lower_case
)
tok_text
=
" "
.
join
(
tokenizer
.
tokenize
(
orig_text
))
start_position
=
tok_text
.
find
(
pred_text
)
if
start_position
==
-
1
:
if
verbose_logging
:
logger
.
info
(
"Unable to find text: '%s' in '%s'"
%
(
pred_text
,
orig_text
))
return
orig_text
end_position
=
start_position
+
len
(
pred_text
)
-
1
(
orig_ns_text
,
orig_ns_to_s_map
)
=
_strip_spaces
(
orig_text
)
(
tok_ns_text
,
tok_ns_to_s_map
)
=
_strip_spaces
(
tok_text
)
if
len
(
orig_ns_text
)
!=
len
(
tok_ns_text
):
if
verbose_logging
:
logger
.
info
(
"Length not equal after stripping spaces: '%s' vs '%s'"
,
orig_ns_text
,
tok_ns_text
)
return
orig_text
# We then project the characters in `pred_text` back to `orig_text` using
# the character-to-character alignment.
tok_s_to_ns_map
=
{}
for
(
i
,
tok_index
)
in
tok_ns_to_s_map
.
items
():
tok_s_to_ns_map
[
tok_index
]
=
i
orig_start_position
=
None
if
start_position
in
tok_s_to_ns_map
:
ns_start_position
=
tok_s_to_ns_map
[
start_position
]
if
ns_start_position
in
orig_ns_to_s_map
:
orig_start_position
=
orig_ns_to_s_map
[
ns_start_position
]
if
orig_start_position
is
None
:
if
verbose_logging
:
logger
.
info
(
"Couldn't map start position"
)
return
orig_text
orig_end_position
=
None
if
end_position
in
tok_s_to_ns_map
:
ns_end_position
=
tok_s_to_ns_map
[
end_position
]
if
ns_end_position
in
orig_ns_to_s_map
:
orig_end_position
=
orig_ns_to_s_map
[
ns_end_position
]
if
orig_end_position
is
None
:
if
verbose_logging
:
logger
.
info
(
"Couldn't map end position"
)
return
orig_text
output_text
=
orig_text
[
orig_start_position
:(
orig_end_position
+
1
)]
return
output_text
def
_get_best_indices
(
logits
,
n_best_size
):
"""Get the n-best logits from a list."""
index_and_score
=
sorted
(
enumerate
(
logits
),
key
=
lambda
x
:
x
[
1
],
reverse
=
True
)
best_indices
=
[]
for
i
in
range
(
len
(
index_and_score
)):
if
i
>=
n_best_size
:
break
best_indices
.
append
(
index_and_score
[
i
][
0
])
return
best_indices
def
_compute_softmax
(
scores
):
"""Compute softmax probability over raw logits."""
if
not
scores
:
return
[]
max_score
=
None
for
score
in
scores
:
if
max_score
is
None
or
score
>
max_score
:
max_score
=
score
exp_scores
=
[]
total_sum
=
0.0
for
score
in
scores
:
x
=
math
.
exp
(
score
-
max_score
)
exp_scores
.
append
(
x
)
total_sum
+=
x
probs
=
[]
for
score
in
exp_scores
:
probs
.
append
(
score
/
total_sum
)
return
probs
from
apex.multi_tensor_apply
import
multi_tensor_applier
class
GradientClipper
:
"""
Clips gradient norm of an iterable of parameters.
"""
def
__init__
(
self
,
max_grad_norm
):
self
.
max_norm
=
max_grad_norm
if
multi_tensor_applier
.
available
:
import
amp_C
self
.
_overflow_buf
=
torch
.
cuda
.
IntTensor
([
0
])
self
.
multi_tensor_l2norm
=
amp_C
.
multi_tensor_l2norm
self
.
multi_tensor_scale
=
amp_C
.
multi_tensor_scale
else
:
raise
RuntimeError
(
'Gradient clipping requires cuda extensions'
)
def
step
(
self
,
parameters
):
l
=
[
p
.
grad
for
p
in
parameters
if
p
.
grad
is
not
None
]
total_norm
,
_
=
multi_tensor_applier
(
self
.
multi_tensor_l2norm
,
self
.
_overflow_buf
,
[
l
],
False
)
total_norm
=
total_norm
.
item
()
if
(
total_norm
==
float
(
'inf'
)):
return
clip_coef
=
self
.
max_norm
/
(
total_norm
+
1e-6
)
if
clip_coef
<
1
:
multi_tensor_applier
(
self
.
multi_tensor_scale
,
self
.
_overflow_buf
,
[
l
,
l
],
clip_coef
)
def
main
():
parser
=
argparse
.
ArgumentParser
()
## Required parameters
parser
.
add_argument
(
"--bert_model"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
"bert-base-multilingual-cased, bert-base-chinese."
)
parser
.
add_argument
(
"--output_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The output directory where the model checkpoints and predictions will be written."
)
parser
.
add_argument
(
"--init_checkpoint"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The checkpoint file from pretraining"
)
## Other parameters
parser
.
add_argument
(
"--train_file"
,
default
=
None
,
type
=
str
,
help
=
"SQuAD json for training. E.g., train-v1.1.json"
)
parser
.
add_argument
(
"--predict_file"
,
default
=
None
,
type
=
str
,
help
=
"SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json"
)
parser
.
add_argument
(
"--max_seq_length"
,
default
=
384
,
type
=
int
,
help
=
"The maximum total input sequence length after WordPiece tokenization. Sequences "
"longer than this will be truncated, and sequences shorter than this will be padded."
)
parser
.
add_argument
(
"--doc_stride"
,
default
=
128
,
type
=
int
,
help
=
"When splitting up a long document into chunks, how much stride to take between chunks."
)
parser
.
add_argument
(
"--max_query_length"
,
default
=
64
,
type
=
int
,
help
=
"The maximum number of tokens for the question. Questions longer than this will "
"be truncated to this length."
)
parser
.
add_argument
(
"--do_train"
,
action
=
'store_true'
,
help
=
"Whether to run training."
)
parser
.
add_argument
(
"--do_predict"
,
action
=
'store_true'
,
help
=
"Whether to run eval on the dev set."
)
parser
.
add_argument
(
"--train_batch_size"
,
default
=
16
,
type
=
int
,
help
=
"Total batch size for training."
)
parser
.
add_argument
(
"--predict_batch_size"
,
default
=
8
,
type
=
int
,
help
=
"Total batch size for predictions."
)
parser
.
add_argument
(
"--learning_rate"
,
default
=
5e-5
,
type
=
float
,
help
=
"The initial learning rate for Adam."
)
parser
.
add_argument
(
"--num_train_epochs"
,
default
=
3.0
,
type
=
float
,
help
=
"Total number of training epochs to perform."
)
parser
.
add_argument
(
"--max_steps"
,
default
=-
1.0
,
type
=
float
,
help
=
"Total number of training steps to perform."
)
parser
.
add_argument
(
"--warmup_proportion"
,
default
=
0.1
,
type
=
float
,
help
=
"Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10%% "
"of training."
)
parser
.
add_argument
(
"--n_best_size"
,
default
=
20
,
type
=
int
,
help
=
"The total number of n-best predictions to generate in the nbest_predictions.json "
"output file."
)
parser
.
add_argument
(
"--max_answer_length"
,
default
=
30
,
type
=
int
,
help
=
"The maximum length of an answer that can be generated. This is needed because the start "
"and end predictions are not conditioned on one another."
)
parser
.
add_argument
(
"--verbose_logging"
,
action
=
'store_true'
,
help
=
"If true, all of the warnings related to data processing will be printed. "
"A number of warnings are expected for a normal SQuAD evaluation."
)
parser
.
add_argument
(
"--no_cuda"
,
action
=
'store_true'
,
help
=
"Whether not to use CUDA when available"
)
parser
.
add_argument
(
'--seed'
,
type
=
int
,
default
=
42
,
help
=
"random seed for initialization"
)
parser
.
add_argument
(
'--gradient_accumulation_steps'
,
type
=
int
,
default
=
1
,
help
=
"Number of updates steps to accumulate before performing a backward/update pass."
)
parser
.
add_argument
(
"--do_lower_case"
,
action
=
'store_true'
,
help
=
"Whether to lower case the input text. True for uncased models, False for cased models."
)
parser
.
add_argument
(
"--local_rank"
,
type
=
int
,
default
=
os
.
getenv
(
'LOCAL_RANK'
,
-
1
),
help
=
"local_rank for distributed training on gpus"
)
parser
.
add_argument
(
'--fp16'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--amp'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--loss_scale'
,
type
=
float
,
default
=
0
,
help
=
"Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.
\n
"
"0 (default value): dynamic loss scaling.
\n
"
"Positive power of 2: static loss scaling value.
\n
"
)
parser
.
add_argument
(
'--version_2_with_negative'
,
action
=
'store_true'
,
help
=
'If true, the SQuAD examples contain some that do not have an answer.'
)
parser
.
add_argument
(
'--null_score_diff_threshold'
,
type
=
float
,
default
=
0.0
,
help
=
"If null_score - best_non_null is greater than the threshold predict null."
)
parser
.
add_argument
(
'--vocab_file'
,
type
=
str
,
default
=
None
,
required
=
True
,
help
=
"Vocabulary mapping/file BERT was pretrainined on"
)
parser
.
add_argument
(
"--config_file"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The BERT model config"
)
parser
.
add_argument
(
'--log_freq'
,
type
=
int
,
default
=
50
,
help
=
'frequency of logging loss.'
)
parser
.
add_argument
(
'--json-summary'
,
type
=
str
,
default
=
"results/dllogger.json"
,
help
=
'If provided, the json summary will be written to'
'the specified file.'
)
parser
.
add_argument
(
"--eval_script"
,
help
=
"Script to evaluate squad predictions"
,
default
=
"evaluate.py"
,
type
=
str
)
parser
.
add_argument
(
"--do_eval"
,
action
=
'store_true'
,
help
=
"Whether to use evaluate accuracy of predictions"
)
parser
.
add_argument
(
"--use_env"
,
action
=
'store_true'
,
help
=
"Whether to read local rank from ENVVAR"
)
parser
.
add_argument
(
'--skip_checkpoint'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to save checkpoints"
)
parser
.
add_argument
(
'--disable-progress-bar'
,
default
=
False
,
action
=
'store_true'
,
help
=
'Disable tqdm progress bar'
)
parser
.
add_argument
(
"--skip_cache"
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to cache train features"
)
parser
.
add_argument
(
"--cache_dir"
,
default
=
None
,
type
=
str
,
help
=
"Location to cache train feaures. Will default to the dataset directory"
)
parser
.
add_argument
(
"--dist_url"
,
default
=
'tcp://224.66.41.62:23456'
,
type
=
str
,
help
=
'url used to set up distributed training'
)
parser
.
add_argument
(
"--gpus_per_node"
,
type
=
int
,
default
=
4
,
help
=
'num of gpus per node'
)
parser
.
add_argument
(
"--world_size"
,
type
=
int
,
default
=
1
,
help
=
"number of process"
)
args
=
parser
.
parse_args
()
args
.
fp16
=
args
.
fp16
or
args
.
amp
if
args
.
local_rank
==
-
1
or
args
.
no_cuda
:
device
=
torch
.
device
(
"cuda"
if
torch
.
cuda
.
is_available
()
and
not
args
.
no_cuda
else
"cpu"
)
n_gpu
=
torch
.
cuda
.
device_count
()
else
:
print
(
"n_gpu:"
,
torch
.
cuda
.
device_count
())
device_n
=
args
.
local_rank
%
4
print
(
"="
*
20
)
print
(
"device:"
,
device_n
)
torch
.
cuda
.
set_device
(
device_n
)
print
(
"="
*
20
)
print
(
"torch.cuda.set_device:"
,
torch
.
cuda
.
set_device
(
device_n
))
device
=
torch
.
device
(
"cuda"
,
device_n
)
print
(
"="
*
20
)
print
(
"torch.device:"
,
torch
.
device
(
"cuda"
,
device_n
))
print
(
"device:"
,
device
)
#torch.cuda.set_device(args.local_rank)
#device = torch.device("cuda", args.local_rank)
#device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
#torch.distributed.init_process_group(backend='gloo', init_method='env://')
#xuan
#if args.world_size > 1:
# args.local_rank = args.local_rank * args.gpus_per_node
torch
.
distributed
.
init_process_group
(
backend
=
'nccl'
,
init_method
=
args
.
dist_url
,
world_size
=
args
.
world_size
,
rank
=
args
.
local_rank
)
n_gpu
=
1
if
is_main_process
():
dllogger
.
init
(
backends
=
[
dllogger
.
JSONStreamBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
filename
=
args
.
json_summary
),
dllogger
.
StdOutBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
step_format
=
format_step
)])
else
:
dllogger
.
init
(
backends
=
[])
print
(
"device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}"
.
format
(
device
,
n_gpu
,
bool
(
args
.
local_rank
!=
-
1
),
args
.
fp16
))
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"Config"
:
[
str
(
args
)]})
if
args
.
gradient_accumulation_steps
<
1
:
raise
ValueError
(
"Invalid gradient_accumulation_steps parameter: {}, should be >= 1"
.
format
(
args
.
gradient_accumulation_steps
))
args
.
train_batch_size
=
args
.
train_batch_size
//
args
.
gradient_accumulation_steps
random
.
seed
(
args
.
seed
)
np
.
random
.
seed
(
args
.
seed
)
torch
.
manual_seed
(
args
.
seed
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"SEED"
:
args
.
seed
})
if
n_gpu
>
0
:
torch
.
cuda
.
manual_seed_all
(
args
.
seed
)
if
not
args
.
do_train
and
not
args
.
do_predict
:
raise
ValueError
(
"At least one of `do_train` or `do_predict` must be True."
)
if
args
.
do_train
:
if
not
args
.
train_file
:
raise
ValueError
(
"If `do_train` is True, then `train_file` must be specified."
)
if
args
.
do_predict
:
if
not
args
.
predict_file
:
raise
ValueError
(
"If `do_predict` is True, then `predict_file` must be specified."
)
if
os
.
path
.
exists
(
args
.
output_dir
)
and
os
.
listdir
(
args
.
output_dir
)
and
args
.
do_train
and
os
.
listdir
(
args
.
output_dir
)
!=
[
'logfile.txt'
]:
print
(
"WARNING: Output directory {} already exists and is not empty."
.
format
(
args
.
output_dir
),
os
.
listdir
(
args
.
output_dir
))
if
not
os
.
path
.
exists
(
args
.
output_dir
)
and
is_main_process
():
os
.
makedirs
(
args
.
output_dir
)
tokenizer
=
BertTokenizer
(
args
.
vocab_file
,
do_lower_case
=
args
.
do_lower_case
,
max_len
=
512
)
# for bert large
# tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
train_examples
=
None
num_train_optimization_steps
=
None
if
args
.
do_train
:
train_examples
=
read_squad_examples
(
input_file
=
args
.
train_file
,
is_training
=
True
,
version_2_with_negative
=
args
.
version_2_with_negative
)
num_train_optimization_steps
=
int
(
len
(
train_examples
)
/
args
.
train_batch_size
/
args
.
gradient_accumulation_steps
)
*
args
.
num_train_epochs
if
args
.
local_rank
!=
-
1
:
num_train_optimization_steps
=
num_train_optimization_steps
//
torch
.
distributed
.
get_world_size
()
# Prepare model
config
=
modeling
.
BertConfig
.
from_json_file
(
args
.
config_file
)
# Padding for divisibility by 8
if
config
.
vocab_size
%
8
!=
0
:
config
.
vocab_size
+=
8
-
(
config
.
vocab_size
%
8
)
modeling
.
ACT2FN
[
"bias_gelu"
]
=
modeling
.
bias_gelu_training
model
=
modeling
.
BertForQuestionAnswering
(
config
)
# model = modeling.BertForQuestionAnswering.from_pretrained(args.bert_model,
# cache_dir=os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank)))
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"loading_checkpoint"
:
True
})
#model.load_state_dict(torch.load(args.init_checkpoint, map_location='cpu')["model"], strict=False)
model
.
load_state_dict
(
torch
.
load
(
args
.
init_checkpoint
,
map_location
=
'cpu'
),
strict
=
False
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"loaded_checkpoint"
:
True
})
model
.
to
(
device
)
#model = model.cuda()
num_weights
=
sum
([
p
.
numel
()
for
p
in
model
.
parameters
()
if
p
.
requires_grad
])
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"model_weights_num"
:
num_weights
})
# Prepare optimizer
param_optimizer
=
list
(
model
.
named_parameters
())
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer
=
[
n
for
n
in
param_optimizer
if
'pooler'
not
in
n
[
0
]]
no_decay
=
[
'bias'
,
'LayerNorm.bias'
,
'LayerNorm.weight'
]
optimizer_grouped_parameters
=
[
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
not
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.01
},
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.0
}
]
if
args
.
do_train
:
if
args
.
fp16
:
try
:
from
apex.optimizers
import
FusedAdam
except
ImportError
:
raise
ImportError
(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training."
)
optimizer
=
FusedAdam
(
optimizer_grouped_parameters
,
lr
=
args
.
learning_rate
,
bias_correction
=
False
)
if
args
.
loss_scale
==
0
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
keep_batchnorm_fp32
=
False
,
loss_scale
=
"dynamic"
)
else
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
keep_batchnorm_fp32
=
False
,
loss_scale
=
args
.
loss_scale
)
if
args
.
do_train
:
scheduler
=
LinearWarmUpScheduler
(
optimizer
,
warmup
=
args
.
warmup_proportion
,
total_steps
=
num_train_optimization_steps
)
else
:
optimizer
=
BertAdam
(
optimizer_grouped_parameters
,
lr
=
args
.
learning_rate
,
warmup
=
args
.
warmup_proportion
,
t_total
=
num_train_optimization_steps
)
if
args
.
local_rank
!=
-
1
:
try
:
from
apex.parallel
import
DistributedDataParallel
as
DDP
except
ImportError
:
raise
ImportError
(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training."
)
model
=
DDP
(
model
)
# model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[device_n])
elif
n_gpu
>
1
:
model
=
torch
.
nn
.
DataParallel
(
model
)
global_step
=
0
if
args
.
do_train
:
if
args
.
cache_dir
is
None
:
cached_train_features_file
=
args
.
train_file
+
'_{0}_{1}_{2}_{3}'
.
format
(
list
(
filter
(
None
,
args
.
bert_model
.
split
(
'/'
))).
pop
(),
str
(
args
.
max_seq_length
),
str
(
args
.
doc_stride
),
str
(
args
.
max_query_length
))
else
:
cached_train_features_file
=
args
.
cache_dir
.
strip
(
'/'
)
+
'/'
+
args
.
train_file
.
split
(
'/'
)[
-
1
]
+
'_{0}_{1}_{2}_{3}'
.
format
(
list
(
filter
(
None
,
args
.
bert_model
.
split
(
'/'
))).
pop
(),
str
(
args
.
max_seq_length
),
str
(
args
.
doc_stride
),
str
(
args
.
max_query_length
))
train_features
=
None
try
:
with
open
(
cached_train_features_file
,
"rb"
)
as
reader
:
train_features
=
pickle
.
load
(
reader
)
except
:
train_features
=
convert_examples_to_features
(
examples
=
train_examples
,
tokenizer
=
tokenizer
,
max_seq_length
=
args
.
max_seq_length
,
doc_stride
=
args
.
doc_stride
,
max_query_length
=
args
.
max_query_length
,
is_training
=
True
)
if
not
args
.
skip_cache
and
is_main_process
():
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"Cached_train features_file"
:
cached_train_features_file
})
with
open
(
cached_train_features_file
,
"wb"
)
as
writer
:
pickle
.
dump
(
train_features
,
writer
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"train_start"
:
True
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"training_samples"
:
len
(
train_examples
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"training_features"
:
len
(
train_features
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"train_batch_size"
:
args
.
train_batch_size
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"steps"
:
num_train_optimization_steps
})
all_input_ids
=
torch
.
tensor
([
f
.
input_ids
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_input_mask
=
torch
.
tensor
([
f
.
input_mask
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_segment_ids
=
torch
.
tensor
([
f
.
segment_ids
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_start_positions
=
torch
.
tensor
([
f
.
start_position
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_end_positions
=
torch
.
tensor
([
f
.
end_position
for
f
in
train_features
],
dtype
=
torch
.
long
)
train_data
=
TensorDataset
(
all_input_ids
,
all_input_mask
,
all_segment_ids
,
all_start_positions
,
all_end_positions
)
if
args
.
local_rank
==
-
1
:
train_sampler
=
RandomSampler
(
train_data
)
else
:
train_sampler
=
DistributedSampler
(
train_data
)
train_dataloader
=
DataLoader
(
train_data
,
sampler
=
train_sampler
,
batch_size
=
args
.
train_batch_size
*
n_gpu
)
model
.
train
()
gradClipper
=
GradientClipper
(
max_grad_norm
=
1.0
)
final_loss
=
None
train_start
=
time
.
time
()
for
epoch
in
range
(
int
(
args
.
num_train_epochs
)):
train_iter
=
tqdm
(
train_dataloader
,
desc
=
"Iteration"
,
disable
=
args
.
disable_progress_bar
)
if
is_main_process
()
else
train_dataloader
for
step
,
batch
in
enumerate
(
train_iter
):
# Terminate early for benchmarking
print
(
"step is "
,
step
,
" "
)
if
args
.
max_steps
>
0
and
global_step
>
args
.
max_steps
:
break
if
n_gpu
==
1
:
batch
=
tuple
(
t
.
to
(
device
)
for
t
in
batch
)
# multi-gpu does scattering it-self
input_ids
,
input_mask
,
segment_ids
,
start_positions
,
end_positions
=
batch
start_logits
,
end_logits
=
model
(
input_ids
,
segment_ids
,
input_mask
)
print
(
"+++++++++++++++++++++++++++++++++++++1"
)
# If we are on multi-GPU, split add a dimension
if
len
(
start_positions
.
size
())
>
1
:
start_positions
=
start_positions
.
squeeze
(
-
1
)
if
len
(
end_positions
.
size
())
>
1
:
end_positions
=
end_positions
.
squeeze
(
-
1
)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index
=
start_logits
.
size
(
1
)
start_positions
.
clamp_
(
0
,
ignored_index
)
end_positions
.
clamp_
(
0
,
ignored_index
)
print
(
"+++++++++++++++++++++++++++++++++++++2"
)
loss_fct
=
torch
.
nn
.
CrossEntropyLoss
(
ignore_index
=
ignored_index
)
start_loss
=
loss_fct
(
start_logits
,
start_positions
)
end_loss
=
loss_fct
(
end_logits
,
end_positions
)
loss
=
(
start_loss
+
end_loss
)
/
2
if
n_gpu
>
1
:
loss
=
loss
.
mean
()
# mean() to average on multi-gpu.
if
args
.
gradient_accumulation_steps
>
1
:
loss
=
loss
/
args
.
gradient_accumulation_steps
if
args
.
fp16
:
with
amp
.
scale_loss
(
loss
,
optimizer
)
as
scaled_loss
:
scaled_loss
.
backward
()
else
:
print
(
"compute loss back"
)
loss
.
backward
()
print
(
"+++++++++++++++++++++++++++++++++++++3"
)
# gradient clipping
gradClipper
.
step
(
amp
.
master_params
(
optimizer
))
if
(
step
+
1
)
%
args
.
gradient_accumulation_steps
==
0
:
if
args
.
fp16
:
# modify learning rate with special warm up for BERT which FusedAdam doesn't do
scheduler
.
step
()
optimizer
.
step
()
optimizer
.
zero_grad
()
global_step
+=
1
final_loss
=
loss
.
item
()
if
step
%
args
.
log_freq
==
0
:
dllogger
.
log
(
step
=
(
epoch
,
global_step
,),
data
=
{
"step_loss"
:
final_loss
,
"learning_rate"
:
optimizer
.
param_groups
[
0
][
'lr'
]})
time_to_train
=
time
.
time
()
-
train_start
if
args
.
do_train
and
is_main_process
()
and
not
args
.
skip_checkpoint
:
# Save a trained model and the associated configuration
model_to_save
=
model
.
module
if
hasattr
(
model
,
'module'
)
else
model
# Only save the model it-self
output_model_file
=
os
.
path
.
join
(
args
.
output_dir
,
modeling
.
WEIGHTS_NAME
)
torch
.
save
({
"model"
:
model_to_save
.
state_dict
()},
output_model_file
)
output_config_file
=
os
.
path
.
join
(
args
.
output_dir
,
modeling
.
CONFIG_NAME
)
with
open
(
output_config_file
,
'w'
)
as
f
:
f
.
write
(
model_to_save
.
config
.
to_json_string
())
if
args
.
do_predict
and
(
args
.
local_rank
==
-
1
or
is_main_process
()):
if
not
args
.
do_train
and
args
.
fp16
:
model
.
half
()
eval_examples
=
read_squad_examples
(
input_file
=
args
.
predict_file
,
is_training
=
False
,
version_2_with_negative
=
args
.
version_2_with_negative
)
eval_features
=
convert_examples_to_features
(
examples
=
eval_examples
,
tokenizer
=
tokenizer
,
max_seq_length
=
args
.
max_seq_length
,
doc_stride
=
args
.
doc_stride
,
max_query_length
=
args
.
max_query_length
,
is_training
=
False
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"infer_start"
:
True
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"eval_samples"
:
len
(
eval_examples
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"eval_features"
:
len
(
eval_features
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"predict_batch_size"
:
args
.
predict_batch_size
})
all_input_ids
=
torch
.
tensor
([
f
.
input_ids
for
f
in
eval_features
],
dtype
=
torch
.
long
)
all_input_mask
=
torch
.
tensor
([
f
.
input_mask
for
f
in
eval_features
],
dtype
=
torch
.
long
)
all_segment_ids
=
torch
.
tensor
([
f
.
segment_ids
for
f
in
eval_features
],
dtype
=
torch
.
long
)
all_example_index
=
torch
.
arange
(
all_input_ids
.
size
(
0
),
dtype
=
torch
.
long
)
eval_data
=
TensorDataset
(
all_input_ids
,
all_input_mask
,
all_segment_ids
,
all_example_index
)
# Run prediction for full data
eval_sampler
=
SequentialSampler
(
eval_data
)
eval_dataloader
=
DataLoader
(
eval_data
,
sampler
=
eval_sampler
,
batch_size
=
args
.
predict_batch_size
)
infer_start
=
time
.
time
()
model
.
eval
()
all_results
=
[]
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"eval_start"
:
True
})
for
input_ids
,
input_mask
,
segment_ids
,
example_indices
in
tqdm
(
eval_dataloader
,
desc
=
"Evaluating"
,
disable
=
args
.
disable_progress_bar
):
if
len
(
all_results
)
%
1000
==
0
:
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"sample_number"
:
len
(
all_results
)})
input_ids
=
input_ids
.
to
(
device
)
input_mask
=
input_mask
.
to
(
device
)
segment_ids
=
segment_ids
.
to
(
device
)
with
torch
.
no_grad
():
batch_start_logits
,
batch_end_logits
=
model
(
input_ids
,
segment_ids
,
input_mask
)
for
i
,
example_index
in
enumerate
(
example_indices
):
start_logits
=
batch_start_logits
[
i
].
detach
().
cpu
().
tolist
()
end_logits
=
batch_end_logits
[
i
].
detach
().
cpu
().
tolist
()
eval_feature
=
eval_features
[
example_index
.
item
()]
unique_id
=
int
(
eval_feature
.
unique_id
)
all_results
.
append
(
RawResult
(
unique_id
=
unique_id
,
start_logits
=
start_logits
,
end_logits
=
end_logits
))
time_to_infer
=
time
.
time
()
-
infer_start
output_prediction_file
=
os
.
path
.
join
(
args
.
output_dir
,
"predictions.json"
)
output_nbest_file
=
os
.
path
.
join
(
args
.
output_dir
,
"nbest_predictions.json"
)
answers
,
nbest_answers
=
get_answers
(
eval_examples
,
eval_features
,
all_results
,
args
)
with
open
(
output_prediction_file
,
"w"
)
as
f
:
f
.
write
(
json
.
dumps
(
answers
,
indent
=
4
)
+
"
\n
"
)
with
open
(
output_nbest_file
,
"w"
)
as
f
:
f
.
write
(
json
.
dumps
(
nbest_answers
,
indent
=
4
)
+
"
\n
"
)
# output_null_log_odds_file = os.path.join(args.output_dir, "null_odds.json")
# write_predictions(eval_examples, eval_features, all_results,
# args.n_best_size, args.max_answer_length,
# args.do_lower_case, output_prediction_file,
# output_nbest_file, output_null_log_odds_file, args.verbose_logging,
# args.version_2_with_negative, args.null_score_diff_threshold)
#if args.do_eval and is_main_process():
if
args
.
do_eval
:
import
sys
import
subprocess
eval_out
=
subprocess
.
check_output
([
sys
.
executable
,
args
.
eval_script
,
args
.
predict_file
,
args
.
output_dir
+
"/predictions.json"
])
scores
=
str
(
eval_out
).
strip
()
exact_match
=
float
(
scores
.
split
(
":"
)[
1
].
split
(
","
)[
0
])
f1
=
float
(
scores
.
split
(
":"
)[
2
].
split
(
"}"
)[
0
])
#测试是否定义了
print
(
'-'
*
20
)
print
(
'f1:'
,
f1
,
"exact_match:"
,
exact_match
)
print
(
'-'
*
20
)
if
args
.
do_train
:
gpu_count
=
n_gpu
if
torch
.
distributed
.
is_initialized
():
gpu_count
=
torch
.
distributed
.
get_world_size
()
if
args
.
max_steps
==
-
1
:
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_train_time"
:
time_to_train
,
"training_sequences_per_second"
:
len
(
train_features
)
*
args
.
num_train_epochs
/
time_to_train
,
"final_loss"
:
final_loss
})
else
:
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_train_time"
:
time_to_train
,
"training_sequences_per_second"
:
args
.
train_batch_size
*
args
.
gradient_accumulation_steps
\
*
args
.
max_steps
*
gpu_count
/
time_to_train
,
"final_loss"
:
final_loss
})
if
args
.
do_predict
and
is_main_process
():
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_inference_time"
:
time_to_infer
,
"inference_sequences_per_second"
:
len
(
eval_features
)
/
time_to_infer
})
if
args
.
do_eval
and
is_main_process
():
# global exact_match
# global f1
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"exact_match"
:
exact_match
,
"F1"
:
f1
})
if
__name__
==
"__main__"
:
main
()
dllogger
.
flush
()
PyTorch/NLP/BERT/single_pre1_1.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export
MIOPEN_FIND_MODE
=
1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v1.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1
\
--config_file=./bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--gpus_per_node 1
\
--do_train
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export
MIOPEN_FIND_MODE
=
1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v1.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1
\
--config_file=./bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20
\
--learning_rate=4.0e-4
\
--seed=12439
\
--fp16
\
--amp
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--gpus_per_node 1
\
--do_train
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
#--fp16 \
# --amp \
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
MIOPEN_FIND_MODE
=
1
module unload compiler/rocm/2.9
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v4.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32
\
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--do_train
\
--use_env
\
--local_rank
${
comm_rank
}
\
--world_size 4
\
--gpus_per_node 1
\
--dist_url tcp://localhost:34567
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
MIOPEN_FIND_MODE
=
1
module unload compiler/rocm/2.9
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v4.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32
\
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--do_train
\
--fp16
\
--amp
\
--use_env
\
--local_rank
${
comm_rank
}
\
--world_size 4
\
--gpus_per_node 1
\
--dist_url tcp://localhost:34567
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
Prev
1
2
3
4
5
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment