Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
dcuai
dlexamples
Commits
bedf3c0c
Commit
bedf3c0c
authored
Sep 16, 2022
by
hepj
Browse files
修改README,增加训练脚本,完善模型转换代码
parent
49afe744
Changes
89
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
2627 additions
and
142 deletions
+2627
-142
PyTorch/NLP/BERT/README.md
PyTorch/NLP/BERT/README.md
+119
-142
PyTorch/NLP/BERT/README_old.md
PyTorch/NLP/BERT/README_old.md
+181
-0
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
+4
-0
PyTorch/NLP/BERT/bert_pre1.sh
PyTorch/NLP/BERT/bert_pre1.sh
+2
-0
PyTorch/NLP/BERT/bert_pre1_4.sh
PyTorch/NLP/BERT/bert_pre1_4.sh
+4
-0
PyTorch/NLP/BERT/bert_pre1_fp16.sh
PyTorch/NLP/BERT/bert_pre1_fp16.sh
+3
-0
PyTorch/NLP/BERT/bert_pre2.sh
PyTorch/NLP/BERT/bert_pre2.sh
+2
-0
PyTorch/NLP/BERT/bert_pre2_4.sh
PyTorch/NLP/BERT/bert_pre2_4.sh
+4
-0
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
+4
-0
PyTorch/NLP/BERT/bert_pre2_fp16.sh
PyTorch/NLP/BERT/bert_pre2_fp16.sh
+2
-0
PyTorch/NLP/BERT/bert_squad.sh
PyTorch/NLP/BERT/bert_squad.sh
+5
-0
PyTorch/NLP/BERT/bert_squad4.sh
PyTorch/NLP/BERT/bert_squad4.sh
+9
-0
PyTorch/NLP/BERT/bert_squad4_fp16.sh
PyTorch/NLP/BERT/bert_squad4_fp16.sh
+9
-0
PyTorch/NLP/BERT/bert_squad_fp16.sh
PyTorch/NLP/BERT/bert_squad_fp16.sh
+5
-0
PyTorch/NLP/BERT/run_pretraining_v4.py
PyTorch/NLP/BERT/run_pretraining_v4.py
+709
-0
PyTorch/NLP/BERT/run_squad_v4.py
PyTorch/NLP/BERT/run_squad_v4.py
+1242
-0
PyTorch/NLP/BERT/single_pre1_1.sh
PyTorch/NLP/BERT/single_pre1_1.sh
+87
-0
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
+91
-0
PyTorch/NLP/BERT/single_pre1_4.sh
PyTorch/NLP/BERT/single_pre1_4.sh
+70
-0
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
+75
-0
No files found.
PyTorch/NLP/BERT/README.md
View file @
bedf3c0c
# 简介
使用PyTorch框架计算Bert网络。
*
BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
*
BERT 的推理可基于不同数据集进行精度验证
*
数据生成、模型转换相关细节见
[
README.md
](
http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md
)
# 运行示例
# **Bert算力测试**
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## 1.数据集准备
##
pre
-
train
phrase1
pre
_
train
数据,目前最新的是wiki20220401的数据,但数据集压缩后近20GB,解压后300GB下载速度慢,解压占大量空间。enwiki-20220401-pages-articles-multistream.xml.bz2下载链接如下:
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
https://dumps.wikimedia.org/enwiki/20220401/
这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据,预训练数据分为PHRASE1、PHRASE2
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
昆山wiki数据集地址PHRASE1:
PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
昆山wiki数据集地址PHRASE2:
PATH_PHRASE2=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
乌镇wiki地址PHRASE1:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
乌镇wiki地址PHRASE2:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
```
问答SQUAD1.1数据:
[
train-v1.1
](
https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
)
[
dev-v1.1
](
https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
)
## 2.测试环境
注意dtk python torch apex 等版本要对齐
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
1.创建python虚拟环境并进入
virtualenv --python=~/package/Python-3.6.8/build/bin/python3 venv_dtk21.10.1_torch1.10
source venv_dtk21.10_torch1.10/bin/activate
2.安装依赖包
pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
pip install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
pip install torchvision-0.10.0a0+300a8a4-cp36-cp36m-linux_x86_64.whl
pip install apex-0.1-cp36-cp36m-linux_x86_64.whl
3.环境变量设置
module rm compiler/rocm/2.9
export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
export HIP_PATH=${ROCM_PATH}/hip
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PAT
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZ
```
## 3.squad测试
##
pre-train phrase2
##
# 1.模型转化
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
```
### 多卡
目前模型转换还存在问题,可能是由于下载的TF模型与model.ckpt-28252不同导致,或torch 、apex版本兼容性问题,还在排查当中,可以直接使用转换好的模型进行squad任务的微调训练(PHRASE的测试则不受此影响,PHRASE为预训练只需要训练数据与网络结构即可,不需要加载模型)
[
转换好的模型 提取密码:vs8d
](
https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ
)
### 2.参数说明
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
--train_file 训练数据
--predict_file 预测文件
--init_checkpoint 模型文件
--vocab_file 词向量文件
--output_dir 输出文件夹
--config_file 模型配置文件
--json-summary 输出json文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--do_train 是否训练
--do_predict 是否预测
--train_batch_size 训练batch_size
--predict_batch_size 预测batch_size
--gpus_per_node 使用gpu节点数
--local_rank 基于GPU的分布式训练的local_rank(单卡设置为-1)
--fp16 混合精度训练
--amp 混合精度训练
```
*
方法二
hostfile:
### 3.运行
```
node1 slots=4
node2 slots=4
#单卡
./bert_squad.sh #单精度 (按自己路径对single_squad.sh里APP设置进行修改)
./bert_squad_fp16.sh #半精度 (按自己路径对single_squad_fp16.sh里APP设置进行修改)
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
#多卡
./bert_squad4.sh #单精度 (按自己路径对single_squad4.sh里APP设置进行修改)
./bert_squad4_fp16.sh #半精度 (按自己路径对single_squad4_fp16.sh里APP设置进行修改)
```
## 4.**PHRASE测试**
### 1.参数说明
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
--input_dir 输入数据文件夹
--output_dir 输出保存文件夹
--config_file 模型配置文件
--bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
--train_batch_size 训练batch_size
--max_seq_length=128 最大长度(需要和训练数据相匹配)
--max_predictions_per_seq 输入序列中屏蔽标记的最大总数
--max_steps 最大步长
--warmup_proportion 进行线性学习率热身的训练比例
--num_steps_per_checkpoint 多少步保存一次模型
--learning_rate 学习率
--seed 随机种子
--gradient_accumulation_steps 在执行向后/更新过程之前,Accumulte的更新步骤数
--allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
--do_train 是否训练
--fp16 混合精度训练
--amp 混合精度训练
--json-summary 输出json文件
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
### 2.PHRASE1
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
#单卡
./bert_pre1.sh #单精度 (按自己路径对single_pre1_1.sh里APP设置进行修改)
./bert_pre1_fp16.sh #半精度 (按自己路径对single_pre1_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre1_4.sh #单精度 (按自己路径对single_pre1_4.sh里APP设置进行修改)
./bert_pre1_4_fp16.sh #半精度 (按自己路径对single_pre1_4_fp16.sh里APP设置进行修改)
```
### 3.PHRASE2
```
#单卡
./bert_pre2.sh #单精度 (按自己路径对single_pre2_1.sh里APP设置进行修改)
./bert_pre2_fp16.sh #半精度 (按自己路径对single_pre2_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre2_4.sh #单精度 (按自己路径对single_pre2_4.sh里APP设置进行修改)
./bert_pre2_4_fp16.sh #半精度 (按自己路径对single_pre2_4_fp16.sh里APP设置进行修改)
```
# 参考资料
[
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
](
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
)
[
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
](
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
)
PyTorch/NLP/BERT/README_old.md
0 → 100644
View file @
bedf3c0c
# 简介
使用PyTorch框架计算Bert网络。
*
BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。
*
BERT 的推理可基于不同数据集进行精度验证
*
数据生成、模型转换相关细节见
[
README.md
](
http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/develop/PyTorch/NLP/BERT/scripts/README.md
)
# 运行示例
目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,
## pre-train phrase1
|参数名|解释|示例|
|:---:|:---:|:---:|
|PATH_PHRASE1|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
|OUTPUT_DIR|输出路径|/workspace/results
|PATH_CONFIG|confing路径|/workspace/bert_large_uncased
|PATH_PHRASE2|第一阶段训练数据集路径|/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.
<br>
15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
<br>
### 单卡
```
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints1 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
### 多卡
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py \
--input_dir=${PATH_PHRASE1} \
--output_dir=${OUTPUT_DIR}/checkpoints \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=16 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--max_steps=100000 \
--warmup_proportion=0.0 \
--num_steps_per_checkpoint=20000 \
--learning_rate=4.0e-4 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--json-summary dllogger.json
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
```
## pre-train phrase2
### 单卡
```
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
### 多卡
*
方法一
```
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
--input_dir=${PATH_PHRASE2} \
--output_dir=${OUTPUT_DIR}/checkpoints2 \
--config_file=${PATH_CONFIG}bert_config.json \
--bert_model=bert-large-uncased \
--train_batch_size=4 \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--max_steps=400000 \
--warmup_proportion=0.128 \
--num_steps_per_checkpoint=200000 \
--learning_rate=4e-3 \
--seed=12439 \
--gradient_accumulation_steps=1 \
--allreduce_post_accumulation \
--do_train \
--phase2 \
--phase1_end_step=0 \
--json-summary dllogger.json
```
*
方法二
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
```
## fine-tune 训练
### 单卡
```
python3 run_squad_v1.py \
--train_file squad/v1.1/train-v1.1.json \
--init_checkpoint model.ckpt-28252.pt \
--vocab_file vocab.txt \
--output_dir SQuAD \
--config_file bert_config.json \
--bert_model=bert-large-uncased \
--do_train \
--train_batch_size 1 \
--gpus_per_node 1
```
### 多卡
hostfile:
```
node1 slots=4
node2 slots=4
```
```
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh
```
# 参考资料
[
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
](
https://github.com/mlperf/training_results_v0.7/blob/master/NVIDIA/benchmarks/bert/implementations/pytorch
)
[
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
](
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT
)
PyTorch/NLP/BERT/bert_per1_4_fp16.sh
0 → 100644
View file @
bedf3c0c
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre1_4_fp16.sh
PyTorch/NLP/BERT/bert_pre1.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre1_1.sh
PyTorch/NLP/BERT/bert_pre1_4.sh
0 → 100644
View file @
bedf3c0c
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre1_4.sh
PyTorch/NLP/BERT/bert_pre1_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre1_1_fp16.sh
PyTorch/NLP/BERT/bert_pre2.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre2_1.sh
PyTorch/NLP/BERT/bert_pre2_4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre2_4.sh
PyTorch/NLP/BERT/bert_pre2_4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HIP_LAUNCH_BLOCKING
=
1
mpirun
--allow-run-as-root
-np
4 single_pre2_4_fp16.sh
PyTorch/NLP/BERT/bert_pre2_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_pre2_1_fp16.sh
PyTorch/NLP/BERT/bert_squad.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_squad.sh
PyTorch/NLP/BERT/bert_squad4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun
--allow-run-as-root
-np
4 single_squad4.sh
PyTorch/NLP/BERT/bert_squad4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export LD_LIBRARY_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1/lib
mpirun
--allow-run-as-root
-np
4 single_squad4_fp16.sh
PyTorch/NLP/BERT/bert_squad_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
mpirun
--allow-run-as-root
-np
1 single_squad_fp16.sh
PyTorch/NLP/BERT/run_pretraining_v4.py
0 → 100644
View file @
bedf3c0c
# coding=utf-8
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
# ==================
import
csv
import
os
import
time
import
argparse
import
random
import
h5py
from
tqdm
import
tqdm
,
trange
import
os
import
numpy
as
np
import
torch
from
torch.utils.data
import
DataLoader
,
RandomSampler
,
SequentialSampler
,
Dataset
from
torch.utils.data.distributed
import
DistributedSampler
import
math
from
apex
import
amp
import
multiprocessing
from
tokenization
import
BertTokenizer
import
modeling
from
apex.optimizers
import
FusedLAMB
from
schedulers
import
PolyWarmUpScheduler
from
file_utils
import
PYTORCH_PRETRAINED_BERT_CACHE
from
utils
import
is_main_process
,
format_step
,
get_world_size
,
get_rank
from
apex.parallel
import
DistributedDataParallel
as
DDP
from
schedulers
import
LinearWarmUpScheduler
from
apex.parallel.distributed
import
flat_dist_call
import
amp_C
import
apex_C
from
apex.amp
import
_amp_state
import
dllogger
from
concurrent.futures
import
ProcessPoolExecutor
os
.
environ
[
"HIP_VISIBLE_DEVICES"
]
=
"0,1,2,3"
torch
.
_C
.
_jit_set_profiling_mode
(
False
)
torch
.
_C
.
_jit_set_profiling_executor
(
False
)
skipped_steps
=
0
# Track whether a SIGTERM (cluster time up) has been handled
timeout_sent
=
False
import
signal
# handle SIGTERM sent from the scheduler and mark so we
# can gracefully save & exit
def
signal_handler
(
sig
,
frame
):
global
timeout_sent
timeout_sent
=
True
signal
.
signal
(
signal
.
SIGTERM
,
signal_handler
)
#Workaround because python functions are not picklable
class
WorkerInitObj
(
object
):
def
__init__
(
self
,
seed
):
self
.
seed
=
seed
def
__call__
(
self
,
id
):
np
.
random
.
seed
(
seed
=
self
.
seed
+
id
)
random
.
seed
(
self
.
seed
+
id
)
def
create_pretraining_dataset
(
input_file
,
max_pred_length
,
shared_list
,
args
,
worker_init
):
train_data
=
pretraining_dataset
(
input_file
=
input_file
,
max_pred_length
=
max_pred_length
)
train_sampler
=
RandomSampler
(
train_data
)
train_dataloader
=
DataLoader
(
train_data
,
sampler
=
train_sampler
,
batch_size
=
args
.
train_batch_size
*
args
.
n_gpu
,
num_workers
=
1
,
worker_init_fn
=
worker_init
,
pin_memory
=
True
)
return
train_dataloader
,
input_file
class
pretraining_dataset
(
Dataset
):
def
__init__
(
self
,
input_file
,
max_pred_length
):
self
.
input_file
=
input_file
self
.
max_pred_length
=
max_pred_length
f
=
h5py
.
File
(
input_file
,
"r"
)
keys
=
[
'input_ids'
,
'input_mask'
,
'segment_ids'
,
'masked_lm_positions'
,
'masked_lm_ids'
,
'next_sentence_labels'
]
self
.
inputs
=
[
np
.
asarray
(
f
[
key
][:])
for
key
in
keys
]
f
.
close
()
def
__len__
(
self
):
'Denotes the total number of samples'
return
len
(
self
.
inputs
[
0
])
def
__getitem__
(
self
,
index
):
[
input_ids
,
input_mask
,
segment_ids
,
masked_lm_positions
,
masked_lm_ids
,
next_sentence_labels
]
=
[
torch
.
from_numpy
(
input
[
index
].
astype
(
np
.
int64
))
if
indice
<
5
else
torch
.
from_numpy
(
np
.
asarray
(
input
[
index
].
astype
(
np
.
int64
)))
for
indice
,
input
in
enumerate
(
self
.
inputs
)]
masked_lm_labels
=
torch
.
ones
(
input_ids
.
shape
,
dtype
=
torch
.
long
)
*
-
1
index
=
self
.
max_pred_length
# store number of masked tokens in index
padded_mask_indices
=
(
masked_lm_positions
==
0
).
nonzero
()
if
len
(
padded_mask_indices
)
!=
0
:
index
=
padded_mask_indices
[
0
].
item
()
masked_lm_labels
[
masked_lm_positions
[:
index
]]
=
masked_lm_ids
[:
index
]
return
[
input_ids
,
segment_ids
,
input_mask
,
masked_lm_labels
,
next_sentence_labels
]
class
BertPretrainingCriterion
(
torch
.
nn
.
Module
):
def
__init__
(
self
,
vocab_size
):
super
(
BertPretrainingCriterion
,
self
).
__init__
()
self
.
loss_fn
=
torch
.
nn
.
CrossEntropyLoss
(
ignore_index
=-
1
)
self
.
vocab_size
=
vocab_size
def
forward
(
self
,
prediction_scores
,
seq_relationship_score
,
masked_lm_labels
,
next_sentence_labels
):
masked_lm_loss
=
self
.
loss_fn
(
prediction_scores
.
view
(
-
1
,
self
.
vocab_size
),
masked_lm_labels
.
view
(
-
1
))
next_sentence_loss
=
self
.
loss_fn
(
seq_relationship_score
.
view
(
-
1
,
2
),
next_sentence_labels
.
view
(
-
1
))
total_loss
=
masked_lm_loss
+
next_sentence_loss
return
total_loss
def
parse_arguments
():
parser
=
argparse
.
ArgumentParser
()
## Required parameters
parser
.
add_argument
(
"--input_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The input data dir. Should contain .hdf5 files for the task."
)
parser
.
add_argument
(
"--config_file"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The BERT model config"
)
parser
.
add_argument
(
"--bert_model"
,
default
=
"bert-large-uncased"
,
type
=
str
,
help
=
"Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese."
)
parser
.
add_argument
(
"--output_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The output directory where the model checkpoints will be written."
)
## Other parameters
parser
.
add_argument
(
"--init_checkpoint"
,
default
=
None
,
type
=
str
,
help
=
"The initial checkpoint to start training from."
)
parser
.
add_argument
(
"--max_seq_length"
,
default
=
512
,
type
=
int
,
help
=
"The maximum total input sequence length after WordPiece tokenization.
\n
"
"Sequences longer than this will be truncated, and sequences shorter
\n
"
"than this will be padded."
)
parser
.
add_argument
(
"--max_predictions_per_seq"
,
default
=
80
,
type
=
int
,
help
=
"The maximum total of masked tokens in input sequence"
)
parser
.
add_argument
(
"--train_batch_size"
,
default
=
32
,
type
=
int
,
help
=
"Total batch size for training."
)
parser
.
add_argument
(
"--learning_rate"
,
default
=
5e-5
,
type
=
float
,
help
=
"The initial learning rate for Adam."
)
parser
.
add_argument
(
"--num_train_epochs"
,
default
=
3.0
,
type
=
float
,
help
=
"Total number of training epochs to perform."
)
parser
.
add_argument
(
"--max_steps"
,
default
=
1000
,
type
=
float
,
help
=
"Total number of training steps to perform."
)
parser
.
add_argument
(
"--warmup_proportion"
,
default
=
0.01
,
type
=
float
,
help
=
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10%% of training."
)
parser
.
add_argument
(
"--local_rank"
,
type
=
int
,
default
=
os
.
getenv
(
'LOCAL_RANK'
,
-
1
),
help
=
"local_rank for distributed training on gpus"
)
parser
.
add_argument
(
'--seed'
,
type
=
int
,
default
=
42
,
help
=
"random seed for initialization"
)
parser
.
add_argument
(
'--gradient_accumulation_steps'
,
type
=
int
,
default
=
1
,
help
=
"Number of updates steps to accumualte before performing a backward/update pass."
)
parser
.
add_argument
(
'--fp16'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--amp'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--loss_scale'
,
type
=
float
,
default
=
0.0
,
help
=
'Loss scaling, positive power of 2 values can improve fp16 convergence.'
)
parser
.
add_argument
(
'--log_freq'
,
type
=
float
,
default
=
1.0
,
help
=
'frequency of logging loss.'
)
parser
.
add_argument
(
'--checkpoint_activations'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to use gradient checkpointing"
)
parser
.
add_argument
(
"--resume_from_checkpoint"
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to resume training from checkpoint."
)
parser
.
add_argument
(
'--resume_step'
,
type
=
int
,
default
=-
1
,
help
=
"Step to resume training from."
)
parser
.
add_argument
(
'--num_steps_per_checkpoint'
,
type
=
int
,
default
=
100
,
help
=
"Number of update steps until a model checkpoint is saved to disk."
)
parser
.
add_argument
(
'--skip_checkpoint'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to save checkpoints"
)
parser
.
add_argument
(
'--phase2'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to train with seq len 512"
)
parser
.
add_argument
(
'--allreduce_post_accumulation'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to do allreduces during gradient accumulation steps."
)
parser
.
add_argument
(
'--allreduce_post_accumulation_fp16'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to do fp16 allreduce post accumulation."
)
parser
.
add_argument
(
'--phase1_end_step'
,
type
=
int
,
default
=
7038
,
help
=
"Number of training steps in Phase1 - seq len 128"
)
parser
.
add_argument
(
'--init_loss_scale'
,
type
=
int
,
default
=
2
**
20
,
help
=
"Initial loss scaler value"
)
parser
.
add_argument
(
"--do_train"
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to run training."
)
parser
.
add_argument
(
'--json-summary'
,
type
=
str
,
default
=
"results/dllogger.json"
,
help
=
'If provided, the json summary will be written to'
'the specified file.'
)
parser
.
add_argument
(
"--use_env"
,
action
=
'store_true'
,
help
=
"Whether to read local rank from ENVVAR"
)
parser
.
add_argument
(
'--disable_progress_bar'
,
default
=
False
,
action
=
'store_true'
,
help
=
'Disable tqdm progress bar'
)
parser
.
add_argument
(
'--steps_this_run'
,
type
=
int
,
default
=-
1
,
help
=
'If provided, only run this many steps before exiting'
)
parser
.
add_argument
(
"--dist_url"
,
default
=
'tcp://224.66.41.62:23456'
,
type
=
str
,
help
=
'url used to set up distributed training'
)
parser
.
add_argument
(
"--gpus_per_node"
,
type
=
int
,
default
=
4
,
help
=
'num of gpus per node'
)
parser
.
add_argument
(
"--world_size"
,
type
=
int
,
default
=
1
,
help
=
"number of process"
)
args
=
parser
.
parse_args
()
args
.
fp16
=
args
.
fp16
or
args
.
amp
if
args
.
steps_this_run
<
0
:
args
.
steps_this_run
=
args
.
max_steps
return
args
def
setup_training
(
args
):
assert
(
torch
.
cuda
.
is_available
())
if
args
.
local_rank
==
-
1
:
device
=
torch
.
device
(
"cuda"
)
args
.
n_gpu
=
torch
.
cuda
.
device_count
()
args
.
allreduce_post_accumulation
=
False
args
.
allreduce_post_accumulation_fp16
=
False
else
:
#torch.cuda.set_device(args.local_rank)
#device = torch.device("cuda", args.local_rank)
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
#torch.distributed.init_process_group(backend='nccl', init_method='env://')
#xuan
device_n
=
args
.
local_rank
%
4
torch
.
cuda
.
set_device
(
device_n
)
device
=
torch
.
device
(
"cuda"
,
device_n
)
torch
.
distributed
.
init_process_group
(
backend
=
'nccl'
,
init_method
=
args
.
dist_url
,
world_size
=
args
.
world_size
,
rank
=
args
.
local_rank
)
args
.
n_gpu
=
1
if
args
.
gradient_accumulation_steps
==
1
:
args
.
allreduce_post_accumulation
=
False
args
.
allreduce_post_accumulation_fp16
=
False
if
is_main_process
():
dllogger
.
init
(
backends
=
[
dllogger
.
JSONStreamBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
filename
=
args
.
json_summary
),
dllogger
.
StdOutBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
step_format
=
format_step
)])
else
:
dllogger
.
init
(
backends
=
[])
print
(
"device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}"
.
format
(
device
,
args
.
n_gpu
,
bool
(
args
.
local_rank
!=
-
1
),
args
.
fp16
))
if
args
.
gradient_accumulation_steps
<
1
:
raise
ValueError
(
"Invalid gradient_accumulation_steps parameter: {}, should be >= 1"
.
format
(
args
.
gradient_accumulation_steps
))
if
args
.
train_batch_size
%
args
.
gradient_accumulation_steps
!=
0
:
raise
ValueError
(
"Invalid gradient_accumulation_steps parameter: {}, batch size {} should be divisible"
.
format
(
args
.
gradient_accumulation_steps
,
args
.
train_batch_size
))
args
.
train_batch_size
=
args
.
train_batch_size
//
args
.
gradient_accumulation_steps
if
not
args
.
do_train
:
raise
ValueError
(
" `do_train` must be True."
)
if
not
args
.
resume_from_checkpoint
and
os
.
path
.
exists
(
args
.
output_dir
)
and
(
os
.
listdir
(
args
.
output_dir
)
and
any
([
i
.
startswith
(
'ckpt'
)
for
i
in
os
.
listdir
(
args
.
output_dir
)])):
raise
ValueError
(
"Output directory ({}) already exists and is not empty."
.
format
(
args
.
output_dir
))
if
(
not
args
.
resume_from_checkpoint
or
not
os
.
path
.
exists
(
args
.
output_dir
))
and
is_main_process
():
os
.
makedirs
(
args
.
output_dir
,
exist_ok
=
True
)
return
device
,
args
def
prepare_model_and_optimizer
(
args
,
device
):
# Prepare model
config
=
modeling
.
BertConfig
.
from_json_file
(
args
.
config_file
)
# Padding for divisibility by 8
if
config
.
vocab_size
%
8
!=
0
:
config
.
vocab_size
+=
8
-
(
config
.
vocab_size
%
8
)
modeling
.
ACT2FN
[
"bias_gelu"
]
=
modeling
.
bias_gelu_training
model
=
modeling
.
BertForPreTraining
(
config
)
checkpoint
=
None
if
not
args
.
resume_from_checkpoint
:
global_step
=
0
else
:
if
args
.
resume_step
==
-
1
and
not
args
.
init_checkpoint
:
model_names
=
[
f
for
f
in
os
.
listdir
(
args
.
output_dir
)
if
f
.
endswith
(
".pt"
)]
args
.
resume_step
=
max
([
int
(
x
.
split
(
'.pt'
)[
0
].
split
(
'_'
)[
1
].
strip
())
for
x
in
model_names
])
global_step
=
args
.
resume_step
if
not
args
.
init_checkpoint
else
0
if
not
args
.
init_checkpoint
:
checkpoint
=
torch
.
load
(
os
.
path
.
join
(
args
.
output_dir
,
"ckpt_{}.pt"
.
format
(
global_step
)),
map_location
=
"cpu"
)
else
:
checkpoint
=
torch
.
load
(
args
.
init_checkpoint
,
map_location
=
"cpu"
)
model
.
load_state_dict
(
checkpoint
[
'model'
],
strict
=
False
)
if
args
.
phase2
and
not
args
.
init_checkpoint
:
global_step
-=
args
.
phase1_end_step
if
is_main_process
():
print
(
"resume step from "
,
args
.
resume_step
)
model
.
to
(
device
)
param_optimizer
=
list
(
model
.
named_parameters
())
no_decay
=
[
'bias'
,
'gamma'
,
'beta'
,
'LayerNorm'
]
optimizer_grouped_parameters
=
[
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
not
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.01
},
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.0
}]
optimizer
=
FusedLAMB
(
optimizer_grouped_parameters
,
lr
=
args
.
learning_rate
)
lr_scheduler
=
PolyWarmUpScheduler
(
optimizer
,
warmup
=
args
.
warmup_proportion
,
total_steps
=
args
.
max_steps
)
if
args
.
fp16
:
if
args
.
loss_scale
==
0
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
loss_scale
=
"dynamic"
,
cast_model_outputs
=
torch
.
float16
)
else
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
loss_scale
=
args
.
loss_scale
,
cast_model_outputs
=
torch
.
float16
)
amp
.
_amp_state
.
loss_scalers
[
0
].
_loss_scale
=
args
.
init_loss_scale
model
.
checkpoint_activations
(
args
.
checkpoint_activations
)
if
args
.
resume_from_checkpoint
:
if
args
.
phase2
or
args
.
init_checkpoint
:
keys
=
list
(
checkpoint
[
'optimizer'
][
'state'
].
keys
())
#Override hyperparameters from previous checkpoint
for
key
in
keys
:
checkpoint
[
'optimizer'
][
'state'
][
key
][
'step'
]
=
global_step
for
iter
,
item
in
enumerate
(
checkpoint
[
'optimizer'
][
'param_groups'
]):
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
'step'
]
=
global_step
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
't_total'
]
=
args
.
max_steps
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
'warmup'
]
=
args
.
warmup_proportion
checkpoint
[
'optimizer'
][
'param_groups'
][
iter
][
'lr'
]
=
args
.
learning_rate
optimizer
.
load_state_dict
(
checkpoint
[
'optimizer'
])
# , strict=False)
# Restore AMP master parameters
if
args
.
fp16
:
optimizer
.
_lazy_init_maybe_master_weights
()
optimizer
.
_amp_stash
.
lazy_init_called
=
True
optimizer
.
load_state_dict
(
checkpoint
[
'optimizer'
])
for
param
,
saved_param
in
zip
(
amp
.
master_params
(
optimizer
),
checkpoint
[
'master params'
]):
param
.
data
.
copy_
(
saved_param
.
data
)
if
args
.
local_rank
!=
-
1
:
if
not
args
.
allreduce_post_accumulation
:
model
=
DDP
(
model
,
message_size
=
250000000
,
gradient_predivide_factor
=
get_world_size
())
else
:
flat_dist_call
([
param
.
data
for
param
in
model
.
parameters
()],
torch
.
distributed
.
broadcast
,
(
0
,)
)
elif
args
.
n_gpu
>
1
:
model
=
torch
.
nn
.
DataParallel
(
model
)
criterion
=
BertPretrainingCriterion
(
config
.
vocab_size
)
return
model
,
optimizer
,
lr_scheduler
,
checkpoint
,
global_step
,
criterion
def
take_optimizer_step
(
args
,
optimizer
,
model
,
overflow_buf
,
global_step
):
global
skipped_steps
if
args
.
allreduce_post_accumulation
:
# manually allreduce gradients after all accumulation steps
# check for Inf/NaN
# 1. allocate an uninitialized buffer for flattened gradient
loss_scale
=
_amp_state
.
loss_scalers
[
0
].
loss_scale
()
if
args
.
fp16
else
1
master_grads
=
[
p
.
grad
for
p
in
amp
.
master_params
(
optimizer
)
if
p
.
grad
is
not
None
]
flat_grad_size
=
sum
(
p
.
numel
()
for
p
in
master_grads
)
allreduce_dtype
=
torch
.
float16
if
args
.
allreduce_post_accumulation_fp16
else
torch
.
float32
flat_raw
=
torch
.
empty
(
flat_grad_size
,
device
=
'cuda'
,
dtype
=
allreduce_dtype
)
# 2. combine unflattening and predivision of unscaled 'raw' gradient
allreduced_views
=
apex_C
.
unflatten
(
flat_raw
,
master_grads
)
overflow_buf
.
zero_
()
amp_C
.
multi_tensor_scale
(
65536
,
overflow_buf
,
[
master_grads
,
allreduced_views
],
loss_scale
/
(
get_world_size
()
*
args
.
gradient_accumulation_steps
))
# 3. sum gradient across ranks. Because of the predivision, this averages the gradient
torch
.
distributed
.
all_reduce
(
flat_raw
)
# 4. combine unscaling and unflattening of allreduced gradient
overflow_buf
.
zero_
()
amp_C
.
multi_tensor_scale
(
65536
,
overflow_buf
,
[
allreduced_views
,
master_grads
],
1.
/
loss_scale
)
# 5. update loss scale
if
args
.
fp16
:
scaler
=
_amp_state
.
loss_scalers
[
0
]
old_overflow_buf
=
scaler
.
_overflow_buf
scaler
.
_overflow_buf
=
overflow_buf
had_overflow
=
scaler
.
update_scale
()
scaler
.
_overfloat_buf
=
old_overflow_buf
else
:
had_overflow
=
0
# 6. call optimizer step function
if
had_overflow
==
0
:
optimizer
.
step
()
global_step
+=
1
else
:
# Overflow detected, print message and clear gradients
skipped_steps
+=
1
if
is_main_process
():
scaler
=
_amp_state
.
loss_scalers
[
0
]
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"loss_scale"
:
scaler
.
loss_scale
()})
if
_amp_state
.
opt_properties
.
master_weights
:
for
param
in
optimizer
.
_amp_stash
.
all_fp32_from_fp16_params
:
param
.
grad
=
None
for
param
in
model
.
parameters
():
param
.
grad
=
None
else
:
optimizer
.
step
()
#optimizer.zero_grad()
for
param
in
model
.
parameters
():
param
.
grad
=
None
global_step
+=
1
return
global_step
def
main
():
global
timeout_sent
args
=
parse_arguments
()
random
.
seed
(
args
.
seed
+
args
.
local_rank
)
np
.
random
.
seed
(
args
.
seed
+
args
.
local_rank
)
torch
.
manual_seed
(
args
.
seed
+
args
.
local_rank
)
torch
.
cuda
.
manual_seed
(
args
.
seed
+
args
.
local_rank
)
worker_init
=
WorkerInitObj
(
args
.
seed
+
args
.
local_rank
)
device
,
args
=
setup_training
(
args
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"Config"
:
[
str
(
args
)]})
# Prepare optimizer
model
,
optimizer
,
lr_scheduler
,
checkpoint
,
global_step
,
criterion
=
prepare_model_and_optimizer
(
args
,
device
)
if
is_main_process
():
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"SEED"
:
args
.
seed
})
raw_train_start
=
None
if
args
.
do_train
:
if
is_main_process
():
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"train_start"
:
True
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"batch_size_per_gpu"
:
args
.
train_batch_size
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"learning_rate"
:
args
.
learning_rate
})
model
.
train
()
most_recent_ckpts_paths
=
[]
average_loss
=
0.0
# averaged loss every args.log_freq steps
epoch
=
0
training_steps
=
0
pool
=
ProcessPoolExecutor
(
1
)
# Note: We loop infinitely over epochs, termination is handled via iteration count
while
True
:
thread
=
None
restored_data_loader
=
None
if
not
args
.
resume_from_checkpoint
or
epoch
>
0
or
(
args
.
phase2
and
global_step
<
1
)
or
args
.
init_checkpoint
:
files
=
[
os
.
path
.
join
(
args
.
input_dir
,
f
)
for
f
in
os
.
listdir
(
args
.
input_dir
)
if
os
.
path
.
isfile
(
os
.
path
.
join
(
args
.
input_dir
,
f
))
and
'training'
in
f
]
files
.
sort
()
num_files
=
len
(
files
)
random
.
Random
(
args
.
seed
+
epoch
).
shuffle
(
files
)
f_start_id
=
0
else
:
f_start_id
=
checkpoint
[
'files'
][
0
]
files
=
checkpoint
[
'files'
][
1
:]
args
.
resume_from_checkpoint
=
False
num_files
=
len
(
files
)
# may not exist in all checkpoints
epoch
=
checkpoint
.
get
(
'epoch'
,
0
)
restored_dataloader
=
checkpoint
.
get
(
'data_loader'
,
None
)
shared_file_list
=
{}
if
torch
.
distributed
.
is_initialized
()
and
get_world_size
()
>
num_files
:
remainder
=
get_world_size
()
%
num_files
data_file
=
files
[(
f_start_id
*
get_world_size
()
+
get_rank
()
+
remainder
*
f_start_id
)
%
num_files
]
else
:
data_file
=
files
[(
f_start_id
*
get_world_size
()
+
get_rank
())
%
num_files
]
previous_file
=
data_file
if
restored_data_loader
is
None
:
train_data
=
pretraining_dataset
(
data_file
,
args
.
max_predictions_per_seq
)
if
args
.
local_rank
==
-
1
:
train_sampler
=
RandomSampler
(
train_data
)
else
:
train_sampler
=
DistributedSampler
(
train_data
)
train_dataloader
=
DataLoader
(
train_data
,
sampler
=
train_sampler
,
batch_size
=
args
.
train_batch_size
*
args
.
n_gpu
,
num_workers
=
4
,
worker_init_fn
=
worker_init
,
pin_memory
=
True
)
# shared_file_list["0"] = (train_dataloader, data_file)
else
:
train_dataloader
=
restored_data_loader
restored_data_loader
=
None
overflow_buf
=
None
if
args
.
allreduce_post_accumulation
:
overflow_buf
=
torch
.
cuda
.
IntTensor
([
0
])
for
f_id
in
range
(
f_start_id
+
1
,
len
(
files
)):
if
get_world_size
()
>
num_files
:
data_file
=
files
[(
f_id
*
get_world_size
()
+
get_rank
()
+
remainder
*
f_id
)
%
num_files
]
else
:
data_file
=
files
[(
f_id
*
get_world_size
()
+
get_rank
())
%
num_files
]
previous_file
=
data_file
dataset_future
=
pool
.
submit
(
create_pretraining_dataset
,
data_file
,
args
.
max_predictions_per_seq
,
shared_file_list
,
args
,
worker_init
)
train_iter
=
tqdm
(
train_dataloader
,
desc
=
"Iteration"
,
disable
=
args
.
disable_progress_bar
)
if
is_main_process
()
else
train_dataloader
if
raw_train_start
is
None
:
raw_train_start
=
time
.
time
()
for
step
,
batch
in
enumerate
(
train_iter
):
training_steps
+=
1
batch
=
[
t
.
to
(
device
)
for
t
in
batch
]
input_ids
,
segment_ids
,
input_mask
,
masked_lm_labels
,
next_sentence_labels
=
batch
prediction_scores
,
seq_relationship_score
=
model
(
input_ids
=
input_ids
,
token_type_ids
=
segment_ids
,
attention_mask
=
input_mask
)
loss
=
criterion
(
prediction_scores
,
seq_relationship_score
,
masked_lm_labels
,
next_sentence_labels
)
if
args
.
n_gpu
>
1
:
loss
=
loss
.
mean
()
# mean() to average on multi-gpu.
divisor
=
args
.
gradient_accumulation_steps
if
args
.
gradient_accumulation_steps
>
1
:
if
not
args
.
allreduce_post_accumulation
:
# this division was merged into predivision
loss
=
loss
/
args
.
gradient_accumulation_steps
divisor
=
1.0
if
args
.
fp16
:
with
amp
.
scale_loss
(
loss
,
optimizer
,
delay_overflow_check
=
args
.
allreduce_post_accumulation
)
as
scaled_loss
:
scaled_loss
.
backward
()
else
:
loss
.
backward
()
average_loss
+=
loss
.
item
()
if
training_steps
%
args
.
gradient_accumulation_steps
==
0
:
lr_scheduler
.
step
()
# learning rate warmup
global_step
=
take_optimizer_step
(
args
,
optimizer
,
model
,
overflow_buf
,
global_step
)
if
global_step
>=
args
.
steps_this_run
or
timeout_sent
:
train_time_raw
=
time
.
time
()
-
raw_train_start
last_num_steps
=
int
(
training_steps
/
args
.
gradient_accumulation_steps
)
%
args
.
log_freq
last_num_steps
=
args
.
log_freq
if
last_num_steps
==
0
else
last_num_steps
average_loss
=
torch
.
tensor
(
average_loss
,
dtype
=
torch
.
float32
).
cuda
()
average_loss
=
average_loss
/
(
last_num_steps
*
divisor
)
if
(
torch
.
distributed
.
is_initialized
()):
average_loss
/=
get_world_size
()
torch
.
distributed
.
all_reduce
(
average_loss
)
final_loss
=
average_loss
.
item
()
if
is_main_process
():
dllogger
.
log
(
step
=
(
epoch
,
global_step
,
),
data
=
{
"final_loss"
:
final_loss
})
elif
training_steps
%
(
args
.
log_freq
*
args
.
gradient_accumulation_steps
)
==
0
:
if
is_main_process
():
dllogger
.
log
(
step
=
(
epoch
,
global_step
,
),
data
=
{
"average_loss"
:
average_loss
/
(
args
.
log_freq
*
divisor
),
"step_loss"
:
loss
.
item
()
*
args
.
gradient_accumulation_steps
/
divisor
,
"learning_rate"
:
optimizer
.
param_groups
[
0
][
'lr'
]})
average_loss
=
0
if
global_step
>=
args
.
steps_this_run
or
training_steps
%
(
args
.
num_steps_per_checkpoint
*
args
.
gradient_accumulation_steps
)
==
0
or
timeout_sent
:
if
is_main_process
()
and
not
args
.
skip_checkpoint
:
# Save a trained model
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"checkpoint_step"
:
global_step
})
model_to_save
=
model
.
module
if
hasattr
(
model
,
'module'
)
else
model
# Only save the model it-self
if
args
.
resume_step
<
0
or
not
args
.
phase2
:
output_save_file
=
os
.
path
.
join
(
args
.
output_dir
,
"ckpt_{}.pt"
.
format
(
global_step
))
else
:
output_save_file
=
os
.
path
.
join
(
args
.
output_dir
,
"ckpt_{}.pt"
.
format
(
global_step
+
args
.
phase1_end_step
))
if
args
.
do_train
:
torch
.
save
({
'model'
:
model_to_save
.
state_dict
(),
'optimizer'
:
optimizer
.
state_dict
(),
'master params'
:
list
(
amp
.
master_params
(
optimizer
)),
'files'
:
[
f_id
]
+
files
,
'epoch'
:
epoch
,
'data_loader'
:
None
if
global_step
>=
args
.
max_steps
else
train_dataloader
},
output_save_file
)
most_recent_ckpts_paths
.
append
(
output_save_file
)
if
len
(
most_recent_ckpts_paths
)
>
3
:
ckpt_to_be_removed
=
most_recent_ckpts_paths
.
pop
(
0
)
os
.
remove
(
ckpt_to_be_removed
)
# Exiting the training due to hitting max steps, or being sent a
# timeout from the cluster scheduler
if
global_step
>=
args
.
steps_this_run
or
timeout_sent
:
del
train_dataloader
# thread.join()
return
args
,
final_loss
,
train_time_raw
,
global_step
del
train_dataloader
# thread.join()
# Make sure pool has finished and switch train_dataloader
# NOTE: Will block until complete
train_dataloader
,
data_file
=
dataset_future
.
result
(
timeout
=
None
)
epoch
+=
1
if
__name__
==
"__main__"
:
now
=
time
.
time
()
args
,
final_loss
,
train_time_raw
,
global_step
=
main
()
gpu_count
=
args
.
n_gpu
global_step
+=
args
.
phase1_end_step
if
(
args
.
phase2
and
args
.
resume_step
>
0
)
else
0
if
args
.
resume_step
==
-
1
:
args
.
resume_step
=
0
if
torch
.
distributed
.
is_initialized
():
gpu_count
=
get_world_size
()
if
is_main_process
():
e2e_time
=
time
.
time
()
-
now
training_perf
=
args
.
train_batch_size
*
args
.
gradient_accumulation_steps
*
gpu_count
\
*
(
global_step
-
args
.
resume_step
+
skipped_steps
)
/
train_time_raw
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_train_time"
:
e2e_time
,
"training_sequences_per_second"
:
training_perf
,
"final_loss"
:
final_loss
,
"raw_train_time"
:
train_time_raw
})
dllogger
.
flush
()
PyTorch/NLP/BERT/run_squad_v4.py
0 → 100644
View file @
bedf3c0c
# coding=utf-8
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Run BERT on SQuAD."""
from
__future__
import
absolute_import
,
division
,
print_function
import
argparse
import
collections
import
json
import
logging
import
math
import
os
import
random
import
sys
from
io
import
open
import
numpy
as
np
import
torch
from
torch.utils.data
import
(
DataLoader
,
RandomSampler
,
SequentialSampler
,
TensorDataset
)
from
torch.utils.data.distributed
import
DistributedSampler
from
tqdm
import
tqdm
,
trange
from
apex
import
amp
from
schedulers
import
LinearWarmUpScheduler
from
file_utils
import
PYTORCH_PRETRAINED_BERT_CACHE
import
modeling
from
optimization
import
BertAdam
,
warmup_linear
from
tokenization
import
(
BasicTokenizer
,
BertTokenizer
,
whitespace_tokenize
)
from
utils
import
is_main_process
,
format_step
import
dllogger
,
time
os
.
environ
[
"HIP_VISIBLE_DEVICES"
]
=
"0,1,2,3"
torch
.
_C
.
_jit_set_profiling_mode
(
False
)
#torch._C._jit_set_profiling_executor(False)
if
sys
.
version_info
[
0
]
==
2
:
import
cPickle
as
pickle
else
:
import
pickle
logging
.
basicConfig
(
format
=
'%(asctime)s - %(levelname)s - %(name)s - %(message)s'
,
datefmt
=
'%m/%d/%Y %H:%M:%S'
,
level
=
logging
.
INFO
)
logger
=
logging
.
getLogger
(
__name__
)
class
SquadExample
(
object
):
"""
A single training/test example for the Squad dataset.
For examples without an answer, the start and end position are -1.
"""
def
__init__
(
self
,
qas_id
,
question_text
,
doc_tokens
,
orig_answer_text
=
None
,
start_position
=
None
,
end_position
=
None
,
is_impossible
=
None
):
self
.
qas_id
=
qas_id
self
.
question_text
=
question_text
self
.
doc_tokens
=
doc_tokens
self
.
orig_answer_text
=
orig_answer_text
self
.
start_position
=
start_position
self
.
end_position
=
end_position
self
.
is_impossible
=
is_impossible
def
__str__
(
self
):
return
self
.
__repr__
()
def
__repr__
(
self
):
s
=
""
s
+=
"qas_id: %s"
%
(
self
.
qas_id
)
s
+=
", question_text: %s"
%
(
self
.
question_text
)
s
+=
", doc_tokens: [%s]"
%
(
" "
.
join
(
self
.
doc_tokens
))
if
self
.
start_position
:
s
+=
", start_position: %d"
%
(
self
.
start_position
)
if
self
.
end_position
:
s
+=
", end_position: %d"
%
(
self
.
end_position
)
if
self
.
is_impossible
:
s
+=
", is_impossible: %r"
%
(
self
.
is_impossible
)
return
s
class
InputFeatures
(
object
):
"""A single set of features of data."""
def
__init__
(
self
,
unique_id
,
example_index
,
doc_span_index
,
tokens
,
token_to_orig_map
,
token_is_max_context
,
input_ids
,
input_mask
,
segment_ids
,
start_position
=
None
,
end_position
=
None
,
is_impossible
=
None
):
self
.
unique_id
=
unique_id
self
.
example_index
=
example_index
self
.
doc_span_index
=
doc_span_index
self
.
tokens
=
tokens
self
.
token_to_orig_map
=
token_to_orig_map
self
.
token_is_max_context
=
token_is_max_context
self
.
input_ids
=
input_ids
self
.
input_mask
=
input_mask
self
.
segment_ids
=
segment_ids
self
.
start_position
=
start_position
self
.
end_position
=
end_position
self
.
is_impossible
=
is_impossible
def
read_squad_examples
(
input_file
,
is_training
,
version_2_with_negative
):
"""Read a SQuAD json file into a list of SquadExample."""
with
open
(
input_file
,
"r"
,
encoding
=
'utf-8'
)
as
reader
:
input_data
=
json
.
load
(
reader
)[
"data"
]
def
is_whitespace
(
c
):
if
c
==
" "
or
c
==
"
\t
"
or
c
==
"
\r
"
or
c
==
"
\n
"
or
ord
(
c
)
==
0x202F
:
return
True
return
False
examples
=
[]
for
entry
in
input_data
:
for
paragraph
in
entry
[
"paragraphs"
]:
paragraph_text
=
paragraph
[
"context"
]
doc_tokens
=
[]
char_to_word_offset
=
[]
prev_is_whitespace
=
True
for
c
in
paragraph_text
:
if
is_whitespace
(
c
):
prev_is_whitespace
=
True
else
:
if
prev_is_whitespace
:
doc_tokens
.
append
(
c
)
else
:
doc_tokens
[
-
1
]
+=
c
prev_is_whitespace
=
False
char_to_word_offset
.
append
(
len
(
doc_tokens
)
-
1
)
for
qa
in
paragraph
[
"qas"
]:
qas_id
=
qa
[
"id"
]
question_text
=
qa
[
"question"
]
start_position
=
None
end_position
=
None
orig_answer_text
=
None
is_impossible
=
False
if
is_training
:
if
version_2_with_negative
:
is_impossible
=
qa
[
"is_impossible"
]
if
(
len
(
qa
[
"answers"
])
!=
1
)
and
(
not
is_impossible
):
raise
ValueError
(
"For training, each question should have exactly 1 answer."
)
if
not
is_impossible
:
answer
=
qa
[
"answers"
][
0
]
orig_answer_text
=
answer
[
"text"
]
answer_offset
=
answer
[
"answer_start"
]
answer_length
=
len
(
orig_answer_text
)
start_position
=
char_to_word_offset
[
answer_offset
]
end_position
=
char_to_word_offset
[
answer_offset
+
answer_length
-
1
]
# Only add answers where the text can be exactly recovered from the
# document. If this CAN'T happen it's likely due to weird Unicode
# stuff so we will just skip the example.
#
# Note that this means for training mode, every example is NOT
# guaranteed to be preserved.
actual_text
=
" "
.
join
(
doc_tokens
[
start_position
:(
end_position
+
1
)])
cleaned_answer_text
=
" "
.
join
(
whitespace_tokenize
(
orig_answer_text
))
if
actual_text
.
find
(
cleaned_answer_text
)
==
-
1
:
logger
.
warning
(
"Could not find answer: '%s' vs. '%s'"
,
actual_text
,
cleaned_answer_text
)
continue
else
:
start_position
=
-
1
end_position
=
-
1
orig_answer_text
=
""
example
=
SquadExample
(
qas_id
=
qas_id
,
question_text
=
question_text
,
doc_tokens
=
doc_tokens
,
orig_answer_text
=
orig_answer_text
,
start_position
=
start_position
,
end_position
=
end_position
,
is_impossible
=
is_impossible
)
examples
.
append
(
example
)
return
examples
def
convert_examples_to_features
(
examples
,
tokenizer
,
max_seq_length
,
doc_stride
,
max_query_length
,
is_training
):
"""Loads a data file into a list of `InputBatch`s."""
unique_id
=
1000000000
features
=
[]
for
(
example_index
,
example
)
in
enumerate
(
examples
):
query_tokens
=
tokenizer
.
tokenize
(
example
.
question_text
)
if
len
(
query_tokens
)
>
max_query_length
:
query_tokens
=
query_tokens
[
0
:
max_query_length
]
tok_to_orig_index
=
[]
orig_to_tok_index
=
[]
all_doc_tokens
=
[]
for
(
i
,
token
)
in
enumerate
(
example
.
doc_tokens
):
orig_to_tok_index
.
append
(
len
(
all_doc_tokens
))
sub_tokens
=
tokenizer
.
tokenize
(
token
)
for
sub_token
in
sub_tokens
:
tok_to_orig_index
.
append
(
i
)
all_doc_tokens
.
append
(
sub_token
)
tok_start_position
=
None
tok_end_position
=
None
if
is_training
and
example
.
is_impossible
:
tok_start_position
=
-
1
tok_end_position
=
-
1
if
is_training
and
not
example
.
is_impossible
:
tok_start_position
=
orig_to_tok_index
[
example
.
start_position
]
if
example
.
end_position
<
len
(
example
.
doc_tokens
)
-
1
:
tok_end_position
=
orig_to_tok_index
[
example
.
end_position
+
1
]
-
1
else
:
tok_end_position
=
len
(
all_doc_tokens
)
-
1
(
tok_start_position
,
tok_end_position
)
=
_improve_answer_span
(
all_doc_tokens
,
tok_start_position
,
tok_end_position
,
tokenizer
,
example
.
orig_answer_text
)
# The -3 accounts for [CLS], [SEP] and [SEP]
max_tokens_for_doc
=
max_seq_length
-
len
(
query_tokens
)
-
3
# We can have documents that are longer than the maximum sequence length.
# To deal with this we do a sliding window approach, where we take chunks
# of the up to our max length with a stride of `doc_stride`.
_DocSpan
=
collections
.
namedtuple
(
# pylint: disable=invalid-name
"DocSpan"
,
[
"start"
,
"length"
])
doc_spans
=
[]
start_offset
=
0
while
start_offset
<
len
(
all_doc_tokens
):
length
=
len
(
all_doc_tokens
)
-
start_offset
if
length
>
max_tokens_for_doc
:
length
=
max_tokens_for_doc
doc_spans
.
append
(
_DocSpan
(
start
=
start_offset
,
length
=
length
))
if
start_offset
+
length
==
len
(
all_doc_tokens
):
break
start_offset
+=
min
(
length
,
doc_stride
)
for
(
doc_span_index
,
doc_span
)
in
enumerate
(
doc_spans
):
tokens
=
[]
token_to_orig_map
=
{}
token_is_max_context
=
{}
segment_ids
=
[]
tokens
.
append
(
"[CLS]"
)
segment_ids
.
append
(
0
)
for
token
in
query_tokens
:
tokens
.
append
(
token
)
segment_ids
.
append
(
0
)
tokens
.
append
(
"[SEP]"
)
segment_ids
.
append
(
0
)
for
i
in
range
(
doc_span
.
length
):
split_token_index
=
doc_span
.
start
+
i
token_to_orig_map
[
len
(
tokens
)]
=
tok_to_orig_index
[
split_token_index
]
is_max_context
=
_check_is_max_context
(
doc_spans
,
doc_span_index
,
split_token_index
)
token_is_max_context
[
len
(
tokens
)]
=
is_max_context
tokens
.
append
(
all_doc_tokens
[
split_token_index
])
segment_ids
.
append
(
1
)
tokens
.
append
(
"[SEP]"
)
segment_ids
.
append
(
1
)
input_ids
=
tokenizer
.
convert_tokens_to_ids
(
tokens
)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask
=
[
1
]
*
len
(
input_ids
)
# Zero-pad up to the sequence length.
while
len
(
input_ids
)
<
max_seq_length
:
input_ids
.
append
(
0
)
input_mask
.
append
(
0
)
segment_ids
.
append
(
0
)
assert
len
(
input_ids
)
==
max_seq_length
assert
len
(
input_mask
)
==
max_seq_length
assert
len
(
segment_ids
)
==
max_seq_length
start_position
=
None
end_position
=
None
if
is_training
and
not
example
.
is_impossible
:
# For training, if our document chunk does not contain an annotation
# we throw it out, since there is nothing to predict.
doc_start
=
doc_span
.
start
doc_end
=
doc_span
.
start
+
doc_span
.
length
-
1
out_of_span
=
False
if
not
(
tok_start_position
>=
doc_start
and
tok_end_position
<=
doc_end
):
out_of_span
=
True
if
out_of_span
:
start_position
=
0
end_position
=
0
else
:
doc_offset
=
len
(
query_tokens
)
+
2
start_position
=
tok_start_position
-
doc_start
+
doc_offset
end_position
=
tok_end_position
-
doc_start
+
doc_offset
if
is_training
and
example
.
is_impossible
:
start_position
=
0
end_position
=
0
features
.
append
(
InputFeatures
(
unique_id
=
unique_id
,
example_index
=
example_index
,
doc_span_index
=
doc_span_index
,
tokens
=
tokens
,
token_to_orig_map
=
token_to_orig_map
,
token_is_max_context
=
token_is_max_context
,
input_ids
=
input_ids
,
input_mask
=
input_mask
,
segment_ids
=
segment_ids
,
start_position
=
start_position
,
end_position
=
end_position
,
is_impossible
=
example
.
is_impossible
))
unique_id
+=
1
return
features
def
_improve_answer_span
(
doc_tokens
,
input_start
,
input_end
,
tokenizer
,
orig_answer_text
):
"""Returns tokenized answer spans that better match the annotated answer."""
# The SQuAD annotations are character based. We first project them to
# whitespace-tokenized words. But then after WordPiece tokenization, we can
# often find a "better match". For example:
#
# Question: What year was John Smith born?
# Context: The leader was John Smith (1895-1943).
# Answer: 1895
#
# The original whitespace-tokenized answer will be "(1895-1943).". However
# after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
# the exact answer, 1895.
#
# However, this is not always possible. Consider the following:
#
# Question: What country is the top exporter of electornics?
# Context: The Japanese electronics industry is the lagest in the world.
# Answer: Japan
#
# In this case, the annotator chose "Japan" as a character sub-span of
# the word "Japanese". Since our WordPiece tokenizer does not split
# "Japanese", we just use "Japanese" as the annotation. This is fairly rare
# in SQuAD, but does happen.
tok_answer_text
=
" "
.
join
(
tokenizer
.
tokenize
(
orig_answer_text
))
for
new_start
in
range
(
input_start
,
input_end
+
1
):
for
new_end
in
range
(
input_end
,
new_start
-
1
,
-
1
):
text_span
=
" "
.
join
(
doc_tokens
[
new_start
:(
new_end
+
1
)])
if
text_span
==
tok_answer_text
:
return
(
new_start
,
new_end
)
return
(
input_start
,
input_end
)
def
_check_is_max_context
(
doc_spans
,
cur_span_index
,
position
):
"""Check if this is the 'max context' doc span for the token."""
# Because of the sliding window approach taken to scoring documents, a single
# token can appear in multiple documents. E.g.
# Doc: the man went to the store and bought a gallon of milk
# Span A: the man went to the
# Span B: to the store and bought
# Span C: and bought a gallon of
# ...
#
# Now the word 'bought' will have two scores from spans B and C. We only
# want to consider the score with "maximum context", which we define as
# the *minimum* of its left and right context (the *sum* of left and
# right context will always be the same, of course).
#
# In the example the maximum context for 'bought' would be span C since
# it has 1 left context and 3 right context, while span B has 4 left context
# and 0 right context.
best_score
=
None
best_span_index
=
None
for
(
span_index
,
doc_span
)
in
enumerate
(
doc_spans
):
end
=
doc_span
.
start
+
doc_span
.
length
-
1
if
position
<
doc_span
.
start
:
continue
if
position
>
end
:
continue
num_left_context
=
position
-
doc_span
.
start
num_right_context
=
end
-
position
score
=
min
(
num_left_context
,
num_right_context
)
+
0.01
*
doc_span
.
length
if
best_score
is
None
or
score
>
best_score
:
best_score
=
score
best_span_index
=
span_index
return
cur_span_index
==
best_span_index
RawResult
=
collections
.
namedtuple
(
"RawResult"
,
[
"unique_id"
,
"start_logits"
,
"end_logits"
])
def
get_answers
(
examples
,
features
,
results
,
args
):
predictions
=
collections
.
defaultdict
(
list
)
#it is possible that one example corresponds to multiple features
Prediction
=
collections
.
namedtuple
(
'Prediction'
,
[
'text'
,
'start_logit'
,
'end_logit'
])
if
args
.
version_2_with_negative
:
null_vals
=
collections
.
defaultdict
(
lambda
:
(
float
(
"inf"
),
0
,
0
))
for
ex
,
feat
,
result
in
match_results
(
examples
,
features
,
results
):
start_indices
=
_get_best_indices
(
result
.
start_logits
,
args
.
n_best_size
)
end_indices
=
_get_best_indices
(
result
.
end_logits
,
args
.
n_best_size
)
prelim_predictions
=
get_valid_prelim_predictions
(
start_indices
,
end_indices
,
feat
,
result
,
args
)
prelim_predictions
=
sorted
(
prelim_predictions
,
key
=
lambda
x
:
(
x
.
start_logit
+
x
.
end_logit
),
reverse
=
True
)
if
args
.
version_2_with_negative
:
score
=
result
.
start_logits
[
0
]
+
result
.
end_logits
[
0
]
if
score
<
null_vals
[
ex
.
qas_id
][
0
]:
null_vals
[
ex
.
qas_id
]
=
(
score
,
result
.
start_logits
[
0
],
result
.
end_logits
[
0
])
curr_predictions
=
[]
seen_predictions
=
[]
for
pred
in
prelim_predictions
:
if
len
(
curr_predictions
)
==
args
.
n_best_size
:
break
if
pred
.
start_index
>
0
:
# this is a non-null prediction TODO: this probably is irrelevant
final_text
=
get_answer_text
(
ex
,
feat
,
pred
,
args
)
if
final_text
in
seen_predictions
:
continue
else
:
final_text
=
""
seen_predictions
.
append
(
final_text
)
curr_predictions
.
append
(
Prediction
(
final_text
,
pred
.
start_logit
,
pred
.
end_logit
))
predictions
[
ex
.
qas_id
]
+=
curr_predictions
#Add empty prediction
if
args
.
version_2_with_negative
:
for
qas_id
in
predictions
.
keys
():
predictions
[
qas_id
].
append
(
Prediction
(
''
,
null_vals
[
ex
.
qas_id
][
1
],
null_vals
[
ex
.
qas_id
][
2
]))
nbest_answers
=
collections
.
defaultdict
(
list
)
answers
=
{}
for
qas_id
,
preds
in
predictions
.
items
():
nbest
=
sorted
(
preds
,
key
=
lambda
x
:
(
x
.
start_logit
+
x
.
end_logit
),
reverse
=
True
)[:
args
.
n_best_size
]
# In very rare edge cases we could only have single null prediction.
# So we just create a nonce prediction in this case to avoid failure.
if
not
nbest
:
nbest
.
append
(
Prediction
(
text
=
"empty"
,
start_logit
=
0.0
,
end_logit
=
0.0
))
total_scores
=
[]
best_non_null_entry
=
None
for
entry
in
nbest
:
total_scores
.
append
(
entry
.
start_logit
+
entry
.
end_logit
)
if
not
best_non_null_entry
and
entry
.
text
:
best_non_null_entry
=
entry
probs
=
_compute_softmax
(
total_scores
)
for
(
i
,
entry
)
in
enumerate
(
nbest
):
output
=
collections
.
OrderedDict
()
output
[
"text"
]
=
entry
.
text
output
[
"probability"
]
=
probs
[
i
]
output
[
"start_logit"
]
=
entry
.
start_logit
output
[
"end_logit"
]
=
entry
.
end_logit
nbest_answers
[
qas_id
].
append
(
output
)
if
args
.
version_2_with_negative
:
score_diff
=
null_vals
[
qas_id
][
0
]
-
best_non_null_entry
.
start_logit
-
best_non_null_entry
.
end_logit
if
score_diff
>
args
.
null_score_diff_threshold
:
answers
[
qas_id
]
=
""
else
:
answers
[
qas_id
]
=
best_non_null_entry
.
text
else
:
answers
[
qas_id
]
=
nbest_answers
[
qas_id
][
0
][
'text'
]
return
answers
,
nbest_answers
def
get_answer_text
(
example
,
feature
,
pred
,
args
):
tok_tokens
=
feature
.
tokens
[
pred
.
start_index
:(
pred
.
end_index
+
1
)]
orig_doc_start
=
feature
.
token_to_orig_map
[
pred
.
start_index
]
orig_doc_end
=
feature
.
token_to_orig_map
[
pred
.
end_index
]
orig_tokens
=
example
.
doc_tokens
[
orig_doc_start
:(
orig_doc_end
+
1
)]
tok_text
=
" "
.
join
(
tok_tokens
)
# De-tokenize WordPieces that have been split off.
tok_text
=
tok_text
.
replace
(
" ##"
,
""
)
tok_text
=
tok_text
.
replace
(
"##"
,
""
)
# Clean whitespace
tok_text
=
tok_text
.
strip
()
tok_text
=
" "
.
join
(
tok_text
.
split
())
orig_text
=
" "
.
join
(
orig_tokens
)
final_text
=
get_final_text
(
tok_text
,
orig_text
,
args
.
do_lower_case
,
args
.
verbose_logging
)
return
final_text
def
get_valid_prelim_predictions
(
start_indices
,
end_indices
,
feature
,
result
,
args
):
_PrelimPrediction
=
collections
.
namedtuple
(
"PrelimPrediction"
,
[
"start_index"
,
"end_index"
,
"start_logit"
,
"end_logit"
])
prelim_predictions
=
[]
for
start_index
in
start_indices
:
for
end_index
in
end_indices
:
if
start_index
>=
len
(
feature
.
tokens
):
continue
if
end_index
>=
len
(
feature
.
tokens
):
continue
if
start_index
not
in
feature
.
token_to_orig_map
:
continue
if
end_index
not
in
feature
.
token_to_orig_map
:
continue
if
not
feature
.
token_is_max_context
.
get
(
start_index
,
False
):
continue
if
end_index
<
start_index
:
continue
length
=
end_index
-
start_index
+
1
if
length
>
args
.
max_answer_length
:
continue
prelim_predictions
.
append
(
_PrelimPrediction
(
start_index
=
start_index
,
end_index
=
end_index
,
start_logit
=
result
.
start_logits
[
start_index
],
end_logit
=
result
.
end_logits
[
end_index
]))
return
prelim_predictions
def
match_results
(
examples
,
features
,
results
):
unique_f_ids
=
set
([
f
.
unique_id
for
f
in
features
])
unique_r_ids
=
set
([
r
.
unique_id
for
r
in
results
])
matching_ids
=
unique_f_ids
&
unique_r_ids
features
=
[
f
for
f
in
features
if
f
.
unique_id
in
matching_ids
]
results
=
[
r
for
r
in
results
if
r
.
unique_id
in
matching_ids
]
features
.
sort
(
key
=
lambda
x
:
x
.
unique_id
)
results
.
sort
(
key
=
lambda
x
:
x
.
unique_id
)
for
f
,
r
in
zip
(
features
,
results
):
#original code assumes strict ordering of examples. TODO: rewrite this
yield
examples
[
f
.
example_index
],
f
,
r
def
get_final_text
(
pred_text
,
orig_text
,
do_lower_case
,
verbose_logging
=
False
):
"""Project the tokenized prediction back to the original text."""
# When we created the data, we kept track of the alignment between original
# (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
# now `orig_text` contains the span of our original text corresponding to the
# span that we predicted.
#
# However, `orig_text` may contain extra characters that we don't want in
# our prediction.
#
# For example, let's say:
# pred_text = steve smith
# orig_text = Steve Smith's
#
# We don't want to return `orig_text` because it contains the extra "'s".
#
# We don't want to return `pred_text` because it's already been normalized
# (the SQuAD eval script also does punctuation stripping/lower casing but
# our tokenizer does additional normalization like stripping accent
# characters).
#
# What we really want to return is "Steve Smith".
#
# Therefore, we have to apply a semi-complicated alignment heruistic between
# `pred_text` and `orig_text` to get a character-to-charcter alignment. This
# can fail in certain cases in which case we just return `orig_text`.
def
_strip_spaces
(
text
):
ns_chars
=
[]
ns_to_s_map
=
collections
.
OrderedDict
()
for
(
i
,
c
)
in
enumerate
(
text
):
if
c
==
" "
:
continue
ns_to_s_map
[
len
(
ns_chars
)]
=
i
ns_chars
.
append
(
c
)
ns_text
=
""
.
join
(
ns_chars
)
return
(
ns_text
,
ns_to_s_map
)
# We first tokenize `orig_text`, strip whitespace from the result
# and `pred_text`, and check if they are the same length. If they are
# NOT the same length, the heuristic has failed. If they are the same
# length, we assume the characters are one-to-one aligned.
tokenizer
=
BasicTokenizer
(
do_lower_case
=
do_lower_case
)
tok_text
=
" "
.
join
(
tokenizer
.
tokenize
(
orig_text
))
start_position
=
tok_text
.
find
(
pred_text
)
if
start_position
==
-
1
:
if
verbose_logging
:
logger
.
info
(
"Unable to find text: '%s' in '%s'"
%
(
pred_text
,
orig_text
))
return
orig_text
end_position
=
start_position
+
len
(
pred_text
)
-
1
(
orig_ns_text
,
orig_ns_to_s_map
)
=
_strip_spaces
(
orig_text
)
(
tok_ns_text
,
tok_ns_to_s_map
)
=
_strip_spaces
(
tok_text
)
if
len
(
orig_ns_text
)
!=
len
(
tok_ns_text
):
if
verbose_logging
:
logger
.
info
(
"Length not equal after stripping spaces: '%s' vs '%s'"
,
orig_ns_text
,
tok_ns_text
)
return
orig_text
# We then project the characters in `pred_text` back to `orig_text` using
# the character-to-character alignment.
tok_s_to_ns_map
=
{}
for
(
i
,
tok_index
)
in
tok_ns_to_s_map
.
items
():
tok_s_to_ns_map
[
tok_index
]
=
i
orig_start_position
=
None
if
start_position
in
tok_s_to_ns_map
:
ns_start_position
=
tok_s_to_ns_map
[
start_position
]
if
ns_start_position
in
orig_ns_to_s_map
:
orig_start_position
=
orig_ns_to_s_map
[
ns_start_position
]
if
orig_start_position
is
None
:
if
verbose_logging
:
logger
.
info
(
"Couldn't map start position"
)
return
orig_text
orig_end_position
=
None
if
end_position
in
tok_s_to_ns_map
:
ns_end_position
=
tok_s_to_ns_map
[
end_position
]
if
ns_end_position
in
orig_ns_to_s_map
:
orig_end_position
=
orig_ns_to_s_map
[
ns_end_position
]
if
orig_end_position
is
None
:
if
verbose_logging
:
logger
.
info
(
"Couldn't map end position"
)
return
orig_text
output_text
=
orig_text
[
orig_start_position
:(
orig_end_position
+
1
)]
return
output_text
def
_get_best_indices
(
logits
,
n_best_size
):
"""Get the n-best logits from a list."""
index_and_score
=
sorted
(
enumerate
(
logits
),
key
=
lambda
x
:
x
[
1
],
reverse
=
True
)
best_indices
=
[]
for
i
in
range
(
len
(
index_and_score
)):
if
i
>=
n_best_size
:
break
best_indices
.
append
(
index_and_score
[
i
][
0
])
return
best_indices
def
_compute_softmax
(
scores
):
"""Compute softmax probability over raw logits."""
if
not
scores
:
return
[]
max_score
=
None
for
score
in
scores
:
if
max_score
is
None
or
score
>
max_score
:
max_score
=
score
exp_scores
=
[]
total_sum
=
0.0
for
score
in
scores
:
x
=
math
.
exp
(
score
-
max_score
)
exp_scores
.
append
(
x
)
total_sum
+=
x
probs
=
[]
for
score
in
exp_scores
:
probs
.
append
(
score
/
total_sum
)
return
probs
from
apex.multi_tensor_apply
import
multi_tensor_applier
class
GradientClipper
:
"""
Clips gradient norm of an iterable of parameters.
"""
def
__init__
(
self
,
max_grad_norm
):
self
.
max_norm
=
max_grad_norm
if
multi_tensor_applier
.
available
:
import
amp_C
self
.
_overflow_buf
=
torch
.
cuda
.
IntTensor
([
0
])
self
.
multi_tensor_l2norm
=
amp_C
.
multi_tensor_l2norm
self
.
multi_tensor_scale
=
amp_C
.
multi_tensor_scale
else
:
raise
RuntimeError
(
'Gradient clipping requires cuda extensions'
)
def
step
(
self
,
parameters
):
l
=
[
p
.
grad
for
p
in
parameters
if
p
.
grad
is
not
None
]
total_norm
,
_
=
multi_tensor_applier
(
self
.
multi_tensor_l2norm
,
self
.
_overflow_buf
,
[
l
],
False
)
total_norm
=
total_norm
.
item
()
if
(
total_norm
==
float
(
'inf'
)):
return
clip_coef
=
self
.
max_norm
/
(
total_norm
+
1e-6
)
if
clip_coef
<
1
:
multi_tensor_applier
(
self
.
multi_tensor_scale
,
self
.
_overflow_buf
,
[
l
,
l
],
clip_coef
)
def
main
():
parser
=
argparse
.
ArgumentParser
()
## Required parameters
parser
.
add_argument
(
"--bert_model"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
"bert-base-multilingual-cased, bert-base-chinese."
)
parser
.
add_argument
(
"--output_dir"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The output directory where the model checkpoints and predictions will be written."
)
parser
.
add_argument
(
"--init_checkpoint"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The checkpoint file from pretraining"
)
## Other parameters
parser
.
add_argument
(
"--train_file"
,
default
=
None
,
type
=
str
,
help
=
"SQuAD json for training. E.g., train-v1.1.json"
)
parser
.
add_argument
(
"--predict_file"
,
default
=
None
,
type
=
str
,
help
=
"SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json"
)
parser
.
add_argument
(
"--max_seq_length"
,
default
=
384
,
type
=
int
,
help
=
"The maximum total input sequence length after WordPiece tokenization. Sequences "
"longer than this will be truncated, and sequences shorter than this will be padded."
)
parser
.
add_argument
(
"--doc_stride"
,
default
=
128
,
type
=
int
,
help
=
"When splitting up a long document into chunks, how much stride to take between chunks."
)
parser
.
add_argument
(
"--max_query_length"
,
default
=
64
,
type
=
int
,
help
=
"The maximum number of tokens for the question. Questions longer than this will "
"be truncated to this length."
)
parser
.
add_argument
(
"--do_train"
,
action
=
'store_true'
,
help
=
"Whether to run training."
)
parser
.
add_argument
(
"--do_predict"
,
action
=
'store_true'
,
help
=
"Whether to run eval on the dev set."
)
parser
.
add_argument
(
"--train_batch_size"
,
default
=
16
,
type
=
int
,
help
=
"Total batch size for training."
)
parser
.
add_argument
(
"--predict_batch_size"
,
default
=
8
,
type
=
int
,
help
=
"Total batch size for predictions."
)
parser
.
add_argument
(
"--learning_rate"
,
default
=
5e-5
,
type
=
float
,
help
=
"The initial learning rate for Adam."
)
parser
.
add_argument
(
"--num_train_epochs"
,
default
=
3.0
,
type
=
float
,
help
=
"Total number of training epochs to perform."
)
parser
.
add_argument
(
"--max_steps"
,
default
=-
1.0
,
type
=
float
,
help
=
"Total number of training steps to perform."
)
parser
.
add_argument
(
"--warmup_proportion"
,
default
=
0.1
,
type
=
float
,
help
=
"Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10%% "
"of training."
)
parser
.
add_argument
(
"--n_best_size"
,
default
=
20
,
type
=
int
,
help
=
"The total number of n-best predictions to generate in the nbest_predictions.json "
"output file."
)
parser
.
add_argument
(
"--max_answer_length"
,
default
=
30
,
type
=
int
,
help
=
"The maximum length of an answer that can be generated. This is needed because the start "
"and end predictions are not conditioned on one another."
)
parser
.
add_argument
(
"--verbose_logging"
,
action
=
'store_true'
,
help
=
"If true, all of the warnings related to data processing will be printed. "
"A number of warnings are expected for a normal SQuAD evaluation."
)
parser
.
add_argument
(
"--no_cuda"
,
action
=
'store_true'
,
help
=
"Whether not to use CUDA when available"
)
parser
.
add_argument
(
'--seed'
,
type
=
int
,
default
=
42
,
help
=
"random seed for initialization"
)
parser
.
add_argument
(
'--gradient_accumulation_steps'
,
type
=
int
,
default
=
1
,
help
=
"Number of updates steps to accumulate before performing a backward/update pass."
)
parser
.
add_argument
(
"--do_lower_case"
,
action
=
'store_true'
,
help
=
"Whether to lower case the input text. True for uncased models, False for cased models."
)
parser
.
add_argument
(
"--local_rank"
,
type
=
int
,
default
=
os
.
getenv
(
'LOCAL_RANK'
,
-
1
),
help
=
"local_rank for distributed training on gpus"
)
parser
.
add_argument
(
'--fp16'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--amp'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Mixed precision training"
)
parser
.
add_argument
(
'--loss_scale'
,
type
=
float
,
default
=
0
,
help
=
"Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.
\n
"
"0 (default value): dynamic loss scaling.
\n
"
"Positive power of 2: static loss scaling value.
\n
"
)
parser
.
add_argument
(
'--version_2_with_negative'
,
action
=
'store_true'
,
help
=
'If true, the SQuAD examples contain some that do not have an answer.'
)
parser
.
add_argument
(
'--null_score_diff_threshold'
,
type
=
float
,
default
=
0.0
,
help
=
"If null_score - best_non_null is greater than the threshold predict null."
)
parser
.
add_argument
(
'--vocab_file'
,
type
=
str
,
default
=
None
,
required
=
True
,
help
=
"Vocabulary mapping/file BERT was pretrainined on"
)
parser
.
add_argument
(
"--config_file"
,
default
=
None
,
type
=
str
,
required
=
True
,
help
=
"The BERT model config"
)
parser
.
add_argument
(
'--log_freq'
,
type
=
int
,
default
=
50
,
help
=
'frequency of logging loss.'
)
parser
.
add_argument
(
'--json-summary'
,
type
=
str
,
default
=
"results/dllogger.json"
,
help
=
'If provided, the json summary will be written to'
'the specified file.'
)
parser
.
add_argument
(
"--eval_script"
,
help
=
"Script to evaluate squad predictions"
,
default
=
"evaluate.py"
,
type
=
str
)
parser
.
add_argument
(
"--do_eval"
,
action
=
'store_true'
,
help
=
"Whether to use evaluate accuracy of predictions"
)
parser
.
add_argument
(
"--use_env"
,
action
=
'store_true'
,
help
=
"Whether to read local rank from ENVVAR"
)
parser
.
add_argument
(
'--skip_checkpoint'
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to save checkpoints"
)
parser
.
add_argument
(
'--disable-progress-bar'
,
default
=
False
,
action
=
'store_true'
,
help
=
'Disable tqdm progress bar'
)
parser
.
add_argument
(
"--skip_cache"
,
default
=
False
,
action
=
'store_true'
,
help
=
"Whether to cache train features"
)
parser
.
add_argument
(
"--cache_dir"
,
default
=
None
,
type
=
str
,
help
=
"Location to cache train feaures. Will default to the dataset directory"
)
parser
.
add_argument
(
"--dist_url"
,
default
=
'tcp://224.66.41.62:23456'
,
type
=
str
,
help
=
'url used to set up distributed training'
)
parser
.
add_argument
(
"--gpus_per_node"
,
type
=
int
,
default
=
4
,
help
=
'num of gpus per node'
)
parser
.
add_argument
(
"--world_size"
,
type
=
int
,
default
=
1
,
help
=
"number of process"
)
args
=
parser
.
parse_args
()
args
.
fp16
=
args
.
fp16
or
args
.
amp
if
args
.
local_rank
==
-
1
or
args
.
no_cuda
:
device
=
torch
.
device
(
"cuda"
if
torch
.
cuda
.
is_available
()
and
not
args
.
no_cuda
else
"cpu"
)
n_gpu
=
torch
.
cuda
.
device_count
()
else
:
print
(
"n_gpu:"
,
torch
.
cuda
.
device_count
())
device_n
=
args
.
local_rank
%
4
print
(
"="
*
20
)
print
(
"device:"
,
device_n
)
torch
.
cuda
.
set_device
(
device_n
)
print
(
"="
*
20
)
print
(
"torch.cuda.set_device:"
,
torch
.
cuda
.
set_device
(
device_n
))
device
=
torch
.
device
(
"cuda"
,
device_n
)
print
(
"="
*
20
)
print
(
"torch.device:"
,
torch
.
device
(
"cuda"
,
device_n
))
print
(
"device:"
,
device
)
#torch.cuda.set_device(args.local_rank)
#device = torch.device("cuda", args.local_rank)
#device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
#torch.distributed.init_process_group(backend='gloo', init_method='env://')
#xuan
#if args.world_size > 1:
# args.local_rank = args.local_rank * args.gpus_per_node
torch
.
distributed
.
init_process_group
(
backend
=
'nccl'
,
init_method
=
args
.
dist_url
,
world_size
=
args
.
world_size
,
rank
=
args
.
local_rank
)
n_gpu
=
1
if
is_main_process
():
dllogger
.
init
(
backends
=
[
dllogger
.
JSONStreamBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
filename
=
args
.
json_summary
),
dllogger
.
StdOutBackend
(
verbosity
=
dllogger
.
Verbosity
.
VERBOSE
,
step_format
=
format_step
)])
else
:
dllogger
.
init
(
backends
=
[])
print
(
"device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}"
.
format
(
device
,
n_gpu
,
bool
(
args
.
local_rank
!=
-
1
),
args
.
fp16
))
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"Config"
:
[
str
(
args
)]})
if
args
.
gradient_accumulation_steps
<
1
:
raise
ValueError
(
"Invalid gradient_accumulation_steps parameter: {}, should be >= 1"
.
format
(
args
.
gradient_accumulation_steps
))
args
.
train_batch_size
=
args
.
train_batch_size
//
args
.
gradient_accumulation_steps
random
.
seed
(
args
.
seed
)
np
.
random
.
seed
(
args
.
seed
)
torch
.
manual_seed
(
args
.
seed
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"SEED"
:
args
.
seed
})
if
n_gpu
>
0
:
torch
.
cuda
.
manual_seed_all
(
args
.
seed
)
if
not
args
.
do_train
and
not
args
.
do_predict
:
raise
ValueError
(
"At least one of `do_train` or `do_predict` must be True."
)
if
args
.
do_train
:
if
not
args
.
train_file
:
raise
ValueError
(
"If `do_train` is True, then `train_file` must be specified."
)
if
args
.
do_predict
:
if
not
args
.
predict_file
:
raise
ValueError
(
"If `do_predict` is True, then `predict_file` must be specified."
)
if
os
.
path
.
exists
(
args
.
output_dir
)
and
os
.
listdir
(
args
.
output_dir
)
and
args
.
do_train
and
os
.
listdir
(
args
.
output_dir
)
!=
[
'logfile.txt'
]:
print
(
"WARNING: Output directory {} already exists and is not empty."
.
format
(
args
.
output_dir
),
os
.
listdir
(
args
.
output_dir
))
if
not
os
.
path
.
exists
(
args
.
output_dir
)
and
is_main_process
():
os
.
makedirs
(
args
.
output_dir
)
tokenizer
=
BertTokenizer
(
args
.
vocab_file
,
do_lower_case
=
args
.
do_lower_case
,
max_len
=
512
)
# for bert large
# tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
train_examples
=
None
num_train_optimization_steps
=
None
if
args
.
do_train
:
train_examples
=
read_squad_examples
(
input_file
=
args
.
train_file
,
is_training
=
True
,
version_2_with_negative
=
args
.
version_2_with_negative
)
num_train_optimization_steps
=
int
(
len
(
train_examples
)
/
args
.
train_batch_size
/
args
.
gradient_accumulation_steps
)
*
args
.
num_train_epochs
if
args
.
local_rank
!=
-
1
:
num_train_optimization_steps
=
num_train_optimization_steps
//
torch
.
distributed
.
get_world_size
()
# Prepare model
config
=
modeling
.
BertConfig
.
from_json_file
(
args
.
config_file
)
# Padding for divisibility by 8
if
config
.
vocab_size
%
8
!=
0
:
config
.
vocab_size
+=
8
-
(
config
.
vocab_size
%
8
)
modeling
.
ACT2FN
[
"bias_gelu"
]
=
modeling
.
bias_gelu_training
model
=
modeling
.
BertForQuestionAnswering
(
config
)
# model = modeling.BertForQuestionAnswering.from_pretrained(args.bert_model,
# cache_dir=os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank)))
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"loading_checkpoint"
:
True
})
#model.load_state_dict(torch.load(args.init_checkpoint, map_location='cpu')["model"], strict=False)
model
.
load_state_dict
(
torch
.
load
(
args
.
init_checkpoint
,
map_location
=
'cpu'
),
strict
=
False
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"loaded_checkpoint"
:
True
})
model
.
to
(
device
)
#model = model.cuda()
num_weights
=
sum
([
p
.
numel
()
for
p
in
model
.
parameters
()
if
p
.
requires_grad
])
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"model_weights_num"
:
num_weights
})
# Prepare optimizer
param_optimizer
=
list
(
model
.
named_parameters
())
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer
=
[
n
for
n
in
param_optimizer
if
'pooler'
not
in
n
[
0
]]
no_decay
=
[
'bias'
,
'LayerNorm.bias'
,
'LayerNorm.weight'
]
optimizer_grouped_parameters
=
[
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
not
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.01
},
{
'params'
:
[
p
for
n
,
p
in
param_optimizer
if
any
(
nd
in
n
for
nd
in
no_decay
)],
'weight_decay'
:
0.0
}
]
if
args
.
do_train
:
if
args
.
fp16
:
try
:
from
apex.optimizers
import
FusedAdam
except
ImportError
:
raise
ImportError
(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training."
)
optimizer
=
FusedAdam
(
optimizer_grouped_parameters
,
lr
=
args
.
learning_rate
,
bias_correction
=
False
)
if
args
.
loss_scale
==
0
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
keep_batchnorm_fp32
=
False
,
loss_scale
=
"dynamic"
)
else
:
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O2"
,
keep_batchnorm_fp32
=
False
,
loss_scale
=
args
.
loss_scale
)
if
args
.
do_train
:
scheduler
=
LinearWarmUpScheduler
(
optimizer
,
warmup
=
args
.
warmup_proportion
,
total_steps
=
num_train_optimization_steps
)
else
:
optimizer
=
BertAdam
(
optimizer_grouped_parameters
,
lr
=
args
.
learning_rate
,
warmup
=
args
.
warmup_proportion
,
t_total
=
num_train_optimization_steps
)
if
args
.
local_rank
!=
-
1
:
try
:
from
apex.parallel
import
DistributedDataParallel
as
DDP
except
ImportError
:
raise
ImportError
(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training."
)
model
=
DDP
(
model
)
# model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[device_n])
elif
n_gpu
>
1
:
model
=
torch
.
nn
.
DataParallel
(
model
)
global_step
=
0
if
args
.
do_train
:
if
args
.
cache_dir
is
None
:
cached_train_features_file
=
args
.
train_file
+
'_{0}_{1}_{2}_{3}'
.
format
(
list
(
filter
(
None
,
args
.
bert_model
.
split
(
'/'
))).
pop
(),
str
(
args
.
max_seq_length
),
str
(
args
.
doc_stride
),
str
(
args
.
max_query_length
))
else
:
cached_train_features_file
=
args
.
cache_dir
.
strip
(
'/'
)
+
'/'
+
args
.
train_file
.
split
(
'/'
)[
-
1
]
+
'_{0}_{1}_{2}_{3}'
.
format
(
list
(
filter
(
None
,
args
.
bert_model
.
split
(
'/'
))).
pop
(),
str
(
args
.
max_seq_length
),
str
(
args
.
doc_stride
),
str
(
args
.
max_query_length
))
train_features
=
None
try
:
with
open
(
cached_train_features_file
,
"rb"
)
as
reader
:
train_features
=
pickle
.
load
(
reader
)
except
:
train_features
=
convert_examples_to_features
(
examples
=
train_examples
,
tokenizer
=
tokenizer
,
max_seq_length
=
args
.
max_seq_length
,
doc_stride
=
args
.
doc_stride
,
max_query_length
=
args
.
max_query_length
,
is_training
=
True
)
if
not
args
.
skip_cache
and
is_main_process
():
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"Cached_train features_file"
:
cached_train_features_file
})
with
open
(
cached_train_features_file
,
"wb"
)
as
writer
:
pickle
.
dump
(
train_features
,
writer
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"train_start"
:
True
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"training_samples"
:
len
(
train_examples
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"training_features"
:
len
(
train_features
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"train_batch_size"
:
args
.
train_batch_size
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"steps"
:
num_train_optimization_steps
})
all_input_ids
=
torch
.
tensor
([
f
.
input_ids
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_input_mask
=
torch
.
tensor
([
f
.
input_mask
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_segment_ids
=
torch
.
tensor
([
f
.
segment_ids
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_start_positions
=
torch
.
tensor
([
f
.
start_position
for
f
in
train_features
],
dtype
=
torch
.
long
)
all_end_positions
=
torch
.
tensor
([
f
.
end_position
for
f
in
train_features
],
dtype
=
torch
.
long
)
train_data
=
TensorDataset
(
all_input_ids
,
all_input_mask
,
all_segment_ids
,
all_start_positions
,
all_end_positions
)
if
args
.
local_rank
==
-
1
:
train_sampler
=
RandomSampler
(
train_data
)
else
:
train_sampler
=
DistributedSampler
(
train_data
)
train_dataloader
=
DataLoader
(
train_data
,
sampler
=
train_sampler
,
batch_size
=
args
.
train_batch_size
*
n_gpu
)
model
.
train
()
gradClipper
=
GradientClipper
(
max_grad_norm
=
1.0
)
final_loss
=
None
train_start
=
time
.
time
()
for
epoch
in
range
(
int
(
args
.
num_train_epochs
)):
train_iter
=
tqdm
(
train_dataloader
,
desc
=
"Iteration"
,
disable
=
args
.
disable_progress_bar
)
if
is_main_process
()
else
train_dataloader
for
step
,
batch
in
enumerate
(
train_iter
):
# Terminate early for benchmarking
print
(
"step is "
,
step
,
" "
)
if
args
.
max_steps
>
0
and
global_step
>
args
.
max_steps
:
break
if
n_gpu
==
1
:
batch
=
tuple
(
t
.
to
(
device
)
for
t
in
batch
)
# multi-gpu does scattering it-self
input_ids
,
input_mask
,
segment_ids
,
start_positions
,
end_positions
=
batch
start_logits
,
end_logits
=
model
(
input_ids
,
segment_ids
,
input_mask
)
print
(
"+++++++++++++++++++++++++++++++++++++1"
)
# If we are on multi-GPU, split add a dimension
if
len
(
start_positions
.
size
())
>
1
:
start_positions
=
start_positions
.
squeeze
(
-
1
)
if
len
(
end_positions
.
size
())
>
1
:
end_positions
=
end_positions
.
squeeze
(
-
1
)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index
=
start_logits
.
size
(
1
)
start_positions
.
clamp_
(
0
,
ignored_index
)
end_positions
.
clamp_
(
0
,
ignored_index
)
print
(
"+++++++++++++++++++++++++++++++++++++2"
)
loss_fct
=
torch
.
nn
.
CrossEntropyLoss
(
ignore_index
=
ignored_index
)
start_loss
=
loss_fct
(
start_logits
,
start_positions
)
end_loss
=
loss_fct
(
end_logits
,
end_positions
)
loss
=
(
start_loss
+
end_loss
)
/
2
if
n_gpu
>
1
:
loss
=
loss
.
mean
()
# mean() to average on multi-gpu.
if
args
.
gradient_accumulation_steps
>
1
:
loss
=
loss
/
args
.
gradient_accumulation_steps
if
args
.
fp16
:
with
amp
.
scale_loss
(
loss
,
optimizer
)
as
scaled_loss
:
scaled_loss
.
backward
()
else
:
print
(
"compute loss back"
)
loss
.
backward
()
print
(
"+++++++++++++++++++++++++++++++++++++3"
)
# gradient clipping
gradClipper
.
step
(
amp
.
master_params
(
optimizer
))
if
(
step
+
1
)
%
args
.
gradient_accumulation_steps
==
0
:
if
args
.
fp16
:
# modify learning rate with special warm up for BERT which FusedAdam doesn't do
scheduler
.
step
()
optimizer
.
step
()
optimizer
.
zero_grad
()
global_step
+=
1
final_loss
=
loss
.
item
()
if
step
%
args
.
log_freq
==
0
:
dllogger
.
log
(
step
=
(
epoch
,
global_step
,),
data
=
{
"step_loss"
:
final_loss
,
"learning_rate"
:
optimizer
.
param_groups
[
0
][
'lr'
]})
time_to_train
=
time
.
time
()
-
train_start
if
args
.
do_train
and
is_main_process
()
and
not
args
.
skip_checkpoint
:
# Save a trained model and the associated configuration
model_to_save
=
model
.
module
if
hasattr
(
model
,
'module'
)
else
model
# Only save the model it-self
output_model_file
=
os
.
path
.
join
(
args
.
output_dir
,
modeling
.
WEIGHTS_NAME
)
torch
.
save
({
"model"
:
model_to_save
.
state_dict
()},
output_model_file
)
output_config_file
=
os
.
path
.
join
(
args
.
output_dir
,
modeling
.
CONFIG_NAME
)
with
open
(
output_config_file
,
'w'
)
as
f
:
f
.
write
(
model_to_save
.
config
.
to_json_string
())
if
args
.
do_predict
and
(
args
.
local_rank
==
-
1
or
is_main_process
()):
if
not
args
.
do_train
and
args
.
fp16
:
model
.
half
()
eval_examples
=
read_squad_examples
(
input_file
=
args
.
predict_file
,
is_training
=
False
,
version_2_with_negative
=
args
.
version_2_with_negative
)
eval_features
=
convert_examples_to_features
(
examples
=
eval_examples
,
tokenizer
=
tokenizer
,
max_seq_length
=
args
.
max_seq_length
,
doc_stride
=
args
.
doc_stride
,
max_query_length
=
args
.
max_query_length
,
is_training
=
False
)
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"infer_start"
:
True
})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"eval_samples"
:
len
(
eval_examples
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"eval_features"
:
len
(
eval_features
)})
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"predict_batch_size"
:
args
.
predict_batch_size
})
all_input_ids
=
torch
.
tensor
([
f
.
input_ids
for
f
in
eval_features
],
dtype
=
torch
.
long
)
all_input_mask
=
torch
.
tensor
([
f
.
input_mask
for
f
in
eval_features
],
dtype
=
torch
.
long
)
all_segment_ids
=
torch
.
tensor
([
f
.
segment_ids
for
f
in
eval_features
],
dtype
=
torch
.
long
)
all_example_index
=
torch
.
arange
(
all_input_ids
.
size
(
0
),
dtype
=
torch
.
long
)
eval_data
=
TensorDataset
(
all_input_ids
,
all_input_mask
,
all_segment_ids
,
all_example_index
)
# Run prediction for full data
eval_sampler
=
SequentialSampler
(
eval_data
)
eval_dataloader
=
DataLoader
(
eval_data
,
sampler
=
eval_sampler
,
batch_size
=
args
.
predict_batch_size
)
infer_start
=
time
.
time
()
model
.
eval
()
all_results
=
[]
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"eval_start"
:
True
})
for
input_ids
,
input_mask
,
segment_ids
,
example_indices
in
tqdm
(
eval_dataloader
,
desc
=
"Evaluating"
,
disable
=
args
.
disable_progress_bar
):
if
len
(
all_results
)
%
1000
==
0
:
dllogger
.
log
(
step
=
"PARAMETER"
,
data
=
{
"sample_number"
:
len
(
all_results
)})
input_ids
=
input_ids
.
to
(
device
)
input_mask
=
input_mask
.
to
(
device
)
segment_ids
=
segment_ids
.
to
(
device
)
with
torch
.
no_grad
():
batch_start_logits
,
batch_end_logits
=
model
(
input_ids
,
segment_ids
,
input_mask
)
for
i
,
example_index
in
enumerate
(
example_indices
):
start_logits
=
batch_start_logits
[
i
].
detach
().
cpu
().
tolist
()
end_logits
=
batch_end_logits
[
i
].
detach
().
cpu
().
tolist
()
eval_feature
=
eval_features
[
example_index
.
item
()]
unique_id
=
int
(
eval_feature
.
unique_id
)
all_results
.
append
(
RawResult
(
unique_id
=
unique_id
,
start_logits
=
start_logits
,
end_logits
=
end_logits
))
time_to_infer
=
time
.
time
()
-
infer_start
output_prediction_file
=
os
.
path
.
join
(
args
.
output_dir
,
"predictions.json"
)
output_nbest_file
=
os
.
path
.
join
(
args
.
output_dir
,
"nbest_predictions.json"
)
answers
,
nbest_answers
=
get_answers
(
eval_examples
,
eval_features
,
all_results
,
args
)
with
open
(
output_prediction_file
,
"w"
)
as
f
:
f
.
write
(
json
.
dumps
(
answers
,
indent
=
4
)
+
"
\n
"
)
with
open
(
output_nbest_file
,
"w"
)
as
f
:
f
.
write
(
json
.
dumps
(
nbest_answers
,
indent
=
4
)
+
"
\n
"
)
# output_null_log_odds_file = os.path.join(args.output_dir, "null_odds.json")
# write_predictions(eval_examples, eval_features, all_results,
# args.n_best_size, args.max_answer_length,
# args.do_lower_case, output_prediction_file,
# output_nbest_file, output_null_log_odds_file, args.verbose_logging,
# args.version_2_with_negative, args.null_score_diff_threshold)
#if args.do_eval and is_main_process():
if
args
.
do_eval
:
import
sys
import
subprocess
eval_out
=
subprocess
.
check_output
([
sys
.
executable
,
args
.
eval_script
,
args
.
predict_file
,
args
.
output_dir
+
"/predictions.json"
])
scores
=
str
(
eval_out
).
strip
()
exact_match
=
float
(
scores
.
split
(
":"
)[
1
].
split
(
","
)[
0
])
f1
=
float
(
scores
.
split
(
":"
)[
2
].
split
(
"}"
)[
0
])
#测试是否定义了
print
(
'-'
*
20
)
print
(
'f1:'
,
f1
,
"exact_match:"
,
exact_match
)
print
(
'-'
*
20
)
if
args
.
do_train
:
gpu_count
=
n_gpu
if
torch
.
distributed
.
is_initialized
():
gpu_count
=
torch
.
distributed
.
get_world_size
()
if
args
.
max_steps
==
-
1
:
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_train_time"
:
time_to_train
,
"training_sequences_per_second"
:
len
(
train_features
)
*
args
.
num_train_epochs
/
time_to_train
,
"final_loss"
:
final_loss
})
else
:
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_train_time"
:
time_to_train
,
"training_sequences_per_second"
:
args
.
train_batch_size
*
args
.
gradient_accumulation_steps
\
*
args
.
max_steps
*
gpu_count
/
time_to_train
,
"final_loss"
:
final_loss
})
if
args
.
do_predict
and
is_main_process
():
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"e2e_inference_time"
:
time_to_infer
,
"inference_sequences_per_second"
:
len
(
eval_features
)
/
time_to_infer
})
if
args
.
do_eval
and
is_main_process
():
# global exact_match
# global f1
dllogger
.
log
(
step
=
tuple
(),
data
=
{
"exact_match"
:
exact_match
,
"F1"
:
f1
})
if
__name__
==
"__main__"
:
main
()
dllogger
.
flush
()
PyTorch/NLP/BERT/single_pre1_1.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export
MIOPEN_FIND_MODE
=
1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v1.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1
\
--config_file=./bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--gpus_per_node 1
\
--do_train
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_1_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=ib0
#export HSA_USERPTR_FOR_PAGED_MEM=0
#export MIOPEN_DEBUG_DISABLE_FIND_DB=1
#export NCCL_SOCKET_IFNAME=eno1
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
export
MIOPEN_FIND_MODE
=
1
#export MIOPEN_ENABLE_LOGGING_CMD=1
#export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_3X3U=0
#export MIOPEN_DEBUG_CONV_DIRECT_ASM_WRW3X3=0
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
#export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#HSA_FORCE_FINE_GRAIN_PCIE=1 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
#module load compiler/rocm/3.9.1
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v1.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki/phrase1
\
--config_file=./bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20
\
--learning_rate=4.0e-4
\
--seed=12439
\
--fp16
\
--amp
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--gpus_per_node 1
\
--do_train
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki/phrase1/dllogger.json
"
#--fp16 \
# --amp \
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_4.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
MIOPEN_FIND_MODE
=
1
module unload compiler/rocm/2.9
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v4.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32
\
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--do_train
\
--use_env
\
--local_rank
${
comm_rank
}
\
--world_size 4
\
--gpus_per_node 1
\
--dist_url tcp://localhost:34567
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
PyTorch/NLP/BERT/single_pre1_4_fp16.sh
0 → 100644
View file @
bedf3c0c
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
MIOPEN_FIND_MODE
=
1
module unload compiler/rocm/2.9
echo
"MIOPEN_FIND_MODE=
$MIOPEN_FIND_MODE
"
lrank
=
$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank
=
$OMPI_COMM_WORLD_RANK
comm_size
=
$OMPI_COMM_WORLD_SIZE
#下边是修改的
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
PATH_PHRASE1
=
/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
APP
=
"python3 run_pretraining_v4.py
\
--input_dir=
${
PATH_PHRASE1
}
\
--output_dir=/public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32
\
--config_file=/public/home/hepj/model_source/pytorch_bert/bert_config.json
\
--bert_model=bert-large-uncased
\
--train_batch_size=16
\
--max_seq_length=128
\
--max_predictions_per_seq=20
\
--max_steps=100000
\
--warmup_proportion=0.0
\
--num_steps_per_checkpoint=20000
\
--learning_rate=4.0e-4
\
--seed=12439
\
--gradient_accumulation_steps=1
\
--allreduce_post_accumulation
\
--do_train
\
--fp16
\
--amp
\
--use_env
\
--local_rank
${
comm_rank
}
\
--world_size 4
\
--gpus_per_node 1
\
--dist_url tcp://localhost:34567
\
--json-summary /public/home/hepj/outdir/torch/pre_wiki4/phrase1/fp32/dllogger.json
"
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
echo
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
#echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
#GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
1
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
echo
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
2
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
echo
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
3
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
echo
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
Prev
1
2
3
4
5
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment