README.md

# **Bert算力测试**

## 1.数据集准备

pre_train 数据，目前最新的是wiki20220401的数据，但数据集压缩后近20GB，解压后300GB下载速度慢，解压占大量空间。enwiki-20220401-pages-articles-multistream.xml.bz2下载链接如下：

https://dumps.wikimedia.org/enwiki/20220401/ 

这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据，预训练数据分为PHRASE1、PHRASE2

```
昆山wiki数据集地址PHRASE1:
PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training

昆山wiki数据集地址PHRASE2:
PATH_PHRASE2=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
```

```
乌镇wiki地址PHRASE1:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
乌镇wiki地址PHRASE2:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
```

问答SQUAD1.1数据：

[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)

[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

## 2.测试环境

注意dtk python torch apex 等版本要对齐

```
1.创建python虚拟环境并进入
virtualenv --python=~/package/Python-3.6.8/build/bin/python3 venv_dtk21.10.1_torch1.10
source venv_dtk21.10_torch1.10/bin/activate

2.安装依赖包
pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
pip install torchvision-0.10.0a0+300a8a4-cp36-cp36m-linux_x86_64.whl
pip install apex-0.1-cp36-cp36m-linux_x86_64.whl

3.环境变量设置
module rm compiler/rocm/2.9 
export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
export HIP_PATH=${ROCM_PATH}/hip
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PAT
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZ
```

## 3.squad测试

### 1.模型转化

```
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
```

目前模型转换还存在问题，可能是由于下载的TF模型与model.ckpt-28252不同导致，或torch 、apex版本兼容性问题，还在排查当中，可以直接使用转换好的模型进行squad任务的微调训练（PHRASE的测试则不受此影响，PHRASE为预训练只需要训练数据与网络结构即可，不需要加载模型）

[转换好的模型  提取密码：vs8d](https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ)

### 2.参数说明

```
  --train_file  训练数据
  --predict_file  预测文件
  --init_checkpoint  模型文件
  --vocab_file  词向量文件
  --output_dir  输出文件夹
  --config_file  模型配置文件
  --json-summary  输出json文件
  --bert_model bert模型类型可选： bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
  --do_train 是否训练
  --do_predict 是否预测
  --train_batch_size  训练batch_size
  --predict_batch_size 预测batch_size
  --gpus_per_node  使用gpu节点数
  --local_rank 基于GPU的分布式训练的local_rank（单卡设置为-1）
  --fp16 混合精度训练
  --amp 混合精度训练
```

### 3.运行

```
#单卡
./bert_squad.sh #单精度 （按自己路径对single_squad.sh里APP设置进行修改）
./bert_squad_fp16.sh  #半精度 （按自己路径对single_squad_fp16.sh里APP设置进行修改）
```

```
#多卡
./bert_squad4.sh #单精度  （按自己路径对single_squad4.sh里APP设置进行修改）
./bert_squad4_fp16.sh #半精度  （按自己路径对single_squad4_fp16.sh里APP设置进行修改）
```

```
#多机多卡
cd 2node-run-squad
sbatch run_bert_squad_4dcu.sh （按照自己情况对#SBATCH -p、#SBATCH -J进行修改；需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数,运行结果保存在相应的slurm文件中）
```


## 4.**PHRASE测试**

### 1.参数说明

```
    --input_dir  输入数据文件夹
    --output_dir 输出保存文件夹
    --config_file 模型配置文件
    --bert_model  bert模型类型可选： bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
    --train_batch_size 训练batch_size
    --max_seq_length=128 最大长度（需要和训练数据相匹配）
    --max_predictions_per_seq 输入序列中屏蔽标记的最大总数 
    --max_steps 最大步长
    --warmup_proportion 进行线性学习率热身的训练比例
    --num_steps_per_checkpoint 多少步保存一次模型
    --learning_rate 学习率
    --seed 随机种子
    --gradient_accumulation_steps 在执行向后/更新过程之前，Accumulte的更新步骤数
    --allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
    --do_train 是否训练
    --fp16 混合精度训练
    --amp 混合精度训练
    --json-summary 输出json文件
```

### 2.PHRASE1

```
#单卡
./bert_pre1.sh #单精度 （按自己路径对single_pre1_1.sh里APP设置进行修改）
./bert_pre1_fp16.sh  #半精度 （按自己路径对single_pre1_1_fp16.sh里APP设置进行修改）
#多卡
./bert_pre1_4.sh #单精度 （按自己路径对single_pre1_4.sh里APP设置进行修改）
./bert_pre1_4_fp16.sh   #半精度 （按自己路径对single_pre1_4_fp16.sh里APP设置进行修改）
#多机多卡
cd 2node-run-pre
sbatch run_bert_pre1_4dcu.sh （按照自己情况对#SBATCH -p、#SBATCH -J进行修改；需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数,运行结果保存在相应的slurm文件中）
```

### 3.PHRASE2

```
#单卡
./bert_pre2.sh  #单精度 （按自己路径对single_pre2_1.sh里APP设置进行修改）
./bert_pre2_fp16.sh  #半精度 （按自己路径对single_pre2_1_fp16.sh里APP设置进行修改）
#多卡
./bert_pre2_4.sh  #单精度 （按自己路径对single_pre2_4.sh里APP设置进行修改）
./bert_pre2_4_fp16.sh  #半精度 （按自己路径对single_pre2_4_fp16.sh里APP设置进行修改）
#多机多卡
cd 2node-run-pre
sbatch run_bert_pre2_4dcu.sh （按照自己情况对#SBATCH -p、#SBATCH -J进行修改；需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数,运行结果保存在相应的slurm文件中）
```