# **Bert算力测试**

## 论文

`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`

[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)

## 模型结构

```
BERT的全称为Bidirectional Encoder Representation from Transformers，是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练，而是采用新的masked language model（MLM），以致能生成深度的双向语言表征。
```


![bert_model](bert_model.png)

## 算法原理

![bert](bert.png)

```
BERT并没有采用整个的Transformer结构（Encoder+Decoder），仅仅使用了Transformer结构里的Encoder部分,BERT将多层的Encoder搭建一起组成了它的基本网络结构。
```

环境配置

`注意dtk python torch apex 等版本要对齐`

```
1.创建python虚拟环境并进入
virtualenv --python=~/package/Python-3.6.8/build/bin/python3 venv_dtk21.10.1_torch1.10
source venv_dtk21.10_torch1.10/bin/activate

2.安装依赖包
pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip install torch-1.10.0a0+gitcc7c9c7-cp36-cp36m-linux_x86_64.whl
pip install torchvision-0.10.0a0+300a8a4-cp36-cp36m-linux_x86_64.whl
pip install apex-0.1-cp36-cp36m-linux_x86_64.whl

3.环境变量设置
module rm compiler/rocm/2.9 
export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
export HIP_PATH=${ROCM_PATH}/hip
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PAT
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZ
```


## 数据集

pre_train 数据，目前最新的是wiki20220401的数据，但数据集压缩后近20GB，解压后300GB下载速度慢，解压占大量空间。enwiki-20220401-pages-articles-multistream.xml.bz2下载链接如下：

https://dumps.wikimedia.org/enwiki/20220401/ 

这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据，预训练数据分为PHRASE1、PHRASE2

```
昆山wiki数据集地址PHRASE1:
PATH_PHRASE1=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training

昆山wiki数据集地址PHRASE2:
PATH_PHRASE2=/public/software/apps/DeepLearning/Data/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en/training
```

```
乌镇wiki地址PHRASE1:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
乌镇wiki地址PHRASE2:
/public/DL_DATA/wikicorpus_en/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/wikicorpus_en
```

`wiki数据集结构`

```
 ──wikicorpus_en 
    │   ├── train
    │             ├── wikicorpus_en_training_0.tfrecord.hdf5
    │             ├── wikicorpus_en_training_1000.tfrecord.hdf5
    │             └── ...
    │   └── test
    │             ├── wikicorpus_en_test_99.tfrecord.hdf5
    │             ├── wikicorpus_en_test_9.tfrecord.hdf5
    │             └── ...
```

问答SQUAD1.1数据：

[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)

[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```

## 训练

### squad训练

#### 1.模型转化

```
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
```

目前模型转换还存在问题，可能是由于下载的TF模型与model.ckpt-28252不同导致，或torch 、apex版本兼容性问题，还在排查当中，可以直接使用转换好的模型进行squad任务的微调训练（PHRASE的测试则不受此影响，PHRASE为预训练只需要训练数据与网络结构即可，不需要加载模型）

[转换好的模型  提取密码：vs8d](https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ)

#### 2.参数说明

```
  --train_file  训练数据
  --predict_file  预测文件
  --init_checkpoint  模型文件
  --vocab_file  词向量文件
  --output_dir  输出文件夹
  --config_file  模型配置文件
  --json-summary  输出json文件
  --bert_model bert模型类型可选： bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
  --do_train 是否训练
  --do_predict 是否预测
  --train_batch_size  训练batch_size
  --predict_batch_size 预测batch_size
  --gpus_per_node  使用gpu节点数
  --local_rank 基于GPU的分布式训练的local_rank（单卡设置为-1）
  --fp16 混合精度训练
  --amp 混合精度训练
```

#### 3.运行

```
#单卡
./bert_squad.sh #单精度 （按自己路径对single_squad.sh里APP设置进行修改）
./bert_squad_fp16.sh  #半精度 （按自己路径对single_squad_fp16.sh里APP设置进行修改）
```

```
#多卡
./bert_squad4.sh #单精度  （按自己路径对single_squad4.sh里APP设置进行修改）
./bert_squad4_fp16.sh #半精度  （按自己路径对single_squad4_fp16.sh里APP设置进行修改）
```

```
#多机多卡
cd 2node-run-squad
sbatch run_bert_squad_4dcu.sh （按照自己情况对#SBATCH -p、#SBATCH -J进行修改；需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数,运行结果保存在相应的slurm文件中）
```


### 4.**PHRASE测试**

#### 1.参数说明

```
    --input_dir  输入数据文件夹
    --output_dir 输出保存文件夹
    --config_file 模型配置文件
    --bert_model  bert模型类型可选： bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
    --train_batch_size 训练batch_size
    --max_seq_length=128 最大长度（需要和训练数据相匹配）
    --max_predictions_per_seq 输入序列中屏蔽标记的最大总数 
    --max_steps 最大步长
    --warmup_proportion 进行线性学习率热身的训练比例
    --num_steps_per_checkpoint 多少步保存一次模型
    --learning_rate 学习率
    --seed 随机种子
    --gradient_accumulation_steps 在执行向后/更新过程之前，Accumulte的更新步骤数
    --allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
    --do_train 是否训练
    --fp16 混合精度训练
    --amp 混合精度训练
    --json-summary 输出json文件
```

#### 2.PHRASE1

```
#单卡
./bert_pre1.sh #单精度 （按自己路径对single_pre1_1.sh里APP设置进行修改）
./bert_pre1_fp16.sh  #半精度 （按自己路径对single_pre1_1_fp16.sh里APP设置进行修改）
#多卡
./bert_pre1_4.sh #单精度 （按自己路径对single_pre1_4.sh里APP设置进行修改）
./bert_pre1_4_fp16.sh   #半精度 （按自己路径对single_pre1_4_fp16.sh里APP设置进行修改）
#多机多卡
cd 2node-run-pre
sbatch run_bert_pre1_4dcu.sh （按照自己情况对#SBATCH -p、#SBATCH -J进行修改；需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数,运行结果保存在相应的slurm文件中）
```

#### 3.PHRASE2

```
#单卡
./bert_pre2.sh  #单精度 （按自己路径对single_pre2_1.sh里APP设置进行修改）
./bert_pre2_fp16.sh  #半精度 （按自己路径对single_pre2_1_fp16.sh里APP设置进行修改）
#多卡
./bert_pre2_4.sh  #单精度 （按自己路径对single_pre2_4.sh里APP设置进行修改）
./bert_pre2_4_fp16.sh  #半精度 （按自己路径对single_pre2_4_fp16.sh里APP设置进行修改）
#多机多卡
cd 2node-run-pre
sbatch run_bert_pre2_4dcu.sh （按照自己情况对#SBATCH -p、#SBATCH -J进行修改；需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数,运行结果保存在相应的slurm文件中）
```

## 准确率数据

| 训练    | 卡数 | batch size | 迭代计数 | 精度                           |
| ------- | ---- | ---------- | -------- | ------------------------------ |
| PHRASE1 | 1    | 16         | 634step  | 9.7421875                      |
| SQUAD   | 1    | 16         | 3epoch   | final_loss : 3.897481918334961 |

## 源码仓库及问题反馈

https://developer.hpccube.com/codes/modelzoo/bert-pytorch

## 参考资料

https://github.com/mlperf/training_results_v0.7/tree/master/NVIDIA/benchmarks/bert/implementations/pytorch

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT