README.md

# bert-large 训练

## 论文

`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`

[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)

## 环境配置

### Docker

```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04.1-py3.10
```

其它依赖库参照requirements.txt安装：

```
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/  
```

## 数据集

pre_train 数据，本项目使用的是wiki20220401的数据，但数据集压缩后近20GB，解压后300GB下载速度慢，解压占大量空间。由于wiki数据集经常更新,官网并不保留旧版数据集，这里提供处理好的seq128和seq512的数据集网盘下载链接。

（seq128对应PHRASE1）链接：https://pan.baidu.com/s/13GA-Jmfr2qXrChjiM2UfFQ?pwd=l30u  提取码：l30u

（seq512对应PHRASE2）链接：https://pan.baidu.com/s/1MBFjYNsGQzlnc8aEb7Pg4w?pwd=6ap2  提取码：6ap2 

**这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据，预训练数据分为PHRASE1、PHRASE2**

`wiki数据集结构`

```
 ──wikicorpus_en_128 
    │   ├── training
    │             ├── wikicorpus_en_training_0.tfrecord.hdf5
    │             ├── wikicorpus_en_training_1000.tfrecord.hdf5
    │             └── ...
    │   └── test
    │             ├── wikicorpus_en_test_99.tfrecord.hdf5
    │             ├── wikicorpus_en_test_9.tfrecord.hdf5
    │             └── ...
──wikicorpus_en_512 
    │   ├── training
    │             ├── wikicorpus_en_training_0.tfrecord.hdf5
    │             ├── wikicorpus_en_training_1000.tfrecord.hdf5
    │             └── ...
    │   └── test
    │             ├── wikicorpus_en_test_99.tfrecord.hdf5
    │             ├── wikicorpus_en_test_9.tfrecord.hdf5
    │             └── ...
```


问答SQUAD1.1数据：

[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)

[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```

## 训练

#### 1.参数说明

```
    --input_dir  输入数据文件夹
    --output_dir 输出保存文件夹
    --config_file 模型配置文件
    --bert_model  bert模型类型可选： bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
    --train_batch_size 训练batch_size
    --max_seq_length=128 最大长度（需要和训练数据相匹配）
    --max_predictions_per_seq 输入序列中屏蔽标记的最大总数 
    --max_steps 最大步长
    --warmup_proportion 进行线性学习率热身的训练比例
    --num_steps_per_checkpoint 多少步保存一次模型
    --learning_rate 学习率
    --seed 随机种子
    --gradient_accumulation_steps 在执行向后/更新过程之前，Accumulte的更新步骤数
    --allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
    --do_train 是否训练
    --fp16 混合精度训练
    --amp 混合精度训练
    --json-summary 输出json文件
```

#### 2.PHRASE1

```
#多卡
bash bert_pre1_4.sh        #单精度 （按自己路径对single_pre1_4.sh里APP设置进行修改）
#bash bert_pre1_4_fp16.sh   #半精度 （按自己路径对single_pre1_4_fp16.sh里APP设置进行修改）
```

#### 3.PHRASE2

```
#多卡
bash bert_pre2_4.sh       #单精度 （按自己路径对single_pre2_4.sh里APP设置进行修改）
#bash bert_pre2_4_fp16.sh  #半精度 （按自己路径对single_pre2_4_fp16.sh里APP设置进行修改）
```