README.md 12.5 KB
Newer Older
hepj987's avatar
hepj987 committed
1
# **Bert**
hepj987's avatar
hepj987 committed
2

hepj987's avatar
hepj987 committed
3
## 论文
shantf's avatar
shantf committed
4

hepj987's avatar
hepj987 committed
5
`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`
hepj987's avatar
hepj987 committed
6

hepj987's avatar
hepj987 committed
7
[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)
hepj987's avatar
hepj987 committed
8

hepj987's avatar
hepj987 committed
9
## 模型结构
hepj987's avatar
hepj987 committed
10
11

```
hepj987's avatar
hepj987 committed
12
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。
hepj987's avatar
hepj987 committed
13
14
15
```


hepj987's avatar
hepj987 committed
16
![bert_model](bert_model.png)
hepj987's avatar
hepj987 committed
17

hepj987's avatar
hepj987 committed
18
## 算法原理
hepj987's avatar
hepj987 committed
19

hepj987's avatar
hepj987 committed
20
![bert](bert.png)
hepj987's avatar
hepj987 committed
21

hepj987's avatar
hepj987 committed
22
23
24
```
BERT并没有采用整个的Transformer结构(Encoder+Decoder),仅仅使用了Transformer结构里的Encoder部分,BERT将多层的Encoder搭建一起组成了它的基本网络结构。
```
hepj987's avatar
hepj987 committed
25

hepj987's avatar
hepj987 committed
26
## 环境配置
hepj987's avatar
hepj987 committed
27
28

`注意dtk python torch apex 等版本要对齐`
hepj987's avatar
hepj987 committed
29

hepj987's avatar
hepj987 committed
30
31
32
### Docker(方式一)

```
shantf's avatar
shantf committed
33
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
hepj987's avatar
hepj987 committed
34
进入docker安装没有的依赖
shantf's avatar
shantf committed
35
36
docker run -dit --network=host --name=bert-pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G -v /opt/hyhal:/opt/hyhal:ro --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker exec -it bert-pytorch /bin/bash
hepj987's avatar
hepj987 committed
37
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
shantf's avatar
shantf committed
38
39
40
#tensorflow安装包参考conda方式下载地址,tensorflow安装包在光合社区中下载
wget https://cancon.hpccube.com:65024/directlink/4/tensorflow/DAS1.1/tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl
pip install tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
hepj987's avatar
hepj987 committed
41
42
```

hepj987's avatar
hepj987 committed
43
44
45
46
### Dockerfile(方式二)

```
docker build -t bert:latest .
shantf's avatar
shantf committed
47
docker run -dit --network=host --name=bert-pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G -v /opt/hyhal:/opt/hyhal:ro --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 bert:latest
hepj987's avatar
hepj987 committed
48
49
docker exec -it bert-pytorch /bin/bash
#tensorflow安装包参考conda方式下载地址
shantf's avatar
shantf committed
50
pip install tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
hepj987's avatar
hepj987 committed
51
```
hepj987's avatar
hepj987 committed
52

hepj987's avatar
hepj987 committed
53
### Conda(方式三)
hepj987's avatar
hepj987 committed
54

hepj987's avatar
hepj987 committed
55
56
```
#创建虚拟环境
shantf's avatar
shantf committed
57
conda create -n bert-pytorch python=3.10
hepj987's avatar
hepj987 committed
58
59
60
61
```

关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。

shantf's avatar
shantf committed
62
[apex](https://cancon.hpccube.com:65024/directlink/4/apex/DAS1.1/apex-1.1.0+das1.1.gitf477a3a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
63

shantf's avatar
shantf committed
64
[pytorch2.1.0](https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.1/torch-2.1.0+das1.1.git3ac1bdd.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
65

shantf's avatar
shantf committed
66
[tensorflow2.13.1](https://cancon.hpccube.com:65024/directlink/4/tensorflow/DAS1.1/tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
67

shantf's avatar
shantf committed
68
[DTK24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
hepj987's avatar
hepj987 committed
69
70
71
72
73

其它依赖库参照requirements.txt安装:

```
pip install -r requirements.txt
hepj987's avatar
hepj987 committed
74
75
76
```


hepj987's avatar
hepj987 committed
77
78
79

## 数据集

dcuai's avatar
dcuai committed
80
pre_train 数据,本项目使用的是wiki20220401的数据,但数据集压缩后近20GB,解压后300GB下载速度慢,解压占大量空间。由于wiki数据集经常更新,官网并不保留旧版数据集,这里提供处理好的seq128和seq512的数据集网盘下载链接。
dcuai's avatar
dcuai committed
81

dcuai's avatar
dcuai committed
82
(seq128对应PHRASE1)链接:https://pan.baidu.com/s/13GA-Jmfr2qXrChjiM2UfFQ?pwd=l30u  提取码:l30u
dcuai's avatar
dcuai committed
83

dcuai's avatar
dcuai committed
84
(seq512对应PHRASE2)链接:https://pan.baidu.com/s/1MBFjYNsGQzlnc8aEb7Pg4w?pwd=6ap2  提取码:6ap2 
hepj987's avatar
hepj987 committed
85

chenzk's avatar
chenzk committed
86
87
数据集快速下载中心:[SCNet AIDatasets](http://113.200.138.88:18080/aidatasets) ,项目中数据集可从快速下载通道下载:[seq128](http://113.200.138.88:18080/aidatasets/project-dependency/bert_pytorch/-/raw/master/wikicorpus_en(128).zip) [seq512](http://113.200.138.88:18080/aidatasets/project-dependency/bert_pytorch/-/raw/master/wikicorpus_en%20(512).zip) 。

hepj987's avatar
hepj987 committed
88
89
90
91
92
93
这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据,预训练数据分为PHRASE1、PHRASE2

`wiki数据集结构`

```
 ──wikicorpus_en 
dcuai's avatar
dcuai committed
94
    │   ├── training
hepj987's avatar
hepj987 committed
95
96
97
98
99
100
101
102
103
    │             ├── wikicorpus_en_training_0.tfrecord.hdf5
    │             ├── wikicorpus_en_training_1000.tfrecord.hdf5
    │             └── ...
    │   └── test
    │             ├── wikicorpus_en_test_99.tfrecord.hdf5
    │             ├── wikicorpus_en_test_9.tfrecord.hdf5
    │             └── ...
```

hepj987's avatar
hepj987 committed
104
105
106
107
108
109
110
111
112
113
114
115
116
117
```
#wiki数据集下载与处理示例
cd cleanup_scripts  
mkdir -p wiki  
cd wiki  
wget https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2    # Optionally use curl instead  
bzip2 -d enwiki-20200101-pages-articles-multistream.xml.bz2  
cd ..    # back to bert/cleanup_scripts  
git clone https://github.com/attardi/wikiextractor.git  
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml    # Results are placed in bert/cleanup_scripts/text  
./process_wiki.sh '<text/*/wiki_??'  
```


hepj987's avatar
hepj987 committed
118
119
120
121
122
123
124
125
126
127
128
129
130
问答SQUAD1.1数据:

[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)

[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```

dcuai's avatar
dcuai committed
131
### 模型权重下载
hepj987's avatar
hepj987 committed
132
133
134
135
136
137
138

[用于squad训练的bert-large-uncased模型(已转换可直接使用)  提取密码:vs8d](https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ)

[bert-large-uncased_L-24_H-1024_A-16(需要转换)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)

[bert-base-uncased_L-12_H-768_A-12(需要转换)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)

hepj987's avatar
hepj987 committed
139
140
141
142
## 训练

### squad训练

hepj987's avatar
hepj987 committed
143
#### 1.模型转换
hepj987's avatar
hepj987 committed
144
145

```
hepj987's avatar
hepj987 committed
146
147
#如果下载的是.ckpt格式的模型,需要转换为.ckpt.pt格式
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint uncased_L-24_H-1024_A-16/bert_model.ckpt --bert_config_path uncased_L-24_H-1024_A-16/bert_config.json --output_checkpoint uncased_L-24_H-1024_A-16/model.ckpt.pt
hepj987's avatar
hepj987 committed
148
149
```

hepj987's avatar
hepj987 committed
150
#### 2.参数说明
hepj987's avatar
hepj987 committed
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170

```
  --train_file  训练数据
  --predict_file  预测文件
  --init_checkpoint  模型文件
  --vocab_file  词向量文件
  --output_dir  输出文件夹
  --config_file  模型配置文件
  --json-summary  输出json文件
  --bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
  --do_train 是否训练
  --do_predict 是否预测
  --train_batch_size  训练batch_size
  --predict_batch_size 预测batch_size
  --gpus_per_node  使用gpu节点数
  --local_rank 基于GPU的分布式训练的local_rank(单卡设置为-1)
  --fp16 混合精度训练
  --amp 混合精度训练
```

hepj987's avatar
hepj987 committed
171
#### 3.运行
hepj987's avatar
hepj987 committed
172
173
174

```
#单卡
shantf's avatar
shantf committed
175
176
bash bert_squad.sh #单精度 (按自己路径对single_squad.sh里APP设置进行修改)
bash bert_squad_fp16.sh  #半精度 (按自己路径对single_squad_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
177
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
shantf's avatar
shantf committed
178
179
180
181
182
183

tips:
1、如提示类似bert_squad.sh等shell脚本不能执行,可执行如下命令:
bash bert_squad.sh
2、倘若运行提示:遇到非法的节点名字K100_AI等,可以使用如下方法改正,在容器里面操作
sudo hostname xxx 改正不包含下划线的名字,比如k21,然后退出容器,重新进入即可。
hepj987's avatar
hepj987 committed
184
185
186
187
```

```
#多卡
shantf's avatar
shantf committed
188
189
bash bert_squad4.sh #单精度  (按自己路径对single_squad4.sh里APP设置进行修改)
bash bert_squad4_fp16.sh #半精度  (按自己路径对single_squad4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
190
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
hepj987's avatar
hepj987 committed
191
192
193
194
```

```
#多机多卡
shantf's avatar
shantf committed
195
#进入节点1,根据环境编写修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
196
cd 2node-run-squad
hepj987's avatar
hepj987 committed
197
198
sh run_bert_squad_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
hepj987's avatar
hepj987 committed
199
200
201
202
```



hepj987's avatar
hepj987 committed
203
### **PHRASE测试**
hepj987's avatar
hepj987 committed
204

hepj987's avatar
hepj987 committed
205
#### 1.参数说明
hepj987's avatar
hepj987 committed
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227

```
    --input_dir  输入数据文件夹
    --output_dir 输出保存文件夹
    --config_file 模型配置文件
    --bert_model  bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
    --train_batch_size 训练batch_size
    --max_seq_length=128 最大长度(需要和训练数据相匹配)
    --max_predictions_per_seq 输入序列中屏蔽标记的最大总数 
    --max_steps 最大步长
    --warmup_proportion 进行线性学习率热身的训练比例
    --num_steps_per_checkpoint 多少步保存一次模型
    --learning_rate 学习率
    --seed 随机种子
    --gradient_accumulation_steps 在执行向后/更新过程之前,Accumulte的更新步骤数
    --allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
    --do_train 是否训练
    --fp16 混合精度训练
    --amp 混合精度训练
    --json-summary 输出json文件
```

hepj987's avatar
hepj987 committed
228
#### 2.PHRASE1
hepj987's avatar
hepj987 committed
229
230
231

```
#单卡
shantf's avatar
shantf committed
232
233
bash bert_pre1.sh #单精度 (按自己路径对single_pre1_1.sh里APP设置进行修改)
bash bert_pre1_fp16.sh  #半精度 (按自己路径对single_pre1_1_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
234
#多卡
shantf's avatar
shantf committed
235
236
bash bert_pre1_4.sh #单精度 (按自己路径对single_pre1_4.sh里APP设置进行修改)
bash bert_pre1_4_fp16.sh   #半精度 (按自己路径对single_pre1_4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
237

hepj987's avatar
hepj987 committed
238
#多机多卡
hepj987's avatar
hepj987 committed
239
#进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
240
cd 2node-run-pre
hepj987's avatar
hepj987 committed
241
sh run_bert_pre1_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
hepj987's avatar
hepj987 committed
242
243
```

hepj987's avatar
hepj987 committed
244
#### 3.PHRASE2
hepj987's avatar
hepj987 committed
245
246
247

```
#单卡
shantf's avatar
shantf committed
248
249
bash bert_pre2.sh  #单精度 (按自己路径对single_pre2_1.sh里APP设置进行修改)
bash bert_pre2_fp16.sh  #半精度 (按自己路径对single_pre2_1_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
250
#多卡
shantf's avatar
shantf committed
251
252
bash bert_pre2_4.sh  #单精度 (按自己路径对single_pre2_4.sh里APP设置进行修改)
bash bert_pre2_4_fp16.sh  #半精度 (按自己路径对single_pre2_4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
253

hepj987's avatar
hepj987 committed
254
#多机多卡
hepj987's avatar
hepj987 committed
255
#进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
256
cd 2node-run-pre
hepj987's avatar
hepj987 committed
257
sh run_bert_pre2_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
hepj987's avatar
hepj987 committed
258
259
```

hepj987's avatar
hepj987 committed
260
261
## result

hepj987's avatar
hepj987 committed
262
263
![result](result.jpg)

dcuai's avatar
dcuai committed
264
### 精度
hepj987's avatar
hepj987 committed
265
266
267
268
269
270

| 训练    | 卡数 | batch size | 迭代计数 | 精度                           |
| ------- | ---- | ---------- | -------- | ------------------------------ |
| PHRASE1 | 1    | 16         | 634step  | 9.7421875                      |
| SQUAD   | 1    | 16         | 3epoch   | final_loss : 3.897481918334961 |

hepj987's avatar
hepj987 committed
271
272
## 应用场景

dcuai's avatar
dcuai committed
273
### 算法类别
hepj987's avatar
hepj987 committed
274

chenzk's avatar
chenzk committed
275
`对话问答`
hepj987's avatar
hepj987 committed
276

dcuai's avatar
dcuai committed
277
### 热点行业
hepj987's avatar
hepj987 committed
278

dcuai's avatar
dcuai committed
279
`互联网,教育,科研`
hepj987's avatar
hepj987 committed
280

chenzk's avatar
chenzk committed
281
282
283
## 预训练权重
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels) ,项目中的预训练权重可从快速下载通道下载:[bert-large-uncased](http://113.200.138.88:18080/aidatasets/project-dependency/bert_pytorch/-/raw/master/bs64k_32k_ckpt.tar.gz)[bert-large-uncased_L-24_H-1024_A-16](http://113.200.138.88:18080/aidatasets/project-dependency/bert-large-uncased/-/raw/master/uncased_L-24_H-1024_A-16.zip)[bert-base-uncased_L-12_H-768_A-12](http://113.200.138.88:18080/aidatasets/project-dependency/bert-large-uncased/-/raw/master/uncased_L-12_H-768_A-12.zip)

hepj987's avatar
hepj987 committed
284
285
286
287
288
289
290
291
292
## 源码仓库及问题反馈

https://developer.hpccube.com/codes/modelzoo/bert-pytorch

## 参考资料

https://github.com/mlperf/training_results_v0.7/tree/master/NVIDIA/benchmarks/bert/implementations/pytorch

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT