README.md 11.6 KB
Newer Older
dcuai's avatar
dcuai committed
1
# **BERT**
hepj987's avatar
hepj987 committed
2

hepj987's avatar
hepj987 committed
3
## 论文
shantf's avatar
shantf committed
4

hepj987's avatar
hepj987 committed
5
`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`
hepj987's avatar
hepj987 committed
6

hepj987's avatar
hepj987 committed
7
[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)
hepj987's avatar
hepj987 committed
8

hepj987's avatar
hepj987 committed
9
## 模型结构
hepj987's avatar
hepj987 committed
10
11

```
hepj987's avatar
hepj987 committed
12
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。
hepj987's avatar
hepj987 committed
13
14
15
```


hepj987's avatar
hepj987 committed
16
![bert_model](bert_model.png)
hepj987's avatar
hepj987 committed
17

hepj987's avatar
hepj987 committed
18
## 算法原理
hepj987's avatar
hepj987 committed
19

hepj987's avatar
hepj987 committed
20
![bert](bert.png)
hepj987's avatar
hepj987 committed
21

hepj987's avatar
hepj987 committed
22
23
24
```
BERT并没有采用整个的Transformer结构(Encoder+Decoder),仅仅使用了Transformer结构里的Encoder部分,BERT将多层的Encoder搭建一起组成了它的基本网络结构。
```
hepj987's avatar
hepj987 committed
25

hepj987's avatar
hepj987 committed
26
## 环境配置
hepj987's avatar
hepj987 committed
27
28

`注意dtk python torch apex 等版本要对齐`
hepj987's avatar
hepj987 committed
29

hepj987's avatar
hepj987 committed
30
31
32
### Docker(方式一)

```
shantf's avatar
shantf committed
33
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
hepj987's avatar
hepj987 committed
34
进入docker安装没有的依赖
shantf's avatar
shantf committed
35
36
docker run -dit --network=host --name=bert-pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G -v /opt/hyhal:/opt/hyhal:ro --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker exec -it bert-pytorch /bin/bash
hepj987's avatar
hepj987 committed
37
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
shantf's avatar
shantf committed
38
39
40
#tensorflow安装包参考conda方式下载地址,tensorflow安装包在光合社区中下载
wget https://cancon.hpccube.com:65024/directlink/4/tensorflow/DAS1.1/tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl
pip install tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
hepj987's avatar
hepj987 committed
41
42
```

hepj987's avatar
hepj987 committed
43
44
45
46
### Dockerfile(方式二)

```
docker build -t bert:latest .
shantf's avatar
shantf committed
47
docker run -dit --network=host --name=bert-pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G -v /opt/hyhal:/opt/hyhal:ro --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 bert:latest
hepj987's avatar
hepj987 committed
48
49
docker exec -it bert-pytorch /bin/bash
#tensorflow安装包参考conda方式下载地址
shantf's avatar
shantf committed
50
pip install tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
hepj987's avatar
hepj987 committed
51
```
hepj987's avatar
hepj987 committed
52

hepj987's avatar
hepj987 committed
53
### Conda(方式三)
hepj987's avatar
hepj987 committed
54

hepj987's avatar
hepj987 committed
55
56
```
#创建虚拟环境
shantf's avatar
shantf committed
57
conda create -n bert-pytorch python=3.10
hepj987's avatar
hepj987 committed
58
59
```

chenzk's avatar
chenzk committed
60
关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
hepj987's avatar
hepj987 committed
61

shantf's avatar
shantf committed
62
[apex](https://cancon.hpccube.com:65024/directlink/4/apex/DAS1.1/apex-1.1.0+das1.1.gitf477a3a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
63

shantf's avatar
shantf committed
64
[pytorch2.1.0](https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.1/torch-2.1.0+das1.1.git3ac1bdd.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
65

shantf's avatar
shantf committed
66
[tensorflow2.13.1](https://cancon.hpccube.com:65024/directlink/4/tensorflow/DAS1.1/tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
67

shantf's avatar
shantf committed
68
[DTK24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
hepj987's avatar
hepj987 committed
69
70
71
72
73

其它依赖库参照requirements.txt安装:

```
pip install -r requirements.txt
hepj987's avatar
hepj987 committed
74
75
76
```


hepj987's avatar
hepj987 committed
77
78
79

## 数据集

dcuai's avatar
dcuai committed
80
pre_train 数据,本项目使用的是wiki20220401的数据,但数据集压缩后近20GB,解压后300GB下载速度慢,解压占大量空间。由于wiki数据集经常更新,官网并不保留旧版数据集,这里提供处理好的seq128和seq512的数据集网盘下载链接。
dcuai's avatar
dcuai committed
81

dcuai's avatar
dcuai committed
82
(seq128对应PHRASE1)链接:https://pan.baidu.com/s/13GA-Jmfr2qXrChjiM2UfFQ?pwd=l30u  提取码:l30u
dcuai's avatar
dcuai committed
83

dcuai's avatar
dcuai committed
84
(seq512对应PHRASE2)链接:https://pan.baidu.com/s/1MBFjYNsGQzlnc8aEb7Pg4w?pwd=6ap2  提取码:6ap2 
hepj987's avatar
hepj987 committed
85

chenzk's avatar
chenzk committed
86

hepj987's avatar
hepj987 committed
87
88
89
90
91
92
这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据,预训练数据分为PHRASE1、PHRASE2

`wiki数据集结构`

```
 ──wikicorpus_en 
dcuai's avatar
dcuai committed
93
    │   ├── training
hepj987's avatar
hepj987 committed
94
95
96
97
98
99
100
101
102
    │             ├── wikicorpus_en_training_0.tfrecord.hdf5
    │             ├── wikicorpus_en_training_1000.tfrecord.hdf5
    │             └── ...
    │   └── test
    │             ├── wikicorpus_en_test_99.tfrecord.hdf5
    │             ├── wikicorpus_en_test_9.tfrecord.hdf5
    │             └── ...
```

hepj987's avatar
hepj987 committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
```
#wiki数据集下载与处理示例
cd cleanup_scripts  
mkdir -p wiki  
cd wiki  
wget https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2    # Optionally use curl instead  
bzip2 -d enwiki-20200101-pages-articles-multistream.xml.bz2  
cd ..    # back to bert/cleanup_scripts  
git clone https://github.com/attardi/wikiextractor.git  
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml    # Results are placed in bert/cleanup_scripts/text  
./process_wiki.sh '<text/*/wiki_??'  
```


hepj987's avatar
hepj987 committed
117
118
119
120
121
122
123
124
125
126
127
128
129
问答SQUAD1.1数据:

[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)

[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```

dcuai's avatar
dcuai committed
130
### 模型权重下载
hepj987's avatar
hepj987 committed
131
132
133
134
135
136
137

[用于squad训练的bert-large-uncased模型(已转换可直接使用)  提取密码:vs8d](https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ)

[bert-large-uncased_L-24_H-1024_A-16(需要转换)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)

[bert-base-uncased_L-12_H-768_A-12(需要转换)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)

hepj987's avatar
hepj987 committed
138
139
140
141
## 训练

### squad训练

hepj987's avatar
hepj987 committed
142
#### 1.模型转换
hepj987's avatar
hepj987 committed
143
144

```
hepj987's avatar
hepj987 committed
145
146
#如果下载的是.ckpt格式的模型,需要转换为.ckpt.pt格式
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint uncased_L-24_H-1024_A-16/bert_model.ckpt --bert_config_path uncased_L-24_H-1024_A-16/bert_config.json --output_checkpoint uncased_L-24_H-1024_A-16/model.ckpt.pt
hepj987's avatar
hepj987 committed
147
148
```

hepj987's avatar
hepj987 committed
149
#### 2.参数说明
hepj987's avatar
hepj987 committed
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

```
  --train_file  训练数据
  --predict_file  预测文件
  --init_checkpoint  模型文件
  --vocab_file  词向量文件
  --output_dir  输出文件夹
  --config_file  模型配置文件
  --json-summary  输出json文件
  --bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
  --do_train 是否训练
  --do_predict 是否预测
  --train_batch_size  训练batch_size
  --predict_batch_size 预测batch_size
  --gpus_per_node  使用gpu节点数
  --local_rank 基于GPU的分布式训练的local_rank(单卡设置为-1)
  --fp16 混合精度训练
  --amp 混合精度训练
```

hepj987's avatar
hepj987 committed
170
#### 3.运行
hepj987's avatar
hepj987 committed
171
172
173

```
#单卡
shantf's avatar
shantf committed
174
175
bash bert_squad.sh #单精度 (按自己路径对single_squad.sh里APP设置进行修改)
bash bert_squad_fp16.sh  #半精度 (按自己路径对single_squad_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
176
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
shantf's avatar
shantf committed
177
178
179
180
181
182

tips:
1、如提示类似bert_squad.sh等shell脚本不能执行,可执行如下命令:
bash bert_squad.sh
2、倘若运行提示:遇到非法的节点名字K100_AI等,可以使用如下方法改正,在容器里面操作
sudo hostname xxx 改正不包含下划线的名字,比如k21,然后退出容器,重新进入即可。
hepj987's avatar
hepj987 committed
183
184
185
186
```

```
#多卡
shantf's avatar
shantf committed
187
188
bash bert_squad4.sh #单精度  (按自己路径对single_squad4.sh里APP设置进行修改)
bash bert_squad4_fp16.sh #半精度  (按自己路径对single_squad4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
189
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
hepj987's avatar
hepj987 committed
190
191
192
193
```

```
#多机多卡
shantf's avatar
shantf committed
194
#进入节点1,根据环境编写修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
195
cd 2node-run-squad
hepj987's avatar
hepj987 committed
196
197
sh run_bert_squad_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
hepj987's avatar
hepj987 committed
198
199
200
201
```



hepj987's avatar
hepj987 committed
202
### **PHRASE测试**
hepj987's avatar
hepj987 committed
203

hepj987's avatar
hepj987 committed
204
#### 1.参数说明
hepj987's avatar
hepj987 committed
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226

```
    --input_dir  输入数据文件夹
    --output_dir 输出保存文件夹
    --config_file 模型配置文件
    --bert_model  bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
    --train_batch_size 训练batch_size
    --max_seq_length=128 最大长度(需要和训练数据相匹配)
    --max_predictions_per_seq 输入序列中屏蔽标记的最大总数 
    --max_steps 最大步长
    --warmup_proportion 进行线性学习率热身的训练比例
    --num_steps_per_checkpoint 多少步保存一次模型
    --learning_rate 学习率
    --seed 随机种子
    --gradient_accumulation_steps 在执行向后/更新过程之前,Accumulte的更新步骤数
    --allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
    --do_train 是否训练
    --fp16 混合精度训练
    --amp 混合精度训练
    --json-summary 输出json文件
```

hepj987's avatar
hepj987 committed
227
#### 2.PHRASE1
hepj987's avatar
hepj987 committed
228
229
230

```
#单卡
shantf's avatar
shantf committed
231
232
bash bert_pre1.sh #单精度 (按自己路径对single_pre1_1.sh里APP设置进行修改)
bash bert_pre1_fp16.sh  #半精度 (按自己路径对single_pre1_1_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
233
#多卡
shantf's avatar
shantf committed
234
235
bash bert_pre1_4.sh #单精度 (按自己路径对single_pre1_4.sh里APP设置进行修改)
bash bert_pre1_4_fp16.sh   #半精度 (按自己路径对single_pre1_4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
236

hepj987's avatar
hepj987 committed
237
#多机多卡
hepj987's avatar
hepj987 committed
238
#进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
239
cd 2node-run-pre
hepj987's avatar
hepj987 committed
240
sh run_bert_pre1_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
hepj987's avatar
hepj987 committed
241
242
```

hepj987's avatar
hepj987 committed
243
#### 3.PHRASE2
hepj987's avatar
hepj987 committed
244
245
246

```
#单卡
shantf's avatar
shantf committed
247
248
bash bert_pre2.sh  #单精度 (按自己路径对single_pre2_1.sh里APP设置进行修改)
bash bert_pre2_fp16.sh  #半精度 (按自己路径对single_pre2_1_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
249
#多卡
shantf's avatar
shantf committed
250
251
bash bert_pre2_4.sh  #单精度 (按自己路径对single_pre2_4.sh里APP设置进行修改)
bash bert_pre2_4_fp16.sh  #半精度 (按自己路径对single_pre2_4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
252

hepj987's avatar
hepj987 committed
253
#多机多卡
hepj987's avatar
hepj987 committed
254
#进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
255
cd 2node-run-pre
hepj987's avatar
hepj987 committed
256
sh run_bert_pre2_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
hepj987's avatar
hepj987 committed
257
258
```

hepj987's avatar
hepj987 committed
259
260
## result

hepj987's avatar
hepj987 committed
261
262
![result](result.jpg)

dcuai's avatar
dcuai committed
263
### 精度
hepj987's avatar
hepj987 committed
264
265
266
267
268
269

| 训练    | 卡数 | batch size | 迭代计数 | 精度                           |
| ------- | ---- | ---------- | -------- | ------------------------------ |
| PHRASE1 | 1    | 16         | 634step  | 9.7421875                      |
| SQUAD   | 1    | 16         | 3epoch   | final_loss : 3.897481918334961 |

hepj987's avatar
hepj987 committed
270
271
## 应用场景

dcuai's avatar
dcuai committed
272
### 算法类别
hepj987's avatar
hepj987 committed
273

chenzk's avatar
chenzk committed
274
`对话问答`
hepj987's avatar
hepj987 committed
275

dcuai's avatar
dcuai committed
276
### 热点行业
hepj987's avatar
hepj987 committed
277

dcuai's avatar
dcuai committed
278
`互联网,教育,科研`
hepj987's avatar
hepj987 committed
279

chenzk's avatar
chenzk committed
280
## 预训练权重
chenzk's avatar
chenzk committed
281

chenzk's avatar
chenzk committed
282

hepj987's avatar
hepj987 committed
283
284
## 源码仓库及问题反馈

chenzk's avatar
chenzk committed
285
https://developer.sourcefind.cn/codes/modelzoo/bert-pytorch
hepj987's avatar
hepj987 committed
286
287
288
289
290
291

## 参考资料

https://github.com/mlperf/training_results_v0.7/tree/master/NVIDIA/benchmarks/bert/implementations/pytorch

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT