"git@developer.sourcefind.cn:OpenDAS/pytorch3d.git" did not exist on "bbc12e70c4fdb2d4534cdf04089f2f3e098580e0"
README.md 11.8 KB
Newer Older
hepj987's avatar
hepj987 committed
1
# **Bert**
hepj987's avatar
hepj987 committed
2

hepj987's avatar
hepj987 committed
3
## 论文
hepj987's avatar
hepj987 committed
4

hepj987's avatar
hepj987 committed
5
`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`
hepj987's avatar
hepj987 committed
6

hepj987's avatar
hepj987 committed
7
[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)
hepj987's avatar
hepj987 committed
8

hepj987's avatar
hepj987 committed
9
## 模型结构
hepj987's avatar
hepj987 committed
10
11

```
hepj987's avatar
hepj987 committed
12
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。
hepj987's avatar
hepj987 committed
13
14
15
```


hepj987's avatar
hepj987 committed
16
![bert_model](bert_model.png)
hepj987's avatar
hepj987 committed
17

hepj987's avatar
hepj987 committed
18
## 算法原理
hepj987's avatar
hepj987 committed
19

hepj987's avatar
hepj987 committed
20
![bert](bert.png)
hepj987's avatar
hepj987 committed
21

hepj987's avatar
hepj987 committed
22
23
24
```
BERT并没有采用整个的Transformer结构(Encoder+Decoder),仅仅使用了Transformer结构里的Encoder部分,BERT将多层的Encoder搭建一起组成了它的基本网络结构。
```
hepj987's avatar
hepj987 committed
25

hepj987's avatar
hepj987 committed
26
## 环境配置
hepj987's avatar
hepj987 committed
27
28

`注意dtk python torch apex 等版本要对齐`
hepj987's avatar
hepj987 committed
29

hepj987's avatar
hepj987 committed
30
31
32
33
34
35
36
37
38
39
### Docker(方式一)

```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-22.10.1-py37-latest
进入docker安装没有的依赖
docker run -dit --network=host --name=bert-pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-22.10.1-py37-latest
docker exec -it llama-tencentpretrain /bin/bash
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
#tensorflow安装包参考conda方式下载地址
pip install tensorflow-2.7.0+git67f0ade9.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl
hepj987's avatar
hepj987 committed
40
41
```

hepj987's avatar
hepj987 committed
42
43
44
45
46
47
48
49
50
### Dockerfile(方式二)

```
docker build -t bert:latest .
docker run -dit --network=host --name=bert-pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 bert:latest
docker exec -it bert-pytorch /bin/bash
#tensorflow安装包参考conda方式下载地址
pip install tensorflow-2.7.0+git67f0ade9.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl
```
hepj987's avatar
hepj987 committed
51

hepj987's avatar
hepj987 committed
52
### Conda(方式三)
hepj987's avatar
hepj987 committed
53

hepj987's avatar
hepj987 committed
54
55
56
57
58
59
60
```
#创建虚拟环境
conda create -n bert-pytorch python=3.7
```

关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。

hepj987's avatar
hepj987 committed
61
62
[apex](https://cancon.hpccube.com:65024/directlink/4/apex/dtk22.10/apex-0.1+gitdb7007a.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl)

hepj987's avatar
hepj987 committed
63
64
65
66
67
68
69
70
71
72
[pytorch1.10](https://cancon.hpccube.com:65024/directlink/4/pytorch/dtk22.10/torch-1.10.0a0+git2040069.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl)

[tensorflow2.7](https://cancon.hpccube.com:65024/directlink/4/tensorflow/dtk22.10/tensorflow-2.7.0+git67f0ade9.dtk2210-cp37-cp37m-manylinux2014_x86_64.whl)

[DTK22.10](https://cancon.hpccube.com:65024/directlink/1/DTK-22.10.1/CentOS7.6/DTK-22.10.1-CentOS7.6-x86_64.tar.gz)

其它依赖库参照requirements.txt安装:

```
pip install -r requirements.txt
hepj987's avatar
hepj987 committed
73
74
75
```


hepj987's avatar
hepj987 committed
76
77
78

## 数据集

dcuai's avatar
dcuai committed
79
pre_train 数据,本项目使用的是wiki20220401的数据,但数据集压缩后近20GB,解压后300GB下载速度慢,解压占大量空间。由于wiki数据集经常更新,官网并不保留旧版数据集,这里提供处理好的seq128和seq512的数据集网盘下载链接。
dcuai's avatar
dcuai committed
80

dcuai's avatar
dcuai committed
81
(seq128对应PHRASE1)链接:https://pan.baidu.com/s/13GA-Jmfr2qXrChjiM2UfFQ?pwd=l30u  提取码:l30u
dcuai's avatar
dcuai committed
82

dcuai's avatar
dcuai committed
83
(seq512对应PHRASE2)链接:https://pan.baidu.com/s/1MBFjYNsGQzlnc8aEb7Pg4w?pwd=6ap2  提取码:6ap2 
hepj987's avatar
hepj987 committed
84

chenzk's avatar
chenzk committed
85
86
数据集快速下载中心:[SCNet AIDatasets](http://113.200.138.88:18080/aidatasets) ,项目中数据集可从快速下载通道下载:[seq128](http://113.200.138.88:18080/aidatasets/project-dependency/bert_pytorch/-/raw/master/wikicorpus_en(128).zip) [seq512](http://113.200.138.88:18080/aidatasets/project-dependency/bert_pytorch/-/raw/master/wikicorpus_en%20(512).zip) 。

hepj987's avatar
hepj987 committed
87
88
89
90
91
92
这里使用服务器已有的wiki数据集服务器上有已经下载处理好的数据,预训练数据分为PHRASE1、PHRASE2

`wiki数据集结构`

```
 ──wikicorpus_en 
dcuai's avatar
dcuai committed
93
    │   ├── training
hepj987's avatar
hepj987 committed
94
95
96
97
98
99
100
101
102
    │             ├── wikicorpus_en_training_0.tfrecord.hdf5
    │             ├── wikicorpus_en_training_1000.tfrecord.hdf5
    │             └── ...
    │   └── test
    │             ├── wikicorpus_en_test_99.tfrecord.hdf5
    │             ├── wikicorpus_en_test_9.tfrecord.hdf5
    │             └── ...
```

hepj987's avatar
hepj987 committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
```
#wiki数据集下载与处理示例
cd cleanup_scripts  
mkdir -p wiki  
cd wiki  
wget https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2    # Optionally use curl instead  
bzip2 -d enwiki-20200101-pages-articles-multistream.xml.bz2  
cd ..    # back to bert/cleanup_scripts  
git clone https://github.com/attardi/wikiextractor.git  
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml    # Results are placed in bert/cleanup_scripts/text  
./process_wiki.sh '<text/*/wiki_??'  
```


hepj987's avatar
hepj987 committed
117
118
119
120
121
122
123
124
125
126
127
128
129
问答SQUAD1.1数据:

[train-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)

[dev-v1.1](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```

dcuai's avatar
dcuai committed
130
### 模型权重下载
hepj987's avatar
hepj987 committed
131
132
133
134
135
136
137

[用于squad训练的bert-large-uncased模型(已转换可直接使用)  提取密码:vs8d](https://pan.baidu.com/share/init?surl=V8kFpgsLQe8tOAeft-5UpQ)

[bert-large-uncased_L-24_H-1024_A-16(需要转换)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)

[bert-base-uncased_L-12_H-768_A-12(需要转换)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)

hepj987's avatar
hepj987 committed
138
139
140
141
## 训练

### squad训练

hepj987's avatar
hepj987 committed
142
#### 1.模型转换
hepj987's avatar
hepj987 committed
143
144

```
hepj987's avatar
hepj987 committed
145
146
#如果下载的是.ckpt格式的模型,需要转换为.ckpt.pt格式
python3 tf_to_torch/convert_tf_checkpoint.py --tf_checkpoint uncased_L-24_H-1024_A-16/bert_model.ckpt --bert_config_path uncased_L-24_H-1024_A-16/bert_config.json --output_checkpoint uncased_L-24_H-1024_A-16/model.ckpt.pt
hepj987's avatar
hepj987 committed
147
148
```

hepj987's avatar
hepj987 committed
149
#### 2.参数说明
hepj987's avatar
hepj987 committed
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

```
  --train_file  训练数据
  --predict_file  预测文件
  --init_checkpoint  模型文件
  --vocab_file  词向量文件
  --output_dir  输出文件夹
  --config_file  模型配置文件
  --json-summary  输出json文件
  --bert_model bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
  --do_train 是否训练
  --do_predict 是否预测
  --train_batch_size  训练batch_size
  --predict_batch_size 预测batch_size
  --gpus_per_node  使用gpu节点数
  --local_rank 基于GPU的分布式训练的local_rank(单卡设置为-1)
  --fp16 混合精度训练
  --amp 混合精度训练
```

hepj987's avatar
hepj987 committed
170
#### 3.运行
hepj987's avatar
hepj987 committed
171
172
173
174
175

```
#单卡
./bert_squad.sh #单精度 (按自己路径对single_squad.sh里APP设置进行修改)
./bert_squad_fp16.sh  #半精度 (按自己路径对single_squad_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
176
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
hepj987's avatar
hepj987 committed
177
178
179
180
181
182
```

```
#多卡
./bert_squad4.sh #单精度  (按自己路径对single_squad4.sh里APP设置进行修改)
./bert_squad4_fp16.sh #半精度  (按自己路径对single_squad4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
183
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
hepj987's avatar
hepj987 committed
184
185
186
187
```

```
#多机多卡
hepj987's avatar
hepj987 committed
188
#进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
189
cd 2node-run-squad
hepj987's avatar
hepj987 committed
190
191
sh run_bert_squad_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
--init_checkpoint使用model.ckpt-28252.pt或者自己转换好的model.ckpt.pt
hepj987's avatar
hepj987 committed
192
193
194
195
```



hepj987's avatar
hepj987 committed
196
### **PHRASE测试**
hepj987's avatar
hepj987 committed
197

hepj987's avatar
hepj987 committed
198
#### 1.参数说明
hepj987's avatar
hepj987 committed
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220

```
    --input_dir  输入数据文件夹
    --output_dir 输出保存文件夹
    --config_file 模型配置文件
    --bert_model  bert模型类型可选: bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,bert-base-multilingual-cased, bert-base-chinese
    --train_batch_size 训练batch_size
    --max_seq_length=128 最大长度(需要和训练数据相匹配)
    --max_predictions_per_seq 输入序列中屏蔽标记的最大总数 
    --max_steps 最大步长
    --warmup_proportion 进行线性学习率热身的训练比例
    --num_steps_per_checkpoint 多少步保存一次模型
    --learning_rate 学习率
    --seed 随机种子
    --gradient_accumulation_steps 在执行向后/更新过程之前,Accumulte的更新步骤数
    --allreduce_post_accumulation 是否在梯度累积步骤期间执行所有减少
    --do_train 是否训练
    --fp16 混合精度训练
    --amp 混合精度训练
    --json-summary 输出json文件
```

hepj987's avatar
hepj987 committed
221
#### 2.PHRASE1
hepj987's avatar
hepj987 committed
222
223
224
225
226
227
228
229

```
#单卡
./bert_pre1.sh #单精度 (按自己路径对single_pre1_1.sh里APP设置进行修改)
./bert_pre1_fp16.sh  #半精度 (按自己路径对single_pre1_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre1_4.sh #单精度 (按自己路径对single_pre1_4.sh里APP设置进行修改)
./bert_pre1_4_fp16.sh   #半精度 (按自己路径对single_pre1_4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
230

hepj987's avatar
hepj987 committed
231
#多机多卡
hepj987's avatar
hepj987 committed
232
#进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
233
cd 2node-run-pre
hepj987's avatar
hepj987 committed
234
sh run_bert_pre1_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
hepj987's avatar
hepj987 committed
235
236
```

hepj987's avatar
hepj987 committed
237
#### 3.PHRASE2
hepj987's avatar
hepj987 committed
238
239
240
241
242
243
244
245

```
#单卡
./bert_pre2.sh  #单精度 (按自己路径对single_pre2_1.sh里APP设置进行修改)
./bert_pre2_fp16.sh  #半精度 (按自己路径对single_pre2_1_fp16.sh里APP设置进行修改)
#多卡
./bert_pre2_4.sh  #单精度 (按自己路径对single_pre2_4.sh里APP设置进行修改)
./bert_pre2_4_fp16.sh  #半精度 (按自己路径对single_pre2_4_fp16.sh里APP设置进行修改)
hepj987's avatar
hepj987 committed
246

hepj987's avatar
hepj987 committed
247
#多机多卡
hepj987's avatar
hepj987 committed
248
#进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改hostfile改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定
hepj987's avatar
hepj987 committed
249
cd 2node-run-pre
hepj987's avatar
hepj987 committed
250
sh run_bert_pre2_4dcu.sh (需要fp16可以在相应single文件APP中增加 --fp16 与 --amp参数)
hepj987's avatar
hepj987 committed
251
252
```

hepj987's avatar
hepj987 committed
253
254
## result

hepj987's avatar
hepj987 committed
255
256
![result](result.jpg)

dcuai's avatar
dcuai committed
257
### 精度
hepj987's avatar
hepj987 committed
258
259
260
261
262
263

| 训练    | 卡数 | batch size | 迭代计数 | 精度                           |
| ------- | ---- | ---------- | -------- | ------------------------------ |
| PHRASE1 | 1    | 16         | 634step  | 9.7421875                      |
| SQUAD   | 1    | 16         | 3epoch   | final_loss : 3.897481918334961 |

hepj987's avatar
hepj987 committed
264
265
## 应用场景

dcuai's avatar
dcuai committed
266
### 算法类别
hepj987's avatar
hepj987 committed
267

chenzk's avatar
chenzk committed
268
`对话问答`
hepj987's avatar
hepj987 committed
269

dcuai's avatar
dcuai committed
270
### 热点行业
hepj987's avatar
hepj987 committed
271

dcuai's avatar
dcuai committed
272
`互联网,教育,科研`
hepj987's avatar
hepj987 committed
273

chenzk's avatar
chenzk committed
274
275
276
## 预训练权重
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels) ,项目中的预训练权重可从快速下载通道下载:[bert-large-uncased](http://113.200.138.88:18080/aidatasets/project-dependency/bert_pytorch/-/raw/master/bs64k_32k_ckpt.tar.gz)[bert-large-uncased_L-24_H-1024_A-16](http://113.200.138.88:18080/aidatasets/project-dependency/bert-large-uncased/-/raw/master/uncased_L-24_H-1024_A-16.zip)[bert-base-uncased_L-12_H-768_A-12](http://113.200.138.88:18080/aidatasets/project-dependency/bert-large-uncased/-/raw/master/uncased_L-12_H-768_A-12.zip)

hepj987's avatar
hepj987 committed
277
278
279
280
281
282
283
284
285
## 源码仓库及问题反馈

https://developer.hpccube.com/codes/modelzoo/bert-pytorch

## 参考资料

https://github.com/mlperf/training_results_v0.7/tree/master/NVIDIA/benchmarks/bert/implementations/pytorch

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT