README.md 10.7 KB
Newer Older
dcuai's avatar
dcuai committed
1
# bert
hepj987's avatar
hepj987 committed
2
3
4
5
6
7

## 论文

`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`

[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)
hepj987's avatar
hepj987 committed
8

dcuai's avatar
dcuai committed
9
## 模型结构
hepj987's avatar
hepj987 committed
10

hepj987's avatar
hepj987 committed
11
12
![bert_model](bert_model.png)

hepj987's avatar
hepj987 committed
13
14
15
```
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。
```
hepj987's avatar
hepj987 committed
16

hepj987's avatar
hepj987 committed
17
18
19
## 算法原理

![bert](bert.png)
hepj987's avatar
hepj987 committed
20

hepj987's avatar
hepj987 committed
21
22
23
```
以往的预训练模型的结构会受到单向语言模型(从左到右或者从右到左)的限制,因而也限制了模型的表征能力,使其只能获取单方向的上下文信息。而BERT利用MLM进行预训练并且采用深层的双向Transformer组件(单向的Transformer一般被称为Transformer decoder,其每一个token(符号)只会attend到目前往左的token。而双向的Transformer则被称为Transformer encoder,其每一个token会attend到所有的token)来构建整个模型,因此最终生成能融合左右上下文信息的深层双向语言表征。
```
hepj987's avatar
hepj987 committed
24

hepj987's avatar
hepj987 committed
25
## 数据集
hepj987's avatar
hepj987 committed
26

hepj987's avatar
hepj987 committed
27
MNLI分类数据集:[MNLI](https://dl.fbaipublicfiles.com/glue/data/MNLI.zip)
hepj987's avatar
hepj987 committed
28
29


hepj987's avatar
hepj987 committed
30
31
32
33
squad问答数据集:[train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)[dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

squad-v1.1 eval脚本:[evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

dcuai's avatar
dcuai committed
34
数据集快速下载中心:[SCNet AIDatasets](http://113.200.138.88:18080/aidatasets) ,项目中数据集可从快速下载通道下载:[MNLI](http://113.200.138.88:18080/aidatasets/project-dependency/mnli/-/raw/master/MNLI.zip)
chenzk's avatar
chenzk committed
35

dcuai's avatar
dcuai committed
36
37
38
39
40
41
### 预训练模型

[bert-base-uncace(MNLI分类时使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)

[bert-large-uncase(squad问答使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)

chenzk's avatar
chenzk committed
42
43
44
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels) ,项目中的预训练权重可从快速下载通道下载:
[bert-base-uncace(MNLI分类时使用此模型)](http://113.200.138.88:18080/aidatasets/project-dependency/bert_models/-/raw/main/uncased_L-12_H-768_A-12.zip)[bert-large-uncase(squad问答使用此模型)](http://113.200.138.88:18080/aidatasets/project-dependency/bert_models/-/raw/main/uncased_L-24_H-1024_A-16.zip?ref_type=heads)

hepj987's avatar
hepj987 committed
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
`MNLI数据集`

```
├── original
│	├── multinli_1.0_dev_matched.jsonl
│	├── multinli_1.0_dev_matched.txt
│	├── multinli_1.0_dev_mismatched.jsonl
│	├── multinli_1.0_dev_mismatched.txt
│	├── multinli_1.0_train.jsonl
│	└── multinli_1.0_train.txt
├── dev_matched.tsv
├── dev_mismatched.tsv
├── README.txt
├── test_matched.tsv
├── test_mismatched.tsv
└── train.tsv
```

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```



hepj987's avatar
hepj987 committed
72
73
## 环境配置

dcuai's avatar
dcuai committed
74
推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/#/main-page)镜像,可以docker pull拉取
hepj987's avatar
hepj987 committed
75

hepj987's avatar
hepj987 committed
76
77
### Docker(方式一)

hepj987's avatar
hepj987 committed
78
```
panhb's avatar
panhb committed
79
80
docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.13.1-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --network=host --name=bert_tensorflow --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.13.1-ubuntu20.04-dtk24.04.1-py3.10
hepj987's avatar
hepj987 committed
81
docker exec -it bert_tensorflow /bin/bash
hepj987's avatar
hepj987 committed
82
pip install -r requirements.txt
hepj987's avatar
hepj987 committed
83
84
```

hepj987's avatar
hepj987 committed
85
86
87
88
89
90
91
92
### Dockerfile(方式二)

```
docker build -t bert:latest .
docker run -dit --network=host --name=bert_tensorflow --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 bert:latest
docker exec -it bert_tensorflow /bin/bash
pip install -r requirements.txt
```
hepj987's avatar
hepj987 committed
93

hepj987's avatar
hepj987 committed
94
95
96
### Conda(方式三)

```
panhb's avatar
panhb committed
97
conda create -n bert_tensorflow python=3.10
hepj987's avatar
hepj987 committed
98
pip install -r requirements.txt
hepj987's avatar
hepj987 committed
99
```
hepj987's avatar
hepj987 committed
100

panhb's avatar
panhb committed
101
安装过程可能顶掉DCU版本的tensorflow,可以到[开发者社区](https://developer.hpccube.com/tool/)下载DCU版本对应包
hepj987's avatar
hepj987 committed
102

panhb's avatar
panhb committed
103
[tensorflow2.13.1](https://cancon.hpccube.com:65024/directlink/4/tensorflow/DAS1.1/tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
104

panhb's avatar
panhb committed
105
[DTK22.10.1](https://cancon.hpccube.com:65024/directlink/1/latest/Ubuntu22.04/DTK-24.04.1-Ubuntu22.04-x86_64.tar.gz)
hepj987's avatar
hepj987 committed
106
107
108
109
110

### python版本兼容

```
搭建环境时会遇到AttributeError: module 'typing' has no attribute '_ClassVar'则是由于新版本的python更换属性名导致的
hepj987's avatar
hepj987 committed
111
112
113
114
修改python3.7/site-packages/dataclasses.py 550行
return type(a_type) is typing._ClassVar
改为
return type(a_type) is typing.ClassVar
hepj987's avatar
hepj987 committed
115
116
```

hepj987's avatar
hepj987 committed
117
118


hepj987's avatar
hepj987 committed
119
## 训练
hepj987's avatar
hepj987 committed
120

hepj987's avatar
hepj987 committed
121
###  数据转化-MNLI
hepj987's avatar
hepj987 committed
122

hepj987's avatar
hepj987 committed
123
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
124
125

```
hepj987's avatar
hepj987 committed
126
python create_finetuning_data.py \
hepj987's avatar
hepj987 committed
127
 --input_data_dir=/public/home/hepj/data/MNLI \
hepj987's avatar
hepj987 committed
128
129
130
131
 --vocab_file=/public/home/hepj/model_source/uncased_L-12_H-768_A-12/vocab.txt \
 --train_data_output_path=/public/home/hepj/MNLI/train.tf_record \
 --eval_data_output_path=/public/home/hepj/MNLI/eval.tf_record \
 --meta_data_file_path=/public/home/hepj/MNLI/meta_data \
hepj987's avatar
hepj987 committed
132
133
134
 --fine_tuning_task_type=classification 
 --max_seq_length=32 \
 --classification_task_name=MNLI
hepj987's avatar
hepj987 committed
135
136
137
138
139
140
141
142
143
144
 
 #参数说明
 --input_data_dir				训练数据路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
 --classification_task_name		分类任务名
hepj987's avatar
hepj987 committed
145
146
```

hepj987's avatar
hepj987 committed
147
### 模型转化-MNLI
hepj987's avatar
hepj987 committed
148

panhb's avatar
panhb committed
149
TF2.13.1与TF1.15.0模型存储、读取格式不同,官网给出的Bert一般是基于TF1.0的模型需要进行模型转化
hepj987's avatar
hepj987 committed
150
151
152
153

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_config.json \
hepj987's avatar
hepj987 committed
154
155
--checkpoint_to_convert /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_model.ckpt \
--converted_checkpoint_path /public/home/hepj/model_source/bert-base-TF2/bert_model.ckpt
hepj987's avatar
hepj987 committed
156

hepj987's avatar
hepj987 committed
157
158
159
160
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
161
162
163

将转换完后的bert_model.ckpt-1.data-00000-of-00001 改为bert_model.ckpt.data-00000-of-00001
bert_model.ckpt-1.index改为 bert_model.ckpt.index
panhb's avatar
panhb committed
164
165

如果报错 'no attriute experimental',则删除报错行中的experimental
hepj987's avatar
hepj987 committed
166
167
```

hepj987's avatar
hepj987 committed
168
### 单卡运行-MNLI
hepj987's avatar
hepj987 committed
169
170

```
hepj987's avatar
hepj987 committed
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
sh bert_class.sh
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict		
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
  --train_batch_size			训练批大小
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
panhb's avatar
panhb committed
187
188
189
190

修改库版本:
absl==1.4.0
tf-models-official==2.6.0
hepj987's avatar
hepj987 committed
191
192
```

hepj987's avatar
hepj987 committed
193
### 多卡运行-MNLI
hepj987's avatar
hepj987 committed
194
195

```
hepj987's avatar
hepj987 committed
196
sh bert_class_gpus.sh
hepj987's avatar
hepj987 committed
197
198
```

hepj987's avatar
hepj987 committed
199
### 数据转化-SQUAD1.1
hepj987's avatar
hepj987 committed
200

hepj987's avatar
hepj987 committed
201
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
202
203
204
205
206
207

```
python3 create_finetuning_data.py \
 --squad_data_file=/public/home/hepj/model/model_source/sq1.1/train-v1.1.json \
 --vocab_file=/public/home/hepj/model_source/bert-large-uncased-TF2/uncased_L-24_H-1024_A-16/vocab.txt \
 --train_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/train_new.tf_record \
hepj987's avatar
hepj987 committed
208
 --meta_data_file_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/meta_data \
hepj987's avatar
hepj987 committed
209
210
 --eval_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/eval_new.tf_record \
 --fine_tuning_task_type=squad \
hepj987's avatar
hepj987 committed
211
 --do_lower_case=False \
hepj987's avatar
hepj987 committed
212
 --max_seq_length=384
hepj987's avatar
hepj987 committed
213
214
215
216
217
218
219
220
221
 
#参数说明
 --squad_data_file				训练文件路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
hepj987's avatar
hepj987 committed
222
223
```

hepj987's avatar
hepj987 committed
224
### 模型转化-SQUAD1.1
hepj987's avatar
hepj987 committed
225
226
227
228
229

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/bert_config.json \
--checkpoint_to_convert /public/home/hepj/model/model_sourceuncased_L-24_H-1024_A-16/bert_model.ckpt \
hepj987's avatar
hepj987 committed
230
--converted_checkpoint_path  /public/home/hepj/model_source/bert-large-TF2/bert_model.ckpt
hepj987's avatar
hepj987 committed
231

hepj987's avatar
hepj987 committed
232
233
234
235
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
236
237
238

将转换完后的bert_model.ckpt-1.data-00000-of-00001 改为bert_model.ckpt.data-00000-of-00001
bert_model.ckpt-1.index改为 bert_model.ckpt.index
hepj987's avatar
hepj987 committed
239
240
```

hepj987's avatar
hepj987 committed
241
### 单卡运行-SQUAD1.1
hepj987's avatar
hepj987 committed
242
243
244

```
sh bert_squad.sh
hepj987's avatar
hepj987 committed
245
246
247
248
249
250
251
252
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict
  --vocab_file					vocab文件路径
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
hepj987's avatar
hepj987 committed
253
  --train_batch_size			总训练批大小
hepj987's avatar
hepj987 committed
254
255
256
257
258
259
260
261
262
263
  --predict_file				预测文件路径
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
```

hepj987's avatar
hepj987 committed
264
### 多卡运行-SQUAD1.1
hepj987's avatar
hepj987 committed
265
266

```
hepj987's avatar
hepj987 committed
267
sh bert_squad_gpus.sh
hepj987's avatar
hepj987 committed
268
269
```

hepj987's avatar
hepj987 committed
270
271
## result

hepj987's avatar
hepj987 committed
272
273
274
275
276
277
278
```
#MNLI分类样例
输入:"premise": "Conceptually cream skimming has two basic dimensions - product and geography.","hypothesis": "Product and geography are what make cream skimming work."
输出:"label": 1
```


hepj987's avatar
hepj987 committed
279

dcuai's avatar
dcuai committed
280
### 精度
hepj987's avatar
hepj987 committed
281

hepj987's avatar
hepj987 committed
282
283
284
285
286
287
288
使用单张Z100收敛性如下所示:

|      模型任务      |       训练精度       |
| :----------------: | :------------------: |
| MNLI-class(单卡) | val_accuracy: 0.7387 |
|  squad1.1(单卡)  |  F1-score:0.916378   |

dcuai's avatar
dcuai committed
289
## 应用场景
hepj987's avatar
hepj987 committed
290

dcuai's avatar
dcuai committed
291
### 算法类别
hepj987's avatar
hepj987 committed
292

chenzk's avatar
chenzk committed
293
`对话问答`
hepj987's avatar
hepj987 committed
294

dcuai's avatar
dcuai committed
295
### 热点应用行业
hepj987's avatar
hepj987 committed
296

chenzk's avatar
chenzk committed
297
`电商,科研,教育`
hepj987's avatar
hepj987 committed
298

dcuai's avatar
dcuai committed
299
## 源码仓库及问题反馈
hepj987's avatar
hepj987 committed
300
301
302

https://developer.hpccube.com/codes/modelzoo/bert-tf2

hepj987's avatar
hepj987 committed
303
# 参考资料
hepj987's avatar
hepj987 committed
304

hepj987's avatar
hepj987 committed
305
https://github.com/tensorflow/models/tree/v2.3.0/official/nlp