README.md 11.9 KB
Newer Older
dcuai's avatar
dcuai committed
1
# BERT
hepj987's avatar
hepj987 committed
2
3
4
5
6
7

## 论文

`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`

[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)
hepj987's avatar
hepj987 committed
8

dcuai's avatar
dcuai committed
9
## 模型结构
hepj987's avatar
hepj987 committed
10

hepj987's avatar
hepj987 committed
11
12
![bert_model](bert_model.png)

hepj987's avatar
hepj987 committed
13
14
15
```
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。
```
hepj987's avatar
hepj987 committed
16

hepj987's avatar
hepj987 committed
17
18
19
## 算法原理

![bert](bert.png)
hepj987's avatar
hepj987 committed
20

hepj987's avatar
hepj987 committed
21
22
23
```
以往的预训练模型的结构会受到单向语言模型(从左到右或者从右到左)的限制,因而也限制了模型的表征能力,使其只能获取单方向的上下文信息。而BERT利用MLM进行预训练并且采用深层的双向Transformer组件(单向的Transformer一般被称为Transformer decoder,其每一个token(符号)只会attend到目前往左的token。而双向的Transformer则被称为Transformer encoder,其每一个token会attend到所有的token)来构建整个模型,因此最终生成能融合左右上下文信息的深层双向语言表征。
```
hepj987's avatar
hepj987 committed
24

hepj987's avatar
hepj987 committed
25
## 数据集
hepj987's avatar
hepj987 committed
26

hepj987's avatar
hepj987 committed
27
MNLI分类数据集:[MNLI](https://dl.fbaipublicfiles.com/glue/data/MNLI.zip)
hepj987's avatar
hepj987 committed
28
29


hepj987's avatar
hepj987 committed
30
31
32
33
squad问答数据集:[train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)[dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

squad-v1.1 eval脚本:[evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

dcuai's avatar
dcuai committed
34
数据集快速下载中心:[SCNet AIDatasets](http://113.200.138.88:18080/aidatasets) ,项目中数据集可从快速下载通道下载:[MNLI](http://113.200.138.88:18080/aidatasets/project-dependency/mnli/-/raw/master/MNLI.zip)
chenzk's avatar
chenzk committed
35

dcuai's avatar
dcuai committed
36
37
38
39
40
41
### 预训练模型

[bert-base-uncace(MNLI分类时使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)

[bert-large-uncase(squad问答使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)

chenzk's avatar
chenzk committed
42
43
44
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels) ,项目中的预训练权重可从快速下载通道下载:
[bert-base-uncace(MNLI分类时使用此模型)](http://113.200.138.88:18080/aidatasets/project-dependency/bert_models/-/raw/main/uncased_L-12_H-768_A-12.zip)[bert-large-uncase(squad问答使用此模型)](http://113.200.138.88:18080/aidatasets/project-dependency/bert_models/-/raw/main/uncased_L-24_H-1024_A-16.zip?ref_type=heads)

hepj987's avatar
hepj987 committed
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
`MNLI数据集`

```
├── original
│	├── multinli_1.0_dev_matched.jsonl
│	├── multinli_1.0_dev_matched.txt
│	├── multinli_1.0_dev_mismatched.jsonl
│	├── multinli_1.0_dev_mismatched.txt
│	├── multinli_1.0_train.jsonl
│	└── multinli_1.0_train.txt
├── dev_matched.tsv
├── dev_mismatched.tsv
├── README.txt
├── test_matched.tsv
├── test_mismatched.tsv
└── train.tsv
```

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```



hepj987's avatar
hepj987 committed
72
73
## 环境配置

dcuai's avatar
dcuai committed
74
推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/#/main-page)镜像,可以docker pull拉取
hepj987's avatar
hepj987 committed
75

hepj987's avatar
hepj987 committed
76
77
### Docker(方式一)

hepj987's avatar
hepj987 committed
78
```
dcuai's avatar
dcuai committed
79
80
docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.13.1-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --network=host --name=bert_tensorflow --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.13.1-ubuntu20.04-dtk24.04.1-py3.10
hepj987's avatar
hepj987 committed
81
docker exec -it bert_tensorflow /bin/bash
hepj987's avatar
hepj987 committed
82
pip install -r requirements.txt
hepj's avatar
hepj committed
83
84
pip install tf-models-official==2.4.0  tensorflow_addons==0.16.1 tensorflow_hub==0.16.1 typeguard==4.3.0 typing_extensions==4.12.2 --no-deps

hepj987's avatar
hepj987 committed
85
86
```

hepj987's avatar
hepj987 committed
87
88
89
90
91
92
93
### Dockerfile(方式二)

```
docker build -t bert:latest .
docker run -dit --network=host --name=bert_tensorflow --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 bert:latest
docker exec -it bert_tensorflow /bin/bash
pip install -r requirements.txt
hepj's avatar
hepj committed
94
95
pip install tf-models-official==2.4.0  tensorflow_addons==0.16.1 tensorflow_hub==0.16.1 typeguard==4.3.0 typing_extensions==4.12.2 --no-deps

hepj987's avatar
hepj987 committed
96
```
hepj987's avatar
hepj987 committed
97

hepj987's avatar
hepj987 committed
98
99
100
### Conda(方式三)

```
dcuai's avatar
dcuai committed
101
conda create -n bert_tensorflow python=3.10
hepj987's avatar
hepj987 committed
102
pip install -r requirements.txt
hepj's avatar
hepj committed
103
104
pip install tf-models-official==2.4.0  tensorflow_addons==0.16.1 tensorflow_hub==0.16.1 typeguard==4.3.0 typing_extensions==4.12.2 --no-deps

hepj987's avatar
hepj987 committed
105
```
hepj987's avatar
hepj987 committed
106

dcuai's avatar
dcuai committed
107
安装过程可能顶掉DCU版本的tensorflow,可以到[开发者社区](https://developer.hpccube.com/tool/)下载DCU版本对应包
hepj987's avatar
hepj987 committed
108

dcuai's avatar
dcuai committed
109
[tensorflow2.13.1](https://cancon.hpccube.com:65024/directlink/4/tensorflow/DAS1.1/tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp311-cp311-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
110

dcuai's avatar
dcuai committed
111
[DTK24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
hepj987's avatar
hepj987 committed
112
113
114
115
116

### python版本兼容

```
搭建环境时会遇到AttributeError: module 'typing' has no attribute '_ClassVar'则是由于新版本的python更换属性名导致的
hepj987's avatar
hepj987 committed
117
118
119
120
修改python3.7/site-packages/dataclasses.py 550行
return type(a_type) is typing._ClassVar
改为
return type(a_type) is typing.ClassVar
hepj987's avatar
hepj987 committed
121
122
```

hepj987's avatar
hepj987 committed
123
124


hepj987's avatar
hepj987 committed
125
## 训练
hepj987's avatar
hepj987 committed
126

hepj's avatar
hepj committed
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
### tf2.13版本兼容性改动

```
/usr/local/lib/python3.10/site-packages/keras/src/optimizers/optimizer.py 1234
改为:
if (self.weight_decay is None) or self.weight_decay=="AdamWeightDecay":


/usr/local/lib/python3.10/site-packages/official/modeling/performance.py 53
改为:
tf.keras.mixed_precision.set_global_policy('float32')

/usr/local/lib/python3.10/site-packages/official/nlp/keras_nlp/layers/transformer_encoder_block.py 167
改为:
tf.keras.mixed_precision.global_policy()


/usr/local/lib/python3.10/site-packages/official/nlp/modeling/networks/classification.py  70
改为:
policy = tf.keras.mixed_precision.global_policy()

/usr/local/lib/python3.10/site-packages/official/nlp/bert/model_training_utils.py 346
改为:
tf.keras.mixed_precision.LossScaleOptimizer):
```



hepj987's avatar
hepj987 committed
155
###  数据转化-MNLI
hepj987's avatar
hepj987 committed
156

hepj987's avatar
hepj987 committed
157
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
158
159

```
hepj987's avatar
hepj987 committed
160
python create_finetuning_data.py \
hepj987's avatar
hepj987 committed
161
 --input_data_dir=/public/home/hepj/data/MNLI \
hepj987's avatar
hepj987 committed
162
163
164
165
 --vocab_file=/public/home/hepj/model_source/uncased_L-12_H-768_A-12/vocab.txt \
 --train_data_output_path=/public/home/hepj/MNLI/train.tf_record \
 --eval_data_output_path=/public/home/hepj/MNLI/eval.tf_record \
 --meta_data_file_path=/public/home/hepj/MNLI/meta_data \
hepj987's avatar
hepj987 committed
166
167
168
 --fine_tuning_task_type=classification 
 --max_seq_length=32 \
 --classification_task_name=MNLI
hepj987's avatar
hepj987 committed
169
170
171
172
173
174
175
176
177
178
 
 #参数说明
 --input_data_dir				训练数据路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
 --classification_task_name		分类任务名
hepj987's avatar
hepj987 committed
179
180
```

hepj987's avatar
hepj987 committed
181
### 模型转化-MNLI
hepj987's avatar
hepj987 committed
182
183
184
185
186
187

TF2.7.2与TF1.15.0模型存储、读取格式不同,官网给出的Bert一般是基于TF1.0的模型需要进行模型转化

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_config.json \
hepj987's avatar
hepj987 committed
188
189
--checkpoint_to_convert /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_model.ckpt \
--converted_checkpoint_path /public/home/hepj/model_source/bert-base-TF2/bert_model.ckpt
hepj987's avatar
hepj987 committed
190

hepj987's avatar
hepj987 committed
191
192
193
194
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
195
196
197

将转换完后的bert_model.ckpt-1.data-00000-of-00001 改为bert_model.ckpt.data-00000-of-00001
bert_model.ckpt-1.index改为 bert_model.ckpt.index
dcuai's avatar
dcuai committed
198
199

如果报错 'no attriute experimental',则删除报错行中的experimental
hepj987's avatar
hepj987 committed
200
201
```

hepj987's avatar
hepj987 committed
202
### 单卡运行-MNLI
hepj987's avatar
hepj987 committed
203
204

```
hepj987's avatar
hepj987 committed
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
sh bert_class.sh
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict		
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
  --train_batch_size			训练批大小
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
dcuai's avatar
dcuai committed
221
222
223
224
225
226
227



  修改库版本:
absl==1.4.0
tf-models-official==2.6.0

hepj987's avatar
hepj987 committed
228
229
```

hepj987's avatar
hepj987 committed
230
### 多卡运行-MNLI
hepj987's avatar
hepj987 committed
231
232

```
hepj987's avatar
hepj987 committed
233
sh bert_class_gpus.sh
hepj987's avatar
hepj987 committed
234
235
```

hepj987's avatar
hepj987 committed
236
### 数据转化-SQUAD1.1
hepj987's avatar
hepj987 committed
237

hepj987's avatar
hepj987 committed
238
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
239
240
241
242
243
244

```
python3 create_finetuning_data.py \
 --squad_data_file=/public/home/hepj/model/model_source/sq1.1/train-v1.1.json \
 --vocab_file=/public/home/hepj/model_source/bert-large-uncased-TF2/uncased_L-24_H-1024_A-16/vocab.txt \
 --train_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/train_new.tf_record \
hepj987's avatar
hepj987 committed
245
 --meta_data_file_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/meta_data \
hepj987's avatar
hepj987 committed
246
247
 --eval_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/eval_new.tf_record \
 --fine_tuning_task_type=squad \
hepj987's avatar
hepj987 committed
248
 --do_lower_case=False \
hepj987's avatar
hepj987 committed
249
 --max_seq_length=384
hepj987's avatar
hepj987 committed
250
251
252
253
254
255
256
257
258
 
#参数说明
 --squad_data_file				训练文件路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
hepj987's avatar
hepj987 committed
259
260
```

hepj987's avatar
hepj987 committed
261
### 模型转化-SQUAD1.1
hepj987's avatar
hepj987 committed
262
263
264
265
266

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/bert_config.json \
--checkpoint_to_convert /public/home/hepj/model/model_sourceuncased_L-24_H-1024_A-16/bert_model.ckpt \
hepj987's avatar
hepj987 committed
267
--converted_checkpoint_path  /public/home/hepj/model_source/bert-large-TF2/bert_model.ckpt
hepj987's avatar
hepj987 committed
268

hepj987's avatar
hepj987 committed
269
270
271
272
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
273
274
275

将转换完后的bert_model.ckpt-1.data-00000-of-00001 改为bert_model.ckpt.data-00000-of-00001
bert_model.ckpt-1.index改为 bert_model.ckpt.index
hepj987's avatar
hepj987 committed
276
277
```

hepj987's avatar
hepj987 committed
278
### 单卡运行-SQUAD1.1
hepj987's avatar
hepj987 committed
279
280
281

```
sh bert_squad.sh
hepj987's avatar
hepj987 committed
282
283
284
285
286
287
288
289
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict
  --vocab_file					vocab文件路径
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
hepj987's avatar
hepj987 committed
290
  --train_batch_size			总训练批大小
hepj987's avatar
hepj987 committed
291
292
293
294
295
296
297
298
299
300
  --predict_file				预测文件路径
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
```

hepj987's avatar
hepj987 committed
301
### 多卡运行-SQUAD1.1
hepj987's avatar
hepj987 committed
302
303

```
hepj987's avatar
hepj987 committed
304
sh bert_squad_gpus.sh
hepj987's avatar
hepj987 committed
305
306
```

hepj987's avatar
hepj987 committed
307
308
## result

hepj987's avatar
hepj987 committed
309
310
311
312
313
314
315
```
#MNLI分类样例
输入:"premise": "Conceptually cream skimming has two basic dimensions - product and geography.","hypothesis": "Product and geography are what make cream skimming work."
输出:"label": 1
```


hepj987's avatar
hepj987 committed
316

dcuai's avatar
dcuai committed
317
### 精度
hepj987's avatar
hepj987 committed
318

hepj987's avatar
hepj987 committed
319
320
321
322
323
324
325
使用单张Z100收敛性如下所示:

|      模型任务      |       训练精度       |
| :----------------: | :------------------: |
| MNLI-class(单卡) | val_accuracy: 0.7387 |
|  squad1.1(单卡)  |  F1-score:0.916378   |

dcuai's avatar
dcuai committed
326
## 应用场景
hepj987's avatar
hepj987 committed
327

dcuai's avatar
dcuai committed
328
### 算法类别
hepj987's avatar
hepj987 committed
329

chenzk's avatar
chenzk committed
330
`对话问答`
hepj987's avatar
hepj987 committed
331

dcuai's avatar
dcuai committed
332
### 热点应用行业
hepj987's avatar
hepj987 committed
333

chenzk's avatar
chenzk committed
334
`电商,科研,教育`
hepj987's avatar
hepj987 committed
335

dcuai's avatar
dcuai committed
336
## 源码仓库及问题反馈
hepj987's avatar
hepj987 committed
337
338
339

https://developer.hpccube.com/codes/modelzoo/bert-tf2

hepj987's avatar
hepj987 committed
340
# 参考资料
hepj987's avatar
hepj987 committed
341

hepj987's avatar
hepj987 committed
342
https://github.com/tensorflow/models/tree/v2.3.0/official/nlp