README.md 11.2 KB
Newer Older
dcuai's avatar
dcuai committed
1
# BERT
hepj987's avatar
hepj987 committed
2
3
4
5
6
7

## 论文

`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`

[BERT论文pdf地址](https://arxiv.org/pdf/1810.04805.pdf)
hepj987's avatar
hepj987 committed
8

dcuai's avatar
dcuai committed
9
## 模型结构
hepj987's avatar
hepj987 committed
10

hepj987's avatar
hepj987 committed
11
12
![bert_model](bert_model.png)

hepj987's avatar
hepj987 committed
13
14
15
```
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。
```
hepj987's avatar
hepj987 committed
16

hepj987's avatar
hepj987 committed
17
18
19
## 算法原理

![bert](bert.png)
hepj987's avatar
hepj987 committed
20

hepj987's avatar
hepj987 committed
21
22
23
```
以往的预训练模型的结构会受到单向语言模型(从左到右或者从右到左)的限制,因而也限制了模型的表征能力,使其只能获取单方向的上下文信息。而BERT利用MLM进行预训练并且采用深层的双向Transformer组件(单向的Transformer一般被称为Transformer decoder,其每一个token(符号)只会attend到目前往左的token。而双向的Transformer则被称为Transformer encoder,其每一个token会attend到所有的token)来构建整个模型,因此最终生成能融合左右上下文信息的深层双向语言表征。
```
hepj987's avatar
hepj987 committed
24

hepj987's avatar
hepj987 committed
25
## 数据集
hepj987's avatar
hepj987 committed
26

hepj987's avatar
hepj987 committed
27
MNLI分类数据集:[MNLI](https://dl.fbaipublicfiles.com/glue/data/MNLI.zip)
hepj987's avatar
hepj987 committed
28
29


hepj987's avatar
hepj987 committed
30
31
32
33
squad问答数据集:[train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)[dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

squad-v1.1 eval脚本:[evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

chenzk's avatar
chenzk committed
34

dcuai's avatar
dcuai committed
35
36
37
38
39
40
### 预训练模型

[bert-base-uncace(MNLI分类时使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)

[bert-large-uncase(squad问答使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)

hepj987's avatar
hepj987 committed
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
`MNLI数据集`

```
├── original
│	├── multinli_1.0_dev_matched.jsonl
│	├── multinli_1.0_dev_matched.txt
│	├── multinli_1.0_dev_mismatched.jsonl
│	├── multinli_1.0_dev_mismatched.txt
│	├── multinli_1.0_train.jsonl
│	└── multinli_1.0_train.txt
├── dev_matched.tsv
├── dev_mismatched.tsv
├── README.txt
├── test_matched.tsv
├── test_mismatched.tsv
└── train.tsv
```

`squadv1.1数据结构`

```
├── dev-v1.1.json
└── train-v1.1.json
```



hepj987's avatar
hepj987 committed
68
69
## 环境配置

dcuai's avatar
dcuai committed
70
推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/#/main-page)镜像,可以docker pull拉取
hepj987's avatar
hepj987 committed
71

hepj987's avatar
hepj987 committed
72
73
### Docker(方式一)

hepj987's avatar
hepj987 committed
74
```
dcuai's avatar
dcuai committed
75
76
docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.13.1-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --network=host --name=bert_tensorflow --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.13.1-ubuntu20.04-dtk24.04.1-py3.10
hepj987's avatar
hepj987 committed
77
docker exec -it bert_tensorflow /bin/bash
hepj987's avatar
hepj987 committed
78
pip install -r requirements.txt
hepj's avatar
hepj committed
79
80
pip install tf-models-official==2.4.0  tensorflow_addons==0.16.1 tensorflow_hub==0.16.1 typeguard==4.3.0 typing_extensions==4.12.2 --no-deps

hepj987's avatar
hepj987 committed
81
82
```

hepj987's avatar
hepj987 committed
83
84
85
86
87
88
89
### Dockerfile(方式二)

```
docker build -t bert:latest .
docker run -dit --network=host --name=bert_tensorflow --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 bert:latest
docker exec -it bert_tensorflow /bin/bash
pip install -r requirements.txt
hepj's avatar
hepj committed
90
91
pip install tf-models-official==2.4.0  tensorflow_addons==0.16.1 tensorflow_hub==0.16.1 typeguard==4.3.0 typing_extensions==4.12.2 --no-deps

hepj987's avatar
hepj987 committed
92
```
hepj987's avatar
hepj987 committed
93

hepj987's avatar
hepj987 committed
94
95
96
### Conda(方式三)

```
dcuai's avatar
dcuai committed
97
conda create -n bert_tensorflow python=3.10
hepj987's avatar
hepj987 committed
98
pip install -r requirements.txt
hepj's avatar
hepj committed
99
100
pip install tf-models-official==2.4.0  tensorflow_addons==0.16.1 tensorflow_hub==0.16.1 typeguard==4.3.0 typing_extensions==4.12.2 --no-deps

hepj987's avatar
hepj987 committed
101
```
hepj987's avatar
hepj987 committed
102

chenzk's avatar
chenzk committed
103
安装过程可能顶掉DCU版本的tensorflow,可以到[开发者社区](https://developer.sourcefind.cn/tool/)下载DCU版本对应包
hepj987's avatar
hepj987 committed
104

dcuai's avatar
dcuai committed
105
[tensorflow2.13.1](https://cancon.hpccube.com:65024/directlink/4/tensorflow/DAS1.1/tensorflow-2.13.1+das1.1.git56b06c8.abi1.dtk2404-cp311-cp311-manylinux_2_31_x86_64.whl)
hepj987's avatar
hepj987 committed
106

dcuai's avatar
dcuai committed
107
[DTK24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
hepj987's avatar
hepj987 committed
108
109
110
111
112

### python版本兼容

```
搭建环境时会遇到AttributeError: module 'typing' has no attribute '_ClassVar'则是由于新版本的python更换属性名导致的
hepj987's avatar
hepj987 committed
113
114
115
116
修改python3.7/site-packages/dataclasses.py 550行
return type(a_type) is typing._ClassVar
改为
return type(a_type) is typing.ClassVar
hepj987's avatar
hepj987 committed
117
118
```

hepj987's avatar
hepj987 committed
119
120


hepj987's avatar
hepj987 committed
121
## 训练
hepj987's avatar
hepj987 committed
122

hepj's avatar
hepj committed
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
### tf2.13版本兼容性改动

```
/usr/local/lib/python3.10/site-packages/keras/src/optimizers/optimizer.py 1234
改为:
if (self.weight_decay is None) or self.weight_decay=="AdamWeightDecay":


/usr/local/lib/python3.10/site-packages/official/modeling/performance.py 53
改为:
tf.keras.mixed_precision.set_global_policy('float32')

/usr/local/lib/python3.10/site-packages/official/nlp/keras_nlp/layers/transformer_encoder_block.py 167
改为:
tf.keras.mixed_precision.global_policy()


/usr/local/lib/python3.10/site-packages/official/nlp/modeling/networks/classification.py  70
改为:
policy = tf.keras.mixed_precision.global_policy()

/usr/local/lib/python3.10/site-packages/official/nlp/bert/model_training_utils.py 346
改为:
tf.keras.mixed_precision.LossScaleOptimizer):
```



hepj987's avatar
hepj987 committed
151
###  数据转化-MNLI
hepj987's avatar
hepj987 committed
152

hepj987's avatar
hepj987 committed
153
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
154
155

```
hepj987's avatar
hepj987 committed
156
python create_finetuning_data.py \
hepj987's avatar
hepj987 committed
157
 --input_data_dir=/public/home/hepj/data/MNLI \
hepj987's avatar
hepj987 committed
158
159
160
161
 --vocab_file=/public/home/hepj/model_source/uncased_L-12_H-768_A-12/vocab.txt \
 --train_data_output_path=/public/home/hepj/MNLI/train.tf_record \
 --eval_data_output_path=/public/home/hepj/MNLI/eval.tf_record \
 --meta_data_file_path=/public/home/hepj/MNLI/meta_data \
hepj987's avatar
hepj987 committed
162
163
164
 --fine_tuning_task_type=classification 
 --max_seq_length=32 \
 --classification_task_name=MNLI
hepj987's avatar
hepj987 committed
165
166
167
168
169
170
171
172
173
174
 
 #参数说明
 --input_data_dir				训练数据路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
 --classification_task_name		分类任务名
hepj987's avatar
hepj987 committed
175
176
```

hepj987's avatar
hepj987 committed
177
### 模型转化-MNLI
hepj987's avatar
hepj987 committed
178
179
180
181
182
183

TF2.7.2与TF1.15.0模型存储、读取格式不同,官网给出的Bert一般是基于TF1.0的模型需要进行模型转化

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_config.json \
hepj987's avatar
hepj987 committed
184
185
--checkpoint_to_convert /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_model.ckpt \
--converted_checkpoint_path /public/home/hepj/model_source/bert-base-TF2/bert_model.ckpt
hepj987's avatar
hepj987 committed
186

hepj987's avatar
hepj987 committed
187
188
189
190
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
191
192
193

将转换完后的bert_model.ckpt-1.data-00000-of-00001 改为bert_model.ckpt.data-00000-of-00001
bert_model.ckpt-1.index改为 bert_model.ckpt.index
dcuai's avatar
dcuai committed
194
195

如果报错 'no attriute experimental',则删除报错行中的experimental
hepj987's avatar
hepj987 committed
196
197
```

hepj987's avatar
hepj987 committed
198
### 单卡运行-MNLI
hepj987's avatar
hepj987 committed
199
200

```
hepj987's avatar
hepj987 committed
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
sh bert_class.sh
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict		
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
  --train_batch_size			训练批大小
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
dcuai's avatar
dcuai committed
217
218
219
220
221
222
223



  修改库版本:
absl==1.4.0
tf-models-official==2.6.0

hepj987's avatar
hepj987 committed
224
225
```

hepj987's avatar
hepj987 committed
226
### 多卡运行-MNLI
hepj987's avatar
hepj987 committed
227
228

```
hepj987's avatar
hepj987 committed
229
sh bert_class_gpus.sh
hepj987's avatar
hepj987 committed
230
231
```

hepj987's avatar
hepj987 committed
232
### 数据转化-SQUAD1.1
hepj987's avatar
hepj987 committed
233

hepj987's avatar
hepj987 committed
234
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
235
236
237
238
239
240

```
python3 create_finetuning_data.py \
 --squad_data_file=/public/home/hepj/model/model_source/sq1.1/train-v1.1.json \
 --vocab_file=/public/home/hepj/model_source/bert-large-uncased-TF2/uncased_L-24_H-1024_A-16/vocab.txt \
 --train_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/train_new.tf_record \
hepj987's avatar
hepj987 committed
241
 --meta_data_file_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/meta_data \
hepj987's avatar
hepj987 committed
242
243
 --eval_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/eval_new.tf_record \
 --fine_tuning_task_type=squad \
hepj987's avatar
hepj987 committed
244
 --do_lower_case=False \
hepj987's avatar
hepj987 committed
245
 --max_seq_length=384
hepj987's avatar
hepj987 committed
246
247
248
249
250
251
252
253
254
 
#参数说明
 --squad_data_file				训练文件路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
hepj987's avatar
hepj987 committed
255
256
```

hepj987's avatar
hepj987 committed
257
### 模型转化-SQUAD1.1
hepj987's avatar
hepj987 committed
258
259
260
261
262

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/bert_config.json \
--checkpoint_to_convert /public/home/hepj/model/model_sourceuncased_L-24_H-1024_A-16/bert_model.ckpt \
hepj987's avatar
hepj987 committed
263
--converted_checkpoint_path  /public/home/hepj/model_source/bert-large-TF2/bert_model.ckpt
hepj987's avatar
hepj987 committed
264

hepj987's avatar
hepj987 committed
265
266
267
268
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
269
270
271

将转换完后的bert_model.ckpt-1.data-00000-of-00001 改为bert_model.ckpt.data-00000-of-00001
bert_model.ckpt-1.index改为 bert_model.ckpt.index
hepj987's avatar
hepj987 committed
272
273
```

hepj987's avatar
hepj987 committed
274
### 单卡运行-SQUAD1.1
hepj987's avatar
hepj987 committed
275
276
277

```
sh bert_squad.sh
hepj987's avatar
hepj987 committed
278
279
280
281
282
283
284
285
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict
  --vocab_file					vocab文件路径
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
hepj987's avatar
hepj987 committed
286
  --train_batch_size			总训练批大小
hepj987's avatar
hepj987 committed
287
288
289
290
291
292
293
294
295
296
  --predict_file				预测文件路径
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
```

hepj987's avatar
hepj987 committed
297
### 多卡运行-SQUAD1.1
hepj987's avatar
hepj987 committed
298
299

```
hepj987's avatar
hepj987 committed
300
sh bert_squad_gpus.sh
hepj987's avatar
hepj987 committed
301
302
```

hepj987's avatar
hepj987 committed
303
304
## result

hepj987's avatar
hepj987 committed
305
306
307
308
309
310
311
```
#MNLI分类样例
输入:"premise": "Conceptually cream skimming has two basic dimensions - product and geography.","hypothesis": "Product and geography are what make cream skimming work."
输出:"label": 1
```


hepj987's avatar
hepj987 committed
312

dcuai's avatar
dcuai committed
313
### 精度
hepj987's avatar
hepj987 committed
314

hepj987's avatar
hepj987 committed
315
316
317
318
319
320
321
使用单张Z100收敛性如下所示:

|      模型任务      |       训练精度       |
| :----------------: | :------------------: |
| MNLI-class(单卡) | val_accuracy: 0.7387 |
|  squad1.1(单卡)  |  F1-score:0.916378   |

dcuai's avatar
dcuai committed
322
## 应用场景
hepj987's avatar
hepj987 committed
323

dcuai's avatar
dcuai committed
324
### 算法类别
hepj987's avatar
hepj987 committed
325

chenzk's avatar
chenzk committed
326
`对话问答`
hepj987's avatar
hepj987 committed
327

dcuai's avatar
dcuai committed
328
### 热点应用行业
hepj987's avatar
hepj987 committed
329

chenzk's avatar
chenzk committed
330
`电商,科研,教育`
hepj987's avatar
hepj987 committed
331

dcuai's avatar
dcuai committed
332
## 源码仓库及问题反馈
hepj987's avatar
hepj987 committed
333

chenzk's avatar
chenzk committed
334
https://developer.sourcefind.cn/codes/modelzoo/bert-tf2
hepj987's avatar
hepj987 committed
335

hepj987's avatar
hepj987 committed
336
# 参考资料
hepj987's avatar
hepj987 committed
337

hepj987's avatar
hepj987 committed
338
https://github.com/tensorflow/models/tree/v2.3.0/official/nlp