README.md 6.64 KB
Newer Older
hepj987's avatar
hepj987 committed
1
# 基于TF2框架的Bert训练
hepj987's avatar
hepj987 committed
2

hepj987's avatar
hepj987 committed
3
## 模型介绍
hepj987's avatar
hepj987 committed
4

hepj987's avatar
hepj987 committed
5
6
7
```
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。
```
hepj987's avatar
hepj987 committed
8

hepj987's avatar
hepj987 committed
9
## 模型结构
hepj987's avatar
hepj987 committed
10

hepj987's avatar
hepj987 committed
11
12
13
```
以往的预训练模型的结构会受到单向语言模型(从左到右或者从右到左)的限制,因而也限制了模型的表征能力,使其只能获取单方向的上下文信息。而BERT利用MLM进行预训练并且采用深层的双向Transformer组件(单向的Transformer一般被称为Transformer decoder,其每一个token(符号)只会attend到目前往左的token。而双向的Transformer则被称为Transformer encoder,其每一个token会attend到所有的token)来构建整个模型,因此最终生成能融合左右上下文信息的深层双向语言表征。
```
hepj987's avatar
hepj987 committed
14

hepj987's avatar
hepj987 committed
15
## 模型下载
hepj987's avatar
hepj987 committed
16

hepj987's avatar
hepj987 committed
17
[bert-base-uncace(MNLI分类时使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)
hepj987's avatar
hepj987 committed
18

hepj987's avatar
hepj987 committed
19
[bert-large-uncase(squad问答使用此模型)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)
hepj987's avatar
hepj987 committed
20

hepj987's avatar
hepj987 committed
21
## 数据集准备
hepj987's avatar
hepj987 committed
22

hepj987's avatar
hepj987 committed
23
MNLI分类数据集:[MNLI](https://dl.fbaipublicfiles.com/glue/data/MNLI.zip)
hepj987's avatar
hepj987 committed
24
25


hepj987's avatar
hepj987 committed
26
27
28
29
30
31
32
squad问答数据集:[train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)[dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

squad-v1.1 eval脚本:[evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

## 环境配置

推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/#/main-page)镜像,可以dockerpull拉取
hepj987's avatar
hepj987 committed
33
34

```
hepj987's avatar
hepj987 committed
35
docker pull image.sourcefind.cn:5000/dcu/admin/base/tensorflow:2.7.0-centos7.6-dtk-22.10.1-py37-latest
hepj987's avatar
hepj987 committed
36
37
```

hepj987's avatar
hepj987 committed
38
## 安装依赖
hepj987's avatar
hepj987 committed
39
40

```
hepj987's avatar
hepj987 committed
41
pip install requirements.txt
hepj987's avatar
hepj987 committed
42
43
```

hepj987's avatar
hepj987 committed
44
# MNLI分类测试
hepj987's avatar
hepj987 committed
45

hepj987's avatar
hepj987 committed
46
##  数据转化
hepj987's avatar
hepj987 committed
47

hepj987's avatar
hepj987 committed
48
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
49
50

```
hepj987's avatar
hepj987 committed
51
python create_finetuning_data.py \
hepj987's avatar
hepj987 committed
52
53
54
55
56
57
58
59
 --input_data_dir=/public/home/hepj/data/MNLI \
 --vocab_file=/public/home/hepj/model/tf2.7.0_Bert/pre_tf2x/vocab.txt \
 --train_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/train.tf_record \
 --eval_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/eval.tf_record \
 --meta_data_file_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/meta_data \
 --fine_tuning_task_type=classification 
 --max_seq_length=32 \
 --classification_task_name=MNLI
hepj987's avatar
hepj987 committed
60
61
62
63
64
65
66
67
68
69
 
 #参数说明
 --input_data_dir				训练数据路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
 --classification_task_name		分类任务名
hepj987's avatar
hepj987 committed
70
71
```

hepj987's avatar
hepj987 committed
72
## 模型转化
hepj987's avatar
hepj987 committed
73
74
75
76
77
78
79
80
81

TF2.7.2与TF1.15.0模型存储、读取格式不同,官网给出的Bert一般是基于TF1.0的模型需要进行模型转化

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_config.json \
--checkpoint_to_convert /public/home/hepjl/model_source/uncased_L-12_H-768_A-12/bert_model.ckpt \
--converted_checkpoint_path pre_tf2x/

hepj987's avatar
hepj987 committed
82
83
84
85
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
86
87
```

hepj987's avatar
hepj987 committed
88
## 单卡运行
hepj987's avatar
hepj987 committed
89
90

```
hepj987's avatar
hepj987 committed
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
sh bert_class.sh
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict		
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
  --train_batch_size			训练批大小
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
```

## 多卡运行
hepj987's avatar
hepj987 committed
110
111
112
113
114

```
sh bert_class4.sh
```

hepj987's avatar
hepj987 committed
115
# SQUAD1.1问答测试
hepj987's avatar
hepj987 committed
116

hepj987's avatar
hepj987 committed
117
### 数据转化
hepj987's avatar
hepj987 committed
118

hepj987's avatar
hepj987 committed
119
TF2.0版本读取数据需要转化为tf_record格式
hepj987's avatar
hepj987 committed
120
121
122
123
124
125
126
127
128
129
130

```
python3 create_finetuning_data.py \
 --squad_data_file=/public/home/hepj/model/model_source/sq1.1/train-v1.1.json \
 --vocab_file=/public/home/hepj/model_source/bert-large-uncased-TF2/uncased_L-24_H-1024_A-16/vocab.txt \
 --train_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/train_new.tf_record \
 --meta_data_file_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/meta_data_new \
 --eval_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/eval_new.tf_record \
 --fine_tuning_task_type=squad \
 --do_lower_case=Flase \
 --max_seq_length=384
hepj987's avatar
hepj987 committed
131
132
133
134
135
136
137
138
139
 
#参数说明
 --squad_data_file				训练文件路径
 --vocab_file					vocab文件路径
 --train_data_output_path		训练数据保存路径
 --eval_data_output_path		验证数据保存路径
 --fine_tuning_task_type 		fine-tune任务类型
 --do_lower_case				是否进行lower
 --max_seq_length				最大句子长度
hepj987's avatar
hepj987 committed
140
141
```

hepj987's avatar
hepj987 committed
142
### 模型转化
hepj987's avatar
hepj987 committed
143
144
145
146
147
148
149

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/bert_config.json \
--checkpoint_to_convert /public/home/hepj/model/model_sourceuncased_L-24_H-1024_A-16/bert_model.ckpt \
--converted_checkpoint_path  /public/home/hepj/model_source/bert-large-uncased-TF2/

hepj987's avatar
hepj987 committed
150
151
152
153
#参数说明
--bert_config_file			bert模型config文件
--checkpoint_to_convert		需要转换的模型路径
--converted_checkpoint_path	转换后模型路径
hepj987's avatar
hepj987 committed
154
155
```

hepj987's avatar
hepj987 committed
156
### 单卡运行
hepj987's avatar
hepj987 committed
157
158
159

```
sh bert_squad.sh
hepj987's avatar
hepj987 committed
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
  #参数说明
  --mode						模型模式train_and_eval、export_only、predict
  --vocab_file					vocab文件路径
  --input_meta_data_path		用于训练和验证的元数据
  --train_data_path				训练数据路径
  --eval_data_path				验证数据路径
  --bert_config_file			bert模型config文件
  --init_checkpoint				初始化模型路径
  --train_batch_size			训练批大小
  --predict_file				预测文件路径
  --eval_batch_size				验证批大小
  --steps_per_loop				打印log间隔
  --learning_rate				学习率
  --num_train_epochs			训练epoch数
  --model_dir					模型保存文件夹
  --distribution_strategy		分布式策略
  --num_gpus					使用gpu数量
```

### 多卡运行
hepj987's avatar
hepj987 committed
180
181

```
hepj987's avatar
hepj987 committed
182
sh bert_squad4.sh
hepj987's avatar
hepj987 committed
183
184
```

hepj987's avatar
hepj987 committed
185
## 模型精度
hepj987's avatar
hepj987 committed
186
187
188



hepj987's avatar
hepj987 committed
189
190
191
192
193
## 源码仓库及问题反馈

https://developer.hpccube.com/codes/modelzoo/bert-tf2

## 参考
hepj987's avatar
hepj987 committed
194

hepj987's avatar
hepj987 committed
195
https://github.com/tensorflow/models/tree/v2.3.0/official/nlp