README.md 7.97 KB
Newer Older
hepj987's avatar
hepj987 committed
1
# 测试前准备
huchen's avatar
huchen committed
2

hepj987's avatar
hepj987 committed
3
## 1.数据集准备
huchen's avatar
huchen committed
4

hepj987's avatar
hepj987 committed
5
GLUE数据集下载https://pan.baidu.com/s/1tLd8opr08Nw5PzUBh7lXsQ
huchen's avatar
huchen committed
6

hepj987's avatar
hepj987 committed
7
分类使用其中的MNLI数据集
huchen's avatar
huchen committed
8

hepj987's avatar
hepj987 committed
9
提取码:fyvy
huchen's avatar
huchen committed
10

hepj987's avatar
hepj987 committed
11
问答数据:
huchen's avatar
huchen committed
12

hepj987's avatar
hepj987 committed
13
[train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
huchen's avatar
huchen committed
14

hepj987's avatar
hepj987 committed
15
[dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
huchen's avatar
huchen committed
16

hepj987's avatar
hepj987 committed
17
[evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
huchen's avatar
huchen committed
18

hepj987's avatar
hepj987 committed
19
## 2.环境部署
huchen's avatar
huchen committed
20

hepj987's avatar
hepj987 committed
21
22
23
24
```
virtualenv -p python3 -system-site-packages venv_2
source venv_2/bin/activat
```
huchen's avatar
huchen committed
25

hepj987's avatar
hepj987 committed
26
安装python依赖包
huchen's avatar
huchen committed
27

hepj987's avatar
hepj987 committed
28
29
30
31
32
33
```
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
pip install tensorflow-2.7.0-cp36-cp36m-linux_x86_64.whl
pip install horovod-0.21.3-cp36-cp36m-linux_x86_64.whl
pip install apex-0.1-cp36-cp36m-linux_x86_64.whl
```
huchen's avatar
huchen committed
34

hepj987's avatar
hepj987 committed
35
环境变量设置
huchen's avatar
huchen committed
36
37

```
hepj987's avatar
hepj987 committed
38
39
40
41
42
module rm compiler/rocm/2.9
export ROCM_PATH=/public/home/hepj/job_env/apps/dtk-21.10.1
export HIP_PATH=${ROCM_PATH}/hip
export AMDGPU_TARGETS="gfx900;gfx906"
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${ROCM_PATH}/hcc/bin:${ROCM_PATH}/hip/bin:$PATH
huchen's avatar
huchen committed
43
44
```

hepj987's avatar
hepj987 committed
45
46
47
##  3.MNLI分类测试

###  3.1单卡测试(单精度)
huchen's avatar
huchen committed
48

hepj987's avatar
hepj987 committed
49
50
51
52
53
54
55
56
57
58
59
60
61
62
####  3.1.1数据转化

TF2.0版本读取数据方式与TF1.0不同,需要转化为tf_record格式

```
python ../data/create_finetuning_data.py \
 --input_data_dir=/public/home/hepj/data/MNLI \
 --vocab_file=/public/home/hepj/model/tf2.7.0_Bert/pre_tf2x/vocab.txt \
 --train_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/train.tf_record \
 --eval_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/eval.tf_record \
 --meta_data_file_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/meta_data \
 --fine_tuning_task_type=classification 
 --max_seq_length=32 \
 --classification_task_name=MNLI
huchen's avatar
huchen committed
63
64
```

hepj987's avatar
hepj987 committed
65
#### 3.1.2   模型转化
huchen's avatar
huchen committed
66

hepj987's avatar
hepj987 committed
67
68
69
70
71
72
73
TF2.7.2与TF1.15.0模型存储、读取格式不同,官网给出的Bert一般是基于TF1.0的模型需要进行模型转化

```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model_source/uncased_L-12_H-768_A-12/bert_config.json \
--checkpoint_to_convert /public/home/hepjl/model_source/uncased_L-12_H-768_A-12/bert_model.ckpt \
--converted_checkpoint_path pre_tf2x/
huchen's avatar
huchen committed
74
75
```

hepj987's avatar
hepj987 committed
76
#### 3.1.3    bert_class.sh
huchen's avatar
huchen committed
77

hepj987's avatar
hepj987 committed
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
```
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
python3 run_classifier.py \
  --mode=train_and_eval \
  --input_meta_data_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/meta_data \
  --train_data_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/train.tf_record \
  --eval_data_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/eval.tf_record \
  --bert_config_file=/public/home/hepj/model/tf2.7.0_Bert/pre_tf2x/bert_config.json \
  --init_checkpoint=/public/home/hepj/model/tf2.7.0_Bert/pre_tf2x/bert_model.ckpt \
  --train_batch_size= 320 \
  --eval_batch_size=32 \
  --steps_per_loop=1000 \
  --learning_rate=2e-5 \
  --num_train_epochs=3 \
  --model_dir=/public/home/hepj/model/tf2/out1 \
  --distribution_strategy=mirrored
huchen's avatar
huchen committed
102
103
```

hepj987's avatar
hepj987 committed
104
#### 3.1.4  运行
huchen's avatar
huchen committed
105

hepj987's avatar
hepj987 committed
106
sh bert_class.sh
huchen's avatar
huchen committed
107

hepj987's avatar
hepj987 committed
108
### 3.2    四卡测试(单精度)
huchen's avatar
huchen committed
109

hepj987's avatar
hepj987 committed
110
#### 3.2.1.     数据转化
huchen's avatar
huchen committed
111

hepj987's avatar
hepj987 committed
112
与单卡相同(3.1.1)
huchen's avatar
huchen committed
113

hepj987's avatar
hepj987 committed
114
####  3.2.2.     模型转化
huchen's avatar
huchen committed
115

hepj987's avatar
hepj987 committed
116
与单卡相同(3.1.2)
huchen's avatar
huchen committed
117

hepj987's avatar
hepj987 committed
118
#### 3.2.3.   bert_class4.sh
huchen's avatar
huchen committed
119

hepj987's avatar
hepj987 committed
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
```
#这里的--train_batch_size为global train_batch_size
#使用mpirun的方式启动多卡存在一些问题
export HIP_VISIBLE_DEVICES=0,1,2,3
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
python3 run_classifier.py \
  --mode=train_and_eval \
  --input_meta_data_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/meta_data  \
  --train_data_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/train.tf_record \
  --eval_data_path=/public/home/hepj/model/tf2.7.0_Bert/MNLI/eval.tf_record  \
  --bert_config_file=/public/home/hepj/model/tf2.7.0_Bert/pre_tf2x/bert_config.json \
  --init_checkpoint=/public/home/hepj/model/tf2.7.0_Bert/pre_tf2x/bert_model.ckpt \
  --train_batch_size=1280 \
  --eval_batch_size=32 \
  --steps_per_loop=10 \
  --learning_rate=2e-5 \
  --num_train_epochs=3 \
  --num_gpus=4 \
  --model_dir=/public/home/hepj/outdir/tf2/class4 \
  --distribution_strategy=mirrored
```

#### 3.2.4.     运行
huchen's avatar
huchen committed
149

hepj987's avatar
hepj987 committed
150
151
```
sh bert_class4.sh
huchen's avatar
huchen committed
152
153
154
155
```



hepj987's avatar
hepj987 committed
156
##  4. SQUAD1.1问答测试
huchen's avatar
huchen committed
157

hepj987's avatar
hepj987 committed
158
### 4.1.     单卡测试(单精度)
huchen's avatar
huchen committed
159

hepj987's avatar
hepj987 committed
160
#### 4.1.1.     数据转化
huchen's avatar
huchen committed
161

hepj987's avatar
hepj987 committed
162
163
164
165
166
167
168
169
170
171
```
python3 create_finetuning_data.py \
 --squad_data_file=/public/home/hepj/model/model_source/sq1.1/train-v1.1.json \
 --vocab_file=/public/home/hepj/model_source/bert-large-uncased-TF2/uncased_L-24_H-1024_A-16/vocab.txt \
 --train_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/train_new.tf_record \
 --meta_data_file_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/meta_data_new \
 --eval_data_output_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/eval_new.tf_record \
 --fine_tuning_task_type=squad \
 --do_lower_case=Flase \
 --max_seq_length=384
huchen's avatar
huchen committed
172
173
```

hepj987's avatar
hepj987 committed
174
#### 4.1.2.     模型转化
huchen's avatar
huchen committed
175

hepj987's avatar
hepj987 committed
176
177
178
179
180
181
182
183
```
python3 tf2_encoder_checkpoint_converter.py \
--bert_config_file /public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/bert_config.json \
--checkpoint_to_convert /public/home/hepj/model/model_sourceuncased_L-24_H-1024_A-16/bert_model.ckpt \
--converted_checkpoint_path  /public/home/hepj/model_source/bert-large-uncased-TF2/
```

#### 4.1.3.     bert_squad.sh
huchen's avatar
huchen committed
184

hepj987's avatar
hepj987 committed
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
```
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
export MIOPEN_ENABLE_LOGGING_CMD=1
export ROCBLAS_LAYER=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
python3 run_squad_xuan.py \
--mode=train_and_eval \
--vocab_file=/public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/vocab.txt \
--bert_config_file=/public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/bert_config.json \
--input_meta_data_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/meta_data \
--train_data_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/train.tf_record \
--predict_file=/public/home/hepj/model/model_source/sq1.1/dev-v1.1.json \
--init_checkpoint=/public/home/hepj/model_source/bert-large-uncased-TF2/bert_model.ckpt \
--train_batch_size=4 \
--predict_batch_size=4 \
--learning_rate=2e-5 \
--log_steps=1 \
--num_gpus=1 \
--distribution_strategy=mirrored \
--model_dir=/public/home/hepj/model/tf2/squad1 \
--run_eagerly=False
```
huchen's avatar
huchen committed
212

hepj987's avatar
hepj987 committed
213
#### 4.1.4.     运行
huchen's avatar
huchen committed
214

hepj987's avatar
hepj987 committed
215
216
```
sh bert_squad.sh
huchen's avatar
huchen committed
217
218
```

hepj987's avatar
hepj987 committed
219
### 4.2.     四卡测试(单精度)
huchen's avatar
huchen committed
220

hepj987's avatar
hepj987 committed
221
#### 4.2.1.     数据转化
huchen's avatar
huchen committed
222

hepj987's avatar
hepj987 committed
223
与单卡相同(4.1.1)
huchen's avatar
huchen committed
224

hepj987's avatar
hepj987 committed
225
#### 4.2.2.     模型转化
huchen's avatar
huchen committed
226

hepj987's avatar
hepj987 committed
227
与单卡相同(4.1.2)
huchen's avatar
huchen committed
228

hepj987's avatar
hepj987 committed
229
#### 4.2.3.     bert_squad4.sh
huchen's avatar
huchen committed
230
231

```
hepj987's avatar
hepj987 committed
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
#这里的--train_batch_size为global train_batch_size
#使用mpirun的方式启动多卡存在一些问题
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=3
module unload compiler/rocm/2.9
echo "MIOPEN_FIND_MODE=$MIOPEN_FIND_MODE"
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_squad_xuan.py \
  --mode=train_and_eval \
  --vocab_file=/public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/vocab.txt \ 
  --bert_config_file=/public/home/hepj/model/model_source/uncased_L-24_H-1024_A-16/bert_config.json \ 
  --input_meta_data_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/meta_data  \
  --train_data_path=/public/home/hepj/model/tf2.7.0_Bert/squad1.1/train.tf_record  \
  --predict_file=/public/home/hepj/model/model_source/sq1.1/dev-v1.1.json \ 
  --init_checkpoint=/public/home/hepj/model_source/bert-large-uncased-TF2/bert_model.ckpt \ 
  --train_batch_size=16 \
huchen's avatar
huchen committed
248
  --predict_batch_size=4 \
hepj987's avatar
hepj987 committed
249
250
251
252
253
254
  --learning_rate=2e-5 \
  --log_steps=1 \
  --num_gpus=4 \
  --distribution_strategy=mirrored \
  --model_dir=/public/home/hepj/outdir/tf2/squad4 \
  --run_eagerly=False
huchen's avatar
huchen committed
255
256
```

hepj987's avatar
hepj987 committed
257
#### 4.2.4.     运行
huchen's avatar
huchen committed
258
259

```
hepj987's avatar
hepj987 committed
260
261
262
sh bert_squad4.sh
```

huchen's avatar
huchen committed
263
264