README.md 6.66 KB
Newer Older
hepj987's avatar
hepj987 committed
1
2
# Generative Pre-Training2(GPT2)

hepj987's avatar
hepj987 committed
3
4
5
6
7
8
9
## 论文

`Language Models are Unsupervised Multitask Learners`

-   https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

## 模型介绍
hepj987's avatar
hepj987 committed
10
11
12
13
14

```
GPT2模型:第二代生成式预训练模型(Generative Pre-Training2)。
```

hepj987's avatar
hepj987 committed
15
16
17
## 模型结构

![gpt2](gpt2.jpg)
hepj987's avatar
hepj987 committed
18
19

```
hepj987's avatar
hepj987 committed
20
GPT2使用Transformer的Decoder结构,并对 Transformer Decoder 进行了一些改动。主要在于将归一化层移到Block的输入位置;在最后一个自注意力块之后加了一层归一化;增大词汇量等。
hepj987's avatar
hepj987 committed
21
22
```

hepj987's avatar
hepj987 committed
23
24
25
26
27
28
29
30
31
## 算法原理

![image-gpt](image-gpt.png)

## 数据集

`oscar-1GB`

-   https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
hepj987's avatar
hepj987 committed
32
33
34
35
36
37
38
39

```
#下载数据集
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
#下载vocab文件
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
xz -d oscar-1GB.jsonl.xz
hepj987's avatar
hepj987 committed
40
41
42
43
44
45
46
47
48
49
50

#处理数据集参数
--input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
--output-prefix		输出数据路径,处理后会自动加上_text_document后缀
--vocab				下载的gpt2-vocab.json词表文件路径
--dataset-impl		dataset类型
--tokenizer-type 	tokenizer类型
--merge-file		下载的gpt2-merges.txt文件路径		
--append-eod		添加结束标志符		
--workers			进程数

hepj987's avatar
hepj987 committed
51
#处理数据集
hepj987's avatar
hepj987 committed
52
sh creat-data.sh
hepj987's avatar
hepj987 committed
53
54
```

hepj987's avatar
hepj987 committed
55
56
57
58
59
60
```
#处理后的数据集格式
├── my-gpt2_text_document.bin
├── my-gpt2_text_document.idx
└── oscar-1GB.jsonl
```
hepj987's avatar
hepj987 committed
61

hepj987's avatar
hepj987 committed
62
## 环境配置
hepj987's avatar
hepj987 committed
63
64
65
66
67
68
69
70
71
72

推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/)拉取的docker镜像:

```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-23.04-py37-latest
```

进入docker

```
hepj987's avatar
hepj987 committed
73
pip install -r requirements.txt  -i https://mirrors.aliyun.com/pypi/simple/  --trusted-host mirrors.aliyun.com
hepj987's avatar
hepj987 committed
74
75
```

hepj987's avatar
hepj987 committed
76
77
## GPT2预训练

hepj987's avatar
hepj987 committed
78
### GPT2单节点训练
hepj987's avatar
hepj987 committed
79
80

```
hepj987's avatar
hepj987 committed
81
82
83
#np为起的进程数,和使用GPU数量一致,并且TP*PP < np,4卡的话可以设置2tp 2pp,或者1tp 4pp,4tp 1pp,节点内使用TP性能更好

mpirun -np 4 run-one-node.sh(基于单节点四卡)
hepj987's avatar
hepj987 committed
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
```

```
#重要参数
MODEL_NAME 					模型名(自定义)
CHECKPOINT_PATH				模型保存&加载路径
DATA_PATH					数据集路径(转换后的)
TENSORBOARD_PATH			tensorboard路径
CODECARBON_PATH				codecarbon路径

N_GPUS         				使用加速卡数量
TP_SIZE  	 				TP数量
PP_SIZE      				PP数量
MICRO_BATCH_SIZE			MICRO_BATCH_SIZE大小
GLOBAL_BATCH_SIZE           GLOBAL_BATCH_SIZE大小
NLAYERS 					模型层数
NHIDDEN						隐藏层维度
NHEADS						多注意力机制头数
SEQ_LEN						最大长度
SAVE_INTERVAL				保存频率

--train-samples				训练样本数
--eval-interval				验证频率
--eval-iters				验证iter
```

hepj987's avatar
hepj987 committed
110
### GPT2模型16B多节点训练
hepj987's avatar
hepj987 committed
111

hepj987's avatar
hepj987 committed
112
要求DCU集群配置好相应的虚拟环境,已安装python依赖项。
hepj987's avatar
hepj987 committed
113

hepj987's avatar
hepj987 committed
114
在安装一下依赖时需要使用基于DTK编译的版本,下载地址在[光合开发者社区](https://cancon.hpccube.com:65024/4/main/)
hepj987's avatar
hepj987 committed
115
116

```
hepj987's avatar
hepj987 committed
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
pytorch
deepspeed
apex
torchaudio
colossalai
faiss
mmcv-full
torchvision
tensorflow
```

这里以DTK23.04、python3.7,torch1.10为例,进入[光合开发者社区](https://cancon.hpccube.com:65024/4/main/)进入到pytorch->dtk23.04->下载 torch-1.10.0+gite378c3c.abi0.dtk2304-cp37-cp37m-manylinux2014_x86_64.whl。然后可以仿照下边配置环境:

```
#创建虚拟环境
hepj987's avatar
hepj987 committed
132
133
export PYTHON3_LIB_PATH=/python_lib_path
virtualenv -p /python_bin_path/python3 --system-site-packages venv_gpt2
hepj987's avatar
hepj987 committed
134
135
136
137
138
139
140
141
#进入venv_gpt2虚拟环境
source venv_gpt2/bin/activate
#加载DTK以及其他环境设置
source env.sh		
#安装DTK版本依赖
pip install torch-1.10.0+gite378c3c.abi0.dtk2304-cp37-cp37m-manylinux2014_x86_64.whl
pip install deepspeed-0.9.2+git25d5540.abi0.dtk2304.torch1.10.0-cp37-cp37m-manylinux2014_x86_64.whl
#安装其他依赖
hepj987's avatar
hepj987 committed
142
143
144
145
pip install -r requirements.txt  -i http://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn
```

```
hepj987's avatar
hepj987 committed
146
#多节点运行
hepj987's avatar
hepj987 committed
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
sbatch  run-16B.sh(主要参数在single-16B.sh)
```

```
#重要参数
MODEL_NAME 					模型名(自定义)
CHECKPOINT_PATH				模型保存&加载路径
DATA_PATH					数据集路径(转换后的)
TENSORBOARD_PATH			tensorboard路径
CODECARBON_PATH				codecarbon路径


TP_SIZE  	 				TP数量
PP_SIZE      				PP数量
MICRO_BATCH_SIZE			MICRO_BATCH_SIZE大小
GLOBAL_BATCH_SIZE           GLOBAL_BATCH_SIZE大小
NLAYERS 					层数
NHIDDEN						隐藏层维度
NHEADS						注意力机制头数
SEQ_LEN						最大长度
SAVE_INTERVAL				保存频率

--train_iters				训练步数
--eval-interval				验证频率
--eval-iters				验证iter
```

### 16B模型训练loss

|   卡数    |   lm loss    |
| :-------: | :----------: |
| 32 x 4DCU | 1.965622E+00 |

### 16B模型验证

|   卡数    | lm loss value | lm loss PPL  |
| :-------: | :-----------: | :----------: |
| 32 x 4DCU | 4.299443E+00  | 7.365877E+01 |

## GPT2文本生成

hepj987's avatar
hepj987 committed
188
### 转换成多卡推理
hepj987's avatar
hepj987 committed
189
190

```
hepj987's avatar
hepj987 committed
191
192
193
#训练后的模型保存格式为deepspeed格式,如果用于推理,需要进行格式转换成megatron格式,deepspeed-> megatron格式时转换前后TP数需要保持相同
#转换脚本
sh conver-model_to_megatron.sh
hepj987's avatar
hepj987 committed
194
195
196
```

```
hepj987's avatar
hepj987 committed
197
198
199
#重要参数
需要将工程路径加入PYTHONPATH
例如:export PYTHONPATH=/home/megatron-deepspeed_dtk23.04:$PYTHONPATH
hepj987's avatar
hepj987 committed
200

hepj987's avatar
hepj987 committed
201
202
203
204
CHECKPOINT_PATH  需要转换的模型路径(具体到保存的global_step)
output_folder	 转换后的模型路径
target_tp		 转换后的TP数,与训练保持一直或设置为1
target_pp		 转换后的PP数,与训练保持一直或设置为1
hepj987's avatar
hepj987 committed
205
206
```

hepj987's avatar
hepj987 committed
207
### 转换成单卡推理
hepj987's avatar
hepj987 committed
208
209

```
hepj987's avatar
hepj987 committed
210
#原始模型保存的是deepspeed格式,deepspeed-> megatron格式时转换前后TP数需要保持相同,因此需要先deepspeed->deepspeed(改变TP成1),然后再由deepspeed-> megatron转换成可推理的格式
hepj987's avatar
hepj987 committed
211

hepj987's avatar
hepj987 committed
212
213
#转换脚本
sh conver-model-1tp.sh
hepj987's avatar
hepj987 committed
214
215
```

hepj987's avatar
hepj987 committed
216

hepj987's avatar
hepj987 committed
217
218
219
220

### 无条件文本生成

```
hepj987's avatar
hepj987 committed
221
222
223
224
#多卡推理
mpirun -np 4 run-inf-gpus.sh
#单卡推理
mpirun -np 1 run-inf.sh
hepj987's avatar
hepj987 committed
225
226
227
228
229
230
231
232
233
234
```

```
#生成时模型各项参数需要与训练时保持一致(TP也需要保持一致)
--micro-batch-size  	micro-batch-size大小
--out-seq-length		输出文本程度
--genfile				生成文本保存位置
--num-samples			生成样本个数
```

hepj987's avatar
hepj987 committed
235
236
237
## 应用场景

### 算法类别
hepj987's avatar
hepj987 committed
238

hepj987's avatar
hepj987 committed
239
`文本生成`
hepj987's avatar
hepj987 committed
240

hepj987's avatar
hepj987 committed
241
242
243
244
245
### 热点应用行业



## result
hepj987's avatar
hepj987 committed
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261

16B模型使用oscar数据集收敛情况如下:



![image-20230524143710566](image-gpt-loss.png)

![image-20230524143830580](image-gpt-loss2.png)

## 源码仓库及问题反馈

https://developer.hpccube.com/codes/modelzoo/gpt2-pytorch/

## 参考

https://github.com/bigscience-workshop/Megatron-DeepSpeed
hepj987's avatar
hepj987 committed
262