README.md 6.65 KB
Newer Older
hepj987's avatar
hepj987 committed
1
2
# Generative Pre-Training2(GPT2)

hepj987's avatar
hepj987 committed
3
4
5
6
7
8
9
10
11
## 论文

`Language Models are Unsupervised Multitask Learners`

-   https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

## 模型结构

![gpt2](gpt2.jpg)
hepj987's avatar
hepj987 committed
12
13

```
hepj987's avatar
hepj987 committed
14
GPT2使用Transformer的Decoder结构,并对 Transformer Decoder 进行了一些改动。主要在于将归一化层移到Block的输入位置;在最后一个自注意力块之后加了一层归一化;增大词汇量等。
hepj987's avatar
hepj987 committed
15
16
```

hepj987's avatar
hepj987 committed
17
18
19
20
21
22
23
24
25
## 算法原理

![image-gpt](image-gpt.png)

## 数据集

`oscar-1GB`

-   https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
hepj987's avatar
hepj987 committed
26
27
28
29
30
31
32
33

```
#下载数据集
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
#下载vocab文件
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
xz -d oscar-1GB.jsonl.xz
hepj987's avatar
hepj987 committed
34
35
36
37
38
39
40
41
42
43
44

#处理数据集参数
--input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
--output-prefix		输出数据路径,处理后会自动加上_text_document后缀
--vocab				下载的gpt2-vocab.json词表文件路径
--dataset-impl		dataset类型
--tokenizer-type 	tokenizer类型
--merge-file		下载的gpt2-merges.txt文件路径		
--append-eod		添加结束标志符		
--workers			进程数

hepj987's avatar
hepj987 committed
45
#处理数据集
hepj987's avatar
hepj987 committed
46
sh creat-data.sh
hepj987's avatar
hepj987 committed
47
48
```

hepj987's avatar
hepj987 committed
49
50
51
52
53
54
```
#处理后的数据集格式
├── my-gpt2_text_document.bin
├── my-gpt2_text_document.idx
└── oscar-1GB.jsonl
```
hepj987's avatar
hepj987 committed
55

hepj987's avatar
hepj987 committed
56
## 环境配置
hepj987's avatar
hepj987 committed
57
58
59
60
61
62
63
64
65
66

推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/)拉取的docker镜像:

```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-23.04-py37-latest
```

进入docker

```
hepj987's avatar
hepj987 committed
67
pip install -r requirements.txt  -i https://mirrors.aliyun.com/pypi/simple/  --trusted-host mirrors.aliyun.com
hepj987's avatar
hepj987 committed
68
69
```

hepj987's avatar
hepj987 committed
70
71
## GPT2预训练

hepj987's avatar
hepj987 committed
72
### GPT2单节点训练
hepj987's avatar
hepj987 committed
73
74

```
hepj987's avatar
hepj987 committed
75
76
77
#np为起的进程数,和使用GPU数量一致,并且TP*PP < np,4卡的话可以设置2tp 2pp,或者1tp 4pp,4tp 1pp,节点内使用TP性能更好

mpirun -np 4 run-one-node.sh(基于单节点四卡)
hepj987's avatar
hepj987 committed
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
```

```
#重要参数
MODEL_NAME 					模型名(自定义)
CHECKPOINT_PATH				模型保存&加载路径
DATA_PATH					数据集路径(转换后的)
TENSORBOARD_PATH			tensorboard路径
CODECARBON_PATH				codecarbon路径

N_GPUS         				使用加速卡数量
TP_SIZE  	 				TP数量
PP_SIZE      				PP数量
MICRO_BATCH_SIZE			MICRO_BATCH_SIZE大小
GLOBAL_BATCH_SIZE           GLOBAL_BATCH_SIZE大小
NLAYERS 					模型层数
NHIDDEN						隐藏层维度
NHEADS						多注意力机制头数
SEQ_LEN						最大长度
SAVE_INTERVAL				保存频率

--train-samples				训练样本数
--eval-interval				验证频率
--eval-iters				验证iter
```

hepj987's avatar
hepj987 committed
104
### GPT2模型16B多节点训练
hepj987's avatar
hepj987 committed
105

hepj987's avatar
hepj987 committed
106
要求DCU集群配置好相应的虚拟环境,已安装python依赖项。
hepj987's avatar
hepj987 committed
107

hepj987's avatar
hepj987 committed
108
在安装一下依赖时需要使用基于DTK编译的版本,下载地址在[光合开发者社区](https://cancon.hpccube.com:65024/4/main/)
hepj987's avatar
hepj987 committed
109
110

```
hepj987's avatar
hepj987 committed
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
pytorch
deepspeed
apex
torchaudio
colossalai
faiss
mmcv-full
torchvision
tensorflow
```

这里以DTK23.04、python3.7,torch1.10为例,进入[光合开发者社区](https://cancon.hpccube.com:65024/4/main/)进入到pytorch->dtk23.04->下载 torch-1.10.0+gite378c3c.abi0.dtk2304-cp37-cp37m-manylinux2014_x86_64.whl。然后可以仿照下边配置环境:

```
#创建虚拟环境
hepj987's avatar
hepj987 committed
126
127
export PYTHON3_LIB_PATH=/python_lib_path
virtualenv -p /python_bin_path/python3 --system-site-packages venv_gpt2
hepj987's avatar
hepj987 committed
128
129
130
131
132
133
134
135
#进入venv_gpt2虚拟环境
source venv_gpt2/bin/activate
#加载DTK以及其他环境设置
source env.sh		
#安装DTK版本依赖
pip install torch-1.10.0+gite378c3c.abi0.dtk2304-cp37-cp37m-manylinux2014_x86_64.whl
pip install deepspeed-0.9.2+git25d5540.abi0.dtk2304.torch1.10.0-cp37-cp37m-manylinux2014_x86_64.whl
#安装其他依赖
hepj987's avatar
hepj987 committed
136
137
138
139
pip install -r requirements.txt  -i http://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn
```

```
hepj987's avatar
hepj987 committed
140
#多节点运行
141
sbatch  run-16B.sh(主要参数在single-16B.sh, 默认以fp32精度训练,如需采用fp16精度可执行sbatch run-16B-fp16.sh)
hepj987's avatar
hepj987 committed
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
```

```
#重要参数
MODEL_NAME 					模型名(自定义)
CHECKPOINT_PATH				模型保存&加载路径
DATA_PATH					数据集路径(转换后的)
TENSORBOARD_PATH			tensorboard路径
CODECARBON_PATH				codecarbon路径


TP_SIZE  	 				TP数量
PP_SIZE      				PP数量
MICRO_BATCH_SIZE			MICRO_BATCH_SIZE大小
GLOBAL_BATCH_SIZE           GLOBAL_BATCH_SIZE大小
NLAYERS 					层数
NHIDDEN						隐藏层维度
NHEADS						注意力机制头数
SEQ_LEN						最大长度
SAVE_INTERVAL				保存频率

--train_iters				训练步数
--eval-interval				验证频率
--eval-iters				验证iter
```

### 16B模型训练loss

|   卡数    |   lm loss    |
| :-------: | :----------: |
| 32 x 4DCU | 1.965622E+00 |

### 16B模型验证

|   卡数    | lm loss value | lm loss PPL  |
| :-------: | :-----------: | :----------: |
| 32 x 4DCU | 4.299443E+00  | 7.365877E+01 |

## GPT2文本生成

hepj987's avatar
hepj987 committed
182
### 转换成多卡推理
hepj987's avatar
hepj987 committed
183
184

```
hepj987's avatar
hepj987 committed
185
186
187
#训练后的模型保存格式为deepspeed格式,如果用于推理,需要进行格式转换成megatron格式,deepspeed-> megatron格式时转换前后TP数需要保持相同
#转换脚本
sh conver-model_to_megatron.sh
hepj987's avatar
hepj987 committed
188
189
190
```

```
hepj987's avatar
hepj987 committed
191
192
193
#重要参数
需要将工程路径加入PYTHONPATH
例如:export PYTHONPATH=/home/megatron-deepspeed_dtk23.04:$PYTHONPATH
hepj987's avatar
hepj987 committed
194

hepj987's avatar
hepj987 committed
195
196
197
198
CHECKPOINT_PATH  需要转换的模型路径(具体到保存的global_step)
output_folder	 转换后的模型路径
target_tp		 转换后的TP数,与训练保持一直或设置为1
target_pp		 转换后的PP数,与训练保持一直或设置为1
hepj987's avatar
hepj987 committed
199
200
```

hepj987's avatar
hepj987 committed
201
### 转换成单卡推理
hepj987's avatar
hepj987 committed
202
203

```
hepj987's avatar
hepj987 committed
204
#原始模型保存的是deepspeed格式,deepspeed-> megatron格式时转换前后TP数需要保持相同,因此需要先deepspeed->deepspeed(改变TP成1),然后再由deepspeed-> megatron转换成可推理的格式
hepj987's avatar
hepj987 committed
205

hepj987's avatar
hepj987 committed
206
207
#转换脚本
sh conver-model-1tp.sh
hepj987's avatar
hepj987 committed
208
209
```

hepj987's avatar
hepj987 committed
210

hepj987's avatar
hepj987 committed
211
212
213
214

### 无条件文本生成

```
hepj987's avatar
hepj987 committed
215
216
217
218
#多卡推理
mpirun -np 4 run-inf-gpus.sh
#单卡推理
mpirun -np 1 run-inf.sh
hepj987's avatar
hepj987 committed
219
220
221
222
223
224
225
226
227
228
```

```
#生成时模型各项参数需要与训练时保持一致(TP也需要保持一致)
--micro-batch-size  	micro-batch-size大小
--out-seq-length		输出文本程度
--genfile				生成文本保存位置
--num-samples			生成样本个数
```

hepj987's avatar
hepj987 committed
229
230
231
## 应用场景

### 算法类别
hepj987's avatar
hepj987 committed
232

hepj987's avatar
hepj987 committed
233
`文本生成`
hepj987's avatar
hepj987 committed
234

hepj987's avatar
hepj987 committed
235
236
### 热点应用行业

hepj987's avatar
hepj987 committed
237
`互联网`
hepj987's avatar
hepj987 committed
238
239

## result
hepj987's avatar
hepj987 committed
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255

16B模型使用oscar数据集收敛情况如下:



![image-20230524143710566](image-gpt-loss.png)

![image-20230524143830580](image-gpt-loss2.png)

## 源码仓库及问题反馈

https://developer.hpccube.com/codes/modelzoo/gpt2-pytorch/

## 参考

https://github.com/bigscience-workshop/Megatron-DeepSpeed
hepj987's avatar
hepj987 committed
256