README.md 6.9 KB
Newer Older
liangjing's avatar
v1  
liangjing committed
1
2
3
# 内容
- [内容](#内容)
- [环境配置](#环境配置)
wxj's avatar
wxj committed
4
5
6
7
8
9
10
11
12
13
14
- [预训练](#预训练)
  - [GPT](##GPT)
    - [下载词汇文件](###下载词汇文件)
    - [下载训练数据](###下载训练数据)
    - [数据预处理](###数据预处理)
    - [GPT预训练](###GPT预训练)
  - [Llama](##Llama)
    - [下载tokenizer文件](###下载tokenizer文件)
    - [下载训练数据](###下载训练数据)
    - [数据预处理](###数据预处理)
    - [Llama预训练](###Llama预训练)
liangjing's avatar
v1  
liangjing committed
15
16
- [参考](#参考)

wxj's avatar
wxj committed
17
18
# 更新日志

wxj's avatar
wxj committed
19
20
21
22
2025.4.14更新lightop的norm算子

2025.4.1更新legacy定长fa接口

silencealiang's avatar
silencealiang committed
23
2025.3.14适配最新代码,shell启动脚本在examples对应模型目录下,模型相关数据集下载:https://r0ddbu55vzx.feishu.cn/drive/folder/ZxHHfCoX4lg75td2hTqcmiAin3g
silencealiang's avatar
silencealiang committed
24

wxj's avatar
wxj committed
25
26
27
2024.12.16适配了torch prof

使用方法: 启动脚本中添加下列参数, 即可采集对应的prof信息
silencealiang's avatar
silencealiang committed
28
29
30
31
32
33

```python
# 采集torchprof
mpirun -np 8 --allow-run-as-root train_mixtral_8x7B_1nodes.sh localhost --profiling=torch
```

wxj's avatar
wxj committed
34
```bash
silencealiang's avatar
silencealiang committed
35
36
# prof相关参数
TORCH_PROFIE_ARGS=(
wxj's avatar
wxj committed
37
38
39
40
41
42
43
44
45
46
    --profile # 开启profile
    --profile-step-start 4 # skip前3个iter, warm第4个iter
    --profile-step-end 5 # 采集第5个iter
    --use-pytorch-profiler # 使用torch prof
    --profile-ranks 0 3 # 采集全局rank 第0和3
    --profile-dir ./prof_data # prof文件的保存目录
)
```


liangjing's avatar
v1  
liangjing committed
47
48
# 环境配置
1. 安装基础依赖包
Neel Kant's avatar
Neel Kant committed
49
<pre>
liangjing's avatar
v1  
liangjing committed
50
pip install -r requirements.txt
Neel Kant's avatar
Neel Kant committed
51
</pre>
wxj's avatar
wxj committed
52
2. 安装HCU相关whl包
Neel Kant's avatar
Neel Kant committed
53

wxj's avatar
wxj committed
54
HCU相关包下载目录:[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)
Neel Kant's avatar
Neel Kant committed
55

wxj's avatar
wxj committed
56
pytorch whl包:pytorch ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
57
根据python版本,下载对应pytorch的whl包
Neel Kant's avatar
Neel Kant committed
58
59

<pre>
liangjing's avatar
v1  
liangjing committed
60
pip install torch* (下载的torch的whl包)
Neel Kant's avatar
Neel Kant committed
61
</pre>
wxj's avatar
wxj committed
62
torchvision whl包:vision ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
63
根据python版本,下载对应torchvision的whl包
Mohammad's avatar
Mohammad committed
64
65

<pre>
liangjing's avatar
v1  
liangjing committed
66
pip install torchvision* (下载的torchvision的whl包)
Mohammad's avatar
Mohammad committed
67
</pre>
wxj's avatar
wxj committed
68
apex whl包:apex ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
69
根据python版本,下载对应apex的whl包
Mohammad's avatar
Mohammad committed
70
71

<pre>
liangjing's avatar
v1  
liangjing committed
72
pip install apex* (下载的apex的whl包)
73
</pre>
wxj's avatar
wxj committed
74

wxj's avatar
wxj committed
75
76
77
安装lightop算子包
http://10.6.10.68/dcutoolkit/deeplearing/lightop

liangjing's avatar
v1  
liangjing committed
78
若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
Mohammad's avatar
Mohammad committed
79

wxj's avatar
wxj committed
80
81
82
# 预训练
## GPT
### 下载词汇文件
83

Mohammad's avatar
Mohammad committed
84
<pre>
liangjing's avatar
v1  
liangjing committed
85
86
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Mohammad's avatar
Mohammad committed
87
</pre>
88

wxj's avatar
wxj committed
89
### 下载训练数据
liangjing's avatar
v1  
liangjing committed
90
使用1GB 79K jsonl数据集
Mohammad's avatar
Mohammad committed
91
<pre>
liangjing's avatar
v1  
liangjing committed
92
93
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
Mohammad's avatar
Mohammad committed
94
</pre>
wxj's avatar
wxj committed
95
解压后为单个`oscar-1GB.jsonl`文件
Mohammad's avatar
Mohammad committed
96

wxj's avatar
wxj committed
97
### 数据预处理
Mohammad's avatar
Mohammad committed
98

wxj's avatar
wxj committed
99
```shell
liangjing's avatar
v1  
liangjing committed
100
101
python tools/preprocess_data.py \
    --input oscar-1GB.jsonl \ 
wxj's avatar
wxj committed
102
103
    --output-prefix ./dataset/oscar-1GB-gpt \
    --vocab-file gpt2-vocab.json \
liangjing's avatar
v1  
liangjing committed
104
105
106
107
    --tokenizer-type GPT2BPETokenizer \
    --merge-file gpt2-merges.txt \
    --append-eod \
    --workers 8
Mohammad's avatar
Mohammad committed
108

wxj's avatar
wxj committed
109
110
111
112
113
114
115
116
117
118
119
120
# 参数说明
# --input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
# --output-prefix		输出数据路径(需要输出目录已创建),处理后会自动加上_text_document后缀
# --vocab-file				下载的gpt2-vocab.json词表文件路径
# --tokenizer-type 	tokenizer类型
# --merge-file		下载的gpt2-merges.txt文件路径		
# --append-eod		添加结束标志符		
# --workers			进程数
```


### GPT预训练
wxj's avatar
wxj committed
121
脚本目录: `examples/gpt3/`
wxj's avatar
wxj committed
122
123
124
125
126
127
128
129
130
131
132
133
134
135

修改数据集与词汇文件路径
```shell
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH="./dataset/oscar-1GB-gpt_text_document"
```
- 单机多卡训练
  ```shell
  # 修改脚本中的分布式启动参数
  # 单机可以使用localhost指定通信地址为本地
  # -np 8指定8进程\(8卡\)并行
  # --allow-run-as-root以root权限启动
  mpirun --allow-run-as-root -np 8 GPT_pretraining.sh localhost >& GPT_pretraining.log
liangjing's avatar
v1  
liangjing committed
136
  ```
wxj's avatar
wxj committed
137
138
139
140
141
142
143
144
145
146
147
  注: 这里的`localhost`参数会传到脚本中的`--dist-url`

`GPT_pretraining.log`中查看训练日志

- 多机多卡训练
  
  多节点docker设置:
  1. 容器内执行/usr/sbin/sshd -p 12345,启动一个端口
  2. 容器间可通过该端口ssh登录,ssh ip -p 12345
  3. 如果需要免密,docker run容器时,docker -v /root/.ssh 挂载.ssh目录
  4. 容器间mpirun执行: `mpirun -np .. --hostfile hosts -mca plm_rsh_args "-p 12345" ./xx.sh master_ip`
Raul Puri's avatar
Raul Puri committed
148

wxj's avatar
wxj committed
149
150

  **案例**: 设有节点192.168.1.1和192.168.1.2两个节点, 每个节点上8张卡, 192.168.1.1作为master节点
151

wxj's avatar
wxj committed
152
153
154
155
  hosts文件:
  ```txt
  192.168.1.1 slots=8 
  192.168.1.2 slots=8
liangjing's avatar
v1  
liangjing committed
156
  ```
wxj's avatar
wxj committed
157
158
159
160

  在master节点执行命令

  ```shell
wxj's avatar
wxj committed
161
  mpirun --allow-run-as-root -np 16 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_args "-p 12345" --bind-to none ./GPT_pretraining.sh 192.168.1.1 >& GPT_pretraining.log
liangjing's avatar
v1  
liangjing committed
162
  ```
wxj's avatar
wxj committed
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
`GPT_pretraining.log`中查看训练日志

## Llama
### 下载tokenizer文件

链接: https://www.modelscope.cn/models/shakechen/Llama-2-7b-hf/files
下载其中的tokenizer*文件

### 下载训练数据
使用1GB 79K jsonl数据集
<pre>
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
</pre>
解压后为单个`oscar-1GB.jsonl`文件

### 数据预处理

```shell
python tools/preprocess_data.py \
  --input oscar-1GB.jsonl \
  --output-prefix /datasets/oscar-1GB-llama\
  --tokenizer-type Llama2Tokenizer \
  --tokenizer-model /path/to/llama2_7b_hf/tokenizer.model \
  --workers 16 \
  --append-eod
```

### Llama预训练
wxj's avatar
wxj committed
192
脚本: `examples/llama`
wxj's avatar
wxj committed
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207

修改数据集与tokenizer路径
```shell
DATA_PATH="/datasets/oscar-1GB-llama_text_document"
--tokenizer-model /path/to/llama2_7b_hf/tokenizer.model
```
- 单机多卡训练
  ```shell
  # 具体参数说明参考上文GPT
  mpirun --allow-run-as-root -np 8 Llama_pretraining.sh localhost >& Llama_pretraining.log
  ```
`Llama_pretraining.log`中查看训练日志

- 多机多卡训练
  
wxj's avatar
wxj committed
208
  **案例**: 设有节点192.168.1.1和192.168.1.2两个节点, 每个节点上8张卡, 192.168.1.1作为master节点
wxj's avatar
wxj committed
209
210
211
212
213
214

  hosts配置如上文GTP所示

  在master节点执行命令

  ```shell
wxj's avatar
wxj committed
215
  mpirun --allow-run-as-root -np 16 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_args "-p 12345" --bind-to none ./Llama_pretraining.sh 192.168.1.1 >& Llama_pretraining.log
wxj's avatar
wxj committed
216
217
218
  ```

`Llama_pretraining.log`中查看训练日志
219

wxj's avatar
wxj committed
220
221
222
223
# 微调
将hf格式转为pt格式
```shell
python tools/checkpoint/convert.py \
wxj's avatar
wxj committed
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
    --model-type GPT \
    --loader llama_mistral \
    --saver megatron \
    --target-tensor-parallel-size 1  \
    --target-pipeline-parallel-size 2 \
    --checkpoint-type hf \
    --model-size llama2-7Bf \
    --load-dir /data/model_weights/Llama-2-7b-hf/ \
    --save-dir ./tmp_modelconvert \
    --tokenizer-model /data/model_weights/Llama-2-7b-hf/
```
然后在训练的脚本上添加微调的参数
```shell
FINETUNE_ARGS=(
    # --finetune
    # --pretrained-checkpoint $CHECKPOINT_PATH
    --load $CHECKPOINT_PATH
    --no-load-optim
    --no-load-rng
)
wxj's avatar
wxj committed
244
245
```

liangjing's avatar
v1  
liangjing committed
246
# 参考
247

silencealiang's avatar
silencealiang committed
248
- [README_ORIGIN](README_ORIGIN.md)