README.md 5.54 KB
Newer Older
wxj's avatar
wxj committed
1
2
3
4
# Dcu Megatron
# 当前仓已迁移到内网,之后此链接不再继续维护,如要使用最新代码,请访问http://112.11.119.99:10068/dcutoolkit/deeplearing/dcu_megatron/-/tree/core_v0.13.0


liangjing's avatar
v1  
liangjing committed
5
6
7
# 内容
- [内容](#内容)
- [环境配置](#环境配置)
wxj's avatar
wxj committed
8
9
10
11
12
13
14
15
16
17
18
- [预训练](#预训练)
  - [GPT](##GPT)
    - [下载词汇文件](###下载词汇文件)
    - [下载训练数据](###下载训练数据)
    - [数据预处理](###数据预处理)
    - [GPT预训练](###GPT预训练)
  - [Llama](##Llama)
    - [下载tokenizer文件](###下载tokenizer文件)
    - [下载训练数据](###下载训练数据)
    - [数据预处理](###数据预处理)
    - [Llama预训练](###Llama预训练)
liangjing's avatar
v1  
liangjing committed
19
20
21
22
- [参考](#参考)

# 环境配置
1. 安装基础依赖包
Neel Kant's avatar
Neel Kant committed
23
<pre>
liangjing's avatar
v1  
liangjing committed
24
pip install -r requirements.txt
Neel Kant's avatar
Neel Kant committed
25
</pre>
wxj's avatar
wxj committed
26
2. 安装HCU相关whl包
Neel Kant's avatar
Neel Kant committed
27

wxj's avatar
wxj committed
28
HCU相关包下载目录:[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)
Neel Kant's avatar
Neel Kant committed
29

wxj's avatar
wxj committed
30
pytorch whl包:pytorch ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
31
根据python版本,下载对应pytorch的whl包
Neel Kant's avatar
Neel Kant committed
32
33

<pre>
liangjing's avatar
v1  
liangjing committed
34
pip install torch* (下载的torch的whl包)
Neel Kant's avatar
Neel Kant committed
35
</pre>
wxj's avatar
wxj committed
36
torchvision whl包:vision ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
37
根据python版本,下载对应torchvision的whl包
Mohammad's avatar
Mohammad committed
38
39

<pre>
liangjing's avatar
v1  
liangjing committed
40
pip install torchvision* (下载的torchvision的whl包)
Mohammad's avatar
Mohammad committed
41
</pre>
wxj's avatar
wxj committed
42
apex whl包:apex ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
43
根据python版本,下载对应apex的whl包
Mohammad's avatar
Mohammad committed
44
45

<pre>
liangjing's avatar
v1  
liangjing committed
46
pip install apex* (下载的apex的whl包)
47
</pre>
wxj's avatar
wxj committed
48

liangjing's avatar
v1  
liangjing committed
49
若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
Mohammad's avatar
Mohammad committed
50

wxj's avatar
wxj committed
51
52
53
# 预训练
## GPT
### 下载词汇文件
54

Mohammad's avatar
Mohammad committed
55
<pre>
liangjing's avatar
v1  
liangjing committed
56
57
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Mohammad's avatar
Mohammad committed
58
</pre>
59

wxj's avatar
wxj committed
60
### 下载训练数据
liangjing's avatar
v1  
liangjing committed
61
使用1GB 79K jsonl数据集
Mohammad's avatar
Mohammad committed
62
<pre>
liangjing's avatar
v1  
liangjing committed
63
64
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
Mohammad's avatar
Mohammad committed
65
</pre>
wxj's avatar
wxj committed
66
解压后为单个`oscar-1GB.jsonl`文件
Mohammad's avatar
Mohammad committed
67

wxj's avatar
wxj committed
68
### 数据预处理
Mohammad's avatar
Mohammad committed
69

wxj's avatar
wxj committed
70
```shell
liangjing's avatar
v1  
liangjing committed
71
72
python tools/preprocess_data.py \
    --input oscar-1GB.jsonl \ 
wxj's avatar
wxj committed
73
74
    --output-prefix ./dataset/oscar-1GB-gpt \
    --vocab-file gpt2-vocab.json \
liangjing's avatar
v1  
liangjing committed
75
76
77
78
    --tokenizer-type GPT2BPETokenizer \
    --merge-file gpt2-merges.txt \
    --append-eod \
    --workers 8
Mohammad's avatar
Mohammad committed
79

wxj's avatar
wxj committed
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# 参数说明
# --input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
# --output-prefix		输出数据路径(需要输出目录已创建),处理后会自动加上_text_document后缀
# --vocab-file				下载的gpt2-vocab.json词表文件路径
# --tokenizer-type 	tokenizer类型
# --merge-file		下载的gpt2-merges.txt文件路径		
# --append-eod		添加结束标志符		
# --workers			进程数
```


### GPT预训练
脚本: `GPT_pretraining.sh`

修改数据集与词汇文件路径
```shell
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH="./dataset/oscar-1GB-gpt_text_document"
```
- 单机多卡训练
  ```shell
  # 修改脚本中的分布式启动参数
  # 单机可以使用localhost指定通信地址为本地
  # -np 8指定8进程\(8卡\)并行
  # --allow-run-as-root以root权限启动
  mpirun --allow-run-as-root -np 8 GPT_pretraining.sh localhost >& GPT_pretraining.log
liangjing's avatar
v1  
liangjing committed
107
  ```
wxj's avatar
wxj committed
108
109
110
111
112
113
114
115
116
117
118
  注: 这里的`localhost`参数会传到脚本中的`--dist-url`

`GPT_pretraining.log`中查看训练日志

- 多机多卡训练
  
  多节点docker设置:
  1. 容器内执行/usr/sbin/sshd -p 12345,启动一个端口
  2. 容器间可通过该端口ssh登录,ssh ip -p 12345
  3. 如果需要免密,docker run容器时,docker -v /root/.ssh 挂载.ssh目录
  4. 容器间mpirun执行: `mpirun -np .. --hostfile hosts -mca plm_rsh_args "-p 12345" ./xx.sh master_ip`
Raul Puri's avatar
Raul Puri committed
119

wxj's avatar
wxj committed
120
  设有节点192.168.1.1和192.168.1.2两个节点, 每个节点上8张卡, 192.168.1.1作为master节点
121

wxj's avatar
wxj committed
122
123
124
125
  hosts文件:
  ```txt
  192.168.1.1 slots=8 
  192.168.1.2 slots=8
liangjing's avatar
v1  
liangjing committed
126
  ```
wxj's avatar
wxj committed
127
128
129
130
131

  在master节点执行命令

  ```shell
  mpirun --allow-run-as-root -np 8 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_args "-p 12345" --bind-to none ./GPT_pretraining.sh 192.168.1.1 >& GPT_pretraining.log
liangjing's avatar
v1  
liangjing committed
132
  ```
wxj's avatar
wxj committed
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
`GPT_pretraining.log`中查看训练日志

## Llama
### 下载tokenizer文件

链接: https://www.modelscope.cn/models/shakechen/Llama-2-7b-hf/files
下载其中的tokenizer*文件

### 下载训练数据
使用1GB 79K jsonl数据集
<pre>
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
</pre>
解压后为单个`oscar-1GB.jsonl`文件

### 数据预处理

```shell
python tools/preprocess_data.py \
  --input oscar-1GB.jsonl \
  --output-prefix /datasets/oscar-1GB-llama\
  --tokenizer-type Llama2Tokenizer \
  --tokenizer-model /path/to/llama2_7b_hf/tokenizer.model \
  --workers 16 \
  --append-eod
```

### Llama预训练
脚本: `Llama_pretraining.sh`

修改数据集与tokenizer路径
```shell
DATA_PATH="/datasets/oscar-1GB-llama_text_document"
--tokenizer-model /path/to/llama2_7b_hf/tokenizer.model
```
- 单机多卡训练
  ```shell
  # 具体参数说明参考上文GPT
  mpirun --allow-run-as-root -np 8 Llama_pretraining.sh localhost >& Llama_pretraining.log
  ```
`Llama_pretraining.log`中查看训练日志

- 多机多卡训练
  
  设有节点192.168.1.1和192.168.1.2两个节点, 每个节点上8张卡, 192.168.1.1作为master节点

  hosts配置如上文GTP所示

  在master节点执行命令

  ```shell
  mpirun --allow-run-as-root -np 8 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_args "-p 12345" --bind-to none ./Llama_pretraining.sh 192.168.1.1 >& Llama_pretraining.log
  ```

`Llama_pretraining.log`中查看训练日志
189

liangjing's avatar
v1  
liangjing committed
190
# 参考
191

silencealiang's avatar
silencealiang committed
192
- [README_ORIGIN](README_ORIGIN.md)