README.md 4.69 KB
Newer Older
liangjing's avatar
v1  
liangjing committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 内容
- [内容](#内容)
- [环境配置](#环境配置)
- [下载词汇文件](#下载词汇文件)
- [下载训练数据](#下载训练数据)
- [训练](#训练)
  - [数据预处理](#数据预处理)
  - [GPT预训练](#gpt预训练)
    - [分布式多卡训练](#分布式多卡训练)
- [GPT文本生成](#gpt文本生成)
- [参考](#参考)

# 环境配置
1. 安装基础依赖包
Neel Kant's avatar
Neel Kant committed
15
<pre>
liangjing's avatar
v1  
liangjing committed
16
pip install -r requirements.txt
Neel Kant's avatar
Neel Kant committed
17
</pre>
liangjing's avatar
v1  
liangjing committed
18
2. 安装DCU相关whl包
liangjing's avatar
liangjing committed
19

liangjing's avatar
v1  
liangjing committed
20
DCU相关包下载目录:[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)
Neel Kant's avatar
Neel Kant committed
21

liangjing's avatar
liangjing committed
22
pytorch whl包:pytorch ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
23
根据python版本,下载对应pytorch的whl包
Neel Kant's avatar
Neel Kant committed
24
25

<pre>
liangjing's avatar
v1  
liangjing committed
26
pip install torch* (下载的torch的whl包)
Neel Kant's avatar
Neel Kant committed
27
</pre>
liangjing's avatar
liangjing committed
28
torchvision whl包:vision ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
29
根据python版本,下载对应torchvision的whl包
Mohammad's avatar
Mohammad committed
30
31

<pre>
liangjing's avatar
v1  
liangjing committed
32
pip install torchvision* (下载的torchvision的whl包)
Mohammad's avatar
Mohammad committed
33
</pre>
liangjing's avatar
liangjing committed
34
apex whl包:apex ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
35
根据python版本,下载对应apex的whl包
Mohammad's avatar
Mohammad committed
36
37

<pre>
liangjing's avatar
v1  
liangjing committed
38
pip install apex* (下载的apex的whl包)
39
</pre>
wxj's avatar
wxj committed
40

liangjing's avatar
v1  
liangjing committed
41
若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
Mohammad's avatar
Mohammad committed
42

unknown's avatar
unknown committed
43
44
45
# 预训练
## gpt
### 下载词汇文件
46

Mohammad's avatar
Mohammad committed
47
<pre>
liangjing's avatar
v1  
liangjing committed
48
49
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Mohammad's avatar
Mohammad committed
50
</pre>
51

unknown's avatar
unknown committed
52
### 下载训练数据
liangjing's avatar
v1  
liangjing committed
53
使用1GB 79K jsonl数据集
Mohammad's avatar
Mohammad committed
54
<pre>
liangjing's avatar
v1  
liangjing committed
55
56
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
Mohammad's avatar
Mohammad committed
57
</pre>
unknown's avatar
unknown committed
58
解压后为单个`oscar-1GB.jsonl`文件
Mohammad's avatar
Mohammad committed
59

unknown's avatar
unknown committed
60
### 数据预处理
Mohammad's avatar
Mohammad committed
61

unknown's avatar
unknown committed
62
```shell
liangjing's avatar
v1  
liangjing committed
63
64
python tools/preprocess_data.py \
    --input oscar-1GB.jsonl \ 
unknown's avatar
unknown committed
65
    --output-prefix ./dataset/oscar-1GB-gpt \
liangjing's avatar
liangjing committed
66
    --vocab-file gpt2-vocab.json \
liangjing's avatar
v1  
liangjing committed
67
68
69
70
    --tokenizer-type GPT2BPETokenizer \
    --merge-file gpt2-merges.txt \
    --append-eod \
    --workers 8
Mohammad's avatar
Mohammad committed
71

unknown's avatar
unknown committed
72
73
74
75
76
77
78
79
80
81
82
83
# 参数说明
# --input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
# --output-prefix		输出数据路径,处理后会自动加上_text_document后缀
# --vocab-file				下载的gpt2-vocab.json词表文件路径
# --tokenizer-type 	tokenizer类型
# --merge-file		下载的gpt2-merges.txt文件路径		
# --append-eod		添加结束标志符		
# --workers			进程数
```


### GPT预训练
wxj's avatar
wxj committed
84
脚本: `GPT_pretraining.sh`
unknown's avatar
unknown committed
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

修改数据集与词汇文件路径
```shell
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH="./dataset/oscar-1GB-gpt_text_document"
```
- 单机多卡训练
  ```shell
  # 修改脚本中的分布式启动参数
  # nproc_per_node表示单节点卡数
  # nnodes表示节点数量
  # node_rank表示当前节点编号
  # master_addr表示主节点地址
  # master_port表示通信端口
  bash GPT_pretraining.sh >& GPT_pretraining.log
  ```
`GPT_pretraining.log`中查看训练日志
Mohammad's avatar
Mohammad committed
103

unknown's avatar
unknown committed
104
105
106
- 多机多卡训练
  
  设有节点192.168.1.1和192.168.1.2两个节点
Mohammad's avatar
Mohammad committed
107

unknown's avatar
unknown committed
108
109
110
111
112
  ```shell
  # 节点192.168.1.1执行下行命令:
  bash GPT_pretraining.sh --NNODES 2 --NODE_RANK 0 --MASTER_ADDR 192.168.1.1 >& GPT_pretraining_rank0.log
  # 节点192.168.1.2执行下行命令:
  bash GPT_pretraining.sh --NNODES 2 --NODE_RANK 1 --MASTER_ADDR 192.168.1.1 >& GPT_pretraining_rank1.log
liangjing's avatar
v1  
liangjing committed
113
  ```
unknown's avatar
unknown committed
114
115
116
117
`GPT_pretraining_rank0.log``GPT_pretraining_rank1.log`中查看训练日志

## llama
### 下载tokenizer文件
Raul Puri's avatar
Raul Puri committed
118

unknown's avatar
unknown committed
119
120
链接: https://www.modelscope.cn/models/shakechen/Llama-2-7b-hf/files
下载其中的tokenizer*文件
121

unknown's avatar
unknown committed
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
### 下载训练数据
使用1GB 79K jsonl数据集
<pre>
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
</pre>
解压后为单个`oscar-1GB.jsonl`文件

### 数据预处理

```shell
python tools/preprocess_data.py \
  --input oscar-1GB.jsonl \
  --output-prefix /datasets/oscar-1GB-llama\
  --tokenizer-type Llama2Tokenizer \
  --tokenizer-model /path/to/llama2_7b_hf/tokenizer.model \
  --workers 16 \
  --append-eod
```

### llama预训练
wxj's avatar
wxj committed
143
脚本: `llama_pretraining.sh`
unknown's avatar
unknown committed
144
145
146
147
148
149
150
151
152

修改数据集与tokenizer路径
```shell
DATA_PATH="/datasets/oscar-1GB-llama_text_document"
--tokenizer-model /path/to/llama2_7b_hf/tokenizer.model
```
- 单机多卡训练
  ```shell
  bash llama_pretraining.sh >& llama_pretraining.log
liangjing's avatar
v1  
liangjing committed
153
  ```
unknown's avatar
unknown committed
154
155
156
157
158
159
160
161
162
163
164
`llama_pretraining.log`中查看训练日志

- 多机多卡训练
  
  设有节点192.168.1.1和192.168.1.2两个节点

  ```shell
  # 节点192.168.1.1执行下行命令:
  bash llama_pretraining.sh --NNODES 2 --NODE_RANK 0 --MASTER_ADDR 192.168.1.1 >& llama_pretraining_rank0.log
  # 节点192.168.1.2执行下行命令:
  bash llama_pretraining.sh --NNODES 2 --NODE_RANK 1 --MASTER_ADDR 192.168.1.1 >& llama_pretraining_rank1.log
liangjing's avatar
v1  
liangjing committed
165
  ```
unknown's avatar
unknown committed
166
`GPT_pretraining_rank0.log``GPT_pretraining_rank1.log`中查看训练日志
167

liangjing's avatar
v1  
liangjing committed
168
# 参考
169

liangjing's avatar
liangjing committed
170
- [README_ORIGIN](README_ORIGIN.md)