README.md 4.68 KB
Newer Older
liangjing's avatar
v1  
liangjing committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 内容
- [内容](#内容)
- [环境配置](#环境配置)
- [下载词汇文件](#下载词汇文件)
- [下载训练数据](#下载训练数据)
- [训练](#训练)
  - [数据预处理](#数据预处理)
  - [GPT预训练](#gpt预训练)
    - [分布式多卡训练](#分布式多卡训练)
- [GPT文本生成](#gpt文本生成)
- [参考](#参考)

# 环境配置
1. 安装基础依赖包
Neel Kant's avatar
Neel Kant committed
15
<pre>
liangjing's avatar
v1  
liangjing committed
16
pip install -r requirements.txt
Neel Kant's avatar
Neel Kant committed
17
</pre>
liangjing's avatar
v1  
liangjing committed
18
2. 安装DCU相关whl包
liangjing's avatar
liangjing committed
19

liangjing's avatar
v1  
liangjing committed
20
DCU相关包下载目录:[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)
Neel Kant's avatar
Neel Kant committed
21

liangjing's avatar
liangjing committed
22
pytorch whl包:pytorch ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
23
根据python版本,下载对应pytorch的whl包
Neel Kant's avatar
Neel Kant committed
24
25

<pre>
liangjing's avatar
v1  
liangjing committed
26
pip install torch* (下载的torch的whl包)
Neel Kant's avatar
Neel Kant committed
27
</pre>
liangjing's avatar
liangjing committed
28
torchvision whl包:vision ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
29
根据python版本,下载对应torchvision的whl包
Mohammad's avatar
Mohammad committed
30
31

<pre>
liangjing's avatar
v1  
liangjing committed
32
pip install torchvision* (下载的torchvision的whl包)
Mohammad's avatar
Mohammad committed
33
</pre>
liangjing's avatar
liangjing committed
34
apex whl包:apex ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
35
根据python版本,下载对应apex的whl包
Mohammad's avatar
Mohammad committed
36
37

<pre>
liangjing's avatar
v1  
liangjing committed
38
pip install apex* (下载的apex的whl包)
39
</pre>
wxj's avatar
wxj committed
40

liangjing's avatar
v1  
liangjing committed
41
若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
Mohammad's avatar
Mohammad committed
42

unknown's avatar
unknown committed
43
44
45
# 预训练
## gpt
### 下载词汇文件
46

Mohammad's avatar
Mohammad committed
47
<pre>
liangjing's avatar
v1  
liangjing committed
48
49
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Mohammad's avatar
Mohammad committed
50
</pre>
51

unknown's avatar
unknown committed
52
### 下载训练数据
liangjing's avatar
v1  
liangjing committed
53
使用1GB 79K jsonl数据集
Mohammad's avatar
Mohammad committed
54
<pre>
liangjing's avatar
v1  
liangjing committed
55
56
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
Mohammad's avatar
Mohammad committed
57
</pre>
unknown's avatar
unknown committed
58
解压后为单个`oscar-1GB.jsonl`文件
Mohammad's avatar
Mohammad committed
59

unknown's avatar
unknown committed
60
### 数据预处理
Mohammad's avatar
Mohammad committed
61

unknown's avatar
unknown committed
62
```shell
liangjing's avatar
v1  
liangjing committed
63
64
python tools/preprocess_data.py \
    --input oscar-1GB.jsonl \ 
unknown's avatar
unknown committed
65
    --output-prefix ./dataset/oscar-1GB-gpt \
liangjing's avatar
liangjing committed
66
    --vocab-file gpt2-vocab.json \
liangjing's avatar
v1  
liangjing committed
67
68
69
70
    --tokenizer-type GPT2BPETokenizer \
    --merge-file gpt2-merges.txt \
    --append-eod \
    --workers 8
Mohammad's avatar
Mohammad committed
71

unknown's avatar
unknown committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# 参数说明
# --input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
# --output-prefix		输出数据路径,处理后会自动加上_text_document后缀
# --vocab-file				下载的gpt2-vocab.json词表文件路径
# --tokenizer-type 	tokenizer类型
# --merge-file		下载的gpt2-merges.txt文件路径		
# --append-eod		添加结束标志符		
# --workers			进程数
```


### GPT预训练
脚本: `GPT_pretrain.sh`

修改数据集与词汇文件路径
```shell
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH="./dataset/oscar-1GB-gpt_text_document"
```
- 单机多卡训练
  ```shell
  # 修改脚本中的分布式启动参数
  # nproc_per_node表示单节点卡数
  # nnodes表示节点数量
  # node_rank表示当前节点编号
  # master_addr表示主节点地址
  # master_port表示通信端口
  bash GPT_pretraining.sh >& GPT_pretraining.log
  ```
`GPT_pretraining.log`中查看训练日志
Mohammad's avatar
Mohammad committed
103

unknown's avatar
unknown committed
104
105
106
- 多机多卡训练
  
  设有节点192.168.1.1和192.168.1.2两个节点
Mohammad's avatar
Mohammad committed
107

unknown's avatar
unknown committed
108
109
110
111
112
  ```shell
  # 节点192.168.1.1执行下行命令:
  bash GPT_pretraining.sh --NNODES 2 --NODE_RANK 0 --MASTER_ADDR 192.168.1.1 >& GPT_pretraining_rank0.log
  # 节点192.168.1.2执行下行命令:
  bash GPT_pretraining.sh --NNODES 2 --NODE_RANK 1 --MASTER_ADDR 192.168.1.1 >& GPT_pretraining_rank1.log
liangjing's avatar
v1  
liangjing committed
113
  ```
unknown's avatar
unknown committed
114
115
116
117
`GPT_pretraining_rank0.log``GPT_pretraining_rank1.log`中查看训练日志

## llama
### 下载tokenizer文件
Raul Puri's avatar
Raul Puri committed
118

unknown's avatar
unknown committed
119
120
链接: https://www.modelscope.cn/models/shakechen/Llama-2-7b-hf/files
下载其中的tokenizer*文件
121

unknown's avatar
unknown committed
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
### 下载训练数据
使用1GB 79K jsonl数据集
<pre>
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
</pre>
解压后为单个`oscar-1GB.jsonl`文件

### 数据预处理

```shell
python tools/preprocess_data.py \
  --input oscar-1GB.jsonl \
  --output-prefix /datasets/oscar-1GB-llama\
  --tokenizer-type Llama2Tokenizer \
  --tokenizer-model /path/to/llama2_7b_hf/tokenizer.model \
  --workers 16 \
  --append-eod
```

### llama预训练
脚本: `llama_pretrain.sh`

修改数据集与tokenizer路径
```shell
DATA_PATH="/datasets/oscar-1GB-llama_text_document"
--tokenizer-model /path/to/llama2_7b_hf/tokenizer.model
```
- 单机多卡训练
  ```shell
  bash llama_pretraining.sh >& llama_pretraining.log
liangjing's avatar
v1  
liangjing committed
153
  ```
unknown's avatar
unknown committed
154
155
156
157
158
159
160
161
162
163
164
`llama_pretraining.log`中查看训练日志

- 多机多卡训练
  
  设有节点192.168.1.1和192.168.1.2两个节点

  ```shell
  # 节点192.168.1.1执行下行命令:
  bash llama_pretraining.sh --NNODES 2 --NODE_RANK 0 --MASTER_ADDR 192.168.1.1 >& llama_pretraining_rank0.log
  # 节点192.168.1.2执行下行命令:
  bash llama_pretraining.sh --NNODES 2 --NODE_RANK 1 --MASTER_ADDR 192.168.1.1 >& llama_pretraining_rank1.log
liangjing's avatar
v1  
liangjing committed
165
  ```
unknown's avatar
unknown committed
166
`GPT_pretraining_rank0.log``GPT_pretraining_rank1.log`中查看训练日志
167

liangjing's avatar
v1  
liangjing committed
168
# 参考
169

liangjing's avatar
liangjing committed
170
- [README_ORIGIN](README_ORIGIN.md)