README.md 2.68 KB
Newer Older
liangjing's avatar
v1  
liangjing committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 内容
- [内容](#内容)
- [环境配置](#环境配置)
- [下载词汇文件](#下载词汇文件)
- [下载训练数据](#下载训练数据)
- [训练](#训练)
  - [数据预处理](#数据预处理)
  - [GPT预训练](#gpt预训练)
    - [单卡训练](#单卡训练)
    - [分布式多卡训练](#分布式多卡训练)
- [GPT文本生成](#gpt文本生成)
- [参考](#参考)

# 环境配置
1. 安装基础依赖包
Neel Kant's avatar
Neel Kant committed
16
<pre>
liangjing's avatar
v1  
liangjing committed
17
pip install -r requirements.txt
Neel Kant's avatar
Neel Kant committed
18
19
</pre>

liangjing's avatar
v1  
liangjing committed
20
21
2. 安装DCU相关whl包
DCU相关包下载目录:[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)
Neel Kant's avatar
Neel Kant committed
22

liangjing's avatar
v1  
liangjing committed
23
24
pytorch whl包:pytorch ---> dtk-24.04
根据python版本,下载对应pytorch的whl包
Neel Kant's avatar
Neel Kant committed
25
26

<pre>
liangjing's avatar
v1  
liangjing committed
27
pip install torch* (下载的torch的whl包)
Neel Kant's avatar
Neel Kant committed
28
</pre>
liangjing's avatar
v1  
liangjing committed
29
30
torchvision whl包:vision ---> dtk-24.04
根据python版本,下载对应torchvision的whl包
Mohammad's avatar
Mohammad committed
31
32

<pre>
liangjing's avatar
v1  
liangjing committed
33
pip install torchvision* (下载的torchvision的whl包)
Mohammad's avatar
Mohammad committed
34
</pre>
liangjing's avatar
v1  
liangjing committed
35
36
apex whl包:apex ---> dtk-24.04
根据python版本,下载对应apex的whl包
Mohammad's avatar
Mohammad committed
37
38

<pre>
liangjing's avatar
v1  
liangjing committed
39
pip install apex* (下载的apex的whl包)
40
</pre>
liangjing's avatar
v1  
liangjing committed
41
若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
Mohammad's avatar
Mohammad committed
42

liangjing's avatar
v1  
liangjing committed
43
# 下载词汇文件
44

Mohammad's avatar
Mohammad committed
45
<pre>
liangjing's avatar
v1  
liangjing committed
46
47
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Mohammad's avatar
Mohammad committed
48
</pre>
49
50


liangjing's avatar
v1  
liangjing committed
51
52
# 下载训练数据
使用1GB 79K jsonl数据集
Mohammad's avatar
Mohammad committed
53
<pre>
liangjing's avatar
v1  
liangjing committed
54
55
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
Mohammad's avatar
Mohammad committed
56
57
</pre>

liangjing's avatar
v1  
liangjing committed
58
# 训练
Mohammad's avatar
Mohammad committed
59

liangjing's avatar
v1  
liangjing committed
60
## 数据预处理
Mohammad's avatar
Mohammad committed
61
62

<pre>
liangjing's avatar
v1  
liangjing committed
63
64
65
66
67
68
69
70
71
python tools/preprocess_data.py \
    --input oscar-1GB.jsonl \ 
    --output-prefix ./dataset/my-gpt2 \
    --vocab gpt2-vocab.json \
    --dataset-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file gpt2-merges.txt \
    --append-eod \
    --workers 8
Mohammad's avatar
Mohammad committed
72
73
</pre>

liangjing's avatar
v1  
liangjing committed
74
75
76
77
78
79
80
81
82
参数说明
--input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
--output-prefix		输出数据路径,处理后会自动加上_text_document后缀
--vocab				下载的gpt2-vocab.json词表文件路径
--dataset-impl		dataset类型
--tokenizer-type 	tokenizer类型
--merge-file		下载的gpt2-merges.txt文件路径		
--append-eod		添加结束标志符		
--workers			进程数
Mohammad's avatar
Mohammad committed
83

liangjing's avatar
v1  
liangjing committed
84
## GPT预训练
Mohammad's avatar
Mohammad committed
85

liangjing's avatar
v1  
liangjing committed
86
### 分布式训练
xingjinliang's avatar
add  
xingjinliang committed
87
- 修改DATA_PATH路径(示例脚本中使用的是Mixtral8x7B数据集,实际运行时请自行修改)
Mohammad's avatar
Mohammad committed
88

liangjing's avatar
v1  
liangjing committed
89
90
91
92
93
  ```bash
  VOCAB_FILE=gpt2-vocab.json
  MERGE_FILE=gpt2-merges.txt
  DATA_PATH=./dataset/my-gpt2_text_document
  ```
Raul Puri's avatar
Raul Puri committed
94

liangjing's avatar
v1  
liangjing committed
95
- 执行多卡训练
96

liangjing's avatar
v1  
liangjing committed
97
98
  ```
  #np为起的进程数,np\hostfile均需按实际填写
silencealiang's avatar
silencealiang committed
99
  mpirun -np 4 --hostfile hostfile train_mixtral_8x7B_1nodes.sh(基于单节点四卡)
liangjing's avatar
v1  
liangjing committed
100
  ```
101

liangjing's avatar
v1  
liangjing committed
102
# 参考
103

silencealiang's avatar
silencealiang committed
104
- [README_ORIGIN](README_ORIGIN.md)