README.md 2.63 KB
Newer Older
liangjing's avatar
v1  
liangjing committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 内容
- [内容](#内容)
- [环境配置](#环境配置)
- [下载词汇文件](#下载词汇文件)
- [下载训练数据](#下载训练数据)
- [训练](#训练)
  - [数据预处理](#数据预处理)
  - [GPT预训练](#gpt预训练)
    - [分布式多卡训练](#分布式多卡训练)
- [GPT文本生成](#gpt文本生成)
- [参考](#参考)

# 环境配置
1. 安装基础依赖包
Neel Kant's avatar
Neel Kant committed
15
<pre>
liangjing's avatar
v1  
liangjing committed
16
pip install -r requirements.txt
Neel Kant's avatar
Neel Kant committed
17
</pre>
liangjing's avatar
v1  
liangjing committed
18
2. 安装DCU相关whl包
liangjing's avatar
liangjing committed
19

liangjing's avatar
v1  
liangjing committed
20
DCU相关包下载目录:[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)
Neel Kant's avatar
Neel Kant committed
21

liangjing's avatar
liangjing committed
22
pytorch whl包:pytorch ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
23
根据python版本,下载对应pytorch的whl包
Neel Kant's avatar
Neel Kant committed
24
25

<pre>
liangjing's avatar
v1  
liangjing committed
26
pip install torch* (下载的torch的whl包)
Neel Kant's avatar
Neel Kant committed
27
</pre>
liangjing's avatar
liangjing committed
28
torchvision whl包:vision ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
29
根据python版本,下载对应torchvision的whl包
Mohammad's avatar
Mohammad committed
30
31

<pre>
liangjing's avatar
v1  
liangjing committed
32
pip install torchvision* (下载的torchvision的whl包)
Mohammad's avatar
Mohammad committed
33
</pre>
liangjing's avatar
liangjing committed
34
apex whl包:apex ---> dtk-24.04.1
liangjing's avatar
v1  
liangjing committed
35
根据python版本,下载对应apex的whl包
Mohammad's avatar
Mohammad committed
36
37

<pre>
liangjing's avatar
v1  
liangjing committed
38
pip install apex* (下载的apex的whl包)
39
</pre>
liangjing's avatar
v1  
liangjing committed
40
若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
Mohammad's avatar
Mohammad committed
41

liangjing's avatar
liangjing committed
42
43
44
3. 安装unsloth

<pre>
wxj's avatar
wxj committed
45
git clone https://github.com/unslothai/unsloth.git
liangjing's avatar
liangjing committed
46
47
48
49
cd ./unsloth
pip3 install -e .
</pre>

liangjing's avatar
v1  
liangjing committed
50
# 下载词汇文件
51

Mohammad's avatar
Mohammad committed
52
<pre>
liangjing's avatar
v1  
liangjing committed
53
54
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Mohammad's avatar
Mohammad committed
55
</pre>
56
57


liangjing's avatar
v1  
liangjing committed
58
59
# 下载训练数据
使用1GB 79K jsonl数据集
Mohammad's avatar
Mohammad committed
60
<pre>
liangjing's avatar
v1  
liangjing committed
61
62
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
Mohammad's avatar
Mohammad committed
63
64
</pre>

liangjing's avatar
v1  
liangjing committed
65
# 训练
Mohammad's avatar
Mohammad committed
66

liangjing's avatar
v1  
liangjing committed
67
## 数据预处理
Mohammad's avatar
Mohammad committed
68
69

<pre>
liangjing's avatar
v1  
liangjing committed
70
71
72
python tools/preprocess_data.py \
    --input oscar-1GB.jsonl \ 
    --output-prefix ./dataset/my-gpt2 \
liangjing's avatar
liangjing committed
73
    --vocab-file gpt2-vocab.json \
liangjing's avatar
v1  
liangjing committed
74
75
76
77
    --tokenizer-type GPT2BPETokenizer \
    --merge-file gpt2-merges.txt \
    --append-eod \
    --workers 8
Mohammad's avatar
Mohammad committed
78
79
</pre>

liangjing's avatar
v1  
liangjing committed
80
81
82
参数说明
--input				输入数据集路径,即oscar-1GB.jsonl.xz解压后的文件路径
--output-prefix		输出数据路径,处理后会自动加上_text_document后缀
liangjing's avatar
liangjing committed
83
--vocab-file				下载的gpt2-vocab.json词表文件路径
liangjing's avatar
v1  
liangjing committed
84
85
86
87
--tokenizer-type 	tokenizer类型
--merge-file		下载的gpt2-merges.txt文件路径		
--append-eod		添加结束标志符		
--workers			进程数
Mohammad's avatar
Mohammad committed
88

liangjing's avatar
v1  
liangjing committed
89
## GPT预训练
Mohammad's avatar
Mohammad committed
90

liangjing's avatar
v1  
liangjing committed
91
92
### 分布式训练
- 修改DATA_PATH路径
Mohammad's avatar
Mohammad committed
93

liangjing's avatar
v1  
liangjing committed
94
95
96
  ```bash
  VOCAB_FILE=gpt2-vocab.json
  MERGE_FILE=gpt2-merges.txt
liangjing's avatar
liangjing committed
97
  DATA_PATH="./dataset/my-gpt2_text_document"
liangjing's avatar
v1  
liangjing committed
98
  ```
Raul Puri's avatar
Raul Puri committed
99

liangjing's avatar
v1  
liangjing committed
100
- 执行多卡训练
101

liangjing's avatar
v1  
liangjing committed
102
103
  ```
  #np为起的进程数,np\hostfile均需按实际填写
liangjing's avatar
liangjing committed
104
  mpirun -np 4 --hostfile hostfile single.sh localhost(基于单节点四卡)
liangjing's avatar
v1  
liangjing committed
105
  ```
106

liangjing's avatar
v1  
liangjing committed
107
# 参考
108

liangjing's avatar
liangjing committed
109
- [README_ORIGIN](README_ORIGIN.md)