README.md 10.8 KB
Newer Older
zhaoying1's avatar
UPDATE  
zhaoying1 committed
1
## LLaMA & LLaMA2
2

zhaoying1's avatar
zhaoying1 committed
3
4
## 论文
`LLaMA: Open and Efficient Foundation Language Models`
5

zhaoying1's avatar
zhaoying1 committed
6
- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
7

zhaoying1's avatar
zhaoying1 committed
8
`Llama 2: Open Foundation and Fine-Tuned Chat Models`
9

zhaoying1's avatar
zhaoying1 committed
10
- [https://arxiv.org/abs/2307.09288](https://arxiv.org/abs/2307.09288)
11

zhaoying1's avatar
zhaoying1 committed
12
## 模型结构
zhaoying1's avatar
UPDATE  
zhaoying1 committed
13
14
**注意**:本仓库在llama2部分仅支持7b和13b模型。

zhaoying1's avatar
zhaoying1 committed
15
LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。特别是,llama 13B在大多数基准测试中优于GPT-3 (175B), LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。
16
17


zhaoying1's avatar
zhaoying1 committed
18
LLaMA模型具体参数:
19

zhaoying1's avatar
zhaoying1 committed
20
21
22
23
| 模型名称 | 隐含层维度 | 层数 | 头数 | 词表大小  | 训练数据(tokens) | 位置编码 | 最大长 |
| -------- | -------- | -------- | -------- |  -------- | -------- | -------- | -------- |
| LLaMA-7B | 4,096 | 32 | 32 | 32,000 |1T | RoPE | 2048 |
| LLaMA-13B | 5,120 | 40 | 40 | 32,000 |1T | RoPE | 2048 |
24
25
26



zhaoying1's avatar
zhaoying1 committed
27
<div align="center">
zhaoying1's avatar
update  
zhaoying1 committed
28
<img src="data/media/llama_arc.png" width="500" height="400">
zhaoying1's avatar
zhaoying1 committed
29
</div>
30

zhaoying1's avatar
zhaoying1 committed
31
LLaMA 2是LLaMA的新一代版本,具有商业友好的许可证。 LLaMA 2 有 3 种不同的尺寸:7B、13B 和 70B。Llama 2训练语料相比LLaMA多出40%,上下文长度是由之前的2048升级到4096,可以理解和生成更长的文本。Llama 2采用了 Llama 1 的大部分预训练设置和模型架构,使用标准Transformer 架构,使用 RMSNorm 应用预归一化、使用 SwiGLU 激活函数和旋转位置嵌入RoPE。
32

zhaoying1's avatar
zhaoying1 committed
33
34
35
36


## 算法原理

zhaoying1's avatar
zhaoying1 committed
37

zhaoying1's avatar
update  
zhaoying1 committed
38
39
40
41
<div align="center">
<img src="data/media/llama_alg.png" width="300" height="400">
</div>

zhaoying1's avatar
zhaoying1 committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
以下是与原始架构的主要区别:

**预归一化**。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。

**SwiGLU 激活函数**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。

**旋转嵌入**。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。


## 环境配置

### Docker(方式一)
推荐使用docker方式运行,提供拉取的docker镜像:
```
dcuai's avatar
dcuai committed
56
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
zhaoying1's avatar
zhaoying1 committed
57
58
59
60
```

进入docker,安装docker中没有的依赖:
```
dcuai's avatar
dcuai committed
61
docker run -dit --network=host --name=llama-tencentpretrain --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v /opt/hyhal:/opt/hyhal:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
62
63
docker exec -it llama-tencentpretrain /bin/bash
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
zhaoying1's avatar
zhaoying1 committed
64
65
```
### Dockerfile(方式二)
66

zhaoying1's avatar
zhaoying1 committed
67
```
68
docker build -t llama:latest .
dcuai's avatar
dcuai committed
69
docker run -dit --network=host --name=llama-tencentpretrain --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v /opt/hyhal:/opt/hyhal:ro llama:latest
zhaoying1's avatar
zhaoying1 committed
70
docker exec -it llama-tencentpretrain /bin/bash
71
``` 
zhaoying1's avatar
zhaoying1 committed
72
### Conda(方式三)
zhaoying1's avatar
zhaoying1 committed
73
74
1. 创建conda虚拟环境:
```
dcuai's avatar
dcuai committed
75
conda create -n llama-tencentpretrain python=3.10
zhaoying1's avatar
zhaoying1 committed
76
77
```
2. 关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
dcuai's avatar
dcuai committed
78
79
80
81
82
83
84
```
DTK软件栈:dtk24.04.1
python:python3.10
torch:2.1.0
torchvision:0.16.0
deepspeed: 0.12.3
```
85

zhaoying1's avatar
zhaoying1 committed
86
    Tips:以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。
87

zhaoying1's avatar
zhaoying1 committed
88
89
90
91
3. 其它依赖库参照requirements.txt安装:
```
pip install -r requirements.txt
```
92

zhaoying1's avatar
zhaoying1 committed
93
94
95
## 数据集
我们在[data](./data)目录下集成了中文公开指令数据集[alpaca_gpt4_data_zh.json](https://huggingface.co/datasets/shibing624/alpaca-zh),供用户快速验证:
```
zhaoying1's avatar
UPDATE  
zhaoying1 committed
96
97
98
$ tree ./data/
  ── alpaca_gpt4_data_zh.json
  ── dataset.pt
zhaoying1's avatar
zhaoying1 committed
99
100
```

zhaoying1's avatar
zhaoying1 committed
101
### 模型权重下载
chenzk's avatar
chenzk committed
102
1. 方式一:下载huggingface格式模型。以 7B 模型为例,首先下载预训练[llama-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf),转换到TencentPretrain格式:
103
104
105
106
107
108
```commandline
python3 scripts/convert_llama_from_huggingface_to_tencentpretrain.py --input_model_path $LLaMA_HF_PATH \
                       --output_model_path  models/llama-7b.bin --type 7B
``` 
2. 方式二:也可以直接下载[TencentPretrain对应格式模型](https://huggingface.co/Linly-AI/)进行微调训练,不需要转换格式。

zhaoying1's avatar
UPDATE  
zhaoying1 committed
109
110
111
112

## 训练
### 全参数增量预训练
#### 数据预处理
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
1. 构建预训练数据集

txt预训练语料:多个txt需要合并到一个 .txt 文件并按行随机打乱,语料格式如下:
```commandline
doc1
doc2
doc3
``` 
jsonl 预训练语料:为了支持代码等包含换行符的数据,预训练数据也可以整理成jsonl格式,格式如下:
```commandline
{"text": "doc1"}
{"text": "doc2"}
{"text": "doc3"}
``` 

2. 按如下方式进行预处理
```commandline
python3 preprocess.py --corpus_path $CORPUS_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
                      --dataset_path $OUTPUT_DATASET_PATH --data_processor lm --seq_length 1024
``` 
可选参数: --json_format_corpus:使用jsonl格式数据;
--full_sentences:对长度不足的样本使用其他样本进行填充(没有 pad token);

zhaoying1's avatar
UPDATE  
zhaoying1 committed
136
#### 训练
137
138
139
140
141
142
143
144
145
146
147
1. 单机
```commandline
deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 \
                      --pretrained_model_path models/llama-7b.bin \
                      --dataset_path $OUTPUT_DATASET_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
                      --config_path models/llama/7b_config.json \
                      --output_model_path models/llama_zh_7b \
                      --world_size 8 --data_processor lm  --deepspeed_checkpoint_activations \
                      --total_steps 300000 --save_checkpoint_steps 5000 --batch_size 24
```

zhaoying1's avatar
UPDATE  
zhaoying1 committed
148
2. 多机
149
```commandline
zhaoying1's avatar
UPDATE  
zhaoying1 committed
150
151
152
153
154
cd multi_node
```
进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改run-13b-pretrain.sh中--mca btl_tcp_if_include enp97s0f1,enp97s0f1改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定,微调命令:
```commandline
bash run-13b-pretrain.sh
155
```
zhaoying1's avatar
zhaoying1 committed
156
157


zhaoying1's avatar
UPDATE  
zhaoying1 committed
158
### 全参数指令微调
zhaoying1's avatar
zhaoying1 committed
159
#### 数据预处理
160
161
162
163
164
165
166
167
168
169
170
1. 构建指令数据集:指令数据为 json 格式,包含instruction、input、output三个字段(可以为空),每行一条样本。
示例:
```commandline
{"instruction": "在以下文本中提取所有的日期。", "input": "6月21日是夏至,这是一年中白天最长的一天。", "output": "6月21日"}
{"instruction": "", "input": "请生成一个新闻标题,描述一场正在发生的大型自然灾害。\\n\n", "output": "\"强烈飓风肆虐,数百万人疏散!\""}
``` 
2. 按如下方式进行预处理
```commandline
python3 preprocess.py --corpus_path $INSTRUCTION_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
                      --dataset_path $OUTPUT_DATASET_PATH --data_processor alpaca --seq_length 1024
``` 
zhaoying1's avatar
UPDATE  
zhaoying1 committed
171
#### 训练
172
173
174
175
176
177
178
179
180
181
182
1. 单机
```commandline
deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 \
                      --pretrained_model_path models/llama_zh_7b.bin \
                      --dataset_path $OUTPUT_DATASET_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
                      --config_path models/llama/7b_config.json \
                      --output_model_path models/chatflow_7b \
                      --world_size 8 --data_processor alpaca --prefix_lm_loss --deepspeed_checkpoint_activations \
                      --total_steps 20000 --save_checkpoint_steps 2000 --batch_size 24
```

zhaoying1's avatar
UPDATE  
zhaoying1 committed
183
2. 多机
184
```commandline
zhaoying1's avatar
UPDATE  
zhaoying1 committed
185
186
187
188
189
cd multi_node
```
进入节点1,根据环境修改hostfile,保证两节点文件路径一致,配置相同,按需修改run-13b.sh中--mca btl_tcp_if_include enp97s0f1,enp97s0f1改为ip a命令后对应节点ip的网卡名,numa可以根据当前节点拓扑更改绑定,微调命令:
```commandline
bash run-13b.sh
190
191
```

zhaoying1's avatar
zhaoying1 committed
192
### 模型分块
193
194
195
196
197
198
199
200
201
训练初始化时,每张卡会加载一个模型的拷贝,因此内存需求为模型大小*GPU数量。内存不足时可以通过以下方式将模型分块,然后使用分块加载。
```commandline
python3 scripts/convert_model_into_blocks.py \
        --input_model_path path/to/chinese_llama_13b.bin \
        --output_model_path path/to/chinese_llama_13b \
        --block_size 10
```
其中,--input_model_path 输入模型路径; --output_model_path 输出模型目录; --block_size 分块大小;在训练加载模型时,将 pretrained_model_path 改为以上输出的目录即可。

zhaoying1's avatar
zhaoying1 committed
202
## 推理
chenzk's avatar
chenzk committed
203
TencentPretrain格式模型推理请参考[llama_inference_pytorch](https://developer.sourcefind.cn/codes/modelzoo/llama_inference_pytorch)
204

zhaoying1's avatar
UPDATE  
zhaoying1 committed
205
206
207
208
209
210
211
212
213
## Result
-input
```
 请问“手臂”的英文是什么
```
-output
```
手臂的英文是“arm”。
```
214

dcuai's avatar
dcuai committed
215
### 精度
216
217
218
219
220
221
222
223
224
225
226
227
228
- 利用公开指令数据集[alpaca_gpt4_data_zh.json](https://huggingface.co/datasets/shibing624/alpaca-zh),基于汉化ChineseLLaMA的7B、13B基础模型,我们进行指令微调训练实验,以下为训练Loss:
<div align="center">
<figure class="half">
    <img width = '300' height ='250' src="./data/media/ift_7B_bs2_32node_128cards.jpg">
    <img width = '300' height ='250' src="./data/media/ift_13B_bs2_32node_128cards.jpg">
</figure>
</div>

- 利用公开指令数据集[alpaca_gpt4_data_zh.json](https://huggingface.co/datasets/shibing624/alpaca-zh),基于meta开源的[meta-llama/Llama-2-7b-chat-hf](https://pan.xunlei.com/s/VN_kQa1_HBvV-X9QVI6jV2kOA1?pwd=xmra) ,我们进行中文指令微调训练实验,以下为训练Loss:
<div align="center">
<img src="./data/media/ift_llama2_7B_bs2_32node_128cards.jpg" width="300" height="250">
</div>

zhaoying1's avatar
UPDATE  
zhaoying1 committed
229
230
231



zhaoying1's avatar
zhaoying1 committed
232
233
234
235
## 应用场景

### 算法类别

dcuai's avatar
dcuai committed
236
`对话问答`
zhaoying1's avatar
zhaoying1 committed
237
238
239

### 热点应用行业

zhaoying1's avatar
UPDATE  
zhaoying1 committed
240
`医疗,教育,科研,金融`
zhaoying1's avatar
zhaoying1 committed
241

wanglch's avatar
wanglch committed
242
243
244
245
246
247
## 预训练权重

预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels)

项目中的预训练权重可从快速下载通道下载:

chenzk's avatar
chenzk committed
248
249
* [llama-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
* [llama-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf)
wanglch's avatar
wanglch committed
250

251
252
## 源码仓库及问题反馈

chenzk's avatar
chenzk committed
253
- https://developer.sourcefind.cn/codes/modelzoo/llama_tencentpretrain_pytorch
254

zhaoying1's avatar
UPDATE  
zhaoying1 committed
255
## 参考资料
256
257
258

* https://github.com/CVI-SZU/Linly
* https://github.com/Tencent/TencentPretrain/
zhaoying1's avatar
zhaoying1 committed
259
* https://github.com/ProjectD-AI/llama_inference