README.md 9.16 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-04-25 10:38:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-12-11 17:18:01
zhuwenwen's avatar
zhuwenwen committed
6
-->
7

8
# Llama
zhuwenwen's avatar
zhuwenwen committed
9
10

## 论文
11

zhuwenwen's avatar
zhuwenwen committed
12
13
14
- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)

## 模型结构
15

16
Llama 网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。以下是与原始架构的主要区别:
zhuwenwen's avatar
zhuwenwen committed
17
18
19
20
预归一化。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。

zhuwenwen's avatar
zhuwenwen committed
21
![img](./docs/llama_str.png)
zhuwenwen's avatar
zhuwenwen committed
22
23

## 算法原理
24

25
Llama 是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。
zhuwenwen's avatar
zhuwenwen committed
26

zhuwenwen's avatar
zhuwenwen committed
27
![img](./docs/llama_pri.png)
zhuwenwen's avatar
zhuwenwen committed
28
29
30

## 环境配置

zhuwenwen's avatar
zhuwenwen committed
31
### Docker(方法一)
32

zhuwenwen's avatar
zhuwenwen committed
33
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:
zhuwenwen's avatar
zhuwenwen committed
34

zhuwenwen's avatar
zhuwenwen committed
35
```
36
docker pull image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas104381-0915-das1.6-py3.10-20250916-rc2
zhuwenwen's avatar
zhuwenwen committed
37
38
39
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
40
# 若要在主机端和容器端映射端口需要删除--network host参数
zhuwenwen's avatar
zhuwenwen committed
41
docker run -it --name llama_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
42
```
43

zhuwenwen's avatar
zhuwenwen committed
44
`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`
zhuwenwen's avatar
zhuwenwen committed
45

zhuwenwen's avatar
zhuwenwen committed
46
### Dockerfile(方法二)
47

zhuwenwen's avatar
zhuwenwen committed
48
49
50
51
52
53
54
55
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t llama:latest .
docker run -it --name llama_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> llama:latest /bin/bash
```

### Anaconda(方法三)
56

zhuwenwen's avatar
zhuwenwen committed
57
58
59
```
conda create -n llama_vllm python=3.10
```
60

61
62
63
64
65
66
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。

* DTK驱动:dtk25.04.01
* Pytorch: 2.4.0
* triton: 3.0.0
* lmslim: 0.2.1
zhuwenwen's avatar
zhuwenwen committed
67
* flash_attn: 2.6.1
68
* flash_mla: 1.0.0
69
* vllm: 0.9.2
zhuwenwen's avatar
zhuwenwen committed
70
* python: python3.10
zhuwenwen's avatar
zhuwenwen committed
71

laibao's avatar
laibao committed
72
`Tips:需先安装相关依赖,最后安装vllm包`
73
74
75
76
77
78
79
80
81
82
83
84

环境变量:
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_NUMA_BIND=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7
zhuwenwen's avatar
zhuwenwen committed
85
86

## 数据集
87

zhuwenwen's avatar
zhuwenwen committed
88
89
90
91
92
93


## 推理

### 模型下载

zhuwenwen's avatar
zhuwenwen committed
94
| 基座模型 | chat模型 | GPTQ模型 | AWQ模型 |
chenzk's avatar
chenzk committed
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
可从HF下载以下模型进行使用:
Llama-2-7b-hf
Llama-2-7b-chat-hf
Llama-2-7B-Chat-GPTQ
Llama-2-7B-AWQ
Llama-2-13b-hf
Llama-2-13b-chat-hf
Llama-2-13B-GPTQ
Llama-2-13B-AWQ
Llama-2-70b-hf
Llama-2-70B-Chat-GPTQ
Llama-2-70B-AWQ
Meta-Llama-3-8B
Meta-Llama-3-8B-Instruct
Meta-Llama-3-8B-Instruct-AWQ
Meta-Llama-3-70B
Meta-Llama-3-70B-Instruct
Meta-Llama-3-70B-Instruct-AWQ
zhuwenwen's avatar
zhuwenwen committed
113

zhuwenwen's avatar
zhuwenwen committed
114
### 离线批量推理
115

zhuwenwen's avatar
zhuwenwen committed
116
```bash
117
 python examples/offline_inference/basic/basic.py
zhuwenwen's avatar
zhuwenwen committed
118
```
zhuwenwen's avatar
zhuwenwen committed
119

120
其中,本示例脚本在代码中直接定义了 `prompts`,并设置 `temperature=0.8``top_p=0.95``max_tokens=16`;如需调整请修改脚本中的参数。`model` 在脚本中指定为本地模型路径;`tensor_parallel_size=1` 表示使用 1 卡;`dtype="float16"` 为推理数据类型。本示例未使用 `quantization` 参数;若需量化推理,可在基准测试示例中使用 `-q gptq`(GPTQ)或参考相应 AWQ 示例,并确保下载对应量化权重。
zhuwenwen's avatar
zhuwenwen committed
121

zhuwenwen's avatar
zhuwenwen committed
122
### 离线批量推理性能测试
123

zhuwenwen's avatar
zhuwenwen committed
124
1、指定输入输出
125

zhuwenwen's avatar
zhuwenwen committed
126
```bash
127
 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model meta-llama/Llama-2-7b-chat-hf -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
128
```
129
130

其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
131
132
133

2、使用数据集
下载数据集:
134
[sharegpt_v3_unfiltered_cleaned_split](https://huggingface.co/datasets/learnanything/sharegpt_v3_unfiltered_cleaned_split)
zhuwenwen's avatar
zhuwenwen committed
135
136

```bash
137
 python benchmarks/benchmark_throughput.py --num-prompts 1 --model meta-llama/Llama-2-7b-chat-hf --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
138
139
```

140
其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型。若模型权重为 bfloat16,建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。
zhuwenwen's avatar
zhuwenwen committed
141

laibao's avatar
laibao committed
142
### openAI api服务推理性能测试
143

zhuwenwen's avatar
zhuwenwen committed
144
1、启动服务端:
145

zhuwenwen's avatar
zhuwenwen committed
146
```bash
147
 vllm serve --model meta-llama/Llama-2-7b-chat-hf --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 1
zhuwenwen's avatar
zhuwenwen committed
148
149
150
```

2、启动客户端:
151

zhuwenwen's avatar
zhuwenwen committed
152
```bash
zhuwenwen's avatar
zhuwenwen committed
153
python benchmarks/benchmark_serving.py --model meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
154
155
```

156
参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
zhuwenwen's avatar
zhuwenwen committed
157

zhuwenwen's avatar
zhuwenwen committed
158
### OpenAI兼容服务
159

zhuwenwen's avatar
zhuwenwen committed
160
启动服务:
161

zhuwenwen's avatar
zhuwenwen committed
162
```bash
163
 vllm serve meta-llama/Llama-2-7b-chat-hf --enforce-eager --dtype float16 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
164
```
165

laibao's avatar
laibao committed
166
这里serve之后为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
167
168

列出模型型号:
169

zhuwenwen's avatar
zhuwenwen committed
170
171
172
173
174
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
175

zhuwenwen's avatar
zhuwenwen committed
176
177
178
179
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
zhuwenwen's avatar
zhuwenwen committed
180
        "model": "meta-llama/Llama-2-7b-hf",
zhuwenwen's avatar
zhuwenwen committed
181
182
183
184
185
186
        "prompt": "I believe the meaning of life is",
        "max_tokens": 7,
        "temperature": 0
    }'
```

187
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
188
189

### OpenAI Chat API和vllm结合使用
190

zhuwenwen's avatar
zhuwenwen committed
191
192
```bash
curl http://localhost:8000/v1/chat/completions \
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "max_tokens": 128,
    "messages": [
      {
        "role": "user",
        "content": "I believe the meaning of life is"
      }
    ]
  }'
```

或者使用[examples/online_serving/openai_chat_completion_client.py](examples/online_serving/openai_chat_completion_client.py)
208

laibao's avatar
laibao committed
209
### **gradio和vllm结合使用**
zhuwenwen's avatar
zhuwenwen committed
210

laibao's avatar
laibao committed
211
212
213
214
215
216
217
218
219
220
221
1.安装gradio

```
pip install gradio
```

2.安装必要文件

    2.1 启动gradio服务,根据提示操作

```
222
python examples/online_serving/gradio_openai_chatbot_webserver.py --model "meta-llama/Llama-2-7b-chat-hf" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
laibao's avatar
laibao committed
223
224
225
226
227
228
229
230
231
```

    2.2 更改文件权限

打开提示下载文件目录,输入以下命令给予权限

```
chmod +x frpc_linux_amd64_v0.*
```
232

laibao's avatar
laibao committed
233
234
235
236
   2.3 端口映射

```
ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节点 -p 登录节点端口
237
```
laibao's avatar
laibao committed
238
239
240
241

3.启动OpenAI兼容服务

```
242
 vllm serve meta-llama/Llama-2-7b-chat-hf --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
laibao's avatar
laibao committed
243
244
245
246
247
```

4.启动gradio服务

```
248
python examples/online_serving/gradio_openai_chatbot_webserver.py --model "meta-llama/Llama-2-7b-chat-hf" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids "" --host "0.0.0.0" --port 8001
laibao's avatar
laibao committed
249
250
251
252
253
```

5.使用对话服务

在浏览器中输入本地 URL,可以使用 Gradio 提供的对话服务。
254

zhuwenwen's avatar
zhuwenwen committed
255
## result
256

zhuwenwen's avatar
zhuwenwen committed
257
使用的加速卡:1张 DCU-K100_AI-64G
258

zhuwenwen's avatar
zhuwenwen committed
259
260
261
262
```
Prompt: 'I believe the meaning of life is', Generated text: ' to find purpose, happiness, and fulfillment. Here are some reasons why:\n\n1. Purpose: Having a sense of purpose gives life meaning and direction. It helps individuals set goals and work towards achieving them, which can lead to a sense of accomplishment and fulfillment.\n2. Happiness: Happiness is a fundamental aspect of life that brings joy and satisfaction.
```

zhuwenwen's avatar
zhuwenwen committed
263
### 精度
264

zhuwenwen's avatar
zhuwenwen committed
265
266
267
268
269


## 应用场景

### 算法类别
270

zhuwenwen's avatar
zhuwenwen committed
271
272
273
对话问答

### 热点应用行业
274

zhuwenwen's avatar
zhuwenwen committed
275
276
277
金融,科研,教育

## 源码仓库及问题反馈
278

chenzk's avatar
chenzk committed
279
* [https://developer.sourcefind.cn/codes/modelzoo/llama_vllm](https://developer.sourcefind.cn/codes/modelzoo/llama_vllm)
zhuwenwen's avatar
zhuwenwen committed
280
281

## 参考资料
282

zhuwenwen's avatar
zhuwenwen committed
283
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)