README.md 13.3 KB
Newer Older
laibao's avatar
laibao committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<!--
 * @Author: laibai
 * @email: laibao@sugon.com
 * @Date: 2024-05-24 14:15:07
 * @LastEditTime: 2024-09-30 08:30:01
-->

# Qwen2.5

## 论文



## 模型结构

Qwen2.5是阿里云开源的最新一代大型语言模型,标志着Qwen系列在性能和功能上的又一次飞跃。本次更新着重提升了模型的多语言处理能力,支持超过29种语言,包括中文、英文、法文、西班牙文、葡萄牙文、德文等。所有规模的模型现在都能支持高达128K tokens的上下文长度,并能生成最长8K tokens的内容。预训练数据集也从7T tokens扩展到了18T tokens,显著提升了模型的知识储备。此外,Qwen2.5还增强了对系统提示的适应性,提升了角色扮演和聊天机器人的背景设置能力。模型系列包括从0.5B到72B不同参数规模的版本,以满足不同应用场景的需求 。

<div align=center>
laibao's avatar
laibao committed
19
    <img src="./doc/qwen2.5.jpg"/>
laibao's avatar
laibao committed
20
21
22
23
24
25
26
</div>

## 算法原理

和Qwen一样,Qwen2.5仍然是一个decoder-only的transformer模型,使用SwiGLU激活函数、RoPE、多头注意力机制等。

<div align=center>
laibao's avatar
laibao committed
27
    <img src="./doc/qwen2.5.png"/>
laibao's avatar
laibao committed
28
29
30
31
32
33
34
35
36
</div>

## 环境配置

### Docker(方法一)

提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
37
docker pull image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas104381-0915-das1.6-py3.10-20250916-rc2
laibao's avatar
laibao committed
38
39
40
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
41
# 若要在主机端和容器端映射端口需要删除--network host参数
laibao's avatar
laibao committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
docker run -it --name qwen2.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
```

`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`

### Dockerfile(方法二)

```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t qwen2.5:latest .
docker run -it --name qwen2.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> qwen2.5:latest /bin/bash
```

### Anaconda(方法三)

```
conda create -n qwen2.5_vllm python=3.10
```

chenzk's avatar
chenzk committed
62
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
laibao's avatar
laibao committed
63

64
65
66
67
* DTK驱动:dtk25.04.01
* Pytorch: 2.4.0
* triton: 3.0.0
* lmslim: 0.2.1
laibao's avatar
laibao committed
68
* flash_attn: 2.6.1
69
* flash_mla: 1.0.0
70
* vllm: 0.9.2
laibao's avatar
laibao committed
71
72
* python: python3.10

73
`Tips:需先安装相关依赖,最后安装vllm包`
74

75
76
77
78
79
80
81
82
83
84
85
环境变量:
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_NUMA_BIND=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7
laibao's avatar
laibao committed
86
87
88
89
90
91
92
93
94

## 数据集



## 推理

### 模型下载

95
96
97
| 基座模型                                                           | chat模型                                                                            | GPTQ模型                                                                                                | AWQ模型                                                                                     |
| ------------------------------------------------------------------ | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| [Qwen2.5 3B](https://huggingface.co/Qwen/Qwen2.5-3B)                  | [Qwen2.5 3B Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)                 | [Qwen2.5-3B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4)                 | [Qwen2.5-3B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-AWQ)                 |
chenzk's avatar
chenzk committed
98
99
100
101
102
103
| [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)                  | [ Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)                | [Qwen2.5-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4)                 | [Qwen2.5-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-AWQ)                 |
| [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B)                | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)               | [Qwen2.5-14B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4)               | [Qwen2.5-14B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-AWQ)               |
| [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B)                | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)               | [Qwen2.5-32B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4)               | [Qwen2.5-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ)               |
| [Qwen2.5-72B](https://huggingface.co/Qwen/Qwen2.5-72B)                | [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)               | [Qwen2.5-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4)               | [Qwen2.5-72B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-AWQ)               |
| [ Qwen2.5 Coder 1.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B) | [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) | [Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4) | [Qwen2.5-Coder-1.5B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ) |
| [Qwen2.5 Coder 7B](https://huggingface.co/Qwen/Qwen2.5-Coder-7B)      | [Qwen2.5 Coder 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct)     | [Qwen2.5 Coder 7B Instruct GPTQ Int4](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GPTQ-Int4)     | [Qwen2.5 Coder 7B Instruct AWQ](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-AWQ)     |
104
105
106
| [Qwen2.5 Coder 32B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B)    | [Qwen2.5 Coder 32B Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)   | [Qwen2.5 Coder 32B Instruct GPTQ Int4](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4)   | [Qwen2.5 Coder 32B Instruct AWQ](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ)   |
| [Qwen2.5 Math 1.5B](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B)    | [Qwen2.5 Math 1.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct)   |                                                                                                         |                                                                                             |
| [ Qwen2.5 Math 7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B)       | [Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)       |                                                                                                         |                                                                                             |
laibao's avatar
laibao committed
107
108
109
110

### 离线批量推理

```bash
111
 python examples/offline_inference/basic/basic.py
laibao's avatar
laibao committed
112
113
```

114
115
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为16;
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型。`block_size`为块大小,默认为64;
laibao's avatar
laibao committed
116
117
118
119
120
121

### 离线批量推理性能测试

1、指定输入输出

```bash
122
 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model /your/model/path -tp 1 --trust-remote-code --enforce-eager --dtype float16
laibao's avatar
laibao committed
123
124
```

125
其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型。若模型权重为 bfloat16,建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。若指定 `--output-len  1`即为首字延迟。
laibao's avatar
laibao committed
126
127
128

2、使用数据集
下载数据集:
chenzk's avatar
chenzk committed
129
[sharegpt_v3_unfiltered_cleaned_split](https://huggingface.co/datasets/learnanything/sharegpt_v3_unfiltered_cleaned_split)
laibao's avatar
laibao committed
130
131

```bash
132
 python benchmarks/benchmark_throughput.py --num-prompts 1 --model /your/model/path --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
laibao's avatar
laibao committed
133
134
```

135
其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset-name`为使用的数据集名称,`--dataset-path`为数据集路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型。若模型权重为 bfloat16,建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。`-q gptq`为使用gptq量化模型进行推理。
laibao's avatar
laibao committed
136
137
138
139
140
141

### OpenAI api服务推理性能测试

1.启动服务:

```bash
142
 vllm serve --model /your/model/path --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 1
laibao's avatar
laibao committed
143
144
145
146
147
```

2.启动客户端

```
148
python benchmarks/benchmark_serving.py --model /your/model/path --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
laibao's avatar
laibao committed
149
150
```

151
参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
laibao's avatar
laibao committed
152
153
154
155
156
157

### OpenAI兼容服务

启动服务:

```bash
158
 vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code
laibao's avatar
laibao committed
159
160
```

161
这里serve之后为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awq`为使用awq量化模型进行推理。
laibao's avatar
laibao committed
162
163
164
165
166
167
168
169
170
171
172
173
174

列出模型型号:

```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用

```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
175
        "model": "/your/model/path",
laibao's avatar
laibao committed
176
177
178
179
180
181
        "prompt": "What is deep learning?",
        "max_tokens": 7,
        "temperature": 0
    }'
```

182
或者使用[examples/online_serving/openai_completion_client.py](examples/online_serving/openai_completion_client.py)
laibao's avatar
laibao committed
183
184
185
186
187
188
189

### OpenAI Chat API和vllm结合使用

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
190
191
        "model": "/your/model/path",
        "max_tokens": 128,
laibao's avatar
laibao committed
192
193
194
195
196
197
198
        "messages": [
            {"role": "system", "content": "What is deep learning?"},
            {"role": "user", "content": "What is deep learning?"}
        ]
    }'
```

199
或者使用[examples/online_serving/openai_chat_completion_client.py](examples/online_serving/openai_chat_completion_client.py)
laibao's avatar
laibao committed
200
201
202
203
204
205
206
207
208
209
210
211
212
213

### **gradio和vllm结合使用**

1.安装gradio

```
pip install gradio
```

2.安装必要文件

    2.1 启动gradio服务,根据提示操作

```
214
python examples/online_serving/gradio_openai_chatbot_webserver.py --model "/your/model/path" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
laibao's avatar
laibao committed
215
216
217
218
219
220
221
222
223
```

    2.2 更改文件权限

打开提示下载文件目录,输入以下命令给予权限

```
chmod +x frpc_linux_amd64_v0.*
```
laibao's avatar
laibao committed
224

laibao's avatar
laibao committed
225
226
227
228
    2.3端口映射

```
ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节点 -p 登录节点端口
laibao's avatar
laibao committed
229
```
laibao's avatar
laibao committed
230
231
232
233

3.启动OpenAI兼容服务

```
234
 vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
laibao's avatar
laibao committed
235
236
237
238
239
```

4.启动gradio服务

```
240
python examples/online_serving/gradio_openai_chatbot_webserver.py --model "/your/model/path" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids "" --host "0.0.0.0" --port 8001
laibao's avatar
laibao committed
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
```

5.使用对话服务

在浏览器中输入本地 URL,可以使用 Gradio 提供的对话服务。

## result

使用的加速卡:1张 DCU-K100_AI-64G

```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```

### 精度



## 应用场景

### 算法类别

对话问答

### 热点应用行业

金融,科研,教育

## 源码仓库及问题反馈

chenzk's avatar
chenzk committed
271
* [https://developer.sourcefind.cn/codes/modelzoo/qwen2.5_vllm](https://developer.sourcefind.cn/codes/modelzoo/qwen-2-5-vllm)
laibao's avatar
laibao committed
272
273
274
275

## 参考资料

* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)