README.md 12.9 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-05-24 14:15:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-09-30 08:30:01
zhuwenwen's avatar
zhuwenwen committed
6
-->
laibao's avatar
laibao committed
7

zhuwenwen's avatar
zhuwenwen committed
8
9
10
# Qwen1.5

## 论文
laibao's avatar
laibao committed
11

zhuwenwen's avatar
zhuwenwen committed
12
13
14


## 模型结构
laibao's avatar
laibao committed
15

zhuwenwen's avatar
zhuwenwen committed
16
Qwen1.5是阿里云开源大型语言模型系列,是Qwen2.0的beta版本。相较于以往版本,本次更新着重提升了Chat模型与人类偏好的对齐程度,并且显著增强了模型的多语言处理能力。在序列长度方面,所有规模模型均已实现 32768 个tokens的上下文长度范围支持。同时,预训练 Base 模型的质量也有关键优化,有望在微调过程中带来更佳体验。
laibao's avatar
laibao committed
17

zhuwenwen's avatar
zhuwenwen committed
18
19
20
21
22
<div align=center>
    <img src="./doc/qwen1.5.jpg"/>
</div>

## 算法原理
laibao's avatar
laibao committed
23

zhuwenwen's avatar
zhuwenwen committed
24
25
26
27
28
29
30
和Qwen一样,Qwen1.5仍然是一个decoder-only的transformer模型,使用SwiGLU激活函数、RoPE、多头注意力机制等。

<div align=center>
    <img src="./doc/qwen1.5.png"/>
</div>

## 环境配置
laibao's avatar
laibao committed
31

zhuwenwen's avatar
zhuwenwen committed
32
### Docker(方法一)
laibao's avatar
laibao committed
33

zhuwenwen's avatar
zhuwenwen committed
34
35
36
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
37
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
zhuwenwen's avatar
zhuwenwen committed
38
39
40
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
41
# 若要在主机端和容器端映射端口需要删除--network host参数
zhuwenwen's avatar
zhuwenwen committed
42
43
docker run -it --name qwen1.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
```
laibao's avatar
laibao committed
44

zhuwenwen's avatar
zhuwenwen committed
45
`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`
zhuwenwen's avatar
zhuwenwen committed
46
47

### Dockerfile(方法二)
laibao's avatar
laibao committed
48

zhuwenwen's avatar
zhuwenwen committed
49
50
51
52
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t qwen1.5:latest .
zhuwenwen's avatar
zhuwenwen committed
53
docker run -it --name qwen1.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> qwen1.5:latest /bin/bash
zhuwenwen's avatar
zhuwenwen committed
54
55
56
```

### Anaconda(方法三)
laibao's avatar
laibao committed
57

zhuwenwen's avatar
zhuwenwen committed
58
59
60
```
conda create -n qwen1.5_vllm python=3.10
```
laibao's avatar
laibao committed
61

zhuwenwen's avatar
zhuwenwen committed
62
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
laibao's avatar
laibao committed
63

laibao's avatar
laibao committed
64
65
66
67
68
69
* DTK驱动:dtk24.04.3
* Pytorch: 2.3.0
* triton: 2.1.0
* lmslim: 0.1.2
* flash_attn: 2.6.1
* vllm: 0.6.2
zhuwenwen's avatar
zhuwenwen committed
70
* python: python3.10
zhuwenwen's avatar
zhuwenwen committed
71

72
73
74
75
76
77
78
79
80
81
82
`Tips:需先安装相关依赖,最后安装vllm包`  
环境变量:  
export ALLREDUCE_STREAM_WITH_COMPUTE=1  
export VLLM_RANK0_NUMA=0  
export VLLM_RANK1_NUMA=1  
export VLLM_RANK2_NUMA=4  
export VLLM_RANK3_NUMA=5  
export VLLM_RANK4_NUMA=2  
export VLLM_RANK5_NUMA=3  
export VLLM_RANK6_NUMA=6  
export VLLM_RANK7_NUMA=7  
zhuwenwen's avatar
zhuwenwen committed
83
84

## 数据集
laibao's avatar
laibao committed
85

zhuwenwen's avatar
zhuwenwen committed
86
87
88
89


## 推理

laibao's avatar
laibao committed
90
91
### 模型下载

laibao's avatar
laibao committed
92
93
94
95
96
97
98
99
100
101
102
103
| 基座模型                                                              | chat模型                                                                      | GPTQ模型                                                                                                | AWQ模型                                                                                      |
| --------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| [Qwen-7B](http://113.200.138.88:18080/aimodels/qwen/Qwen-7B.git)         | [Qwen-7B-Chat](http://113.200.138.88:18080/aimodels/Qwen-7B-Chat)                | [Qwen-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)                                    |                                                                                              |
| [Qwen-14B](http://113.200.138.88:18080/aimodels/qwen/Qwen-14B)           | [Qwen-14B-Chat](http://113.200.138.88:18080/aimodels/Qwen-14B-Chat)              | [Qwen-14B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen-14B-Chat-Int4.git)                |                                                                                              |
| [Qwen-72B](http://113.200.138.88:18080/aimodels/qwen/Qwen-72B)           | [Qwen-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen-72B-Chat)              | [Qwen-72B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen-72B-Chat-Int4.git)                |                                                                                              |
| [Qwen1.5-7B](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-7B.git)   | [Qwen1.5-7B-Chat](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-7B-Chat.git) | [Qwen1.5-7B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-7B-Chat-GPTQ-Int4.git)       | [Qwen1.5-7B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-7B-Chat-AWQ)       |
| [Qwen1.5-14B](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B.git) | [Qwen1.5-14B-Chat](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat)   | [Qwen1.5-14B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat-GPTQ-Int4.git)     | [Qwen1.5-14B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat-AWQ)     |
| [Qwen1.5-32B](http://113.200.138.88:18080/aimodels/Qwen1.5-32B)          | [Qwen1.5-32B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat)        | [Qwen1.5-32B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat-GPTQ-Int4)              | [Qwen1.5-32B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-32B-Chat-AWQ.git) |
| [Qwen1.5-72B](http://113.200.138.88:18080/aimodels/Qwen1.5-72B)          | [Qwen1.5-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-72B-Chat)        | [Qwen1.5-72B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-72B-Chat-GPTQ-Int4.git)     | [Qwen1.5-72B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-72B-Chat-AWQ)     |
| [Qwen1.5-110B](http://113.200.138.88:18080/aimodels/Qwen1.5-110B)        | [Qwen1.5-110B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-110B-Chat)      | [Qwen1.5-110B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-110B-Chat-GPTQ-Int4.git)   | [Qwen1.5-110B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-110B-Chat-AWQ)   |
| [Qwen2-7B](http://113.200.138.88:18080/aimodels/Qwen2-7B)                | [Qwen2-7B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-7B-Instruct)      | [Qwen2-7B-Instruct-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-7B-Instruct-GPTQ-Int4.git)   | [Qwen2-7B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-7B-Instruct-AWQ)   |
| [Qwen2-72B](http://113.200.138.88:18080/aimodels/Qwen2-72B)              | [Qwen2-72B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-72B-Instruct)    | [Qwen2-72B-Instruct-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-72B-Instruct-GPTQ-Int4.git) | [Qwen2-72B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-72B-Instruct-AWQ) |
zhuwenwen's avatar
add env  
zhuwenwen committed
104

zhuwenwen's avatar
zhuwenwen committed
105
### 离线批量推理
laibao's avatar
laibao committed
106

zhuwenwen's avatar
zhuwenwen committed
107
```bash
zhuwenwen's avatar
zhuwenwen committed
108
python examples/offline_inference.py
zhuwenwen's avatar
zhuwenwen committed
109
```
laibao's avatar
laibao committed
110

zhuwenwen's avatar
zhuwenwen committed
111
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
zhuwenwen's avatar
zhuwenwen committed
112
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。`quantization="awq"`为使用awq量化进行推理,需下载以上AWQ模型。
zhuwenwen's avatar
zhuwenwen committed
113
114

### 离线批量推理性能测试
laibao's avatar
laibao committed
115

zhuwenwen's avatar
zhuwenwen committed
116
1、指定输入输出
laibao's avatar
laibao committed
117

zhuwenwen's avatar
zhuwenwen committed
118
```bash
zhuwenwen's avatar
zhuwenwen committed
119
python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model Qwen/Qwen1.5-7B-Chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
120
```
laibao's avatar
laibao committed
121
122

其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
123
124
125

2、使用数据集
下载数据集:
laibao's avatar
laibao committed
126

zhuwenwen's avatar
zhuwenwen committed
127
```bash
dcuai's avatar
dcuai committed
128
wget http://113.200.138.88:18080/aidatasets/vllm_data/-/raw/main/ShareGPT_V3_unfiltered_cleaned_split.json
zhuwenwen's avatar
zhuwenwen committed
129
130
131
```

```bash
zhuwenwen's avatar
zhuwenwen committed
132
python benchmarks/benchmark_throughput.py --num-prompts 1 --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
133
134
```

laibao's avatar
laibao committed
135
其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
136

laibao's avatar
laibao committed
137
### OpenAI api服务推理性能测试
laibao's avatar
laibao committed
138

zhuwenwen's avatar
zhuwenwen committed
139
1、启动服务端:
laibao's avatar
laibao committed
140

zhuwenwen's avatar
zhuwenwen committed
141
```bash
zhuwenwen's avatar
zhuwenwen committed
142
python -m vllm.entrypoints.openai.api_server  --model Qwen/Qwen1.5-7B-Chat  --dtype float16 --enforce-eager -tp 1 
zhuwenwen's avatar
zhuwenwen committed
143
144
145
```

2、启动客户端:
laibao's avatar
laibao committed
146

zhuwenwen's avatar
zhuwenwen committed
147
```bash
zhuwenwen's avatar
zhuwenwen committed
148
python benchmarks/benchmark_serving.py --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
149
150
```

laibao's avatar
laibao committed
151
参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
zhuwenwen's avatar
zhuwenwen committed
152
153

### OpenAI兼容服务
laibao's avatar
laibao committed
154

zhuwenwen's avatar
zhuwenwen committed
155
启动服务:
laibao's avatar
laibao committed
156

zhuwenwen's avatar
zhuwenwen committed
157
```bash
laibao's avatar
laibao committed
158
vllm serve Qwen/Qwen1.5-7B-Chat --enforce-eager --dtype float16 --trust-remote-code --port 8000
zhuwenwen's avatar
zhuwenwen committed
159
```
laibao's avatar
laibao committed
160

laibao's avatar
laibao committed
161
这里serve之后为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awqq`为使用awq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
162
163

列出模型型号:
laibao's avatar
laibao committed
164

zhuwenwen's avatar
zhuwenwen committed
165
166
167
168
169
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
laibao's avatar
laibao committed
170

zhuwenwen's avatar
zhuwenwen committed
171
172
173
174
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
zhuwenwen's avatar
zhuwenwen committed
175
        "model": "Qwen/Qwen1.5-7B",
zhuwenwen's avatar
zhuwenwen committed
176
177
178
179
180
181
        "prompt": "What is deep learning?",
        "max_tokens": 7,
        "temperature": 0
    }'
```

laibao's avatar
laibao committed
182
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
183
184

### OpenAI Chat API和vllm结合使用
laibao's avatar
laibao committed
185

zhuwenwen's avatar
zhuwenwen committed
186
187
188
189
190
191
192
193
194
195
196
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen1.5-7B-Chat",
        "messages": [
            {"role": "system", "content": "What is deep learning?"},
            {"role": "user", "content": "What is deep learning?"}
        ]
    }'
```
laibao's avatar
laibao committed
197

zhuwenwen's avatar
zhuwenwen committed
198
或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
laibao's avatar
laibao committed
199

laibao's avatar
laibao committed
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
### **gradio和vllm结合使用**

1.安装gradio

```
pip install gradio
```

2.安装必要文件

    2.1 启动gradio服务,根据提示操作

```
python  gradio_openai_chatbot_webserver.py --model "Qwen/Qwen1.5-7B-Chat" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
```

    2.2 更改文件权限

打开提示下载文件目录,输入以下命令给予权限

```
chmod +x frpc_linux_amd64_v0.*
```
laibao's avatar
laibao committed
223

laibao's avatar
laibao committed
224
225
226
227
    2.3端口映射

```
ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节点 -p 登录节点端口
laibao's avatar
laibao committed
228
```
laibao's avatar
laibao committed
229
230
231
232

3.启动OpenAI兼容服务

```
laibao's avatar
laibao committed
233
vllm serve Qwen/Qwen1.5-7B-Chat --enforce-eager --dtype float16 --trust-remote-code --port 8000 --host "0.0.0.0"
laibao's avatar
laibao committed
234
235
236
237
238
```

4.启动gradio服务

```
laibao's avatar
laibao committed
239
python  gradio_openai_chatbot_webserver.py --model "Qwen/Qwen1.5-7B-Chat" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids --host "0.0.0.0" --port 8001"
laibao's avatar
laibao committed
240
241
242
243
244
```

5.使用对话服务

在浏览器中输入本地 URL,可以使用 Gradio 提供的对话服务。
zhuwenwen's avatar
zhuwenwen committed
245
246

## result
laibao's avatar
laibao committed
247

zhuwenwen's avatar
zhuwenwen committed
248
使用的加速卡:1张 DCU-K100_AI-64G
laibao's avatar
laibao committed
249

zhuwenwen's avatar
zhuwenwen committed
250
251
252
253
254
```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```

### 精度
laibao's avatar
laibao committed
255

zhuwenwen's avatar
zhuwenwen committed
256
257
258
259
260


## 应用场景

### 算法类别
laibao's avatar
laibao committed
261

zhuwenwen's avatar
zhuwenwen committed
262
263
264
对话问答

### 热点应用行业
laibao's avatar
laibao committed
265

zhuwenwen's avatar
zhuwenwen committed
266
267
268
金融,科研,教育

## 源码仓库及问题反馈
laibao's avatar
laibao committed
269

zhuwenwen's avatar
zhuwenwen committed
270
271
272
273
* [https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm](https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm)

## 参考资料

laibao's avatar
laibao committed
274
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)