README.md 11.3 KB
Newer Older
laibao's avatar
laibao committed
1
2
3
4
5
6
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-05-24 14:15:07
 * @LastEditTime: 2024-09-30 08:30:01
-->
laibao's avatar
laibao committed
7

laibao's avatar
laibao committed
8
9
10
# Qwen1.5

## 论文
laibao's avatar
laibao committed
11

laibao's avatar
laibao committed
12
13
14


## 模型结构
laibao's avatar
laibao committed
15

laibao's avatar
laibao committed
16
Qwen1.5是阿里云开源大型语言模型系列,是Qwen2.0的beta版本。相较于以往版本,本次更新着重提升了Chat模型与人类偏好的对齐程度,并且显著增强了模型的多语言处理能力。在序列长度方面,所有规模模型均已实现 32768 个tokens的上下文长度范围支持。同时,预训练 Base 模型的质量也有关键优化,有望在微调过程中带来更佳体验。
laibao's avatar
laibao committed
17

laibao's avatar
laibao committed
18
19
20
21
22
<div align=center>
    <img src="./doc/qwen1.5.jpg"/>
</div>

## 算法原理
laibao's avatar
laibao committed
23

laibao's avatar
laibao committed
24
25
26
27
28
29
30
和Qwen一样,Qwen1.5仍然是一个decoder-only的transformer模型,使用SwiGLU激活函数、RoPE、多头注意力机制等。

<div align=center>
    <img src="./doc/qwen1.5.png"/>
</div>

## 环境配置
laibao's avatar
laibao committed
31

laibao's avatar
laibao committed
32
### Docker(方法一)
laibao's avatar
laibao committed
33

laibao's avatar
laibao committed
34
35
36
37
38
39
40
41
42
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.10
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name qwen1.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
```
laibao's avatar
laibao committed
43

laibao's avatar
laibao committed
44
45
46
`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`

### Dockerfile(方法二)
laibao's avatar
laibao committed
47

laibao's avatar
laibao committed
48
49
50
51
52
53
54
55
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t qwen1.5:latest .
docker run -it --name qwen1.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> qwen1.5:latest /bin/bash
```

### Anaconda(方法三)
laibao's avatar
laibao committed
56

laibao's avatar
laibao committed
57
58
59
```
conda create -n qwen1.5_vllm python=3.10
```
laibao's avatar
laibao committed
60

laibao's avatar
laibao committed
61
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
laibao's avatar
laibao committed
62

laibao's avatar
laibao committed
63
64
65
66
67
68
69
70
71
72
73
74
* DTK驱动:dtk24.04.2
* Pytorch: 2.1.0
* triton:2.1.0
* lmslim: 0.1.0
* xformers: 0.0.25
* flash_attn: 2.0.4
* vllm: 0.5.0
* python: python3.10

`Tips:需先安装相关依赖,最后安装vllm包`

## 数据集
laibao's avatar
laibao committed
75

laibao's avatar
laibao committed
76
77
78
79


## 推理

laibao's avatar
laibao committed
80
81
82
83
84
85
86
87
88
89
90
91
92
93
### 模型下载

| 基座模型                                                       | chat模型                                                                    | GPTQ模型                                                                                   | AWQ模型                                                                                      |
| -------------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------- |
| [Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)                    | [Qwen-7B-Chat](http://113.200.138.88:18080/aimodels/Qwen-7B-Chat)              | [Qwen-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)                       |                                                                                              |
| [Qwen-14B](https://huggingface.co/Qwen/Qwen-14B)                  | [Qwen-14B-Chat](http://113.200.138.88:18080/aimodels/Qwen-14B-Chat)            | [Qwen-14B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4)                     |                                                                                              |
| [Qwen-72B](http://113.200.138.88:18080/aimodels/qwen/Qwen-72B)    | [Qwen-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen-72B-Chat)            | [Qwen-72B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-72B-Chat-Int4)                     |                                                                                              |
| [Qwen1.5-7B](https://huggingface.co/Qwen/Qwen1.5-7B)              | [Qwen1.5-7B-Chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat)                 | [Qwen1.5-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GPTQ-Int4)            | [Qwen1.5-7B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-7B-Chat-AWQ)       |
| [Qwen1.5-14B](https://huggingface.co/Qwen/Qwen1.5-14B)            | [Qwen1.5-14B-Chat](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat) | [Qwen1.5-14B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-14B-Chat-GPTQ-Int4)          | [Qwen1.5-14B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat-AWQ)     |
| [Qwen1.5-32B](http://113.200.138.88:18080/aimodels/Qwen1.5-32B)   | [Qwen1.5-32B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat)      | [Qwen1.5-32B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat-GPTQ-Int4) | [Qwen1.5-32B-Chat-AWQ-Int4](https://huggingface.co/Qwen/Qwen1.5-32B-Chat-AWQ)                   |
| [Qwen1.5-72B](http://113.200.138.88:18080/aimodels/Qwen1.5-72B)   | [Qwen1.5-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-72B-Chat)      | [Qwen1.5-72B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GPTQ-Int4)          | [Qwen1.5-72B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-72B-Chat-AWQ)     |
| [Qwen1.5-110B](http://113.200.138.88:18080/aimodels/Qwen1.5-110B) | [Qwen1.5-110B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-110B-Chat)    | [Qwen1.5-110B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-110B-Chat-GPTQ-Int4)        | [Qwen1.5-110B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-110B-Chat-AWQ)   |
| [Qwen2-7B](http://113.200.138.88:18080/aimodels/Qwen2-7B)         | [Qwen2-7B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-7B-Instruct)    | [Qwen2-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)        | [Qwen2-7B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-7B-Instruct-AWQ)   |
| [Qwen2-72B](http://113.200.138.88:18080/aimodels/Qwen2-72B)       | [Qwen2-72B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-72B-Instruct)  | [Qwen2-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-72B-Instruct-GPTQ-Int4)      | [Qwen2-72B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-72B-Instruct-AWQ) |
laibao's avatar
laibao committed
94
95

### 离线批量推理
laibao's avatar
laibao committed
96

laibao's avatar
laibao committed
97
98
99
```bash
python examples/offline_inference.py
```
laibao's avatar
laibao committed
100

laibao's avatar
laibao committed
101
102
103
104
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。`quantization="awq"`为使用awq量化进行推理,需下载以上AWQ模型。

### 离线批量推理性能测试
laibao's avatar
laibao committed
105

laibao's avatar
laibao committed
106
1、指定输入输出
laibao's avatar
laibao committed
107

laibao's avatar
laibao committed
108
109
110
```bash
python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model Qwen/Qwen1.5-7B-Chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
laibao's avatar
laibao committed
111
112

其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
laibao's avatar
laibao committed
113
114
115

2、使用数据集
下载数据集:
laibao's avatar
laibao committed
116

laibao's avatar
laibao committed
117
118
119
120
121
122
123
124
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
python benchmarks/benchmark_throughput.py --num-prompts 1 --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
```

laibao's avatar
laibao committed
125
其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
laibao's avatar
laibao committed
126
127

### api服务推理性能测试
laibao's avatar
laibao committed
128

laibao's avatar
laibao committed
129
1、启动服务端:
laibao's avatar
laibao committed
130

laibao's avatar
laibao committed
131
132
133
134
135
```bash
python -m vllm.entrypoints.openai.api_server  --model Qwen/Qwen1.5-7B-Chat  --dtype float16 --enforce-eager -tp 1 
```

2、启动客户端:
laibao's avatar
laibao committed
136

laibao's avatar
laibao committed
137
138
139
140
```bash
python benchmarks/benchmark_serving.py --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
```

laibao's avatar
laibao committed
141
参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
laibao's avatar
laibao committed
142
143

### OpenAI兼容服务
laibao's avatar
laibao committed
144

laibao's avatar
laibao committed
145
启动服务:
laibao's avatar
laibao committed
146

laibao's avatar
laibao committed
147
148
149
```bash
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat --enforce-eager --dtype float16 --trust-remote-code
```
laibao's avatar
laibao committed
150
151

这里 `--model`为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awqq`为使用awq量化模型进行推理。
laibao's avatar
laibao committed
152
153

列出模型型号:
laibao's avatar
laibao committed
154

laibao's avatar
laibao committed
155
156
157
158
159
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
laibao's avatar
laibao committed
160

laibao's avatar
laibao committed
161
162
163
164
165
166
167
168
169
170
171
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen1.5-7B",
        "prompt": "What is deep learning?",
        "max_tokens": 7,
        "temperature": 0
    }'
```

laibao's avatar
laibao committed
172
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
laibao's avatar
laibao committed
173
174

### OpenAI Chat API和vllm结合使用
laibao's avatar
laibao committed
175

laibao's avatar
laibao committed
176
177
178
179
180
181
182
183
184
185
186
187
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen1.5-7B-Chat",
        "messages": [
            {"role": "system", "content": "What is deep learning?"},
            {"role": "user", "content": "What is deep learning?"}
        ]
    }'
```

laibao's avatar
laibao committed
188
或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
laibao's avatar
laibao committed
189
190

## result
laibao's avatar
laibao committed
191

laibao's avatar
laibao committed
192
使用的加速卡:1张 DCU-K100_AI-64G
laibao's avatar
laibao committed
193

laibao's avatar
laibao committed
194
195
196
197
198
```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```

### 精度
laibao's avatar
laibao committed
199

laibao's avatar
laibao committed
200
201
202
203
204


## 应用场景

### 算法类别
laibao's avatar
laibao committed
205

laibao's avatar
laibao committed
206
207
208
对话问答

### 热点应用行业
laibao's avatar
laibao committed
209

laibao's avatar
laibao committed
210
211
212
金融,科研,教育

## 源码仓库及问题反馈
laibao's avatar
laibao committed
213

laibao's avatar
laibao committed
214
215
216
217
* [https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm](https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm)

## 参考资料

laibao's avatar
laibao committed
218
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)