README.md 11.9 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
<!--
laibao's avatar
laibao committed
2
3
 * @Author: laibai
 * @email: laibao@sugon.com
zhuwenwen's avatar
zhuwenwen committed
4
 * @Date: 2024-05-24 14:15:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-09-30 08:30:01
zhuwenwen's avatar
zhuwenwen committed
6
-->
laibao's avatar
laibao committed
7

laibao's avatar
laibao committed
8
# Qwen2.5
zhuwenwen's avatar
zhuwenwen committed
9
10

## 论文
laibao's avatar
laibao committed
11
12


zhuwenwen's avatar
zhuwenwen committed
13
14

## 模型结构
laibao's avatar
laibao committed
15

laibao's avatar
laibao committed
16
Qwen2.5是阿里云开源的最新一代大型语言模型,标志着Qwen系列在性能和功能上的又一次飞跃。本次更新着重提升了模型的多语言处理能力,支持超过29种语言,包括中文、英文、法文、西班牙文、葡萄牙文、德文等。所有规模的模型现在都能支持高达128K tokens的上下文长度,并能生成最长8K tokens的内容。预训练数据集也从7T tokens扩展到了18T tokens,显著提升了模型的知识储备。此外,Qwen2.5还增强了对系统提示的适应性,提升了角色扮演和聊天机器人的背景设置能力。模型系列包括从0.5B到72B不同参数规模的版本,以满足不同应用场景的需求 。
laibao's avatar
laibao committed
17

zhuwenwen's avatar
zhuwenwen committed
18
19
20
21
22
<div align=center>
    <img src="./doc/qwen1.5.jpg"/>
</div>

## 算法原理
laibao's avatar
laibao committed
23

zhuwenwen's avatar
zhuwenwen committed
24
25
26
27
28
29
30
和Qwen一样,Qwen1.5仍然是一个decoder-only的transformer模型,使用SwiGLU激活函数、RoPE、多头注意力机制等。

<div align=center>
    <img src="./doc/qwen1.5.png"/>
</div>

## 环境配置
laibao's avatar
laibao committed
31

zhuwenwen's avatar
zhuwenwen committed
32
### Docker(方法一)
laibao's avatar
laibao committed
33

zhuwenwen's avatar
zhuwenwen committed
34
35
36
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
laibao's avatar
laibao committed
37
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.10
zhuwenwen's avatar
zhuwenwen committed
38
39
40
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
laibao's avatar
laibao committed
41
docker run -it --name qwen2.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
42
```
laibao's avatar
laibao committed
43

zhuwenwen's avatar
zhuwenwen committed
44
`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`
zhuwenwen's avatar
zhuwenwen committed
45
46

### Dockerfile(方法二)
laibao's avatar
laibao committed
47

zhuwenwen's avatar
zhuwenwen committed
48
49
50
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
laibao's avatar
laibao committed
51
52
docker build -t qwen2.5:latest .
docker run -it --name qwen2.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> qwen1.5:latest /bin/bash
zhuwenwen's avatar
zhuwenwen committed
53
54
55
```

### Anaconda(方法三)
laibao's avatar
laibao committed
56

zhuwenwen's avatar
zhuwenwen committed
57
```
laibao's avatar
laibao committed
58
conda create -n qwen2.5_vllm python=3.10
zhuwenwen's avatar
zhuwenwen committed
59
```
laibao's avatar
laibao committed
60

zhuwenwen's avatar
zhuwenwen committed
61
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
laibao's avatar
laibao committed
62

zhuwenwen's avatar
zhuwenwen committed
63
* DTK驱动:dtk24.04.2
zhuwenwen's avatar
zhuwenwen committed
64
65
* Pytorch: 2.1.0
* triton:2.1.0
zhuwenwen's avatar
zhuwenwen committed
66
* lmslim: 0.1.0
zhuwenwen's avatar
zhuwenwen committed
67
* xformers: 0.0.25
laibao's avatar
laibao committed
68
* flash_attn: 2.6.1
zhuwenwen's avatar
zhuwenwen committed
69
* vllm: 0.5.0
zhuwenwen's avatar
zhuwenwen committed
70
* python: python3.10
zhuwenwen's avatar
zhuwenwen committed
71

zhuwenwen's avatar
zhuwenwen committed
72
`Tips:需先安装相关依赖,最后安装vllm包`
zhuwenwen's avatar
zhuwenwen committed
73
74

## 数据集
laibao's avatar
laibao committed
75

zhuwenwen's avatar
zhuwenwen committed
76
77
78
79


## 推理

laibao's avatar
laibao committed
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
### 模型下载

| 基座模型                                                                         | chat模型                                                                           | GPTQ模型                                                                                              | AWQ模型                                                                                      |
| -------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| [Qwen2.5 3B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-3B)                  | [Qwen2.5 3B Instruct](http://113.200.138.88:18080/aimodels/Qwen-7B-Chat)              | [Qwen2.5-3B-Instruct-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/qwen2.5-3b-instruct-gptq-int4) | [Qwen2.5-3B-Instruct-AWQ](http://113.200.138.88:18080/aimodels/qwen/qwen2.5-3b-instruct-awq)    |
| [Qwen2.5-7B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-7B)                  | [ Qwen2.5 7B Instruct](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-7B-Instruct) | [Qwen2.5-7B-Instruct-GPTQ-Int4](http://113.200.138.88:18080/aimodels/qwen/qwen2.5-7b-instruct-gptq-int4) | [Qwen-7B-Chat](http://113.200.138.88:18080/aimodels/Qwen-7B-Chat)                               |
| [Qwen2.5-14B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-14B)                | [Qwen-14B-Chat](http://113.200.138.88:18080/aimodels/Qwen-14B-Chat)                   | [Qwen-14B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4)                                | [Qwen-7B-Chat](http://113.200.138.88:18080/aimodels/Qwen-7B-Chat)                               |
| [Qwen2.5-32B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-32B)                | [Qwen-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen-72B-Chat)                   | [Qwen-72B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-72B-Chat-Int4)                                | [Qwen-7B-Chat](http://113.200.138.88:18080/aimodels/Qwen-7B-Chat)                               |
| [Qwen2.5-72B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-72B)                | [Qwen1.5-7B-Chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat)                        | [Qwen1.5-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GPTQ-Int4)                       | [Qwen1.5-7B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-7B-Chat-AWQ)       |
| [ Qwen2.5 Coder 1.5B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-Coder-1.5B) | [Qwen1.5-14B-Chat](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat)        | [Qwen1.5-14B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-14B-Chat-GPTQ-Int4)                     | [Qwen1.5-14B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat-AWQ)     |
| [Qwen2.5 Coder 7B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-Coder-7B)      | [Qwen1.5-32B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat)             | [Qwen1.5-32B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat-GPTQ-Int4)            | [Qwen1.5-32B-Chat-AWQ-Int4](https://huggingface.co/Qwen/Qwen1.5-32B-Chat-AWQ)                   |
| [Qwen2.5 Math 1.5B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-Math-1.5B)    | [Qwen1.5-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-72B-Chat)             | [Qwen1.5-72B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GPTQ-Int4)                     | [Qwen1.5-72B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-72B-Chat-AWQ)     |
| [ Qwen2.5 Math 7B](http://113.200.138.88:18080/aimodels/qwen/Qwen2.5-Math-7B)       | [Qwen1.5-110B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-110B-Chat)           | [Qwen1.5-110B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-110B-Chat-GPTQ-Int4)                   | [Qwen1.5-110B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-110B-Chat-AWQ)   |
| [Qwen2-7B](http://113.200.138.88:18080/aimodels/Qwen2-7B)                           | [Qwen2-7B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-7B-Instruct)           | [Qwen2-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)                   | [Qwen2-7B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-7B-Instruct-AWQ)   |
| [Qwen2-72B](http://113.200.138.88:18080/aimodels/Qwen2-72B)                         | [Qwen2-72B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-72B-Instruct)         | [Qwen2-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-72B-Instruct-GPTQ-Int4)                 | [Qwen2-72B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-72B-Instruct-AWQ) |
zhuwenwen's avatar
add env  
zhuwenwen committed
95

zhuwenwen's avatar
zhuwenwen committed
96
### 离线批量推理
laibao's avatar
laibao committed
97

zhuwenwen's avatar
zhuwenwen committed
98
```bash
zhuwenwen's avatar
zhuwenwen committed
99
python examples/offline_inference.py
zhuwenwen's avatar
zhuwenwen committed
100
```
laibao's avatar
laibao committed
101

zhuwenwen's avatar
zhuwenwen committed
102
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
zhuwenwen's avatar
zhuwenwen committed
103
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。`quantization="awq"`为使用awq量化进行推理,需下载以上AWQ模型。
zhuwenwen's avatar
zhuwenwen committed
104
105

### 离线批量推理性能测试
laibao's avatar
laibao committed
106

zhuwenwen's avatar
zhuwenwen committed
107
1、指定输入输出
laibao's avatar
laibao committed
108

zhuwenwen's avatar
zhuwenwen committed
109
```bash
laibao's avatar
laibao committed
110
python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model Qwen/Qwen2.5-7B-instruct -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
111
```
laibao's avatar
laibao committed
112
113

其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
114
115
116

2、使用数据集
下载数据集:
laibao's avatar
laibao committed
117

zhuwenwen's avatar
zhuwenwen committed
118
119
120
121
122
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
laibao's avatar
laibao committed
123
python benchmarks/benchmark_throughput.py --num-prompts 1 --model Qwen/Qwen2.5-7B-instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
124
125
```

laibao's avatar
laibao committed
126
其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
127
128

### OpenAI兼容服务
laibao's avatar
laibao committed
129

zhuwenwen's avatar
zhuwenwen committed
130
启动服务:
laibao's avatar
laibao committed
131

zhuwenwen's avatar
zhuwenwen committed
132
```bash
laibao's avatar
laibao committed
133
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-instruct --enforce-eager --dtype float16 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
134
```
laibao's avatar
laibao committed
135
136

这里 `--model`为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awqq`为使用awq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
137
138

列出模型型号:
laibao's avatar
laibao committed
139

zhuwenwen's avatar
zhuwenwen committed
140
141
142
143
144
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
laibao's avatar
laibao committed
145

zhuwenwen's avatar
zhuwenwen committed
146
147
148
149
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
laibao's avatar
laibao committed
150
        "model": "Qwen/Qwen2.5-7B-instruct",
zhuwenwen's avatar
zhuwenwen committed
151
152
153
154
155
156
        "prompt": "What is deep learning?",
        "max_tokens": 7,
        "temperature": 0
    }'
```

laibao's avatar
laibao committed
157
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
158
159

### OpenAI Chat API和vllm结合使用
laibao's avatar
laibao committed
160

zhuwenwen's avatar
zhuwenwen committed
161
162
163
164
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
laibao's avatar
laibao committed
165
        "model": "Qwen/Qwen2.5-7B-instruct",
zhuwenwen's avatar
zhuwenwen committed
166
167
168
169
170
171
172
        "messages": [
            {"role": "system", "content": "What is deep learning?"},
            {"role": "user", "content": "What is deep learning?"}
        ]
    }'
```

laibao's avatar
laibao committed
173
或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
zhuwenwen's avatar
zhuwenwen committed
174
175

## result
laibao's avatar
laibao committed
176

zhuwenwen's avatar
zhuwenwen committed
177
使用的加速卡:1张 DCU-K100_AI-64G
laibao's avatar
laibao committed
178

zhuwenwen's avatar
zhuwenwen committed
179
180
181
182
183
```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```

### 精度
laibao's avatar
laibao committed
184

zhuwenwen's avatar
zhuwenwen committed
185
186
187
188
189


## 应用场景

### 算法类别
laibao's avatar
laibao committed
190

zhuwenwen's avatar
zhuwenwen committed
191
192
193
对话问答

### 热点应用行业
laibao's avatar
laibao committed
194

zhuwenwen's avatar
zhuwenwen committed
195
196
197
金融,科研,教育

## 源码仓库及问题反馈
laibao's avatar
laibao committed
198

laibao's avatar
laibao committed
199
* [https://developer.hpccube.com/codes/modelzoo/qwen2.5_vllm](https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm)
zhuwenwen's avatar
zhuwenwen committed
200
201
202

## 参考资料

laibao's avatar
laibao committed
203
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)