README.md 10.1 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
<!--
laibao's avatar
laibao committed
2
3
 * @Author: laibai
 * @email: laibao@sugon.com
zhuwenwen's avatar
zhuwenwen committed
4
 * @Date: 2024-05-24 14:15:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-09-30 08:30:01
zhuwenwen's avatar
zhuwenwen committed
6
-->
laibao's avatar
laibao committed
7
# Qwen2.5
zhuwenwen's avatar
zhuwenwen committed
8
9

## 论文
10
<<<<<<<<< Temporary merge branch 1
laibao's avatar
laibao committed
11
无 test
12
13
14
=========
无 testhi
>>>>>>>>> Temporary merge branch 2
zhuwenwen's avatar
zhuwenwen committed
15
16

## 模型结构
laibao's avatar
laibao committed
17
Qwen2.5是阿里云开源的最新一代大型语言模型,标志着Qwen系列在性能和功能上的又一次飞跃。本次更新着重提升了模型的多语言处理能力,支持超过29种语言,包括中文、英文、法文、西班牙文、葡萄牙文、德文等。所有规模的模型现在都能支持高达128K tokens的上下文长度,并能生成最长8K tokens的内容。预训练数据集也从7T tokens扩展到了18T tokens,显著提升了模型的知识储备。此外,Qwen2.5还增强了对系统提示的适应性,提升了角色扮演和聊天机器人的背景设置能力。模型系列包括从0.5B到72B不同参数规模的版本,以满足不同应用场景的需求 。
zhuwenwen's avatar
zhuwenwen committed
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<div align=center>
    <img src="./doc/qwen1.5.jpg"/>
</div>

## 算法原理
和Qwen一样,Qwen1.5仍然是一个decoder-only的transformer模型,使用SwiGLU激活函数、RoPE、多头注意力机制等。

<div align=center>
    <img src="./doc/qwen1.5.png"/>
</div>

## 环境配置
### Docker(方法一)
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
laibao's avatar
laibao committed
34
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.10
zhuwenwen's avatar
zhuwenwen committed
35
36
37
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
laibao's avatar
laibao committed
38
docker run -it --name qwen2.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
39
```
zhuwenwen's avatar
zhuwenwen committed
40
`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`
zhuwenwen's avatar
zhuwenwen committed
41
42
43
44
45

### Dockerfile(方法二)
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
laibao's avatar
laibao committed
46
47
docker build -t qwen2.5:latest .
docker run -it --name qwen2.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> qwen1.5:latest /bin/bash
zhuwenwen's avatar
zhuwenwen committed
48
49
50
51
```

### Anaconda(方法三)
```
laibao's avatar
laibao committed
52
conda create -n qwen2.5_vllm python=3.10
zhuwenwen's avatar
zhuwenwen committed
53
54
```
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
zhuwenwen's avatar
zhuwenwen committed
55
* DTK驱动:dtk24.04.2
zhuwenwen's avatar
zhuwenwen committed
56
57
* Pytorch: 2.1.0
* triton:2.1.0
zhuwenwen's avatar
zhuwenwen committed
58
* lmslim: 0.1.0
zhuwenwen's avatar
zhuwenwen committed
59
* xformers: 0.0.25
laibao's avatar
laibao committed
60
* flash_attn: 2.6.1
zhuwenwen's avatar
zhuwenwen committed
61
* vllm: 0.5.0
zhuwenwen's avatar
zhuwenwen committed
62
* python: python3.10
zhuwenwen's avatar
zhuwenwen committed
63

zhuwenwen's avatar
zhuwenwen committed
64
`Tips:需先安装相关依赖,最后安装vllm包`
zhuwenwen's avatar
zhuwenwen committed
65
66
67
68
69
70

## 数据集


## 推理

zhuwenwen's avatar
zhuwenwen committed
71
### 模型下载 
zhuwenwen's avatar
zhuwenwen committed
72

zhuwenwen's avatar
zhuwenwen committed
73
74
75
76
77
78
79
80
81
82
83
84
| 基座模型 | chat模型 | GPTQ模型 | AWQ模型 |
| ------- | ------- | ------- | ------- | 
| [Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)   | [Qwen-7B-Chat](http://113.200.138.88:18080/aimodels/Qwen-7B-Chat)   | [Qwen-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)   |
| [Qwen-14B](https://huggingface.co/Qwen/Qwen-14B) | [Qwen-14B-Chat](http://113.200.138.88:18080/aimodels/Qwen-14B-Chat) | [Qwen-14B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4) |  
| [Qwen-72B](http://113.200.138.88:18080/aimodels/qwen/Qwen-72B) | [Qwen-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen-72B-Chat) | [Qwen-72B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen-72B-Chat-Int4) |  
| [Qwen1.5-7B](https://huggingface.co/Qwen/Qwen1.5-7B)   | [Qwen1.5-7B-Chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat)  | [Qwen1.5-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GPTQ-Int4) | [Qwen1.5-7B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-7B-Chat-AWQ)   |
| [Qwen1.5-14B](https://huggingface.co/Qwen/Qwen1.5-14B) | [Qwen1.5-14B-Chat](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat) | [Qwen1.5-14B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-14B-Chat-GPTQ-Int4) | [Qwen1.5-14B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-14B-Chat-AWQ) |
| [Qwen1.5-32B](http://113.200.138.88:18080/aimodels/Qwen1.5-32B) | [Qwen1.5-32B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat) | [Qwen1.5-32B-Chat-GPTQ-Int4](http://113.200.138.88:18080/aimodels/Qwen1.5-32B-Chat-GPTQ-Int4) | [Qwen1.5-32B-Chat-AWQ-Int4](https://huggingface.co/Qwen/Qwen1.5-32B-Chat-AWQ) |
| [Qwen1.5-72B](http://113.200.138.88:18080/aimodels/Qwen1.5-72B) | [Qwen1.5-72B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-72B-Chat) | [Qwen1.5-72B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GPTQ-Int4) | [Qwen1.5-72B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-72B-Chat-AWQ) |
| [Qwen1.5-110B](http://113.200.138.88:18080/aimodels/Qwen1.5-110B) | [Qwen1.5-110B-Chat](http://113.200.138.88:18080/aimodels/Qwen1.5-110B-Chat) | [Qwen1.5-110B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-110B-Chat-GPTQ-Int4) | [Qwen1.5-110B-Chat-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen1.5-110B-Chat-AWQ) |
| [Qwen2-7B](http://113.200.138.88:18080/aimodels/Qwen2-7B)   | [Qwen2-7B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-7B-Instruct)    | [Qwen2-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)   | [Qwen2-7B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-7B-Instruct-AWQ)   
| [Qwen2-72B](http://113.200.138.88:18080/aimodels/Qwen2-72B) | [Qwen2-72B-Instruct](http://113.200.138.88:18080/aimodels/Qwen2-72B-Instruct) | [Qwen2-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-72B-Instruct-GPTQ-Int4) |[Qwen2-72B-Instruct-AWQ-Int4](http://113.200.138.88:18080/aimodels/qwen/Qwen2-72B-Instruct-AWQ) |
zhuwenwen's avatar
zhuwenwen committed
85
86


zhuwenwen's avatar
add env  
zhuwenwen committed
87

zhuwenwen's avatar
zhuwenwen committed
88
89
### 离线批量推理
```bash
zhuwenwen's avatar
zhuwenwen committed
90
python examples/offline_inference.py
zhuwenwen's avatar
zhuwenwen committed
91
92
```
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
zhuwenwen's avatar
zhuwenwen committed
93
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。`quantization="awq"`为使用awq量化进行推理,需下载以上AWQ模型。
zhuwenwen's avatar
zhuwenwen committed
94
95
96
97
98


### 离线批量推理性能测试
1、指定输入输出
```bash
laibao's avatar
laibao committed
99
python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model Qwen/Qwen2.5-7B-instruct -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
100
```
zhuwenwen's avatar
zhuwenwen committed
101
其中`--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
102
103
104
105
106
107
108
109

2、使用数据集
下载数据集:
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
laibao's avatar
laibao committed
110
python benchmarks/benchmark_throughput.py --num-prompts 1 --model Qwen/Qwen2.5-7B-instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
111
```
zhuwenwen's avatar
zhuwenwen committed
112
其中`--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
113
114
115
116
117
118
119




### OpenAI兼容服务
启动服务:
```bash
laibao's avatar
laibao committed
120
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-instruct --enforce-eager --dtype float16 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
121
```
zhuwenwen's avatar
zhuwenwen committed
122
这里`--model`为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awqq`为使用awq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
123
124
125
126
127
128
129
130
131
132
133

列出模型型号:
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
laibao's avatar
laibao committed
134
        "model": "Qwen/Qwen2.5-7B-instruct",
zhuwenwen's avatar
zhuwenwen committed
135
136
137
138
139
        "prompt": "What is deep learning?",
        "max_tokens": 7,
        "temperature": 0
    }'
```
zhuwenwen's avatar
zhuwenwen committed
140
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
141
142
143
144
145
146
147


### OpenAI Chat API和vllm结合使用
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
laibao's avatar
laibao committed
148
        "model": "Qwen/Qwen2.5-7B-instruct",
zhuwenwen's avatar
zhuwenwen committed
149
150
151
152
153
154
        "messages": [
            {"role": "system", "content": "What is deep learning?"},
            {"role": "user", "content": "What is deep learning?"}
        ]
    }'
```
zhuwenwen's avatar
zhuwenwen committed
155
或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
zhuwenwen's avatar
zhuwenwen committed
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175


## result
使用的加速卡:1张 DCU-K100_AI-64G
```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```

### 精度


## 应用场景

### 算法类别
对话问答

### 热点应用行业
金融,科研,教育

## 源码仓库及问题反馈
laibao's avatar
laibao committed
176
* [https://developer.hpccube.com/codes/modelzoo/qwen2.5_vllm](https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm)
zhuwenwen's avatar
zhuwenwen committed
177
178
179
180
181

## 参考资料
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)