README.md 10.4 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-05-24 14:15:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-06-20 08:40:01
zhuwenwen's avatar
zhuwenwen committed
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-->
# Qwen1.5

## 论文


## 模型结构
Qwen1.5是阿里云开源大型语言模型系列,是Qwen2.0的beta版本。相较于以往版本,本次更新着重提升了Chat模型与人类偏好的对齐程度,并且显著增强了模型的多语言处理能力。在序列长度方面,所有规模模型均已实现 32768 个tokens的上下文长度范围支持。同时,预训练 Base 模型的质量也有关键优化,有望在微调过程中带来更佳体验。
<div align=center>
    <img src="./doc/qwen1.5.jpg"/>
</div>

## 算法原理
和Qwen一样,Qwen1.5仍然是一个decoder-only的transformer模型,使用SwiGLU激活函数、RoPE、多头注意力机制等。

<div align=center>
    <img src="./doc/qwen1.5.png"/>
</div>

## 环境配置
### Docker(方法一)
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
zhuwenwen's avatar
zhuwenwen committed
30
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
zhuwenwen's avatar
zhuwenwen committed
31
32
33
34
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name qwen1.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
35
36

pip install aiohttp==3.9.1 outlines==0.0.37 openai==1.23.3 -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
zhuwenwen's avatar
zhuwenwen committed
37
```
zhuwenwen's avatar
zhuwenwen committed
38
`Tips:若在K100/Z100L上使用,需要替换flash_attn,下载链接:https://forum.hpccube.com/thread/515`
zhuwenwen's avatar
zhuwenwen committed
39
40
41
42
43
44
45
46

### Dockerfile(方法二)
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t qwen1.5:latest .
docker run -it --name qwen1.5_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> qwen1.5:latest /bin/bash
```
zhuwenwen's avatar
zhuwenwen committed
47
`Tips:若在K100/Z100L上使用,需要替换flash_attn,下载链接:https://forum.hpccube.com/thread/515`
zhuwenwen's avatar
zhuwenwen committed
48
49
50
51

### Anaconda(方法三)
```
conda create -n qwen1.5_vllm python=3.10
zhuwenwen's avatar
zhuwenwen committed
52
pip install aiohttp==3.9.1 outlines==0.0.37 openai==1.23.3
zhuwenwen's avatar
zhuwenwen committed
53
54
55
56
57
58
59
60
61
62
```
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
* DTK驱动:dtk24.04
* Pytorch: 2.1.0
* triton:2.1.0
* vllm: 0.3.3
* xformers: 0.0.25
* flash_attn: 2.0.4
* python: python3.10

zhuwenwen's avatar
zhuwenwen committed
63
`Tips:若在K100/Z100L上使用,需要替换flash_attn,下载链接:https://forum.hpccube.com/thread/515`
zhuwenwen's avatar
zhuwenwen committed
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

## 数据集


## 推理

### 源码编译安装
```
# 若使用光源的镜像,可以跳过源码编译安装,镜像中已安装vllm。
git clone http://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm.git
cd qwen1.5_vllm
git submodule init && git submodule update
cd vllm
pip install wheel
python setup.py bdist_wheel
cd dist && pip install vllm*
```

### 模型下载

| 基座模型                                                                        | chat模型                                                                                | GPTQ模型                                                                                          |
| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
zhuwenwen's avatar
zhuwenwen committed
86
87
88
| [Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)   | [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)   |
| [Qwen-14B](https://huggingface.co/Qwen/Qwen-14B) | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 
| [Qwen-72B](https://huggingface.co/Qwen/Qwen-72B) | [Qwen-72B-Chat](https://huggingface.co/Qwen/Qwen-72B-Chat) | 
zhuwenwen's avatar
zhuwenwen committed
89
90
91
92
93
| [Qwen1.5-7B](https://huggingface.co/Qwen/Qwen1.5-7B)   | [Qwen1.5-7B-Chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat)    | [Qwen1.5-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GPTQ-Int4)   |
| [Qwen1.5-14B](https://huggingface.co/Qwen/Qwen1.5-14B) | [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) | [Qwen1.5-14B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-14B-Chat-GPTQ-Int4) |
| [Qwen1.5-32B](https://huggingface.co/Qwen/Qwen1.5-32B) | [Qwen1.5-32B-Chat](https://huggingface.co/Qwen/Qwen1.5-32B-Chat) | [Qwen1.5-32B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4) |
| [Qwen1.5-72B](https://huggingface.co/Qwen/Qwen1.5-72B) | [Qwen1.5-72B-Chat](https://huggingface.co/Qwen/Qwen1.5-72B-Chat) | [Qwen1.5-72B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GPTQ-Int4) |
| [Qwen1.5-110B](https://huggingface.co/Qwen/Qwen1.5-110B) | [Qwen1.5-110B-Chat](https://huggingface.co/Qwen/Qwen1.5-110B-Chat) | [Qwen1.5-110B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-110B-Chat-GPTQ-Int4) |
zhuwenwen's avatar
zhuwenwen committed
94
95
| [Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B)   | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)    | [Qwen2-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)   |
| [Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B) | [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | [Qwen2-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-72B-Instruct-GPTQ-Int4) |
zhuwenwen's avatar
zhuwenwen committed
96
97


zhuwenwen's avatar
add env  
zhuwenwen committed
98

zhuwenwen's avatar
zhuwenwen committed
99
100
101
102
103
104
105
106
107
### 离线批量推理
```bash
python vllm/examples/offline_inference.py
```
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。


### 离线批量推理性能测试
zhuwenwen's avatar
add env  
zhuwenwen committed
108
`Tips:若测试qwen1.5-7b/qwen1.5-72b/qwen1.5-72b,添加环境变量LLAMA_NN=1`
zhuwenwen's avatar
zhuwenwen committed
109
110
111
112
1、指定输入输出
```bash
python vllm/benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model Qwen/Qwen1.5-7B-Chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
zhuwenwen's avatar
zhuwenwen committed
113
其中`--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
114
115
116
117
118
119
120
121
122
123

2、使用数据集
下载数据集:
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
python benchmark_throughput.py --num-prompts 1 --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
zhuwenwen's avatar
zhuwenwen committed
124
其中`--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144


### api服务推理性能测试
1、启动服务端:
```bash
python -m vllm.entrypoints.api_server  --model Qwen/Qwen1.5-7B-Chat  --dtype float16 --enforce-eager -tp 1 
```

2、启动客户端:
```bash
python vllm/benchmarks/benchmark_serving.py --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
```
参数同使用数据集,离线批量推理性能测试,具体参考[vllm/benchmarks/benchmark_serving.py]


### OpenAI兼容服务
启动服务:
```bash
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat --enforce-eager --dtype float16 --trust-remote-code
```
zhuwenwen's avatar
zhuwenwen committed
145
这里`--model`为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
146
147
148
149
150
151
152
153
154
155
156

列出模型型号:
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
zhuwenwen's avatar
zhuwenwen committed
157
        "model": "Qwen/Qwen1.5-7B",
zhuwenwen's avatar
zhuwenwen committed
158
159
160
161
162
        "prompt": "What is deep learning?",
        "max_tokens": 7,
        "temperature": 0
    }'
```
zhuwenwen's avatar
zhuwenwen committed
163
或者使用[vllm/examples/openai_completion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/blob/df6349c78b49a5b8f6f600d0d9490791cd1d32ee/examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
164
165
166
167
168
169
170
171
172
173
174
175
176
177


### OpenAI Chat API和vllm结合使用
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen1.5-7B-Chat",
        "messages": [
            {"role": "system", "content": "What is deep learning?"},
            {"role": "user", "content": "What is deep learning?"}
        ]
    }'
```
zhuwenwen's avatar
zhuwenwen committed
178
或者使用[vllm/examples/openai_chatcompletion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/blob/df6349c78b49a5b8f6f600d0d9490791cd1d32ee/examples/openai_chatcompletion_client.py)
zhuwenwen's avatar
zhuwenwen committed
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204


## result
使用的加速卡:1张 DCU-K100_AI-64G
```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```

### 精度


## 应用场景

### 算法类别
对话问答

### 热点应用行业
金融,科研,教育

## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm](https://developer.hpccube.com/codes/modelzoo/qwen1.5_vllm)

## 参考资料
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)