README.md 10.6 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-06-13 14:38:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2024-09-30 09:16:01
zhuwenwen's avatar
zhuwenwen committed
6
-->
laibao's avatar
laibao committed
7

dcuai's avatar
dcuai committed
8
9
# ChatGLM

zhuwenwen's avatar
zhuwenwen committed
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## 论文

`GLM: General Language Model Pretraining with Autoregressive Blank Infilling`

- [https://arxiv.org/abs/2103.10360](https://arxiv.org/abs/2103.10360)

## 模型结构

ChatGLM-6B 是清华大学开源的开源的、支持中英双语的对话语言模型,基于 [General Language Model (GLM)](https://github.com/THUDM/GLM) 架构,具有 62 亿参数。ChatGLM-6B 使用了和 ChatGPT 相似的技术,针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练,辅以监督微调、反馈自助、人类反馈强化学习等技术的加持,62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。ChatGLM2-6B 是开源中英双语对话模型 ChatGLM-6B 的第二代版本,ChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型,在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上,ChatGLM3-6B 具有更强大的基础模型、更完整的功能支持、更全面的开源序列。

<div align="center">
<img src="docs/transformers.jpg" width="300" height="400">
</div>

以下是ChatGLM系列模型的主要网络参数配置:

| 模型名称    | 隐含层维度 | 层数 | 头数 | 词表大小 | 位置编码 | 最大序列长度 |
| ----------- | ---------- | ---- | ---- | -------- | -------- | ------------ |
| ChatGLM2-6B | 4096       | 28   | 32   | 65024    | RoPE     | 8192         |
| ChatGLM3-6B | 4096       | 28   | 32   | 65024    | RoPE     | 8192         |
laibao's avatar
laibao committed
30
| glm-4-9b    | 4096       | 40   | 32   | 151552   | RoPE     | 131072       |
zhuwenwen's avatar
zhuwenwen committed
31
32
33
34
35
36
37
38
39
40
41
42

## 算法原理

ChatGLM系列模型基于GLM架构开发。GLM是一种基于Transformer的语言模型,以自回归空白填充为训练目标, 同时具备自回归和自编码能力。

<div align="center">
<img src="docs/GLM.png" width="550" height="200">
</div>

## 环境配置

### Docker(方法一)
laibao's avatar
laibao committed
43

zhuwenwen's avatar
zhuwenwen committed
44
45
46
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:

```
47
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
zhuwenwen's avatar
zhuwenwen committed
48
49
50
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
51
# 若要在主机端和容器端映射端口需要删除--network host参数
zhuwenwen's avatar
zhuwenwen committed
52
53
docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
```
laibao's avatar
laibao committed
54

zhuwenwen's avatar
zhuwenwen committed
55
`Tips:若在K100/Z100L上使用,使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`
zhuwenwen's avatar
zhuwenwen committed
56
57

### Dockerfile(方法二)
laibao's avatar
laibao committed
58

zhuwenwen's avatar
zhuwenwen committed
59
60
61
62
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t chatglm:latest .
zhuwenwen's avatar
zhuwenwen committed
63
docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> llama:latest /bin/bash
zhuwenwen's avatar
zhuwenwen committed
64
65
66
```

### Anaconda(方法三)
laibao's avatar
laibao committed
67

zhuwenwen's avatar
zhuwenwen committed
68
69
70
```
conda create -n chatglm_vllm python=3.10
```
laibao's avatar
laibao committed
71

chenzk's avatar
chenzk committed
72
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
laibao's avatar
laibao committed
73
74
75
76
77
78
79

* DTK驱动:dtk24.04.3
* Pytorch: 2.3.0
* triton: 2.1.0
* lmslim: 0.1.2
* flash_attn: 2.6.1
* vllm: 0.6.2
zhuwenwen's avatar
zhuwenwen committed
80
81
* python: python3.10

82
83
84
85
86
87
88
89
90
91
92
93
94
`Tips:需先安装相关依赖,最后安装vllm包`  

环境变量:  
export ALLREDUCE_STREAM_WITH_COMPUTE=1  
export VLLM_NUMA_BIND=1  
export VLLM_RANK0_NUMA=0  
export VLLM_RANK1_NUMA=1  
export VLLM_RANK2_NUMA=2  
export VLLM_RANK3_NUMA=3  
export VLLM_RANK4_NUMA=4  
export VLLM_RANK5_NUMA=5  
export VLLM_RANK6_NUMA=6  
export VLLM_RANK7_NUMA=7  
zhuwenwen's avatar
zhuwenwen committed
95
96

## 数据集
laibao's avatar
laibao committed
97

zhuwenwen's avatar
zhuwenwen committed
98
99
100
101
102
103


## 推理

### 模型下载

laibao's avatar
laibao committed
104
105
| 基座模型                                                             | 长文本模型                                                                     |
| -------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
chenzk's avatar
chenzk committed
106
107
108
| [chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)         | [chatglm2-6b-32k](https://huggingface.co/THUDM/chatglm2-6b-32k) |
| [chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)         | [chatglm3-6b-32k](https://huggingface.co/THUDM/chatglm3-6b-32k)           |
| [glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | [glm-4-9b-chat-1m](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat-1m)         |
zhuwenwen's avatar
zhuwenwen committed
109
110

### 离线批量推理
laibao's avatar
laibao committed
111

zhuwenwen's avatar
zhuwenwen committed
112
```bash
zhuwenwen's avatar
zhuwenwen committed
113
python examples/offline_inference.py
zhuwenwen's avatar
zhuwenwen committed
114
```
laibao's avatar
laibao committed
115

zhuwenwen's avatar
zhuwenwen committed
116
117
118
119
其中,`prompts`为提示词;`temperature`为控制采样随机性的值,值越小模型生成越确定,值变高模型生成更随机,0表示贪婪采样,默认为1;`max_tokens=16`为生成长度,默认为1;
`model`为模型路径;`tensor_parallel_size=1`为使用卡数,默认为1;`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。

### 离线批量推理性能测试
laibao's avatar
laibao committed
120

zhuwenwen's avatar
zhuwenwen committed
121
1、指定输入输出
laibao's avatar
laibao committed
122

zhuwenwen's avatar
zhuwenwen committed
123
```bash
zhuwenwen's avatar
zhuwenwen committed
124
python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model THUDM/glm-4-9b-chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
125
```
laibao's avatar
laibao committed
126
127

其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
panhb's avatar
panhb committed
128
glm-4-9b-chat-1m模型默认的model_max_length为1024000,官方vllm也尚不支持该长度,模型启动时必须添加--max_model_len(包括后面的启动命令), 经测试,500000左右也可以正常进行推理。
zhuwenwen's avatar
zhuwenwen committed
129
130
131

2、使用数据集
下载数据集:
laibao's avatar
laibao committed
132

zhuwenwen's avatar
zhuwenwen committed
133
```bash
dcuai's avatar
dcuai committed
134
wget http://113.200.138.88:18080/aidatasets/vllm_data/-/raw/main/ShareGPT_V3_unfiltered_cleaned_split.json
zhuwenwen's avatar
zhuwenwen committed
135
136
137
```

```bash
zhuwenwen's avatar
zhuwenwen committed
138
python benchmarks/benchmark_throughput.py --num-prompts 1 --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
zhuwenwen's avatar
zhuwenwen committed
139
140
```

laibao's avatar
laibao committed
141
142
143
其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。

### OpenAI api服务推理性能测试
zhuwenwen's avatar
zhuwenwen committed
144
145

1、启动服务端:
laibao's avatar
laibao committed
146

zhuwenwen's avatar
zhuwenwen committed
147
```bash
zhuwenwen's avatar
zhuwenwen committed
148
python -m vllm.entrypoints.openai.api_server  --model THUDM/glm-4-9b-chat  --dtype float16 --enforce-eager -tp 1 
zhuwenwen's avatar
zhuwenwen committed
149
150
151
```

2、启动客户端:
laibao's avatar
laibao committed
152

zhuwenwen's avatar
zhuwenwen committed
153
```bash
zhuwenwen's avatar
zhuwenwen committed
154
python benchmarks/benchmark_serving.py --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
zhuwenwen's avatar
zhuwenwen committed
155
156
```

laibao's avatar
laibao committed
157
参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
zhuwenwen's avatar
zhuwenwen committed
158
159

### OpenAI兼容服务
laibao's avatar
laibao committed
160

zhuwenwen's avatar
zhuwenwen committed
161
启动服务:
laibao's avatar
laibao committed
162

zhuwenwen's avatar
zhuwenwen committed
163
```bash
laibao's avatar
laibao committed
164
vllm serve THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
zhuwenwen's avatar
zhuwenwen committed
165
```
laibao's avatar
laibao committed
166
167

这里serve之后 为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
zhuwenwen's avatar
zhuwenwen committed
168
169

列出模型型号:
laibao's avatar
laibao committed
170

zhuwenwen's avatar
zhuwenwen committed
171
172
173
174
175
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
laibao's avatar
laibao committed
176

zhuwenwen's avatar
zhuwenwen committed
177
178
179
180
181
182
183
184
185
186
187
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "THUDM/glm-4-9b-chat",
        "prompt": "晚上睡不着怎么办",
        "max_tokens": 7,
        "temperature": 0
    }'
```

laibao's avatar
laibao committed
188
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
zhuwenwen's avatar
zhuwenwen committed
189
190

### OpenAI Chat API和vllm结合使用
laibao's avatar
laibao committed
191

zhuwenwen's avatar
zhuwenwen committed
192
193
194
195
196
197
198
199
200
201
202
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "THUDM/glm-4-9b-chat",
        "messages": [
            {"role": "system", "content": "晚上睡不着怎么办"},
            {"role": "user", "content": "晚上睡不着怎么办"}
        ]
    }'
```
laibao's avatar
laibao committed
203

zhuwenwen's avatar
zhuwenwen committed
204
或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
laibao's avatar
laibao committed
205

laibao's avatar
laibao committed
206
### **gradio和vllm结合使用**
zhuwenwen's avatar
zhuwenwen committed
207

laibao's avatar
laibao committed
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
1.安装gradio

```
pip install gradio
```

2.安装必要文件

    2.1 启动gradio服务,根据提示操作

```
python  gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
```

    2.2 更改文件权限

打开提示下载文件目录,输入以下命令给予权限

```
chmod +x frpc_linux_amd64_v0.*
```
laibao's avatar
laibao committed
229

laibao's avatar
laibao committed
230
231
232
233
    2.3端口映射

```
ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节点 -p 登录节点端口
laibao's avatar
laibao committed
234
```
laibao's avatar
laibao committed
235
236
237
238

3.启动OpenAI兼容服务

```
laibao's avatar
laibao committed
239
vllm serve THUDM/glm-4-9b-chat  --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
laibao's avatar
laibao committed
240
241
242
243
244
```

4.启动gradio服务

```
laibao's avatar
laibao committed
245
python  gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids --host "0.0.0.0" --port 8001"
laibao's avatar
laibao committed
246
247
248
249
250
```

5.使用对话服务

在浏览器中输入本地 URL,可以使用 Gradio 提供的对话服务。
zhuwenwen's avatar
zhuwenwen committed
251
252

## result
laibao's avatar
laibao committed
253

zhuwenwen's avatar
zhuwenwen committed
254
使用的加速卡:1张 DCU-K100_AI-64G
laibao's avatar
laibao committed
255

zhuwenwen's avatar
zhuwenwen committed
256
257
258
259
260
```
Prompt: '晚上睡不着怎么办', Generated text: '?\n晚上睡不着可以尝试以下方法来改善睡眠质量:\n\n1. **调整作息时间**:尽量每天同一时间上床睡觉和起床,建立规律的生物钟。\n\n2. **放松身心**:睡前进行深呼吸、冥想或瑜伽等放松活动,有助于减轻压力和焦虑。\n\n3. **避免咖啡因和酒精**:晚上避免摄入咖啡因和酒精,因为它们可能会干扰睡眠。\n\n'
```

### 精度
laibao's avatar
laibao committed
261

zhuwenwen's avatar
zhuwenwen committed
262
263
264
265
266


## 应用场景

### 算法类别
laibao's avatar
laibao committed
267

zhuwenwen's avatar
zhuwenwen committed
268
269
270
对话问答

### 热点应用行业
laibao's avatar
laibao committed
271

zhuwenwen's avatar
zhuwenwen committed
272
273
274
医疗,金融,科研,教育

## 源码仓库及问题反馈
laibao's avatar
laibao committed
275

chenzk's avatar
chenzk committed
276
* [https://developer.sourcefind.cn/codes/modelzoo/llama_vllm](https://developer.sourcefind.cn/codes/modelzoo/chatglm_vllm)
zhuwenwen's avatar
zhuwenwen committed
277
278

## 参考资料
laibao's avatar
laibao committed
279

zhuwenwen's avatar
zhuwenwen committed
280
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
laibao's avatar
laibao committed
281
* [https://github.com/THUDM/ChatGLM3](https://github.com/THUDM/ChatGLM3)