README.md 9.34 KB
Newer Older
raojy's avatar
updata  
raojy committed
1
2
## GLM-5
## 论文
chenych's avatar
chenych committed
3
[GLM-5: From Vibe Coding to Agentic Engineering](https://z.ai/blog/glm-5)
raojy's avatar
updata  
raojy committed
4
5

## 模型简介
chenych's avatar
chenych committed
6
作为智谱AI新一代旗舰大模型,GLM-5专注于复杂系统工程和长周期智能体任务。扩展模型规模仍是提升通用人工智能(AGI)智能效率的最重要途径之一。与 GLM-4.5 相比,GLM-5 的参数量从 355B(激活参数 32B)扩展至 744B(激活参数 40B),预训练数据量也从 23T tokens 增加到 28.5T tokens。此外,GLM-5 还集成了 DeepSeek 稀疏注意力(DSA)机制,在保持长上下文能力的同时大幅降低了部署成本。
raojy's avatar
updata  
raojy committed
7
8

## 环境依赖
9
10
11
12
13
| 软件 |   版本    |
| :------: |:-------:|
|     DTK      | 26.04 |
|    python    | 3.10.12 |
| transformers |  5.2.0  |
chenych's avatar
chenych committed
14
15
16
|    torch     |  2.9.0  |
|     vllm     |  0.15.1  |
|    sglang    |  0.5.10rc0  |
raojy's avatar
updata  
raojy committed
17

chenych's avatar
chenych committed
18
当前仅支持镜像:
chenych's avatar
chenych committed
19
- **vLLM推理请使用:** docker pull harbor.sourcefind.cn:5443/dcu/admin/base/vllm:0.15.1--ubuntu22.04-dtk26.04-py3.10-20260409
chenych's avatar
chenych committed
20
- **SGLang推理请使用:** harbor.sourcefind.cn:5443/dcu/admin/base/custom:sglang-0.5.10-glm5-0416
raojy's avatar
updata  
raojy committed
21
22

- 挂载地址`-v`根据实际模型情况修改
chenych's avatar
chenych committed
23
- 下面以`vLLM`镜像启动示例,如果使用`SGLang`,请对应替换镜像地址
raojy's avatar
updata  
raojy committed
24
25
26

```bash
docker run -it \
chenych's avatar
chenych committed
27
    --shm-size 200g \
raojy's avatar
updata  
raojy committed
28
    --network=host \
chenych's avatar
chenych committed
29
    --name glm-5 \
raojy's avatar
updata  
raojy committed
30
31
32
33
34
35
36
37
38
39
    --privileged \
    --device=/dev/kfd \
    --device=/dev/dri \
    --device=/dev/mkfd \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -u root \
    -v /opt/hyhal/:/opt/hyhal/:ro \
    -v /path/your_code_data/:/path/your_code_data/ \
chenych's avatar
chenych committed
40
    docker pull harbor.sourcefind.cn:5443/dcu/admin/base/vllm:0.15.1--ubuntu22.04-dtk26.04-py3.10-20260409 bash
raojy's avatar
updata  
raojy committed
41
42
43
44
```

更多镜像可前往[光源](https://sourcefind.cn/#/service-list)下载使用。

chenych's avatar
chenych committed
45
46
47
48
49
50
51
52
53
## 预训练权重
**请根据`支持的DCU型号`选择对应模型下载,FP8模型仅在BW1100/BW1101上支持,其他型号请勿使用!**

| 模型名称  | 权重大小  | 数据类型 | 支持的DCU型号  | 最低卡数需求 |下载地址|
|:-----:|:----------:|:----------:|:----------:|:---------------------:|:----------:|
| GLM-5 | 744B | BF16 | BW1000  | 32 | [ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-5) |
| GLM-5 | 744B | BF16 | BW1100  | 16 | [ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-5) |
| GLM-5-FP8 | 744B | FP8 | BW1100  | 8 | [ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-5-FP8) |

raojy's avatar
updata  
raojy committed
54
55
56
57
58
59
60
## 数据集
`暂无`

## 训练
`暂无`

## 推理
chenych's avatar
chenych committed
61
62
63
> 1. 如果出现`ImportError: librocm_smi64.so.2: cannot open shaned object file: No such file or directory`报错,系机器hyhal版本较低所致,请进行升级。
> 2. FP8模型仅在BW1100上支持,其他型号请使用BF16模型
> 3. MTP暂不支持
chenych's avatar
chenych committed
64

chenych's avatar
chenych committed
65
66
67
68
69
70
71
72
73
### SGLang
1. 加入环境变量
```bash
export SGLANG_USE_LIGHTOP=1
export HIP_GRAPH_USE_CMD_CACHE=0
export SGLANG_ROCM_USE_AITER_MOE=0
```

2. 启动服务
chenych's avatar
chenych committed
74

chenych's avatar
chenych committed
75
```bash
chenych's avatar
chenych committed
76
model_path=ZhipuAI/GLM-5-FP8 # FP8模型
chenych's avatar
chenych committed
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

option="--numa-node 0 0 0 0 1 1 1 1 "
option+=" --disable-radix-cache "
option+=" --chunked-prefill-size 16384"
option+=" --page-size 64 "
option+=" --nsa-prefill-backend flashmla_auto --nsa-decode-backend flashmla_kv "

python3 -m sglang.launch_server --model-path "${model_path}" ${option} \
                                --trust-remote-code \
                                --reasoning-parser glm45 \
                                --tool-call-parser glm47 \
                                --kv-cache-dtype fp8_e4m3 \
                                --dtype bfloat16 \
                                --mem-fraction-static 0.925 \
                                --host 0.0.0.0 \
                                --port 8001 \
                                --tp-size 8 \
                                --context-length 32768 \
                                --served-model-name glm-5-fp8
```

3. 启动完成后可通过以下方式访问:
```bash
curl http://localhost:8001/v1/chat/completions   \
    -H "Content-Type: application/json"  \
    -d '{
        "model": "glm-5-fp8",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What is 15% of 240?"}
        ],
        "max_tokens": 2048,
        "temperature": 0.7,
        "chat_template_kwargs": {"enable_thinking": false}
    }'
```

chenych's avatar
chenych committed
114
### vLLM
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
#### 单机推理
1. 加入环境变量
```bash
# 环境变量
rm -rf ~/.cache
rm -rf ~/.triton

export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export Allgather_Base_STREAM_WITH_COMPUTE=1
export SENDRECV_STREAM_WITH_COMPUTE=1
export HIP_KERNEL_EVENT_SYSTENFENCE=1
export VLLM_RPC_TIMEOUT=1800000
export VLLM_USE_PD_SPLIT=1
export VLLM_USE_PIECEWISE=1
export VLLM_REJECT_SAMPLE_OPT=1
export USE_FUSED_RMS_QUANT=0
export USE_FUSED_SILU_MUL_QUANT=1
export VLLM_USE_GLOBAL_CACHE13=1
export VLLM_FUSED_MOE_CHUNK_SIZE=16384
export VLLM_CUSTOM_CACHE=1
export VLLM_USE_OPT_CAT=1
export VLLM_USE_FUSED_FILL_RMS_CAT=1
export VLLM_USE_LIGHTOP_MOE_SUM_MUL_ADD=0
export VLLM_USE_LIGHTOP_RMS_ROPE_CONCAT=0
export VLLM_USE_FLASH_MLA=1
export VLLM_DISABLE_DSA=0
export USE_LIGHTOP_TOPK=1
export USE_LIGHTOP_PER_TOKEN_GROUP_QUANT_FP8=1
export USE_LIGHTOP_CONVERT_REQ_INDEX_TO_GLOBAL_INDEX=1
```

2. 启动vllm serve
chenych's avatar
chenych committed
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
- **BF16**模型
```bash
vllm serve ZhipuAI/GLM-5 \
    --port 8001 \
    --trust-remote-code \
    --tensor-parallel-size 32 \  # BW1000是32, BW1100是16
    --gpu-memory-utilization 0.85 \
    --distributed-executor-backend ray \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-5
```

- **F8**模型,**FP8模型仅在BW1100/BW1101上支持,其他型号请使用BF16模型**
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
```bash
vllm serve ZhipuAI/GLM-5-FP8 \
    --gpu-memory-utilization 0.925 \
    --port 8001 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --kv-cache-dtype fp8_ds_mla \
    --served-model-name glm-5-fp8 \
    --disable-log-requests \
    --compilation-config '{"pass_config": {"fuse_act_quant": false}}'
```

3. 启动完成后可通过以下方式访问:
```bash
curl http://localhost:8001/v1/chat/completions   \
    -H "Content-Type: application/json"  \
    -d '{
chenych's avatar
chenych committed
186
        "model": "glm-5 或者 glm-5-fp8",
187
188
189
190
191
192
193
194
195
196
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Summarize GLM-5 in one sentence."}
        ],
        "max_tokens": 4096,
        "temperature": 0.7,
        "chat_template_kwargs": {"enable_thinking": false}
    }'
```

chenych's avatar
chenych committed
197
#### 多机推理
raojy's avatar
updata  
raojy committed
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
1. 加入环境变量
> 请注意:
> 每个节点上的环境变量都写到.sh文件中,保存后各个计算节点分别source`.sh`文件
>
> VLLM_HOST_IP:节点本地通信口ip,尽量选择IB网卡的IP,**避免出现rccl超时问题**
>
> NCCL_SOCKET_IFNAME和 GLOO_SOCKET_IFNAME:节点本地通信网口ip对应的名称
>
> 通信口和ip查询方法:ifconfig
>
> IB口状态查询:ibstat  !!!一定要active激活状态才可用,各个节点要保持统一

```bash
export ALLREDUCE_STREAM_WITH_COMPUTE=1
export VLLM_HOST_IP=x.x.x.x # 对应计算节点的IP,选择IB口SOCKET_IFNAME对应IP地址
export NCCL_SOCKET_IFNAME=ibxxxx
export GLOO_SOCKET_IFNAME=ibxxxx
export NCCL_IB_HCA=mlx5_0:1 # 环境中的IB网卡名字
unset NCCL_ALGO
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export NCCL_NET_GDR_READ=1
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_SPEC_DECODE_EAGER=1
export VLLM_MLA_DISABLE=0
export VLLM_USE_FLASH_MLA=1

# K100_AI集群建议额外设置的环境变量:
export VLLM_ENFORCE_EAGER_BS_THRESHOLD=44
export VLLM_RPC_TIMEOUT=1800000

# 海光CPU绑定核
export VLLM_NUMA_BIND=1
export VLLM_RANK0_NUMA=0
export VLLM_RANK1_NUMA=1
export VLLM_RANK2_NUMA=2
export VLLM_RANK3_NUMA=3
export VLLM_RANK4_NUMA=4
export VLLM_RANK5_NUMA=5
export VLLM_RANK6_NUMA=6
export VLLM_RANK7_NUMA=7
```

2. 启动RAY集群
> x.x.x.x 对应第一步 VLLM_HOST_IP

```bash
# head节点执行
ray start --head --node-ip-address=x.x.x.x --port=6379 --num-gpus=8 --num-cpus=32
# worker节点执行
ray start --address='x.x.x.x:6379' --num-gpus=8 --num-cpus=32
```

251
3. 启动vllm serve
chenych's avatar
chenych committed
252
- **BF16**模型
raojy's avatar
updata  
raojy committed
253
```bash
254
vllm serve ZhipuAI/GLM-5 \
chenych's avatar
chenych committed
255
256
    --port 8001 \
    --trust-remote-code \
chenych's avatar
chenych committed
257
    --tensor-parallel-size 32 \  # BW1000是32, BW1100是16
chenych's avatar
chenych committed
258
259
260
261
262
263
264
265
    --gpu-memory-utilization 0.85 \
    --distributed-executor-backend ray \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-5
raojy's avatar
updata  
raojy committed
266
267
```

268
4. 启动完成后可通过以下方式访问:
raojy's avatar
updata  
raojy committed
269
```bash
chenych's avatar
Update  
chenych committed
270
curl http://localhost:8001/v1/chat/completions   \
chenych's avatar
chenych committed
271
272
273
274
275
276
277
    -H "Content-Type: application/json"  \
    -d '{
        "model": "glm-5",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Summarize GLM-5 in one sentence."}
        ],
chenych's avatar
chenych committed
278
        "max_tokens": 200,
chenych's avatar
Update  
chenych committed
279
        "temperature": 1
chenych's avatar
chenych committed
280
    }'
raojy's avatar
updata  
raojy committed
281
282
283
284
```

## 效果展示
<div align=center>
chenych's avatar
chenych committed
285
    <img src="./doc/result.png"/>
raojy's avatar
updata  
raojy committed
286
287
288
</div>

### 精度
chenych's avatar
chenych committed
289
`DCU与GPU精度一致,推理框架:vllm, sglang。`
raojy's avatar
updata  
raojy committed
290
291

## 源码仓库及问题反馈
chenych's avatar
chenych committed
292
- https://developer.sourcefind.cn/codes/modelzoo/glm-5_vllm
raojy's avatar
updata  
raojy committed
293
294

## 参考资料
chenych's avatar
chenych committed
295
- https://github.com/zai-org/GLM-5