Update README.md

09c6cf7c · raojy · 89613cf4 · 09c6cf7c
Commit 09c6cf7c authored Jun 04, 2026 by raojy 💬
Hide whitespace changes
Inline Side-by-side

Showing with 75 additions and 3 deletions

README.md README.md +75 -3

No files found.
--- a/README.md
+++ b/README.md
@@ -21,9 +21,10 @@ Qwen3.5 通过异构基础设施实现高效的原生多模态训练：在视觉
 | vllm |      0.18.1+das.dtk2604     |
 | triton |      3.6.0+das.opt1.dtk2604     |
 | torch |   2.10.0+das.opt1.dtk2604   |
+| SGLang |   0.5.10rc0+das.opt2.alpha.dtk2604   |
-当前仅支持定制镜像: harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.18.1-ubuntu22.04-dtk26.04-py3.10-20260510-2242
+- **vLLM当前仅支持定制镜像:** : harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.18.1-ubuntu22.04-dtk26.04-py3.10-20260510-2242
+- **SGLang推理请使用:** harbor.sourcefind.cn:5443/dcu/admin/base/custom:sglang0.5.10rc0-ubuntu22.04-dtk26.04-py3.10-20260518
 - 挂载地址`-v` 根据实际模型情况修改
 ```bash
 docker run -it \
@@ -178,14 +179,85 @@ curl http://localhost:8001/v1/chat/completions   \
    }'
 ```
+### SGLang
+#### 单机推理
+##### BF16
+1. serve启动，以`Qwen/Qwen3.6-27B`为例（此命令适用于非K100AI芯片）
+```bash
+export SGLANG_ENABLE_SPEC_V2=1
+export SGLANG_USE_FUSED_TOPK_SOFTMAX=1
+export SGLANG_USE_LIGHTOP=1
+export SGLANG_USE_CAUSAL_CONV1D=1
+export SGLANG_USE_AITER_LINEAR_ATTN=1
+export SGLANG_USE_CUDA_IPC_TRANSPORT=1
+sglang serve --model-path Qwen/Qwen3.5-35B-A3B \
+    --attention-backend fa3 \
+    --mm-attention-backend fa3 \
+    --enable-piecewise-cuda-graph \
+    --tp-size 1 --pp-size 1 \
+    --page-size 64  \
+    --mem-fraction-static 0.95 \
+    --mamba-scheduler-strategy extra_buffer \
+    --kv-cache-dtype fp8_e5m2  \
+    --trust-remote-code \
+    --chunked-prefill-size -1 --context-length 8192 
+```
+2. serve启动，以`Qwen/Qwen3.6-27B`为例（此命令适用于K100AI芯片）
+```bash
+export SGLANG_ENABLE_SPEC_V2=1
+export SGLANG_USE_FUSED_TOPK_SOFTMAX=1
+export SGLANG_USE_LIGHTOP=1
+export SGLANG_USE_CAUSAL_CONV1D=1
+export SGLANG_USE_AITER_LINEAR_ATTN=1
+export SGLANG_USE_CUDA_IPC_TRANSPORT=1
+export HIP_VISIBLE_DEVICES=4,5,6,7
+export SGLANG_KV_LAYOUT_DCU_FA=0
+sglang serve --model-path Qwen/Qwen3.5-35B-A3B \
+    --attention-backend fa3 \
+    --mm-attention-backend fa3 \
+    --disable-cuda-graph \
+    --tp-size 2 --pp-size 1 \
+    --page-size 64  \
+    --mem-fraction-static 0.9 \
+    --kv-cache-dtype bf16  \
+    --trust-remote-code \
+    --chunked-prefill-size -1 \
+    --disable-radix-cache \
+```
+2. client访问
+```bash
+curl http://localhost:8001/v1/chat/completions   \
+    -H "Content-Type: application/json"  \
+    -d '{
+        "model": "Qwen/Qwen3.6-27B-FP8",
+        "messages": [
+          {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"}
+        ],
+        "temperature": 0.8,
+        "chat_template_kwargs": {
+            "enable_thinking": true
+        }
+    }'
+```
 ## 效果展示
 <div align=center>
    <img src="./doc/result-dcu.png"/>
 </div>
 ### 精度
-DCU与GPU精度一致，推理框架：vllm。
+- 推理框架: SGLang
+- 测试数据: humaneval、gsm8k
+- 使用的加速卡: BW1000
+| model name| humaneval | gsm8k |
+| :------: | :------: | :------: |
+| Qwen3.5-27B | 0.92 | 0.98 |
+| Qwen3.5-35B-A3B | 0.92 | 0.98 |
+| Qwen3.5-122B-A10B | 0.93 | 0.98 |
 ## 源码仓库及问题反馈
 - https://developer.sourcefind.cn/codes/modelzoo/qwen3.5