Commits · b91b3028d2a7982831db3af07260297f8fffeebb · OpenDAS / vllm_cscc

04 Feb, 2026 1 commit

feat: Support shared experts fusion. · b91b3028

wanglong3 authored Jan 29, 2026

feat: support moe sum when topk==9

bugfix: Fix mtp model load error when eable shared experts fusion.

b91b3028

26 Jan, 2026 1 commit
- fix local kv_cache_dtype_str · 718337a7
  zhuwenwen authored Jan 26, 2026
  
  718337a7
23 Jan, 2026 1 commit
- support fa kvcache fp8, add VLLM_USE_QUERY_QUANT to not use q quant(todo) · b3062dab
  zhuwenwen authored Jan 23, 2026
  
  b3062dab
22 Jan, 2026 2 commits
- 零消耗添加epsp · cc4d1002
  王敏 authored Jan 22, 2026
  
  cc4d1002
- 优化epsp代码 · 9135afe4
  王敏 authored Jan 22, 2026
  
  9135afe4
21 Jan, 2026 4 commits

update VLLM_USE_FUSED_RMS_ROPE=0 (default) · 0d5dd2da
zhuwenwen authored Jan 21, 2026
```
for qwen3, VLLM_USE_FUSED_RMS_ROPE=1 (default)
```
0d5dd2da

feat(moe/marlin): Marlin W16A16 MoE 自动探测并预打包（去掉手动开关） · de588fab

laibao authored Jan 21, 2026

  - 移除 VLLM_USE_MARLIN_W16A16_MOE 环境变量
  - 初始化阶段基于 lightop 探测并缓存 _marlin_w16a16_moe_enabled，满足条件强制 use_nn_moe=False
  - 权重加载后按缓存结果一次性 Marlin pack；运行时按 packed 标记走 Marlin fast path

de588fab

perf(qwen3): 融合 q/k RMSNorm + RoPE · 7cd7bf8a

laibao authored Jan 21, 2026

新增 VLLM_USE_FUSED_RMS_ROPE 分支，走 fused 路径
注册 torch.ops.vllm.rms_rotary_embedding_fuse（direct_register_custom_op）
cos_sin_cache 自动转 device/dtype 并缓存，避免每次重复拷贝

7cd7bf8a

fused_moe_fp8接入lmslim · 5b7f2c7d
SAC_fanth authored Jan 21, 2026

5b7f2c7d

20 Jan, 2026 2 commits
- feat: Support w8a8-fp8 GEMM backend. · 900f4720
  wanglong3 authored Jan 17, 2026
  
  900f4720
- update VLLM_USE_TOPK_RENORM · 5a1e9359
  zhuwenwen authored Jan 20, 2026
  
  5a1e9359
19 Jan, 2026 2 commits
- remove SUPPORT_MOE_MARLIN_W16A16 · 564cbe7a
  zhuwenwen authored Jan 19, 2026
  
  564cbe7a
- [qwen3-235b] MoE(TN&NN) configs for nmz TP=8 · 0328ef06
  zhuwenwen authored Jan 19, 2026
```
[qwen3-480b] MoE(TN) configs for nmz TP=8
```
  0328ef06
17 Jan, 2026 5 commits
- 优化deepep相关代码 · 76695c0a
  王敏 authored Jan 17, 2026
  
  76695c0a
- set VLLM_USE_FLASH_ATTN_FP8=1 and VLLM_USE_FLASH_MLA_FP8=1 · 25e16eea
  zhuwenwen authored Jan 17, 2026
  
  25e16eea
- set VLLM_USE_FUSED_FILL_RMS_CAT=1 · cf7d1166
  zhuwenwen authored Jan 17, 2026
  
  cf7d1166
- update q_quant dtype · 8f30468c
  zhuwenwen authored Jan 17, 2026
  
  8f30468c
- fix: update_state,优化性能，去除冗余操作 · e9cfa85e
  jujl1 authored Jan 17, 2026
  
  e9cfa85e
16 Jan, 2026 7 commits
- update unified_attention_with_output_fake · 9d16d5aa
  zhuwenwen authored Jan 16, 2026
  
  9d16d5aa
- add VLLM_USE_FUSED_CACHE_QUANT_BMM_MLA to use fused rmsnorm + contiguous +... · 9dd70f0e
  zhuwenwen authored Jan 16, 2026
```
add VLLM_USE_FUSED_CACHE_QUANT_BMM_MLA to use fused rmsnorm + contiguous + rope(for dpsk-v3) + concat_and_cache_mla + q quant, control bmm(todo) + cat +mla (fp8)
```
  9dd70f0e
- MoE 路由抓取：新增 router_capture 工具链与 envs 统一配置 · a2f0ce42
  laibao authored Jan 16, 2026
```
新增环境变量 VLLM_MOE_ROUTER_CAPTURE / DIR / RANK / MAX_LAYERS / NUM_TOKENS_* 用于开关与过滤控制
新增 router_capture.py，支持按 num_tokens 分桶抓取 router logits 并落盘
在 qwen3_moe 中接入抓取逻辑，默认关闭，仅在开启时记录
固定 skip_profile / skip_stack_funcs 为默认启用，避免抓到 warmup/profile 形状
统一配置入口到 vllm.envs，作为运行时基准
```
  a2f0ce42
- update custom_all_reduce · 2c560dc5
  zhuwenwen authored Jan 16, 2026
  
  2c560dc5
- set VLLM_USE_FUSED_RMS_ROPE=1 · d4df43b0
  zhuwenwen authored Jan 16, 2026
  
  d4df43b0
- set VLLM_CUSTOM_CACHE=1 · 30559839
  zhuwenwen authored Jan 16, 2026
  
  30559839
- add SUPPORT_MOE_MARLIN_W16A16 to use moe marlin on bw · cabf690f
  zhuwenwen authored Jan 16, 2026
  
  cabf690f
15 Jan, 2026 3 commits
- Switch default w8a8 gemm impl to blaslt. · 5663e01d
  wanglong3 authored Jan 15, 2026
  
  5663e01d
- 修复awq模型的VLLM_USE_LIGHTOP_MOE_SUM_MUL_ADD设置位置 · ab66909d
  yangql authored Jan 15, 2026
  
  ab66909d
- 修复deepseek moe模型的awq量化推理bug和精度问题 · 475dcaa0
  yangql authored Jan 15, 2026
  
  475dcaa0
14 Jan, 2026 2 commits
- fix return of schedule · efd51772
  zhuwenwen authored Jan 14, 2026
  
  efd51772
- set VLLM_USE_PD_SPLIT=1 · 69cfaa53
  zhuwenwen authored Jan 14, 2026
  
  69cfaa53
13 Jan, 2026 3 commits
- fix: pp+chunkprefill多并发input ids更新bug · be41974c
  jujl1 authored Jan 09, 2026
  
  be41974c
- fix: 修复丢弃MTP代码报错 · 0cf05716
  jujl1 authored Jan 09, 2026
  
  0cf05716
- fix(PP 场景 decode 阶段 token 被误丢弃导致卡住 · 62a5b28f
  laibao authored Jan 13, 2026
```
  - decode 已开始时不再按 partial prefill 丢弃 sampled token，避免 new_token_ids=[] 循环拖尾
```
  62a5b28f
12 Jan, 2026 6 commits
- remove log info · ce5b3c9a
  zhuwenwen authored Jan 12, 2026
  
  ce5b3c9a
- 修改非堆成切分的判断 · f384ee43
  xiabo authored Jan 12, 2026
  
  f384ee43
- [feat]添加dp attention功能 · cda54326
  王敏 authored Jan 12, 2026
  
  cda54326
- fix: 修复不开启融合图的断言错误。 · 9cf5c476
  wujl5 authored Jan 12, 2026
  
  9cf5c476
- 适配block和channel fp8 · db23fcac
  SAC_fanth authored Jan 12, 2026
  
  db23fcac
- fix: pp+chunkprefill多并发input ids更新bug · 7e3e2339
  jujl1 authored Jan 09, 2026
  
  7e3e2339
09 Jan, 2026 1 commit

perf(fused-moe): 预打包 Marlin W16A16 MoE 权重，降低 warmup 显存峰值 · bfaac804

laibao authored Jan 09, 2026

  - 在 post-load hook 中对 w13/w2 做 per-expert Marlin pack，并替换为 packed 参数
  - Marlin fast path 仅接受 packed 权重；未预打包则 fail fast，避免运行时 packing 峰值/不确定性
  - 更新 Marlin wrapper 的入参与 shape 推导（从 packed layout 计算 K/N）

bfaac804