- 02 Mar, 2026 4 commits
- 28 Feb, 2026 2 commits
- 27 Feb, 2026 1 commit
-
-
zhuwenwen authored
feat(sampler): 增加 reduced topk+topp 采样快速路径以降低全词表 softmax 开销 See merge request dcutoolkit/deeplearing/vllm!447
-
- 26 Feb, 2026 1 commit
-
-
laibao authored
新增 VLLM_V1_USE_REDUCED_TOPK_TOPP_SAMPLER 开关并补充适用场景说明 在 V1 GPU 输入批预计算 max_top_k/has_any_no_top_k,native sampler 满足条件时走 reduced fast path,异常自动回退
-
- 25 Feb, 2026 3 commits
- 24 Feb, 2026 9 commits
-
-
SAC_fanth authored
-
zhuwenwen authored
V0.15.1 dev w4a8+pp balance See merge request dcutoolkit/deeplearing/vllm!442
-
jujl1 authored
# Conflicts: # vllm/envs.py
-
zhuwenwen authored
feat(moe): 支持通过环境变量开启/配置 Qwen3 路由 logits 采集 See merge request dcutoolkit/deeplearing/vllm!441
-
zhuwenwen authored
perf(v1): 增加可选的快速 token-id 拷贝路径 See merge request dcutoolkit/deeplearing/vllm!440
-
laibao authored
新增 router_capture 工具,用于按 num_tokens/rank 过滤并落盘 MoE router logits 在 Qwen3MoeSparseMoeBlock 中接入采集调用,并在 torch.compile 场景下自动跳过 补充 VLLM_MOE_ROUTER_CAPTURE* 环境变量
-
laibao authored
- 新增环境变量 `VLLM_V1_FAST_TOKEN_ID_COPY`(默认关闭) - 在 `CachedRequestState` 中缓存 int32 的 prompt token ids(numpy 数组) - 开启后在 `InputBatch` 中使用 `np.copyto` 拷贝 prompt/output token ids
-
zhuwenwen authored
Support qwen3 5 See merge request dcutoolkit/deeplearing/vllm!438
-
zhuwenwen authored
支持非deepseek模型使用moe_fused_gate See merge request dcutoolkit/deeplearing/vllm!439
-
- 19 Feb, 2026 1 commit
-
-
王敏 authored
-
- 16 Feb, 2026 3 commits
- 14 Feb, 2026 1 commit
-
-
zhuwenwen authored
[feat]支持glm4_moe_mtp使用torch compile,实现mtp cudagraph模式 See merge request dcutoolkit/deeplearing/vllm!436
-
- 13 Feb, 2026 1 commit
-
-
王敏 authored
-
- 12 Feb, 2026 1 commit
-
-
zhuwenwen authored
-
- 11 Feb, 2026 5 commits
-
-
zhuwenwen authored
-
zhuwenwen authored
feat(moe): 补齐 v0.15 中 Marlin W16A16 MoE 端到端接入 See merge request dcutoolkit/deeplearing/vllm!431
-
zhuwenwen authored
接入channel、block triton 及channel-wise marlin See merge request dcutoolkit/deeplearing/vllm!430
-
laibao authored
参考并移植 011/vllm 的关键提交逻辑 新增 VLLM_USE_MOE_W16A16_TRITON 开关,并接入基于 lightop 的运行时能力探测与启用结果缓存。 在权重加载后对 w13 与 w2 执行 W16A16 Marlin 预打包。 W16A16 Marlin 启用时保留 monolithic 执行路径,并在 fused_experts_impl 中增加 packed 权重 fast-path。 保持 Marlin 或 lightop 不可用时的回退行为不变。
-
SAC_fanth authored
-
- 10 Feb, 2026 6 commits
- 09 Feb, 2026 2 commits