Commits · c1858b7ec8aa571dc0c0e00aded01019cca6a7e6 · OpenDAS / vllm_cscc

05 Feb, 2026 13 commits

[Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943) · c1858b7e

Aaron Hao authored Feb 05, 2026


Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>

c1858b7e

[Bugfix] Fix step3p5 parser when using mtp (#33690) · 82914d2a
Mario Hong authored Feb 06, 2026
```
Signed-off-by: mariohong <mariohong128@gmail.com>
```
82914d2a
[Bugfix] Fix corner case of sparse embedding (#33886) · 1c3a221d
wang.yuqi authored Feb 05, 2026
```
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
```
1c3a221d

Enable Cross layers KV cache layout at NIXL Connector V2 (#33339) · 8322d4e4

liranschour authored Feb 05, 2026


Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

8322d4e4

[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) (#32710) · 3e472e81

Andreas Karatzas authored Feb 05, 2026


Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>

3e472e81

[Refactor] Move `task` outside of `PoolingParams.verify` (#33796) · 038914b7

Cyrus Leung authored Feb 05, 2026


Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>

038914b7

[KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522) · 2abd9759
Mark McLoughlin authored Feb 05, 2026
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
2abd9759

[Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result (#33837) · 1f70313e

Andreas Karatzas authored Feb 05, 2026


Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>

1f70313e

[CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py... · c1395f72

rasmith authored Feb 04, 2026


[CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly (#33840)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>

c1395f72

[Minor] Include `StreamingInput` in inputs package (#33856) · add9f1fb
Nick Hill authored Feb 04, 2026
```
Signed-off-by: Nick Hill <nickhill123@gmail.com>
```
add9f1fb
[CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode (#32762) · fb1270f1
Andreas Karatzas authored Feb 04, 2026
```
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
```
fb1270f1

[CI][torch.compile] Reduce e2e fusion test time (#33293) · 4d951353

Luka Govedič authored Feb 04, 2026


Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

4d951353

feat: Add ColBERT late interaction model support (#33686) · 439afa4e

Ilya Boytsov authored Feb 05, 2026


Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>

439afa4e

04 Feb, 2026 13 commits

[Core] Don't schedule spec tokens with prefill chunks (#33652) · fa4e0fb0
Nick Hill authored Feb 04, 2026
```
Signed-off-by: Nick Hill <nickhill123@gmail.com>
```
fa4e0fb0
Revert "[torch.compile] Significantly speed up cold start times" (#33820) · 9f14c922
Richard Zou authored Feb 04, 2026
```
Signed-off-by: Richard Zou <zou3519@gmail.com>
```
9f14c922
[Bugfix] Support `RotaryEmbedding` CustomOp for gpt-oss (#33800) · 4292c90a
Simon Danielsson authored Feb 04, 2026
```
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
```
4292c90a
[rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm (#33308) · 2f6d17cb
kourosh hakhamaneshi authored Feb 04, 2026
```
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
```
2f6d17cb
[Bugfix] Fix interns1-pro initialization and PP (#33793) · 192ad464
Isotr0py authored Feb 05, 2026
```
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
```
192ad464
[Bugfix] Fix `normalize` still being passed to `PoolerConfig` (#33794) · 80f921ba
Cyrus Leung authored Feb 04, 2026
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
80f921ba

[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads (#32255) · 8e326908

Or Ozeri authored Feb 04, 2026



Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.
Signed-off-by: Or Ozeri <oro@il.ibm.com>

8e326908

[Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290) · 4403e3ed

zhanqiuhu authored Feb 04, 2026

Add labeled Prometheus metrics to distinguish where prompt tokens come
from in P/D disaggregated deployments.

In P/D disaggregation, decode instances receive KV cache from prefill instances.
Currently, decode reports inflated prompt throughput because it counts all
prompt tokens as "computed", even though most were transferred.

This PR adds labeled metrics so users can understand actual compute work vs
transferred work:

vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer
vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache
vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>

4403e3ed

[Feature] Enable `TRITON_ATTN` for Batch Invariance (#33688) · 45f8fd6f
Frank Wang authored Feb 03, 2026
```
Signed-off-by: frankwang28 <frank.wbb@hotmail.com>
```
45f8fd6f
[CPU] Split attention dispatch by head_dim alignment (#32161) · 4dffc5e0
R3hankhan authored Feb 04, 2026
```
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
```
4dffc5e0
[1/N] Initial Implementation of Parser for ResponsesAPI (#32712) · e1bf04b6
Andrew Xia authored Feb 03, 2026
```
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
```
e1bf04b6
[Bugfix] Fix torchrun PP broadcast deadlock with async scheduling (#33701) · 02080179
Isotr0py authored Feb 04, 2026
```
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
```
02080179
[Frontend][4/n] Make pooling entrypoints request schema consensus | ScoreRequest (#33060) · 1b8fe6f7
wang.yuqi authored Feb 04, 2026
```
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
```
1b8fe6f7

03 Feb, 2026 11 commits
- [BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729) · 52ee2102
  Nick Hill authored Feb 03, 2026
```
Signed-off-by: Nick Hill <nickhill123@gmail.com>
```
  52ee2102
- [Voxtral Realtime] Change name (#33716) · 3f7662d6
  Patrick von Platen authored Feb 03, 2026
```
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
```
  3f7662d6
- Turn `@config` into a `dataclass_transform` (#31541) · 61e632ae
  Harry Mellor authored Feb 03, 2026
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  61e632ae
- [torch.compile] Significantly speed up cold start times (#33641) · b1bb18de
  Richard Zou authored Feb 03, 2026
```
Signed-off-by: Richard Zou <zou3519@gmail.com>
```
  b1bb18de
- Feat/add nemotron nano v3 tests (#33345) · 4bc913ae
  shaharmor98 authored Feb 03, 2026
  
  4bc913ae
- [Models] Intern-S1-Pro (#33636) · a3acfa10
  zxy authored Feb 03, 2026
```
Signed-off-by: zxy <zhou0493@e.ntu.edu.sg>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
```
  a3acfa10
- Fix offline test for Transformers v5 (#33682) · f6af3462
  Harry Mellor authored Feb 03, 2026
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  f6af3462
- [Refactor] Clean up pooling serial utils (#33665) · 83449a5f
  Cyrus Leung authored Feb 03, 2026
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  83449a5f
- [CI/Build] Investigate torchrun distributed tests hanging issue (#33650) · 32e84fa1
  Isotr0py authored Feb 03, 2026
```
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
```
  32e84fa1
- [Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable (#33535) · b95cc501
  杨朱 · Kiki authored Feb 03, 2026
```
Signed-off-by: carlory <baofa.fan@daocloud.io>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
```
  b95cc501
- [Frontend] Add sampling parameters to Responses API (#32609) · 4c4b6f7a
  Daniel Mescheder authored Feb 03, 2026
```
Signed-off-by: Daniel Mescheder <dmesch@amazon.com>
Co-authored-by: Daniel Mescheder <dmesch@amazon.com>
```
  4c4b6f7a
02 Feb, 2026 3 commits

[Voxtral Realtime] Introduce global log mel max (#33574) · 5019c59d

Patrick von Platen authored Feb 02, 2026


Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

5019c59d

fix memory for online fp8 quantization with streaming weight load (#31914) · 0130223b
Vasiliy Kuznetsov authored Feb 02, 2026
```
Signed-off-by: vasiliy <vasiliy@fb.com>
```
0130223b

Reduce the kernel overhead when num of active loras is smaller than max... · ffe1fc7a

yugong333 authored Feb 02, 2026


  Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. (#32005)
Signed-off-by: Yu Gong <yu3.gong@gmail.com>

ffe1fc7a