Commits · 541a2ef892720489f770569417bc1bc4436dbb21 · OpenDAS / vllm_cscc

07 Dec, 2025 3 commits
- [Misc][Core] Remove unused `req_index` increment in scheduler (#30176) · 1b0482b9
  Yifan Qiao authored Dec 07, 2025
```
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
```
  1b0482b9
- Revert "[Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145)" (#30199) · e83b7e37
  Cyrus Leung authored Dec 07, 2025
  
  e83b7e37
- [Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145) · 27f4c2fd
  Cyrus Leung authored Dec 07, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  27f4c2fd
06 Dec, 2025 2 commits
- [Model] Move `multimodal_cpu_fields` definition to field config (#30181) · 671427ef
  Cyrus Leung authored Dec 06, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  671427ef
- [Model Runner V2] Support min-p sampling (#30171) · a238cbd8
  Woosuk Kwon authored Dec 05, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  a238cbd8
05 Dec, 2025 7 commits

[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache (#27170) · adb31506

Tova Movshovitz authored Dec 05, 2025

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

adb31506

[Compile] Conditional compilation. Introduce compile_ranges (#24252) · 4e26d3b0

Ilya Markov authored Dec 05, 2025


Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Luka Govedič <luka.govedic@gmail.com>
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Luka Govedič <luka.govedic@gmail.com>

4e26d3b0

[Attention][UX][1/N] Add AttentionConfig and change attention env vars to CLI arguments (#26315) · 66e674cd

Matthew Bonanni authored Dec 05, 2025

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

66e674cd

[BugFix] Eagerly abort cancelled final-step requests (#29987) · dc264bce

Nick Hill authored Dec 05, 2025



Currently, when requests are cancelled while executing their final
step, "completion" is handled based on normal stop processing (e.g.
length or stop token), so the abort has no effect. This is typically
not a problem, but when a kv connector is involved it thinks the
request completed successfully rather than being aborted.

This is problematic for disaggregated prefill which will free kv
cache blocks if the request was aborted but not if it completed
successfully—since the cancelled request will never be sent to
the decode side, kv cache blocks remain pinned until the fall-back
timeout expires. The problem is exacerbated when many requests
are cancelled and/or there are large prefills whose forward pass
takes a long time (since the window is bigger).

This PR fixes the problem by processing pending aborts
immediately prior to processing model output each step; we process
only aborts, not new requests, since it's preferable for latency to
process model outputs before new incoming requests.

Fixes #26400.
Signed-off-by: Nick Hill <nhill@redhat.com>

dc264bce

[Feature] Add Layer-wise NVTX Support (#29990) · c2894d38

Max Hu authored Dec 05, 2025


Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <maxhu@nvidia.com>
Co-authored-by: Max Hu <maxhu@nvidia.com>

c2894d38

[BugFix] Adding env variable to disable async grammar compilation (#29996) · 65ee9728

Alec S authored Dec 05, 2025

Signed-off-by: Alec Solder <alecs@fb.com>
Signed-off-by: Alec S <10566873+alecsolder@users.noreply.github.com>
Co-authored-by: Alec Solder <alecs@fb.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>

65ee9728

[Bugfix] Correct num_q_heads on DCP for Flashinfer backends (#29487) · d698bb38

Jingchun Gao authored Dec 05, 2025


Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>

d698bb38

04 Dec, 2025 6 commits
- [BugFix] Fix DBO assert `assert B_block_table == B_q` (#29933) · c8ab988b
  Lucas Wilkinson authored Dec 04, 2025
```
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
```
  c8ab988b
- [Chore] Deprecate `merge_by_field_config` arg (#30035) · b286a311
  Cyrus Leung authored Dec 05, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  b286a311
- [Model Runner V2] Implement get_num_sampled_and_rejected kernel (#30029) · cc050558
  Woosuk Kwon authored Dec 04, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  cc050558
- [Model][6/N] Improve all pooling task | Support chunked prefill with ALL pooling (#27145) · 74c4d80c
  wang.yuqi authored Dec 04, 2025
```
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
```
  74c4d80c
- [ROCm][CI][Bugfix] Fixing the `Multi-Modal Models Test (Extended) 1` group (#30013) · e96a6a6d
  Andreas Karatzas authored Dec 04, 2025
```
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
```
  e96a6a6d
- [Misc] Move functions into `PoolingMetadata` (#30027) · 68eb5c8d
  Cyrus Leung authored Dec 04, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  68eb5c8d
03 Dec, 2025 5 commits
- Fix LLMEngine.del dp_group cleanup condition (#29954) · 2fc5d6e0
  Yongtao Huang authored Dec 04, 2025
```
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
```
  2fc5d6e0
- [Core] Add xxHash as a high-performance hash option for accelerating prefix caching (#29163) · 9bcf9229
  Lumis Chen authored Dec 04, 2025
```
Signed-off-by: LuminolT <lumischen01@gmail.com>
Signed-off-by: Lumis Chen <lumischen01@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
```
  9bcf9229
- Add logging for cudagraph related info (#29825) · 69520bc6
  Yong Hoon Shin authored Dec 02, 2025
```
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
```
  69520bc6
- [Core] Rename PassConfig flags as per RFC #27995 (#29646) · d7284a26
  Arpit Khandelwal authored Dec 02, 2025
```
Signed-off-by: arpitkh101 <arpit5khandelwal@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
```
  d7284a26
- [BugFix] Fix assert in `build_for_cudagraph_capture` (#29893) · 5cdd6645
  Lucas Wilkinson authored Dec 02, 2025
```
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
```
  5cdd6645
02 Dec, 2025 11 commits
- [Doc] Add allocate_slots parameter docs (#29777) · 5d91d2b2
  maang-h authored Dec 03, 2025
```
Signed-off-by: maang <maang_h@163.com>
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
```
  5d91d2b2
- [Bugfix] fix --scheduling-policy=priority & n>1 crashes engine (#29764) · 0a9caca9
  Chauncey authored Dec 03, 2025
```
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
```
  0a9caca9
- [Perf] Avoid pageable HtoD transfer in MinTokensLogitsProcessor (#29826) · 1528e079
  jthomson04 authored Dec 02, 2025
```
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
```
  1528e079
- [Attention][CUDAGraph] Remove CG padding from attention backends (#29352) · 1d93f116
  Matthew Bonanni authored Dec 02, 2025
```
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
```
  1d93f116
- Add Mistral Large 3 and Ministral 3 (#29757) · d8c6210e
  Julien Denize authored Dec 02, 2025
```
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Mickael Seznec <mickael@mistral.ai>
```
  d8c6210e
- [Core] Eliminate redundant is_encoder_decoder lookups (20-40us/step) (#29800) · 0037b574
  Wushi Dong authored Dec 01, 2025
```
Signed-off-by: Wushi Dong <dongws@meta.com>
```
  0037b574
- [Chore] Move tokenizer initialization methods (#29793) · 653591d5
  Cyrus Leung authored Dec 02, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  653591d5
- [BugFix] Fix index error in ngram_proposer (#29779) · 81fe3f82
  usberkeley authored Dec 02, 2025
```
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
```
  81fe3f82
- [Misc] Add ReplicaId to Ray metrics (#24267) · 22274b21
  Seiji Eicher authored Dec 01, 2025
```
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: rongfu.leng <1275177125@qq.com>
```
  22274b21
- [Core] Support reseting all running requests' KV while calling `reset_prefix_cache` (#28827) · d0cd7289
  Zhuohan Li authored Dec 01, 2025
```
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
```
  d0cd7289
- [BugFix] Preserve spec decoding uniform decode when scheduling (#29759) · 44822d7f
  Nick Hill authored Dec 01, 2025
```
Signed-off-by: Nick Hill <nhill@redhat.com>
```
  44822d7f
01 Dec, 2025 4 commits

[Core][Observability] Add KV cache residency metrics (#27793) · cabc77cc

shivampr authored Dec 01, 2025



Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:

vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block

These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.

Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.

Two new runtime flags are introduced:

--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)
Signed-off-by: Shivam <shivamprasad91@gmail.com>

cabc77cc

[v1] Add real sliding window calculation to FlexAttention direct BlockMask building (#26015) · b95db244

Isotr0py authored Dec 01, 2025


Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Co-authored-by: baonudesifeizhai <baonudesifeizhai@gmail.com>

b95db244

[crashfix] Eagle + multimodal can crash on mm cache miss (#29750) · 86e178f7
Mickaël Seznec authored Dec 01, 2025
```
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: Roger Wang <hey@rogerw.io>
```
86e178f7
Make PyTorch profiler gzip and CUDA time dump configurable (#29568) · 1ab8fc81
Yifei Zhang authored Dec 01, 2025
```
Signed-off-by: Yifei Zhang <yifei.zhang1992@outlook.com>
```
1ab8fc81

30 Nov, 2025 2 commits
- [Model Runner V2] Use packed mask for prompt bin counts (#29756) · ec38a736
  Woosuk Kwon authored Nov 30, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  ec38a736
- [ROCm][Attention] Sliding window support for `AiterFlashAttentionBackend` (#29234) · 8c363ed6
  Pleaplusone authored Nov 30, 2025
```
Signed-off-by: ganyi <ygan@amd.com>
```
  8c363ed6