Commits · 880be2b1b80fb2d18c32b0ee5a95174cf2e37c7d · OpenDAS / vllm_cscc

20 Mar, 2026 1 commit
- [Metrics] Some small refactoring for better maintainability (#33898) · 880be2b1
  Martin Hickey authored Mar 20, 2026
```
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
```
  880be2b1
18 Mar, 2026 1 commit
- [Bugfix] Expand quantization method support in perf metrics (#37231) · 828f862a
  Thillai Chithambaram authored Mar 18, 2026
```
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
```
  828f862a
23 Feb, 2026 1 commit

[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) (#30950) · 5cc7c445

Mark McLoughlin authored Feb 23, 2026



Export the existing Model FLOPs Utilization (MFU) metrics via Prometheus.

`--enable-mfu-metrics` is required for these to be exposed.
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>

5cc7c445

14 Feb, 2026 1 commit

[Renderer] Move InputPreprocessor into Renderer (1/2) (#34510) · 73391a1b

Cyrus Leung authored Feb 15, 2026


Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

73391a1b

13 Feb, 2026 1 commit

[Core] Profiler improvements and lazy initialization (#33198) · 4453ba8d

Jaewon authored Feb 12, 2026


Signed-off-by: Jaewon Lee <jaewon@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>

4453ba8d

04 Feb, 2026 1 commit

[Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290) · 4403e3ed

zhanqiuhu authored Feb 04, 2026

Add labeled Prometheus metrics to distinguish where prompt tokens come
from in P/D disaggregated deployments.

In P/D disaggregation, decode instances receive KV cache from prefill instances.
Currently, decode reports inflated prompt throughput because it counts all
prompt tokens as "computed", even though most were transferred.

This PR adds labeled metrics so users can understand actual compute work vs
transferred work:

vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer
vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache
vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>

4403e3ed

31 Jan, 2026 1 commit
- Support clear mm and encoder cache (#33452) · 22d9a056
  jma99_2333 authored Jan 31, 2026
```
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
```
  22d9a056
27 Jan, 2026 1 commit

[Metrics][MFU] Fix UnembedMetrics FLOP overcounting for prefill (#33045) (#33045) · 5ec44056

omkhalil authored Jan 27, 2026

Fix UnembedMetrics to correctly count FLOPs for the unembedding (LM head) layer.

The bug: UnembedMetrics used total_num_tokens() which counts all tokens in the
batch for projection flops, vocab projections are run on just the last token for the
autoregressive use case.
Co-authored-by: Omar Mohamed Khalil <omarkhalil@meta.com>

5ec44056

20 Jan, 2026 1 commit

[Metrics] Complete removal of deprecated vllm:time_per_output_token_seconds metric (#32661) · bb917203

杨朱 · Kiki authored Jan 20, 2026



This PR completes the removal of the deprecated vllm:time_per_output_token_seconds
metric that was deprecated in v0.11, hidden in v0.12, scheduled for removal in v0.13,
but delayed until v0.15.
Signed-off-by: carlory <baofa.fan@daocloud.io>
Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>

bb917203

18 Dec, 2025 1 commit

[Metrics] Model FLOPs Utilization estimation (#30738) · a0b782f9

SungMinCho authored Dec 17, 2025


Signed-off-by: SungMinCho <tjdals4565@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>

a0b782f9

09 Dec, 2025 1 commit
- feat(metrics): Add prefill KV compute metric excluding cached tokens (#30189) · f1599ca5
  Victor Ziliang Peng authored Dec 08, 2025
```
Signed-off-by: Ziliang Peng <ziliang@character.ai>
```
  f1599ca5
08 Dec, 2025 1 commit
- [BugFix] Unblock use of LoRA with data parallel mode (#30220) · d726a7b0
  Nick Hill authored Dec 07, 2025
```
Signed-off-by: Nick Hill <nhill@redhat.com>
```
  d726a7b0
03 Dec, 2025 1 commit
- Add logging for cudagraph related info (#29825) · 69520bc6
  Yong Hoon Shin authored Dec 02, 2025
```
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
```
  69520bc6
02 Dec, 2025 1 commit

[Misc] Add ReplicaId to Ray metrics (#24267) · 22274b21

Seiji Eicher authored Dec 01, 2025


Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: rongfu.leng <1275177125@qq.com>

22274b21

01 Dec, 2025 1 commit

[Core][Observability] Add KV cache residency metrics (#27793) · cabc77cc

shivampr authored Dec 01, 2025



Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:

vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block

These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.

Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.

Two new runtime flags are introduced:

--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)
Signed-off-by: Shivam <shivamprasad91@gmail.com>

cabc77cc

28 Nov, 2025 1 commit
- [mypy] Enable type checking for more directories (#29674) · 9e6bcda3
  Cyrus Leung authored Nov 29, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  9e6bcda3
25 Nov, 2025 1 commit
- [Metrics] Scheduled removal of deprecated metrics (#29330) · 9cf4edae
  Mark McLoughlin authored Nov 25, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  9cf4edae
21 Nov, 2025 1 commit
- [Doc] Update plugin doc (#28532) · 4050bae4
  wangxiyuan authored Nov 21, 2025
```
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
```
  4050bae4
17 Nov, 2025 1 commit

[Metrics] Fix KV cache usage percent metric multiproc (#28792) · d4acf518

Jae-Won Chung authored Nov 17, 2025



The `vllm:kv_cache_usage_perc` Gauge metric is missing `multiprocess_mode="mostrecent"` and ends up returning

```
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="277"} 0.0
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="275"} 0.0
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="273"} 0.6530455880475035
...
```

The deprecated `vllm:gpu_cache_usage_perc` Gauge metric has `multiprocess_mode="mostrecent"`.
Signed-off-by: Jae-Won Chung <jwnchung@umich.edu>

d4acf518

14 Nov, 2025 1 commit

[Metrics] Log number of preempted requests (#28522) · ecf8230d

lyn610 authored Nov 14, 2025

Add tracking and periodic logging for the number of preempted requests in the
metrics logger. This helps monitor system behavior under load.
Signed-off-by: Yining Liu <610lyn@gmail.com>