1. 20 Mar, 2026 1 commit
  2. 18 Mar, 2026 1 commit
  3. 23 Feb, 2026 1 commit
  4. 14 Feb, 2026 1 commit
  5. 13 Feb, 2026 1 commit
  6. 04 Feb, 2026 1 commit
    • zhanqiuhu's avatar
      [Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290) · 4403e3ed
      zhanqiuhu authored
      
      
      Add labeled Prometheus metrics to distinguish where prompt tokens come
      from in P/D disaggregated deployments.
      
      In P/D disaggregation, decode instances receive KV cache from prefill instances.
      Currently, decode reports inflated prompt throughput because it counts all
      prompt tokens as "computed", even though most were transferred.
      
      This PR adds labeled metrics so users can understand actual compute work vs
      transferred work:
      
      vllm:prompt_tokens_by_source_total{source="local_compute"}        # Tokens prefilled locally
      vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer  
      vllm:prompt_tokens_by_source_total{source="local_cache_hit"}      # Tokens from local prefix cache
      vllm:prompt_tokens_cached_total                                    # Total cached (local + external, -1 when all 
      Signed-off-by: default avatarZhanqiu Hu <zh338@cornell.edu>
      4403e3ed
  7. 31 Jan, 2026 1 commit
  8. 27 Jan, 2026 1 commit
  9. 20 Jan, 2026 1 commit
  10. 18 Dec, 2025 1 commit
  11. 09 Dec, 2025 1 commit
  12. 08 Dec, 2025 1 commit
  13. 03 Dec, 2025 1 commit
  14. 02 Dec, 2025 1 commit
  15. 01 Dec, 2025 1 commit
    • shivampr's avatar
      [Core][Observability] Add KV cache residency metrics (#27793) · cabc77cc
      shivampr authored
      
      
      Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:
      
      vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
      vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
      vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block
      
      These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.
      
      Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.
      
      Two new runtime flags are introduced:
      
      --kv-cache-metrics – enable KV cache residency metrics
      --kv-cache-metrics-sample – control sampling ratio (default: 0.01)
      Signed-off-by: default avatarShivam <shivamprasad91@gmail.com>
      cabc77cc
  16. 28 Nov, 2025 1 commit
  17. 25 Nov, 2025 1 commit
  18. 21 Nov, 2025 1 commit
  19. 17 Nov, 2025 1 commit
    • Jae-Won Chung's avatar
      [Metrics] Fix KV cache usage percent metric multiproc (#28792) · d4acf518
      Jae-Won Chung authored
      
      
      The `vllm:kv_cache_usage_perc` Gauge metric is missing `multiprocess_mode="mostrecent"` and ends up returning
      
      ```
      vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="277"} 0.0
      vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="275"} 0.0
      vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="273"} 0.6530455880475035
      ...
      ```
      
      The deprecated `vllm:gpu_cache_usage_perc` Gauge metric has `multiprocess_mode="mostrecent"`.
      Signed-off-by: default avatarJae-Won Chung <jwnchung@umich.edu>
      d4acf518
  20. 14 Nov, 2025 1 commit
  21. 10 Nov, 2025 1 commit
  22. 05 Nov, 2025 1 commit
  23. 04 Nov, 2025 1 commit
  24. 30 Oct, 2025 1 commit
  25. 29 Oct, 2025 2 commits
  26. 24 Oct, 2025 1 commit
  27. 23 Oct, 2025 1 commit
  28. 18 Oct, 2025 1 commit
  29. 14 Oct, 2025 1 commit
  30. 12 Oct, 2025 1 commit
  31. 10 Oct, 2025 2 commits
  32. 05 Oct, 2025 2 commits
  33. 03 Oct, 2025 1 commit
  34. 27 Sep, 2025 1 commit
  35. 26 Sep, 2025 1 commit
  36. 25 Sep, 2025 1 commit
  37. 24 Sep, 2025 1 commit