• shivampr's avatar
    [Core][Observability] Add KV cache residency metrics (#27793) · cabc77cc
    shivampr authored
    
    
    Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:
    
    vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
    vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
    vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block
    
    These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.
    
    Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.
    
    Two new runtime flags are introduced:
    
    --kv-cache-metrics – enable KV cache residency metrics
    --kv-cache-metrics-sample – control sampling ratio (default: 0.01)
    Signed-off-by: default avatarShivam <shivamprasad91@gmail.com>
    cabc77cc
arg_utils.py 86.7 KB