• zhanqiuhu's avatar
    [Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290) · 4403e3ed
    zhanqiuhu authored
    
    
    Add labeled Prometheus metrics to distinguish where prompt tokens come
    from in P/D disaggregated deployments.
    
    In P/D disaggregation, decode instances receive KV cache from prefill instances.
    Currently, decode reports inflated prompt throughput because it counts all
    prompt tokens as "computed", even though most were transferred.
    
    This PR adds labeled metrics so users can understand actual compute work vs
    transferred work:
    
    vllm:prompt_tokens_by_source_total{source="local_compute"}        # Tokens prefilled locally
    vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer  
    vllm:prompt_tokens_by_source_total{source="local_cache_hit"}      # Tokens from local prefix cache
    vllm:prompt_tokens_cached_total                                    # Total cached (local + external, -1 when all 
    Signed-off-by: default avatarZhanqiu Hu <zh338@cornell.edu>
    4403e3ed
test_stats.py 6.55 KB