[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment (#18674)

Signed-off-by: zzzyq <zhangyuqi94@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment (#18674)
Signed-off-by: zzzyq <zhangyuqi94@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
f2faac74 · Yuqi Zhang · GitHub · 279f8545 · f2faac74
Unverified Commit f2faac74 authored May 25, 2025 by Yuqi Zhang Committed by GitHub May 25, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 14 additions and 0 deletions

vllm/engine/llm_engine.py vllm/engine/llm_engine.py +14 -0

No files found.
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -1650,6 +1650,20 @@ class LLMEngine:
        gpu_prefix_cache_hit_rate = self.scheduler[
            0].get_prefix_cache_hit_rate(Device.GPU)
+        # Exchange the uasge and cache hit stats between gpu and cpu when
+        # running on cpu because the cpu_worker.py intentionally reports the
+        # number of cpu blocks as gpu blocks in favor of cache management.
+        if self.device_config.device_type == "cpu":
+            num_total_gpu, num_total_cpu = num_total_cpu, num_total_gpu
+            gpu_cache_usage_sys, cpu_cache_usage_sys = (
+                cpu_cache_usage_sys,
+                gpu_cache_usage_sys,
+            )
+            gpu_prefix_cache_hit_rate, cpu_prefix_cache_hit_rate = (
+                cpu_prefix_cache_hit_rate,
+                gpu_prefix_cache_hit_rate,
+            )
        # Iteration stats
        num_prompt_tokens_iter = 0
        num_generation_tokens_iter = 0