[Bugfix] Cuda Clean up scales Kvcache fp8/int8_per_token_head (#39224)

Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: Michael Goin <mgoin64@gmail.com>

[Bugfix] Cuda Clean up scales Kvcache fp8/int8_per_token_head (#39224)
Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: Michael Goin <mgoin64@gmail.com>
140cbb11 · JartX · GitHub · 6155bbd1 · 140cbb11
Unverified Commit 140cbb11 authored Apr 08, 2026 by JartX Committed by GitHub Apr 08, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 0 deletions

vllm/v1/worker/gpu_model_runner.py vllm/v1/worker/gpu_model_runner.py +7 -0

No files found.
--- a/vllm/v1/worker/gpu_model_runner.py
+++ b/vllm/v1/worker/gpu_model_runner.py
@@ -5859,6 +5859,13 @@ class GPUModelRunner(
                layer.kv_cache = (
                    torch.tensor([]) if isinstance(kv_cache, torch.Tensor) else []
                )
+            # Clean up quantized KV cache scale views
+            # (int8_per_token_head, fp8_per_token_head)
+            if hasattr(layer, "impl"):
+                if hasattr(layer.impl, "_k_scale_cache"):
+                    layer.impl._k_scale_cache = None
+                if hasattr(layer.impl, "_v_scale_cache"):
+                    layer.impl._v_scale_cache = None
        gc.collect()
        torch.accelerator.empty_cache()