[Bugfix][CT] Fix KV cache scale handling (#39418)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

[Bugfix][CT] Fix KV cache scale handling (#39418)
Signed-off-by: yiliu30 <yi4.liu@intel.com>
d8ddb316 · Yi Liu · GitHub · 1ce0318c · d8ddb316
Unverified Commit d8ddb316 authored Apr 13, 2026 by Yi Liu Committed by GitHub Apr 13, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 0 deletions

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py ...ers/quantization/compressed_tensors/compressed_tensors.py +11 -0

No files found.
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -1123,6 +1123,17 @@ class CompressedTensorsKVCacheMethod(BaseKVCacheMethod):
        layer._v_scale = layer.v_scale
        layer._q_scale = layer.q_scale
+        # Set the _float variants that the attention backend uses.
+        def _to_scalar(tensor: torch.Tensor) -> float:
+            # For n_scales > 1 (e.g., ATTN_HEAD strategy), take max
+            if tensor.numel() > 1:
+                return tensor.max().item()
+            return tensor.item()
+        layer._k_scale_float = _to_scalar(layer.k_scale)
+        layer._v_scale_float = _to_scalar(layer.v_scale)
+        layer._q_scale_float = _to_scalar(layer.q_scale)
        # Discard all placeholders.
        del layer.k_scale
        del layer.v_scale