fp8 kv cache support fix for torch.compile (#22758)

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

fp8 kv cache support fix for torch.compile (#22758)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
3053a22b · Aleksandr Malyshev · GitHub · 02d4b854 · 3053a22b · 3053a22b
Unverified Commit 3053a22b authored Sep 16, 2025 by Aleksandr Malyshev Committed by GitHub Sep 16, 2025
Showing with 4 additions and 2 deletions

vllm/model_executor/layers/quantization/kv_cache.py vllm/model_executor/layers/quantization/kv_cache.py +3 -1

vllm/v1/attention/backends/triton_attn.py vllm/v1/attention/backends/triton_attn.py +1 -1

No files found.
--- a/vllm/model_executor/layers/quantization/kv_cache.py
+++ b/vllm/model_executor/layers/quantization/kv_cache.py
@@ -125,7 +125,9 @@ class BaseKVCacheMethod(QuantizeMethodBase):
        # These are used in the final Attention.forward()
        layer._q_scale.copy_(q_scale)
-        layer._q_scale_float = q_scale
+        layer._q_scale_float = q_scale.item() if isinstance(
+            q_scale, torch.Tensor) else q_scale
        layer._prob_scale.copy_(prob_scale)
        if layer.kv_cache_dtype == "fp8" and (q_scale == 1.0
                                              or prob_scale == 1.0):

--- a/vllm/v1/attention/backends/triton_attn.py
+++ b/vllm/v1/attention/backends/triton_attn.py
@@ -361,7 +361,7 @@ class TritonAttentionImpl(AttentionImpl):
            key_cache = key_cache.view(self.fp8_dtype)
            value_cache = value_cache.view(self.fp8_dtype)
            num_tokens, num_heads, head_size = query.shape
-            assert layer._q_scale == 1.0, \
+            assert layer._q_scale_float == 1.0, \
                "A non 1.0 q_scale is not currently supported."
            if current_platform.is_cuda():
                # Skip Q quantization on ROCm and XPU, enable this on cuda