add warning when FP8 KV cache misses prefill query quantization (#39752)

Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Albert Cheng (Engrg-Hardware 1) <albecheng@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

add warning when FP8 KV cache misses prefill query quantization (#39752)
Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Albert Cheng (Engrg-Hardware 1) <albecheng@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dc8df110 · Albert Cheng · GitHub · be0c855e · dc8df110
Unverified Commit dc8df110 authored Apr 14, 2026 by Albert Cheng Committed by GitHub Apr 14, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 13 additions and 0 deletions

vllm/model_executor/layers/attention/mla_attention.py vllm/model_executor/layers/attention/mla_attention.py +13 -0

No files found.
--- a/vllm/model_executor/layers/attention/mla_attention.py
+++ b/vllm/model_executor/layers/attention/mla_attention.py
@@ -1443,6 +1443,19 @@ class MLACommonMetadataBuilder(AttentionMetadataBuilder[M]):
                scope="local",
            )
            return model_dtype
+        elif (
+            is_quantized_kv_cache(vllm_config.cache_config.cache_dtype)
+            and backend_supports_prefill_query_quantization()
+        ):
+            logger.warning_once(
+                "FP8 KV cache is enabled but prefill queries are not "
+                "quantized to FP8. For long-context workloads (ISL >= 4K), "
+                "enabling FP8 prefill attention can significantly optimize "
+                "prefill latency. To enable, add: "
+                '--attention-config \'{"use_prefill_query_quantization"'
+                ": true}'",
+                scope="local",
+            )
        return model_dtype