Unverified Commit dc8df110 authored by Albert Cheng's avatar Albert Cheng Committed by GitHub
Browse files

add warning when FP8 KV cache misses prefill query quantization (#39752)


Signed-off-by: default avatarMichael Goin <mgoin64@gmail.com>
Co-authored-by: default avatarAlbert Cheng (Engrg-Hardware 1) <albecheng@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: default avatarMichael Goin <mgoin64@gmail.com>
Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
parent be0c855e
...@@ -1443,6 +1443,19 @@ class MLACommonMetadataBuilder(AttentionMetadataBuilder[M]): ...@@ -1443,6 +1443,19 @@ class MLACommonMetadataBuilder(AttentionMetadataBuilder[M]):
scope="local", scope="local",
) )
return model_dtype return model_dtype
elif (
is_quantized_kv_cache(vllm_config.cache_config.cache_dtype)
and backend_supports_prefill_query_quantization()
):
logger.warning_once(
"FP8 KV cache is enabled but prefill queries are not "
"quantized to FP8. For long-context workloads (ISL >= 4K), "
"enabling FP8 prefill attention can significantly optimize "
"prefill latency. To enable, add: "
'--attention-config \'{"use_prefill_query_quantization"'
": true}'",
scope="local",
)
return model_dtype return model_dtype
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment