[Attention][UX][1/N] Add AttentionConfig and change attention env vars to CLI arguments (#26315)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

[Attention][UX][1/N] Add AttentionConfig and change attention env vars to CLI arguments (#26315)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
66e674cd · Matthew Bonanni · GitHub · dff0a2b3 · 66e674cd · 66e674cd
Unverified Commit 66e674cd authored Dec 05, 2025 by Matthew Bonanni Committed by GitHub Dec 05, 2025
Show whitespace changes
Inline Side-by-side

Showing with 3 additions and 4 deletions

vllm/v1/attention/backends/rocm_attn.py vllm/v1/attention/backends/rocm_attn.py +1 -1

vllm/v1/attention/backends/triton_attn.py vllm/v1/attention/backends/triton_attn.py +2 -3

No files found.
--- a/vllm/v1/attention/backends/rocm_attn.py
+++ b/vllm/v1/attention/backends/rocm_attn.py
@@ -165,7 +165,7 @@ class RocmAttentionBackend(AttentionBackend):
            raise ValueError(
                f"Head size {head_size} is not supported by {attn_type}. "
                f"Supported head sizes are: {cls.get_supported_head_sizes()}. "
-                "Set VLLM_ATTENTION_BACKEND=FLEX_ATTENTION to use "
+                "Set --attention-config.backend=FLEX_ATTENTION to use "
                "FlexAttention backend which supports all head sizes."
            )


--- a/vllm/v1/attention/backends/triton_attn.py
+++ b/vllm/v1/attention/backends/triton_attn.py
@@ -210,9 +210,6 @@ class TritonAttentionImpl(AttentionImpl):
    def fused_output_quant_supported(self, quant_key: QuantKey):
        return quant_key == kFp8StaticTensorSym

-    def supports_quant_query_input(self) -> bool:
-        return current_platform.is_cuda()
-
    def __init__(
        self,
        num_heads: int,
@@ -262,6 +259,8 @@ class TritonAttentionImpl(AttentionImpl):
                f"num_heads: {num_heads}."
            )

+        self.supports_quant_query_input = current_platform.is_cuda()
+
    def forward(
        self,
        layer: torch.nn.Module,