[Bugfix] Disable CG for Whisper+FA2 (#33164)

Signed-off-by: NickLucche <nlucches@redhat.com>

[Bugfix] Disable CG for Whisper+FA2 (#33164)
Signed-off-by: NickLucche <nlucches@redhat.com>
1f3a2c29 · Nicolò Lucchesi · GitHub · 7227d061 · 1f3a2c29
Unverified Commit 1f3a2c29 authored Jan 27, 2026 by Nicolò Lucchesi Committed by GitHub Jan 27, 2026
Show whitespace changes
Inline Side-by-side

Showing with 20 additions and 0 deletions

vllm/v1/attention/backends/flash_attn.py vllm/v1/attention/backends/flash_attn.py +20 -0

No files found.
--- a/vllm/v1/attention/backends/flash_attn.py
+++ b/vllm/v1/attention/backends/flash_attn.py
@@ -257,6 +257,26 @@ class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[FlashAttentionMetad
    )
    supports_update_block_table: bool = True

+    @classmethod
+    def get_cudagraph_support(
+        cls,
+        vllm_config: "VllmConfig",
+        kv_cache_spec: "AttentionSpec",
+    ) -> AttentionCGSupport:
+        # FA2 does not support CUDA graphs with encoder-decoder models due to
+        # accuracy issues reported in https://github.com/vllm-project/vllm/issues/33091
+        if (
+            vllm_config.model_config.is_encoder_decoder
+            and get_flash_attn_version() == 2
+        ):
+            logger.warning_once(
+                "FlashAttention2 does not support CUDA graphs with "
+                "encoder-decoder models due to accuracy issues reported in #33091. "
+                "Disabling CUDA graph."
+            )
+            return AttentionCGSupport.NEVER
+        return cls._cudagraph_support
+
    def __init__(
        self,
        kv_cache_spec: AttentionSpec,