Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121)

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121)
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
de81b7df · Kshitij Lakhani · GitHub · 1e2c68d6 · de81b7df
Unverified Commit de81b7df authored Aug 27, 2025 by Kshitij Lakhani Committed by GitHub Aug 27, 2025
Show whitespace changes
Inline Side-by-side

Showing with 2 additions and 2 deletions

transformer_engine/pytorch/attention/dot_product_attention/utils.py ...r_engine/pytorch/attention/dot_product_attention/utils.py +2 -2

No files found.
--- a/transformer_engine/pytorch/attention/dot_product_attention/utils.py
+++ b/transformer_engine/pytorch/attention/dot_product_attention/utils.py
@@ -434,8 +434,8 @@ def get_attention_backend(
    #          | FP8            | non-paged/paged | sm90         | thd           | >= 1
    # Unfused  | FP32/FP16/BF16 | non-paged/paged | all          | bshd,sbhd,thd | >= 1
    if inference_params is not None:
-        if device_compute_capability == (8, 9) and cudnn_version <= (9, 12, 0):
+        if device_compute_capability == (8, 9) and cudnn_version <= (9, 13, 0):
-            logger.debug("Disabling FusedAttention for KV caching for sm89 and cuDNN <= 9.12")
+            logger.debug("Disabling FusedAttention for KV caching for sm89 and cuDNN <= 9.13")
            use_fused_attention = False
        if context_parallel:
            logger.debug("Disabling all backends for KV caching with context parallelism")