[NVIDIA] flashinfer TRTLLM attention prefill token limit (#25998)

Signed-off-by: jasonlizhengjian <jason.li@centml.ai> Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com>

[NVIDIA] flashinfer TRTLLM attention prefill token limit (#25998)
Signed-off-by: jasonlizhengjian <jason.li@centml.ai> Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com>
6b6e9877 · Jason Li · GitHub · 9c3c21c5 · 6b6e9877
Unverified Commit 6b6e9877 authored Oct 05, 2025 by Jason Li Committed by GitHub Oct 05, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 12 additions and 5 deletions

vllm/utils/flashinfer.py vllm/utils/flashinfer.py +12 -5

No files found.
--- a/vllm/utils/flashinfer.py
+++ b/vllm/utils/flashinfer.py
@@ -283,11 +283,18 @@ def use_trtllm_attention(

    if force_use_trtllm is None:
        # Environment variable not set - use auto-detection
-        use_trtllm = (
-            num_tokens <= 256 and max_seq_len <= 131072 and kv_cache_dtype == "auto"
-        )
-        if use_trtllm:
-            logger.warning_once("Using TRTLLM attention (auto-detected).")
+        if is_prefill:
+            # Prefill auto-detection
+            use_trtllm = max_seq_len <= 131072 and kv_cache_dtype == "auto"
+            if use_trtllm:
+                logger.warning_once("Using TRTLLM prefill attention (auto-detected).")
+        else:
+            # Decode auto-detection
+            use_trtllm = (
+                num_tokens <= 256 and max_seq_len <= 131072 and kv_cache_dtype == "auto"
+            )
+            if use_trtllm:
+                logger.warning_once("Using TRTLLM decode attention (auto-detected).")
        return use_trtllm

    # Environment variable is set to 1 - respect it