Fix TURBOQUANT backend selection in cuda.py (#40060)

Signed-off-by: Michael Goin <mgoin64@gmail.com>

Fix TURBOQUANT backend selection in cuda.py (#40060)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
1174723e · Michael Goin · GitHub · 6b2b7bd0 · 1174723e · 1174723e
Unverified Commit 1174723e authored Apr 17, 2026 by Michael Goin Committed by GitHub Apr 17, 2026
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 5 deletions

docs/design/attention_backends.md docs/design/attention_backends.md +2 -0

vllm/platforms/cuda.py vllm/platforms/cuda.py +2 -5

No files found.
--- a/docs/design/attention_backends.md
+++ b/docs/design/attention_backends.md
@@ -106,6 +106,7 @@ Priority is **1 = highest** (tried first).
 | 2 | `FLASH_ATTN` |
 | 3 | `TRITON_ATTN` |
 | 4 | `FLEX_ATTENTION` |
+| 5 | `TURBOQUANT` |
 **Ampere/Hopper (SM 8.x-9.x):**
@@ -115,6 +116,7 @@ Priority is **1 = highest** (tried first).
 | 2 | `FLASHINFER` |
 | 3 | `TRITON_ATTN` |
 | 4 | `FLEX_ATTENTION` |
+| 5 | `TURBOQUANT` |
 ### MLA Attention (DeepSeek-style)

--- a/vllm/platforms/cuda.py
+++ b/vllm/platforms/cuda.py
@@ -131,6 +131,7 @@ def _get_backend_priorities(
                AttentionBackendEnum.FLASH_ATTN,
                AttentionBackendEnum.TRITON_ATTN,
                AttentionBackendEnum.FLEX_ATTENTION,
+                AttentionBackendEnum.TURBOQUANT,
            ]
        else:
            return [
@@ -138,6 +139,7 @@ def _get_backend_priorities(
                AttentionBackendEnum.FLASHINFER,
                AttentionBackendEnum.TRITON_ATTN,
                AttentionBackendEnum.FLEX_ATTENTION,
+                AttentionBackendEnum.TURBOQUANT,
            ]
@@ -255,11 +257,6 @@ class CudaPlatformBase(Platform):
        valid_backends_priorities = []
        invalid_reasons: dict[AttentionBackendEnum, tuple[int, list[str]]] = {}
-        # TurboQuant KV cache: route directly to TQ backend
-        kv_cache_dtype = attn_selector_config.kv_cache_dtype
-        if kv_cache_dtype is not None and kv_cache_dtype.startswith("turboquant_"):
-            return [(AttentionBackendEnum.TURBOQUANT, 0)], {}
        backend_priorities = _get_backend_priorities(
            attn_selector_config.use_mla,
            device_capability,