[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (#23716)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (#23716)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
704432af · Thomas Parnell · GitHub · a403d0fa · 704432af · 704432af
Unverified Commit 704432af authored Aug 27, 2025 by Thomas Parnell Committed by GitHub Aug 27, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 8 deletions

docs/usage/v1_guide.md docs/usage/v1_guide.md +6 -4

vllm/model_executor/models/config.py vllm/model_executor/models/config.py +5 -4

No files found.
--- a/docs/usage/v1_guide.md
+++ b/docs/usage/v1_guide.md
@@ -107,14 +107,16 @@ to enable simultaneous generation and embedding using the same engine instance i
 #### Mamba Models
 Models using selective state-space mechanisms instead of standard transformer attention are supported.
-Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported. Please note that these models currently require disabling prefix caching in V1.
+Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported.
+Please note that prefix caching is not yet supported for these models.
 Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
-`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`). Please note that
+`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`).
-these models currently require disabling prefix caching in V1.
+Please note that prefix caching is not yet supported for these models.
 Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`).
-Please note that these models currently require disabling prefix caching and enforcing eager mode in V1.
+Please note that prefix caching is not yet supported for these models.
+It is also necessary to enforce eager mode for these models in V1.
 #### Encoder-Decoder Models

--- a/vllm/model_executor/models/config.py
+++ b/vllm/model_executor/models/config.py
@@ -292,12 +292,13 @@ class MambaModelConfig(VerifyAndUpdateConfig):
            return
        model_config = vllm_config.model_config
+        cache_config = vllm_config.cache_config
        compilation_config = vllm_config.compilation_config
-        model_cls, _ = ModelRegistry.resolve_model_cls(
+        # TODO(tdoublep): remove once prefix caching is enabled
-            model_config.architecture,
+        cache_config.enable_prefix_caching = False
-            model_config=model_config,
+        logger.info("Hybrid or mamba-based model detected: disabling prefix "
-        )
+                    "caching since it is not yet supported.")
        # TODO(tdoublep): remove as full cuda graph support is added
        FCG_NOT_SUPPORTED_MODELS = [