Unverified Commit 704432af authored by Thomas Parnell's avatar Thomas Parnell Committed by GitHub
Browse files

[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (#23716)


Signed-off-by: default avatarThomas Parnell <tpa@zurich.ibm.com>
parent a403d0fa
...@@ -107,14 +107,16 @@ to enable simultaneous generation and embedding using the same engine instance i ...@@ -107,14 +107,16 @@ to enable simultaneous generation and embedding using the same engine instance i
#### Mamba Models #### Mamba Models
Models using selective state-space mechanisms instead of standard transformer attention are supported. Models using selective state-space mechanisms instead of standard transformer attention are supported.
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported. Please note that these models currently require disabling prefix caching in V1. Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported.
Please note that prefix caching is not yet supported for these models.
Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`, Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`). Please note that `Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`).
these models currently require disabling prefix caching in V1. Please note that prefix caching is not yet supported for these models.
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`). Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`).
Please note that these models currently require disabling prefix caching and enforcing eager mode in V1. Please note that prefix caching is not yet supported for these models.
It is also necessary to enforce eager mode for these models in V1.
#### Encoder-Decoder Models #### Encoder-Decoder Models
......
...@@ -292,12 +292,13 @@ class MambaModelConfig(VerifyAndUpdateConfig): ...@@ -292,12 +292,13 @@ class MambaModelConfig(VerifyAndUpdateConfig):
return return
model_config = vllm_config.model_config model_config = vllm_config.model_config
cache_config = vllm_config.cache_config
compilation_config = vllm_config.compilation_config compilation_config = vllm_config.compilation_config
model_cls, _ = ModelRegistry.resolve_model_cls( # TODO(tdoublep): remove once prefix caching is enabled
model_config.architecture, cache_config.enable_prefix_caching = False
model_config=model_config, logger.info("Hybrid or mamba-based model detected: disabling prefix "
) "caching since it is not yet supported.")
# TODO(tdoublep): remove as full cuda graph support is added # TODO(tdoublep): remove as full cuda graph support is added
FCG_NOT_SUPPORTED_MODELS = [ FCG_NOT_SUPPORTED_MODELS = [
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment