Disable chunked prefill and/or prefix caching when MLA is enabled (#12642)

From @mgoin in https://github.com/vllm-project/vllm/pull/12638 I cannot push to that branch, therefore a new PR to unblock release. --------- Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Co-authored-by: mgoin <michael@neuralmagic.com>

Disable chunked prefill and/or prefix caching when MLA is enabled (#12642)
From @mgoin in https://github.com/vllm-project/vllm/pull/12638 I cannot push to that branch, therefore a new PR to unblock release. --------- Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Co-authored-by: mgoin <michael@neuralmagic.com>
4f4d427a · Simon Mo · GitHub · 1e369839 · 4f4d427a
Unverified Commit 4f4d427a authored Jan 31, 2025 by Simon Mo Committed by GitHub Jan 31, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 10 additions and 0 deletions

vllm/config.py vllm/config.py +10 -0

No files found.
--- a/vllm/config.py
+++ b/vllm/config.py
@@ -3252,6 +3252,16 @@ class VllmConfig:

        current_platform.check_and_update_config(self)

+        # If MLA is enabled, force disable chunked prefill and prefix caching
+        if self.model_config and self.model_config.use_mla:
+            logger.info("MLA is enabled; forcing chunked prefill and prefix "
+                        "caching to be disabled.")
+            self.scheduler_config.enable_chunked_prefill = False
+            self.scheduler_config.chunked_prefill_enabled = False
+
+            if self.cache_config is not None:
+                self.cache_config.enable_prefix_caching = False
+
        if not self.instance_id:
            self.instance_id = random_uuid()[:5]