[Fix] MLA only supports decode-only full CUDAGraph capture. Make sure all...

[Fix] MLA only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.

[Fix] MLA only supports decode-only full CUDAGraph capture. Make sure all...
[Fix] MLA only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.
b35835a1 · zhuwenwen · e4a84fdc · b35835a1
Commit b35835a1 authored Jul 25, 2025 by zhuwenwen
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 1 deletion

vllm/config.py vllm/config.py +4 -1

No files found.
--- a/vllm/config.py
+++ b/vllm/config.py
@@ -4745,7 +4745,10 @@ class VllmConfig:
            batch_size_capture_list = []
            if self.model_config is not None and \
                not self.model_config.enforce_eager:
-                cuda_graph_sizes = self.scheduler_config.cuda_graph_sizes
+                if self.model_config.use_mla and self.compilation_config.full_cuda_graph:
+                    cuda_graph_sizes = [256]
+                else:
+                    cuda_graph_sizes = self.scheduler_config.cuda_graph_sizes 
                if len(cuda_graph_sizes) == 1:
                    batch_size_capture_list = [1, 2, 4] + [
                        i for i in range(8, cuda_graph_sizes[0] + 1, 8)