[Bugfix] Allocate less memory in non-batched CUTLASS MoE (#21121)

Signed-off-by: ElizaWszola <ewszola@redhat.com>

[Bugfix] Allocate less memory in non-batched CUTLASS MoE (#21121)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
4adc66f6 · ElizaWszola · GitHub · 55ad6487 · 4adc66f6
Unverified Commit 4adc66f6 authored Jul 18, 2025 by ElizaWszola Committed by GitHub Jul 18, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 2 deletions

vllm/model_executor/layers/fused_moe/cutlass_moe.py vllm/model_executor/layers/fused_moe/cutlass_moe.py +2 -2

No files found.
--- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py
+++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py
@@ -283,8 +283,8 @@ class CutlassExpertsFp8(mk.FusedMoEPermuteExpertsUnpermute):
                          (N // 2))
            output = (self.max_experts_per_worker, padded_M, K)
        else:
-            workspace1 = (M * topk, max(2 * N, K))
-            workspace2 = (M * topk, N)
+            workspace1 = (M * topk, max(N, K))
+            workspace2 = (M * topk, N // 2)
            output = (M * topk, K)
        return (workspace1, workspace2, output,
                self.out_dtype if self.out_dtype is not None else a.dtype)