Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe (#35088)

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe (#35088)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
a0c70816 · danisereb · GitHub · 34ce0ffd · a0c70816
Unverified Commit a0c70816 authored Feb 24, 2026 by danisereb Committed by GitHub Feb 24, 2026
Show whitespace changes
Inline Side-by-side

Showing with 2 additions and 2 deletions

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py ..._executor/layers/quantization/utils/flashinfer_fp4_moe.py +2 -2

No files found.
--- a/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
+++ b/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
@@ -348,7 +348,7 @@ def flashinfer_trtllm_fp4_moe(
        hidden_states=hidden_states_fp4,
        hidden_states_scale=hidden_states_scale_linear_fp4.view(
            torch.float8_e4m3fn
-        ).flatten(),
+        ).reshape(*hidden_states_fp4.shape[:-1], -1),
        gemm1_weights=layer.w13_weight.data,
        gemm1_weights_scale=layer.w13_weight_scale.data.view(torch.float8_e4m3fn),
        gemm1_bias=None,
@@ -432,7 +432,7 @@ def flashinfer_trtllm_fp4_routed_moe(
        hidden_states=hidden_states_fp4,
        hidden_states_scale=hidden_states_scale_linear_fp4.view(
            torch.float8_e4m3fn
-        ).flatten(),
+        ).reshape(*hidden_states_fp4.shape[:-1], -1),
        gemm1_weights=layer.w13_weight.data,
        gemm1_weights_scale=layer.w13_weight_scale.data.view(torch.float8_e4m3fn),
        gemm1_bias=None,