[Kernel][MoE] fix computation order of MoE weight multiplication and improve flow (#31962)

Signed-off-by: xuebwang-amd <xuebwang@amd.com>

[Kernel][MoE] fix computation order of MoE weight multiplication and improve flow (#31962)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
629584bf · xuebwang-amd · GitHub · 0a7dd237 · 629584bf
Unverified Commit 629584bf authored Jan 13, 2026 by xuebwang-amd Committed by GitHub Jan 12, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 24 additions and 9 deletions

vllm/model_executor/layers/fused_moe/fused_moe.py vllm/model_executor/layers/fused_moe/fused_moe.py +24 -9

No files found.
--- a/vllm/model_executor/layers/fused_moe/fused_moe.py
+++ b/vllm/model_executor/layers/fused_moe/fused_moe.py
@@ -531,22 +531,37 @@ def fused_moe_kernel(
        a_ptrs += BLOCK_SIZE_K * stride_ak
        b_ptrs += BLOCK_SIZE_K * stride_bk
-    # Router weight multiplication MUST happen in float32 before precision
+    # Dequantization for supported quantization schemes:
-    # conversion for numerical stability (especially critical on ROCm).
+    #   - int8_w8a16
-    if MUL_ROUTED_WEIGHT:
+    #   - fp8_w8a8
-        moe_weight = tl.load(topk_weights_ptr + offs_token, mask=token_mask, other=0)
+    #   - int8_w8a8
-        accumulator = accumulator * moe_weight[:, None]
+    # Accumulator and scalings are in float32 to preserve numerical accuracy.
    if use_int8_w8a16:
        accumulator = accumulator * b_scale
    elif (use_fp8_w8a8 or use_int8_w8a8) and not (group_k > 0 and group_n > 0):
        accumulator = accumulator * a_scale * b_scale
-    # Bias is added AFTER dequantization since bias is typically stored in
+    # Bias addition:
-    # the output dtype and should not be scaled by quantization factors.
+    # Bias must be applied after dequantization:
+    #   - Since bias is typically not quantized
+    #   - Bias should not be scaled by quantization factors
    if HAS_BIAS:
-        accumulator = accumulator + bias[None, :]
+        accumulator += bias[None, :]
+    # Router (MoE) weight multiplication:
+    # This multiplication MUST be performed in float32 before any precision
+    # conversion to ensure numerical stability, which is especially critical
+    # on ROCm platforms.
+    if MUL_ROUTED_WEIGHT:
+        moe_weight = tl.load(
+            topk_weights_ptr + offs_token,
+            mask=token_mask,
+            other=0,
+        )
+        accumulator *= moe_weight[:, None]
+    # Final precision conversion:
+    # Cast once at the end to the desired compute/output dtype.
    accumulator = accumulator.to(compute_type)
    # -----------------------------------------------------------