[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 (#34914)

Signed-off-by: Roger Wang <hey@rogerw.io>

[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 (#34914)
Signed-off-by: Roger Wang <hey@rogerw.io>
4fb8beef · Roger Wang · GitHub · 304319c4 · 4fb8beef
Unverified Commit 4fb8beef authored Feb 19, 2026 by Roger Wang Committed by GitHub Feb 19, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 0 deletions

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py ...el_executor/layers/quantization/utils/flashinfer_utils.py +11 -0

No files found.
--- a/vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
+++ b/vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
@@ -455,4 +455,15 @@ def prepare_fp8_moe_layer_for_fi(
            w2_input_scale=w2_input_scale,
        )

+    # Clamp block scales to avoid NaN from the FlashInfer CUTLASS kernel.
+    # Some FP8 models have near-zero block scales (~1e-23) for dead/unused
+    # experts. The CUTLASS kernel doesn't handle these correctly on Hopper
+    # (SM 9.0), producing NaN instead of near-zero output. Clamping to a
+    # small minimum prevents this without affecting model accuracy since
+    # these experts' effective weights are already zero.
+    if block_quant:
+        _FI_CUTLASS_MIN_BLOCK_SCALE = 1e-10
+        w13_scale.clamp_(min=_FI_CUTLASS_MIN_BLOCK_SCALE)
+        w2_scale.clamp_(min=_FI_CUTLASS_MIN_BLOCK_SCALE)
+
    return w13, w2, w13_scale