Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
4fb8beef
Unverified
Commit
4fb8beef
authored
Feb 19, 2026
by
Roger Wang
Committed by
GitHub
Feb 19, 2026
Browse files
[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 (#34914)
Signed-off-by:
Roger Wang
<
hey@rogerw.io
>
parent
304319c4
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
11 additions
and
0 deletions
+11
-0
vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
...el_executor/layers/quantization/utils/flashinfer_utils.py
+11
-0
No files found.
vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
View file @
4fb8beef
...
...
@@ -455,4 +455,15 @@ def prepare_fp8_moe_layer_for_fi(
w2_input_scale
=
w2_input_scale
,
)
# Clamp block scales to avoid NaN from the FlashInfer CUTLASS kernel.
# Some FP8 models have near-zero block scales (~1e-23) for dead/unused
# experts. The CUTLASS kernel doesn't handle these correctly on Hopper
# (SM 9.0), producing NaN instead of near-zero output. Clamping to a
# small minimum prevents this without affecting model accuracy since
# these experts' effective weights are already zero.
if
block_quant
:
_FI_CUTLASS_MIN_BLOCK_SCALE
=
1e-10
w13_scale
.
clamp_
(
min
=
_FI_CUTLASS_MIN_BLOCK_SCALE
)
w2_scale
.
clamp_
(
min
=
_FI_CUTLASS_MIN_BLOCK_SCALE
)
return
w13
,
w2
,
w13_scale
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment