Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
d1fd802f
Unverified
Commit
d1fd802f
authored
Jan 10, 2026
by
gnovack
Committed by
GitHub
Jan 11, 2026
Browse files
fused_moe_kernel - cast accumulator after applying router weights (#32002)
Signed-off-by:
gnovack
<
gnovack@amazon.com
>
parent
543c23be
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
8 deletions
+5
-8
vllm/model_executor/layers/fused_moe/fused_moe.py
vllm/model_executor/layers/fused_moe/fused_moe.py
+5
-8
No files found.
vllm/model_executor/layers/fused_moe/fused_moe.py
View file @
d1fd802f
...
...
@@ -539,20 +539,17 @@ def fused_moe_kernel(
accumulator
=
accumulator
*
moe_weight
[:,
None
]
if
use_int8_w8a16
:
accumulator
=
(
accumulator
*
b_scale
).
to
(
compute_type
)
elif
use_fp8_w8a8
or
use_int8_w8a8
:
if
group_k
>
0
and
group_n
>
0
:
accumulator
=
accumulator
.
to
(
compute_type
)
else
:
accumulator
=
(
accumulator
*
a_scale
*
b_scale
).
to
(
compute_type
)
else
:
accumulator
=
accumulator
.
to
(
compute_type
)
accumulator
=
accumulator
*
b_scale
elif
(
use_fp8_w8a8
or
use_int8_w8a8
)
and
not
(
group_k
>
0
and
group_n
>
0
):
accumulator
=
accumulator
*
a_scale
*
b_scale
# Bias is added AFTER dequantization since bias is typically stored in
# the output dtype and should not be scaled by quantization factors.
if
HAS_BIAS
:
accumulator
=
accumulator
+
bias
[
None
,
:]
accumulator
=
accumulator
.
to
(
compute_type
)
# -----------------------------------------------------------
# Write back the block of the output
offs_cn
=
pid_n
*
BLOCK_SIZE_N
+
tl
.
arange
(
0
,
BLOCK_SIZE_N
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment