Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
bcf2be96
Commit
bcf2be96
authored
Mar 19, 2026
by
khluu
Browse files
[cherry-pick][Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591)#37605
Signed-off-by:
khluu
<
khluu000@gmail.com
>
parent
89138b21
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
42 additions
and
5 deletions
+42
-5
.buildkite/test_areas/lm_eval.yaml
.buildkite/test_areas/lm_eval.yaml
+16
-0
tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-DEP2.yaml
tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-DEP2.yaml
+8
-0
tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-FP8-DEP2.yaml
tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-FP8-DEP2.yaml
+9
-0
tests/evals/gsm8k/configs/models-qwen35-blackwell.txt
tests/evals/gsm8k/configs/models-qwen35-blackwell.txt
+2
-0
vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py
...model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py
+7
-5
No files found.
.buildkite/test_areas/lm_eval.yaml
View file @
bcf2be96
...
...
@@ -45,6 +45,22 @@ steps:
commands
:
-
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt
-
label
:
LM Eval Qwen3.5 Models (B200)
timeout_in_minutes
:
120
device
:
b200
optional
:
true
num_devices
:
2
source_file_dependencies
:
-
vllm/model_executor/models/qwen3_5.py
-
vllm/model_executor/models/qwen3_5_mtp.py
-
vllm/transformers_utils/configs/qwen3_5.py
-
vllm/transformers_utils/configs/qwen3_5_moe.py
-
vllm/model_executor/models/qwen3_next.py
-
vllm/model_executor/models/qwen3_next_mtp.py
-
vllm/model_executor/layers/fla/ops/
commands
:
-
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt
-
label
:
LM Eval Large Models (H200)
timeout_in_minutes
:
60
device
:
h200
...
...
tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-DEP2.yaml
0 → 100644
View file @
bcf2be96
model_name
:
"
Qwen/Qwen3.5-35B-A3B"
accuracy_threshold
:
0.86
num_questions
:
1319
num_fewshot
:
5
server_args
:
>-
--max-model-len 4096
--data-parallel-size 2
--enable-expert-parallel
tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-FP8-DEP2.yaml
0 → 100644
View file @
bcf2be96
model_name
:
"
Qwen/Qwen3.5-35B-A3B-FP8"
accuracy_threshold
:
0.86
num_questions
:
1319
num_fewshot
:
5
server_args
:
>-
--max-model-len 4096
--data-parallel-size 2
--enable-expert-parallel
--kv-cache-dtype fp8
tests/evals/gsm8k/configs/models-qwen35-blackwell.txt
0 → 100644
View file @
bcf2be96
Qwen3.5-35B-A3B-DEP2.yaml
Qwen3.5-35B-A3B-FP8-DEP2.yaml
vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py
View file @
bcf2be96
...
...
@@ -253,23 +253,25 @@ class TrtLlmFp8ExpertsMonolithic(TrtLlmFp8ExpertsBase, mk.FusedMoEExpertsMonolit
weight_key
:
QuantKey
|
None
,
activation_key
:
QuantKey
|
None
,
)
->
bool
:
"""Monolithic kernels need to express router support."""
"""Monolithic kernels need to express router support.
Renormalize/RenormalizeNaive are excluded: the monolithic kernel's
internal routing for these methods produces output uncorrelated
with the modular kernel's output and with Triton kernel's output
for Qwen3.5-35B-A3B-FP8.
See: https://github.com/vllm-project/vllm/issues/37591
"""
# NOTE(dbari): TopK routing could also be enabled, but need to validate models
# NOTE(dbari): Default is not implemented and should not be enabled until it is
if
(
weight_key
,
activation_key
)
==
(
kFp8Static128BlockSym
,
kFp8Dynamic128Sym
):
# NOTE(rob): potentially allow others here. This is a conservative list.
return
routing_method
in
[
RoutingMethodType
.
DeepSeekV3
,
RoutingMethodType
.
Renormalize
,
RoutingMethodType
.
RenormalizeNaive
,
]
elif
(
weight_key
,
activation_key
)
==
(
kFp8StaticTensorSym
,
kFp8StaticTensorSym
):
# NOTE(dbari): as above, potentially allow others here.
return
routing_method
in
[
RoutingMethodType
.
DeepSeekV3
,
RoutingMethodType
.
Llama4
,
RoutingMethodType
.
Renormalize
,
RoutingMethodType
.
RenormalizeNaive
,
]
else
:
raise
ValueError
(
"Unsupported quantization scheme."
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment