[DeepEP] Reduce routed scaling overhead (#5277)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

[DeepEP] Reduce routed scaling overhead (#5277)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
adca585b · yulei · GitHub · 39d90449 · adca585b
Unverified Commit adca585b authored Apr 14, 2025 by yulei Committed by GitHub Apr 13, 2025
Show whitespace changes
Inline Side-by-side

Showing with 9 additions and 10 deletions

python/sglang/srt/models/deepseek_v2.py python/sglang/srt/models/deepseek_v2.py +9 -10

No files found.
--- a/python/sglang/srt/models/deepseek_v2.py
+++ b/python/sglang/srt/models/deepseek_v2.py
@@ -337,8 +337,7 @@ class DeepseekV2MoE(nn.Module):
                topk_weights,
                forward_mode=forward_mode,
            )
-        final_hidden_states = (
+        final_hidden_states = self.experts(
-            self.experts(
            hidden_states=hidden_states,
            reorder_topk_ids=reorder_topk_ids,
            seg_indptr=seg_indptr,
@@ -346,8 +345,6 @@ class DeepseekV2MoE(nn.Module):
            expected_m=expected_m,
            forward_mode=forward_mode,
        )
-            * self.routed_scaling_factor
-        )
        if self.ep_size > 1:
            final_hidden_states = self.deepep_dispatcher.combine(
                final_hidden_states,
@@ -355,6 +352,8 @@ class DeepseekV2MoE(nn.Module):
                topk_weights,
                forward_mode,
            )
+        final_hidden_states *= self.routed_scaling_factor
        if shared_output is not None:
            final_hidden_states = final_hidden_states + shared_output