[Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next...

[Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) (#33173) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

[Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next...
[Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) (#33173) Signed-off-by: xuebwang-amd <xuebwang@amd.com>
f451b455 · xuebwang-amd · GitHub · 3f96fcf6 · f451b455
Unverified Commit f451b455 authored Jan 31, 2026 by xuebwang-amd Committed by GitHub Jan 30, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 2 deletions

vllm/model_executor/layers/fused_moe/layer.py vllm/model_executor/layers/fused_moe/layer.py +6 -2

No files found.
--- a/vllm/model_executor/layers/fused_moe/layer.py
+++ b/vllm/model_executor/layers/fused_moe/layer.py
@@ -996,7 +996,9 @@ class FusedMoE(CustomOp):
            shard_size = expert_data.shape[shard_dim] // 2
        else:
            shard_size = expert_data.shape[shard_dim]
-        if not load_full:
+        # Only narrow if the loaded_weight is not a scalar (0-dim tensor)
+        # and we're not loading the full weight
+        if not load_full and loaded_weight.ndim > 0:
            loaded_weight = loaded_weight.narrow(
                shard_dim, shard_size * tp_rank, shard_size
            )
@@ -1022,7 +1024,9 @@ class FusedMoE(CustomOp):
        # down_proj: "RowParallel" so tp sharding on input_dim
        # Narrow parameter and load.
        shard_size = expert_data.shape[shard_dim]
-        if not load_full:
+        # Only narrow if the loaded_weight is not a scalar (0-dim tensor)
+        # and we're not loading the full weight
+        if not load_full and loaded_weight.ndim > 0:
            loaded_weight = loaded_weight.narrow(
                shard_dim, shard_size * tp_rank, shard_size
            )