Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op (#22701)

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op (#22701)
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
4f0f844b · Po-Han Huang (NVIDIA) · GitHub · c5830381 · 4f0f844b
Unverified Commit 4f0f844b authored Aug 13, 2025 by Po-Han Huang (NVIDIA) Committed by GitHub Aug 12, 2025
Show whitespace changes
Inline Side-by-side

Showing with 6 additions and 2 deletions

vllm/model_executor/models/llama4.py vllm/model_executor/models/llama4.py +6 -2

No files found.
--- a/vllm/model_executor/models/llama4.py
+++ b/vllm/model_executor/models/llama4.py
@@ -224,10 +224,14 @@ class Llama4Attention(nn.Module):
        if self.rotary_emb is not None:
            q, k = self.rotary_emb(positions, q, k)
        if self.qk_norm is not None:
-            q = q.reshape(-1, self.num_heads, self.head_dim)
+            # Normalization is applied on the head_dim dimension. The rest of
+            # the dimensions are collapsed into a single dimension to support
+            # custom rms_norm cuda kernel.
+            q = q.reshape(-1, self.head_dim)
            q = self.qk_norm(q.float()).reshape(-1, self.q_size).to(q.dtype)
-            k = k.reshape(-1, self.num_kv_heads, self.head_dim)
+            k = k.reshape(-1, self.head_dim)
            k = self.qk_norm(k.float()).reshape(-1, self.kv_size).to(k.dtype)
        # We are applying temperature tuning (https://arxiv.org/abs/2501.19399)