[bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B (#25146)

Signed-off-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

[bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B (#25146)
Signed-off-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
a684c012 · Yan Ma · GitHub · f2718d29 · a684c012
Unverified Commit a684c012 authored Sep 19, 2025 by Yan Ma Committed by GitHub Sep 19, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 3 deletions

vllm/attention/layer.py vllm/attention/layer.py +5 -3

No files found.
--- a/vllm/attention/layer.py
+++ b/vllm/attention/layer.py
@@ -430,9 +430,11 @@ class MultiHeadAttention(nn.Module):
        key: torch.Tensor,
        value: torch.Tensor,
    ) -> torch.Tensor:
-        """Input shape: batch_size x seq_len x hidden_size"""
+        """Input shape: 
-        # TODO(Isotr0py): Use existing backend implementations and support FA3
+        (batch_size x seq_len x hidden_size) or
-        bsz, q_len, _ = query.size()
+        (batch_size x seq_len x num_heads x head_size)
+        """
+        bsz, q_len = query.size()[:2]
        kv_len = key.size(1)
        query = query.view(bsz, q_len, self.num_heads, self.head_size)