llama fp16 torch.max bug fix (#24561)

* open llama fp16 bug fix * bug fix * bug fixed * make style * Update modeling_llama.py * apply formatting * Address amy's comment --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>

llama fp16 torch.max bug fix (#24561)
* open llama fp16 bug fix * bug fix * bug fixed * make style * Update modeling_llama.py * apply formatting * Address amy's comment --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
a3b402ff · Prathik Rao · GitHub · 4e945660 · a3b402ff
Unverified Commit a3b402ff authored Jul 04, 2023 by Prathik Rao Committed by GitHub Jul 04, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 2 deletions

src/transformers/models/llama/modeling_llama.py src/transformers/models/llama/modeling_llama.py +3 -2

No files found.
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -224,9 +224,10 @@ class LlamaAttention(nn.Module):
                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
                )
            attn_weights = attn_weights + attention_mask
-            attn_weights = torch.max(
-                attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
+            dtype_min = torch.tensor(
+                torch.finfo(attn_weights.dtype).min, device=attn_weights.device, dtype=attn_weights.dtype
            )
+            attn_weights = torch.max(attn_weights, dtype_min)

        # upcast attention to fp32
        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)