Fix ORTTrainer failure on gpt2 fp16 training (#18017)

* Ensure value and attn weights have the same dtype * Remove prints * Modify decision transformers copied from gpt2 * Nit device Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Fix style Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Fix ORTTrainer failure on gpt2 fp16 training (#18017)
* Ensure value and attn weights have the same dtype * Remove prints * Modify decision transformers copied from gpt2 * Nit device Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Fix style Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2844c5de · Jingya HUANG · GitHub · 2b096508 · 2844c5de · 2844c5de
Unverified Commit 2844c5de authored Jul 26, 2022 by Jingya HUANG Committed by GitHub Jul 26, 2022
2 changed files
--- a/src/transformers/models/decision_transformer/modeling_decision_transformer.py
+++ b/src/transformers/models/decision_transformer/modeling_decision_transformer.py
@@ -178,7 +178,9 @@ class DecisionTransformerGPT2Attention(nn.Module):
        attn_weights = torch.matmul(query, key.transpose(-1, -2))

        if self.scale_attn_weights:
-            attn_weights = attn_weights / (value.size(-1) ** 0.5)
+            attn_weights = attn_weights / torch.tensor(
+                value.size(-1) ** 0.5, dtype=attn_weights.dtype, device=attn_weights.device
+            )

        # Layer-wise attention scaling
        if self.scale_attn_by_inverse_layer_idx:

--- a/src/transformers/models/gpt2/modeling_gpt2.py
+++ b/src/transformers/models/gpt2/modeling_gpt2.py
@@ -189,7 +189,9 @@ class GPT2Attention(nn.Module):
        attn_weights = torch.matmul(query, key.transpose(-1, -2))

        if self.scale_attn_weights:
-            attn_weights = attn_weights / (value.size(-1) ** 0.5)
+            attn_weights = attn_weights / torch.tensor(
+                value.size(-1) ** 0.5, dtype=attn_weights.dtype, device=attn_weights.device
+            )

        # Layer-wise attention scaling
        if self.scale_attn_by_inverse_layer_idx: