Fix gradient overflow issue during attention mask

This fix is in reference to issue #382. GPT2 can now be trained in mixed precision, which I've confirmed with testing. I also tested unconditional generation on multiple seeds before and after changing 1e10 to 1e4 and there was no difference. Please let me know if there is anything else I can do to make this pull request better. Thanks for all your work!

Fix gradient overflow issue during attention mask
This fix is in reference to issue #382. GPT2 can now be trained in mixed precision, which I've confirmed with testing. I also tested unconditional generation on multiple seeds before and after changing 1e10 to 1e4 and there was no difference. Please let me know if there is anything else I can do to make this pull request better. Thanks for all your work!
9e666aaa · Abhi Sharma · GitHub · 3d78e226 · 9e666aaa
Unverified Commit 9e666aaa authored Apr 16, 2019 by Abhi Sharma Committed by GitHub Apr 16, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

pytorch_pretrained_bert/modeling_gpt2.py pytorch_pretrained_bert/modeling_gpt2.py +1 -1

No files found.
--- a/pytorch_pretrained_bert/modeling_gpt2.py
+++ b/pytorch_pretrained_bert/modeling_gpt2.py
@@ -218,7 +218,7 @@ class Attention(nn.Module):
            w = w / math.sqrt(v.size(-1))
        nd, ns = w.size(-2), w.size(-1)
        b = self.bias[:, :, ns-nd:ns, :ns]
-        w = w * b - 1e10 * (1 - b)
+        w = w * b - 1e4 * (1 - b)

        w = nn.Softmax(dim=-1)(w)
        return torch.matmul(w, v)