"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "c49ce3c722c35324803e40efb88b1a3057c7f249"
Unverified Commit 9e666aaa authored by Abhi Sharma's avatar Abhi Sharma Committed by GitHub
Browse files

Fix gradient overflow issue during attention mask

This fix is in reference to issue #382. GPT2 can now be trained in mixed precision, which I've confirmed with testing. I also tested unconditional generation on multiple seeds before and after changing 1e10 to 1e4 and there was no difference. Please let me know if there is anything else I can do to make this pull request better. Thanks for all your work!
parent 3d78e226
...@@ -218,7 +218,7 @@ class Attention(nn.Module): ...@@ -218,7 +218,7 @@ class Attention(nn.Module):
w = w / math.sqrt(v.size(-1)) w = w / math.sqrt(v.size(-1))
nd, ns = w.size(-2), w.size(-1) nd, ns = w.size(-2), w.size(-1)
b = self.bias[:, :, ns-nd:ns, :ns] b = self.bias[:, :, ns-nd:ns, :ns]
w = w * b - 1e10 * (1 - b) w = w * b - 1e4 * (1 - b)
w = nn.Softmax(dim=-1)(w) w = nn.Softmax(dim=-1)(w)
return torch.matmul(w, v) return torch.matmul(w, v)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment