Unverified Commit 9e666aaa authored by Abhi Sharma's avatar Abhi Sharma Committed by GitHub
Browse files

Fix gradient overflow issue during attention mask

This fix is in reference to issue #382. GPT2 can now be trained in mixed precision, which I've confirmed with testing. I also tested unconditional generation on multiple seeds before and after changing 1e10 to 1e4 and there was no difference. Please let me know if there is anything else I can do to make this pull request better. Thanks for all your work!
parent 3d78e226
...@@ -218,7 +218,7 @@ class Attention(nn.Module): ...@@ -218,7 +218,7 @@ class Attention(nn.Module):
w = w / math.sqrt(v.size(-1)) w = w / math.sqrt(v.size(-1))
nd, ns = w.size(-2), w.size(-1) nd, ns = w.size(-2), w.size(-1)
b = self.bias[:, :, ns-nd:ns, :ns] b = self.bias[:, :, ns-nd:ns, :ns]
w = w * b - 1e10 * (1 - b) w = w * b - 1e4 * (1 - b)
w = nn.Softmax(dim=-1)(w) w = nn.Softmax(dim=-1)(w)
return torch.matmul(w, v) return torch.matmul(w, v)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment