Minor fix to make adafactor work for >2d conv kernels (#1122)

Summary: missing .unsqueeze(-1) in line 124, without this change we'll encounter runtime error for >2d convolutional kernels, with this fix, we're applying adafactor's 2d logic to the two final dimensions. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1122 Differential Revision: D17431662 Pulled By: myleott fbshipit-source-id: e7435e77270a9252f75f01b2457ef0048f5bcf36

Minor fix to make adafactor work for >2d conv kernels (#1122)
Summary: missing .unsqueeze(-1) in line 124, without this change we'll encounter runtime error for >2d convolutional kernels, with this fix, we're applying adafactor's 2d logic to the two final dimensions. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1122 Differential Revision: D17431662 Pulled By: myleott fbshipit-source-id: e7435e77270a9252f75f01b2457ef0048f5bcf36
8dbee4ab · Akhilesh Gotmare · Facebook Github Bot · 718677eb · 8dbee4ab
Commit 8dbee4ab authored Sep 18, 2019 by Akhilesh Gotmare Committed by Facebook Github Bot Sep 18, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

fairseq/optim/adafactor.py fairseq/optim/adafactor.py +1 -1

No files found.
--- a/fairseq/optim/adafactor.py
+++ b/fairseq/optim/adafactor.py
@@ -121,7 +121,7 @@ class Adafactor(torch.optim.Optimizer):
        return tensor.norm(2) / (tensor.numel() ** 0.5)

    def _approx_sq_grad(self, exp_avg_sq_row, exp_avg_sq_col, output):
-        r_factor = (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1)).rsqrt_().unsqueeze(-1)
+        r_factor = (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1).unsqueeze(-1)).rsqrt_().unsqueeze(-1)
        c_factor = exp_avg_sq_col.unsqueeze(-2).rsqrt()
        torch.mul(r_factor, c_factor, out=output)