Forcing static shapes in loss computation (LSCE) (#876)

Summary: applying non_pad_mask results in dynamic shapes = bad for tpus This is an equivalent loss computation (tested), but tensor shapes are constant (in the case of reduce=True) Pull Request resolved: https://github.com/pytorch/fairseq/pull/876 Differential Revision: D16241621 Pulled By: myleott fbshipit-source-id: 973254b7e0842f2b55817afd66b2a110a566f149

Forcing static shapes in loss computation (LSCE) (#876)
Summary: applying non_pad_mask results in dynamic shapes = bad for tpus This is an equivalent loss computation (tested), but tensor shapes are constant (in the case of reduce=True) Pull Request resolved: https://github.com/pytorch/fairseq/pull/876 Differential Revision: D16241621 Pulled By: myleott fbshipit-source-id: 973254b7e0842f2b55817afd66b2a110a566f149
8db7b1c7 · Taylan Bilal · Facebook Github Bot · c38b1f91 · 8db7b1c7
Commit 8db7b1c7 authored Jul 17, 2019 by Taylan Bilal Committed by Facebook Github Bot Jul 17, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 2 deletions

fairseq/criterions/label_smoothed_cross_entropy.py fairseq/criterions/label_smoothed_cross_entropy.py +5 -2

No files found.
--- a/fairseq/criterions/label_smoothed_cross_entropy.py
+++ b/fairseq/criterions/label_smoothed_cross_entropy.py
@@ -52,11 +52,14 @@ class LabelSmoothedCrossEntropyCriterion(FairseqCriterion):
        lprobs = lprobs.view(-1, lprobs.size(-1))
        target = model.get_targets(sample, net_output).view(-1, 1)
        non_pad_mask = target.ne(self.padding_idx)
-        nll_loss = -lprobs.gather(dim=-1, index=target)[non_pad_mask]
-        smooth_loss = -lprobs.sum(dim=-1, keepdim=True)[non_pad_mask]
        if reduce:
+            nll_loss = -lprobs.gather(dim=-1, index=target).masked_fill_(1.0-non_pad_mask, 0.0)
            nll_loss = nll_loss.sum()
+            smooth_loss = -lprobs.sum(dim=-1, keepdim=True).masked_fill_(1.0-non_pad_mask, 0.0)
            smooth_loss = smooth_loss.sum()
+        else:
+            nll_loss = -lprobs.gather(dim=-1, index=target)[non_pad_mask]
+            smooth_loss = -lprobs.sum(dim=-1, keepdim=True)[non_pad_mask]
        eps_i = self.eps / lprobs.size(-1)
        loss = (1. - self.eps) * nll_loss + eps_i * smooth_loss
        return loss, nll_loss