Denoising autoencoder task (#251)

Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/251 We should use shared encoder and separate decoders as in: https://fb.facebook.com/groups/2156114531381111/permalink/2169028113423086/ Generation is a hack, ideally the net input should have the lang pair info so that when we pass the sample to the model, it can select the correct encoder/decoder pair. diff [2/2] will be for flow integration for basic experimentation TODO in a future diff: figure out how to generalize this so export will work?? This works with vocab reduction, but we only support vocab reduction for src-tgt, not src-src model. A future (lowpri) task could be to add word prediction vocab reduction for src-src model to speed up training. Reviewed By: xianxl Differential Revision: D10512576 fbshipit-source-id: 545d96cad8e814b9da7be102a48cc5cac358b758

Denoising autoencoder task (#251)
Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/251 We should use shared encoder and separate decoders as in: https://fb.facebook.com/groups/2156114531381111/permalink/2169028113423086/ Generation is a hack, ideally the net input should have the lang pair info so that when we pass the sample to the model, it can select the correct encoder/decoder pair. diff [2/2] will be for flow integration for basic experimentation TODO in a future diff: figure out how to generalize this so export will work?? This works with vocab reduction, but we only support vocab reduction for src-tgt, not src-src model. A future (lowpri) task could be to add word prediction vocab reduction for src-src model to speed up training. Reviewed By: xianxl Differential Revision: D10512576 fbshipit-source-id: 545d96cad8e814b9da7be102a48cc5cac358b758
c9c660c0 · Liezl Puzon · Facebook Github Bot · 5bbd148e · c9c660c0
Commit c9c660c0 authored Nov 01, 2018 by Liezl Puzon Committed by Facebook Github Bot Nov 01, 2018
Hide whitespace changes
Inline Side-by-side

Showing with 13 additions and 4 deletions

fairseq/data/noising.py fairseq/data/noising.py +13 -4

No files found.
--- a/fairseq/data/noising.py
+++ b/fairseq/data/noising.py
@@ -32,10 +32,17 @@ class WordNoising(object):
        Given a list of BPE tokens, for every index in the tokens list,
        return the index of the word grouping that it belongs to.
        For example, for input x corresponding to ["how", "are", "y@@", "ou"],
-        return [0, 1, 2, 2].
+        return [[0], [1], [2], [2]].
        """
        # x: (T x B)
        bpe_end = self.bpe_end[x]
+        if (x.size(0) == 1 and x.size(1) == 1):
+            # Special case when we only have one word in x. If x = [[N]],
+            # bpe_end is a scalar (bool) instead of a 2-dim array of bools,
+            # which makes the sum operation below fail.
+            return np.array([[0]])
        # do a reduce front sum to generate word ids
        word_idx = bpe_end[::-1].cumsum(0)[::-1]
        word_idx = word_idx.max(0)[None, :] - word_idx
@@ -142,7 +149,7 @@ class WordShuffle(WordNoising):
        noise = np.random.uniform(
            0,
            max_shuffle_distance,
-            size=(x.size(0) - 1, x.size(1)),
+            size=(x.size(0), x.size(1)),
        )
        noise[0] = -1  # do not move start sentence symbol
@@ -153,7 +160,6 @@ class WordShuffle(WordNoising):
            length_no_eos = lengths[i]
            if x[lengths[i] - 1, i] == self.dictionary.eos():
                length_no_eos = lengths[i] - 1
            # generate a random permutation
            scores = word_idx[:length_no_eos, i] + noise[word_idx[:length_no_eos, i], i]
            # ensure no reordering inside a word
@@ -216,6 +222,7 @@ class NoisingDataset(torch.utils.data.Dataset):
        src_dataset,
        src_dict,
        seed,
+        noiser=None,
        noising_class=UnsupervisedMTNoising,
        **kwargs,
    ):
@@ -235,6 +242,8 @@ class NoisingDataset(torch.utils.data.Dataset):
            src_dict: src dict
            src_dict: src dictionary
            seed: seed to use when generating random noise
+            noiser: a pre-initialized noiser. If this is None, a noiser will
+                be created using noising_class and kwargs.
            noising_class: class to use when initializing noiser
            kwargs: noising args for configuring noising to apply
                Note that there is no equivalent argparse code for these args
@@ -246,7 +255,7 @@ class NoisingDataset(torch.utils.data.Dataset):
        self.src_dataset = src_dataset
        self.src_dict = src_dict
-        self.noiser = noising_class(
+        self.noiser = noiser if noiser is not None else noising_class(
            dictionary=src_dict, **kwargs,
        )
        self.seed = seed