Add whole word masking for SentencepieceBPE (#1292)

Summary: Models seem to train fine with this modification. I checked that the mask for beginning of words is correct but didn't check if the actual masking worked correctly. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1292 Differential Revision: D18338307 Pulled By: myleott fbshipit-source-id: eae9e29d6ab648e768d70921694a898554496704

Add whole word masking for SentencepieceBPE (#1292)
Summary: Models seem to train fine with this modification. I checked that the mask for beginning of words is correct but didn't check if the actual masking worked correctly. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1292 Differential Revision: D18338307 Pulled By: myleott fbshipit-source-id: eae9e29d6ab648e768d70921694a898554496704
37c9d96f · Louis MARTIN · Facebook Github Bot · 7ca56cb8 · 37c9d96f · 37c9d96f
Commit 37c9d96f authored Nov 07, 2019 by Louis MARTIN Committed by Facebook Github Bot Nov 07, 2019
Show whitespace changes
Inline Side-by-side

Showing with 27 additions and 17 deletions

fairseq/data/encoders/sentencepiece_bpe.py fairseq/data/encoders/sentencepiece_bpe.py +10 -0

fairseq/tasks/masked_lm.py fairseq/tasks/masked_lm.py +17 -17

No files found.
--- a/fairseq/data/encoders/sentencepiece_bpe.py
+++ b/fairseq/data/encoders/sentencepiece_bpe.py
@@ -31,3 +31,13 @@ class SentencepieceBPE(object):
    def decode(self, x: str) -> str:
        return x.replace(' ', '').replace('\u2581', ' ').strip()
+    def is_beginning_of_word(self, x: str) -> bool:
+        if x in ['<unk>', '<s>', '</s>', '<pad>']:
+            # special elements are always considered beginnings
+            # HACK: this logic is already present in fairseq/tasks/masked_lm.py
+            # but these special tokens are also contained in the sentencepiece
+            # vocabulary which causes duplicate special tokens. This hack makes
+            # sure that they are all taken into account.
+            return True
+        return x.startswith('\u2581')
--- a/fairseq/tasks/masked_lm.py
+++ b/fairseq/tasks/masked_lm.py
@@ -108,7 +108,7 @@ class MaskedLMTask(FairseqTask):
        # create masked input and targets
        if self.args.mask_whole_words:
            bpe = encoders.build_bpe(self.args)
-            if bpe is not None:
+            assert bpe is not None
            def is_beginning_of_word(i):
                if i < self.source_dictionary.nspecial: