Commit 4fac3b60 authored by Jingfei Du's avatar Jingfei Du Committed by Facebook Github Bot
Browse files

fix bug for masking (#752)

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/752

previously we sample masked tokens with replace=True (default). Because of this, we would mask same tokens multiple times, which will make us mask less tokens finally

Reviewed By: liaimi

Differential Revision: D15403556

fbshipit-source-id: cf12eeb13f9610431136a345de9199ad0292984b
parent ee28411f
...@@ -152,7 +152,7 @@ class MaskedLMDataset(FairseqDataset): ...@@ -152,7 +152,7 @@ class MaskedLMDataset(FairseqDataset):
masked_sent = np.copy(sentence) masked_sent = np.copy(sentence)
sent_length = len(sentence) sent_length = len(sentence)
mask_num = math.ceil(sent_length * self.masking_ratio) mask_num = math.ceil(sent_length * self.masking_ratio)
mask = np.random.choice(sent_length, mask_num) mask = np.random.choice(sent_length, mask_num, replace=False)
target = np.copy(sentence) target = np.copy(sentence)
for i in range(sent_length): for i in range(sent_length):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment