Commits · 2b13f3c036898140c6266d8b2d39c7fb3b3112e6 · OpenDAS / Fairseq

07 Nov, 2018 1 commit

Support BPE end of word marker suffix in fairseq noising module · 2b13f3c0

Liezl Puzon authored Nov 06, 2018

Summary:
There are 2 ways to implement BPE:
1. use a continuation marker suffix to indicate that there is at least one more subtoken left in the word
2. use a end of word marker suffix to indicate that there is no more subtokens left in the word

This adds some logic to account for either kind of BPE marker suffix. This diff adds a corresponding test. I also refactored the test setup to reduce the number of boolean args when setting up test data.

Reviewed By: xianxl

Differential Revision: D12919428

fbshipit-source-id: 405e9f346dce6e736c1305288721dfc7b63e872a

2b13f3c0

01 Nov, 2018 1 commit

Denoising autoencoder task (#251) · c9c660c0

Liezl Puzon authored Nov 01, 2018

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/251

We should use shared encoder and separate decoders as in:

https://fb.facebook.com/groups/2156114531381111/permalink/2169028113423086/

Generation is a hack, ideally the net input should have the lang pair info so that when we pass the sample to the model, it can select the correct encoder/decoder pair.

diff [2/2] will be for flow integration for basic experimentation

TODO in a future diff: figure out how to generalize this so export will work??

This works with vocab reduction, but we only support vocab reduction for src-tgt, not src-src model. A future (lowpri) task could be to add word prediction vocab reduction for src-src model to speed up training.

Reviewed By: xianxl

Differential Revision: D10512576

fbshipit-source-id: 545d96cad8e814b9da7be102a48cc5cac358b758

c9c660c0

27 Oct, 2018 1 commit

Extend WordShuffle noising function to apply to non-bpe tokens · 90c01b3a

Xian Li authored Oct 26, 2018

Summary:
We'd like to resue the noising functions and DenoisingDataset in
adversarial training. However, current noising functions assume the input are
subword tokens. The goal of this diff is to extend it so the noising can be
applied to word tokens. Since we're mostly interested in the word shuffle
noising, so I only modified the WordShuffle class.

Reviewed By: liezl200

Differential Revision: D10523177

fbshipit-source-id: 1e5d27362850675010e73cd38850c890d42652ab

90c01b3a

06 Oct, 2018 2 commits

Add denoising dataset for denoising autoencoder (#306) · e286243c

Liezl Puzon authored Oct 05, 2018

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/306

This uses a source dataset to generate a batch of {source: noisy source, target: original clean source} which allows us to train a denoising autoencoding component as part of a seq2seq model.

Reviewed By: xianxl

Differential Revision: D10078981

fbshipit-source-id: 026225984d4a97062ac05dc3a36e79b5c841fe9c

e286243c

Have noising account for sentences with and without EOS (#305) · 8798a240

Liezl Puzon authored Oct 05, 2018

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/305

Previously, noising code assumed that every sentence had an EOS which had to be excluded from noising operations (since we shouldn't drop, blank, or shuffle EOS). This logic allows the noising module to handle sentences with EOS and without EOS

Reviewed By: xianxl

Differential Revision: D10114425

fbshipit-source-id: 04ec8547343eb94266bda1ac7fca3d8a1991c9f4

8798a240

30 Sep, 2018 1 commit
- fbshipit-source-id: 6a835d32f9dc5e0de118f1b46d365d0e0cc85e11 · f8377a70
  myleott authored Sep 30, 2018
  
  f8377a70