1. 22 Feb, 2019 1 commit
  2. 01 Feb, 2019 1 commit
    • Davide Caroselli's avatar
      Support custom Dictionary implementations in 'preprocess.py' (#448) · bbb4120b
      Davide Caroselli authored
      Summary:
      The `preprocess.py` script has been refactored in order to:
      
      1. Use the `options` module for command line arguments  parsing. This will give to `preprocess.py` the ability to load custom modules with `--user-dir` flag (already implemented to all other binaries)
      2. Dictionary loading and building code has moved to Task implementation. This allows custom Dictionary classes to be used during the data generation step.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/448
      
      Differential Revision: D13674819
      
      Pulled By: myleott
      
      fbshipit-source-id: b40648a98ed6c08284577e5ec25876e018d8c822
      bbb4120b
  3. 30 Jan, 2019 2 commits
  4. 25 Jan, 2019 1 commit
  5. 05 Jan, 2019 1 commit
  6. 26 Nov, 2018 1 commit
    • Myle Ott's avatar
      Refactor BacktranslationDataset to be more reusable (#354) · 3c19878f
      Myle Ott authored
      Summary:
      - generalize AppendEosDataset -> TransformEosDataset
      - remove EOS logic from BacktranslationDataset (use TransformEosDataset instead)
      - BacktranslationDataset takes a backtranslation_fn instead of building the SequenceGenerator itself
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/354
      
      Reviewed By: liezl200
      
      Differential Revision: D12970233
      
      Pulled By: myleott
      
      fbshipit-source-id: d5c5b0e0a75eca1bd3a50382ac24621f35c32f36
      3c19878f
  7. 18 Nov, 2018 1 commit
  8. 07 Nov, 2018 1 commit
    • Liezl Puzon's avatar
      Support BPE end of word marker suffix in fairseq noising module · 2b13f3c0
      Liezl Puzon authored
      Summary:
      There are 2 ways to implement BPE:
      1. use a continuation marker suffix to indicate that there is at least one more subtoken left in the word
      2. use a end of word marker suffix to indicate that there is no more subtokens left in the word
      
      This adds some logic to account for either kind of BPE marker suffix. This diff adds a corresponding test. I also refactored the test setup to reduce the number of boolean args when setting up test data.
      
      Reviewed By: xianxl
      
      Differential Revision: D12919428
      
      fbshipit-source-id: 405e9f346dce6e736c1305288721dfc7b63e872a
      2b13f3c0
  9. 02 Nov, 2018 2 commits
  10. 01 Nov, 2018 1 commit
  11. 27 Oct, 2018 1 commit
    • Xian Li's avatar
      Extend WordShuffle noising function to apply to non-bpe tokens · 90c01b3a
      Xian Li authored
      Summary:
      We'd like to resue the noising functions and DenoisingDataset in
      adversarial training. However, current noising functions assume the input are
      subword tokens. The goal of this diff is to extend it so the noising can be
      applied to word tokens. Since we're mostly interested in the word shuffle
      noising, so I only modified the WordShuffle class.
      
      Reviewed By: liezl200
      
      Differential Revision: D10523177
      
      fbshipit-source-id: 1e5d27362850675010e73cd38850c890d42652ab
      90c01b3a
  12. 23 Oct, 2018 1 commit
  13. 06 Oct, 2018 2 commits
  14. 04 Oct, 2018 1 commit
    • Liezl Puzon's avatar
      Option to remove EOS at source in backtranslation dataset · b9e29a47
      Liezl Puzon authored
      Summary:
      If we want our parallel data to have EOS at the end of source, we keep the EOS at the end of the generated source dialect backtranslation.
      If we don't want our parallel data to have EOS at the end of source, we **remove** the EOS at the end of the generated source dialect backtranslation.
      
      Note: we always want EOS at the end of our target / reference in parallel data so our model can learn to generate a sentence at any arbitrary length. So we make sure that the original target has an EOS before returning a batch of {generated src, original target}. If our original targets in tgt dataset doesn't have an EOS, we append EOS to each tgt sample before collating.
      We only do this for the purpose of collating a {generated src, original tgt} batch AFTER generating the backtranslations. We don't enforce any EOS before passing tgt to the tgt->src model for generating the backtranslation. The users of this dataset is expected to format tgt dataset examples in the correct format that the tgt->src model expects.
      
      Reviewed By: jmp84
      
      Differential Revision: D10157725
      
      fbshipit-source-id: eb6a15f13c651f7c435b8db28103c9a8189845fb
      b9e29a47
  15. 03 Oct, 2018 2 commits
    • Myle Ott's avatar
      Fix proxying in DistributedFairseqModel · fc677c94
      Myle Ott authored
      Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/302
      
      Differential Revision: D10174608
      
      Pulled By: myleott
      
      fbshipit-source-id: 4e2dfc76eae97afc5488f29b47e74f9897a643ff
      fc677c94
    • Liezl Puzon's avatar
      Pass in kwargs and SequenceGenerator class to init BacktranslationDataset · f766c9a0
      Liezl Puzon authored
      Summary: This generalizes BacktranslationDataset to allow us to use any SequenceGenerator class. For example, if we want to use this model in PyTorch Translate, we can pass the following to BacktraanslationDataset init: (1) a PyTorch Translate SequenceGenerator class as generator_class and (2) the appropriate args for initializing that class as kwargs.
      
      Reviewed By: xianxl
      
      Differential Revision: D10156552
      
      fbshipit-source-id: 0495d825bf4727da96d0d9a40dc434135ff3486c
      f766c9a0
  16. 02 Oct, 2018 1 commit
    • Liezl Puzon's avatar
      Explicitly list out generation args for backtranslation dataset · 86e93f2b
      Liezl Puzon authored
      Summary:
      Using argparse Namespace hides the actual args that are expected and makes code harder to read.
      
      Note the difference in style for the args list
      
          def __init__(
              self,
              tgt_dataset,
              tgt_dict,
              backtranslation_model,
              unkpen,
              sampling,
              beam,
              max_len_a,
              max_len_b,
          ):
      
      instead of
      
          def __init__(
              self, tgt_dataset, tgt_dict, backtranslation_model, unkpen, sampling,
              beam,  max_len_a, max_len_b,
          ):
      
      Reviewed By: dpacgopinath
      
      Differential Revision: D10152331
      
      fbshipit-source-id: 6539ccba09d48acf23759996b7e32fb329b3e3f6
      86e93f2b
  17. 30 Sep, 2018 1 commit
  18. 25 Sep, 2018 8 commits
  19. 03 Sep, 2018 9 commits
  20. 25 Jul, 2018 1 commit
  21. 25 Jun, 2018 1 commit