1. 10 May, 2019 1 commit
  2. 08 May, 2019 5 commits
  3. 07 May, 2019 4 commits
  4. 06 May, 2019 2 commits
    • Naman Goyal's avatar
      allowing sharded dataset (#696) · 0add50c2
      Naman Goyal authored
      
      
      Summary:
      Co-authored-by: default avatarmyleott <myleott@fb.com>
      
      Changing `data` to be `str` with colon separated list for loading sharded datasets. This change is useful for loading large datasets that cannot fit into, memory. The large dataset can be sharded and then each shard is loaded in one epoch in roudrobin manner.
      
      For example, if there are `5` shards of data and `10` epochs then the shards will be iterated upon `[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]`.
      
      myleott We need to look into `translation.py` as it currently already expects a list and then concats the datasets.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/696
      
      Differential Revision: D15214049
      
      fbshipit-source-id: 03e43a7b69c7aefada2ca668abf1eac1969fe013
      0add50c2
    • Naman Goyal's avatar
      added masked_lm task (#697) · e1ffea87
      Naman Goyal authored
      
      
      Summary:
      Co-authored-by: default avatarjingfeidu <jingfeidu@fb.com>
      
      1) Adding `masked_lm` task for BERT like training. Code mostly taken from jingfeidu 's implementation.
      
      2) Added `has_eos` option to `block_pair_dataset` for working with dataset that has been preprocessed with having `eos`.
      
      Depends on: https://github.com/pytorch/fairseq/pull/696
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/697
      
      Differential Revision: D15214050
      
      fbshipit-source-id: c179ce2d70e59d2ddc941b13ceda99d929878931
      e1ffea87
  5. 04 May, 2019 2 commits
  6. 01 May, 2019 4 commits
  7. 30 Apr, 2019 3 commits
  8. 27 Apr, 2019 1 commit
  9. 17 Apr, 2019 3 commits
  10. 16 Apr, 2019 1 commit
  11. 10 Apr, 2019 1 commit
  12. 03 Apr, 2019 1 commit
    • Paco Guzman's avatar
      sort dictionary items lexicographically for consistency · 10ad7495
      Paco Guzman authored
      Summary: Sorts dictionaries lexicographically before creating counter. This makes distributed preprocessing deterministic
      
      Reviewed By: myleott
      
      Differential Revision: D14678214
      
      fbshipit-source-id: 7a9e2f0cb367e8fb76da01e108dda4c6c5aab505
      10ad7495
  13. 02 Apr, 2019 1 commit
  14. 15 Mar, 2019 1 commit
    • Myle Ott's avatar
      0.6.1 -> 0.6.2 (#577) · e6422528
      Myle Ott authored
      Summary:
      Changelog:
      - 998ba4f: Add language models from Baevski & Auli (2018)
      - 4294c4f6: Add mixture of experts code from Shen et al. (2019)
      - 00493490: Add example for multilingual training
      - 48d9afbe: Speed improvements, including fused operators from apex
      - 44d27e64: Add Tensorboard support
      - d17fa851: Add Adadelta optimizer
      - 9e1c880f: Add `FairseqEncoderModel`
      - b65c579b: Add `FairseqTask.inference_step` to modularize generate.py
      - 2ad1178e: Add back `--curriculum`
      - Misc bug fixes and other features
      
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/577
      
      Differential Revision: D14481233
      
      Pulled By: myleott
      
      fbshipit-source-id: 4ff8625ef1c0b24273fc65df7c5658e3c932e8b7
      e6422528
  15. 01 Mar, 2019 2 commits
    • James King's avatar
      Fixed the issue that no space in string converted from tensor · 88bf8b56
      James King authored
      Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/548
      
      Differential Revision: D14286021
      
      Pulled By: myleott
      
      fbshipit-source-id: 7c725304185e63787220371a812ec860e178872c
      88bf8b56
    • Kartikay Khandelwal's avatar
      Refactor BERTDataset to the more general MaskedLMDataset · 92a6c548
      Kartikay Khandelwal authored
      Summary: The current BERTDataset has a lot of components needed for generic MaskedLM training but is too restrictive in terms of the assumptions it makes - two blocks being masked, the special tokens used for the sentence embedding as well as the separator etc. In this diff I refactor this dataset and at the same time add make some of the parameters including the probabilities associated with masking configurable.
      
      Reviewed By: rutyrinott
      
      Differential Revision: D14222467
      
      fbshipit-source-id: e9f78788dfe7f56646ba09c62967c4c0bd30aed8
      92a6c548
  16. 28 Feb, 2019 1 commit
  17. 26 Feb, 2019 2 commits
  18. 22 Feb, 2019 1 commit
  19. 19 Feb, 2019 1 commit
    • Ruty Rinott's avatar
      moving masking logic to collate · 08e866f9
      Ruty Rinott authored
      Summary: Move masking logic to data_utils
      
      Reviewed By: kartikayk, jingfeidu
      
      Differential Revision: D14098403
      
      fbshipit-source-id: c7b7e811ab48b9c5a12662dc1e2f2ed694724176
      08e866f9
  20. 16 Feb, 2019 1 commit
  21. 30 Jan, 2019 1 commit
    • Myle Ott's avatar
      Merge internal changes (#483) · 42be3ebd
      Myle Ott authored
      Summary:
      Changelog:
      - `4889802`: can now remove detokenize sentencepiece output with `--remove-bpe=sentencepiece` (fixes #331). Also added `--sacrebleu` for computing detokenized BLEU.
      - `0d76427`: fix assertion error when training language model with dataset containing empty sentences
      - minor bug and style fixes
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/483
      
      Differential Revision: D13867899
      
      Pulled By: myleott
      
      fbshipit-source-id: 25c940b847fe270262ac8f5ac838407b3977fdda
      42be3ebd
  22. 24 Jan, 2019 1 commit
    • Davide Caroselli's avatar
      Enforce UTF-8 when open() text files (#460) · 38f1dee9
      Davide Caroselli authored
      Summary:
      When opening text files without specifying the encoding (i.e. `open(path, "r")` or `open(path, "w")`), python3 will use the preferred locale encoding (`locale.getpreferredencoding()`) so the result is platform dependent and can change from one machine to another.
      
      I believe fairseq should enforce its standard (UTF-8 seems like the best choice to me). This pull request explicity specify UTF-8 encoding when reading text files.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/460
      
      Differential Revision: D13802525
      
      Pulled By: myleott
      
      fbshipit-source-id: 672fd55707ee559ab36d74bc1c24026166ea2367
      38f1dee9