1. 20 Sep, 2019 1 commit
    • Naman Goyal's avatar
      added multilingual masked LM training (#849) · 32335404
      Naman Goyal authored
      Summary:
      The multilingual-RoBERTa training is working with aconneau XLM data.
      
      Two pieces remaining:
      
      1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple.
      
      2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline).
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849
      
      Differential Revision: D17162460
      
      fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f
      32335404
  2. 19 Sep, 2019 2 commits
    • Jerry Ma's avatar
      Add dataset class for weighted sampling with replacement. (#861) · a8a85c26
      Jerry Ma authored
      Summary:
      As discussed with Naman earlier today. Weighted sampling with
      replacement can be done on a per-epoch basis using `set_epoch()`
      functionality, which generates the samples as a function of random seed
      and epoch.
      
      Additionally, `FairseqTask` needs to set the starting epoch for the
      dataset at the very beginning of iterator construction.
      
      Not yet implemented is the per-epoch iterator construction, which
      is necessary to actually regenerate the batches for each epoch.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/861
      
      Differential Revision: D17460687
      
      Pulled By: jma127
      
      fbshipit-source-id: 1c2a54f04ac96b3561c100a6fd66a9fccbe3c658
      a8a85c26
    • Myle Ott's avatar
      Add cython language_level hints · 0eaaf355
      Myle Ott authored
      Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1147
      
      Differential Revision: D17468447
      
      Pulled By: myleott
      
      fbshipit-source-id: 0dbac04b92c8df74ad991d5e92cd02036d662369
      0eaaf355
  3. 18 Sep, 2019 3 commits
  4. 17 Sep, 2019 2 commits
  5. 16 Sep, 2019 1 commit
    • Naman Goyal's avatar
      added fast stats sync option (#858) · e1ba32aa
      Naman Goyal authored
      Summary:
      Added `--fast-stat-sync` option.
      This avoids pickle and achieves `~7%` more `wps` on 16 nodes.
      It is less flexible as it just aggregates only basic stats and it ignores the aggregate function defined by criterion.
      
      Let me know what you think myleott
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/858
      
      Differential Revision: D17398770
      
      fbshipit-source-id: 36261a1d970e67deeda8211af8f009ef9b4f9c14
      e1ba32aa
  6. 12 Sep, 2019 1 commit
  7. 05 Sep, 2019 1 commit
    • Roman Rädle's avatar
      Return predicted token for RoBERTa filling mask · 3e3fe722
      Roman Rädle authored
      Summary:
      Added the `predicted_token` to each `topk` filled output item
      
      Updated RoBERTa filling mask example in README.md
      
      Reviewed By: myleott
      
      Differential Revision: D17188810
      
      fbshipit-source-id: 5fdc57ff2c13239dabf13a8dad43ae9a55e8931c
      3e3fe722
  8. 04 Sep, 2019 1 commit
    • Peng-Jen Chen's avatar
      Fix multilingual translation bug for to-many case · 1566cfb9
      Peng-Jen Chen authored
      Summary:
      The logic for adding decoder side language token was wrongly implemented.
      The way we inject the language token is by replacing the eos symbol with language token symbol. However, the parameter for source / target eos symbol was not set correctly.
      
      Reviewed By: tangyuq
      
      Differential Revision: D17129108
      
      fbshipit-source-id: 6fae385b787370656fd7ca7ab74e6bb91fe5463b
      1566cfb9
  9. 03 Sep, 2019 2 commits
  10. 01 Sep, 2019 1 commit
  11. 31 Aug, 2019 3 commits
  12. 30 Aug, 2019 2 commits
    • alexeib's avatar
      set numpy seed explicitly + other minor fixes (#850) · 4a7cd582
      alexeib authored
      Summary:
      not setting the numpy seed explicitly at the beginning was an extremely annoying bug to find. it it caused different gpus to have a different view of data if some randomization was used in the dataset (e.g. subsample dataset)
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/850
      
      Differential Revision: D17085006
      
      Pulled By: alexeib
      
      fbshipit-source-id: 62bb2116369fb703df878e6bc24c06f1ea4e75a0
      4a7cd582
    • Paul O'Shannessy's avatar
      Adopt Contributor Covenant · 8777465b
      Paul O'Shannessy authored
      Summary:
      In order to foster healthy open source communities, we're adopting the
      [Contributor Covenant](https://www.contributor-covenant.org/). It has been
      built by open source community members and represents a shared understanding of
      what is expected from a healthy community.
      
      Reviewed By: josephsavona, danobi, rdzhabarov
      
      Differential Revision: D17104640
      
      fbshipit-source-id: d210000de686c5f0d97d602b50472d5869bc6a49
      8777465b
  13. 29 Aug, 2019 1 commit
  14. 28 Aug, 2019 1 commit
    • Naman Goyal's avatar
      use numpy function for filter by size when possible (#845) · 108f94bc
      Naman Goyal authored
      Summary:
      For general Masked language modeling use-case, this is much faster, (`3 minutes vs 1 sec`).
      
      Let me know what you think about it myleott, if you don't like all the special case checking, we can think of reorganizing the dataset APIs to always have `sizes` as property calculated in `__init__`.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/845
      
      Reviewed By: myleott
      
      Differential Revision: D16993769
      
      Pulled By: myleott
      
      fbshipit-source-id: 161bba62af2965190c07c47e838ee967cb886e88
      108f94bc
  15. 27 Aug, 2019 4 commits
  16. 26 Aug, 2019 1 commit
  17. 23 Aug, 2019 3 commits
    • Myle Ott's avatar
      Suppress leaked semaphore warnings · 833f053d
      Myle Ott authored
      Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/844
      
      Differential Revision: D16985131
      
      Pulled By: myleott
      
      fbshipit-source-id: 66ba3b9aa0cdf329a1e38fc09786f34906afdb43
      833f053d
    • Naman Goyal's avatar
      Cythonize token block dataset (#834) · 4fc39538
      Naman Goyal authored
      Summary:
      Cythonized token block dataset code, it's `> 100x` faster. Token block for entire `bookwiki+CC+stories+openweb` is just ~`39.9` seconds.
      
      TODO:
      1) I think, I can make it 2x more faster.
      2) cleanup.
      
      EDIT History:
      ~~First pass at parellelizing `token_block_dataset`. The code feels somewhat complicated and cluttered.
      This is 2-3x faster though on my tests on `bookwiki` dataset with both `complete` and `complete_doc` modes.
      myleott Can you take a look for correctness as I am still not 100% sure that I am not missing corner cases.~~
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/834
      
      Test Plan:
      Imported from GitHub, without a `Test Plan:` line.
      
      Test workflow: f133816198
      
      Reviewed By: myleott
      
      Differential Revision: D16970257
      
      Pulled By: myleott
      
      fbshipit-source-id: ec45a308193c9e9f3e7075336c15df4723228d6f
      4fc39538
    • Alexei Baevski's avatar
      wav2vec everstore support · 6e2bd794
      Alexei Baevski authored
      Summary: changes for internal support
      
      Differential Revision: D16646887
      
      fbshipit-source-id: ac5bf6c32901819726249422324eae32a0a6e148
      6e2bd794
  18. 22 Aug, 2019 3 commits
  19. 21 Aug, 2019 4 commits
    • Trinkle23897's avatar
      fix string format to work in python 3.5 (#1050) · 93057cc0
      Trinkle23897 authored
      Summary:
      change string fromat in fairseq/data/subsample_dataset.py#20
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/1050
      
      Differential Revision: D16946060
      
      Pulled By: okhonko
      
      fbshipit-source-id: 0eabf22e7ffd4f658b6d18c87dc6e59c81a355c7
      93057cc0
    • Jeff Cai's avatar
      Parameterized criterions (#808) · ba5f829f
      Jeff Cai authored
      Summary:
      Support criterion with parameters, such as AutoSegmentationCriterion (ASG) used in wav2letter which has a transition matrix parameter. This is needed to integrate wav2letter's ASG into PySpeech.
      
      With this diff, parameters in criterions will be:
      (1) updated by optimizers, with a configurable learning rate
      (2) saved and loaded from checkpoints, preserving backward compatibility for criterions without parameters
      (3) synchronized across nodes in distributed training.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/808
      
      Reviewed By: jcai1
      
      Differential Revision: D16934097
      
      Pulled By: okhonko
      
      fbshipit-source-id: 121ec9382459385c6f9cbef3a8274bec1a434038
      ba5f829f
    • alexeib's avatar
      Multiset (#838) · a2f5361d
      alexeib authored
      Summary:
      Adds ability to tag individual examples with the names of their datasets, along with some minor miscellaneous fixes and improvements
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/838
      
      Differential Revision: D16919175
      
      Pulled By: alexeib
      
      fbshipit-source-id: 4bf493299645bae63f3ee6382e15f18a9f73666c
      a2f5361d
    • Siddharth Dalmia's avatar
      vggblock support without pooling and pooling_kernel_size missing self (#839) · 7a31fe06
      Siddharth Dalmia authored
      Summary:
      1) VggBlock was not supported if pooling kernel size was None.
      2) Since we modify pooling kernel size by using _pair. We should use self.pooling_kernel_size. But I agree it doesn't matter as pytorch is robust to this.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/839
      
      Differential Revision: D16934112
      
      Pulled By: okhonko
      
      fbshipit-source-id: b6b95163b0e7f7203d76d535f01a41912382bdc3
      7a31fe06
  20. 20 Aug, 2019 2 commits
  21. 19 Aug, 2019 1 commit