1. 07 May, 2019 1 commit
    • Davide Caroselli's avatar
      Memory-Mapped IndexedDataset implementation (#589) · a1c997bd
      Davide Caroselli authored
      Summary:
      Following discussion in https://github.com/pytorch/fairseq/issues/574:
      
       - Implemented MMapIndexedDataset and MMapIndexedDatasetBuilder compatible with IndexedDataset/IndexedDatasetBuilder
      - Update scripts/read_binarized.py to support new MMapIndexedDataset
      - Option '--raw-text' and '--lazy-load' replaced with '--dataset-impl' and moved the option definition custom task args to more high-level options.add_dataset_args() (more appropriate)
      - Implemented also utils functions in indexed_dataset: make_dataset(), dataset_exists()
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/589
      
      Differential Revision: D14597128
      
      Pulled By: myleott
      
      fbshipit-source-id: 4e92d99920cbaa52cfe5a0f1f5d9ae5c92d4268e
      a1c997bd
  2. 30 Apr, 2019 1 commit
  3. 01 Mar, 2019 2 commits
  4. 28 Feb, 2019 1 commit
  5. 26 Feb, 2019 1 commit
    • Myle Ott's avatar
      Multilingual training example (#527) · 00493490
      Myle Ott authored
      Summary:
      * Add example for multilingual translation on IWSLT'17
      * Match dataset ordering for multilingual_translation and translation
      * Fix bug with LegacyDistributedDataParallel when calling forward of sub-modules
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/527
      
      Differential Revision: D14218372
      
      Pulled By: myleott
      
      fbshipit-source-id: 2e3fe24aa39476bcc5c9af68ef9a40192db34a3b
      00493490
  6. 07 Feb, 2019 1 commit
    • Ruty Rinott's avatar
      stitch preprocessing pipeline · cea0e4b9
      Ruty Rinott authored
      Summary:
      1. add call to binarization to complete preprocessing pipeline
      2. add ability to specify task to select the dictionary, and add a bert task
      3. Get rid of function calls that are no longer needed after moving functions from fairseq here
      
      Reviewed By: jingfeidu
      
      Differential Revision: D13977842
      
      fbshipit-source-id: ec9bbb4e98e62e12c20ba68bb52b8bcc94aee91d
      cea0e4b9
  7. 05 Feb, 2019 1 commit
  8. 01 Feb, 2019 1 commit
    • Davide Caroselli's avatar
      Support custom Dictionary implementations in 'preprocess.py' (#448) · bbb4120b
      Davide Caroselli authored
      Summary:
      The `preprocess.py` script has been refactored in order to:
      
      1. Use the `options` module for command line arguments  parsing. This will give to `preprocess.py` the ability to load custom modules with `--user-dir` flag (already implemented to all other binaries)
      2. Dictionary loading and building code has moved to Task implementation. This allows custom Dictionary classes to be used during the data generation step.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/448
      
      Differential Revision: D13674819
      
      Pulled By: myleott
      
      fbshipit-source-id: b40648a98ed6c08284577e5ec25876e018d8c822
      bbb4120b
  9. 29 Jan, 2019 1 commit
  10. 24 Jan, 2019 2 commits
    • Davide Caroselli's avatar
      Enforce UTF-8 when open() text files (#460) · 38f1dee9
      Davide Caroselli authored
      Summary:
      When opening text files without specifying the encoding (i.e. `open(path, "r")` or `open(path, "w")`), python3 will use the preferred locale encoding (`locale.getpreferredencoding()`) so the result is platform dependent and can change from one machine to another.
      
      I believe fairseq should enforce its standard (UTF-8 seems like the best choice to me). This pull request explicity specify UTF-8 encoding when reading text files.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/460
      
      Differential Revision: D13802525
      
      Pulled By: myleott
      
      fbshipit-source-id: 672fd55707ee559ab36d74bc1c24026166ea2367
      38f1dee9
    • vufg's avatar
      change f"{args}" to "{}".format(args) (#467) · 8eb49c84
      vufg authored
      Summary:
      Although both are supported by Python 3.6, I think it would be better to unify the usage of string format function.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/467
      
      Differential Revision: D13802506
      
      Pulled By: myleott
      
      fbshipit-source-id: 5c4877547b1c4ca806ab54c80ae483cfbaa7827a
      8eb49c84
  11. 16 Jan, 2019 1 commit
    • Davide Caroselli's avatar
      FIX: '--user-dir' on multi-gpu (#449) · 7853818c
      Davide Caroselli authored
      Summary:
      On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`.
      
      This pull request fixes this problem: custom module import in now explicit on every `main()` function.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/449
      
      Differential Revision: D13676922
      
      Pulled By: myleott
      
      fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6
      7853818c
  12. 05 Jan, 2019 1 commit
  13. 06 Dec, 2018 1 commit
  14. 18 Nov, 2018 1 commit
  15. 10 Nov, 2018 1 commit
    • Ruty Rinott's avatar
      pipeline for LM training · 880e7cd4
      Ruty Rinott authored
      Summary:
      step 2 of pipeline for LM training
      assumes tokenized text data as input. Splits it into train/validation/test, and runs binarization
      (step a_ii in https://fb.quip.com/kazzAxvZHBj9)
      
      Reviewed By: borguz
      
      Differential Revision: D10454705
      
      fbshipit-source-id: 74e8679041f5507c4e404c1b719547c2ae9ed983
      880e7cd4
  16. 25 Sep, 2018 1 commit
  17. 03 Sep, 2018 1 commit
  18. 31 Jul, 2018 1 commit
  19. 21 Jun, 2018 1 commit
  20. 15 Jun, 2018 3 commits
    • alexeib's avatar
      Conv lm implementation · 4c2ef2de
      alexeib authored
      This implements convolutional language model from https://arxiv.org/pdf/1612.08083.pdf
      
      There are 3 modes for constructing batches:
      
      - token block: fill each sample with a specified number of tokens without regard for sentence delimiters - this is what was used for training in the paper
      - complete: fill each sample with a specified number of tokens but make sure it contains only complete sentences (i.e. if next sentence goes over token block limit, move it to the next sample) - this was used for evaluation in the paper
      - eos: one sentence per sample (skip blank lines)
      
      some results:
      
      GCNN-13 - GBW - 37.46
      GCNN-14B - GBW - 33.88
      GCNN-8 - Wiki103 - 43.76
      GCNN-14 - Wiki103 - 35.66
      
      train:
      
      python train.py /private/home/abaevski/data/wiki103 --save-dir /tmp --fp16 --max-epoch 35 --save-interval 1 --save-interval-updates 1000 --keep-interval-updates 25 --arch fconv_lm --optimizer nag --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 --decoder-embed-dim 280 --decoder-layers '[(850, 6)] * 3 + [(850,1)] + [(850,5)] * 4 + [(850,1)] + [(850,4)] * 3 + [(1024,4)] + [(2048, 4)]' --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion cross_entropy --max-tokens 1024 --max-target-positions 1024 --seed 1 --log-format json --log-interval 500
      
      eval:
      
      python eval_lm.py ~abaevski/data/wiki103 --path '/checkpoint02/abaevski/2018-04-27/lm_wiki.fp16.mxup300000.fconv.adam.lrs=reduce_lr_on_plateau.emb280.layers(850,6)*3+(850,1)+(850,5)*4+(850,1)+(850,4)*3+(1024,1)+(2048,4).lr0.0005.clp0.1.drp0.3.wd0.0.crt=cross_entropy.mxtk2048.smptk256.seed1.ngpu8/checkpoint_last.pt'
      4c2ef2de
    • Myle Ott's avatar
      Fix preprocess.py · fa7c575a
      Myle Ott authored
      fa7c575a
    • Myle Ott's avatar
      745d5fbd
  21. 05 Mar, 2018 1 commit
  22. 27 Feb, 2018 2 commits
    • Myle Ott's avatar
      Fix tests and flake8 · 29c82741
      Myle Ott authored
      29c82741
    • Myle Ott's avatar
      fairseq-py goes distributed (#106) · 66415206
      Myle Ott authored
      This PR includes breaking API changes to modularize fairseq-py and adds support for distributed training across multiple nodes.
      
      Changes:
      - c7033ef: add support for distributed training! See updated README for usage.
      - e016299: modularize fairseq-py, adding support for register_model, register_criterion, register_optimizer, etc.
      - 154e440: update LSTM implementation to use PackedSequence objects in the encoder, better following best practices and improving perf
      - 90c2973 and 1da6265: improve unit test coverage
      66415206
  23. 13 Nov, 2017 1 commit
  24. 08 Nov, 2017 2 commits
  25. 19 Oct, 2017 1 commit
  26. 15 Sep, 2017 1 commit