1. 24 Jan, 2019 4 commits
  2. 17 Jan, 2019 2 commits
    • Myle Ott's avatar
      Fix stories generation · d259ffa9
      Myle Ott authored
      Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/454
      
      Differential Revision: D13708565
      
      Pulled By: myleott
      
      fbshipit-source-id: 5cd0e07e3e1885eef14e3a5e8074f24cf4bde632
      d259ffa9
    • Myle Ott's avatar
      Fix initial learning rate (#453) · 2210fa71
      Myle Ott authored
      Summary:
      There was a very subtle bug here 😢When we recently removed this line (7633129b), it meant that the learning rate scheduler didn't get initialized until after the first update. Unfortunately pytorch optimizers store the learning rate in their internal state, so some learning rate schedulers use their `__init__` method to reset the learning rate to some sane initial value. This is especially problematic for LR schedulers that include a warmup, where the Optimizer is likely to contain the peak learning rate at initialization, and it's only in the LR scheduler's `__init__` that the (much smaller) warmup value is set.
      
      For example, the inverse_sqrt scheduler resets the learning rate upon initialization:
      https://github.com/pytorch/fairseq/blob/7853818c2e33a63ec17a31bcfe20e4fc75d94130/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py#L48-L50
      
      **Impact:** For the last ~1.5 weeks, the first training update would use the optimizer's default learning rate instead of the initial rate set by the LR scheduler. All subsequent updates used the correct learning rates. This primarily affects LR schedulers with warmups.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/453
      
      Differential Revision: D13704453
      
      Pulled By: myleott
      
      fbshipit-source-id: a946da30100f837c66bdc6b9b77b014ab4eb8764
      2210fa71
  3. 16 Jan, 2019 3 commits
    • Davide Caroselli's avatar
      FIX: '--user-dir' on multi-gpu (#449) · 7853818c
      Davide Caroselli authored
      Summary:
      On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`.
      
      This pull request fixes this problem: custom module import in now explicit on every `main()` function.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/449
      
      Differential Revision: D13676922
      
      Pulled By: myleott
      
      fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6
      7853818c
    • Myle Ott's avatar
      Add --checkpoint-upper-bound to average_checkpoints.py (#452) · bdec179b
      Myle Ott authored
      Summary:
      This is useful for averaging the last N checkpoints, ending at some "best" checkpoint.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/452
      
      Differential Revision: D13695407
      
      Pulled By: myleott
      
      fbshipit-source-id: 5d9d2bff3706834f01501e9259834c77fb335817
      bdec179b
    • Ruty Rinott's avatar
      optimizations for token_block_dataset · d1dc66d9
      Ruty Rinott authored
      Summary:
      optimizing memory use of token_block_dataset by replacing python data structures with numpy arrays.
      applying needed parts from D13498973, instead of rebasing it on changes
      
      Reviewed By: edunov
      
      Differential Revision: D13678485
      
      fbshipit-source-id: c0c827a8b95834a6a5456476040ebdc8e42136d4
      d1dc66d9
  4. 15 Jan, 2019 2 commits
  5. 14 Jan, 2019 2 commits
  6. 10 Jan, 2019 1 commit
  7. 09 Jan, 2019 2 commits
  8. 07 Jan, 2019 1 commit
  9. 05 Jan, 2019 3 commits
  10. 28 Dec, 2018 3 commits
  11. 26 Dec, 2018 2 commits
  12. 24 Dec, 2018 2 commits
    • Myle Ott's avatar
      Improve memory efficiency of FP16 optimization (#404) · 03a57dec
      Myle Ott authored
      Summary:
      Previously when training with --fp16, we stored a copy of the model parameters in FP32 for optimization, which consumed a lot of memory. An alternative is to just do the conversions to FP32 on the fly, which allows the caching allocator to reuse/save some memory.
      
      This reduces peak memory usage by ~20% with a negligible reduction in training speed (~2% slower) when training a big transformer on 8 GPUs on wmt en-de with --update-freq=16.
      
      This does not affect convergence, i.e., models will train exactly as they did before.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/404
      
      Differential Revision: D13394376
      
      Pulled By: myleott
      
      fbshipit-source-id: 2b9f808548df4782110513c9cfc9f7c6159bcbbf
      03a57dec
    • Myle Ott's avatar
      Add BufferedIterator (#419) · 0f833526
      Myle Ott authored
      Summary:
      This improves performance for datasets that load data lazily. Enabled by default since it shouldn't compromise performance for non-lazy datasets.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/419
      
      Differential Revision: D13546585
      
      Pulled By: myleott
      
      fbshipit-source-id: f6152e2047291b0d68cd7506cd772b0caafe95be
      0f833526
  13. 18 Dec, 2018 1 commit
    • Haoran Li's avatar
      data per gpu change · 9ca82a0e
      Haoran Li authored
      Summary: Avoid loading entire data set per gpu to reduce memory footprint
      
      Reviewed By: rutyrinott
      
      Differential Revision: D13163548
      
      fbshipit-source-id: 4ba717c8021ba5723d02225bae5782e2c3a18640
      9ca82a0e
  14. 11 Dec, 2018 1 commit
  15. 08 Dec, 2018 1 commit
  16. 07 Dec, 2018 2 commits
    • Myle Ott's avatar
      Add --fp16-scale-tolerance (#397) · 03ef3ab8
      Myle Ott authored
      Summary:
      Let's only decrease the loss scale if a large enough percentage of batches overflow.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/397
      
      Differential Revision: D13355159
      
      Pulled By: myleott
      
      fbshipit-source-id: e17dde73d34a639519b4348c013fdd19d2b314e6
      03ef3ab8
    • Halil Akin's avatar
      Take a dummy train step under OOM to keep multiprocessing in sync · 6c006a34
      Halil Akin authored
      Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.
      
      Reviewed By: myleott
      
      Differential Revision: D13086018
      
      fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee
      6c006a34
  17. 06 Dec, 2018 4 commits
  18. 04 Dec, 2018 1 commit
  19. 30 Nov, 2018 1 commit
  20. 29 Nov, 2018 2 commits