1. 20 Jun, 2019 1 commit
  2. 12 Jun, 2019 1 commit
    • Nayan Singhal's avatar
      Add Model Averaging · 6982c404
      Nayan Singhal authored
      Summary:
      Implemented model averaging for fairseq.
      Removed the ddp wrapper if global optimizer is provided.
      Syncing all the models based on the iteration provide in the input
      
      TODO:
      1) Fix throughput and wps meter. Need to check other meters too.
      2) Replace Model average code with BMUF algorithm implementation.
      
      Reviewed By: myleott
      
      Differential Revision: D15711044
      
      fbshipit-source-id: 58a4af74db2a61d06762597b95836cbeb1ed82cc
      6982c404
  3. 30 May, 2019 1 commit
  4. 17 May, 2019 2 commits
  5. 09 May, 2019 1 commit
  6. 04 May, 2019 1 commit
  7. 03 May, 2019 1 commit
  8. 02 May, 2019 1 commit
  9. 01 May, 2019 1 commit
  10. 30 Apr, 2019 1 commit
  11. 29 Apr, 2019 1 commit
  12. 10 Apr, 2019 1 commit
  13. 04 Apr, 2019 1 commit
    • Jay Mahadeokar's avatar
      aligned training task and CE related changes · 3658fa32
      Jay Mahadeokar authored
      Summary:
      This diff adds:
      
      1. Aligned training task specifically for doing cross entropy criterion training using prod data and prod like models
      2. Few changes to correctly register the task and criterions.
      3. Changes to trainer code for propogating accuracy metrics which we care about for training.
      
      Couple of things are hacky right now:
      - The reporting is not modular (this needs to be thought about in general for fairseq).
      
      - The get dummy batch could be specific to task instead of specific for dataset.
      
      Reviewed By: myleott
      
      Differential Revision: D14670482
      
      fbshipit-source-id: dc077247b2ae9d26a8e842a386ec5faa5771e836
      3658fa32
  14. 12 Mar, 2019 1 commit
    • Dmytro Okhonko's avatar
      Handle 3+ dimensional input in sequence_generator + nits · 860010e9
      Dmytro Okhonko authored
      Summary: sequence_generator assumes that model input is 2d tensor of longs. But it can be something like 3d tensor of floats and we should be able to handle this as long as first dimension is batch size followed by source lengths.
      
      Reviewed By: myleott
      
      Differential Revision: D14420044
      
      fbshipit-source-id: bf8b1e42ad1873f7b803c1a377b0af21648db015
      860010e9
  15. 26 Feb, 2019 1 commit
    • Myle Ott's avatar
      Multilingual training example (#527) · 00493490
      Myle Ott authored
      Summary:
      * Add example for multilingual translation on IWSLT'17
      * Match dataset ordering for multilingual_translation and translation
      * Fix bug with LegacyDistributedDataParallel when calling forward of sub-modules
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/527
      
      Differential Revision: D14218372
      
      Pulled By: myleott
      
      fbshipit-source-id: 2e3fe24aa39476bcc5c9af68ef9a40192db34a3b
      00493490
  16. 06 Feb, 2019 1 commit
  17. 25 Jan, 2019 1 commit
  18. 17 Jan, 2019 1 commit
    • Myle Ott's avatar
      Fix initial learning rate (#453) · 2210fa71
      Myle Ott authored
      Summary:
      There was a very subtle bug here 😢When we recently removed this line (7633129b), it meant that the learning rate scheduler didn't get initialized until after the first update. Unfortunately pytorch optimizers store the learning rate in their internal state, so some learning rate schedulers use their `__init__` method to reset the learning rate to some sane initial value. This is especially problematic for LR schedulers that include a warmup, where the Optimizer is likely to contain the peak learning rate at initialization, and it's only in the LR scheduler's `__init__` that the (much smaller) warmup value is set.
      
      For example, the inverse_sqrt scheduler resets the learning rate upon initialization:
      https://github.com/pytorch/fairseq/blob/7853818c2e33a63ec17a31bcfe20e4fc75d94130/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py#L48-L50
      
      **Impact:** For the last ~1.5 weeks, the first training update would use the optimizer...
      2210fa71
  19. 09 Jan, 2019 1 commit
  20. 05 Jan, 2019 1 commit
  21. 28 Dec, 2018 1 commit
  22. 24 Dec, 2018 1 commit
    • Myle Ott's avatar
      Improve memory efficiency of FP16 optimization (#404) · 03a57dec
      Myle Ott authored
      Summary:
      Previously when training with --fp16, we stored a copy of the model parameters in FP32 for optimization, which consumed a lot of memory. An alternative is to just do the conversions to FP32 on the fly, which allows the caching allocator to reuse/save some memory.
      
      This reduces peak memory usage by ~20% with a negligible reduction in training speed (~2% slower) when training a big transformer on 8 GPUs on wmt en-de with --update-freq=16.
      
      This does not affect convergence, i.e., models will train exactly as they did before.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/404
      
      Differential Revision: D13394376
      
      Pulled By: myleott
      
      fbshipit-source-id: 2b9f808548df4782110513c9cfc9f7c6159bcbbf
      03a57dec
  23. 07 Dec, 2018 1 commit
    • Halil Akin's avatar
      Take a dummy train step under OOM to keep multiprocessing in sync · 6c006a34
      Halil Akin authored
      Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.
      
      Reviewed By: myleott
      
      Differential Revision: D13086018
      
      fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee
      6c006a34
  24. 19 Nov, 2018 1 commit
    • Halil Akin's avatar
      Protect against failures in case of OOMs · a442244d
      Halil Akin authored
      Summary: Fixing some distributed failures that happen when OOMs are observed.
      
      Reviewed By: myleott
      
      Differential Revision: D13121054
      
      fbshipit-source-id: f71a0a695332acbaa1797e89887b8b7c7ddaa727
      a442244d
  25. 17 Nov, 2018 1 commit
  26. 07 Nov, 2018 1 commit
  27. 01 Nov, 2018 1 commit
  28. 22 Oct, 2018 1 commit
    • Halil Akin's avatar
      Fix another distributed syncing issue · 23e9dc2e
      Halil Akin authored
      Summary:
      This is another failure due to distributed GPU's getting out of sync.
      We are running save_and_eval (which has the inter-gpu communication calls) by
      looking at number of updates. But number of updates means weight updates. Whenever
      there is an issue in the training and weights can't be updated, nodes go
      out of sync and nodes start failing. So we should check number of iterations instead.
      
      I am, again, making a small change to save the day, but we should decouple/refactor
      save_and_eval logic from the training, to have less headache in future.
      Planning, working on that in future. But this should solve some of the
      issues for now.
      
      Reviewed By: jhcross
      
      Differential Revision: D10478427
      
      fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c
      23e9dc2e
  29. 21 Oct, 2018 1 commit
  30. 30 Sep, 2018 1 commit
  31. 25 Sep, 2018 4 commits
  32. 03 Sep, 2018 4 commits
  33. 01 Aug, 2018 1 commit