1. 27 Feb, 2021 1 commit
  2. 26 Feb, 2021 3 commits
  3. 25 Feb, 2021 2 commits
  4. 24 Feb, 2021 1 commit
  5. 23 Feb, 2021 4 commits
    • Min Xu's avatar
      [test]: add peak mem in checkpoint test (#415) · 4b5b4d3d
      Min Xu authored
      * [test]: add peak mem in checkpoint test
      
      * more debugging
      
      * new test
      
      * more fix
      
      * better collection of debug in case of future failures
      
      * update the comment
      
      * typo
      
      * comment
      
      * clarify
      
      * better wording
      4b5b4d3d
    • Benjamin Lefaudeux's avatar
      [perf][ShardedDDP] fp16 gradient reduce (#411) · d52d2186
      Benjamin Lefaudeux authored
      * POC, testing against the DDP comm hook when available
      * docs, adding a reference to DDP's compress hook
      * updating changelog, prep for v0.1.8 release
      d52d2186
    • Min Xu's avatar
      [bug]: not all CUDA memory is freed when model is deleted (#412) · e3035933
      Min Xu authored
      * [bug]: not all CUDA memory is freed when model is deleted
      
      * fixed memory leak
      
      - without this, peak memory will be high when more than one model
        is trained (i.e. first model leave staff around pushing up the
        peak memory when the second model runs)
      
      * addressed comments
      
      * fix
      
      * changelog
      e3035933
    • Myle Ott's avatar
      Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e
      Myle Ott authored
      Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336
      
      ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.
      
      Compared to PyTorch DDP:
      * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
      * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
      * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
          * all-gather parameters at start of forward pass and start of backward pass
          * reduce-scatter grads at end of backward pass
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      15512d9e
  6. 19 Feb, 2021 1 commit
  7. 18 Feb, 2021 2 commits
  8. 17 Feb, 2021 1 commit
  9. 12 Feb, 2021 1 commit
  10. 10 Feb, 2021 1 commit
  11. 09 Feb, 2021 1 commit
  12. 04 Feb, 2021 4 commits
  13. 03 Feb, 2021 2 commits
  14. 02 Feb, 2021 1 commit
  15. 30 Jan, 2021 1 commit
  16. 29 Jan, 2021 1 commit
  17. 27 Jan, 2021 1 commit
  18. 23 Jan, 2021 1 commit
  19. 21 Jan, 2021 3 commits
  20. 15 Jan, 2021 1 commit
  21. 11 Jan, 2021 1 commit
  22. 05 Jan, 2021 1 commit
    • Benjamin Lefaudeux's avatar
      [fix] Flaky tests (#283) · 79365ee6
      Benjamin Lefaudeux authored
      * adding the pytest timeout plugin to properly root out hanging tests
      * removing redundant code, slightly more reasonable timeout, works on single cuda
      * finding the root bug for some of the cpu hangs, rpc init
      * propagating all the rpc init test changes to the pipe and model parallel tests
      79365ee6
  23. 02 Jan, 2021 1 commit
  24. 30 Dec, 2020 1 commit
  25. 29 Dec, 2020 1 commit
  26. 28 Dec, 2020 1 commit
  27. 19 Dec, 2020 1 commit