1. 12 Mar, 2021 1 commit
  2. 11 Mar, 2021 1 commit
  3. 09 Mar, 2021 1 commit
  4. 05 Mar, 2021 1 commit
  5. 23 Feb, 2021 1 commit
    • Myle Ott's avatar
      Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e
      Myle Ott authored
      Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336
      
      ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.
      
      Compared to PyTorch DDP:
      * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
      * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
      * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
          * all-gather parameters at start of forward pass and start of backward pass
          * reduce-scatter grads at end of backward pass
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      15512d9e
  6. 22 Feb, 2021 1 commit
  7. 14 Feb, 2021 1 commit
  8. 12 Feb, 2021 1 commit
  9. 05 Feb, 2021 1 commit
  10. 03 Feb, 2021 1 commit
  11. 02 Feb, 2021 1 commit
  12. 27 Jan, 2021 1 commit
  13. 20 Jan, 2021 1 commit
  14. 11 Jan, 2021 1 commit
  15. 08 Jan, 2021 3 commits
  16. 05 Jan, 2021 1 commit
    • Benjamin Lefaudeux's avatar
      [fix] Flaky tests (#283) · 79365ee6
      Benjamin Lefaudeux authored
      * adding the pytest timeout plugin to properly root out hanging tests
      * removing redundant code, slightly more reasonable timeout, works on single cuda
      * finding the root bug for some of the cpu hangs, rpc init
      * propagating all the rpc init test changes to the pipe and model parallel tests
      79365ee6
  17. 29 Dec, 2020 1 commit
  18. 22 Dec, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [OSS] Balance the trainable params only (#262) · c386e937
      Benjamin Lefaudeux authored
      * fix, one liner
      
      * adjust so that frozen trunks get spread still, even if this should have little consequences
      
      * removing dead code, hopeful unit test fix
      
      * now with some linting..
      
      * adding a proper unit test case
      c386e937
  19. 06 Dec, 2020 1 commit
  20. 16 Nov, 2020 1 commit
  21. 06 Nov, 2020 1 commit
  22. 14 Oct, 2020 2 commits
  23. 08 Oct, 2020 1 commit
  24. 15 Sep, 2020 2 commits
  25. 09 Sep, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] OSS flatten state dict (#65) · 4f597233
      Benjamin Lefaudeux authored
      Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading
      4f597233
  26. 08 Sep, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] OSS: Sync all attributes (#67) · 5a268b25
      Benjamin Lefaudeux authored
      Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual wrapped optimizer. Some frameworks make it possible to alter any attribute on a scheduled basis, which proves useful depending on the optimizer, so the keys need to be generically supported (not just "lr"). Not syncing these attributes is a worst case scenario, since these adjustments are silently not propagated, fixing that. 
      5a268b25
  27. 03 Sep, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [fix] OSS pytorch-compliant state dict (#61) · 1d1d15ea
      Benjamin Lefaudeux authored
      * Aligning the optimizer state dict with what PyTorch expects
      
      * Adding a check on the dict keys, ensure that `state` and `param_groups` are there
      
      * after installing the specific isort, black and all, one liner to please the linter..
      1d1d15ea
  28. 28 Aug, 2020 1 commit
  29. 27 Aug, 2020 3 commits
  30. 20 Aug, 2020 1 commit
  31. 14 Aug, 2020 1 commit
  32. 13 Aug, 2020 1 commit
  33. 08 Aug, 2020 1 commit
  34. 31 Jul, 2020 1 commit