1. 12 Mar, 2021 1 commit
  2. 11 Mar, 2021 1 commit
  3. 09 Mar, 2021 1 commit
  4. 05 Mar, 2021 1 commit
  5. 04 Mar, 2021 1 commit
    • Min Xu's avatar
      [test] AdaScale & SDP/FSDP (#468) · efed9cee
      Min Xu authored
      - cover them in terms of code path only
      - numerically, AdaScale is different on SDP/FSDP than DDP, mainly
        due to partial view of the gradients.
      - this doesn't mean it is definitely not useful but it is yet to
        be validated.
      - not going to spend too much time until we have a real use case.
      efed9cee
  6. 23 Feb, 2021 1 commit
    • Myle Ott's avatar
      Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e
      Myle Ott authored
      Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336
      
      ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.
      
      Compared to PyTorch DDP:
      * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
      * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
      * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
          * all-gather parameters at start of forward pass and start of backward pass
          * reduce-scatter grads at end of backward pass
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      15512d9e
  7. 22 Feb, 2021 1 commit
  8. 19 Feb, 2021 1 commit
  9. 14 Feb, 2021 1 commit
  10. 12 Feb, 2021 1 commit
  11. 05 Feb, 2021 1 commit
  12. 03 Feb, 2021 2 commits
    • Benjamin Lefaudeux's avatar
      [chore] disheartening switch off of a OSS cpu test (#356) · 011c0c41
      Benjamin Lefaudeux authored
      * precise skip, only if agent has only cpu
      011c0c41
    • Min Xu's avatar
      [feat] Add AdaScaleWrapper (#347) · a2408eb8
      Min Xu authored
      * [feat] Add AdaScaleWrapper
      
      - This enables a different API for wrapping an optimizer with AdaScale.
      - This also enables AdaScale to be wrapped by OSS.
      - However, OSS wrapping AdaScale results in different optimization,
        which future research will be needed to study its effects.
      
      testing: add unit tests.
      
      * addressed comment: typo
      a2408eb8
  13. 02 Feb, 2021 1 commit
  14. 29 Jan, 2021 1 commit
    • Min Xu's avatar
      [test]: test with py39 + torch 1.8 nightly (#339) · e348806b
      Min Xu authored
      * [test]: test with py39 + torch 1.8 nightly
      
      * version fix
      
      * more fix
      
      * fix version function for nightly version
      
      * fix torch_pg build
      
      * invalidate cache
      
      * separate benchmark requirements
      
      * comment
      
      * fixed mypy
      
      * fixed a test
      e348806b
  15. 28 Jan, 2021 1 commit
    • Min Xu's avatar
      [test]: test adascale with oss (#328) · fa11d338
      Min Xu authored
      * [test]: test adascale with oss
      
      * minor fix
      
      * add a small comment
      
      * refactor: moved find_tensor_by_shape
      
      * refactor: move test golden data into its own module
      
      * refactor: simplied the train function
      
      * refactor: added comments as suggested
      fa11d338
  16. 27 Jan, 2021 1 commit
  17. 20 Jan, 2021 1 commit
  18. 11 Jan, 2021 1 commit
  19. 08 Jan, 2021 3 commits
  20. 05 Jan, 2021 1 commit
    • Benjamin Lefaudeux's avatar
      [fix] Flaky tests (#283) · 79365ee6
      Benjamin Lefaudeux authored
      * adding the pytest timeout plugin to properly root out hanging tests
      * removing redundant code, slightly more reasonable timeout, works on single cuda
      * finding the root bug for some of the cpu hangs, rpc init
      * propagating all the rpc init test changes to the pipe and model parallel tests
      79365ee6
  21. 04 Jan, 2021 1 commit
    • Min Xu's avatar
      [feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6
      Min Xu authored
      * [feat] sync adascale from internal repo
      
      - tbd
      
      testing: tbd
      
      * Update argument document of __init__
      
      * update documentation around set_num_gradients_to_accumulate
      
      * added checking code for proper API calling places
      
      * rename internal APIs to make them internal
      
      * updated changelog
      
      * added support for add_param_group and its unit test
      
      * added unit test for set_num_gradients_to_accumulate
      
      * added debias_ewma unit test
      
      * fixed test_set_num_gradients_to_accumulate (need zero_grad() call)
      
      * added missing zero_grad() to test_lr_scheduler
      
      * fixed test_add_param_group with respect to optim.zero_grad()
      
      * added test_gradient_value
      
      * added test_scale_not_equal_default for scale != world_size * grad_accum
      
      * added test_unhook()
      
      * removed print statements
      
      * fixed a typo
      
      * addressed Ben's comment
      3932a1f6
  22. 29 Dec, 2020 1 commit
  23. 22 Dec, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [OSS] Balance the trainable params only (#262) · c386e937
      Benjamin Lefaudeux authored
      * fix, one liner
      
      * adjust so that frozen trunks get spread still, even if this should have little consequences
      
      * removing dead code, hopeful unit test fix
      
      * now with some linting..
      
      * adding a proper unit test case
      c386e937
  24. 16 Dec, 2020 1 commit
    • Min Xu's avatar
      [feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838
      Min Xu authored
      * [doc]: AdaScale example and notes
      
      * formatted notes correctly as suggested by Benjamin
      
      * added feature and unit test to make sure lr_scheduler works
      
      * update the example with lr_scheduler
      
      * fixed doc with "make html"
      
      * addressed Mike's suggestions
      d65cd838
  25. 14 Dec, 2020 1 commit
  26. 06 Dec, 2020 1 commit
  27. 03 Dec, 2020 1 commit
    • Min Xu's avatar
      [feat] AdaScale: Gradient Accumulation and Add PyTest unit tests (#202) · ce5860ea
      Min Xu authored
      * added AdaScale to README
      
      * [adascale] added gradient accumulation
      
      - added gradient accumulation
      - tested with cifar full trainings with different value of accumulation
      and verified the full accuracy is obtained
      - also removed the patch optimize flag until we need it
      
      * [adascale] adding pytest
      
      - added basic and ddp tests and grad_accum
      - closes #195
      
      * added changelog
      
      * added ddp grad_accum test
      
      * moved ddp and non-ddp tests into separate files
      
      * added checkpoint test
      
      * more doc
      
      * addressed Mike's comments
      ce5860ea
  28. 16 Nov, 2020 1 commit
  29. 06 Nov, 2020 1 commit
  30. 28 Oct, 2020 1 commit
  31. 14 Oct, 2020 2 commits
  32. 08 Oct, 2020 1 commit
  33. 15 Sep, 2020 2 commits
  34. 09 Sep, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] OSS flatten state dict (#65) · 4f597233
      Benjamin Lefaudeux authored
      Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading
      4f597233
  35. 08 Sep, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] OSS: Sync all attributes (#67) · 5a268b25
      Benjamin Lefaudeux authored
      Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual wrapped optimizer. Some frameworks make it possible to alter any attribute on a scheduled basis, which proves useful depending on the optimizer, so the keys need to be generically supported (not just "lr"). Not syncing these attributes is a worst case scenario, since these adjustments are silently not propagated, fixing that. 
      5a268b25