1. 07 May, 2021 1 commit
  2. 05 May, 2021 2 commits
    • Min Xu's avatar
      [fix] better assert and better test for frozen weights (#657) · b54eed1b
      Min Xu authored
      
      
      * [fix] better assert and better test for frozen weights
      
      - the precise condition should have been check m.parameters(), not
        m.params.
      - fixes #643
      
      * add changelog
      
      * use enum is so much better
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      b54eed1b
    • Min Xu's avatar
      [fix] add clear_autocast_cache flag (#650) · 861b5ce2
      Min Xu authored
      
      
      * [fix] add clear_autocast_cache flag
      
      - when training in AMP model with weight dtype32, FSDP may need to
        optionally clear the autocast cache to avoid GPU OOM
      - this flag is default false, automatically doing it is a future TODO
      - also added a verbose flag to make print(fsdp_model) a bit shorter
      - updated the memory test to cover those new code
      - added a couple of useful functions in parallel.py and testing.py
      
      * minor
      
      * address comments
      
      * format
      
      * improve the test
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      861b5ce2
  3. 03 May, 2021 1 commit
  4. 28 Apr, 2021 2 commits
    • msbaines's avatar
      2bb2a134
    • Min Xu's avatar
      [feat] save memory by using bucket buffer only in backward (#633) · a5594032
      Min Xu authored
      
      
      * [feat] save memory by using bucket buffer only in backward
      
      - this fixes bug #627
      - added documentation to clarify the buffer's cost and speed/memory
        tradeoff
      - added setup/teardown calls so that the buffer is only allocated
        during the backward pass, saving more memory for forward and stepping
        so that they can be used for things like activations.
      - added a unit test that assert the memory is in range.
      
      Comparing with DDP:
      
        1. buffer size scales with # of FSDP not model size
        2. buffer is only allocated during backward
        3. buffer is used for small tensors only to reduce overhead
        4. overlapping of compute-reduction is very different
      
      * add PR number to changelog
      
      * filled in with memory number on 1.9
      
      * addressed comments
      
      * update comments
      
      * fix for 1.6
      
      * add a todo
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      a5594032
  5. 26 Apr, 2021 1 commit
  6. 19 Apr, 2021 1 commit
  7. 13 Apr, 2021 1 commit
  8. 02 Apr, 2021 1 commit
  9. 18 Mar, 2021 3 commits
  10. 12 Mar, 2021 1 commit
  11. 11 Mar, 2021 1 commit
  12. 09 Mar, 2021 1 commit
  13. 25 Feb, 2021 1 commit
  14. 23 Feb, 2021 6 commits
  15. 22 Feb, 2021 1 commit
  16. 19 Feb, 2021 1 commit
  17. 18 Feb, 2021 1 commit
  18. 17 Feb, 2021 1 commit
  19. 12 Feb, 2021 1 commit
  20. 11 Feb, 2021 1 commit
  21. 03 Feb, 2021 1 commit
  22. 02 Feb, 2021 1 commit
  23. 29 Jan, 2021 1 commit
  24. 07 Jan, 2021 1 commit
  25. 05 Jan, 2021 1 commit
  26. 04 Jan, 2021 2 commits
    • Benjamin Lefaudeux's avatar
      [chore] 0.1.2 version bump (#285) · a21f50f9
      Benjamin Lefaudeux authored
      a21f50f9
    • Min Xu's avatar
      [feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6
      Min Xu authored
      * [feat] sync adascale from internal repo
      
      - tbd
      
      testing: tbd
      
      * Update argument document of __init__
      
      * update documentation around set_num_gradients_to_accumulate
      
      * added checking code for proper API calling places
      
      * rename internal APIs to make them internal
      
      * updated changelog
      
      * added support for add_param_group and its unit test
      
      * added unit test for set_num_gradients_to_accumulate
      
      * added debias_ewma unit test
      
      * fixed test_set_num_gradients_to_accumulate (need zero_grad() call)
      
      * added missing zero_grad() to test_lr_scheduler
      
      * fixed test_add_param_group with respect to optim.zero_grad()
      
      * added test_gradient_value
      
      * added test_scale_not_equal_default for scale != world_size * grad_accum
      
      * added test_unhook()
      
      * removed print statements
      
      * fixed a typo
      
      * addressed Ben's comment
      3932a1f6
  27. 30 Dec, 2020 1 commit
  28. 24 Dec, 2020 1 commit
    • Min Xu's avatar
      [chore] Update changelog (#268) · 18455bf0
      Min Xu authored
      * Update changelog
      
      missed this item from previous AdaScale commit.
      
      * More change log
      
      * Addressed review comments
      18455bf0
  29. 03 Dec, 2020 1 commit
    • Min Xu's avatar
      [feat] AdaScale: Gradient Accumulation and Add PyTest unit tests (#202) · ce5860ea
      Min Xu authored
      * added AdaScale to README
      
      * [adascale] added gradient accumulation
      
      - added gradient accumulation
      - tested with cifar full trainings with different value of accumulation
      and verified the full accuracy is obtained
      - also removed the patch optimize flag until we need it
      
      * [adascale] adding pytest
      
      - added basic and ddp tests and grad_accum
      - closes #195
      
      * added changelog
      
      * added ddp grad_accum test
      
      * moved ddp and non-ddp tests into separate files
      
      * added checkpoint test
      
      * more doc
      
      * addressed Mike's comments
      ce5860ea
  30. 02 Dec, 2020 1 commit