1. 12 May, 2021 1 commit
    • anj-s's avatar
      [chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2
      anj-s authored
      * rename files
      
      * add newly renamed file
      
      * rename and move checkpoint activations related files
      
      * add test files to ci list
      
      * fix lint errors
      
      * modify docs
      
      * add changelog
      
      * retain old path for now
      
      * fix lint errors
      
      * add another import test case
      
      * fix merge conflict
      
      * add missing test file
      72c6bab2
  2. 11 May, 2021 1 commit
    • Min Xu's avatar
      [fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3
      Min Xu authored
      
      
      * [fix] FSDP forward pass overlap between compute and all-gather
      
      - much thanks for @cyanguwa for report and @QuentinDuval for debugging it
      - a new unit test is added to check for this and ensure we detect
        issue with overlapping and cpu/gpu blocking wait calls
      
      * fix
      
      * fix
      
      * fix
      
      * better assertion outputs
      
      * fix format and tune all_gather mb for CI
      
      * more tuning with non_flatten
      
      * undo an accidental change
      
      * tuning all gather mb and del model
      
      * Update + fix overlapping test to use patched all_gather w/ delay (#672)
      
      * fixing get_cycles_per_ms
      
      * add get_smi_memory
      
      * update the docstring
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      Co-authored-by: default avatarMyle Ott <myleott@fb.com>
      8a42a8e3
  3. 07 May, 2021 1 commit
  4. 05 May, 2021 2 commits
    • Min Xu's avatar
      [fix] better assert and better test for frozen weights (#657) · b54eed1b
      Min Xu authored
      
      
      * [fix] better assert and better test for frozen weights
      
      - the precise condition should have been check m.parameters(), not
        m.params.
      - fixes #643
      
      * add changelog
      
      * use enum is so much better
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      b54eed1b
    • Min Xu's avatar
      [fix] add clear_autocast_cache flag (#650) · 861b5ce2
      Min Xu authored
      
      
      * [fix] add clear_autocast_cache flag
      
      - when training in AMP model with weight dtype32, FSDP may need to
        optionally clear the autocast cache to avoid GPU OOM
      - this flag is default false, automatically doing it is a future TODO
      - also added a verbose flag to make print(fsdp_model) a bit shorter
      - updated the memory test to cover those new code
      - added a couple of useful functions in parallel.py and testing.py
      
      * minor
      
      * address comments
      
      * format
      
      * improve the test
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      861b5ce2
  5. 03 May, 2021 1 commit
  6. 28 Apr, 2021 2 commits
    • msbaines's avatar
      2bb2a134
    • Min Xu's avatar
      [feat] save memory by using bucket buffer only in backward (#633) · a5594032
      Min Xu authored
      
      
      * [feat] save memory by using bucket buffer only in backward
      
      - this fixes bug #627
      - added documentation to clarify the buffer's cost and speed/memory
        tradeoff
      - added setup/teardown calls so that the buffer is only allocated
        during the backward pass, saving more memory for forward and stepping
        so that they can be used for things like activations.
      - added a unit test that assert the memory is in range.
      
      Comparing with DDP:
      
        1. buffer size scales with # of FSDP not model size
        2. buffer is only allocated during backward
        3. buffer is used for small tensors only to reduce overhead
        4. overlapping of compute-reduction is very different
      
      * add PR number to changelog
      
      * filled in with memory number on 1.9
      
      * addressed comments
      
      * update comments
      
      * fix for 1.6
      
      * add a todo
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      a5594032
  7. 26 Apr, 2021 1 commit
  8. 19 Apr, 2021 1 commit
  9. 13 Apr, 2021 1 commit
  10. 02 Apr, 2021 1 commit
  11. 18 Mar, 2021 3 commits
  12. 12 Mar, 2021 1 commit
  13. 11 Mar, 2021 1 commit
  14. 09 Mar, 2021 1 commit
  15. 25 Feb, 2021 1 commit
  16. 23 Feb, 2021 6 commits
  17. 22 Feb, 2021 1 commit
  18. 19 Feb, 2021 1 commit
  19. 18 Feb, 2021 1 commit
  20. 17 Feb, 2021 1 commit
  21. 12 Feb, 2021 1 commit
  22. 11 Feb, 2021 1 commit
  23. 03 Feb, 2021 1 commit
  24. 02 Feb, 2021 1 commit
  25. 29 Jan, 2021 1 commit
  26. 07 Jan, 2021 1 commit
  27. 05 Jan, 2021 1 commit
  28. 04 Jan, 2021 2 commits
    • Benjamin Lefaudeux's avatar
      [chore] 0.1.2 version bump (#285) · a21f50f9
      Benjamin Lefaudeux authored
      a21f50f9
    • Min Xu's avatar
      [feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6
      Min Xu authored
      * [feat] sync adascale from internal repo
      
      - tbd
      
      testing: tbd
      
      * Update argument document of __init__
      
      * update documentation around set_num_gradients_to_accumulate
      
      * added checking code for proper API calling places
      
      * rename internal APIs to make them internal
      
      * updated changelog
      
      * added support for add_param_group and its unit test
      
      * added unit test for set_num_gradients_to_accumulate
      
      * added debias_ewma unit test
      
      * fixed test_set_num_gradients_to_accumulate (need zero_grad() call)
      
      * added missing zero_grad() to test_lr_scheduler
      
      * fixed test_add_param_group with respect to optim.zero_grad()
      
      * added test_gradient_value
      
      * added test_scale_not_equal_default for scale != world_size * grad_accum
      
      * added test_unhook()
      
      * removed print statements
      
      * fixed a typo
      
      * addressed Ben's comment
      3932a1f6
  29. 30 Dec, 2020 1 commit
  30. 24 Dec, 2020 1 commit
    • Min Xu's avatar
      [chore] Update changelog (#268) · 18455bf0
      Min Xu authored
      * Update changelog
      
      missed this item from previous AdaScale commit.
      
      * More change log
      
      * Addressed review comments
      18455bf0