1. 12 May, 2021 1 commit
    • anj-s's avatar
      [chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2
      anj-s authored
      * rename files
      
      * add newly renamed file
      
      * rename and move checkpoint activations related files
      
      * add test files to ci list
      
      * fix lint errors
      
      * modify docs
      
      * add changelog
      
      * retain old path for now
      
      * fix lint errors
      
      * add another import test case
      
      * fix merge conflict
      
      * add missing test file
      72c6bab2
  2. 11 May, 2021 1 commit
    • Min Xu's avatar
      [fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3
      Min Xu authored
      
      
      * [fix] FSDP forward pass overlap between compute and all-gather
      
      - much thanks for @cyanguwa for report and @QuentinDuval for debugging it
      - a new unit test is added to check for this and ensure we detect
        issue with overlapping and cpu/gpu blocking wait calls
      
      * fix
      
      * fix
      
      * fix
      
      * better assertion outputs
      
      * fix format and tune all_gather mb for CI
      
      * more tuning with non_flatten
      
      * undo an accidental change
      
      * tuning all gather mb and del model
      
      * Update + fix overlapping test to use patched all_gather w/ delay (#672)
      
      * fixing get_cycles_per_ms
      
      * add get_smi_memory
      
      * update the docstring
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      Co-authored-by: default avatarMyle Ott <myleott@fb.com>
      8a42a8e3
  3. 08 May, 2021 1 commit
  4. 07 May, 2021 1 commit
  5. 05 May, 2021 3 commits
    • Min Xu's avatar
      [fix] better assert and better test for frozen weights (#657) · b54eed1b
      Min Xu authored
      
      
      * [fix] better assert and better test for frozen weights
      
      - the precise condition should have been check m.parameters(), not
        m.params.
      - fixes #643
      
      * add changelog
      
      * use enum is so much better
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      b54eed1b
    • Benjamin Lefaudeux's avatar
      [draft][chore] SDP : increase code coverage (#653) · 69cbdf5d
      Benjamin Lefaudeux authored
      * increasing the code coverage, good practice and raising bugs.  hopefully getting to 100%
      * small bugfix
      69cbdf5d
    • Min Xu's avatar
      [fix] add clear_autocast_cache flag (#650) · 861b5ce2
      Min Xu authored
      
      
      * [fix] add clear_autocast_cache flag
      
      - when training in AMP model with weight dtype32, FSDP may need to
        optionally clear the autocast cache to avoid GPU OOM
      - this flag is default false, automatically doing it is a future TODO
      - also added a verbose flag to make print(fsdp_model) a bit shorter
      - updated the memory test to cover those new code
      - added a couple of useful functions in parallel.py and testing.py
      
      * minor
      
      * address comments
      
      * format
      
      * improve the test
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      861b5ce2
  6. 03 May, 2021 1 commit
  7. 29 Apr, 2021 2 commits
  8. 28 Apr, 2021 2 commits
    • Min Xu's avatar
      [test] improve BN test coverage (#638) · 21cba91b
      Min Xu authored
      
      
      * [test] improve BN test coverage
      
      - Added sync_bn on/off cases
      - Added conv and linear bias on/off cases
      - clarified when sync_bn is off, when is BN wrapping needed with the test
      
      * adding a comment
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      21cba91b
    • Min Xu's avatar
      [feat] save memory by using bucket buffer only in backward (#633) · a5594032
      Min Xu authored
      
      
      * [feat] save memory by using bucket buffer only in backward
      
      - this fixes bug #627
      - added documentation to clarify the buffer's cost and speed/memory
        tradeoff
      - added setup/teardown calls so that the buffer is only allocated
        during the backward pass, saving more memory for forward and stepping
        so that they can be used for things like activations.
      - added a unit test that assert the memory is in range.
      
      Comparing with DDP:
      
        1. buffer size scales with # of FSDP not model size
        2. buffer is only allocated during backward
        3. buffer is used for small tensors only to reduce overhead
        4. overlapping of compute-reduction is very different
      
      * add PR number to changelog
      
      * filled in with memory number on 1.9
      
      * addressed comments
      
      * update comments
      
      * fix for 1.6
      
      * add a todo
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      a5594032
  9. 26 Apr, 2021 1 commit
  10. 23 Apr, 2021 1 commit
    • shuyingsunshine21's avatar
      [FSDP] relax checking root condition (#620) · d3b86d65
      shuyingsunshine21 authored
      * relax checking root condition
      
      * formatting
      
      * add unittest
      
      * add unittest to ci test list
      
      * isort for import of unittest
      
      * format black .
      
      * move test to list 1
      
      * add skip no cuda
      
      * black and isort
      d3b86d65
  11. 22 Apr, 2021 2 commits
  12. 19 Apr, 2021 1 commit
    • Min Xu's avatar
      FSDP: fixing training with freezing weights (#614) · 24da3b11
      Min Xu authored
      
      
      * FSDP: fixing training with freezing weights
      
      - an assert is changed to catch this case correctly
      - unit test added (based on Quentin's test code) for this case and
        compare DDP and FSDP
      
      fixes: #610
      
      * added test file to list 1
      
      * Use better and simpler code as suggested by Myle
      
      * testing both methods of freezing as well
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      24da3b11
  13. 13 Apr, 2021 2 commits
  14. 08 Apr, 2021 1 commit
  15. 07 Apr, 2021 2 commits
  16. 06 Apr, 2021 1 commit
  17. 04 Apr, 2021 1 commit
  18. 31 Mar, 2021 2 commits
    • Min Xu's avatar
      [fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98
      Min Xu authored
      [fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)
      
      * [fix] disable single rank process group for auto_wrap_bn
      
      - beefed up unit test with regnet-like model
      - found that single-rank process group is causing problem
      - disabled it to enable convergence tests on the vissl side
      - use `raise e from None` to get a better assertion output
        in testing.py.
      
      * [test] fix regnet test for ddp+mixed_precision
      
      - need AMP context in FSDP
      - workaround different between ddp & fsdp when bias=True
      - fixed a bug in input data generation that caused different ranks have
        the same data with wrong iteration count.
      - added TODO for need a better loss and grad_scaler and reduced
        iters so there is no nan.
      - added a (disabled) debugging code
      
      * lint
      
      * lint
      
      * add scaler
      
      * lint
      
      * scaler
      
      * add a real loss
      
      * seeding in the ranks
      
      * blance tests
      
      * run AMP DDP==FSDP test only on cuda version 11 and up
      
      * add relu inplace and comment
      
      * make wrap_bn covers more cases in full precision mode
      a0458b98
    • msbaines's avatar
      acb9ef00
  19. 30 Mar, 2021 1 commit
  20. 26 Mar, 2021 1 commit
  21. 25 Mar, 2021 2 commits
  22. 22 Mar, 2021 1 commit
  23. 20 Mar, 2021 1 commit
  24. 18 Mar, 2021 3 commits
  25. 17 Mar, 2021 1 commit
  26. 12 Mar, 2021 1 commit
  27. 11 Mar, 2021 1 commit
  28. 09 Mar, 2021 2 commits