1. 12 May, 2021 2 commits
  2. 11 May, 2021 1 commit
    • Min Xu's avatar
      [fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3
      Min Xu authored
      
      
      * [fix] FSDP forward pass overlap between compute and all-gather
      
      - much thanks for @cyanguwa for report and @QuentinDuval for debugging it
      - a new unit test is added to check for this and ensure we detect
        issue with overlapping and cpu/gpu blocking wait calls
      
      * fix
      
      * fix
      
      * fix
      
      * better assertion outputs
      
      * fix format and tune all_gather mb for CI
      
      * more tuning with non_flatten
      
      * undo an accidental change
      
      * tuning all gather mb and del model
      
      * Update + fix overlapping test to use patched all_gather w/ delay (#672)
      
      * fixing get_cycles_per_ms
      
      * add get_smi_memory
      
      * update the docstring
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      Co-authored-by: default avatarMyle Ott <myleott@fb.com>
      8a42a8e3
  3. 10 May, 2021 2 commits
    • Min Xu's avatar
      [chore] Updating PR template (#674) · c8d32c30
      Min Xu authored
      * [chore] Updating PR template
      
      Add N/A (Not Applicable) options to some of the questions in the PR template
      
      * Update PULL_REQUEST_TEMPLATE.md
      
      * Update PULL_REQUEST_TEMPLATE.md
      c8d32c30
    • Min Xu's avatar
      [minor] clarify a comment (#673) · 6c61887d
      Min Xu authored
      - we do have a use case of empty params inside a FSDP -- for the
      overlapping fsdp unit test, we use it to measure timing of compute
      when no params is needed for all_gather
      - therefore, I updated the comment to be more correct there.
      - fixes #661
      6c61887d
  4. 08 May, 2021 5 commits
  5. 07 May, 2021 3 commits
    • msbaines's avatar
      [perf] nn.moe: workaround inefficiency in PyTorch's one_hot (#666) · 99b30a04
      msbaines authored
      Workaround for https://github.com/pytorch/pytorch/issues/55579
      
      Co-authored-by: @shruti-bh, @myleott
      99b30a04
    • Min Xu's avatar
      [fix]: support pytorch SyncBatchNorm under AMP & checkpointing with FSDP (#659) · 6db68518
      Min Xu authored
      
      
      * [test]: add a more general test case
      
      - also rebalance the tests a bit
      
      * added missing arg
      
      * balance
      
      * better checking
      
      * balance
      
      * make test smaller and faster
      
      * make ddp results cached and enable sync_bn
      
      * clean up
      
      * fix tests
      
      * changelog
      
      * blance
      
      * fix
      
      * addressing comments
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      6db68518
    • msbaines's avatar
      [feat] experimental.nn.SyncBatchNorm: initial commit (#662) · f0a40046
      msbaines authored
      * [feat] experimental.nn.SyncBatchNorm: initial commit
      
      Fast/simple re-implementation of SyncBatchNorm.
      
      When profiling SSL Vision, I was seeing a majority of cycles spent in
      SyncBatchNorm. With this change, I see a 10% to 20% speedup on the
      model I was profiling.
      
      When running benchmarks/experimental/sync_batchnorm.py on 8 x V100,
      I get a 6x speedup:
      
      <class 'torch.nn.modules.batchnorm.BatchNorm2d'>
      Elapsed time is  0.08709120750427246
      Elapsed time is  0.12632274627685547
      Elapsed time is  0.14095258712768555
      Elapsed time is  0.16529417037963867
      Elapsed time is  0.1419970989227295
      Elapsed time is  0.15166854858398438
      Elapsed time is  0.12000870704650879
      Elapsed time is  0.17534875869750977
      <class 'torch.nn.modules.batchnorm.SyncBatchNorm'>
      Elapsed time is  2.5087168216705322
      Elapsed time is  2.497001886367798
      Elapsed time is  2.5204885005950928
      Elapsed time is  2.526789903640747
      Elapsed time is  2.5080230236053467
      Elapsed time is  2.524489641189575
      Elapsed time is  2.513214588165283
      Elapsed time is  2.5359973907470703
      <class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'>
      Elapsed time is  0.4126114845275879
      Elapsed time is  0.39051294326782227
      Elapsed time is  0.40685415267944336
      Elapsed time is  0.4159870147705078
      Elapsed time is  0.42383885383605957
      Elapsed time is  0.4080159664154053
      Elapsed time is  0.41202712059020996
      Elapsed time is  0.42400121688842773
      f0a40046
  6. 05 May, 2021 6 commits
  7. 04 May, 2021 1 commit
  8. 03 May, 2021 2 commits
  9. 30 Apr, 2021 1 commit
  10. 29 Apr, 2021 2 commits
  11. 28 Apr, 2021 4 commits
    • Min Xu's avatar
      [test] improve BN test coverage (#638) · 21cba91b
      Min Xu authored
      
      
      * [test] improve BN test coverage
      
      - Added sync_bn on/off cases
      - Added conv and linear bias on/off cases
      - clarified when sync_bn is off, when is BN wrapping needed with the test
      
      * adding a comment
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      21cba91b
    • Mehdi Mirzazadeh's avatar
      adding auto graph generation for distributed pipeline (#615) · bdc0581b
      Mehdi Mirzazadeh authored
      * adding auto graph generation for distributed pipeline
      
      * ignore trace.py for my for now, since it needs pytorch 1.8
      
      * fixing tests
      
      * simplifying graph api
      
      * remove unused debug utilities
      
      * use inspect to find argument lists
      
      * use sharded linear layer
      
      * flkae8
      
      * comment
      
      * polishing
      
      * polishing
      bdc0581b
    • msbaines's avatar
      2bb2a134
    • Min Xu's avatar
      [feat] save memory by using bucket buffer only in backward (#633) · a5594032
      Min Xu authored
      
      
      * [feat] save memory by using bucket buffer only in backward
      
      - this fixes bug #627
      - added documentation to clarify the buffer's cost and speed/memory
        tradeoff
      - added setup/teardown calls so that the buffer is only allocated
        during the backward pass, saving more memory for forward and stepping
        so that they can be used for things like activations.
      - added a unit test that assert the memory is in range.
      
      Comparing with DDP:
      
        1. buffer size scales with # of FSDP not model size
        2. buffer is only allocated during backward
        3. buffer is used for small tensors only to reduce overhead
        4. overlapping of compute-reduction is very different
      
      * add PR number to changelog
      
      * filled in with memory number on 1.9
      
      * addressed comments
      
      * update comments
      
      * fix for 1.6
      
      * add a todo
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      a5594032
  12. 27 Apr, 2021 1 commit
  13. 26 Apr, 2021 4 commits
  14. 23 Apr, 2021 2 commits
  15. 22 Apr, 2021 3 commits
  16. 21 Apr, 2021 1 commit