1. 07 Apr, 2021 2 commits
  2. 06 Apr, 2021 1 commit
  3. 04 Apr, 2021 1 commit
  4. 31 Mar, 2021 2 commits
    • Min Xu's avatar
      [fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98
      Min Xu authored
      [fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)
      
      * [fix] disable single rank process group for auto_wrap_bn
      
      - beefed up unit test with regnet-like model
      - found that single-rank process group is causing problem
      - disabled it to enable convergence tests on the vissl side
      - use `raise e from None` to get a better assertion output
        in testing.py.
      
      * [test] fix regnet test for ddp+mixed_precision
      
      - need AMP context in FSDP
      - workaround different between ddp & fsdp when bias=True
      - fixed a bug in input data generation that caused different ranks have
        the same data with wrong iteration count.
      - added TODO for need a better loss and grad_scaler and reduced
        iters so there is no nan.
      - added a (disabled) debugging code
      
      * lint
      
      * lint
      
      * add scaler
      
      * lint
      
      * scaler
      
      * add a real loss
      
      * seeding in the ranks
      
      * blance tests
      
      * run AMP DDP==FSDP test only on cuda version 11 and up
      
      * add relu inplace and comment
      
      * make wrap_bn covers more cases in full precision mode
      a0458b98
    • msbaines's avatar
      acb9ef00
  5. 30 Mar, 2021 1 commit
  6. 26 Mar, 2021 1 commit
  7. 25 Mar, 2021 2 commits
  8. 22 Mar, 2021 1 commit
  9. 20 Mar, 2021 1 commit
  10. 18 Mar, 2021 3 commits
  11. 17 Mar, 2021 1 commit
  12. 12 Mar, 2021 1 commit
  13. 11 Mar, 2021 1 commit
  14. 09 Mar, 2021 2 commits
  15. 08 Mar, 2021 1 commit
    • Min Xu's avatar
      [fix]: handle inputs with containers in mixed precision (#486) · 2e9a14e7
      Min Xu authored
      * [fix]: handle inputs with containers
      
      - this is an issue surfaces by vissl as well
      - fix seems to be super simple
      - also cleaned up two tests with respect to multiple such tests
        running back to back (they don't do that presently)
      
      * cleanup
      
      * fix
      
      * lint
      2e9a14e7
  16. 06 Mar, 2021 1 commit
  17. 05 Mar, 2021 1 commit
  18. 04 Mar, 2021 1 commit
  19. 03 Mar, 2021 1 commit
  20. 02 Mar, 2021 1 commit
  21. 01 Mar, 2021 2 commits
    • Min Xu's avatar
      [chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7
      Min Xu authored
      * [chores]: CI py39 on GPU and more efficiency
      
      * add test list files
      
      * fix
      
      * add test list files
      
      * split benchmark run into 2 runs
      
      * fix 1.8 version and balance benchmarks
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * recording tests
      
      * py39 install fix
      
      * test again
      
      * move tests
      
      * reorg tests
      
      * skip tests for torch 1.8 due to an upstream bug
      
      * removed __init__.py from tests since it confuses pytest
      
      * Revert "removed __init__.py from tests since it confuses pytest"
      
      This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.
      
      * don't include __init__ in file list
      
      * notes on __init__.py and added missing ones
      
      * fixed mypy in a test file
      
      * balance test runtime
      
      * better pip install
      
      * balance more
      
      * pip fix
      
      * balance
      
      * balance more, all test should finish within 20m now
      
      * minor license update
      
      * trying cu102
      
      * more doc and addressed Ben's comments
      
      * debugging
      
      * debugging...
      5eb6b8c7
    • Min Xu's avatar
      [test] FSDP: add the failing test for #421 (#453) · 5ecac15a
      Min Xu authored
      
      
      * [test] FSDP: add the failing test for #421
      
      * skip on 1.5
      
      * better skipping
      
      * Update tests/nn/data_parallel/test_fsdp_grad_scaler.py
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      5ecac15a
  22. 27 Feb, 2021 1 commit
  23. 26 Feb, 2021 3 commits
  24. 25 Feb, 2021 1 commit
  25. 24 Feb, 2021 1 commit
  26. 23 Feb, 2021 2 commits
    • Benjamin Lefaudeux's avatar
      [perf][ShardedDDP] fp16 gradient reduce (#411) · d52d2186
      Benjamin Lefaudeux authored
      * POC, testing against the DDP comm hook when available
      * docs, adding a reference to DDP's compress hook
      * updating changelog, prep for v0.1.8 release
      d52d2186
    • Myle Ott's avatar
      Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e
      Myle Ott authored
      Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336
      
      ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.
      
      Compared to PyTorch DDP:
      * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
      * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
      * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
          * all-gather parameters at start of forward pass and start of backward pass
          * reduce-scatter grads at end of backward pass
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      15512d9e
  27. 19 Feb, 2021 1 commit
  28. 18 Feb, 2021 2 commits
  29. 17 Feb, 2021 1 commit